Spaces:

vimmoos
/

udrl

Running

App Files Files Community

vimmoos@Thor commited on Oct 8, 2024

Commit

b49af5c

0 Parent(s):

In the beginning there was darkness

Browse files

Files changed (40) hide show

.gitignore +4 -0
README.md +64 -0
old_code/experiment_1/train_agent.py +405 -0
old_code/experiment_1/utils.py +121 -0
old_code/experiment_2/catch.py +93 -0
old_code/experiment_2/catch_v2.py +98 -0
old_code/experiment_2/catch_v3.py +91 -0
old_code/experiment_2/catch_v4.py +88 -0
old_code/experiment_2/train_catch_cnn_agent.py +295 -0
old_code/experiment_2/utils.py +121 -0
old_code/experiment_3/q_networks/a2c.py +104 -0
old_code/experiment_3/q_networks/buffers/CartPole-v0/1/DQN/memory_buffer.p +0 -0
old_code/experiment_3/q_networks/ddqn.py +136 -0
old_code/experiment_3/q_networks/dqn.py +130 -0
old_code/experiment_3/q_networks/prepare_buffer.py +104 -0
old_code/experiment_3/q_networks/train_offline_a2c.py +123 -0
old_code/experiment_3/q_networks/train_offline_ddqn.py +126 -0
old_code/experiment_3/q_networks/train_offline_dqn.py +124 -0
old_code/experiment_3/q_networks/utils.py +29 -0
old_code/experiment_3/upside_down/prepare_offline_buffer.py +157 -0
old_code/experiment_3/upside_down/train_agent.py +248 -0
old_code/experiment_3/upside_down/train_offline_agent.py +196 -0
old_code/experiment_3/upside_down/utils.py +131 -0
old_code/train_atari_agent.py +321 -0
old_code/utils.py +121 -0
poetry.lock +0 -0
udrl/__main__.py +238 -0
udrl/agent.py +180 -0
udrl/buffer.py +70 -0
udrl/catch/__init__.py +35 -0
udrl/catch/adptor.py +126 -0
udrl/catch/core.py +190 -0
udrl/catch/renderer.py +65 -0
udrl/cli.py +192 -0
udrl/data_proc.py +51 -0
udrl/inference.py +122 -0
udrl/plot.py +189 -0
udrl/policies.py +364 -0
udrl/test.py +137 -0
udrl/viz.py +310 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,4 @@

+__pycache__/
+data/
+**/*.npy
+**/*.pyc

README.md ADDED Viewed

	@@ -0,0 +1,64 @@

+# Upside-Down RL
+This project implements an Upside-Down Reinforcement Learning (UDRL) agent.
+### Installation
+1. Make sure you have Python 3.10 installed. You can check your version with `python --version`.
+   **NOTE**  Use a virtual env to avoid dependency clash
+2. Install the project dependencies using Poetry:
+   ```bash
+   poetry install
+   ```
+   If you do not have poetry use pip to install the requirements like so:
+   ```bash
+   pip install -r requirements.txt
+   ```
+### Running the Experiment
+You can run the experiment with various configuration options using the command line:
+```bash
+poetry run python -m udrl [options]
+```
+**Note** If you are already inside a virtual env `python -m udrl [options]` is enough
+**Note** All defaults are for the CartPole-v0
+Available options include:
+* `--env_name`: Name of the Gym environment (default: `CartPole-v0`)
+* `--estimator_name`: "neural" for NN or a fully qualified name of the scikit-learn estimator class (default: `ensemble.RandomForestClassifier`)
+* `--seed`: Random seed (default: `42`)
+* `--max_episode`: Maximum training episodes (default: `500`)
+* `--collect_episode`: Episodes to collect between training (default: `15`)
+* `--batch_size`: Batch size for training (default: `0`, uses entire replay buffer)
+* Other options related to warm-up, memory size, exploration, testing, saving, etc.
+**NOTE** Cartpole, Acrobot, Mountain car and LunarLander envs were tested
+### Result Data
+* Experiment configuration and final test results are saved in a JSON file (`conf.json`) within a directory structure based on the environment, seed, and non-default configuration values (e.g., `data/[env-name]/[experiment_name]/[seed]/conf.json`).
+* If `save_policy` is True, the trained policy is saved in the same directory (`policy`).
+* If `save_learning_infos` is True, learning infos and rewards  during training are saved as a NumPy file (e.g.`test_rewards.npy`) and a json file (e.h.`learning_infos.json`) in the same directory.
+### Process Data
+* A base post processing is available to convert the results data in csvs run it as `python -m udrl.data_proc`
+### Project Structure
+* `data`: Stores experiment results and other data.
+* `old_code`: Contains previous code versions (not used in the current setup).
+* `poetry.lock`, `pyproject.toml`: Manage project dependencies and configuration.
+* `README.md`: This file.
+* `udrl`: Contains the main Python modules for the UDRL agent.
+Please refer to the code and comments for further details on the implementation.
+## Troubleshooting
+If you encounter any errors during installation or execution, or if you have any questions about the project, feel free to reach out to me at [[email protected]](mailto:[email protected]). I'll be happy to assist you!

old_code/experiment_1/train_agent.py ADDED Viewed

	@@ -0,0 +1,405 @@

+import os
+import math
+import time
+import gymnasium as gym
+import random
+import utils
+import keras
+import numpy as np
+from collections import deque
+from matplotlib import pyplot as plt
+from sklearn.preprocessing import OneHotEncoder
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.ensemble import (
+    RandomForestClassifier,
+    ExtraTreesClassifier,
+    AdaBoostClassifier,
+)
+from sklearn.neighbors import KNeighborsClassifier
+from sklearn.svm import SVC
+from sklearn.exceptions import NotFittedError
+from sklearn.ensemble import GradientBoostingClassifier
+from tqdm import trange
+class ReplayBuffer:
+    """
+    Thank you: https://github.com/BY571/
+    """
+    def __init__(self, max_size):
+        self.max_size = max_size
+        self.buffer = []
+    def add_sample(self, states, actions, rewards):
+        episode = {
+            "states": states,
+            "actions": actions,
+            "rewards": rewards,
+            "summed_rewards": sum(rewards),
+        }
+        self.buffer.append(episode)
+    def sort(self):
+        # sort buffer
+        self.buffer = sorted(
+            self.buffer, key=lambda i: i["summed_rewards"], reverse=True
+        )
+        # keep the max buffer size
+        self.buffer = self.buffer[: self.max_size]
+    def get_random_samples(self, batch_size):
+        self.sort()
+        idxs = np.random.randint(0, len(self.buffer), batch_size)
+        batch = [self.buffer[idx] for idx in idxs]
+        return batch
+    def get_n_best(self, n):
+        self.sort()
+        return self.buffer[:n]
+    def __len__(self):
+        return len(self.buffer)
+class UpsideDownAgent:
+    def __init__(self, environment, approximator):
+        print(environment)
+        self.environment = gym.make(environment)
+        self.approximator = approximator
+        self.state_size = self.environment.observation_space.shape[0]
+        self.action_size = self.environment.action_space.n
+        self.warm_up_episodes = 50
+        self.render = False
+        self.memory = ReplayBuffer(700)
+        self.last_few = 75
+        self.batch_size = 32
+        self.command_size = 2  # desired return + desired horizon
+        self.desired_return = 1
+        self.desired_horizon = 1
+        self.horizon_scale = 0.02
+        self.return_scale = 0.02
+        self.testing_state = 0
+        if approximator == "neural_network":
+            self.behaviour_function = utils.get_functional_behaviour_function(
+                self.state_size, self.command_size, self.action_size
+            )
+        elif approximator == "forest":
+            self.behaviour_function = RandomForestClassifier(200)
+        elif approximator == "extra-trees":
+            self.behaviour_function = ExtraTreesClassifier()
+        elif approximator == "knn":
+            self.behaviour_function = KNeighborsClassifier()
+        elif approximator == "adaboost":
+            self.behaviour_function = AdaBoostClassifier()
+        self.testing_rewards = []
+        self.warm_up_buffer()
+    def warm_up_buffer(self):
+        for i in range(self.warm_up_episodes):
+            # Gymnasium returns (state,info_dict)
+            state, _ = self.environment.reset()
+            states = []
+            rewards = []
+            actions = []
+            done = False
+            desired_return = 1
+            desired_horizon = 1
+            while not done:
+                state = np.reshape(state, [1, self.state_size])
+                states.append(state)
+                observation = state
+                command = np.asarray(
+                    [
+                        desired_return * self.return_scale,
+                        desired_horizon * self.horizon_scale,
+                    ]
+                )
+                command = np.reshape(command, [1, len(command)])
+                action = self.get_action(observation, command)
+                actions.append(action)
+                # Gymnasium returns (s,r,tr,te,info)
+                next_state, reward, tru, ter, info = self.environment.step(action)
+                done = tru or ter
+                next_state = np.reshape(next_state, [1, self.state_size])
+                rewards.append(reward)
+                state = next_state
+                desired_return -= reward  # Line 8 Algorithm 2
+                desired_horizon -= 1  # Line 9 Algorithm 2
+                desired_horizon = np.maximum(desired_horizon, 1)
+            self.memory.add_sample(states, actions, rewards)
+    def get_action(self, observation, command):
+        """
+        We will sample from the action distribution modeled by the Behavior Function
+        """
+        if self.approximator == "neural_network":
+            action_probs = self.behaviour_function.predict([observation, command])
+            action = np.random.choice(np.arange(0, self.action_size), p=action_probs[0])
+            return action
+        elif self.approximator in ["forest", "extra-trees", "knn", "svm", "adaboost"]:
+            try:
+                input_state = np.concatenate((observation, command), axis=1)
+                action = self.behaviour_function.predict(input_state)
+                # print(action)
+                if np.random.rand() > 0.8:
+                    return int(not np.argmax(action))
+                return np.argmax(action)
+            except NotFittedError as e:
+                return random.randint(0, 1)
+    def get_greedy_action(self, observation, command):
+        if self.approximator == "neural_network":
+            action_probs = self.behaviour_function.predict([observation, command])
+            action = np.argmax(action_probs)
+            return action
+        else:
+            input_state = np.concatenate((observation, command), axis=1)
+            action = self.behaviour_function.predict(input_state)
+            self.testing_state += 1
+            feature_importances = {}
+            for t in self.behaviour_function.estimators_:
+                branch = t.decision_path(input_state).todense()
+                branch = np.array(branch, dtype=bool)
+                imp = t.tree_.impurity[branch[0]]
+                for f, i in zip(t.tree_.feature[branch[0]][:-1], imp[:-1] - imp[1:]):
+                    feature_importances.setdefault(f, []).append(i)
+            summed_importances = [
+                sum(feature_importances[0]),
+                sum(feature_importances[1]),
+                sum(feature_importances[2]),
+                sum(feature_importances[3]),
+                sum(feature_importances[4]),
+                sum(feature_importances[5]),
+            ]
+            x = np.arange(len(summed_importances))
+            plt.figure()
+            plt.title("Cartpole-v0")
+            plt.bar(x, summed_importances)
+            plt.xticks(
+                x,
+                [
+                    "feature-1",
+                    "feature-2",
+                    "feature-3",
+                    "feature-4",
+                    r"$d_t^{r}$",
+                    r"$d_t^{h}$",
+                ],
+            )
+            plt.savefig("importances_state_" + str(self.testing_state) + ".jpg")
+            return np.argmax(action)
+    def train_behaviour_function(self):
+        random_episodes = self.memory.get_random_samples(self.batch_size)
+        training_observations = np.zeros((self.batch_size, self.state_size))
+        training_commands = np.zeros((self.batch_size, 2))
+        y = []
+        for idx, episode in enumerate(random_episodes):
+            T = len(episode["states"])
+            t1 = np.random.randint(0, T - 1)
+            t2 = np.random.randint(t1 + 1, T)
+            state = episode["states"][t1]
+            desired_return = sum(episode["rewards"][t1:t2])
+            desired_horizon = t2 - t1
+            target = episode["actions"][t1]
+            training_observations[idx] = state[0]
+            training_commands[idx] = np.asarray(
+                [
+                    desired_return * self.return_scale,
+                    desired_horizon * self.horizon_scale,
+                ]
+            )
+            y.append(target)
+        _y = keras.utils.to_categorical(y)
+        if self.approximator == "neural_network":
+            self.behaviour_function.fit(
+                [training_observations, training_commands], _y, verbose=0
+            )
+        elif self.approximator in ["forest", "extra-trees", "adaboost"]:
+            input_classifier = np.concatenate(
+                (training_observations, training_commands), axis=1
+            )
+            self.behaviour_function.fit(input_classifier, _y)
+    def sample_exploratory_commands(self):
+        best_episodes = self.memory.get_n_best(self.last_few)
+        exploratory_desired_horizon = np.mean([len(i["states"]) for i in best_episodes])
+        returns = [i["summed_rewards"] for i in best_episodes]
+        exploratory_desired_returns = np.random.uniform(
+            np.mean(returns), np.mean(returns) + np.std(returns)
+        )
+        return [exploratory_desired_returns, exploratory_desired_horizon]
+    def generate_episode(
+        self, environment, e, desired_return, desired_horizon, testing
+    ):
+        env = gym.make(environment)
+        tot_rewards = []
+        done = False
+        score = 0
+        # Gymnasium returns (state,info_dict)
+        state, _ = env.reset()
+        scores = []
+        states = []
+        actions = []
+        rewards = []
+        while not done:
+            state = np.reshape(state, [1, self.state_size])
+            states.append(state)
+            observation = state
+            command = np.asarray(
+                [
+                    desired_return * self.return_scale,
+                    desired_horizon * self.horizon_scale,
+                ]
+            )
+            command = np.reshape(command, [1, len(command)])
+            if not testing:
+                action = self.get_action(observation, command)
+                actions.append(action)
+            else:
+                action = self.get_greedy_action(observation, command)
+            # Gymnasium returns (s,r,tr,te,info)
+            next_state, reward, tru, ter, info = env.step(action)
+            done = tru or ter
+            next_state = np.reshape(next_state, [1, self.state_size])
+            rewards.append(reward)
+            score += reward
+            state = next_state
+            desired_return -= reward  # Line 8 Algorithm 2
+            desired_horizon -= 1  # Line 9 Algorithm 2
+            desired_horizon = np.maximum(desired_horizon, 1)
+        self.memory.add_sample(states, actions, rewards)
+        self.testing_rewards.append(score)
+        if testing:
+            print("Querying the model ...")
+            print("Testing score: {}".format(score))
+        return score
+def run_experiment():
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--approximator", type=str, default="forest")
+    parser.add_argument("--environment", type=str, default="CartPole-v0")
+    parser.add_argument("--seed", type=int, default=42)
+    args = parser.parse_args()
+    approximator = args.approximator
+    environment = args.environment
+    seed = args.seed
+    print(args)
+    episodes = 500
+    returns = []
+    agent = UpsideDownAgent(environment, approximator)
+    epi_bar = trange(episodes)
+    for e in epi_bar:
+        for i in range(100):
+            agent.train_behaviour_function()
+        for i in range(15):
+            tmp_r = []
+            exploratory_commands = (
+                agent.sample_exploratory_commands()
+            )  # Line 5 Algorithm 1
+            desired_return = exploratory_commands[0]
+            desired_horizon = exploratory_commands[1]
+            r = agent.generate_episode(
+                environment, e, desired_return, desired_horizon, False
+            )
+            tmp_r.append(r)
+        epi_bar.set_postfix(
+            {
+                "mean": np.mean(tmp_r),
+                "std": np.std(tmp_r),
+            }
+        )
+        # print()
+        returns.append(np.mean(tmp_r))
+        exploratory_commands = agent.sample_exploratory_commands()
+    agent.generate_episode(environment, 1, 200, 200, True)
+    utils.save_results(environment, approximator, seed, returns)
+    if approximator == "neural_network":
+        utils.save_trained_model(environment, seed, agent.behaviour_function)
+if __name__ == "__main__":
+    import warnings
+    warnings.simplefilter("ignore", DeprecationWarning)
+    run_experiment()

old_code/experiment_1/utils.py ADDED Viewed

	@@ -0,0 +1,121 @@

+import os
+import argparse
+import pickle
+import keras
+import numpy as np
+from keras.layers import Dense, Multiply, Input, Conv2D, Flatten
+from keras.models import Sequential, Model
+from keras.optimizers import Adam, RMSprop, SGD
+from skimage.transform import resize
+from skimage.color import rgb2gray
+STORING_PATH = './results/'
+MODELS_PATH = './trained_models/'
+def save_results(environment, approximator, seed, rewards):
+    storing_path = os.path.join(STORING_PATH, environment, approximator, str(seed))
+    if not os.path.exists(storing_path):
+        os.makedirs(storing_path)
+    np.save(storing_path + '/' + 'upside_down_rewards.npy', rewards)
+def get_functional_behaviour_function(state_size, command_size, action_size):
+    observation_input = keras.Input(shape=(state_size,))
+    linear_layer = Dense(64, activation='sigmoid')(observation_input)
+    command_input = keras.Input(shape=(command_size,))
+    sigmoidal_layer = Dense(64, activation='sigmoid')(command_input)
+    multiplied_layer = Multiply()([linear_layer, sigmoidal_layer])
+    layer_1 = Dense(64, activation='relu')(multiplied_layer)
+    layer_2 = Dense(64, activation='relu')(layer_1)
+    layer_3 = Dense(64, activation='relu')(layer_2)
+    layer_4 = Dense(64, activation='relu')(layer_3)
+    final_layer = Dense(action_size, activation='softmax')(layer_4)
+    model = Model(inputs=[observation_input, command_input], outputs=final_layer)
+    model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.001))
+    return model
+def get_atari_behaviour_function(action_size):
+    print('Getting the model')
+    input_state = Input(shape=(84,84,4))
+    first_conv = Conv2D(
+            32, (8, 8), strides=(4,4), activation='relu')(input_state)
+    second_conv = Conv2D(
+            64, (4, 4), strides=(2,2), activation='relu')(first_conv)
+    third_conv = Conv2D(
+            64, (3, 3), strides=(1,1), activation='relu')(second_conv)
+    flattened = Flatten()(third_conv)
+    dense_layer = Dense(512, activation='relu')(flattened)
+    command_input = keras.Input(shape=(2,))
+    sigmoidal_layer = Dense(512, activation='sigmoid')(command_input)
+    multiplied_layer = Multiply()([dense_layer, sigmoidal_layer])
+    final_layer = Dense(256, activation='relu')(multiplied_layer)
+    action_layer = Dense(action_size, activation='softmax')(final_layer)
+    model = Model(inputs=[input_state, command_input], outputs=action_layer)
+    model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.001, rho=0.95, epsilon=0.01))
+    print(model.summary())
+    return model
+def get_catch_behaviour_function(action_size):
+    print('Getting the Catch-model')
+    input_state = Input(shape=(84,84,4))
+    first_conv = Conv2D(
+            32, (8, 8), strides=(4,4), activation='relu')(input_state)
+    second_conv = Conv2D(
+            64, (4, 4), strides=(2,2), activation='relu')(first_conv)
+    third_conv = Conv2D(
+            64, (3, 3), strides=(1,1), activation='relu')(second_conv)
+    flattened = Flatten()(third_conv)
+    dense_layer = Dense(512, activation='relu')(flattened)
+    command_input = keras.Input(shape=(2,))
+    sigmoidal_layer = Dense(512, activation='sigmoid')(command_input)
+    multiplied_layer = Multiply()([dense_layer, sigmoidal_layer])
+    final_layer = Dense(256, activation='relu')(multiplied_layer)
+    action_layer = Dense(action_size, activation='softmax')(final_layer)
+    model = Model(inputs=[input_state, command_input], outputs=action_layer)
+    model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.001, rho=0.95, epsilon=0.01))
+    print(model.summary())
+    return model
+def pre_processing(state):
+    processed_state = np.uint8(
+            resize(rgb2gray(state), (84, 84), mode='constant')*255)
+    return processed_state
+def save_trained_model(environment, seed, model):
+    storing_path = os.path.join(MODELS_PATH, environment, str(seed))
+    if not os.path.exists(storing_path):
+        os.makedirs(storing_path)
+    model.save_weights(storing_path + '/' + 'trained_model.h5')

old_code/experiment_2/catch.py ADDED Viewed

	@@ -0,0 +1,93 @@

+from scipy.misc import imresize
+import gym
+import random
+import numpy as np
+from queue import Queue
+from matplotlib import pyplot as plt
+from PIL import Image
+class CatchEnv:
+    def __init__(self):
+        self.size = 21
+        self.image = np.zeros((self.size, self.size))
+        self.state = []
+        self.fps = 4
+        self.output_shape = (84, 84)
+    def reset_random(self):
+        self.image.fill(0)
+        self.pos = np.random.randint(2, self.size-2)
+        self.vx = np.random.randint(5) - 2
+        self.vy = 1
+        self.ballx, self.bally = np.random.randint(self.size), 4
+        self.image[self.bally, self.ballx] = 1
+        self.image[-5, self.pos - 2:self.pos + 3] = np.ones(5)
+        return self.step(2)[0]
+    def step(self, action):
+        def left():
+            if self.pos > 3:
+                self.pos -= 2
+        def right():
+            if self.pos < 17:
+                self.pos += 2
+        def noop():
+            pass
+        {0: left, 1: right, 2: noop}[action]()
+        self.image[self.bally, self.ballx] = 0
+        self.ballx += self.vx
+        self.bally += self.vy
+        if self.ballx > self.size - 1:
+            self.ballx -= 2 * (self.ballx - (self.size-1))
+            self.vx *= -1
+        elif self.ballx < 0:
+            self.ballx += 2 * (0 - self.ballx)
+            self.vx *= -1
+        self.image[self.bally, self.ballx] = 1
+        self.image[-5].fill(0)
+        self.image[-5, self.pos-2:self.pos+3] = np.ones(5)
+        terminal = self.bally == self.size - 1 - 4
+        reward = int(self.pos - 2 <= self.ballx <= self.pos + 2) if terminal else 0
+        [self.state.append(imresize(self.image, (84, 84))) for _ in range(self.fps - len(self.state) + 1)]
+        self.state = self.state[-self.fps:]
+        return np.transpose(self.state, [1, 2, 0]), reward, terminal
+    def get_num_actions(self):
+        return 3
+    def reset(self):
+        return self.reset_random()
+    def state_shape(self):
+        return (self.fps,) + self.output_shape
+def test():
+    env = CatchEnv()
+    i = 0
+    for ep in range(1):
+        env.reset()
+        state, reward, terminal = env.step(1)
+        while not terminal:
+            state, reward, terminal = env.step(random.randint(0,2))
+            state = np.squeeze(state)
+            #print(reward)
+            #print(terminal)
+            i += 1
+if __name__ == "__main__":
+    test()

old_code/experiment_2/catch_v2.py ADDED Viewed

	@@ -0,0 +1,98 @@

+import random
+import numpy as np
+from scipy.misc import imresize
+import random
+import numpy as np
+from scipy.ndimage import rotate
+from scipy.misc import imresize
+from matplotlib import pyplot as plt
+class CatchEnv2:
+    def __init__(self):
+        self.size = 21
+        self.image = np.zeros((self.size, self.size))
+        self.state = []
+        self.fps = 4
+        self.output_shape = (84, 84)
+    def reset_random(self):
+        self.image.fill(0)
+        self.pos = np.random.randint(2, self.size-2)
+        self.vx = np.random.randint(5) - 2
+        self.vy = 1
+        self.ballx, self.bally = np.random.randint(self.size), 4
+        self.image[self.bally, self.ballx] = 1
+        self.image[-5, self.pos - 2:self.pos + 3] = np.ones(5)
+        for i in range(0, self.size):
+            for j in range(0, self.size):
+                self.image[i][j] = random.randint(2,5)
+        return self.step(2)[0]
+    def step(self, action):
+        def left():
+            if self.pos > 3:
+                self.pos -= 2
+        def right():
+            if self.pos < 17:
+                self.pos += 2
+        def noop():
+            pass
+        {0: left, 1: right, 2: noop}[action]()
+        self.image[self.bally, self.ballx] = 0
+        self.ballx += self.vx
+        self.bally += self.vy
+        if self.ballx > self.size - 1:
+            self.ballx -= 2 * (self.ballx - (self.size-1))
+            self.vx *= -1
+        elif self.ballx < 0:
+            self.ballx += 2 * (0 - self.ballx)
+            self.vx *= -1
+        self.image[self.bally, self.ballx] = 1
+        self.image[-5].fill(random.randint(2,5))
+        self.image[-5, self.pos-2:self.pos+3] = np.ones(5)
+        terminal = self.bally == self.size - 1 - 4
+        reward = int(self.pos - 2 <= self.ballx <= self.pos + 2) if terminal else 0
+        [self.state.append(imresize(self.image, (84, 84))) for _ in range(self.fps - len(self.state) + 1)]
+        self.state = self.state[-self.fps:]
+        return np.transpose(self.state, [1, 2, 0]), reward, terminal
+    def get_num_actions(self):
+        return 3
+    def reset(self):
+        return self.reset_random()
+    def state_shape(self):
+        return (self.fps,) + self.output_shape
+    def show_state(self, i):
+        plt.imshow(self.image)
+        plt.imsave('image_'+str(i)+'.jpg', self.image)
+def test():
+    env = CatchEnv2()
+    i = 0
+    for ep in range(1):
+        env.reset()
+        env.show_state(i)
+        state, reward, terminal = env.step(1)
+        while not terminal:
+            env.show_state(i)
+            state, reward, terminal = env.step(np.random.randint(0,2))
+            i += 1
+            #print(reward)
+if __name__ == "main":
+    test()

old_code/experiment_2/catch_v3.py ADDED Viewed

	@@ -0,0 +1,91 @@

+import random
+import numpy as np
+from scipy.misc import imresize
+class CatchEnv3:
+    def __init__(self):
+        self.size = 21
+        self.image = np.zeros((self.size, self.size))
+        self.state = []
+        self.fps = 4
+        self.output_shape = (84, 84)
+    def reset_random(self):
+        self.image.fill(0)
+        self.pos = np.random.randint(2, self.size-2)
+        self.vx = np.random.randint(5) - 2
+        self.vy = 1
+        self.ballx, self.bally = np.random.randint(self.size), 4
+        self.image[self.bally, self.ballx] = 1
+        self.image[-5, self.pos - 2:self.pos + 3] = np.ones(5)
+        return self.step(2)[0]
+    def step(self, action):
+        def left():
+            if self.pos > 3:
+                self.pos -= 2
+        def right():
+            if self.pos < 17:
+                self.pos += 2
+        def noop():
+            pass
+        {0: left, 1: right, 2: noop}[action]()
+        self.image[self.bally, self.ballx] = 0
+        self.ballx += self.vx
+        self.bally += self.vy
+        if self.ballx > self.size - 1:
+            self.ballx -= 2 * (self.ballx - (self.size-1))
+            self.vx *= -1
+        elif self.ballx < 0:
+            self.ballx += 2 * (0 - self.ballx)
+            self.vx *= -1
+        self.image[self.bally, self.ballx] = 1
+        self.image[-5].fill(0)
+        self.image[-5, self.pos-2:self.pos+3] = np.ones(5)
+        terminal = self.bally == self.size - 1 - 4
+        reward = int(self.pos - 2 <= self.ballx <= self.pos + 2) if terminal else 0
+        [self.state.append(imresize(self.image, (84, 84))) for _ in range(self.fps - len(self.state) + 1)]
+        self.state = self.state[-self.fps:]
+        self.state[0] = self.state[0][::-1,:]
+        self.state[1] = self.state[1][::-1,:]
+        self.state[2] = self.state[2][::-1,:]
+        self.state[3] = self.state[3][::-1,:]
+        return np.transpose(self.state, [1, 2, 0]), reward, terminal
+    def get_num_actions(self):
+        return 3
+    def reset(self):
+        return self.reset_random()
+    def state_shape(self):
+        return (self.fps,) + self.output_shape
+def test():
+    env = CatchEnv2()
+    i = 0
+    for ep in range(1):
+        env.reset()
+        state, reward, terminal = env.step(1)
+        while not terminal:
+            env.show_state(i)
+            state, reward, terminal = env.step(1)
+            state = np.squeeze(state)
+            plt.imsave('image_'+str(i)+'.jpg', state)
+            i += 1
+if __name__ == "main":
+    test()

old_code/experiment_2/catch_v4.py ADDED Viewed

	@@ -0,0 +1,88 @@

+import random
+import numpy as np
+from scipy.ndimage import rotate
+from scipy.misc import imresize
+class CatchEnv4:
+    def __init__(self):
+        self.size = 21
+        self.image = np.zeros((self.size, self.size))
+        self.state = []
+        self.fps = 4
+        self.output_shape = (84, 84)
+    def reset_random(self):
+        self.image.fill(0)
+        self.pos = np.random.randint(2, self.size-2)
+        self.vx = np.random.randint(5) - 2
+        self.vy = 1
+        self.ballx, self.bally = np.random.randint(self.size), 4
+        self.image[self.bally, self.ballx] = 1
+        self.image[-5, self.pos - 1:self.pos+1] = np.ones(1)
+        return self.step(2)[0]
+    def step(self, action):
+        def left():
+            if self.pos > 3:
+                self.pos -= 2
+        def right():
+            if self.pos < 17:
+                self.pos += 2
+        def noop():
+            pass
+        {0: left, 1: right, 2: noop}[action]()
+        self.image[self.bally, self.ballx] = 0
+        self.ballx += self.vx
+        self.bally += self.vy
+        if self.ballx > self.size - 1:
+            self.ballx -= 2 * (self.ballx - (self.size-1))
+            self.vx *= -1
+        elif self.ballx < 0:
+            self.ballx += 2 * (0 - self.ballx)
+            self.vx *= -1
+        self.image[self.bally, self.ballx] = 1
+        self.image[-5].fill(0)
+        self.image[-5, self.pos-1:self.pos+1] = np.ones(1)
+        terminal = self.bally == self.size - 2 - 4
+        reward = int(self.pos - 1 <= self.ballx <= self.pos + 1) if terminal else 0
+        [self.state.append(imresize(self.image, (84, 84))) for _ in range(self.fps - len(self.state) + 1)]
+        self.state = self.state[-self.fps:]
+        return np.transpose(self.state, [1, 2, 0]), reward, terminal
+    def get_num_actions(self):
+        return 3
+    def reset(self):
+        return self.reset_random()
+    def state_shape(self):
+        return (self.fps,) + self.output_shape
+    def show_state(self, i):
+        plt.imshow(self.image)
+        plt.imsave('image_'+str(i)+'.jpg', self.image)
+def test():
+    env = CatchEnv4()
+    i = 0
+    for ep in range(1):
+        env.reset()
+        state, reward, terminal = env.step(1)
+        while not terminal:
+            state, reward, terminal = env.step(np.random.randint(0,2))
+            state = np.squeeze(state)
+            env.show_state(i)
+            i += 1
+        print(reward)
+if __name__ == "main":
+    test()

old_code/experiment_2/train_catch_cnn_agent.py ADDED Viewed

	@@ -0,0 +1,295 @@

+import os
+import math
+import time
+import gym
+import random
+import utils
+import keras
+import catch
+import catch_v2
+import catch_v3
+import catch_v4
+import numpy as np
+from collections import deque
+from matplotlib import pyplot as plt
+from sklearn.preprocessing import OneHotEncoder
+class ReplayBuffer():
+    """
+        Thank you: https://github.com/BY571/
+    """
+    def __init__(self, max_size):
+        self.max_size = max_size
+        self.buffer = []
+    def add_sample(self, states, actions, rewards):
+        episode = {"states": states, "actions":actions, "rewards": rewards, "summed_rewards":sum(rewards)}
+        self.buffer.append(episode)
+    def sort(self):
+        #sort buffer
+        self.buffer = sorted(self.buffer, key = lambda i: i["summed_rewards"],reverse=True)
+        # keep the max buffer size
+        self.buffer = self.buffer[:self.max_size]
+    def get_random_samples(self, batch_size):
+        self.sort()
+        idxs = np.random.randint(0, len(self.buffer), batch_size)
+        batch = [self.buffer[idx] for idx in idxs]
+        return batch
+    def get_n_best(self, n):
+        self.sort()
+        return self.buffer[:n]
+    def __len__(self):
+        return len(self.buffer)
+class UpsideDownAgent():
+    def __init__(self, environment, approximator):
+        if environment == "Catch-v0":
+            self.environment = catch.CatchEnv()
+        elif environment == "Catch-v2":
+            self.environment = catch_v2.CatchEnv()
+        elif environment == "Catch-v3":
+            self.environment = catch_v3.CatchEnv()
+        elif environment == "Catch-v4":
+            self.environment = catch_v4.CatchEnv()
+        self.approximator = approximator
+        self.state_size = (84, 84, 4)
+        self.action_size = 3
+        self.warm_up_episodes = 50
+        self.memory = ReplayBuffer(700)
+        self.last_few = 50
+        self.batch_size = 32
+        self.command_size = 2 # desired return + desired horizon
+        self.desired_return = 1
+        self.desired_horizon = 1
+        self.horizon_scale = 0.02
+        self.return_scale = 0.02
+        self.behaviour_function = utils.get_catch_behaviour_function(self.action_size)
+        self.testing_rewards = []
+        self.warm_up_buffer()
+    def warm_up_buffer(self):
+        print('Warming up')
+        for i in range(self.warm_up_episodes):
+            states = []
+            rewards = []
+            actions = []
+            dead = False
+            done = False
+            desired_return = 1
+            desired_horizon = 1
+            step, score, start_life = 0, 0, 5
+            observe = self.environment.reset()
+            observe, reward, terminal = self.environment.step(1)
+            state = utils.pre_processing(observe)
+            history = np.stack((state, state, state, state), axis=2)
+            history = np.reshape([history], (1, 84, 84, 4))
+            while not done:
+                states.append(history)
+                command = np.asarray([desired_return * self.return_scale, desired_horizon * self.horizon_scale])
+                command = np.reshape(command, [1, len(command)])
+                action = self.get_action(history, command)
+                actions.append(action)
+                next_state, reward, done  = self.environment.step(action)
+                next_state = utils.pre_processing(observe)
+                next_state = np.reshape([next_state], (1, 84, 84, 1))
+                next_history = np.append(next_state, history[:, :, :, :3], axis = 3)
+                rewards.append(reward)
+                state = next_state
+                history = next_history
+                desired_return -= reward  # Line 8 Algorithm 2
+                desired_horizon -= 1 # Line 9 Algorithm 2
+                desired_horizon = np.maximum(desired_horizon, 1)
+            self.memory.add_sample(states, actions, rewards)
+    def get_action(self, observation, command):
+        """
+            We will sample from the action distribution modeled by the Behavior Function
+        """
+        observation = np.float32(observation / 255.0)
+        action_probs = self.behaviour_function.predict([observation, command])
+        action = np.random.choice(np.arange(0, self.action_size), p=action_probs[0])
+        return action
+    def get_greedy_action(self, observation, command):
+        action_probs = self.behaviour_function.predict([observation, command])
+        action = np.argmax(action_probs)
+        return action
+    def train_behaviour_function(self):
+        random_episodes = self.memory.get_random_samples(self.batch_size)
+        training_observations = np.zeros((self.batch_size, self.state_size[0], self.state_size[1], self.state_size[2]))
+        training_commands = np.zeros((self.batch_size, 2))
+        y = []
+        for idx, episode in enumerate(random_episodes):
+            T = len(episode['states'])
+            t1 = np.random.randint(0, T-1)
+            t2 = np.random.randint(t1+1, T)
+            state = np.float32(episode['states'][t1] / 255.)
+            desired_return = sum(episode["rewards"][t1:t2])
+            desired_horizon = t2 -t1
+            target = episode['actions'][t1]
+            training_observations[idx] = state[0]
+            training_commands[idx] = np.asarray([desired_return*self.return_scale, desired_horizon*self.horizon_scale])
+            y.append(target)
+        _y = keras.utils.to_categorical(y, num_classes=self.action_size)
+        self.behaviour_function.fit([training_observations, training_commands], _y, verbose=0)
+    def sample_exploratory_commands(self):
+        best_episodes = self.memory.get_n_best(self.last_few)
+        exploratory_desired_horizon = np.mean([len(i["states"]) for i in best_episodes])
+        returns = [i["summed_rewards"] for i in best_episodes]
+        exploratory_desired_returns = np.random.uniform(np.mean(returns), np.mean(returns)+np.std(returns))
+        return [exploratory_desired_returns, exploratory_desired_horizon]
+    def generate_episode(self, environment, e, desired_return, desired_horizon, testing):
+        if environment == "Catch-v0":
+            env = catch.CatchEnv()
+        elif environment == "Catch-v2":
+            self.environment = catch_v2.CatchEnv()
+        elif environment == "Catch-v3":
+            self.environment = catch_v3.CatchEnv()
+        elif environment == "Catch-v4":
+            self.environment = catch_v4.CatchEnv()
+        tot_rewards = []
+        done = False
+        dead = False
+        scores = []
+        states = []
+        actions = []
+        rewards = []
+        step, score, start_life = 0, 0, 5
+        observe = env.reset()
+        observe, _, _ = env.step(1)
+        state = utils.pre_processing(observe)
+        history = np.stack((state, state, state, state), axis=2)
+        history = np.reshape([history], (1, 84, 84, 4))
+        while not done:
+            states.append(history)
+            command = np.asarray([desired_return * self.return_scale, desired_horizon * self.horizon_scale])
+            command = np.reshape(command, [1, len(command)])
+            if not testing:
+                action = self.get_action(history, command)
+                actions.append(action)
+            else:
+                action = self.get_greedy_action(history, command)
+            next_state, reward, done = env.step(action)
+            next_state = utils.pre_processing(observe)
+            next_state = np.reshape([next_state], (1, 84, 84, 1))
+            next_history = np.append(next_state, history[:, :, :, :3], axis = 3)
+            score += reward
+            history = next_history
+            desired_return -= reward  # Line 8 Algorithm 2
+            desired_horizon -= 1 # Line 9 Algorithm 2
+            desired_horizon = np.maximum(desired_horizon, 1)
+        self.memory.add_sample(states, actions, rewards)
+        self.testing_rewards.append(score)
+        if testing:
+            print('Querying the model ...')
+            print('Testing score: {}'.format(score))
+            return score
+def run_experiment():
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--approximator', type=str, default='neural_network')
+    parser.add_argument('--environment', type=str, default='PongDeterministic-v4')
+    parser.add_argument('--seed', type=int, default=1)
+    args = parser.parse_args()
+    approximator = args.approximator
+    environment = args.environment
+    seed = args.seed
+    training_episodes =  10
+    warm_up_episodes = 10
+    testing_returns = []
+    agent = UpsideDownAgent(environment, approximator)
+    for e in range(training_episodes):
+        print("Training Episode {}".format(e))
+        for i in range(100):
+            agent.train_behaviour_function()
+        print("Finished training B!")
+        for i in range(15):
+            exploratory_commands = agent.sample_exploratory_commands() # Line 5 Algorithm 1
+            desired_return = exploratory_commands[0]
+            desired_horizon = exploratory_commands[1]
+            agent.generate_episode(environment, e, desired_return, desired_horizon, False)
+        if e % 2 == 0:
+            for i in range(1):
+                r =  agent.generate_episode(environment, e, desired_return, desired_horizon, True)
+                testing_returns.append(r)
+        exploratory_commands = agent.sample_exploratory_commands()
+if __name__ == "__main__":
+    run_experiment()

old_code/experiment_2/utils.py ADDED Viewed

	@@ -0,0 +1,121 @@

+import os
+import argparse
+import pickle
+import keras
+import numpy as np
+from keras.layers import Dense, Multiply, Input, Conv2D, Flatten
+from keras.models import Sequential, Model
+from keras.optimizers import Adam, RMSprop, SGD
+from skimage.transform import resize
+from skimage.color import rgb2gray
+STORING_PATH = './results/'
+MODELS_PATH = './trained_models/'
+def save_results(environment, approximator, seed, rewards):
+    storing_path = os.path.join(STORING_PATH, environment, approximator, str(seed))
+    if not os.path.exists(storing_path):
+        os.makedirs(storing_path)
+    np.save(storing_path + '/' + 'upside_down_rewards.npy', rewards)
+def get_functional_behaviour_function(state_size, command_size, action_size):
+    observation_input = keras.Input(shape=(state_size,))
+    linear_layer = Dense(64, activation='sigmoid')(observation_input)
+    command_input = keras.Input(shape=(command_size,))
+    sigmoidal_layer = Dense(64, activation='sigmoid')(command_input)
+    multiplied_layer = Multiply()([linear_layer, sigmoidal_layer])
+    layer_1 = Dense(64, activation='relu')(multiplied_layer)
+    layer_2 = Dense(64, activation='relu')(layer_1)
+    layer_3 = Dense(64, activation='relu')(layer_2)
+    layer_4 = Dense(64, activation='relu')(layer_3)
+    final_layer = Dense(action_size, activation='softmax')(layer_4)
+    model = Model(inputs=[observation_input, command_input], outputs=final_layer)
+    model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.001))
+    return model
+def get_atari_behaviour_function(action_size):
+    print('Getting the model')
+    input_state = Input(shape=(84,84,4))
+    first_conv = Conv2D(
+            32, (8, 8), strides=(4,4), activation='relu')(input_state)
+    second_conv = Conv2D(
+            64, (4, 4), strides=(2,2), activation='relu')(first_conv)
+    third_conv = Conv2D(
+            64, (3, 3), strides=(1,1), activation='relu')(second_conv)
+    flattened = Flatten()(third_conv)
+    dense_layer = Dense(512, activation='relu')(flattened)
+    command_input = keras.Input(shape=(2,))
+    sigmoidal_layer = Dense(512, activation='sigmoid')(command_input)
+    multiplied_layer = Multiply()([dense_layer, sigmoidal_layer])
+    final_layer = Dense(256, activation='relu')(multiplied_layer)
+    action_layer = Dense(action_size, activation='softmax')(final_layer)
+    model = Model(inputs=[input_state, command_input], outputs=action_layer)
+    model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.001, rho=0.95, epsilon=0.01))
+    print(model.summary())
+    return model
+def get_catch_behaviour_function(action_size):
+    print('Getting the Catch-model')
+    input_state = Input(shape=(84,84,4))
+    first_conv = Conv2D(
+            32, (8, 8), strides=(4,4), activation='relu')(input_state)
+    second_conv = Conv2D(
+            64, (4, 4), strides=(2,2), activation='relu')(first_conv)
+    third_conv = Conv2D(
+            64, (3, 3), strides=(1,1), activation='relu')(second_conv)
+    flattened = Flatten()(third_conv)
+    dense_layer = Dense(512, activation='relu')(flattened)
+    command_input = keras.Input(shape=(2,))
+    sigmoidal_layer = Dense(512, activation='sigmoid')(command_input)
+    multiplied_layer = Multiply()([dense_layer, sigmoidal_layer])
+    final_layer = Dense(256, activation='relu')(multiplied_layer)
+    action_layer = Dense(action_size, activation='softmax')(final_layer)
+    model = Model(inputs=[input_state, command_input], outputs=action_layer)
+    model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.001, rho=0.95, epsilon=0.01))
+    print(model.summary())
+    return model
+def pre_processing(state):
+    processed_state = np.uint8(
+            resize(rgb2gray(state), (84, 84), mode='constant')*255)
+    return processed_state
+def save_trained_model(environment, seed, model):
+    storing_path = os.path.join(MODELS_PATH, environment, str(seed))
+    if not os.path.exists(storing_path):
+        os.makedirs(storing_path)
+    model.save_weights(storing_path + '/' + 'trained_model.h5')

old_code/experiment_3/q_networks/a2c.py ADDED Viewed

	@@ -0,0 +1,104 @@

+import os
+import sys
+import gym
+import utils
+import numpy as np
+from keras.layers import Dense
+from keras.models import Sequential
+from keras.optimizers import Adam
+from matplotlib import pyplot as plt
+class A2CAgent:
+	def __init__(self, state_size, action_size):
+		self.render = False
+		self.state_size = state_size
+		self.action_size = action_size
+		self.value_size = 1
+		self.discount_factor = 0.99
+		self.actor_lr = 0.001
+		self.critic_lr = 0.005
+		self.actor = self.build_actor()
+		self.critic = self.build_critic()
+	def build_actor(self):
+		actor = Sequential()
+		actor.add(Dense(24, input_dim=self.state_size, activation='relu',
+						kernel_initializer='he_uniform'))
+		actor.add(Dense(self.action_size, activation='softmax',
+						kernel_initializer='he_uniform'))
+		actor.compile(loss='categorical_crossentropy',
+					  optimizer=Adam(lr=self.actor_lr))
+		return actor
+	def build_critic(self):
+		critic = Sequential()
+		critic.add(Dense(24, input_dim=self.state_size, activation='relu',
+						 kernel_initializer='he_uniform'))
+		critic.add(Dense(self.value_size, activation='linear',
+						 kernel_initializer='he_uniform'))
+		critic.compile(loss="mse", optimizer=Adam(lr=self.critic_lr))
+		return critic
+	def get_action(self, state):
+		policy = self.actor.predict(state, batch_size=1).flatten()
+		return np.random.choice(self.action_size, 1, p=policy)[0]
+	def train_model(self, state, action, reward, next_state, done):
+		target = np.zeros((1, self.value_size))
+		advantages = np.zeros((1, self.action_size))
+		value = self.critic.predict(state)[0]
+		next_value = self.critic.predict(next_state)[0]
+		if done:
+			advantages[0][action] = reward - value
+			target[0][0] = reward
+		else:
+			advantages[0][action] = reward + self.discount_factor * (next_value) - value
+			target[0][0] = reward + self.discount_factor * next_value
+		self.actor.fit(state, advantages, epochs=1, verbose=0)
+		self.critic.fit(state, target, epochs=1, verbose=0)
+def run_A2C():
+    episodes = 500
+    seed = 1
+    results = []
+    game = 'CartPole-v0'
+    env = gym.make(game)
+    state_size = env.observation_space.shape[0]
+    action_size = env.action_space.n
+    agent = A2CAgent(state_size, action_size)
+    for e in range(episodes):
+        done = False
+        score = 0
+        state = env.reset()
+        state = np.reshape(state, [1, state_size])
+        while not done:
+            action = agent.get_action(state)
+            next_state, reward, done, info = env.step(action)
+            next_state = np.reshape(next_state, [1, state_size])
+            agent.train_model(state, action, reward, next_state, done)
+            score += reward
+            state = next_state
+        results.append(score)
+    utils.save_trained_model(game, seed, 'A2C', agent.actor)
+    plt.plot(results)
+    plt.show()
+run_A2C()

old_code/experiment_3/q_networks/buffers/CartPole-v0/1/DQN/memory_buffer.p ADDED Viewed

The diff for this file is too large to render. See raw diff

old_code/experiment_3/q_networks/ddqn.py ADDED Viewed

	@@ -0,0 +1,136 @@

+import os
+import sys
+import gym
+import random
+import utils
+import numpy as np
+from collections import deque
+from keras.layers import Dense
+from keras.optimizers import Adam
+from keras.models import Sequential
+from matplotlib import pyplot as plt
+class DoubleDQNAgent:
+    def __init__(self, state_size, action_size):
+        self.render = False
+        self.load_model = False
+        self.state_size = state_size
+        self.action_size = action_size
+        self.discount_factor = 0.99
+        self.learning_rate = 0.001
+        self.epsilon = 1.0
+        self.epsilon_decay = 0.999
+        self.epsilon_min = 0.01
+        self.batch_size = 64
+        self.train_start = 1000
+        self.memory = deque(maxlen=2000)
+        self.model = self.build_model()
+        self.target_model = self.build_model()
+        self.update_target_model()
+    def build_model(self):
+        model = Sequential()
+        model.add(Dense(24, input_dim=self.state_size, activation='relu',
+                        kernel_initializer='he_uniform'))
+        model.add(Dense(24, activation='relu',
+                        kernel_initializer='he_uniform'))
+        model.add(Dense(self.action_size, activation='linear',
+                        kernel_initializer='he_uniform'))
+        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
+        return model
+    def update_target_model(self):
+        self.target_model.set_weights(self.model.get_weights())
+    def get_action(self, state):
+        if np.random.rand() <= self.epsilon:
+            return random.randrange(self.action_size)
+        else:
+            q_value = self.model.predict(state)
+            return np.argmax(q_value[0])
+    def append_sample(self, state, action, reward, next_state, done):
+        self.memory.append((state, action, reward, next_state, done))
+        if self.epsilon > self.epsilon_min:
+            self.epsilon *= self.epsilon_decay
+    def train_model(self):
+        if len(self.memory) < self.train_start:
+            return
+        batch_size = min(self.batch_size, len(self.memory))
+        mini_batch = random.sample(self.memory, batch_size)
+        update_input = np.zeros((batch_size, self.state_size))
+        update_target = np.zeros((batch_size, self.state_size))
+        action, reward, done = [], [], []
+        for i in range(batch_size):
+            update_input[i] = mini_batch[i][0]
+            action.append(mini_batch[i][1])
+            reward.append(mini_batch[i][2])
+            update_target[i] = mini_batch[i][3]
+            done.append(mini_batch[i][4])
+        target = self.model.predict(update_input)
+        target_next = self.model.predict(update_target)
+        target_val = self.target_model.predict(update_target)
+        for i in range(self.batch_size):
+            if done[i]:
+                target[i][action[i]] = reward[i]
+            else:
+                a = np.argmax(target_next[i])
+                target[i][action[i]] = reward[i] + self.discount_factor * (
+                    target_val[i][a])
+        self.model.fit(update_input, target, batch_size=self.batch_size,
+                       epochs=1, verbose=0)
+def run_DDQN():
+    episodes = 500
+    seed = 1
+    results = []
+    game = 'CartPole-v0'
+    env = gym.make(game)
+    state_size = env.observation_space.shape[0]
+    action_size = env.action_space.n
+    agent = DoubleDQNAgent(state_size, action_size)
+    for e in range(episodes):
+        done = False
+        score = 0
+        state = env.reset()
+        state = np.reshape(state, [1, state_size])
+        while not done:
+            action = agent.get_action(state)
+            next_state, reward, done, info = env.step(action)
+            next_state = np.reshape(next_state, [1, state_size])
+            agent.append_sample(state, action, reward, next_state, done)
+            agent.train_model()
+            score += reward
+            state = next_state
+            if done:
+                agent.update_target_model()
+        results.append(score)
+    utils.save_trained_model(game, seed, 'DDQN', agent.model)
+    plt.plot(results)
+    plt.show()
+run_DDQN()

old_code/experiment_3/q_networks/dqn.py ADDED Viewed

	@@ -0,0 +1,130 @@

+import os
+import sys
+import gym
+import random
+import utils
+import numpy as np
+from collections import deque
+from keras.layers import Dense
+from keras.optimizers import Adam
+from keras.models import Sequential
+from matplotlib import pyplot as plt
+class DQNAgent:
+	def __init__(self, state_size, action_size):
+		self.state_size = state_size
+		self.action_size = action_size
+		self.discount_factor = 0.99
+		self.learning_rate = 0.001
+		self.epsilon = 1.0
+		self.epsilon_decay = 0.999
+		self.epsilon_min = 0.01
+		self.batch_size = 64
+		self.train_start = 1000
+		self.memory = deque(maxlen=2000)
+		self.model = self.build_model()
+		self.target_model = self.build_model()
+		self.update_target_model()
+	def build_model(self):
+		model = Sequential()
+		model.add(Dense(24, input_dim=self.state_size, activation='relu',
+						kernel_initializer='he_uniform'))
+		model.add(Dense(24, activation='relu',
+						kernel_initializer='he_uniform'))
+		model.add(Dense(self.action_size, activation='linear',
+						kernel_initializer='he_uniform'))
+		model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
+		return model
+	def update_target_model(self):
+		self.target_model.set_weights(self.model.get_weights())
+	def get_action(self, state):
+		if np.random.rand() <= self.epsilon:
+			return random.randrange(self.action_size)
+		else:
+			q_value = self.model.predict(state)
+			return np.argmax(q_value[0])
+	def append_sample(self, state, action, reward, next_state, done):
+		self.memory.append((state, action, reward, next_state, done))
+		if self.epsilon > self.epsilon_min:
+			self.epsilon *= self.epsilon_decay
+	def train_model(self):
+		if len(self.memory) < self.train_start:
+			return
+		batch_size = min(self.batch_size, len(self.memory))
+		mini_batch = random.sample(self.memory, batch_size)
+		update_input = np.zeros((batch_size, self.state_size))
+		update_target = np.zeros((batch_size, self.state_size))
+		action, reward, done = [], [], []
+		for i in range(self.batch_size):
+			update_input[i] = mini_batch[i][0]
+			action.append(mini_batch[i][1])
+			reward.append(mini_batch[i][2])
+			update_target[i] = mini_batch[i][3]
+			done.append(mini_batch[i][4])
+		target = self.model.predict(update_input)
+		target_val = self.target_model.predict(update_target)
+		for i in range(self.batch_size):
+			if done[i]:
+				target[i][action[i]] = reward[i]
+			else:
+				target[i][action[i]] = reward[i] + self.discount_factor * (
+					np.amax(target_val[i]))
+		self.model.fit(update_input, target, batch_size=self.batch_size,
+					   epochs=1, verbose=0)
+def run_DQN():
+    episodes = 500
+    seed = 1
+    results = []
+    game = 'CartPole-v0'
+    env = gym.make(game)
+    state_size = env.observation_space.shape[0]
+    action_size = env.action_space.n
+    agent = DQNAgent(state_size, action_size)
+    for e in range(episodes):
+        done = False
+        score = 0
+        state = env.reset()
+        state = np.reshape(state, [1, state_size])
+        while not done:
+            action = agent.get_action(state)
+            next_state, reward, done, info = env.step(action)
+            next_state = np.reshape(next_state, [1, state_size])
+            agent.append_sample(state, action, reward, next_state, done)
+            agent.train_model()
+            score += reward
+            state = next_state
+            if done:
+                agent.update_target_model()
+        results.append(score)
+    utils.save_trained_model(game, seed, 'DQN', agent.model)
+    plt.plot(results)
+    plt.show()
+run_DQN()

old_code/experiment_3/q_networks/prepare_buffer.py ADDED Viewed

	@@ -0,0 +1,104 @@

+import os
+import sys
+import gym
+import random
+import numpy as np
+import pickle
+from collections import deque
+from keras.layers import Dense
+from keras.optimizers import Adam
+from keras.models import Sequential
+from matplotlib import pyplot as plt
+WEIGHTS_PATH = './trained_models/CartPole-v0/1/'
+BUFFER_PATH = './buffers/CartPole-v0/1/'
+class Agent:
+    def __init__(self, algorithm, state_size, action_size):
+        self.algorithm = algorithm
+        self.render = False
+        self.state_size = state_size
+        self.action_size = action_size
+        self.memory = deque(maxlen=2000)
+        if self.algorithm in ['DQN', 'DDQN', 'DQV']:
+            self.model = self.build_model()
+            self.model.load_weights(os.path.join(WEIGHTS_PATH, self.algorithm, 'trained_model.h5'))
+        else:
+            self.model = self.build_actor()
+            self.model.load_weights(os.path.join(WEIGHTS_PATH, self.algorithm, 'trained_model.h5'))
+    def build_actor(self):
+        actor = Sequential()
+        actor.add(Dense(24, input_dim=self.state_size, activation='relu', kernel_initializer='he_uniform'))
+        actor.add(Dense(self.action_size, activation='softmax', kernel_initializer='he_uniform'))
+        return actor
+    def build_model(self):
+        model = Sequential()
+        model.add(Dense(24, input_dim=self.state_size, activation='relu',
+						kernel_initializer='he_uniform'))
+        model.add(Dense(24, activation='relu',
+						kernel_initializer='he_uniform'))
+        model.add(Dense(self.action_size, activation='linear',
+						kernel_initializer='he_uniform'))
+        return model
+    def get_action(self, state):
+        if self.algorithm == 'A2C':
+            policy = self.model.predict(state, batch_size=1).flatten()
+            return np.random.choice(self.action_size, 1, p=policy)[0]
+        else:
+            q_value = self.model.predict(state)
+            return np.argmax(q_value[0])
+    def append_sample(self, state, action, reward, next_state, done):
+        self.memory.append((state, action, reward, next_state, done))
+    def save_buffer(self):
+        if not os.path.exists(os.path.join(BUFFER_PATH, self.algorithm)):
+            os.makedirs(os.path.join(BUFFER_PATH, self.algorithm))
+        with open(os.path.join(BUFFER_PATH, self.algorithm, 'memory_buffer.p'), 'wb') as filehandler:
+            pickle.dump(self.memory, filehandler)
+def fill_buffer(algorithm):
+    max_len = 10000
+    results = []
+    game = 'CartPole-v0'
+    env = gym.make(game)
+    state_size = env.observation_space.shape[0]
+    action_size = env.action_space.n
+    agent = Agent(algorithm, state_size, action_size)
+    while True:
+        done = False
+        score = 0
+        state = env.reset()
+        state = np.reshape(state, [1, state_size])
+        while not done:
+            action = agent.get_action(state)
+            next_state, reward, done, info = env.step(action)
+            next_state = np.reshape(next_state, [1, state_size])
+            agent.append_sample(state, action, reward, next_state, done)
+            score += reward
+            state = next_state
+        if len(agent.memory) > max_len:
+            agent.save_buffer()
+            break
+fill_buffer('DQN')

old_code/experiment_3/q_networks/train_offline_a2c.py ADDED Viewed

	@@ -0,0 +1,123 @@

+import os
+import sys
+import gym
+import pickle
+import random
+import utils
+import numpy as np
+from keras.layers import Dense
+from keras.models import Sequential
+from keras.optimizers import Adam
+from matplotlib import pyplot as plt
+MEMORY_PATH = './buffers/CartPole-v0/1/A2C/'
+class A2CAgent:
+    def __init__(self, state_size, action_size):
+        self.render = False
+        self.state_size = state_size
+        self.action_size = action_size
+        self.value_size = 1
+        self.discount_factor = 0.99
+        self.actor_lr = 0.0001
+        self.critic_lr = 0.005
+        self.actor = self.build_actor()
+        self.critic = self.build_critic()
+        self.get_memory_buffer()
+    def build_actor(self):
+        actor = Sequential()
+        actor.add(Dense(24, input_dim=self.state_size, activation='relu',
+						kernel_initializer='he_uniform'))
+        actor.add(Dense(self.action_size, activation='softmax',
+						kernel_initializer='he_uniform'))
+        actor.compile(loss='categorical_crossentropy',
+					  optimizer=Adam(lr=self.actor_lr))
+        return actor
+    def build_critic(self):
+        critic = Sequential()
+        critic.add(Dense(24, input_dim=self.state_size, activation='relu',
+						 kernel_initializer='he_uniform'))
+        critic.add(Dense(self.value_size, activation='linear',
+						 kernel_initializer='he_uniform'))
+        critic.compile(loss="mse", optimizer=Adam(lr=self.critic_lr))
+        return critic
+    def get_memory_buffer(self):
+        memory_buffer_path = os.path.join(MEMORY_PATH, 'memory_buffer.p')
+        with open(memory_buffer_path, 'rb') as f:
+            self.memory = pickle.load(f)
+    def get_action(self, state):
+        policy = self.actor.predict(state, batch_size=1).flatten()
+        return np.random.choice(self.action_size, 1, p=policy)[0]
+    def train_model(self):
+        mini_batch = random.sample(self.memory, 1)
+        state = mini_batch[0][0]
+        action = mini_batch[0][1]
+        reward = mini_batch[0][2]
+        next_state = mini_batch[0][3]
+        done = mini_batch[0][4]
+        target = np.zeros((1, self.value_size))
+        advantages = np.zeros((1, self.action_size))
+        value = self.critic.predict(state)[0]
+        next_value = self.critic.predict(next_state)[0]
+        if done:
+            advantages[0][action] = reward - value
+            target[0][0] = reward
+        else:
+            advantages[0][action] = reward + self.discount_factor * (next_value) - value
+            target[0][0] = reward + self.discount_factor * next_value
+        self.actor.fit(state, advantages, epochs=1, verbose=0)
+        self.critic.fit(state, target, epochs=1, verbose=0)
+def run_A2C():
+    episodes = 500
+    seed = 2
+    results = []
+    game = 'CartPole-v0'
+    env = gym.make(game)
+    state_size = env.observation_space.shape[0]
+    action_size = env.action_space.n
+    agent = A2CAgent(state_size, action_size)
+    for e in range(episodes):
+        done = False
+        score = 0
+        state = env.reset()
+        state = np.reshape(state, [1, state_size])
+        while not done:
+            action = agent.get_action(state)
+            next_state, reward, done, info = env.step(action)
+            next_state = np.reshape(next_state, [1, state_size])
+            agent.train_model()
+            score += reward
+            state = next_state
+        print(score)
+        results.append(score)
+    utils.save_offline_results(game, 'A2C', seed, results)
+    plt.plot(results)
+    plt.show()
+run_A2C()

old_code/experiment_3/q_networks/train_offline_ddqn.py ADDED Viewed

	@@ -0,0 +1,126 @@

+import os
+import sys
+import gym
+import pickle
+import random
+import utils
+import numpy as np
+from collections import deque
+from keras.layers import Dense
+from keras.optimizers import Adam
+from keras.models import Sequential
+from matplotlib import pyplot as plt
+MEMORY_PATH = './buffers/CartPole-v0/1/DDQN/'
+class DDQNAgent:
+    def __init__(self, state_size, action_size):
+        self.render = False
+        self.state_size = state_size
+        self.action_size = action_size
+        self.discount_factor = 0.99
+        self.learning_rate = 0.00001
+        self.batch_size = 256
+        self.model = self.build_model()
+        self.target_model = self.build_model()
+        self.get_memory_buffer()
+        self.update_target_model()
+    def build_model(self):
+        model = Sequential()
+        model.add(Dense(24, input_dim=self.state_size, activation='relu',
+                        kernel_initializer='he_uniform'))
+        model.add(Dense(24, activation='relu',
+                        kernel_initializer='he_uniform'))
+        model.add(Dense(self.action_size, activation='linear',
+                        kernel_initializer='he_uniform'))
+        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
+        return model
+    def update_target_model(self):
+        self.target_model.set_weights(self.model.get_weights())
+    def get_memory_buffer(self):
+        memory_buffer_path = os.path.join(MEMORY_PATH, 'memory_buffer.p')
+        with open(memory_buffer_path, 'rb') as f:
+            self.memory = pickle.load(f)
+        print(len(self.memory))
+    def get_action(self, state):
+        q_value = self.model.predict(state)
+        return np.argmax(q_value[0])
+    def train_model(self):
+        batch_size = min(self.batch_size, len(self.memory))
+        mini_batch = random.sample(self.memory, batch_size)
+        update_input = np.zeros((batch_size, self.state_size))
+        update_target = np.zeros((batch_size, self.state_size))
+        action, reward, done = [], [], []
+        for i in range(batch_size):
+            update_input[i] = mini_batch[i][0]
+            action.append(mini_batch[i][1])
+            reward.append(mini_batch[i][2])
+            update_target[i] = mini_batch[i][3]
+            done.append(mini_batch[i][4])
+        target = self.model.predict(update_input)
+        target_next = self.model.predict(update_target)
+        target_val = self.target_model.predict(update_target)
+        for i in range(self.batch_size):
+            if done[i]:
+                target[i][action[i]] = reward[i]
+            else:
+                a = np.argmax(target_next[i])
+                target[i][action[i]] = reward[i] + self.discount_factor * (
+                    target_val[i][a])
+        self.model.fit(update_input, target, batch_size=self.batch_size, epochs=1, verbose=0)
+def run_DDQN():
+    episodes = 500
+    seed = 2
+    results = []
+    game = 'CartPole-v0'
+    env = gym.make(game)
+    state_size = env.observation_space.shape[0]
+    action_size = env.action_space.n
+    agent = DDQNAgent(state_size, action_size)
+    for e in range(episodes):
+        done = False
+        score = 0
+        state = env.reset()
+        state = np.reshape(state, [1, state_size])
+        while not done:
+            action = agent.get_action(state)
+            next_state, reward, done, info = env.step(action)
+            next_state = np.reshape(next_state, [1, state_size])
+            agent.train_model()
+            score += reward
+            state = next_state
+        print(score)
+        results.append(score)
+    utils.save_offline_results(game, 'DDQN', seed, results)
+    plt.plot(results)
+    plt.show()
+run_DDQN()

old_code/experiment_3/q_networks/train_offline_dqn.py ADDED Viewed

	@@ -0,0 +1,124 @@

+import os
+import sys
+import gym
+import pickle
+import random
+import utils
+import numpy as np
+from collections import deque
+from keras.layers import Dense
+from keras.optimizers import Adam
+from keras.models import Sequential
+from matplotlib import pyplot as plt
+MEMORY_PATH = './buffers/CartPole-v0/1/DQN/'
+class DQNAgent:
+    def __init__(self, state_size, action_size):
+        self.render = False
+        self.state_size = state_size
+        self.action_size = action_size
+        self.discount_factor = 0.99
+        self.learning_rate = 0.00001
+        self.batch_size = 256
+        self.model = self.build_model()
+        self.target_model = self.build_model()
+        self.get_memory_buffer()
+        self.update_target_model()
+    def build_model(self):
+        model = Sequential()
+        model.add(Dense(24, input_dim=self.state_size, activation='relu',
+                        kernel_initializer='he_uniform'))
+        model.add(Dense(24, activation='relu',
+                        kernel_initializer='he_uniform'))
+        model.add(Dense(self.action_size, activation='linear',
+                        kernel_initializer='he_uniform'))
+        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
+        return model
+    def update_target_model(self):
+        self.target_model.set_weights(self.model.get_weights())
+    def get_memory_buffer(self):
+        memory_buffer_path = os.path.join(MEMORY_PATH, 'memory_buffer.p')
+        with open(memory_buffer_path, 'rb') as f:
+            self.memory = pickle.load(f)
+        print(len(self.memory))
+    def get_action(self, state):
+        q_value = self.model.predict(state)
+        return np.argmax(q_value[0])
+    def train_model(self):
+        batch_size = min(self.batch_size, len(self.memory))
+        mini_batch = random.sample(self.memory, batch_size)
+        update_input = np.zeros((batch_size, self.state_size))
+        update_target = np.zeros((batch_size, self.state_size))
+        action, reward, done = [], [], []
+        for i in range(self.batch_size):
+            update_input[i] = mini_batch[i][0]
+            action.append(mini_batch[i][1])
+            reward.append(mini_batch[i][2])
+            update_target[i] = mini_batch[i][3]
+            done.append(mini_batch[i][4])
+        target = self.model.predict(update_input)
+        target_val = self.target_model.predict(update_target)
+        for i in range(self.batch_size):
+            if done[i]:
+                target[i][action[i]] = reward[i]
+            else:
+                target[i][action[i]] = reward[i] + self.discount_factor * (np.amax(target_val[i]))
+        self.model.fit(update_input, target, batch_size=self.batch_size,
+					   epochs=1, verbose=0)
+def run_DQN():
+    episodes = 500
+    seed = 2
+    results = []
+    game = 'CartPole-v0'
+    env = gym.make(game)
+    state_size = env.observation_space.shape[0]
+    action_size = env.action_space.n
+    agent = DQNAgent(state_size, action_size)
+    for e in range(episodes):
+        done = False
+        score = 0
+        state = env.reset()
+        state = np.reshape(state, [1, state_size])
+        while not done:
+            action = agent.get_action(state)
+            next_state, reward, done, info = env.step(action)
+            next_state = np.reshape(next_state, [1, state_size])
+            agent.train_model()
+            score += reward
+            state = next_state
+        print(score)
+        results.append(score)
+    utils.save_offline_results(game, 'DQN', seed, results)
+    plt.plot(results)
+    plt.show()
+run_DQN()

old_code/experiment_3/q_networks/utils.py ADDED Viewed

	@@ -0,0 +1,29 @@

+import os
+import argparse
+import pickle
+import keras
+import numpy as np
+STORING_PATH = '../offline_rl_results/'
+MODELS_PATH = './trained_models/'
+def save_results(environment, approximator, seed, rewards):
+    storing_path = os.path.join(STORING_PATH, environment, approximator, str(seed))
+    if not os.path.exists(storing_path):
+        os.makedirs(storing_path)
+    np.save(storing_path + '/' + 'upside_down_rewards.npy', rewards)
+def save_trained_model(environment, seed, algorithm, model):
+    storing_path = os.path.join(MODELS_PATH, environment, str(seed), algorithm)
+    if not os.path.exists(storing_path):
+        os.makedirs(storing_path)
+    model.save_weights(storing_path + '/' + 'trained_model.h5')
+def save_offline_results(environment, algorithm, seed, returns):
+    storing_path = os.path.join(STORING_PATH, algorithm, str(seed))
+    if not os.path.exists(storing_path):
+        os.makedirs(storing_path)
+    np.save(storing_path + '/rewards.npy', returns)

old_code/experiment_3/upside_down/prepare_offline_buffer.py ADDED Viewed

	@@ -0,0 +1,157 @@

+import os
+import math
+import time
+import gym
+import random
+import utils
+import keras
+import numpy as np
+from collections import deque
+from matplotlib import pyplot as plt
+class ReplayBuffer():
+    """
+        Thank you: https://github.com/BY571/
+    """
+    def __init__(self, max_size):
+        self.max_size = max_size
+        self.buffer = []
+    def add_sample(self, states, actions, rewards):
+        episode = {"states": states, "actions":actions, "rewards": rewards, "summed_rewards":sum(rewards)}
+        self.buffer.append(episode)
+    def sort(self):
+        #sort buffer
+        self.buffer = sorted(self.buffer, key = lambda i: i["summed_rewards"],reverse=True)
+        # keep the max buffer size
+        self.buffer = self.buffer[:self.max_size]
+    def get_random_samples(self, batch_size):
+        self.sort()
+        idxs = np.random.randint(0, len(self.buffer), batch_size)
+        batch = [self.buffer[idx] for idx in idxs]
+        return batch
+    def get_n_best(self, n):
+        self.sort()
+        return self.buffer[:n]
+    def __len__(self):
+        return len(self.buffer)
+class UpsideDownAgent():
+    def __init__(self, environment):
+        self.environment = gym.make(environment)
+        self.state_size = self.environment.observation_space.shape[0]
+        self.action_size = self.environment.action_space.n
+        self.memory = ReplayBuffer(700)
+        self.last_few = 75
+        self.batch_size = 32
+        self.command_size = 2 # desired return + desired horizon
+        self.desired_return = 1
+        self.desired_horizon = 1
+        self.horizon_scale = 0.02
+        self.return_scale = 0.02
+        self.testing_state = 0
+        self.behaviour_function = utils.get_functional_behaviour_function(self.state_size, self.command_size, self.action_size, True)
+        self.testing_rewards = []
+    def get_action(self, observation, command):
+        """
+            We will sample from the action distribution modeled by the Behavior Function
+        """
+        action_probs = self.behaviour_function.predict([observation, command])
+        action = np.random.choice(np.arange(0, self.action_size), p=action_probs[0])
+        return action
+    def get_greedy_action(self, observation, command):
+        action_probs = self.behaviour_function.predict([observation, command])
+        action = np.argmax(action_probs)
+        return action
+    def sample_exploratory_commands(self):
+        best_episodes = self.memory.get_n_best(self.last_few)
+        exploratory_desired_horizon = np.mean([len(i["states"]) for i in best_episodes])
+        returns = [i["summed_rewards"] for i in best_episodes]
+        exploratory_desired_returns = np.random.uniform(np.mean(returns), np.mean(returns)+np.std(returns))
+        return [exploratory_desired_returns, exploratory_desired_horizon]
+    def generate_offline_episodes(self, environment, e, desired_return, desired_horizon):
+        env = gym.make(environment)
+        tot_rewards = []
+        done = False
+        score = 0
+        state = env.reset()
+        scores = []
+        states = []
+        actions = []
+        rewards = []
+        while not done:
+            state = np.reshape(state, [1, self.state_size])
+            states.append(state)
+            observation = state
+            command = np.asarray([desired_return * self.return_scale, desired_horizon * self.horizon_scale])
+            command = np.reshape(command, [1, len(command)])
+            action = self.get_action(observation, command)
+            actions.append(action)
+            next_state, reward, done, info = env.step(action)
+            next_state = np.reshape(next_state, [1, self.state_size])
+            rewards.append(reward)
+            score += reward
+            state = next_state
+            desired_return -= reward  # Line 8 Algorithm 2
+            desired_horizon -= 1 # Line 9 Algorithm 2
+            desired_horizon = np.maximum(desired_horizon, 1)
+        self.memory.add_sample(states, actions, rewards)
+        print('Testing score: {}'.format(score))
+    def save_buffer(self, environment, seed):
+        utils.save_buffer(environment, seed, self.memory.buffer)
+def run_experiment():
+    environment = 'CartPole-v0'
+    seed = 1
+    offline_episodes = 700
+    returns = []
+    agent = UpsideDownAgent(environment)
+    for e in range(offline_episodes):
+        tmp_r = []
+        r = agent.generate_offline_episodes(environment, e, 200, 200)
+        tmp_r.append(r)
+    agent.save_buffer(environment, seed)
+if __name__ == "__main__":
+    run_experiment()

old_code/experiment_3/upside_down/train_agent.py ADDED Viewed

	@@ -0,0 +1,248 @@

+import os
+import math
+import time
+import gym
+import random
+import utils
+import keras
+import numpy as np
+from collections import deque
+from matplotlib import pyplot as plt
+class ReplayBuffer():
+    """
+        Thank you: https://github.com/BY571/
+    """
+    def __init__(self, max_size):
+        self.max_size = max_size
+        self.buffer = []
+    def add_sample(self, states, actions, rewards):
+        episode = {"states": states, "actions":actions, "rewards": rewards, "summed_rewards":sum(rewards)}
+        self.buffer.append(episode)
+    def sort(self):
+        #sort buffer
+        self.buffer = sorted(self.buffer, key = lambda i: i["summed_rewards"],reverse=True)
+        # keep the max buffer size
+        self.buffer = self.buffer[:self.max_size]
+    def get_random_samples(self, batch_size):
+        self.sort()
+        idxs = np.random.randint(0, len(self.buffer), batch_size)
+        batch = [self.buffer[idx] for idx in idxs]
+        return batch
+    def get_n_best(self, n):
+        self.sort()
+        return self.buffer[:n]
+    def __len__(self):
+        return len(self.buffer)
+class UpsideDownAgent():
+    def __init__(self, environment):
+        self.environment = gym.make(environment)
+        self.state_size = self.environment.observation_space.shape[0]
+        self.action_size = self.environment.action_space.n
+        self.warm_up_episodes = 50
+        self.render = False
+        self.memory = ReplayBuffer(700)
+        self.last_few = 75
+        self.batch_size = 32
+        self.command_size = 2 # desired return + desired horizon
+        self.desired_return = 1
+        self.desired_horizon = 1
+        self.horizon_scale = 0.02
+        self.return_scale = 0.02
+        self.testing_state = 0
+        self.behaviour_function = utils.get_functional_behaviour_function(self.state_size, self.command_size, self.action_size, False)
+        self.testing_rewards = []
+        self.warm_up_buffer()
+    def warm_up_buffer(self):
+        for i in range(self.warm_up_episodes):
+            state = self.environment.reset()
+            states = []
+            rewards = []
+            actions = []
+            done = False
+            desired_return = 1
+            desired_horizon = 1
+            while not done:
+                state = np.reshape(state, [1, self.state_size])
+                states.append(state)
+                observation = state
+                command = np.asarray([desired_return * self.return_scale, desired_horizon * self.horizon_scale])
+                command = np.reshape(command, [1, len(command)])
+                action = self.get_action(observation, command)
+                actions.append(action)
+                next_state, reward, done, info = self.environment.step(action)
+                next_state = np.reshape(next_state, [1, self.state_size])
+                rewards.append(reward)
+                state = next_state
+                desired_return -= reward  # Line 8 Algorithm 2
+                desired_horizon -= 1 # Line 9 Algorithm 2
+                desired_horizon = np.maximum(desired_horizon, 1)
+            self.memory.add_sample(states, actions, rewards)
+    def get_action(self, observation, command):
+        """
+            We will sample from the action distribution modeled by the Behavior Function
+        """
+        action_probs = self.behaviour_function.predict([observation, command])
+        action = np.random.choice(np.arange(0, self.action_size), p=action_probs[0])
+        return action
+    def get_greedy_action(self, observation, command):
+        action_probs = self.behaviour_function.predict([observation, command])
+        action = np.argmax(action_probs)
+        return action
+    def train_behaviour_function(self):
+        random_episodes = self.memory.get_random_samples(self.batch_size)
+        training_observations = np.zeros((self.batch_size, self.state_size))
+        training_commands = np.zeros((self.batch_size, 2))
+        y = []
+        for idx, episode in enumerate(random_episodes):
+            T = len(episode['states'])
+            t1 = np.random.randint(0, T-1)
+            t2 = np.random.randint(t1+1, T)
+            state = episode['states'][t1]
+            desired_return = sum(episode["rewards"][t1:t2])
+            desired_horizon = t2 -t1
+            target = episode['actions'][t1]
+            training_observations[idx] = state[0]
+            training_commands[idx] = np.asarray([desired_return*self.return_scale, desired_horizon*self.horizon_scale])
+            y.append(target)
+        _y = keras.utils.to_categorical(y)
+        self.behaviour_function.fit([training_observations, training_commands], _y, verbose=0)
+    def sample_exploratory_commands(self):
+        best_episodes = self.memory.get_n_best(self.last_few)
+        exploratory_desired_horizon = np.mean([len(i["states"]) for i in best_episodes])
+        returns = [i["summed_rewards"] for i in best_episodes]
+        exploratory_desired_returns = np.random.uniform(np.mean(returns), np.mean(returns)+np.std(returns))
+        return [exploratory_desired_returns, exploratory_desired_horizon]
+    def generate_episode(self, environment, e, desired_return, desired_horizon, testing):
+        env = gym.make(environment)
+        tot_rewards = []
+        done = False
+        score = 0
+        state = env.reset()
+        scores = []
+        states = []
+        actions = []
+        rewards = []
+        while not done:
+            state = np.reshape(state, [1, self.state_size])
+            states.append(state)
+            observation = state
+            command = np.asarray([desired_return * self.return_scale, desired_horizon * self.horizon_scale])
+            command = np.reshape(command, [1, len(command)])
+            if not testing:
+                action = self.get_action(observation, command)
+                actions.append(action)
+            else:
+                action = self.get_greedy_action(observation, command)
+            next_state, reward, done, info = env.step(action)
+            next_state = np.reshape(next_state, [1, self.state_size])
+            rewards.append(reward)
+            score += reward
+            state = next_state
+            desired_return -= reward  # Line 8 Algorithm 2
+            desired_horizon -= 1 # Line 9 Algorithm 2
+            desired_horizon = np.maximum(desired_horizon, 1)
+        self.memory.add_sample(states, actions, rewards)
+        self.testing_rewards.append(score)
+        if testing:
+            print('Querying the model ...')
+            print('Testing score: {}'.format(score))
+        return score
+def run_experiment():
+    environment = 'CartPole-v0'
+    seed = 1
+    episodes = 500
+    returns = []
+    agent = UpsideDownAgent(environment)
+    for e in range(episodes):
+        for i in range(100):
+            agent.train_behaviour_function()
+        for i in range(15):
+            tmp_r = []
+            exploratory_commands = agent.sample_exploratory_commands() # Line 5 Algorithm 1
+            desired_return = exploratory_commands[0]
+            desired_horizon = exploratory_commands[1]
+            r = agent.generate_episode(environment, e, desired_return, desired_horizon, False)
+            tmp_r.append(r)
+        print(np.mean(tmp_r))
+        returns.append(np.mean(tmp_r))
+        exploratory_commands = agent.sample_exploratory_commands()
+    agent.generate_episode(environment, 1, 200, 200, True)
+    utils.save_results(environment, 'upside_down_agent', seed, returns)
+    utils.save_trained_model(environment, seed, agent.behaviour_function)
+if __name__ == "__main__":
+    run_experiment()

old_code/experiment_3/upside_down/train_offline_agent.py ADDED Viewed

	@@ -0,0 +1,196 @@

+import os
+import math
+import time
+import gym
+import random
+import utils
+import keras
+import numpy as np
+from collections import deque
+from matplotlib import pyplot as plt
+class ReplayBuffer():
+    """
+        Thank you: https://github.com/BY571/
+    """
+    def __init__(self, max_size):
+        self.max_size = max_size
+        self.buffer = np.load('./buffers/CartPole-v0/1/memory_buffer.npy')
+    def sort(self):
+        #sort buffer
+        self.buffer = sorted(self.buffer, key = lambda i: i["summed_rewards"],reverse=True)
+        # keep the max buffer size
+        self.buffer = self.buffer[:self.max_size]
+    def get_random_samples(self, batch_size):
+        self.sort()
+        idxs = np.random.randint(0, len(self.buffer), batch_size)
+        batch = [self.buffer[idx] for idx in idxs]
+        return batch
+    def get_n_best(self, n):
+        self.sort()
+        return self.buffer[:n]
+    def __len__(self):
+        return len(self.buffer)
+class UpsideDownAgent():
+    def __init__(self, environment):
+        self.environment = gym.make(environment)
+        self.state_size = self.environment.observation_space.shape[0]
+        self.action_size = self.environment.action_space.n
+        self.memory = ReplayBuffer(700)
+        self.last_few = 75
+        self.batch_size = 32
+        self.command_size = 2 # desired return + desired horizon
+        self.desired_return = 1
+        self.desired_horizon = 1
+        self.horizon_scale = 0.02
+        self.return_scale = 0.02
+        self.testing_state = 0
+        self.behaviour_function = utils.get_functional_behaviour_function(self.state_size, self.command_size, self.action_size, False)
+        self.testing_rewards = []
+    def get_action(self, observation, command):
+        """
+            We will sample from the action distribution modeled by the Behavior Function
+        """
+        action_probs = self.behaviour_function.predict([observation, command])
+        action = np.random.choice(np.arange(0, self.action_size), p=action_probs[0])
+        return action
+    def get_greedy_action(self, observation, command):
+        action_probs = self.behaviour_function.predict([observation, command])
+        action = np.argmax(action_probs)
+        return action
+    def train_behaviour_function(self):
+        random_episodes = self.memory.get_random_samples(self.batch_size)
+        training_observations = np.zeros((self.batch_size, self.state_size))
+        training_commands = np.zeros((self.batch_size, 2))
+        y = []
+        for idx, episode in enumerate(random_episodes):
+            T = len(episode['states'])
+            t1 = np.random.randint(0, T-1)
+            t2 = np.random.randint(t1+1, T)
+            state = episode['states'][t1]
+            desired_return = sum(episode["rewards"][t1:t2])
+            desired_horizon = t2 -t1
+            target = episode['actions'][t1]
+            training_observations[idx] = state[0]
+            training_commands[idx] = np.asarray([desired_return*self.return_scale, desired_horizon*self.horizon_scale])
+            y.append(target)
+        _y = keras.utils.to_categorical(y)
+        self.behaviour_function.fit([training_observations, training_commands], _y, verbose=0)
+    def sample_exploratory_commands(self):
+        best_episodes = self.memory.get_n_best(self.last_few)
+        exploratory_desired_horizon = np.mean([len(i["states"]) for i in best_episodes])
+        returns = [i["summed_rewards"] for i in best_episodes]
+        exploratory_desired_returns = np.random.uniform(np.mean(returns), np.mean(returns)+np.std(returns))
+        return [exploratory_desired_returns, exploratory_desired_horizon]
+    def generate_episode(self, environment, e, desired_return, desired_horizon, testing):
+        env = gym.make(environment)
+        tot_rewards = []
+        done = False
+        score = 0
+        state = env.reset()
+        scores = []
+        states = []
+        actions = []
+        rewards = []
+        while not done:
+            state = np.reshape(state, [1, self.state_size])
+            states.append(state)
+            observation = state
+            command = np.asarray([desired_return * self.return_scale, desired_horizon * self.horizon_scale])
+            command = np.reshape(command, [1, len(command)])
+            if not testing:
+                action = self.get_action(observation, command)
+                actions.append(action)
+            else:
+                action = self.get_greedy_action(observation, command)
+            next_state, reward, done, info = env.step(action)
+            next_state = np.reshape(next_state, [1, self.state_size])
+            rewards.append(reward)
+            score += reward
+            state = next_state
+            desired_return -= reward  # Line 8 Algorithm 2
+            desired_horizon -= 1 # Line 9 Algorithm 2
+            desired_horizon = np.maximum(desired_horizon, 1)
+        self.testing_rewards.append(score)
+        if testing:
+            print('Querying the model ...')
+            print('Testing score: {}'.format(score))
+        return score
+def run_experiment():
+    environment = 'CartPole-v0'
+    seed = 1
+    episodes = 500
+    returns = []
+    agent = UpsideDownAgent(environment)
+    for e in range(episodes):
+        for i in range(100):
+            agent.train_behaviour_function()
+        for i in range(15):
+            tmp_r = []
+            r = agent.generate_episode(environment, e, 200, 200, False)
+            tmp_r.append(r)
+        print(np.mean(tmp_r))
+        returns.append(np.mean(tmp_r))
+    agent.generate_episode(environment, 1, 200, 200, True)
+    utils.save_offline_results(environment, approximator, seed, returns)
+if __name__ == "__main__":
+    run_experiment()

old_code/experiment_3/upside_down/utils.py ADDED Viewed

	@@ -0,0 +1,131 @@

+import os
+import argparse
+import pickle
+import keras
+import numpy as np
+from keras.layers import Dense, Multiply, Input, Conv2D, Flatten
+from keras.models import Sequential, Model
+from keras.optimizers import Adam, RMSprop, SGD
+from skimage.transform import resize
+from skimage.color import rgb2gray
+STORING_PATH = './results/'
+MODELS_PATH = './trained_models/'
+BUFFERS_PATH = './buffers/'
+def save_results(environment, approximator, seed, rewards):
+    storing_path = os.path.join(STORING_PATH, environment, approximator, str(seed))
+    if not os.path.exists(storing_path):
+        os.makedirs(storing_path)
+    np.save(storing_path + '/' + 'upside_down_rewards.npy', rewards)
+def get_functional_behaviour_function(state_size, command_size, action_size, pretrained):
+    observation_input = keras.Input(shape=(state_size,))
+    linear_layer = Dense(64, activation='sigmoid')(observation_input)
+    command_input = keras.Input(shape=(command_size,))
+    sigmoidal_layer = Dense(64, activation='sigmoid')(command_input)
+    multiplied_layer = Multiply()([linear_layer, sigmoidal_layer])
+    layer_1 = Dense(64, activation='relu')(multiplied_layer)
+    layer_2 = Dense(64, activation='relu')(layer_1)
+    layer_3 = Dense(64, activation='relu')(layer_2)
+    layer_4 = Dense(64, activation='relu')(layer_3)
+    final_layer = Dense(action_size, activation='softmax')(layer_4)
+    model = Model(inputs=[observation_input, command_input], outputs=final_layer)
+    model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.001))
+    if pretrained:
+        model.load_weights(os.path.join(MODELS_PATH, 'CartPole-v0', '1', 'trained_model.h5'))
+    return model
+def get_atari_behaviour_function(action_size):
+    print('Getting the model')
+    input_state = Input(shape=(84,84,4))
+    first_conv = Conv2D(
+            32, (8, 8), strides=(4,4), activation='relu')(input_state)
+    second_conv = Conv2D(
+            64, (4, 4), strides=(2,2), activation='relu')(first_conv)
+    third_conv = Conv2D(
+            64, (3, 3), strides=(1,1), activation='relu')(second_conv)
+    flattened = Flatten()(third_conv)
+    dense_layer = Dense(512, activation='relu')(flattened)
+    command_input = keras.Input(shape=(2,))
+    sigmoidal_layer = Dense(512, activation='sigmoid')(command_input)
+    multiplied_layer = Multiply()([dense_layer, sigmoidal_layer])
+    final_layer = Dense(256, activation='relu')(multiplied_layer)
+    action_layer = Dense(action_size, activation='softmax')(final_layer)
+    model = Model(inputs=[input_state, command_input], outputs=action_layer)
+    model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.001, rho=0.95, epsilon=0.01))
+    print(model.summary())
+    return model
+def get_catch_behaviour_function(action_size):
+    print('Getting the Catch-model')
+    input_state = Input(shape=(84,84,4))
+    first_conv = Conv2D(
+            32, (8, 8), strides=(4,4), activation='relu')(input_state)
+    second_conv = Conv2D(
+            64, (4, 4), strides=(2,2), activation='relu')(first_conv)
+    third_conv = Conv2D(
+            64, (3, 3), strides=(1,1), activation='relu')(second_conv)
+    flattened = Flatten()(third_conv)
+    dense_layer = Dense(512, activation='relu')(flattened)
+    command_input = keras.Input(shape=(2,))
+    sigmoidal_layer = Dense(512, activation='sigmoid')(command_input)
+    multiplied_layer = Multiply()([dense_layer, sigmoidal_layer])
+    final_layer = Dense(256, activation='relu')(multiplied_layer)
+    action_layer = Dense(action_size, activation='softmax')(final_layer)
+    model = Model(inputs=[input_state, command_input], outputs=action_layer)
+    model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.001, rho=0.95, epsilon=0.01))
+    print(model.summary())
+    return model
+def pre_processing(state):
+    processed_state = np.uint8(
+            resize(rgb2gray(state), (84, 84), mode='constant')*255)
+    return processed_state
+def save_trained_model(environment, seed, model):
+    storing_path = os.path.join(MODELS_PATH, environment, str(seed))
+    if not os.path.exists(storing_path):
+        os.makedirs(storing_path)
+    model.save_weights(storing_path + '/' + 'trained_model.h5')
+def save_buffer(environment, seed, memory_buffer):
+    storing_path = os.path.join(BUFFERS_PATH, environment, str(seed))
+    if not os.path.exists(storing_path):
+        os.makedirs(storing_path)
+    np.save(os.path.join(storing_path,'memory_buffer.npy'), memory_buffer)

old_code/train_atari_agent.py ADDED Viewed

	@@ -0,0 +1,321 @@

+import os
+import math
+import time
+import gym
+import random
+import utils
+import keras
+import numpy as np
+from collections import deque
+from matplotlib import pyplot as plt
+from sklearn.preprocessing import OneHotEncoder
+class ReplayBuffer():
+    """
+        Thank you: https://github.com/BY571/
+    """
+    def __init__(self, max_size):
+        self.max_size = max_size
+        self.buffer = []
+    def add_sample(self, states, actions, rewards):
+        episode = {"states": states, "actions":actions, "rewards": rewards, "summed_rewards":sum(rewards)}
+        self.buffer.append(episode)
+    def sort(self):
+        #sort buffer
+        self.buffer = sorted(self.buffer, key = lambda i: i["summed_rewards"],reverse=True)
+        # keep the max buffer size
+        self.buffer = self.buffer[:self.max_size]
+    def get_random_samples(self, batch_size):
+        self.sort()
+        idxs = np.random.randint(0, len(self.buffer), batch_size)
+        batch = [self.buffer[idx] for idx in idxs]
+        return batch
+    def get_n_best(self, n):
+        self.sort()
+        return self.buffer[:n]
+    def __len__(self):
+        return len(self.buffer)
+class UpsideDownAgent():
+    def __init__(self, environment, approximator):
+        self.environment = gym.make(environment)
+        self.approximator = approximator
+        self.state_size = (84, 84, 4)
+        self.action_size = 3
+        self.warm_up_episodes = 1 #50
+        self.render = False
+        self.memory = ReplayBuffer(700)
+        self.last_few = 50
+        self.batch_size = 256
+        self.command_size = 2 # desired return + desired horizon
+        self.desired_return = 1
+        self.desired_horizon = 1
+        self.horizon_scale = 0.02
+        self.return_scale = 0.02
+        self.behaviour_function = utils.get_atari_behaviour_function(self.action_size)
+        self.testing_rewards = []
+        self.warm_up_buffer()
+    def warm_up_buffer(self):
+        print('Warming up')
+        for i in range(self.warm_up_episodes):
+            states = []
+            rewards = []
+            actions = []
+            dead = False
+            done = False
+            desired_return = 1
+            desired_horizon = 1
+            step, score, start_life = 0, 0, 5
+            observe = self.environment.reset()
+            for _ in range(random.randint(1, 30)):
+                observe, _, _, _ = self.environment.step(1)
+            state = utils.pre_processing(observe)
+            history = np.stack((state, state, state, state), axis=2)
+            history = np.reshape([history], (1, 84, 84, 4))
+            while not done:
+                states.append(history)
+                command = np.asarray([desired_return * self.return_scale, desired_horizon * self.horizon_scale])
+                command = np.reshape(command, [1, len(command)])
+                action = self.get_action(history, command)
+                actions.append(action)
+                if action == 0:
+                    real_action = 1
+                elif action == 1:
+                    real_action = 2
+                else:
+                    real_action = 3
+                next_state, reward, done, info = self.environment.step(real_action)
+                next_state = utils.pre_processing(observe)
+                next_state = np.reshape([next_state], (1, 84, 84, 1))
+                next_history = np.append(next_state, history[:, :, :, :3], axis = 3)
+                rewards.append(reward)
+                state = next_state
+                if start_life > info['ale.lives']:
+                    dead = True
+                    start_lide = info['ale.lives']
+                if dead:
+                    dead = False
+                else:
+                    history = next_history
+                desired_return -= reward  # Line 8 Algorithm 2
+                desired_horizon -= 1 # Line 9 Algorithm 2
+                desired_horizon = np.maximum(desired_horizon, 1)
+            self.memory.add_sample(states, actions, rewards)
+    def get_action(self, observation, command):
+        """
+            We will sample from the action distribution modeled by the Behavior Function
+        """
+        observation = np.float32(observation / 255.0)
+        action_probs = self.behaviour_function.predict([observation, command])
+        action = np.random.choice(np.arange(0, self.action_size), p=action_probs[0])
+        return action
+    def get_greedy_action(self, observation, command):
+        action_probs = self.behaviour_function.predict([observation, command])
+        action = np.argmax(action_probs)
+        return action
+    def train_behaviour_function(self):
+        random_episodes = self.memory.get_random_samples(self.batch_size)
+        training_observations = np.zeros((self.batch_size, self.state_size[0], self.state_size[1], self.state_size[2]))
+        training_commands = np.zeros((self.batch_size, 2))
+        y = []
+        for idx, episode in enumerate(random_episodes):
+            T = len(episode['states'])
+            t1 = np.random.randint(0, T-1)
+            t2 = np.random.randint(t1+1, T)
+            state = np.float32(episode['states'][t1] / 255.)
+            desired_return = sum(episode["rewards"][t1:t2])
+            desired_horizon = t2 -t1
+            target = episode['actions'][t1]
+            training_observations[idx] = state[0]
+            training_commands[idx] = np.asarray([desired_return*self.return_scale, desired_horizon*self.horizon_scale])
+            y.append(target)
+        _y = keras.utils.to_categorical(y, num_classes=self.action_size)
+        self.behaviour_function.fit([training_observations, training_commands], _y, verbose=0)
+    def sample_exploratory_commands(self):
+        best_episodes = self.memory.get_n_best(self.last_few)
+        exploratory_desired_horizon = np.mean([len(i["states"]) for i in best_episodes])
+        returns = [i["summed_rewards"] for i in best_episodes]
+        exploratory_desired_returns = np.random.uniform(np.mean(returns), np.mean(returns)+np.std(returns))
+        return [exploratory_desired_returns, exploratory_desired_horizon]
+    def generate_episode(self, environment, e, desired_return, desired_horizon, testing):
+        env = gym.make(environment)
+        tot_rewards = []
+        done = False
+        dead = False
+        scores = []
+        states = []
+        actions = []
+        rewards = []
+        step, score, start_life = 0, 0, 5
+        observe = env.reset()
+        for _ in range(random.randint(1, 30)):
+            observe, _, _, _ = env.step(1)
+        state = utils.pre_processing(observe)
+        history = np.stack((state, state, state, state), axis=2)
+        history = np.reshape([history], (1, 84, 84, 4))
+        while not done:
+            states.append(history)
+            command = np.asarray([desired_return * self.return_scale, desired_horizon * self.horizon_scale])
+            command = np.reshape(command, [1, len(command)])
+            if not testing:
+                action = self.get_action(history, command)
+                actions.append(action)
+            else:
+                action = self.get_greedy_action(history, command)
+            if action == 0:
+                real_action = 1
+            elif action == 1:
+                real_action = 2
+            else:
+                real_action = 3
+            next_state, reward, done, info = env.step(real_action)
+            next_state = utils.pre_processing(observe)
+            next_state = np.reshape([next_state], (1, 84, 84, 1))
+            next_history = np.append(next_state, history[:, :, :, :3], axis = 3)
+            clipped_reward = np.clip(reward, -1, 1)
+            rewards.append(clipped_reward)
+            score += reward
+            if start_life > info['ale.lives']:
+                dead = True
+                start_life = info['ale.lives']
+            if dead:
+                dead = False
+            else:
+                history = next_history
+            desired_return -= reward  # Line 8 Algorithm 2
+            desired_horizon -= 1 # Line 9 Algorithm 2
+            desired_horizon = np.maximum(desired_horizon, 1)
+        self.memory.add_sample(states, actions, rewards)
+        self.testing_rewards.append(score)
+        if testing:
+            print('Querying the model ...')
+            print('Testing score: {}'.format(score))
+        return score
+def run_experiment():
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--approximator', type=str, default='neural_network')
+    parser.add_argument('--environment', type=str, default='PongDeterministic-v4')
+    parser.add_argument('--seed', type=int, default=1)
+    args = parser.parse_args()
+    approximator = args.approximator
+    environment = args.environment
+    seed = args.seed
+    episodes =  1500
+    returns = []
+    agent = UpsideDownAgent(environment, approximator)
+    for e in range(episodes):
+        print("Episode {}".format(e))
+        for i in range(100):
+            agent.train_behaviour_function()
+        print("Finished training B!")
+        for i in range(15):
+            tmp_r = []
+            exploratory_commands = agent.sample_exploratory_commands() # Line 5 Algorithm 1
+            desired_return = exploratory_commands[0]
+            desired_horizon = exploratory_commands[1]
+            r = agent.generate_episode(environment, e, desired_return, desired_horizon, False)
+            tmp_r.append(r)
+        print(np.mean(tmp_r))
+        returns.append(np.mean(tmp_r))
+        exploratory_commands = agent.sample_exploratory_commands()
+    #agent.generate_episode(environment, 1, 200, 200, True)
+        utils.save_results(environment, approximator, seed, returns)
+    if approximator == 'neural_network':
+        utils.save_trained_model(environment, seed, agent.behaviour_function)
+    plt.plot(returns)
+    plt.show()
+if __name__ == "__main__":
+    run_experiment()

old_code/utils.py ADDED Viewed

	@@ -0,0 +1,121 @@

+import os
+import argparse
+import pickle
+import keras
+import numpy as np
+from keras.layers import Dense, Multiply, Input, Conv2D, Flatten
+from keras.models import Sequential, Model
+from keras.optimizers import Adam, RMSprop, SGD
+from skimage.transform import resize
+from skimage.color import rgb2gray
+STORING_PATH = './results/'
+MODELS_PATH = './trained_models/'
+def save_results(environment, approximator, seed, rewards):
+    storing_path = os.path.join(STORING_PATH, environment, approximator, str(seed))
+    if not os.path.exists(storing_path):
+        os.makedirs(storing_path)
+    np.save(storing_path + '/' + 'upside_down_rewards.npy', rewards)
+def get_functional_behaviour_function(state_size, command_size, action_size):
+    observation_input = keras.Input(shape=(state_size,))
+    linear_layer = Dense(64, activation='sigmoid')(observation_input)
+    command_input = keras.Input(shape=(command_size,))
+    sigmoidal_layer = Dense(64, activation='sigmoid')(command_input)
+    multiplied_layer = Multiply()([linear_layer, sigmoidal_layer])
+    layer_1 = Dense(64, activation='relu')(multiplied_layer)
+    layer_2 = Dense(64, activation='relu')(layer_1)
+    layer_3 = Dense(64, activation='relu')(layer_2)
+    layer_4 = Dense(64, activation='relu')(layer_3)
+    final_layer = Dense(action_size, activation='softmax')(layer_4)
+    model = Model(inputs=[observation_input, command_input], outputs=final_layer)
+    model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.001))
+    return model
+def get_atari_behaviour_function(action_size):
+    print('Getting the model')
+    input_state = Input(shape=(84,84,4))
+    first_conv = Conv2D(
+            32, (8, 8), strides=(4,4), activation='relu')(input_state)
+    second_conv = Conv2D(
+            64, (4, 4), strides=(2,2), activation='relu')(first_conv)
+    third_conv = Conv2D(
+            64, (3, 3), strides=(1,1), activation='relu')(second_conv)
+    flattened = Flatten()(third_conv)
+    dense_layer = Dense(512, activation='relu')(flattened)
+    command_input = keras.Input(shape=(2,))
+    sigmoidal_layer = Dense(512, activation='sigmoid')(command_input)
+    multiplied_layer = Multiply()([dense_layer, sigmoidal_layer])
+    final_layer = Dense(256, activation='relu')(multiplied_layer)
+    action_layer = Dense(action_size, activation='softmax')(final_layer)
+    model = Model(inputs=[input_state, command_input], outputs=action_layer)
+    model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.001, rho=0.95, epsilon=0.01))
+    print(model.summary())
+    return model
+def get_catch_behaviour_function(action_size):
+    print('Getting the Catch-model')
+    input_state = Input(shape=(84,84,4))
+    first_conv = Conv2D(
+            32, (8, 8), strides=(4,4), activation='relu')(input_state)
+    second_conv = Conv2D(
+            64, (4, 4), strides=(2,2), activation='relu')(first_conv)
+    third_conv = Conv2D(
+            64, (3, 3), strides=(1,1), activation='relu')(second_conv)
+    flattened = Flatten()(third_conv)
+    dense_layer = Dense(512, activation='relu')(flattened)
+    command_input = keras.Input(shape=(2,))
+    sigmoidal_layer = Dense(512, activation='sigmoid')(command_input)
+    multiplied_layer = Multiply()([dense_layer, sigmoidal_layer])
+    final_layer = Dense(256, activation='relu')(multiplied_layer)
+    action_layer = Dense(action_size, activation='softmax')(final_layer)
+    model = Model(inputs=[input_state, command_input], outputs=action_layer)
+    model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.001, rho=0.95, epsilon=0.01))
+    print(model.summary())
+    return model
+def pre_processing(state):
+    processed_state = np.uint8(
+            resize(rgb2gray(state), (84, 84), mode='constant')*255)
+    return processed_state
+def save_trained_model(environment, seed, model):
+    storing_path = os.path.join(MODELS_PATH, environment, str(seed))
+    if not os.path.exists(storing_path):
+        os.makedirs(storing_path)
+    model.save_weights(storing_path + '/' + 'trained_model.h5')

poetry.lock ADDED Viewed

The diff for this file is too large to render. See raw diff

udrl/__main__.py ADDED Viewed

	@@ -0,0 +1,238 @@

+from .agent import UpsideDownAgent, AgentHyper
+from .policies import SklearnPolicy, NeuralPolicy
+from .catch import CatchAdaptor
+from dataclasses import dataclass, asdict
+import gymnasium as gym
+from tqdm import trange
+import numpy as np
+import warnings
+import argparse
+from udrl.cli import (
+    with_meta,
+    create_argparse_dict,
+    create_experiment_from_args,
+    dataclass_non_defaults_to_string,
+    apply,
+)
+from pathlib import Path
+import json
+import torch
+import random as rnd
+@dataclass
+class UDRLExperiment:
+    """Configuration for an Upside-Down Reinforcement Learning experiment."""
+    env_name: str = with_meta(
+        "CartPole-v0", "Name of the Gym environment to use "
+    )
+    estimator_name: str = with_meta(
+        "ensemble.RandomForestClassifier",
+        "neural for the NN or a fully qualified name of the "
+        "scikit-learn estimator class "
+        "for the policy",
+    )
+    seed: int = with_meta(42, "Random seed for reproducibility")
+    max_episode: int = with_meta(500, "Maximum number of training episodes ")
+    collect_iter: int = with_meta(
+        15, "Number of episodes to collect between training steps "
+    )
+    train_per_iter: int = with_meta(
+        100, "Number of train iteration for each collected episode "
+    )
+    batch_size: int = with_meta(
+        0,
+        "Batch size for training the policy."
+        "If batch_size <= 0, use the entire replay buffer",
+    )
+    warm_up: int = with_meta(
+        50, "Number of initial random episodes to populate the replay buffer"
+    )
+    memory_size: int = with_meta(700, "Maximum size of the replay buffer")
+    last_few: int = with_meta(
+        75,
+        "Number of recent episodes to consider for exploratory command sampling",
+    )
+    testing_period: int = with_meta(
+        10, "After how many training loop we perform the testing of the agent"
+    )
+    horizon_scale: float = with_meta(
+        0.02, "Scaling factor for desired horizon in commands "
+    )
+    return_scale: float = with_meta(
+        0.02, "Scaling factor for desired return in commands"
+    )
+    epsilon: float = with_meta(
+        0.2, "Exploration rate for epsilon-greedy action selection"
+    )
+    save_desired: bool = with_meta(
+        False, "Save desired_horizon and desired_return during training"
+    )
+    final_testing: bool = with_meta(
+        True, "Whether to perform final testing after training "
+    )
+    final_testing_sample: int = with_meta(
+        100, "Number of episodes to evaluate during final testing "
+    )
+    final_desired_return: int = with_meta(
+        200, "Desired return for final testing episodes"
+    )
+    final_desired_horizon: int = with_meta(
+        200, "Desired horizon for final testing episodes "
+    )
+    save_policy: bool = with_meta(True, "Whether to save the trained policy ")
+    save_learning_infos: bool = with_meta(
+        True, "Whether to save the learning infos"
+    )
+def dump_dict(data, file_path):
+    with open(file_path, "w") as file:
+        json.dump(data, file, indent=4)
+def run_experiment(conf: UDRLExperiment):
+    """Runs an Upside-Down Reinforcement Learning experiment.
+    Parameters
+    ----------
+    conf : UDRLExperiment
+        Configuration for the experiment.
+    Returns
+    -------
+    None
+    Notes
+    -----
+    * Trains an agent using the specified policy and environment.
+    * Collects episodes of experience and updates the policy.
+    * Optionally performs final testing,saves the policy and learning infos.
+    """
+    torch.manual_seed(conf.seed)
+    np.random.seed(conf.seed)
+    rnd.seed(conf.seed)
+    toy_env = (
+        CatchAdaptor(dense=True)
+        if conf.env_name == "catch"
+        else gym.make(conf.env_name)
+    )
+    if conf.estimator_name == "neural":
+        policy = NeuralPolicy(
+            toy_env.observation_space.shape[0],
+            action_size=toy_env.action_space.n,
+        )
+    else:
+        policy = SklearnPolicy(
+            epsilon=conf.epsilon,
+            estimator_name=conf.estimator_name,
+            action_size=toy_env.action_space.n,
+        )
+    agent = UpsideDownAgent(
+        conf=apply(AgentHyper, asdict(conf)),
+        policy=policy,
+    )
+    epi_bar = trange(conf.max_episode)
+    returns = []
+    test_returns = []
+    infos = []
+    desired_returns = []
+    desired_horizons = []
+    test_reward_mean = 0
+    test_reward_std = 0
+    for e in epi_bar:
+        metric = []
+        for _ in range(conf.train_per_iter):
+            info = agent.train()
+            metric.append(info["metric"])
+            infos.append(info)
+        episodic_rewards = []
+        for _ in range(conf.collect_iter):
+            r, dr, dh = agent.collect_episode(
+                *agent.sample_exploratory_commands()
+            )
+            episodic_rewards.append(r)
+            desired_returns.extend(dr)
+            desired_horizons.extend(dh)
+        ep_r_mean = np.mean(episodic_rewards)
+        ep_r_std = np.std(episodic_rewards)
+        returns.append((ep_r_mean, ep_r_std))
+        if e % conf.testing_period == 0:
+            test_reward = [
+                agent.collect_episode(
+                    conf.final_desired_return,
+                    conf.final_desired_horizon,
+                    test=True,
+                    store_episode=False,
+                )[0]
+                for _ in range(conf.final_testing_sample)
+            ]
+            test_reward_mean = np.mean(test_reward)
+            test_reward_std = np.std(test_reward)
+            test_returns.append((test_reward_mean, test_reward_std))
+        epi_bar.set_postfix(
+            {
+                "mean": test_reward_mean,
+                "std": test_reward_std,
+                "mean_m": np.mean(metric),
+                "std_m": np.std(metric),
+            }
+        )
+    exp_name = dataclass_non_defaults_to_string(conf)
+    base_path = Path("data") / conf.env_name / exp_name / str(conf.seed)
+    base_path.mkdir(parents=True, exist_ok=True)
+    final_res = {}
+    if conf.final_testing:
+        print("Start Testing...")
+        final_r = [
+            agent.collect_episode(
+                conf.final_desired_return,
+                conf.final_desired_horizon,
+                test=True,
+                store_episode=False,
+            )[0]
+            for _ in trange(conf.final_testing_sample)
+        ]
+        final_res["test_mean"] = np.mean(final_r)
+        final_res["test_std"] = np.std(final_r)
+        print(f"Final result:\n{np.mean(final_r)} +- {np.std(final_r)}")
+    dump_dict(asdict(conf) | final_res, str(base_path / "conf.json"))
+    if conf.save_policy:
+        agent.policy.save(str(base_path / "policy"))
+    if conf.save_learning_infos:
+        np.save(str(base_path / "train_rewards.npy"), returns)
+        np.save(str(base_path / "test_rewards.npy"), test_returns)
+        np.save(str(base_path / "desired_returns.npy"), desired_returns)
+        np.save(str(base_path / "desired_horizons.npy"), desired_horizons)
+        dump_dict(infos, str(base_path / "learning_infos.json"))
+warnings.simplefilter("ignore", DeprecationWarning)
+warnings.simplefilter("ignore", FutureWarning)
+parser = argparse.ArgumentParser(
+    description="Runs an Upside-Down Reinforcement Learning experiment."
+    "NOTE: Default values are for the CartPole env with RandomForestClassifier"
+)
+arguments = create_argparse_dict(UDRLExperiment)
+for k, v in arguments.items():
+    parser.add_argument(k, **v)
+args = parser.parse_args()
+conf = create_experiment_from_args(args, UDRLExperiment)
+print(conf)
+run_experiment(conf)

udrl/agent.py ADDED Viewed

	@@ -0,0 +1,180 @@

+from dataclasses import dataclass
+import gymnasium as gym
+import numpy as np
+from .catch import CatchAdaptor
+from .policies import ABCPolicy
+from .buffer import ReplayBuffer
+@dataclass
+class AgentHyper:
+    """Hyperparameters for an agent interacting with an environment.
+    Parameters
+    ----------
+    env_name : str
+        Name of the environment the agent interacts with.
+    warm_up : int, optional
+        Number of initial steps before training begins (default: 50).
+    memory_size : int, optional
+        Maximum size of the agent's experience replay memory (default: 700).
+    last_few : int, optional
+        Number of recent experiences to prioritize in training (default: 75).
+    batch_size : int, optional
+        Number of experiences sampled from memory for each training update
+        (default: 32).
+    horizon_scale : float, optional
+        Scaling factor for the horizon length in reinforcement learning
+        (default: 0.02).
+    return_scale : float, optional
+        Scaling factor for rewards or returns in reinforcement learning
+        (default: 0.02).
+    """
+    env_name: str
+    warm_up: int = 50
+    memory_size: int = 700
+    last_few: int = 75
+    batch_size: int = 32
+    horizon_scale: float = 0.02
+    return_scale: float = 0.02
+class UpsideDownAgent:
+    """An agent that interacts with an environment using an
+    Upside-Down Reinforcement Learning approach.
+    Parameters
+    ----------
+    conf : AgentHyper
+        Hyperparameters for the agent.
+    policy : ABCPolicy
+        A policy object used by the agent to select actions.
+    Attributes
+    ----------
+    environment : gym.Env
+        The Gym environment the agent interacts with.
+    state_size : int
+        The size of the state space in the environment.
+    memory : ReplayBuffer
+        The replay buffer used to store experiences for training.
+    policy : ABCPolicy
+        The policy object used by the agent to select actions.
+    Methods
+    -------
+    collect_episode(desired_return=1, desired_horizon=1, random=False,
+                    store_episode=True, test=False)
+        Collects an episode of experience from the environment.
+    sample_exploratory_commands()
+        Samples exploratory commands based on past experiences.
+    train()
+        Trains the agent's policy using experiences from the replay buffer.
+    """
+    def __init__(self, conf: AgentHyper, policy: ABCPolicy):
+        self.conf = conf
+        self.environment = (
+            CatchAdaptor(dense=True)
+            if conf.env_name == "catch"
+            else gym.make(conf.env_name)
+        )
+        self.state_size = self.environment.observation_space.shape[0]
+        self.memory = ReplayBuffer(conf.memory_size)
+        self.policy = policy
+        for x in range(conf.warm_up):
+            self.collect_episode(random=True)
+    def collect_episode(
+        self,
+        desired_return: int = 1,
+        desired_horizon: int = 1,
+        random: bool = False,
+        store_episode: bool = True,
+        test: bool = False,
+    ):
+        state, _ = self.environment.reset()
+        epochs = []
+        horizons = []
+        returns = []
+        cum_rew = 0
+        tru, ter = False, False
+        while not (tru or ter):
+            state = np.expand_dims(state, axis=0)
+            command = np.array(
+                [
+                    desired_return * self.conf.return_scale,
+                    desired_horizon * self.conf.horizon_scale,
+                ]
+            )
+            command = np.expand_dims(command, axis=0)
+            action = (
+                self.environment.action_space.sample()
+                if random
+                else self.policy(state, command, test)
+            )
+            next_state, reward, tru, ter, _ = self.environment.step(action)
+            epochs.append([state, action, reward])
+            cum_rew += reward
+            horizons.append(desired_horizon)
+            returns.append(desired_return)
+            state = next_state
+            # Line 8 Algorithm 2
+            desired_return -= reward
+            # Line 9 Algorithm 2
+            desired_horizon = max(desired_horizon - 1, 1)
+        if store_episode:
+            self.memory.add_sample(*list(zip(*epochs)))
+        return cum_rew, returns, horizons
+    def sample_exploratory_commands(self):
+        best_ep = self.memory.get_n_best(self.conf.last_few)
+        expl_desired_horizon = np.mean([len(i["states"]) for i in best_ep])
+        returns = [i["summed_rewards"] for i in best_ep]
+        expl_desired_returns = np.random.uniform(
+            np.mean(returns), np.mean(returns) + np.std(returns)
+        )
+        return [expl_desired_returns, expl_desired_horizon]
+    def train(self):
+        batch_size = self.conf.batch_size
+        if self.conf.batch_size <= 0:
+            batch_size = len(self.memory.buffer)
+        random_episodes = self.memory.get_random_samples(batch_size)
+        training_states = np.zeros((batch_size, self.state_size))
+        training_commands = np.zeros((batch_size, 2))
+        actions = []
+        for idx, episode in enumerate(random_episodes):
+            T = len(episode["states"])
+            t1 = np.random.randint(0, T - 1)
+            # t2 = np.random.randint(t1 + 1, T)
+            t2 = T
+            state = episode["states"][t1]
+            desired_return = sum(episode["rewards"][t1:t2])
+            desired_horizon = t2 - t1
+            action = episode["actions"][t1]
+            training_states[idx] = state[0]
+            training_commands[idx] = np.array(
+                [
+                    desired_return * self.conf.return_scale,
+                    desired_horizon * self.conf.horizon_scale,
+                ]
+            )
+            actions.append(action)
+        return self.policy.train(training_states, training_commands, actions)

udrl/buffer.py ADDED Viewed

	@@ -0,0 +1,70 @@

+import random
+class ReplayBuffer:
+    """A replay buffer for storing and sampling experiences.
+    Thank you: https://github.com/BY571/
+    Parameters
+    ----------
+    max_size : int
+        The maximum number of experiences the buffer can store.
+    Attributes
+    ----------
+    max_size : int
+        The maximum number of experiences the buffer can store.
+    buffer : list
+        The list storing the experiences.
+    Methods
+    -------
+    add_sample(states, actions, rewards)
+        Adds an episode of experience to the buffer and sorts the buffer
+        by summed rewards in descending order.
+    sort()
+        Sorts the buffer by summed rewards in descending order and keeps only
+        the top `max_size` experiences.
+    get_random_samples(batch_size)
+        Returns a random sample of `batch_size` experiences from the buffer.
+    get_n_best(n)
+        Returns the `n` experiences with the highest summed rewards.
+    __len__()
+        Returns the current number of experiences in the buffer.
+    """
+    def __init__(self, max_size):
+        self.max_size = max_size
+        self.buffer = []
+    def add_sample(self, states, actions, rewards):
+        episode = {
+            "states": states,
+            "actions": actions,
+            "rewards": rewards,
+            "summed_rewards": sum(rewards),
+        }
+        self.buffer.append(episode)
+        self.sort()
+    def sort(self):
+        # sort buffer
+        self.buffer = sorted(
+            self.buffer, key=lambda i: i["summed_rewards"], reverse=True
+        )
+        # keep the max buffer size
+        self.buffer = self.buffer[: self.max_size]
+    def get_random_samples(self, batch_size):
+        return random.sample(self.buffer, batch_size)
+    def get_n_best(self, n):
+        self.sort()
+        return self.buffer[:n]
+    def __len__(self):
+        return len(self.buffer)

udrl/catch/__init__.py ADDED Viewed

	@@ -0,0 +1,35 @@

+from .adptor import CatchAdaptor
+from .core import CatchEnv
+env_names = [
+    "base",
+    "small_paddle",
+    "random_background",
+    "hardest",
+    "discrete_background",
+]
+env_names = ["catch_" + x for x in env_names]
+def make_catch_conf(env_name: str):
+    base_args = {
+        "random_background": False,
+        "discrete_background": False,
+        "paddle_size": 5,
+    }
+    match env_name:
+        case "catch_small_paddle":
+            base_args["paddle_size"] = 2
+        case "catch_discrete_background":
+            base_args["random_background"] = True
+            base_args["discrete_background"] = True
+        case "catch_random_background":
+            base_args["random_background"] = True
+        case "catch_hardest":
+            base_args["random_background"] = True
+            base_args["paddle_size"] = 2
+    return base_args
+__all__ = ["CatchEnv", "CatchAdaptor", "make_catch_conf", "env_names"]

udrl/catch/adptor.py ADDED Viewed

	@@ -0,0 +1,126 @@

+from typing import Any, Dict, Tuple
+import gymnasium as gym
+import numpy as np
+from gymnasium import spaces
+from numpy.typing import NDArray
+from .core import CatchEnv
+from .renderer import Renderer
+class CatchAdaptor(gym.Env):
+    """Adapts the CatchEnv game to the OpenAI Gym interface.
+    This class provides a wrapper for the CatchEnv game,
+    making it compatible with the Gymnasium environment framework.
+    It handles action and observation spacedefinitions, rendering,
+    and environment interaction.
+    Parameters
+    ----------
+    render : bool, optional
+        If True or "human", renders the environment in a human-viewable window.
+        If "rgb_array", renders the environment to an RGB array.
+        Default is False.
+    numpy_type : str, optional
+        The NumPy data type for the observation array. Default is "float32".
+    **catch_kwargs
+        Additional keyword arguments to pass to the CatchEnv constructor.
+    """
+    def __init__(
+        self, render: bool = False, numpy_type: str = "float32", **catch_kwargs
+    ):
+        super().__init__()
+        self.catch = CatchEnv(**catch_kwargs)
+        self.np_type = numpy_type
+        self.action_space = spaces.Discrete(3)
+        self.obs_shape = (84, 84)
+        self.dense = catch_kwargs.get("dense", None)
+        if self.dense:
+            self.observation_space = spaces.Box(
+                np.array([0, 0, 0]), np.array([21, 21, 21]), dtype=np.uint8
+            )
+        else:
+            self.observation_space = spaces.Box(
+                low=0, high=255, shape=self.obs_shape, dtype=np.uint8
+            )
+        self.render_mode = render
+        if self.render_mode:
+            self.GUI = Renderer(self.obs_shape)
+    def step(
+        self, action: int
+    ) -> Tuple[NDArray, float, bool, bool, Dict[str, Any]]:
+        """Run one timestep of the environment's dynamics.
+        Parameters
+        ----------
+        action : int
+            The action to take in the environment
+            (0: move left, 1: move right, 2: stay).
+        Returns
+        -------
+        observation : np.ndarray
+            The agent's observation of the current environment.
+        reward : float
+            The amount of reward returned after the previous action.
+        terminated : bool
+            Whether the episode has ended.
+        truncated : bool
+            Whether the episode was truncated.
+        info : dict
+            Contains auxiliary diagnostic information.
+        """
+        state, reward, done = self.catch.step(action)
+        self.state = state
+        if self.render_mode:
+            self.render()
+        # terminated vs truncated: see gymnasium documentation
+        # https://gymnasium.farama.org/api/env/
+        # in this environment we do not have a difference between the two.
+        obs = state
+        if not self.dense:
+            obs = np.reshape(obs, self.obs_shape).astype(self.np_type)
+        return (
+            obs,
+            reward,
+            done,  # terminated
+            done,  # trucated
+            {},  # empty info
+        )
+    def reset(self, **_) -> Tuple[NDArray, Dict[str, Any]]:
+        """Resets the environment to an initial state and
+        returns the initial observation.
+        Returns
+        -------
+        observation : np.ndarray
+            The initial observation.
+        info : dict
+            Contains auxiliary diagnostic information.
+        """
+        obs = self.catch.reset()
+        if not self.dense:
+            obs = np.reshape(obs, self.obs_shape)
+        return obs, {}
+    def render(self):
+        """Renders the environment.
+        If the 'render' parameter is set,
+        this method will display the environment
+        either in a human-viewable window or as an RGB array.
+        """
+        if self.render_mode:
+            self.GUI(self.state)
+    def close(self):
+        """Closes the renderer if it is active."""
+        if self.render_mode:
+            self.GUI.quit()

udrl/catch/core.py ADDED Viewed

	@@ -0,0 +1,190 @@

+from dataclasses import dataclass, field
+from typing import Tuple
+import numpy as np
+from numpy.typing import NDArray
+from skimage.transform import resize
+@dataclass
+class CatchEnv:
+    """A simple 2D Catch environment for reinforcement learning.
+    This environment simulates a game where the agent controls a paddle
+    at the bottom of the screen and tries to catch a falling ball.
+    The state is represented as an image, and the actions are discrete
+    movements of the paddle.
+    Attributes
+    ----------
+    paddle_size: int, default=5
+        The size of the paddle in pixels.
+    random_background: bool, default=False
+        Whether to use a random background image.
+    discrete_background: bool, default=False
+        If True and random_background is True,
+        the background will be chosen from a discrete set of values.
+    scale_value: int, default=255
+        The scaling factor for the image values.
+    """
+    paddle_size: int = 5
+    random_background: bool = False
+    discrete_background: bool = False
+    scale_value: int = 255
+    dense: bool = False
+    size: int = field(init=False, default_factory=lambda: 21)
+    scale_factor: int = field(init=False, default_factory=lambda: 4)
+    image: np.ndarray = field(init=False)
+    background: np.ndarray = field(init=False)
+    left_paddle_offset: int = field(init=False)
+    right_paddle_offset: int = field(init=False)
+    def __post_init__(self):
+        """Initializes internal environment variables after object creation."""
+        self.final_size = (
+            self.size * self.scale_factor,
+            self.size * self.scale_factor,
+        )
+        self.default_size = (self.size, self.size)
+        if self.random_background:
+            if self.discrete_background:
+                self.background = np.random.choice(
+                    np.linspace(0, 0.5, 10),
+                    size=self.final_size,
+                )
+            else:
+                self.background = resize(
+                    np.random.choice(
+                        np.linspace(0, 0.999, 10),
+                        size=self.default_size,
+                    ),
+                    self.final_size,
+                )
+        self.image = np.zeros(self.default_size)
+        self.left_paddle_offset = self.paddle_size // 2
+        self.right_paddle_offset = self.left_paddle_offset + (
+            self.paddle_size % 2
+        )
+        self.actions = {
+            0: lambda self=self: max(self.pos - 2, self.left_paddle_offset),
+            1: lambda self=self: min(
+                self.pos + 2, self.size - self.right_paddle_offset - 1
+            ),
+            2: lambda self=self: self.pos,
+        }
+    def _update_ball(self):
+        """Updates the position of the ball in the environment.
+        This method updates the ball's position based on its current velocity
+        and checks for collisions with the walls.
+        If a collision occurs, the ball's velocity is reversed appropriately.
+        """
+        self.image[self.bally, self.ballx] = 0
+        self.ballx += self.vx
+        self.bally += self.vy
+        if self.ballx > self.size - 1:
+            self.ballx -= 2 * (self.ballx - (self.size - 1))
+            self.vx *= -1
+        elif self.ballx < 0:
+            self.ballx -= 2 * self.ballx
+            self.vx *= -1
+        self.image[self.bally, self.ballx] = 1
+    def _update_paddle(self):
+        """Updates the position of the paddle in the environment.
+        This method clears the previous position of the paddle and
+        redraws it at its new position based on the current `self.pos` value.
+        """
+        self.image[-5].fill(0)
+        left_pos = self.pos - self.left_paddle_offset
+        right_pos = self.pos + self.right_paddle_offset
+        self.image[
+            -5,
+            left_pos:right_pos,
+        ] = np.ones(self.paddle_size)
+    def _compute_terminal(self):
+        """Determines if the episode is terminal and calculates the reward.
+        This method checks if the ball has reached the bottom of the screen,
+        indicating the end of an episode. If so, it calculates a reward based
+        on whether the ball was caught by the paddle.
+        Returns
+        -------
+        reward : int
+            The reward for the current timestep
+            (1 if the ball is caught, 0 otherwise).
+        terminal : bool
+            Whether the episode has ended.
+        """
+        terminal = self.bally == self.size - 5
+        reward = terminal and (
+            -self.left_paddle_offset
+            <= self.ballx - self.pos
+            <= self.right_paddle_offset
+        )
+        return int(reward), terminal
+    def step(self, action: int) -> Tuple[NDArray, int, bool]:
+        """Takes a step in the environment.
+        Parameters
+        ----------
+        action: int
+            The action to take: 0 (move left), 1 (move right), or 2 (stay).
+        Returns
+        -------
+        image: np.ndarray
+            The rendered image of the environment.
+        reward: int
+            The reward obtained after taking the action.
+        terminal: bool
+            Whether the episode has ended.
+        """
+        self.pos = self.actions[action]()
+        self._update_ball()
+        self._update_paddle()
+        image = resize(
+            self.image,
+            (self.size * self.scale_factor, self.size * self.scale_factor),
+        )
+        image[image != 0] = 1
+        if self.random_background:
+            mask = image == 0
+            image[mask] = self.background[mask]
+        if self.dense:
+            return (
+                [self.ballx, self.bally, self.pos],
+                *self._compute_terminal(),
+            )
+        return (image * self.scale_value, *self._compute_terminal())
+    def reset(self) -> NDArray:
+        """Resets the environment to its initial state.
+        Returns
+        -------
+        image: np.ndarray
+            The initial rendered image of the environment.
+        """
+        self.image = np.zeros((self.size, self.size))
+        self.pos = np.random.randint(
+            self.left_paddle_offset, self.size - self.right_paddle_offset
+        )
+        self.vx = np.random.randint(5) - 2
+        self.vy = 1
+        self.ballx, self.bally = np.random.randint(self.size), 4
+        self.image[self.bally, self.ballx] = 1
+        self._update_paddle()
+        return self.step(2)[0]

udrl/catch/renderer.py ADDED Viewed

	@@ -0,0 +1,65 @@

+from dataclasses import dataclass, field
+from os import environ
+from typing import Tuple
+import pygame
+from numpy.typing import NDArray
+environ["PYGAME_HIDE_SUPPORT_PROMPT"] = "1"
+@dataclass
+class Renderer:
+    """A renderer for visualizing the CatchEnv game using Pygame.
+    This class initializes a Pygame screen and provides a method to render the
+    CatchEnv game state as an image onto the screen.
+    Attributes
+    ----------
+    size : Tuple[int, int]
+        The size of the environment to render (height, width).
+    scale_factor : int, default=5
+        The scaling factor for the rendered image.
+    """
+    size: Tuple[int, int]
+    scale_factor: int = 5
+    screen: pygame.surface.Surface = field(init=False)
+    def __post_init__(self):
+        """Initializes the Pygame display after object creation."""
+        pygame.init()
+        self.screen = pygame.display.set_mode(
+            (
+                self.size[1] * self.scale_factor,
+                self.size[0] * self.scale_factor,
+            )
+        )
+    def quit(self):
+        """Quits the Pygame display."""
+        pygame.quit()
+    def __call__(self, image: NDArray):
+        """Renders the CatchEnv game state onto the Pygame screen.
+        Parameters
+        ----------
+        image : np.ndarray
+            A 2D NumPy array representing the game state. The array should have
+            values that correspond to pixel intensities or colors.
+        """
+        scaled_size = (
+            image.shape[0] * self.scale_factor,
+            image.shape[1] * self.scale_factor,
+        )
+        scaled_image = pygame.transform.scale(
+            pygame.surfarray.make_surface(image.T), scaled_size
+        )
+        # Blit (copy) the scaled image onto the screen
+        self.screen.blit(scaled_image, (0, 0))
+        pygame.display.flip()

udrl/cli.py ADDED Viewed

	@@ -0,0 +1,192 @@

+import argparse
+import dataclasses
+import inspect as i
+from typing import Callable, Dict, Any
+from dataclasses import fields, is_dataclass
+def sel_args(kw: Dict[str, Any], fun: Callable) -> Dict[str, Any]:
+    """
+    Selects keyword arguments relevant to a function.
+    Parameters
+    ----------
+    kw : Dict[str, Any]
+        A dictionary of keyword arguments.
+    fun : Callable
+        The function for which arguments are to be selected.
+    Returns
+    -------
+    Dict[str, Any]
+        A new dictionary containing only the keyword arguments
+        that are valid parameters for the given function.
+    """
+    return {
+        k: v for k, v in kw.items() if k in list(i.signature(fun).parameters)
+    }
+def apply(fun: Callable, kw: Dict[str, Any]) -> Any:
+    """
+    Applies a function with selected keyword arguments.
+    Parameters
+    ----------
+    fun : Callable
+        The function to apply.
+    kw : Dict[str, Any]
+        A dictionary of keyword arguments.
+    Returns
+    -------
+    Any
+        The result of calling the function with the selected keyword arguments.
+    """
+    return fun(**sel_args(kw, fun))
+def create_argparse_dict(dataclass_cls):
+    """
+    Creates an argument parser dictionary configuration from a dataclass.
+    This function examines the fields of a dataclass and generates a dictionary
+    that can be used to configure an argparse.ArgumentParser.
+    It handles boolean fields with special actions, sets default values,
+    includes help messages with defaults, and supports optional choices
+    and required arguments based on metadata.
+    Parameters
+    ----------
+    dataclass_cls : type
+        The dataclass type to create the argument parser dictionary from.
+    Returns
+    -------
+    Dict[str, Dict[str, Any]]
+        A dictionary mapping argument names to dictionaries containing
+        argparse configuration options.
+    """
+    result = {}
+    for field in dataclasses.fields(dataclass_cls):
+        if not field.init:
+            continue
+        arg_name = f"--{field.name.replace('_', '-')}"
+        if field.type == bool:
+            result[arg_name] = dict(
+                action=argparse.BooleanOptionalAction,
+                default=field.default,
+            )
+            continue
+        result[arg_name] = {
+            "type": field.type,
+            "default": (
+                field.default
+                if not dataclasses.is_dataclass(field.type)
+                else None
+            ),
+            "help": f"{field.metadata.get('help', '')}"
+            f"(default: {field.default})",
+        }
+        if choices := field.metadata.get("choices", None):
+            result[arg_name]["choices"] = choices
+        if required := field.metadata.get("required", None):
+            result[arg_name]["required"] = required
+    return result
+def create_experiment_from_args(
+    args: argparse.Namespace, dataclass: Callable[..., Any]
+) -> Any:
+    """
+    Creates an experiment instance from parsed command-line arguments.
+    Parameters
+    ----------
+    args : argparse.Namespace
+        An argparse Namespace object containing parsed command-line arguments.
+    dataclass : Callable[..., Any]
+        A dataclass constructor that takes keyword arguments corresponding
+        to experiment parameters.
+    Returns
+    -------
+    Any
+        An instance of the dataclass initialized with the parsed arguments.
+    """
+    return apply(
+        dataclass,
+        {
+            k.replace("--", "").replace("-", "_"): v
+            for k, v in vars(args).items()
+        },
+    )
+def with_meta(default: Any, help: str, **kwargs):
+    """
+    Creates a dataclass field with default value, help string,
+    and additional metadata.
+    This function simplifies the creation of dataclass fields by
+    providing a convenient way to set a default value,
+    a help string, and other metadata attributes for a field.
+    Parameters
+    ----------
+    default : Any
+        The default value for the field. If callable,
+        it's treated as a default factory.
+    help : str
+        The help string describing the field's purpose.
+    **kwargs
+        Additional keyword arguments to be included in the field's metadata.
+    Returns
+    -------
+    dataclasses.Field
+        A dataclass Field object with the specified default,
+        help, and metadata.
+    """
+    args: Dict[str, Any] = {"metadata": {"help": help, **kwargs}}
+    if callable(default):
+        args["default_factory"] = default
+    else:
+        args["default"] = default
+    return dataclasses.field(**args)
+def dataclass_non_defaults_to_string(data_obj):
+    """Converts non-default values of a dataclass object's fields to a string,
+    excluding 'seed' and 'env_name'.
+    Parameters
+    ----------
+    data_obj : dataclass object
+        The dataclass object to process.
+    Returns
+    -------
+    str
+        A string representation of non-default field values, or "base" if all
+        fields have default values
+        (excluding the 'seed' and 'env_name' attributes).
+    Raises
+    ------
+    TypeError
+        If the input is not a dataclass object.
+    """
+    if not is_dataclass(data_obj):
+        raise TypeError("Input must be a dataclass object.")
+    non_defaults = []
+    for field in fields(data_obj):
+        if field.name == "seed" or field.name == "env_name":
+            continue
+        if getattr(data_obj, field.name) != field.default:
+            non_defaults.append(
+                field.name + str(getattr(data_obj, field.name))
+            )
+    return "_".join(non_defaults) or "base"

udrl/data_proc.py ADDED Viewed

	@@ -0,0 +1,51 @@

+from pathlib import Path
+import numpy as np
+import json
+import csv
+naming = {
+    "neural": "NN",
+    "ensemble.ExtraTreesClassifier": "ET",
+    "ensemble.RandomForestClassifier": "RF",
+}
+if __name__ == "__main__":
+    path = Path("data")
+    csvs_path = path / "csvs"
+    csvs_path.mkdir(parents=True, exist_ok=True)
+    for env in path.iterdir():
+        all_paths = list(set([p.parent for p in env.rglob("*.npy")]))
+        if not all_paths:
+            continue
+        toy_rewards = np.load(all_paths[0] / "train_rewards.npy")
+        data = {"episode": list(range(len(toy_rewards)))}
+        estimators = {
+            "neural": ([], [], [], []),
+            "ensemble.ExtraTreesClassifier": ([], [], [], []),
+            "ensemble.RandomForestClassifier": ([], [], [], []),
+        }
+        for exp in all_paths:
+            print(exp)
+            rewards = np.load(exp / "train_rewards.npy")
+            with open((exp / "conf.json"), "r") as f:
+                conf = json.load(f)
+            estimators[conf["estimator_name"]][0].append(list(rewards[:, 0]))
+            estimators[conf["estimator_name"]][1].append(list(rewards[:, 1]))
+            estimators[conf["estimator_name"]][2].append(conf["test_mean"])
+            estimators[conf["estimator_name"]][3].append(conf["test_std"])
+        for k, v in estimators.items():
+            data[naming[k] + "_mean"] = [
+                "{:.2f}".format(np.mean(x)) for x in zip(*v[0])
+            ]
+            data[naming[k] + "_std"] = [
+                "{:.2f}".format(np.std(x)) for x in zip(*v[0])
+            ]
+            print(f"{k}:{env.name}-> {np.median(v[2])} +- {np.median(v[3])}")
+        with open(csvs_path / f"{env.name}.csv", "w") as f:
+            w = csv.writer(f)
+            w.writerow(data.keys())
+            w.writerows(zip(*data.values()))

udrl/inference.py ADDED Viewed

	@@ -0,0 +1,122 @@

+import matplotlib.pyplot as plt
+import numpy as np
+from .policies import SklearnPolicy, NeuralPolicy
+from .agent import UpsideDownAgent, AgentHyper
+from pathlib import Path
+from collections import Counter
+from tqdm import trange
+def get_common(base, env, conf, seed):
+    path = base / env / conf / seed
+    if not path.exists():
+        print("Cannot find path")
+        return None, None
+    algo_name = (
+        "NN" if "neural" in conf else ("ET" if "Extra" in conf else "RT")
+    )
+    des_ret = np.load(str(path / "desired_returns.npy")).astype(int)
+    des_hor = np.load(str(path / "desired_horizons.npy")).astype(int)
+    # rew = np.load(str(path / "train_rewards.npy")).astype(int)[:, 0]
+    te = []
+    prev = -np.inf
+    for i, x in enumerate(des_hor):
+        if prev < x:
+            te.append(i)
+        prev = x
+    init_des_ret = des_ret[te]
+    init_des_hor = des_hor[te]
+    mean_des_ret = []
+    mean_des_hor = []
+    tmp_r = []
+    tmp_h = []
+    for i, (ret, hor) in enumerate(zip(init_des_ret, init_des_hor)):
+        tmp_r.append(ret)
+        tmp_h.append(hor)
+        if i % 15 == 0:
+            mean_des_hor.append(np.mean(tmp_h))
+            mean_des_ret.append(np.mean(tmp_r))
+            tmp_r = []
+            tmp_h = []
+    common_hor = Counter(init_des_hor[-1500:]).most_common()[0][0]
+    common_ret = Counter(init_des_ret[-1500:]).most_common()[0][0]
+    print(f"{env}:{algo_name}.horizon-> {common_hor}")
+    print(f"{env}:{algo_name}.return-> {common_ret}")
+    return common_ret, common_hor
+def test_desired(base, env, conf, des_ret, des_hor):
+    algo_name = (
+        "NN" if "neural" in conf else ("ET" if "Extra" in conf else "RT")
+    )
+    if des_hor is None or des_ret is None:
+        print(f"Invalid desired for {env}:{algo_name}")
+        return
+    for path in (base / env / conf).iterdir():
+        if "neural" in conf:
+            policy = NeuralPolicy.load(str(path / "policy"))
+        else:
+            policy = SklearnPolicy.load(str(path / "policy"))
+        hyper = AgentHyper(env, warm_up=0)
+        agent = UpsideDownAgent(hyper, policy)
+        final_r = [
+            agent.collect_episode(
+                des_ret,
+                des_hor,
+                test=True,
+                store_episode=False,
+            )[0]
+            for _ in range(100)
+        ]
+        print(
+            f"{env}:{algo_name}:{path.name}:r.{des_ret}:h.{des_hor}"
+            f" -> {np.median(final_r):.2f} +- {np.std(final_r):.2f}"
+            f",max {np.max(final_r):.2f},min {np.min(final_r):.2f}"
+        )
+base = Path("/home/vimmoos/upside_down_rl/data")
+confs = {
+    "NN": "estimator_nameneural_batch_size256_warm_up260",
+    "ET": "estimator_nameensemble.ExtraTreesClassifier_train_per_iter1",
+    "RT": "train_per_iter1",
+}
+envs = ["LunarLander-v2", "Acrobot-v1"]
+seeds = [str(45), str(46)]
+res = {}
+for env in envs:
+    res[env] = {}
+    for algo_name, conf in confs.items():
+        res[env][algo_name] = {}
+        for seed in seeds:
+            ret, hor = get_common(base, env, conf + "_save_desiredTrue", seed)
+            res[env][algo_name][seed] = (ret, hor)
+pprint(res)
+for env, algos in res.items():
+    for algo, seeds in algos.items():
+        for _, vals in seeds.items():
+            test_desired(base, env, confs[algo], *vals)
+# plt.plot(mean_des_ret)
+# plt.plot(mean_des_hor)
+# plt.plot(rew)
+# plt.show()

udrl/plot.py ADDED Viewed

	@@ -0,0 +1,189 @@

+from .policies import SklearnPolicy
+from .agent import UpsideDownAgent, AgentHyper
+from pathlib import Path
+import matplotlib.pyplot as plt
+import numpy as np
+from itertools import zip_longest, tee
+from tqdm import tqdm, trange
+import imageio
+def calculate_ep_feat_importance(
+    episode, agent, desired_return, desired_horizon
+):
+    ep_features = []
+    for state, _, reward in zip(*episode.values()):
+        command = np.array(
+            [
+                desired_return * agent.conf.return_scale,
+                desired_horizon * agent.conf.horizon_scale,
+            ]
+        )
+        command = np.expand_dims(command, axis=0)
+        ext_state = np.concatenate((state, command), axis=1)
+        feature_importances = {}
+        for t in agent.policy.estimator.estimators_:
+            branch = np.array(t.decision_path(ext_state).todense(), dtype=bool)
+            imp = t.tree_.impurity[branch[0]]
+            for f, i in zip(
+                t.tree_.feature[branch[0]][:-1], imp[:-1] - imp[1:]
+            ):
+                feature_importances.setdefault(f, []).append(i)
+        # Line 8 Algorithm 2
+        desired_return -= reward
+        # Line 9 Algorithm 2
+        desired_horizon = max(desired_horizon - 1, 1)
+        summed_importances = [
+            sum(feature_importances[k])
+            for k in range(len(feature_importances.keys()))
+        ]
+        ep_features.append(summed_importances)
+    return ep_features
+def summarize_episodes_feat(
+    episodes_feat, summarize_funs: list = [np.mean, np.std]
+):
+    return [
+        [
+            [
+                fun(list(data))
+                for fun, data in zip(
+                    summarize_funs,
+                    tee(
+                        (s for s in state if s is not None),
+                        len(summarize_funs),
+                    ),
+                )
+            ]
+            for state in zip_longest(*ep)
+        ]
+        for ep in zip_longest(*episodes_feat, fillvalue=[])
+    ]
+def calculate_features_importance(
+    path: Path,
+    env: str,
+    desired_return: int,
+    desired_horizon: int,
+    horizon_scale: float,
+    return_scale: float,
+    redundancy: int = 100,
+):
+    policy = SklearnPolicy.load(str(path / "policy"))
+    hyper = AgentHyper(
+        env,
+        warm_up=0,
+        horizon_scale=horizon_scale,
+        return_scale=return_scale,
+    )
+    agent = UpsideDownAgent(hyper, policy)
+    for _ in trange(redundancy, desc="Collect Data"):
+        agent.collect_episode(desired_return, desired_horizon, test=True)
+    episodes = [
+        {k: v for k, v in ep.items() if k != "summed_rewards"}
+        for ep in agent.memory.buffer
+    ]
+    episodes_feat = [
+        calculate_ep_feat_importance(
+            ep, agent, desired_return, desired_horizon
+        )
+        for ep in tqdm(episodes, desc="Calculate importance features")
+    ]
+    feature_importances = summarize_episodes_feat(episodes_feat)
+    return feature_importances
+def example_plot(feature_importances):
+    for idx, state_feat in tqdm(
+        enumerate(feature_importances),
+        desc="Plotting",
+        total=len(feature_importances),
+    ):
+        x = np.arange(len(state_feat))
+        plt.figure()
+        plt.title(f"Cartpole-v0 State {idx}")
+        plt.bar(x, [x[0] for x in state_feat], yerr=[x[1] for x in state_feat])
+        plt.xticks(
+            x,
+            [
+                *[f"feature-{index}" for index in range(len(state_feat) - 2)],
+                r"$d_t^{r}$",
+                r"$d_t^{h}$",
+            ],
+        )
+        plt.savefig(f"data/example_plot2/importances_state_{idx}")
+        plt.close()
+def create_gif_from_plots(
+    image_filenames, output_filename="animation.gif", duration=0.5
+):
+    """Creates a GIF from a list of image filenames."""
+    images = [imageio.imread(filename) for filename in image_filenames]
+    imageio.mimsave(output_filename, images, duration=duration)
+base_path = Path("data")
+env = "CartPole-v0"
+estimator = "ExtraTreesClassifier"
+seed = str(42)
+conf_name = "estimator_nameensemble.ExtraTreesClassifier_train_per_iter1"
+desired_return = 200
+desired_horizon = 200
+path = base_path / env / conf_name / seed
+res = calculate_features_importance(
+    path, env, desired_return, desired_horizon, 0.02, 0.02
+)
+example_plot(res)
+image_filenames = [
+    f"data/example_plot2/importances_state_{idx}.png"
+    for idx in range(len(res))
+]
+create_gif_from_plots(image_filenames)
+# import numpy as np
+# import matplotlib.pyplot as plt
+# from sklearn.cluster import KMeans, HDBSCAN
+# from sklearn.decomposition import PCA
+# # Assuming you have your data in a numpy array 'data'
+# data = np.array(res)[:, :, 0]
+# # 1. Apply K-Means clustering
+# kmeans = HDBSCAN()
+# kmeans.fit(data)
+# labels = kmeans.labels_
+# # 2. Dimensionality Reduction for visualization (PCA)
+# pca = PCA(n_components=2)  # Reduce to 2 dimensions for plotting
+# data_pca = pca.fit_transform(data)
+# # 3. Plotting
+# plt.figure(figsize=(10, 8))
+# plt.scatter(data_pca[:, 0], data_pca[:, 1], c=labels, cmap="viridis")
+# plt.title("K-Means Clustering Visualization")
+# plt.xlabel("Principal Component 1")
+# plt.ylabel("Principal Component 2")
+# plt.colorbar()
+# plt.show()

udrl/policies.py ADDED Viewed

	@@ -0,0 +1,364 @@

+from dataclasses import dataclass, field
+from typing import Dict, Any, Union
+from abc import ABC
+import importlib
+from pickle import dump, load
+from sklearn.exceptions import NotFittedError
+from sklearn.base import BaseEstimator
+from sklearn.metrics import classification_report
+import numpy as np
+import torch
+from torch import nn
+from torch.distributions import Categorical
+class ABCPolicy(ABC):
+    """An abstract base class for defining agent policies.
+    Methods
+    -------
+    __call__(state, command, test)
+        Selects an action based on the given state and command.
+        Parameters
+        ----------
+        state : np.array
+            The current state of the environment.
+        command : np.array
+            The command or goal provided to the policy.
+        test : bool
+            Whether the policy is being used in a testing scenario.
+        Returns
+        -------
+        int or np.array
+            The selected action.
+    train(states, commands, actions)
+        Trains the policy using the provided experiences.
+        Parameters
+        ----------
+        states : np.array
+            A batch of states.
+        commands : np.array
+            A batch of corresponding commands.
+        actions : np.array
+            A batch of corresponding actions taken.
+        Returns
+        -------
+        Dict[str, Any]
+            A dictionary containing training metrics or other  information.
+            It MUST contain the key "metric"
+    save(path)
+        Saves the policy to the specified path.
+        Parameters
+        ----------
+        path : str
+            The path to save the policy to.
+    load(path)
+        Loads the policy from the specified path.
+        Parameters
+        ----------
+        path : str
+            The path to load the policy from.
+    """
+    def __call__(
+        self,
+        state: np.array,
+        command: np.array,
+        test: bool,
+    ) -> Union[int, np.array]: ...
+    def train(
+        self,
+        states: np.array,
+        commands: np.array,
+        actions: np.array,
+    ) -> Dict[str, Any]: ...
+    def save(self, path: str): ...
+    def load(path: str): ...
+@dataclass
+class SklearnPolicy(ABCPolicy):
+    """A policy using a scikit-learn estimator for action selection.
+    Parameters
+    ----------
+    epsilon : float
+        Exploration rate for epsilon-greedy action selection.
+    action_size : int
+        The number of possible actions in the environment.
+    estimator_name : str
+        The fully qualified name of the scikit-learn estimator class
+        (e.g., 'ensemble.RandomForestClassifier').
+    estimator_kwargs : Dict[str, Any], optional
+        Keyword arguments to pass to the estimator constructor (default: {}).
+    Attributes
+    ----------
+    estimator : BaseEstimator
+        The initialized scikit-learn estimator.
+    Methods
+    -------
+    __call__(state, command, test)
+        Selects an action based on the given state and command,
+        using the estimator or epsilon-greedy exploration.
+    train(states, commands, actions)
+        Trains the estimator using the provided experiences.
+    save(path)
+        Saves the policy (including the estimator) to a pickle file.
+    load(path)
+        Loads the policy (including the estimator) from a pickle file.
+    """
+    epsilon: float
+    action_size: int
+    estimator_name: str
+    estimator_kwargs: Dict[str, Any] = field(default_factory=dict)
+    estimator: BaseEstimator = field(init=False)
+    def __post_init__(self):
+        module, clf_name = self.estimator_name.split(".")
+        module = importlib.import_module("sklearn." + module)
+        self.estimator = getattr(module, clf_name)(
+            **self.estimator_kwargs,
+        )
+    def __call__(
+        self,
+        state: np.array,
+        command: np.array,
+        test: bool,
+    ):
+        input_state = np.concatenate((state, command), axis=1)
+        actions = None
+        try:
+            actions = self.estimator.predict(input_state)
+        except NotFittedError:
+            ...
+        if not test and (actions is None or np.random.rand() <= self.epsilon):
+            return np.random.choice(self.action_size)
+        return actions[0]
+    def train(
+        self,
+        states: np.array,
+        commands: np.array,
+        actions: np.array,
+    ):
+        input_state = np.concatenate((states, commands), axis=1)
+        self.estimator.fit(input_state, actions)
+        pred = self.estimator.predict(input_state)
+        report = classification_report(actions, pred, output_dict=True)
+        report["metric"] = report["accuracy"]
+        return report
+    def save(self, path: str):
+        with open(path + ".pkl", "wb") as f:
+            dump(self, f)
+    def load(path: str):
+        with open(path + ".pkl", "rb") as f:
+            policy = load(f)
+        return policy
+class BehaviorNet(nn.Module):
+    """
+    A neural network module designed to model agent behavior based on state
+    and command inputs.
+    Parameters
+    ----------
+    state_size : int
+        Dimensionality of the state input.
+    action_size : int
+        Dimensionality of the action output.
+    command_size : int
+        Dimensionality of the command input.
+    hidden_size : int, optional
+        Number of neurons in the hidden layers. Defaults to 64.
+    Returns
+    -------
+    torch.Tensor
+        A probability distribution over actions,
+        shaped (batch_size, action_size).
+    """
+    def __init__(
+        self,
+        state_size: int,
+        action_size: int,
+        command_size: int,
+        hidden_size: int = 64,
+    ):
+        super().__init__()
+        self.state_entry = nn.Sequential(
+            nn.Linear(state_size, hidden_size), nn.Sigmoid()
+        )
+        self.command_entry = nn.Sequential(
+            nn.Linear(command_size, hidden_size), nn.Sigmoid()
+        )
+        self.model = nn.Sequential(
+            nn.Linear(hidden_size, hidden_size),
+            nn.ReLU(),
+            nn.Linear(hidden_size, hidden_size),
+            nn.ReLU(),
+            nn.Linear(hidden_size, hidden_size),
+            nn.ReLU(),
+            nn.Linear(hidden_size, action_size),
+            nn.Softmax(dim=-1),
+        )
+    def forward(self, state, command):
+        state_out = self.state_entry(state)
+        command_out = self.command_entry(command)
+        out = state_out * command_out
+        return self.model(out)
+@dataclass
+class NeuralPolicy(ABCPolicy):
+    """
+    A policy that uses a neural network to map states and commands to actions.
+    Parameters
+    ----------
+    state_size : int
+        The dimensionality of the state input.
+    action_size : int
+        The dimensionality of the action output.
+    command_size : int, optional
+        The dimensionality of the command input. Defaults to 2.
+    hidden_size : int, optional
+        The number of neurons in the hidden layers of the neural network.
+        Defaults to 64.
+    device : str, optional
+        The device on which to run the neural network.
+        Can be "auto" (to automatically select CUDA if available, else CPU),
+        or a valid torch device string. Defaults to "auto".
+    loss : nn.Module, optional
+        The loss function class used for training.
+        Defaults to `nn.CrossEntropyLoss`.
+    Attributes
+    ----------
+    estimator : nn.Module
+        The neural network used to estimate the action probabilities.
+    loss : nn.Module
+        The instantiated loss function used for training.
+    optim : torch.optim.Adam
+        The optimizer used for training.
+    Methods
+    -------
+    __call__(state, command, test)
+        Selects an action based on the given state and command
+    train(states, commands, actions)
+        Trains the estimator using the provided experiences.
+    save(path)
+        Saves the policy.
+    load(path)
+        Loads the policy.
+    """
+    state_size: int
+    action_size: int
+    command_size: int = 2
+    hidden_size: int = 64
+    # NOTE GPU maybe be drastically slower for small batch_size
+    device: str = "cpu"
+    loss: nn.Module = nn.CrossEntropyLoss
+    estimator: nn.Module = field(init=False)
+    def __post_init__(self):
+        self.estimator = BehaviorNet(
+            self.state_size,
+            self.action_size,
+            self.command_size,
+            self.hidden_size,
+        )
+        if self.device == "auto":
+            self.device = torch.device(
+                "cuda" if torch.cuda.is_available() else "cpu"
+            )
+        self.estimator.to(self.device)
+        self.loss = self.loss()
+        self.optim = torch.optim.Adam(self.estimator.parameters())
+    def __call__(
+        self,
+        state: np.array,
+        command: np.array,
+        test: bool,
+    ):
+        state = torch.FloatTensor(state).to(self.device)
+        command = torch.FloatTensor(command).to(self.device)
+        action_probs = self.estimator(state, command)
+        if test:
+            return torch.argmax(action_probs).item()
+        return Categorical(action_probs).sample().item()
+    def train(
+        self,
+        states: np.array,
+        commands: np.array,
+        actions: np.array,
+    ):
+        states = torch.FloatTensor(states).to(self.device)
+        commands = torch.FloatTensor(commands).to(self.device)
+        actions = torch.LongTensor(actions).to(self.device)
+        pred = self.estimator(states, commands)
+        self.optim.zero_grad()
+        loss = self.loss(pred, actions)
+        loss.backward()
+        self.optim.step()
+        return {"metric": loss.item()}
+    def save(self, path: str):
+        torch.save(
+            {
+                "model": self.estimator.state_dict(),
+                "optim": self.optim.state_dict(),
+                "state_size": self.state_size,
+                "action_size": self.action_size,
+                "command_size": self.command_size,
+                "hidden_size": self.hidden_size,
+            },
+            path + ".pth",
+        )
+    def load(path: str):
+        saved_dict = torch.load(path + ".pth")
+        policy = NeuralPolicy(
+            saved_dict["state_size"],
+            saved_dict["action_size"],
+            saved_dict["command_size"],
+            saved_dict["hidden_size"],
+        )
+        policy.estimator.load_state_dict(saved_dict["model"])
+        policy.optim.load_state_dict(saved_dict["optim"])
+        return policy

udrl/test.py ADDED Viewed

	@@ -0,0 +1,137 @@

+# import gymnasium as gym
+# import pygame
+# import numpy as np
+# def normalize_value(value, is_bounded, low=None, high=None):
+#     if is_bounded:
+#         return (value - low) / (high - low)
+#     else:
+#         return 0.5 * (np.tanh(value / 2) + 1)
+# def draw_bar(screen, start, value, max_length, color, height=20):
+#     bar_length = value * max_length
+#     pygame.draw.rect(screen, color, (*start, bar_length, height))
+#     pygame.draw.rect(
+#         screen, (0, 0, 0), (*start, max_length, height), 2
+#     )  # Border
+#     mid_x = start[0] + max_length / 2
+#     pygame.draw.line(
+#         screen, (0, 0, 0), (mid_x, start[1]), (mid_x, start[1] + height), 2
+#     )
+# def visualize_environment(screen, state, env):
+#     screen_width, screen_height = screen.get_size()
+#     screen.fill((255, 255, 255))
+#     # Visualize environment-specific elements
+#     if env.spec.id.startswith("CartPole"):
+#         cart_x = int(state[0] * 50 + screen_width // 2)
+#         cart_y = screen_height - 100
+#         pole_angle = state[2]
+#         pygame.draw.rect(screen, (0, 0, 0), (cart_x - 30, cart_y - 15, 60, 30))
+#         pygame.draw.line(
+#             screen,
+#             (0, 0, 0),
+#             (cart_x, cart_y),
+#             (
+#                 cart_x + int(np.sin(pole_angle) * 100),
+#                 cart_y - int(np.cos(pole_angle) * 100),
+#             ),
+#             6,
+#         )
+#     elif env.spec.id.startswith("Acrobot"):
+#         center_x, center_y = screen_width // 2, screen_height // 2
+#         l1, l2 = 100, 100  # Length of links
+#         s0, s1 = state[0], state[1]  # sin(theta1), sin(theta2)
+#         c0, c1 = state[2], state[3]  # cos(theta1), cos(theta2)
+#         x0, y0 = center_x, center_y
+#         x1 = x0 + l1 * s0
+#         y1 = y0 + l1 * c0
+#         x2 = x1 + l2 * s1
+#         y2 = y1 + l2 * c1
+#         pygame.draw.line(screen, (0, 0, 0), (x0, y0), (x1, y1), 6)
+#         pygame.draw.line(screen, (0, 0, 0), (x1, y1), (x2, y2), 6)
+#         pygame.draw.circle(screen, (0, 0, 255), (int(x0), int(y0)), 10)
+#         pygame.draw.circle(screen, (0, 255, 0), (int(x1), int(y1)), 10)
+#         pygame.draw.circle(screen, (255, 0, 0), (int(x2), int(y2)), 10)
+#     # Add more environment-specific visualizations here as needed
+#     # Draw bars for each state dimension
+#     num_dims = env.observation_space.shape[0]
+#     bar_colors = [
+#         (255, 0, 0),
+#         (0, 255, 0),
+#         (0, 0, 255),
+#         (255, 255, 0),
+#         (255, 0, 255),
+#         (0, 255, 255),
+#     ]
+#     bar_starts = [(50, 50 + i * 70) for i in range(num_dims)]
+#     max_length = 300
+#     for i, (start, color) in enumerate(zip(bar_starts, bar_colors)):
+#         is_bounded = not (
+#             env.observation_space.high[i] > 100
+#         ) and not np.isinf(env.observation_space.low[i] < -100)
+#         normalized_value = normalize_value(
+#             state[i],
+#             is_bounded,
+#             env.observation_space.low[i],
+#             env.observation_space.high[i],
+#         )
+#         draw_bar(screen, start, normalized_value, max_length, color)
+#         # Draw labels
+#         font = pygame.font.Font(None, 30)
+#         text = font.render(f"Dim {i}: {state[i]:.2f}", True, (0, 0, 0))
+#         screen.blit(text, (start[0], start[1] - 30))
+#         # Add description of bar representation
+#         if is_bounded:
+#             desc = f"(Range: {env.observation_space.low[i]:.2f} to {env.observation_space.high[i]:.2f})"
+#         else:
+#             desc = "(Unbounded: Center is 0, edges are ±∞)"
+#         desc_text = pygame.font.Font(None, 24).render(
+#             desc, True, (100, 100, 100)
+#         )
+#         screen.blit(desc_text, (start[0], start[1] + 25))
+#     pygame.display.flip()
+# def run_visualization(env_name):
+#     pygame.init()
+#     screen = pygame.display.set_mode((800, 600))
+#     pygame.display.set_caption(f"{env_name} Visualization")
+#     env = gym.make(env_name)
+#     state, _ = env.reset()
+#     clock = pygame.time.Clock()
+#     running = True
+#     while running:
+#         visualize_environment(screen, state, env)
+#         action = env.action_space.sample()
+#         state, reward, done, truncated, info = env.step(action)
+#         if done or truncated:
+#             state, _ = env.reset()
+#         for event in pygame.event.get():
+#             if event.type == pygame.QUIT:
+#                 running = False
+#         clock.tick(60)  # Limit to 60 FPS
+#     env.close()
+#     pygame.quit()
+# # Example usage
+# # run_visualization("CartPole-v1")
+# # Uncomment the line below to run Acrobot visualization
+# run_visualization("Acrobot-v1")

udrl/viz.py ADDED Viewed

	@@ -0,0 +1,310 @@

+import gymnasium as gym
+import pygame
+import numpy as np
+from .policies import SklearnPolicy
+from .agent import UpsideDownAgent, AgentHyper
+from pathlib import Path
+import json
+def normalize_value(value, is_bounded, low=None, high=None):
+    return (value - low) / (high - low)
+def draw_bar(screen, start, value, max_length, color, height=20, mid=True):
+    bar_length = value * max_length
+    pygame.draw.rect(screen, color, (*start, bar_length, height))
+    pygame.draw.rect(screen, (0, 0, 0), (*start, max_length, height), 2)
+    if mid:
+        mid_x = start[0] + max_length / 2
+        pygame.draw.line(
+            screen, (0, 0, 0), (mid_x, start[1]), (mid_x, start[1] + height), 2
+        )
+def create_button(text, position, size):
+    font = pygame.font.Font(None, 36)
+    button_rect = pygame.Rect(position, size)
+    text_surf = font.render(text, True, (0, 0, 0))
+    text_rect = text_surf.get_rect(center=button_rect.center)
+    return button_rect, text_surf, text_rect
+def visualize_environment(
+    screen,
+    state,
+    env,
+    env_surface,
+    paused,
+    feature_importances,
+    epoch,
+    max_epoch=200,
+):
+    screen_width, screen_height = screen.get_size()
+    screen.fill((255, 255, 255))
+    screen.blit(env_surface, (0, 0))
+    num_dims = len(feature_importances)
+    bar_colors = [
+        (255, 0, 0),  # Red
+        (0, 255, 0),  # Green
+        (0, 0, 255),  # Blue
+        (255, 255, 0),  # Yellow
+        (255, 0, 255),  # Magenta
+        (0, 255, 255),  # Cyan
+        (128, 128, 0),  # Olive
+        (0, 128, 128),  # Teal
+        (128, 0, 0),  # Maroon
+        (0, 128, 0),  # Dark Green
+        (0, 0, 128),  # Navy
+        (128, 128, 128),  # Gray
+        (192, 192, 192),  # Light Gray
+        (255, 165, 0),  # Orange
+        (255, 192, 203),  # Pink
+    ]
+    bar_starts = [
+        (screen_width - 350, 50 + i * 70) for i in range(num_dims + 1)
+    ]
+    max_length = 300
+    for i, (start, color) in enumerate(zip(bar_starts, bar_colors)):
+        if i == 0:
+            normalized_value = epoch / max_epoch
+            draw_bar(
+                screen, start, normalized_value, max_length, color, mid=False
+            )
+            font = pygame.font.Font(None, 30)
+            text = font.render(
+                f"Epoch {epoch}",
+                True,
+                (0, 0, 0),
+            )
+            screen.blit(text, (start[0], start[1] - 30))
+            continue
+        i -= 1
+        normalized_value = feature_importances[i] / 30
+        draw_bar(screen, start, normalized_value, max_length, color)
+        font = pygame.font.Font(None, 30)
+        what = f"Importance  {i}"
+        if len(feature_importances) - i == 2:
+            what = "Desired Return"
+        if len(feature_importances) - i == 1:
+            what = "Desired Horizon"
+        text = font.render(
+            f"{what}: {feature_importances[i]:.2f}", True, (0, 0, 0)
+        )
+        screen.blit(text, (start[0], start[1] - 30))
+        desc = "(Range: 0 to 30)"
+        desc_text = pygame.font.Font(None, 24).render(
+            desc, True, (100, 100, 100)
+        )
+        screen.blit(desc_text, (start[0], start[1] + 25))
+    button_width, button_height = 100, 50
+    reset_button, reset_text, reset_text_rect = create_button(
+        "Reset", (10, screen_height - 60), (button_width, button_height)
+    )
+    pause_play_button, pause_play_text, pause_play_text_rect = create_button(
+        "Pause" if not paused else "Play",
+        (120, screen_height - 60),
+        (button_width, button_height),
+    )
+    next_button, next_text, next_text_rect = create_button(
+        "Next", (230, screen_height - 60), (button_width, button_height)
+    )
+    save_button, save_text, save_text_rect = create_button(
+        "Save", (340, screen_height - 60), (button_width, button_height)
+    )
+    pygame.draw.rect(screen, (200, 200, 200), reset_button)
+    pygame.draw.rect(screen, (200, 200, 200), pause_play_button)
+    pygame.draw.rect(screen, (200, 200, 200), next_button)
+    pygame.draw.rect(screen, (200, 200, 200), save_button)
+    screen.blit(reset_text, reset_text_rect)
+    screen.blit(pause_play_text, pause_play_text_rect)
+    screen.blit(next_text, next_text_rect)
+    screen.blit(save_text, save_text_rect)
+    pygame.display.flip()
+    return reset_button, pause_play_button, next_button, save_button
+def run_visualization(
+    env_name,
+    agent,
+    init_desired_return,
+    init_desired_horizon,
+    max_epoch,
+    base_path,
+):
+    base_path = (
+        Path(base_path) / env_name / agent.policy.estimator.__str__()[:-2]
+    )
+    base_path.mkdir(parents=True, exist_ok=True)
+    desired_return = init_desired_return
+    desired_horizon = init_desired_horizon
+    pygame.init()
+    screen = pygame.display.set_mode((1000, 800))
+    pygame.display.set_caption(f"{env_name} Visualization")
+    env = gym.make(env_name, render_mode="rgb_array")
+    state, _ = env.reset()
+    clock = pygame.time.Clock()
+    epoch = 0
+    save_index = 0
+    running = True
+    paused = False
+    step = False
+    while running:
+        env_render = env.render()
+        env_surface = pygame.surfarray.make_surface(env_render.swapaxes(0, 1))
+        if not paused or step:
+            command = np.array(
+                [
+                    desired_return * agent.conf.return_scale,
+                    desired_horizon * agent.conf.horizon_scale,
+                ]
+            )
+            command = np.expand_dims(command, axis=0)
+            state = np.expand_dims(state, axis=0)
+            action = agent.policy(state, command, True)
+            ext_state = np.concatenate((state, command), axis=1)
+            state, reward, done, truncated, info = env.step(action)
+            feature_importances = {
+                idx: [] for idx in range(ext_state.shape[1])
+            }
+            for t in agent.policy.estimator.estimators_:
+                branch = np.array(
+                    t.decision_path(ext_state).todense(), dtype=bool
+                )
+                imp = t.tree_.impurity[branch[0]]
+                for f, i in zip(
+                    t.tree_.feature[branch[0]][:-1], imp[:-1] - imp[1:]
+                ):
+                    feature_importances.setdefault(f, []).append(i)
+            # Line 8 Algorithm 2
+            desired_return -= reward
+            # Line 9 Algorithm 2
+            desired_horizon = max(desired_horizon - 1, 1)
+            summed_importances = [
+                sum(feature_importances.get(k, [0.001]))
+                for k in range(len(feature_importances.keys()))
+            ]
+            epoch += 1
+        reset_button, pause_play_button, next_button, save_button = (
+            visualize_environment(
+                screen,
+                state,
+                env,
+                env_surface,
+                paused,
+                summed_importances,
+                epoch,
+                max_epoch,
+            )
+        )
+        if done or truncated:
+            state, _ = env.reset()
+            desired_horizon = init_desired_horizon
+            desired_return = init_desired_return
+            epoch = 0
+        step = False
+        for event in pygame.event.get():
+            if event.type == pygame.QUIT:
+                running = False
+            elif event.type == pygame.MOUSEBUTTONDOWN:
+                if reset_button.collidepoint(event.pos):
+                    state, _ = env.reset()
+                    desired_horizon = init_desired_horizon
+                    desired_return = init_desired_return
+                    epoch = 0
+                elif pause_play_button.collidepoint(event.pos):
+                    paused = not paused
+                elif (
+                    next_button.collidepoint(event.pos) and paused
+                ):  # Only when paused
+                    step = True
+                elif save_button.collidepoint(event.pos):
+                    pygame.image.save(
+                        env_surface,
+                        str(base_path / f"env_image_{save_index}.png"),
+                    )
+                    with open(
+                        str(base_path / f"info_{save_index}.json"), "w"
+                    ) as f:
+                        json.dump(
+                            {
+                                "state": {
+                                    i: str(val) for i, val in enumerate(state)
+                                },
+                                "feature": {
+                                    i: str(val)
+                                    for i, val in enumerate(summed_importances)
+                                },
+                                "action": str(action),
+                                "reward": str(reward),
+                                "desired_return": str(desired_return + reward),
+                                "desired_horizon": str(desired_horizon + 1),
+                            },
+                            f,
+                            indent=4,
+                        )
+                    save_index += 1
+        clock.tick(5)
+    env.close()
+    pygame.quit()
+# LunarLander-v2:RT:43:r.57:h.102 -> -92.03 +- 81.51,max 36.37,min -327.94
+# Acrobot-v1:RT:44:r.-79:h.82 -> -79.00 +- 47.01,max -64.00,min -500.00
+base_path = Path("data")
+# env = "CartPole-v0"
+env = "Acrobot-v1"
+# env = "LunarLander-v2"
+estimator = "RandomForestClassifier"
+seed = str(44)
+conf_name = "train_per_iter1"
+desired_return = -79
+desired_horizon = 82
+max_epoch = 500
+path = base_path / env / conf_name / seed
+policy = SklearnPolicy.load(str(path / "policy"))
+hyper = AgentHyper(
+    env,
+    warm_up=0,
+    # horizon_scale=horizon_scale,
+    # return_scale=return_scale,
+)
+agent = UpsideDownAgent(hyper, policy)
+run_visualization(
+    env, agent, desired_return, desired_horizon, max_epoch, "data/viz_examples"
+)