Spaces:

vimmoos
/

udrl

Running

App Files Files Community

vimmoos@Thor commited on Dec 5, 2024

Commit

610d19f

1 Parent(s): bcb5af3

update website

Browse files

Files changed (5) hide show

udrl/app/home.py +239 -52
udrl/app/sim.py +4 -3
udrl/data_proc.py +108 -0
udrl/inference.py +152 -28
udrl/test.py +0 -137

udrl/app/home.py CHANGED Viewed

@@ -4,63 +4,188 @@ import streamlit as st
 st.image("logo.jpg")
 st.html(
-    """<div>
-<h1>Upside-Down Reinforcement Learning for More Interpretable Optimal Control</h1>
-    <div class="abstract">
-        <h2>Abstract</h2>
-        <p>This research introduces a novel approach to reinforcement learning that emphasizes interpretability and explainability. By leveraging tree-based methods within the Upside-Down Reinforcement Learning (UDRL) framework, we demonstrate that it's possible to achieve performance comparable to neural networks while gaining significant advantages in terms of interpretability.</p>
-    </div>
-    <h2>What is Upside-Down Reinforcement Learning?</h2>
-    <p>UDRL is an innovative paradigm that transforms reinforcement learning problems into supervised learning tasks. Unlike traditional approaches that focus on predicting rewards or learning environment models, UDRL learns to predict actions based on:</p>
-    <ul>
-        <li>Current state (s<sub>t</sub>)</li>
-        <li>Desired reward (d<sub>r</sub>)</li>
-        <li>Time horizon (d<sub>t</sub>)</li>
-    </ul>
-    <h2>Motivation</h2>
-    <p>While neural networks have been the go-to choice for implementing UDRL, they lack interpretability. Our research explores whether other supervised learning algorithms, particularly tree-based methods, can:</p>
-    <ul>
-        <li>Match the performance of neural networks</li>
-        <li>Provide more interpretable policies</li>
-        <li>Enhance the explainability of reinforcement learning systems</li>
-    </ul>
-    <div class="results">
-        <h2>Results</h2>
-        <p>We tested three different implementations of the Behaviour Function:</p>
-        <ul>
-            <li>Neural Networks (NN)</li>
-            <li>Random Forests (RF)</li>
-            <li>Extremely Randomized Trees (ET)</li>
-        </ul>
-        <p>Tests were conducted on three popular OpenAI Gym environments:</p>
-        <ul>
-            <li>CartPole</li>
-            <li>Acrobot</li>
-            <li>Lunar-Lander</li>
-        </ul>
-    </div>
-    <h2>Key Findings</h2>
-    <ul>
-        <li>Tree-based methods performed comparably to neural networks</li>
-        <li>Random Forests and Extremely Randomized Trees provided fully interpretable policies</li>
-        <li>Feature importance analysis revealed insights into decision-making processes</li>
-    </ul>
-    <h2>Implications</h2>
-    <p>This research opens new avenues for:</p>
-    <ul>
-        <li>More explainable reinforcement learning systems</li>
-        <li>Enhanced safety in AI decision-making</li>
-        <li>Better understanding of agent behavior in complex environments</li>
-    </ul>
     </div>
-"""
 )
 st.html(
     """
@@ -74,8 +199,12 @@ st.html(
             margin: 0 auto;
             padding: 20px;
         }
         h1, h2 {
             color: #81a1c1;
         }
         .abstract {
             background-color: #2e3440;
@@ -108,3 +237,61 @@ st.html(
     </style>
     """
 )

 st.image("logo.jpg")
 st.html(
+    """
+<body>
+    <div class="container">
+            <h1 >Upside-Down Reinforcement Learning for More Interpretable Optimal Control</h1>
+            <div class="authors" style="text-align: center;">
+                <p class="author-names">Juan Cardenas-Cartagena, Massimiliano Falzari, Marco Zullich, Matthia Sabatelli</p>
+                <p class="institution">Bernoulli Institute, University of Groningen, The Netherlands</p>
+            </div>
+<h2><a href="https://arxiv.org/abs/2411.11457" target="_blank">Read the full paper on arXiv</a></h2>
+        <section class="motivation">
+            <h2>Research Motivation</h2>
+            <p>The dramatic growth in adoption of Neural Networks (NNs) within the last 15 years has sparked a crucial need for increased transparency, especially in high-stake applications. While NNs have demonstrated remarkable performance across various domains, they are essentially black boxes whose decision-making processes remain opaque to human understanding. This research addresses this fundamental challenge by exploring alternative approaches that maintain performance while dramatically improving interpretability.</p>
+            <div class="key-challenges">
+                <h3>Current Challenges in Reinforcement Learning</h3>
+                <p>Traditional approaches to Reinforcement Learning (RL) face several key limitations:</p>
+                <ul>
+                    <li>Complex neural network policies are difficult to interpret and explain</li>
+                    <li>Lack of transparency in decision-making processes poses risks in critical applications</li>
+                    <li>Traditional RL approaches either focus on predicting rewards or learning environment models, making interpretation challenging</li>
+                    <li>The gap between performance and interpretability has been difficult to bridge</li>
+                </ul>
+            </div>
+        </section>
+        <section class="udrl-framework">
+            <h2>The UDRL Framework: A Novel Approach</h2>
+            <p>Upside-Down Reinforcement Learning represents a fundamental shift in how we approach reinforcement learning problems. Instead of traditional methods that focus on predicting rewards or learning environment models, UDRL transforms the reinforcement learning problem into a supervised learning task.</p>
+            <div class="framework-details">
+                <h3>Key Components</h3>
+                <p>The UDRL approach centers around learning a behavior function f(st, dr, dt) = at where:</p>
+                <ul>
+                    <li><strong>st</strong>: The current state of the environment</li>
+                    <li><strong>dr</strong>: The desired reward the agent aims to achieve</li>
+                    <li><strong>dt</strong>: The time horizon within which to achieve the reward</li>
+                    <li><strong>at</strong>: The action to take to achieve the desired reward</li>
+                </ul>
+            </div>
+            <div class="mathematical-framework">
+                <h3>Mathematical Foundation</h3>
+                <p>The framework is built on a Markov Decision Process (MDP) defined as a tuple M = ⟨S,A,P,R⟩ where:</p>
+                <ul>
+                    <li>S: The state space of the environment</li>
+                    <li>A: The action space modeling all possible actions</li>
+                    <li>P: The transition function P : S × A × S → [0,1]</li>
+                    <li>R: The reward function R : S × A × S → R</li>
+                </ul>
+            </div>
+        </section>
+        <section class="implementation">
+            <h2>Implementation and Methodology</h2>
+            <div class="algorithms-detailed">
+                <h3>Studied Algorithms</h3>
+                <div class="tree-based">
+                    <h4>Tree-Based Methods</h4>
+                    <p>We extensively evaluated two primary tree-based approaches:</p>
+                    <ul>
+                        <li><strong>Random Forests (RF):</strong> An ensemble method building multiple decision trees and merging their predictions</li>
+                        <li><strong>Extremely Randomized Trees (ET):</strong> A variation that adds additional randomization in the tree-building process</li>
+                    </ul>
+                </div>
+                <div class="boosting">
+                    <h4>Boosting Algorithms</h4>
+                    <p>We also investigated sequential ensemble methods:</p>
+                    <ul>
+                        <li><strong>AdaBoost:</strong> Adaptive Boosting for sequential tree construction</li>
+                        <li><strong>XGBoost:</strong> A more advanced implementation of gradient boosting</li>
+                    </ul>
+                </div>
+                <div class="baseline">
+                    <h4>Baseline Methods</h4>
+                    <ul>
+                        <li><strong>Neural Networks:</strong> Traditional multi-layer perceptron architecture</li>
+                        <li><strong>K-Nearest Neighbours:</strong> Non-parametric baseline for comparison</li>
+                    </ul>
+                </div>
+            </div>
+            <div class="experimental-environments">
+                <h3>Test Environments</h3>
+                <div class="cartpole">
+                    <h4>CartPole</h4>
+                    <p>A 4-dimensional continuous state space including:</p>
+                    <ul>
+                        <li>Cart Position (x)</li>
+                        <li>Cart Velocity (ẋ)</li>
+                        <li>Pole Angle (θ)</li>
+                        <li>Pole Angular Velocity (θ̇)</li>
+                    </ul>
+                </div>
+                <div class="acrobot">
+                    <h4>Acrobot</h4>
+                    <p>A 6-dimensional state space representing:</p>
+                    <ul>
+                        <li>First Link: sin(θ1), cos(θ1), θ̇1</li>
+                        <li>Second Link: sin(θ2), cos(θ2), θ̇2</li>
+                    </ul>
+                </div>
+                <div class="lunar-lander">
+                    <h4>Lunar Lander</h4>
+                    <p>An 8-dimensional state space including:</p>
+                    <ul>
+                        <li>Position (x, y)</li>
+                        <li>Velocity (ẋ, ẏ)</li>
+                        <li>Angle (θ) and Angular velocity (θ̇)</li>
+                        <li>Left and right leg contact points</li>
+                    </ul>
+                </div>
+            </div>
+        </section>
+        <section class="results">
+            <h2>Comprehensive Results and Analysis</h2>
+            <div class="performance-analysis">
+                <h3>Performance Metrics</h3>
+                <p>Our experiments revealed surprising competitiveness of tree-based methods:</p>
+                <ul>
+                    <li><strong>CartPole Environment:</strong>
+                        <ul>
+                            <li>Neural Networks: 199.93 ± 0.255</li>
+                            <li>Random Forests: 188.25 ± 13.82</li>
+                            <li>XGBoost: 199.27 ± 4.06</li>
+                        </ul>
+                    </li>
+                    <li><strong>Acrobot Environment:</strong>
+                        <ul>
+                            <li>Neural Networks: -75.00 ± 15.36</li>
+                            <li>Random Forests: -100.05 ± 62.80</li>
+                            <li>Extra Trees: -100.00 ± 93.72</li>
+                        </ul>
+                    </li>
+                    <li><strong>Lunar Lander:</strong>
+                        <ul>
+                            <li>Random Forests: -54.74 ± 96.22</li>
+                            <li>XGBoost: -76.96 ± 89.69</li>
+                            <li>Neural Networks: -157.04 ± 71.26</li>
+                        </ul>
+                    </li>
+                </ul>
+            </div>
+            <div class="interpretability-analysis">
+                <h3>Interpretability Insights</h3>
+                <p>The tree-based methods provided unprecedented insights into decision-making:</p>
+                <ul>
+                    <li><strong>CartPole:</strong> Pole angular velocity emerged as the most crucial feature for balancing</li>
+                    <li><strong>Acrobot:</strong> Angular velocities of both links proved essential for control</li>
+                    <li><strong>Lunar Lander:</strong> Vertical position showed highest importance for landing decisions</li>
+                </ul>
+            </div>
+        </section>
+        <section class="future-directions">
+            <h2>Future Research Directions</h2>
+            <p>Our findings open several promising avenues for future research:</p>
+            <ul>
+                <li>Scaling to High-Dimensional Spaces:
+                    <p>Investigating the applicability of tree-based UDRL to more complex environments with higher-dimensional state spaces</p>
+                </li>
+                <li>Enhanced Interpretation Tools:
+                    <p>Development of specialized tools for analyzing and visualizing decision processes in tree-based UDRL systems</p>
+                </li>
+                <li>Real-World Applications:
+                    <p>Exploring applications in safety-critical domains where interpretability is crucial</p>
+                </li>
+                <li>Theoretical Analysis:
+                    <p>Deeper investigation of the theoretical foundations underlying the success of tree-based methods in UDRL</p>
+                </li>
+            </ul>
+        </section>
     </div>
+</body>
+    """
 )
 st.html(
     """
             margin: 0 auto;
             padding: 20px;
         }
+        h3 {
+            text-align: center;
+        }
         h1, h2 {
             color: #81a1c1;
+            text-align: center;
         }
         .abstract {
             background-color: #2e3440;
     </style>
     """
 )
+#     """<div>
+# <h1>Upside-Down Reinforcement Learning for More Interpretable Optimal Control</h1>
+#     <div class="abstract">
+#         <h2>Abstract</h2>
+#         <p>This research introduces a novel approach to reinforcement learning that emphasizes interpretability and explainability. By leveraging tree-based methods within the Upside-Down Reinforcement Learning (UDRL) framework, we demonstrate that it's possible to achieve performance comparable to neural networks while gaining significant advantages in terms of interpretability.</p>
+#     </div>
+#     <h2>What is Upside-Down Reinforcement Learning?</h2>
+#     <p>UDRL is an innovative paradigm that transforms reinforcement learning problems into supervised learning tasks. Unlike traditional approaches that focus on predicting rewards or learning environment models, UDRL learns to predict actions based on:</p>
+#     <ul>
+#         <li>Current state (s<sub>t</sub>)</li>
+#         <li>Desired reward (d<sub>r</sub>)</li>
+#         <li>Time horizon (d<sub>t</sub>)</li>
+#     </ul>
+#     <h2>Motivation</h2>
+#     <p>While neural networks have been the go-to choice for implementing UDRL, they lack interpretability. Our research explores whether other supervised learning algorithms, particularly tree-based methods, can:</p>
+#     <ul>
+#         <li>Match the performance of neural networks</li>
+#         <li>Provide more interpretable policies</li>
+#         <li>Enhance the explainability of reinforcement learning systems</li>
+#     </ul>
+#     <div class="results">
+#         <h2>Results</h2>
+#         <p>We tested three different implementations of the Behaviour Function:</p>
+#         <ul>
+#             <li>Neural Networks (NN)</li>
+#             <li>Random Forests (RF)</li>
+#             <li>Extremely Randomized Trees (ET)</li>
+#         </ul>
+#         <p>Tests were conducted on three popular OpenAI Gym environments:</p>
+#         <ul>
+#             <li>CartPole</li>
+#             <li>Acrobot</li>
+#             <li>Lunar-Lander</li>
+#         </ul>
+#     </div>
+#     <h2>Key Findings</h2>
+#     <ul>
+#         <li>Tree-based methods performed comparably to neural networks</li>
+#         <li>Random Forests and Extremely Randomized Trees provided fully interpretable policies</li>
+#         <li>Feature importance analysis revealed insights into decision-making processes</li>
+#     </ul>
+#     <h2>Implications</h2>
+#     <p>This research opens new avenues for:</p>
+#     <ul>
+#         <li>More explainable reinforcement learning systems</li>
+#         <li>Enhanced safety in AI decision-making</li>
+#         <li>Better understanding of agent behavior in complex environments</li>
+#     </ul>
+#     </div>
+# """

udrl/app/sim.py CHANGED Viewed

@@ -265,7 +265,8 @@ def make_viz(state):
 def make_commands(state):
     # Add control buttons in a horizontal layout
-    col1, col2, col3, col4 = st.columns(4)
     with col1:
         st.button("Reset Environment", on_click=state.reset_env)
     with col2:
@@ -274,8 +275,8 @@ def make_commands(state):
         )
     with col3:
         st.button("Next", on_click=state.next_epoch)
-    with col4:
-        st.button("Save")
     return state

 def make_commands(state):
     # Add control buttons in a horizontal layout
+    # col1, col2, col3, col4 = st.columns(4)
+    col1, col2, col3 = st.columns(3)
     with col1:
         st.button("Reset Environment", on_click=state.reset_env)
     with col2:
         )
     with col3:
         st.button("Next", on_click=state.next_epoch)
+    # with col4:
+    #     st.button("Save")
     return state

udrl/data_proc.py CHANGED Viewed

@@ -3,10 +3,112 @@ import numpy as np
 import json
 import csv
 naming = {
     "neural": "NN",
     "ensemble.ExtraTreesClassifier": "ET",
     "ensemble.RandomForestClassifier": "RF",
 }
 if __name__ == "__main__":
@@ -23,6 +125,10 @@ if __name__ == "__main__":
             "neural": ([], [], [], []),
             "ensemble.ExtraTreesClassifier": ([], [], [], []),
             "ensemble.RandomForestClassifier": ([], [], [], []),
         }
         for exp in all_paths:
             print(exp)
@@ -30,6 +136,8 @@ if __name__ == "__main__":
             with open((exp / "conf.json"), "r") as f:
                 conf = json.load(f)
             estimators[conf["estimator_name"]][0].append(list(rewards[:, 0]))
             estimators[conf["estimator_name"]][1].append(list(rewards[:, 1]))

 import json
 import csv
+def convert_json_to_pgfplots(json_file, output_file):
+    """
+    Convert JSON feature data to pgfplots-compatible format
+    """
+    # Read JSON file
+    with open(json_file, "r") as f:
+        data = json.load(f)
+    # Extract feature values
+    features = data["feature"]
+    # Create pgfplots data
+    # Each line will be "x y" format
+    plot_data = []
+    for i, value in features.items():
+        plot_data.append(f"{i} {float(value)}")
+    # Write to file
+    with open(output_file, "w") as f:
+        f.write("\n".join(plot_data))
+def convert_all_viz(
+    base=Path("data") / "viz_examples", algo="RandomForestClassifier"
+):
+    for env_p in base.iterdir():
+        path = env_p / algo
+        for fil in path.iterdir():
+            if "info" not in fil.name:
+                continue
+            convert_json_to_pgfplots(
+                str(fil), f"{str(fil.parent)}/{fil.stem}.dat"
+            )
+def smooth_curve(
+    df,
+    column_name="mean_reward",
+    window_size=10,
+    method="exponential",
+    alpha=0.1,
+):
+    """
+    Smooth a column in a pandas DataFrame using different methods.
+    Parameters:
+    - df: pandas DataFrame containing the data
+    - column_name: name of the column to smooth
+    - window_size: size of the rolling window for simple moving average
+    - method: 'simple' for simple moving average, 'exponential' for exponential moving average
+    - alpha: smoothing factor for exponential moving average (0 < alpha < 1)
+    Returns:
+    - DataFrame with both original and smoothed data
+    """
+    # Create a copy to avoid modifying the original DataFrame
+    df_smoothed = df.copy()
+    if method == "simple":
+        # Simple Moving Average
+        df_smoothed[f"{column_name}_smoothed"] = (
+            df[column_name]
+            .rolling(window=window_size, center=True, min_periods=1)
+            .mean()
+        )
+        # # Handle NaN values at the beginning and end
+        df_smoothed[f"{column_name}_smoothed"].fillna(
+            df[column_name], inplace=True
+        )
+    elif method == "exponential":
+        # Exponential Moving Average
+        df_smoothed[f"{column_name}_smoothed"] = (
+            df[column_name].ewm(alpha=alpha, adjust=False).mean()
+        )
+    return df_smoothed
+# import pandas as pd
+# path = "data/csvs/Acrobot-v1.csv"
+# path = "data/csvs/CartPole-v0.csv"
+# path = "data/csvs/LunarLander-v2.csv"
+# dat = pd.read_csv(path)
+# for name in ["NN", "ET", "RF", "KNN", "SVM", "AdaBoost", "XGBoost"]:
+#     dat = smooth_curve(dat, name + "_mean", window_size=20, method="simple")
+# dat.to_csv(path)
 naming = {
     "neural": "NN",
     "ensemble.ExtraTreesClassifier": "ET",
     "ensemble.RandomForestClassifier": "RF",
+    "neighbors.KNeighborsClassifier": "KNN",
+    "svm.SVC": "SVM",
+    "ensemble.AdaBoostClassifier": "AdaBoost",
+    "ensemble.GradientBoostingClassifier": "XGBoost",
 }
 if __name__ == "__main__":
             "neural": ([], [], [], []),
             "ensemble.ExtraTreesClassifier": ([], [], [], []),
             "ensemble.RandomForestClassifier": ([], [], [], []),
+            "neighbors.KNeighborsClassifier": ([], [], [], []),
+            "svm.SVC": ([], [], [], []),
+            "ensemble.AdaBoostClassifier": ([], [], [], []),
+            "ensemble.GradientBoostingClassifier": ([], [], [], []),
         }
         for exp in all_paths:
             print(exp)
             with open((exp / "conf.json"), "r") as f:
                 conf = json.load(f)
+            if conf["estimator_name"] not in list(estimators.keys()):
+                continue
             estimators[conf["estimator_name"]][0].append(list(rewards[:, 0]))
             estimators[conf["estimator_name"]][1].append(list(rewards[:, 1]))

udrl/inference.py CHANGED Viewed

@@ -1,22 +1,25 @@
-import matplotlib.pyplot as plt
 import numpy as np
 from udrl.policies import SklearnPolicy, NeuralPolicy
 from udrl.agent import UpsideDownAgent, AgentHyper
 from pathlib import Path
 from collections import Counter
-from tqdm import trange
-def get_common(base, env, conf, seed):
-    path = base / env / conf / seed
     if not path.exists():
         print("Cannot find path")
         return None, None
-    algo_name = (
-        "NN" if "neural" in conf else ("ET" if "Extra" in conf else "RT")
-    )
     des_ret = np.load(str(path / "desired_returns.npy")).astype(int)
     des_hor = np.load(str(path / "desired_horizons.npy")).astype(int)
@@ -53,14 +56,12 @@ def get_common(base, env, conf, seed):
     return common_ret, common_hor
-def test_desired(base, env, conf, des_ret, des_hor):
-    algo_name = (
-        "NN" if "neural" in conf else ("ET" if "Extra" in conf else "RT")
-    )
     if des_hor is None or des_ret is None:
         print(f"Invalid desired for {env}:{algo_name}")
         return
     for path in (base / env / conf).iterdir():
         if "neural" in conf:
             policy = NeuralPolicy.load(str(path / "policy"))
@@ -80,43 +81,166 @@ def test_desired(base, env, conf, des_ret, des_hor):
             )[0]
             for _ in range(100)
         ]
         print(
             f"{env}:{algo_name}:{path.name}:r.{des_ret}:h.{des_hor}"
             f" -> {np.median(final_r):.2f} +- {np.std(final_r):.2f}"
             f",max {np.max(final_r):.2f},min {np.min(final_r):.2f}"
         )
-base = Path("/home/vimmoos/upside_down_rl/data")
-confs = {
-    "NN": "estimator_nameneural_batch_size256_warm_up260",
-    "ET": "estimator_nameensemble.ExtraTreesClassifier_train_per_iter1",
-    "RT": "train_per_iter1",
-}
 envs = ["LunarLander-v2", "Acrobot-v1"]
-seeds = [str(45), str(46)]
 res = {}
 for env in envs:
     res[env] = {}
-    for algo_name, conf in confs.items():
-        res[env][algo_name] = {}
-        for seed in seeds:
-            ret, hor = get_common(base, env, conf + "_save_desiredTrue", seed)
-            res[env][algo_name][seed] = (ret, hor)
 pprint(res)
 for env, algos in res.items():
     for algo, seeds in algos.items():
         for _, vals in seeds.items():
-            test_desired(base, env, confs[algo], *vals)
-# plt.plot(mean_des_ret)
-# plt.plot(mean_des_hor)
-# plt.plot(rew)
-# plt.show()

 import numpy as np
 from udrl.policies import SklearnPolicy, NeuralPolicy
 from udrl.agent import UpsideDownAgent, AgentHyper
 from pathlib import Path
 from collections import Counter
+from pprint import pprint
+import re
+from dataclasses import dataclass, field
+from typing import Dict, Any
+# from tqdm import trange, tqdm
+def get_common(base="", env="", conf="", seed="", path=None, algo_name=None):
+    if path is None:
+        path = base / env / conf / seed
     if not path.exists():
         print("Cannot find path")
         return None, None
+    if algo_name is None:
+        algo_name = "UNKNOWN"
     des_ret = np.load(str(path / "desired_returns.npy")).astype(int)
     des_hor = np.load(str(path / "desired_horizons.npy")).astype(int)
     return common_ret, common_hor
+def test_desired(base, env, conf, algo_name, des_ret, des_hor):
     if des_hor is None or des_ret is None:
         print(f"Invalid desired for {env}:{algo_name}")
         return
+    ret = []
     for path in (base / env / conf).iterdir():
         if "neural" in conf:
             policy = NeuralPolicy.load(str(path / "policy"))
             )[0]
             for _ in range(100)
         ]
+        ret.append(
+            {
+                "env": env,
+                "algo": algo_name,
+                "seed": path.name,
+                "des_ret": des_ret,
+                "des_hor": des_hor,
+                "final_r": np.median(final_r),
+                "final_r_std": np.std(final_r),
+                "final_r_max": np.max(final_r),
+                "final_r_min": np.min(final_r),
+                "final_raw": final_r,
+            }
+        )
         print(
             f"{env}:{algo_name}:{path.name}:r.{des_ret}:h.{des_hor}"
             f" -> {np.median(final_r):.2f} +- {np.std(final_r):.2f}"
             f",max {np.max(final_r):.2f},min {np.min(final_r):.2f}"
         )
+    return ret
+@dataclass
+class RunStats:
+    median: float
+    mean: float
+    std: float
+    min_val: float
+    max_val: float
+    infos: Dict[str, Any] = field(repr=False)
+    weights: Dict[str, float] = field(
+        repr=False,
+        default_factory=lambda: {
+            "median": 0.5,
+            "mean": 0.1,
+            "std_penalty": 0.3,
+            "range_penalty": 0.1,
+        },
+    )
+    score: float = field(init=False, default=-np.inf)
+    def __post_init__(self):
+        self.score = self.calculate_score()
+    def calculate_score(self):
+        """
+        Calculate a composite score for a run.
+        The score is calculated as:
+        score = (w1 * median + w2 * mean) * stability_factor
+        where stability_factor penalizes high std and wide ranges
+        """
+        shift = 1000 if self.median < 0 else 0
+        base_score = self.weights["median"] * (
+            self.median + shift
+        ) + self.weights["mean"] * (self.mean + shift)
+        std_factor = 1 / (1 + self.weights["std_penalty"] * self.std)
+        range_factor = 1 / (
+            1 + self.weights["range_penalty"] * (self.max_val - self.min_val)
+        )
+        return base_score + std_factor + range_factor
+def extract_statistics(run):
+    return RunStats(
+        median=run["final_r"],
+        mean=np.mean(run["final_raw"]),
+        std=run["final_r_std"],
+        min_val=run["final_r_min"],
+        max_val=run["final_r_max"],
+        infos=run,
+    )
+base = Path("data")
 envs = ["LunarLander-v2", "Acrobot-v1"]
+algo_name_extract = r"(.*/)+(estimator_name(.+?)_train.*save_desired.*)$"
+available_algo = []
+special_confs = {
+    "NN": "estimator_nameneural_batch_size256_warm_up260_save_desiredTrue",
+    "RT": "train_per_iter1_save_desiredTrue",
+}
+wanted_algo = [
+    "ensemble.ExtraTreesClassifier",
+    "neighbors.KNeighborsClassifier",
+    "NN",
+    "RT",
+    "svm.SVC",
+    "ensemble.AdaBoostClassifier",
+    "ensemble.GradientBoostingClassifier",
+]
 res = {}
 for env in envs:
     res[env] = {}
+    available_algo = [
+        re.match(algo_name_extract, str(x))
+        for x in (base / env).iterdir()
+        if re.match(algo_name_extract, str(x))
+    ]
+    for algo_match in available_algo:
+        conf = algo_match.group(2)
+        algo_name = algo_match.group(3)
+        res[env][(algo_name, conf)] = {}
+        for seed_path in Path(algo_match.string).iterdir():
+            seed = seed_path.stem
+            ret, hor = get_common(path=seed_path, algo_name=algo_name)
+            res[env][(algo_name, conf)][seed] = (ret, hor)
+    for algo_name, conf in special_confs.items():
+        res[env][(algo_name, conf)] = {}
+        for seed_path in (base / env / conf).iterdir():
+            seed = seed_path.stem
+            ret, hor = get_common(path=seed_path, algo_name=algo_name)
+            res[env][(algo_name, conf)][seed] = (ret, hor)
 pprint(res)
+data = []
 for env, algos in res.items():
     for algo, seeds in algos.items():
         for _, vals in seeds.items():
+            data.append(test_desired(base, env, algo[1], algo[0], *vals))
+best_res = {env: {algo: None for algo in wanted_algo} for env in envs}
+for runs in data:
+    if runs[0]["algo"] not in wanted_algo:
+        continue
+    for run in runs:
+        stats = extract_statistics(run)
+        print(
+            f"{run['env']}:{run['algo']}:{run['seed']}:r.{run['des_ret']}:h.{run['des_hor']}"
+            f" -> SCORE {stats.score} \t {run['final_r']:.2f} +- {run['final_r_std']:.2f}"
+            f",max {run['final_r_max']:.2f},min {run['final_r_min']:.2f}"
+        )
+        current_best = best_res[run["env"]][run["algo"]]
+        if current_best is None:
+            best_res[run["env"]][run["algo"]] = stats
+            continue
+        if current_best.score < stats.score:
+            best_res[run["env"]][run["algo"]] = stats
+pprint(best_res)
+best_res["LunarLander-v2"]["svm.SVC"].infos
+for env, vs in best_res.items():
+    for algo, v in vs.items():
+        run = v.infos
+        print(
+            f"{run['env']}:{run['algo']}: r. {run['des_ret']} : h. {run['des_hor']}"
+        )

udrl/test.py DELETED Viewed

@@ -1,137 +0,0 @@
-# import gymnasium as gym
-# import pygame
-# import numpy as np
-# def normalize_value(value, is_bounded, low=None, high=None):
-#     if is_bounded:
-#         return (value - low) / (high - low)
-#     else:
-#         return 0.5 * (np.tanh(value / 2) + 1)
-# def draw_bar(screen, start, value, max_length, color, height=20):
-#     bar_length = value * max_length
-#     pygame.draw.rect(screen, color, (*start, bar_length, height))
-#     pygame.draw.rect(
-#         screen, (0, 0, 0), (*start, max_length, height), 2
-#     )  # Border
-#     mid_x = start[0] + max_length / 2
-#     pygame.draw.line(
-#         screen, (0, 0, 0), (mid_x, start[1]), (mid_x, start[1] + height), 2
-#     )
-# def visualize_environment(screen, state, env):
-#     screen_width, screen_height = screen.get_size()
-#     screen.fill((255, 255, 255))
-#     # Visualize environment-specific elements
-#     if env.spec.id.startswith("CartPole"):
-#         cart_x = int(state[0] * 50 + screen_width // 2)
-#         cart_y = screen_height - 100
-#         pole_angle = state[2]
-#         pygame.draw.rect(screen, (0, 0, 0), (cart_x - 30, cart_y - 15, 60, 30))
-#         pygame.draw.line(
-#             screen,
-#             (0, 0, 0),
-#             (cart_x, cart_y),
-#             (
-#                 cart_x + int(np.sin(pole_angle) * 100),
-#                 cart_y - int(np.cos(pole_angle) * 100),
-#             ),
-#             6,
-#         )
-#     elif env.spec.id.startswith("Acrobot"):
-#         center_x, center_y = screen_width // 2, screen_height // 2
-#         l1, l2 = 100, 100  # Length of links
-#         s0, s1 = state[0], state[1]  # sin(theta1), sin(theta2)
-#         c0, c1 = state[2], state[3]  # cos(theta1), cos(theta2)
-#         x0, y0 = center_x, center_y
-#         x1 = x0 + l1 * s0
-#         y1 = y0 + l1 * c0
-#         x2 = x1 + l2 * s1
-#         y2 = y1 + l2 * c1
-#         pygame.draw.line(screen, (0, 0, 0), (x0, y0), (x1, y1), 6)
-#         pygame.draw.line(screen, (0, 0, 0), (x1, y1), (x2, y2), 6)
-#         pygame.draw.circle(screen, (0, 0, 255), (int(x0), int(y0)), 10)
-#         pygame.draw.circle(screen, (0, 255, 0), (int(x1), int(y1)), 10)
-#         pygame.draw.circle(screen, (255, 0, 0), (int(x2), int(y2)), 10)
-#     # Add more environment-specific visualizations here as needed
-#     # Draw bars for each state dimension
-#     num_dims = env.observation_space.shape[0]
-#     bar_colors = [
-#         (255, 0, 0),
-#         (0, 255, 0),
-#         (0, 0, 255),
-#         (255, 255, 0),
-#         (255, 0, 255),
-#         (0, 255, 255),
-#     ]
-#     bar_starts = [(50, 50 + i * 70) for i in range(num_dims)]
-#     max_length = 300
-#     for i, (start, color) in enumerate(zip(bar_starts, bar_colors)):
-#         is_bounded = not (
-#             env.observation_space.high[i] > 100
-#         ) and not np.isinf(env.observation_space.low[i] < -100)
-#         normalized_value = normalize_value(
-#             state[i],
-#             is_bounded,
-#             env.observation_space.low[i],
-#             env.observation_space.high[i],
-#         )
-#         draw_bar(screen, start, normalized_value, max_length, color)
-#         # Draw labels
-#         font = pygame.font.Font(None, 30)
-#         text = font.render(f"Dim {i}: {state[i]:.2f}", True, (0, 0, 0))
-#         screen.blit(text, (start[0], start[1] - 30))
-#         # Add description of bar representation
-#         if is_bounded:
-#             desc = f"(Range: {env.observation_space.low[i]:.2f} to {env.observation_space.high[i]:.2f})"
-#         else:
-#             desc = "(Unbounded: Center is 0, edges are ±∞)"
-#         desc_text = pygame.font.Font(None, 24).render(
-#             desc, True, (100, 100, 100)
-#         )
-#         screen.blit(desc_text, (start[0], start[1] + 25))
-#     pygame.display.flip()
-# def run_visualization(env_name):
-#     pygame.init()
-#     screen = pygame.display.set_mode((800, 600))
-#     pygame.display.set_caption(f"{env_name} Visualization")
-#     env = gym.make(env_name)
-#     state, _ = env.reset()
-#     clock = pygame.time.Clock()
-#     running = True
-#     while running:
-#         visualize_environment(screen, state, env)
-#         action = env.action_space.sample()
-#         state, reward, done, truncated, info = env.step(action)
-#         if done or truncated:
-#             state, _ = env.reset()
-#         for event in pygame.event.get():
-#             if event.type == pygame.QUIT:
-#                 running = False
-#         clock.tick(60)  # Limit to 60 FPS
-#     env.close()
-#     pygame.quit()
-# # Example usage
-# # run_visualization("CartPole-v1")
-# # Uncomment the line below to run Acrobot visualization
-# run_visualization("Acrobot-v1")