Spaces:
Running
Running
### 3.5. `research_agent.py` Refactoring | |
* **Rationale:** To improve browser instance management, error handling, and configuration. | |
* **Proposals:** | |
1. **Browser Lifecycle Management:** Instead of initializing the browser (`start_chrome`) at the module level, manage its lifecycle explicitly. Options: | |
* Initialize the browser within the agent's initialization and provide a method or tool to explicitly close it (`kill_browser`) when the agent's task is done or the application shuts down. | |
* Use a context manager (`with start_chrome(...) as browser:`) if the browser is only needed for a specific scope within a tool call (less likely for a persistent agent). | |
* Ensure `kill_browser` is reliably called. Perhaps the `planner_agent` could invoke a cleanup tool/method on the `research_agent` after its tasks are complete. | |
2. **Configuration:** Move hardcoded Chrome options to configuration. Externalize API keys/IDs if not already done (they seem to be using `os.getenv`, which is good). | |
3. **Robust Error Handling:** For browser interaction tools (`visit`, `get_text_by_css`, `click_element`), raise specific custom exceptions instead of returning error strings. This allows for more structured error handling by the agent or workflow. | |
4. **Tool Consolidation (Optional):** The agent has many tools. Consider if some related tools (e.g., different search APIs) could be consolidated behind a single tool that internally chooses the best source, or if the LLM handles the large toolset effectively. | |
* **Diff Patch (Illustrative - Configuration & Browser Init):** | |
```diff | |
--- a/research_agent.py | |
+++ b/research_agent.py | |
@@ -1,5 +1,6 @@ | |
import os | |
import time | |
+ import logging | |
from typing import List | |
from llama_index.core.agent.workflow import ReActAgent | |
@@ -15,17 +16,21 @@ | |
from helium import start_chrome, go_to, find_all, Text, kill_browser | |
from helium import get_driver | |
+ logger = logging.getLogger(__name__) | |
+ | |
# 1. Helium | |
-chrome_options = webdriver.ChromeOptions() | |
-chrome_options.add_argument("--no-sandbox") | |
-chrome_options.add_argument("--disable-dev-shm-usage") | |
-chrome_options.add_experimental_option("prefs", { | |
- "download.prompt_for_download": False, | |
- "plugins.always_open_pdf_externally": True, | |
- "profile.default_content_settings.popups": 0 | |
-}) | |
- | |
-browser = start_chrome(headless=True, options=chrome_options) | |
+# Browser instance should be managed, not global at module level | |
+# browser = start_chrome(headless=True, options=chrome_options) | |
+ | |
+def get_chrome_options(): | |
+ options = webdriver.ChromeOptions() | |
+ if os.getenv("RESEARCH_AGENT_CHROME_NO_SANDBOX", "true").lower() == "true": | |
+ options.add_argument("--no-sandbox") | |
+ if os.getenv("RESEARCH_AGENT_CHROME_DISABLE_DEV_SHM", "true").lower() == "true": | |
+ options.add_argument("--disable-dev-shm-usage") | |
+ # Add other options from config as needed | |
+ # options.add_experimental_option(...) # Example | |
+ return options | |
def visit(url: str, wait_seconds: float = 2.0) -> str |None: | |
""" | |
@@ -36,10 +41,11 @@ | |
wait_seconds (float): Time to wait after navigation. | |
""" | |
try: | |
+ # Assumes browser is available in context (e.g., class member) | |
go_to(url) | |
time.sleep(wait_seconds) | |
return f"Visited: {url}" | |
except Exception as e: | |
+ logger.error(f"Error visiting {url}: {e}", exc_info=True) | |
return f"Error visiting {url}: {e}" | |
def get_text_by_css(selector: str) -> List[str] | str: | |
@@ -52,13 +58,15 @@ | |
List[str]: List of text contents. | |
""" | |
try: | |
+ # Assumes browser/helium context is active | |
if selector.lower() == 'body': | |
elements = find_all(Text()) | |
else: | |
elements = find_all(selector) | |
texts = [elem.web_element.text for elem in elements] | |
- print(f"Extracted {len(texts)} elements for selector \'{selector}\'") | |
+ logger.info(f"Extracted {len(texts)} elements for selector \'{selector}\'") | |
return texts | |
except Exception as e: | |
+ logger.error(f"Error extracting text for selector {selector}: {e}", exc_info=True) | |
return f"Error extracting text for selector {selector}: {e}" | |
def get_page_html() -> str: | |
@@ -70,9 +78,11 @@ | |
str: HTML content, or empty string on error. | |
""" | |
try: | |
+ # Assumes browser/helium context is active | |
driver = get_driver() | |
html = driver.page_source | |
return html | |
except Exception as e: | |
+ logger.error(f"Error extracting HTML: {e}", exc_info=True) | |
return f"Error extracting HTML: {e}" | |
def click_element(selector: str, index_element: int = 0) -> str: | |
@@ -83,10 +93,12 @@ | |
selector (str): CSS selector of the element to click. | |
""" | |
try: | |
+ # Assumes browser/helium context is active | |
element = find_all(selector)[index_element] | |
element.click() | |
time.sleep(1) | |
return f"Clicked element matching selector \'{selector}\'" | |
except Exception as e: | |
+ logger.error(f"Error clicking element {selector}: {e}", exc_info=True) | |
return f"Error clicking element {selector}: {e}" | |
def search_item_ctrl_f(text: str, nth_result: int = 1) -> str: | |
@@ -97,6 +109,7 @@ | |
nth_result: Which occurrence to jump to (default: 1) | |
""" | |
elements = browser.find_elements(By.XPATH, f"//*[contains(text(), \'{text}\')]") | |
+ # Assumes browser is available in context | |
if nth_result > len(elements): | |
return f"Match n°{nth_result} not found (only {len(elements)} matches found)" | |
result = f"Found {len(elements)} matches for \'{text}\'." | |
@@ -107,19 +120,22 @@ | |
def go_back() -> None: | |
"""Goes back to previous page.""" | |
browser.back() | |
+ # Assumes browser is available in context | |
def close_popups() -> None: | |
""" | |
Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows! This does not work on cookie consent banners. | |
""" | |
webdriver.ActionChains(browser).send_keys(Keys.ESCAPE).perform() | |
+ # Assumes browser is available in context | |
def close() -> None: | |
""" | |
Close the browser instance. | |
""" | |
try: | |
+ # Assumes kill_browser is appropriate here | |
kill_browser() | |
- print("Browser closed") | |
+ logger.info("Browser closed via kill_browser()") | |
except Exception as e: | |
- print(f"Error closing browser: {e}") | |
+ logger.error(f"Error closing browser: {e}", exc_info=True) | |
visit_tool = FunctionTool.from_defaults( | |
fn=visit, | |
@@ -240,9 +256,14 @@ | |
def initialize_research_agent() -> ReActAgent: | |
+ # Browser initialization should happen here or be managed externally | |
+ # Example: browser = start_chrome(headless=True, options=get_chrome_options()) | |
+ # Ensure browser instance is passed to tools or accessible via agent state/class | |
+ | |
+ llm_model_name = os.getenv("RESEARCH_AGENT_LLM_MODEL", "models/gemini-1.5-pro") | |
llm = GoogleGenAI( | |
api_key=os.getenv("GEMINI_API_KEY"), | |
- model="models/gemini-1.5-pro", | |
+ model=llm_model_name, | |
) | |
system_prompt = """\ | |
``` | |
### 3.6. `text_analyzer_agent.py` Refactoring | |
* **Rationale:** To improve configuration management and error handling. | |
* **Proposals:** | |
1. **Configuration:** Move the hardcoded LLM model name (`models/gemini-1.5-pro`) to environment variables or a configuration file. | |
2. **Prompt Management:** Move the `analyze_text` prompt to a separate template file. | |
3. **Error Handling:** In `extract_text_from_pdf`, consider raising specific exceptions (e.g., `PDFDownloadError`, `PDFParsingError`) instead of returning error strings, allowing the agent to handle failures more gracefully. | |
* **Diff Patch (Illustrative - Configuration & Error Handling):** | |
```diff | |
--- a/text_analyzer_agent.py | |
+++ b/text_analyzer_agent.py | |
@@ -6,6 +6,14 @@ | |
logger = logging.getLogger(__name__) | |
+ class PDFExtractionError(Exception): | |
+ """Custom exception for PDF extraction failures.""" | |
+ pass | |
+ | |
+ class PDFDownloadError(PDFExtractionError): | |
+ """Custom exception for PDF download failures.""" | |
+ pass | |
+ | |
def extract_text_from_pdf(source: str) -> str: | |
""" | |
Extract raw text from a PDF file on disk or at a URL. | |
@@ -19,21 +27,21 @@ | |
try: | |
resp = requests.get(source, timeout=10) | |
resp.raise_for_status() | |
- except Exception as e: | |
- return f"Error downloading PDF from {source}: {e}" | |
+ except requests.exceptions.RequestException as e: | |
+ raise PDFDownloadError(f"Error downloading PDF from {source}: {e}") from e | |
try: | |
tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") | |
tmp.write(resp.content) | |
tmp.flush() | |
tmp_path = tmp.name | |
tmp.close() | |
- except Exception as e: | |
- return f"Error writing temp PDF file: {e}" | |
+ except IOError as e: | |
+ raise PDFExtractionError(f"Error writing temp PDF file: {e}") from e | |
path = tmp_path | |
else: | |
path = source | |
# Now extract text from the PDF on disk | |
if not os.path.isfile(path): | |
- return f"PDF not found: {path}" | |
+ raise PDFExtractionError(f"PDF not found: {path}") | |
text = "" | |
@@ -41,10 +49,10 @@ | |
reader = PdfReader(path) | |
pages = [page.extract_text() or "" for page in reader.pages] | |
text = "\n".join(pages) | |
- print(f"Extracted {len(pages)} pages of text from PDF") | |
+ logger.info(f"Extracted {len(pages)} pages of text from PDF: {path}") | |
except Exception as e: | |
# Catch specific PyPDF2 errors if possible, otherwise general Exception | |
- return f"Error reading PDF: {e}" | |
+ raise PDFExtractionError(f"Error reading PDF {path}: {e}") from e | |
# Clean up temporary file if one was created | |
if source.lower().startswith(("http://", "https://")): | |
@@ -67,6 +75,14 @@ | |
str: A plain-text string containing: | |
• A “Summary:” section with bullet points. | |
• A “Facts:” section with bullet points. | |
+ """ | |
+ # Load prompt from file ideally | |
+ prompt_template = """You are an expert analyst. | |
+ | |
+ Please analyze the following text and produce a plain-text response | |
+ with two sections: | |
+ | |
+ Summary: | |
+ • Provide 2–3 concise bullet points summarizing the main ideas. | |
+ | |
+ Facts: | |
+ • List each verifiable fact found in the text as a bullet point. | |
+ | |
+ Respond with exactly that format—no JSON, no extra commentary. | |
+ | |
+ Text to analyze: | |
+ \"\"\" | |
+ {text} | |
+ \"\"\" | |
""" | |
# Build the prompt to guide the LLM’s output format | |
input_prompt = f"""You are an expert analyst. | |
@@ -84,13 +100,14 @@ | |
{text} | |
\"\"\" | |
""" | |
+ input_prompt = prompt_template.format(text=text) | |
# Use the LLM to generate the analysis | |
+ llm_model_name = os.getenv("TEXT_ANALYZER_LLM_MODEL", "models/gemini-1.5-pro") | |
llm = GoogleGenAI( | |
api_key=os.getenv("GEMINI_API_KEY"), | |
- model="models/gemini-1.5-pro", | |
+ model=llm_model_name, | |
) | |
generated = llm.complete(input_prompt) | |
@@ -124,9 +141,10 @@ | |
FunctionAgent: Configured analysis agent. | |
""" | |
+ llm_model_name = os.getenv("TEXT_ANALYZER_AGENT_LLM_MODEL", "models/gemini-1.5-pro") | |
llm = GoogleGenAI( | |
api_key=os.getenv("GEMINI_API_KEY"), | |
- model="models/gemini-1.5-pro", | |
+ model=llm_model_name, | |
) | |
system_prompt = """\ | |
``` | |
### 3.7. `reasoning_agent.py` Refactoring | |
* **Rationale:** To simplify the agent structure, improve configuration, and potentially optimize LLM usage. | |
* **Proposals:** | |
1. **Configuration:** Move hardcoded LLM model names (`models/gemini-1.5-pro`, `o4-mini`) and the API key environment variable name (`ALPAFLOW_OPENAI_API_KEY`) to configuration. | |
2. **Prompt Management:** Move the detailed CoT prompt from `reasoning_tool_fn` to a separate template file. | |
3. **Agent Structure Simplification:** Given the rigid workflow (call tool -> handoff), consider replacing the `ReActAgent` with a simpler `FunctionAgent` that directly calls the `reasoning_tool` and formats the output before handing off. Alternatively, evaluate if the `reasoning_tool` logic could be integrated as a direct LLM call within agents that need CoT (like `planner_agent`), potentially removing the need for a separate `reasoning_agent` altogether, unless its specific CoT prompt/model (`o4-mini`) is crucial. | |
* **Diff Patch (Illustrative - Configuration & Prompt Loading):** | |
```diff | |
--- a/reasoning_agent.py | |
+++ b/reasoning_agent.py | |
@@ -1,10 +1,19 @@ | |
import os | |
+ import logging | |
from llama_index.core.agent.workflow import ReActAgent | |
from llama_index.llms.google_genai import GoogleGenAI | |
from llama_index.core.tools import FunctionTool | |
from llama_index.llms.openai import OpenAI | |
+ logger = logging.getLogger(__name__) | |
+ | |
+ def load_prompt_from_file(filename="reasoning_tool_prompt.txt") -> str: | |
+ try: | |
+ with open(filename, "r") as f: | |
+ return f.read() | |
+ except FileNotFoundError: | |
+ logger.error(f"Prompt file {filename} not found.") | |
+ return "Perform chain-of-thought reasoning on the context: {context}" | |
+ | |
def reasoning_tool_fn(context: str) -> str: | |
""" | |
Perform end-to-end chain-of-thought reasoning over the full multi-agent workflow context, | |
@@ -17,45 +26,12 @@ | |
str: A structured reasoning trace with numbered thought steps, intermediate checks, | |
and a concise final recommendation or conclusion. | |
""" | |
- prompt = f"""You are an expert reasoning engine. You have the following full context of a multi-agent workflow: | |
- | |
- {context} | |
- | |
- Your job is to: | |
- 1. **Comprehension** | |
- - Read the entire question or problem statement carefully. | |
- - Identify key terms, constraints, and desired outcomes. | |
- | |
- 2. **Decomposition** | |
- - Break down the problem into logical sub-steps or sub-questions. | |
- - Ensure each sub-step is necessary and sufficient to progress toward a solution. | |
- | |
- 3. **Chain-of-Thought** | |
- - Articulate your internal reasoning in clear, numbered steps. | |
- - At each step, state your assumptions, derive implications, and check for consistency. | |
- | |
- 4. **Intermediate Verification** | |
- - After each reasoning step, validate your conclusion against the problem’s constraints. | |
- - If a contradiction or uncertainty arises, revisit and refine the previous step. | |
- | |
- 5. **Synthesis** | |
- - Once all sub-steps are resolved, integrate the intermediate results into a cohesive answer. | |
- - Ensure the final answer directly addresses the user’s request and all specified criteria. | |
- | |
- 6. **Clarity & Precision** | |
- - Use formal, precise language. | |
- - Avoid ambiguity: define any technical terms you introduce. | |
- - Provide just enough detail to justify each conclusion without digression. | |
- | |
- 7. **Final Answer** | |
- - Present a concise, well-structured response. | |
- - If appropriate, include a brief summary of your reasoning steps. | |
- | |
- Respond with your reasoning steps followed by the final recommendation. | |
- """ | |
+ prompt_template = load_prompt_from_file() | |
+ prompt = prompt_template.format(context=context) | |
+ reasoning_llm_model = os.getenv("REASONING_TOOL_LLM_MODEL", "o4-mini") | |
+ # Use specific API key if needed, e.g., ALPAFLOW_OPENAI_API_KEY | |
+ reasoning_api_key_env = os.getenv("REASONING_TOOL_API_KEY_ENV", "ALPAFLOW_OPENAI_API_KEY") | |
+ reasoning_api_key = os.getenv(reasoning_api_key_env) | |
llm = OpenAI( | |
- model="o4-mini", | |
- api_key=os.getenv("ALPAFLOW_OPENAI_API_KEY"), | |
+ model=reasoning_llm_model, | |
+ api_key=reasoning_api_key, | |
reasoning_effort="high" | |
) | |
response = llm.complete(prompt) | |
@@ -74,9 +50,10 @@ | |
""" | |
Create a pure reasoning agent with no tools, relying solely on chain-of-thought. | |
""" | |
+ agent_llm_model = os.getenv("REASONING_AGENT_LLM_MODEL", "models/gemini-1.5-pro") | |
llm = GoogleGenAI( | |
api_key=os.getenv("GEMINI_API_KEY"), | |
- model="models/gemini-1.5-pro", | |
+ model=agent_llm_model, | |
) | |
system_prompt = """\ | |
``` | |
### 3.8. `planner_agent.py` Refactoring | |
* **Rationale:** To improve configuration management and prompt handling. | |
* **Proposals:** | |
1. **Configuration:** Move the hardcoded LLM model name (`models/gemini-1.5-pro`) to environment variables or a configuration file. | |
2. **Prompt Management:** Move the system prompt and the prompts within the `plan` and `synthesize_and_respond` functions to separate template files for better readability and maintainability. | |
* **Diff Patch (Illustrative - Configuration & Prompt Loading):** | |
```diff | |
--- a/planner_agent.py | |
+++ b/planner_agent.py | |
@@ -1,10 +1,19 @@ | |
import os | |
+ import logging | |
from typing import List, Any | |
from llama_index.core.agent.workflow import FunctionAgent, ReActAgent | |
from llama_index.core.tools import FunctionTool | |
from llama_index.llms.google_genai import GoogleGenAI | |
+ logger = logging.getLogger(__name__) | |
+ | |
+ def load_prompt_from_file(filename: str, default_prompt: str) -> str: | |
+ try: | |
+ with open(filename, "r") as f: | |
+ return f.read() | |
+ except FileNotFoundError: | |
+ logger.warning(f"Prompt file {filename} not found. Using default.") | |
+ return default_prompt | |
+ | |
def plan(objective: str) -> List[str]: | |
""" | |
Generate a list of sub-questions from the given objective. | |
@@ -15,14 +24,16 @@ | |
Returns: | |
List[str]: A list of sub-steps as strings. | |
""" | |
- input_prompt: str = ( | |
+ default_plan_prompt = ( | |
"You are a research assistant. " | |
"Given an objective, break it down into a list of concise, actionable sub-steps.\n" | |
f"Objective: {objective}\n" | |
"Sub-steps (one per line):" | |
) | |
+ plan_prompt_template = load_prompt_from_file("planner_plan_prompt.txt", default_plan_prompt) | |
+ input_prompt = plan_prompt_template.format(objective=objective) | |
+ llm_model_name = os.getenv("PLANNER_TOOL_LLM_MODEL", "models/gemini-1.5-pro") | |
llm = GoogleGenAI( | |
api_key=os.getenv("GEMINI_API_KEY"), | |
- model="models/gemini-1.5-pro", | |
+ model=llm_model_name, | |
) | |
@@ -44,13 +55,16 @@ | |
Returns: | |
str: A unified, well-structured response addressing the original objective. | |
""" | |
- # Join each ready-made QA block directly | |
summary_blocks = "\n".join(results) | |
- input_prompt = f"""You are an expert synthesizer. Given the following sub-questions and their answers, | |
+ default_synth_prompt = f"""You are an expert synthesizer. Given the following sub-questions and their answers, | |
produce a single, coherent, comprehensive report that addresses the original objective: | |
{summary_blocks} | |
Final Report: | |
""" | |
+ synth_prompt_template = load_prompt_from_file("planner_synthesize_prompt.txt", default_synth_prompt) | |
+ input_prompt = synth_prompt_template.format(summary_blocks=summary_blocks) | |
+ | |
+ llm_model_name = os.getenv("PLANNER_TOOL_LLM_MODEL", "models/gemini-1.5-pro") # Can use same model as plan | |
llm = GoogleGenAI( | |
api_key=os.getenv("GEMINI_API_KEY"), | |
- model="models/gemini-1.5-pro", | |
+ model=llm_model_name, | |
) | |
response = llm.complete(input_prompt) | |
return response.text | |
@@ -77,9 +91,10 @@ | |
""" | |
Initialize a LlamaIndex agent specialized in research planning and question engineering. | |
""" | |
+ agent_llm_model = os.getenv("PLANNER_AGENT_LLM_MODEL", "models/gemini-1.5-pro") | |
llm = GoogleGenAI( | |
api_key=os.getenv("GEMINI_API_KEY"), | |
- model="models/gemini-1.5-pro", | |
+ model=agent_llm_model, | |
) | |
system_prompt = """\ | |
@@ -108,6 +123,7 @@ | |
**Completion & Synthesis** | |
If the final result fully completes the original objective, produce a consolidated synthesis of the roadmap and send it as your concluding output. | |
""" | |
+ system_prompt = load_prompt_from_file("planner_system_prompt.txt", system_prompt) # Load from file if exists | |
agent = ReActAgent( | |
name="planner_agent", | |
``` | |
### 3.9. `code_agent.py` Refactoring | |
* **Rationale:** To address the critical security vulnerability of the `SimpleCodeExecutor`, improve configuration management, and align code execution with safer practices. | |
* **Proposals:** | |
1. **Remove `SimpleCodeExecutor`:** This class and its `execute` method using `subprocess` with raw code strings are fundamentally insecure and **must be removed entirely**. | |
2. **Use `CodeInterpreterToolSpec`:** Rely *exclusively* on the `code_interpreter` tool derived from LlamaIndex's `CodeInterpreterToolSpec` for code execution. This tool is designed for safer, sandboxed execution. | |
3. **Update `CodeActAgent` Initialization:** Remove the `code_execute_fn` parameter when initializing `CodeActAgent`, as the agent should use the provided `code_interpreter` tool for execution via the standard ReAct/Act loop, not a direct execution function. | |
4. **Configuration:** Move hardcoded LLM model names (`o4-mini`, `models/gemini-1.5-pro`) and the API key environment variable name (`ALPAFLOW_OPENAI_API_KEY`) to configuration. | |
5. **Prompt Management:** Move the `generate_python_code` prompt to a separate template file. | |
* **Diff Patch (Illustrative - Security Fix & Configuration):** | |
```diff | |
--- a/code_agent.py | |
+++ b/code_agent.py | |
@@ -1,5 +1,6 @@ | |
import os | |
import subprocess | |
+ import logging | |
from llama_index.core.agent.workflow import ReActAgent, CodeActAgent | |
from llama_index.core.tools import FunctionTool | |
@@ -7,6 +8,16 @@ | |
from llama_index.llms.openai import OpenAI | |
from llama_index.tools.code_interpreter import CodeInterpreterToolSpec | |
+ logger = logging.getLogger(__name__) | |
+ | |
+ def load_prompt_from_file(filename: str, default_prompt: str) -> str: | |
+ try: | |
+ with open(filename, "r") as f: | |
+ return f.read() | |
+ except FileNotFoundError: | |
+ logger.warning(f"Prompt file {filename} not found. Using default.") | |
+ return default_prompt | |
+ | |
def generate_python_code(prompt: str) -> str: | |
""" | |
Generate valid Python code from a natural language description. | |
@@ -27,7 +38,7 @@ | |
it before execution. | |
- This function only generates code and does not execute it. | |
""" | |
- | |
- input_prompt = f"""You are also a helpful assistant that writes Python code. | |
+ default_gen_prompt = f"""You are also a helpful assistant that writes Python code. | |
You will be given a prompt and you must generate Python code based on that prompt. | |
You must only generate Python code and nothing else. | |
Do not include any explanations or any other text. | |
@@ -40,10 +51,14 @@ | |
Code:\n | |
""" | |
+ gen_prompt_template = load_prompt_from_file("code_gen_prompt.txt", default_gen_prompt) | |
+ input_prompt = gen_prompt_template.format(prompt=prompt) | |
+ | |
+ gen_llm_model = os.getenv("CODE_GEN_LLM_MODEL", "o4-mini") | |
+ gen_api_key_env = os.getenv("CODE_GEN_API_KEY_ENV", "ALPAFLOW_OPENAI_API_KEY") | |
+ gen_api_key = os.getenv(gen_api_key_env) | |
llm = OpenAI( | |
- model="o4-mini", | |
- api_key=os.getenv("ALPAFLOW_OPENAI_API_KEY") | |
+ model=gen_llm_model, | |
+ api_key=gen_api_key | |
) | |
generated_code = llm.complete(input_prompt) | |
@@ -74,60 +89,11 @@ | |
), | |
) | |
-from typing import Any, Dict, Tuple | |
-import io | |
-import contextlib | |
-import ast | |
-import traceback | |
- | |
- | |
-class SimpleCodeExecutor: | |
- """ | |
- A simple code executor that runs Python code with state persistence. | |
- | |
- This executor maintains a global and local state between executions, | |
- allowing for variables to persist across multiple code runs. | |
- | |
- NOTE: not safe for production use! Use with caution. | |
- """ | |
- | |
- def __init__(self): | |
- pass | |
- | |
- def execute(self, code: str) -> str: | |
- """ | |
- Execute Python code and capture output and return values. | |
- | |
- Args: | |
- code: Python code to execute | |
- | |
- Returns: | |
- Dict with keys `success`, `output`, and `return_value` | |
- """ | |
- print(f"Executing code: {code}") | |
- try: | |
- result = subprocess.run( | |
- ["python", code], | |
- stdout=subprocess.PIPE, | |
- stderr=subprocess.PIPE, | |
- text=True, | |
- timeout=60 | |
- ) | |
- if result.returncode != 0: | |
- print(f"Execution failed with error: {result.stderr.strip()}") | |
- return f"Error: {result.stderr.strip()}" | |
- else: | |
- output = result.stdout.strip() | |
- print(f"Captured Output: {output}") | |
- return output | |
- except subprocess.TimeoutExpired: | |
- print("Execution timed out.") | |
- return "Error: Timeout" | |
- except Exception as e: | |
- print(f"Execution failed with error: {e}") | |
- return f"Error: {e}" | |
- | |
def initialize_code_agent() -> CodeActAgent: | |
- code_executor = SimpleCodeExecutor() | |
+ # DO NOT USE SimpleCodeExecutor - it is insecure. | |
+ # Rely on the code_interpreter tool provided below. | |
+ agent_llm_model = os.getenv("CODE_AGENT_LLM_MODEL", "models/gemini-1.5-pro") | |
llm = GoogleGenAI( | |
api_key=os.getenv("GEMINI_API_KEY"), | |
- model="models/gemini-1.5-pro", | |
+ model=agent_llm_model, | |
) | |
system_prompt = """\ | |
@@ -151,6 +117,7 @@ | |
- If further logical reasoning or verification is needed, delegate to **reasoning_agent**. | |
- Otherwise, once you have the final code or execution result, pass your output to **planner_agent** for overall synthesis and presentation. | |
""" | |
+ system_prompt = load_prompt_from_file("code_agent_system_prompt.txt", system_prompt) | |
agent = CodeActAgent( | |
name="code_agent", | |
@@ -161,7 +128,7 @@ | |
"pipelines, and library development, CodeAgent delivers production-ready Python solutions." | |
), | |
# REMOVED: code_execute_fn=code_executor.execute, # Use code_interpreter tool instead | |
- code_execute_fn=code_executor.execute, | |
tools=[ | |
python_code_generator_tool, | |
code_interpreter_tool, | |
``` | |
### 3.10. `math_agent.py` Refactoring | |
* **Rationale:** To improve configuration management and potentially simplify the tool interface for the LLM. | |
* **Proposals:** | |
1. **Configuration:** Move the hardcoded agent LLM model name (`models/gemini-1.5-pro`) to configuration. Ensure the WolframAlpha App ID is configured via environment variable (`WOLFRAM_ALPHA_APP_ID`) as intended. | |
2. **Tool Granularity:** The current approach creates a separate tool for almost every single math function (solve, derivative, integral, add, multiply, inverse, mean, median, etc.). While explicit, this results in a very large number of tools for the `ReActAgent` to manage. Consider: | |
* **Grouping:** Group related functions under fewer tools. For example, a `symbolic_math_tool` that takes the operation type (solve, diff, integrate) as a parameter, or a `matrix_ops_tool`. | |
* **Natural Language Interface:** Create a single `calculate` tool that takes a natural language math query (e.g., "solve x**2 - 4 = 0 for x", "mean of [1, 2, 3]") and uses an LLM (or rule-based parsing) internally to dispatch to the appropriate NumPy/SciPy/SymPy function. This simplifies the interface for the main agent LLM but adds complexity within the tool. | |
* **WolframAlpha Prioritization:** Evaluate if WolframAlpha can handle many of these requests directly, potentially reducing the need for numerous specific SymPy/NumPy tools, especially for symbolic tasks. | |
3. **Truncated File:** Since the original file was truncated, ensure the full file is reviewed if possible, as there might be other issues or tools not seen. | |
* **Diff Patch (Illustrative - Configuration):** | |
```diff | |
--- a/math_agent.py | |
+++ b/math_agent.py | |
@@ -1,5 +1,6 @@ | |
import os | |
from typing import List, Optional, Union | |
+ import logging | |
import sympy as sp | |
import numpy as np | |
from llama_index.core.agent.workflow import ReActAgent | |
@@ -12,6 +13,8 @@ | |
from scipy.integrate import odeint | |
import numpy.fft as fft | |
+ logger = logging.getLogger(__name__) | |
+ | |
# --- Symbolic math functions --- | |
@@ -451,10 +454,11 @@ | |
def initialize_math_agent() -> ReActAgent: | |
+ agent_llm_model = os.getenv("MATH_AGENT_LLM_MODEL", "models/gemini-1.5-pro") | |
llm = GoogleGenAI( | |
api_key=os.getenv("GEMINI_API_KEY"), | |
- model="models/gemini-1.5-pro", | |
+ model=agent_llm_model, | |
) | |
# Ensure WolframAlpha App ID is set | |
``` | |
*(Refactoring proposals section complete)* | |
## 4. New Feature Designs | |
This section outlines the design for the new features requested: YouTube Ingestion and Generic Audio Transcription. | |
### 4.1. YouTube Ingestion | |
* **Rationale:** To enable the framework to process YouTube videos by extracting audio, transcribing it, and summarizing the content, as requested by the user. | |
* **Design Proposal:** | |
* **Implementation:** Introduce a new dedicated agent, `youtube_agent`, or add tools to the existing `research_agent` or `text_analyzer_agent`. A dedicated agent seems cleaner given the specific multi-step workflow. | |
* **Agent (`youtube_agent`):** | |
* **Purpose:** Manages the end-to-end process of downloading YouTube audio, chunking, transcribing, and summarizing. | |
* **Tools:** | |
1. `download_youtube_audio`: Takes a YouTube URL, uses a library like `yt-dlp` (or potentially `pytube`) to download the audio stream into a temporary file (e.g., `.mp3` or `.opus`). Returns the path to the audio file. | |
2. `chunk_audio_file`: Takes an audio file path and a maximum chunk duration (e.g., 60 seconds). Uses a library like `pydub` or `librosa`+`soundfile` to split the audio into smaller, sequentially numbered temporary files. Returns a list of chunk file paths. | |
3. `transcribe_audio_chunk_gemini`: Takes an audio file path (representing a chunk). Uses the Google Generative AI SDK (`google.generativeai`) to call the Gemini 1.5 Pro model with the audio file for transcription. Returns the transcribed text. | |
4. `summarize_transcript`: Takes the full concatenated transcript text. Uses a Gemini model (e.g., 1.5 Pro or Flash) with a specific prompt to generate a one-paragraph summary. Returns the summary text. | |
* **Workflow (ReAct or Function sequence):** | |
1. Receive YouTube URL. | |
2. Call `download_youtube_audio`. | |
3. Call `chunk_audio_file` with the downloaded audio path. | |
4. Iterate through the list of chunk paths: | |
* Call `transcribe_audio_chunk_gemini` for each chunk. | |
* Collect transcribed text segments. | |
5. Concatenate all transcribed text segments into a full transcript. | |
6. Call `summarize_transcript` with the full transcript. | |
7. Return the full transcript and the summary. | |
8. Clean up temporary audio files (downloaded and chunks). | |
* **Handoff:** Could hand off the transcript and summary to `planner_agent` or `text_analyzer_agent` for further processing or integration. | |
* **Dependencies:** `yt-dlp`, `pydub` (requires `ffmpeg` or `libav`), `google-generativeai`. | |
* **Configuration:** Gemini API Key, chunk duration. | |
### 4.2. Generic Audio Transcription | |
* **Rationale:** To provide a flexible audio transcription capability for local files or remote URLs, using Gemini Pro for quality/latency tolerance and Whisper.cpp as a fallback, exposing it via a Python API as requested. | |
* **Design Proposal:** | |
* **Implementation:** Introduce a new dedicated agent, `transcription_agent`, or add tools to `text_analyzer_agent`. A dedicated agent allows for clearer separation of concerns, especially managing the Whisper.cpp dependency and logic. | |
* **Agent (`transcription_agent`):** | |
* **Purpose:** Transcribes audio from various sources (local path, URL) using either Gemini or Whisper.cpp based on latency requirements or availability. | |
* **Tools:** | |
1. `prepare_audio_source`: Takes a source string (URL or local path). If it's a URL, downloads it to a temporary file using `requests`. Validates the local file path. Returns the path to the local audio file. | |
2. `transcribe_gemini`: Takes an audio file path. Uses the `google-generativeai` SDK to call Gemini 1.5 Pro for transcription. Returns the transcribed text. This is the preferred method when latency is acceptable. | |
3. `transcribe_whisper_cpp`: Takes an audio file path. Uses a Python wrapper around `whisper.cpp` (e.g., installing `whisper.cpp` via `apt` or compiling from source, then using `subprocess` or a dedicated Python binding if available) to perform local transcription. Returns the transcribed text. This is the fallback or low-latency option. | |
4. `choose_transcription_method`: (Internal logic or a simple tool) Takes latency preference (e.g., 'high_quality' vs 'low_latency') or checks Gemini availability/quota. Decides whether to use `transcribe_gemini` or `transcribe_whisper_cpp`. | |
* **Workflow (ReAct or Function sequence):** | |
1. Receive audio source (URL/path) and potentially a latency preference. | |
2. Call `prepare_audio_source` to get a local file path. | |
3. Call `choose_transcription_method` (or execute internal logic) to decide between Gemini and Whisper. | |
4. If Gemini: Call `transcribe_gemini`. | |
5. If Whisper: Call `transcribe_whisper_cpp`. | |
6. Return the resulting transcript. | |
7. Clean up temporary downloaded audio file if applicable. | |
* **Handoff:** Could hand off the transcript to `planner_agent` or `text_analyzer_agent`. | |
* **Python API:** | |
* Define a simple Python function (e.g., in a `transcription_api.py` module) that encapsulates the agent's logic or directly calls the underlying transcription functions. | |
```python | |
# Example API function in transcription_api.py | |
from .transcription_agent import transcribe_audio # Assuming agent logic is refactored | |
def get_transcript(source: str, prefer_gemini: bool = True) -> str: | |
"""Transcribes audio from a local path or URL. | |
Args: | |
source: Path to the local audio file or URL. | |
prefer_gemini: If True, attempts to use Gemini Pro first. | |
If False or Gemini fails, falls back to Whisper.cpp. | |
Returns: | |
The transcribed text. | |
Raises: | |
TranscriptionError: If transcription fails. | |
""" | |
# Implementation would call the agent or its refactored functions | |
try: | |
# Simplified logic - actual implementation needs error handling, | |
# Gemini/Whisper selection based on preference/availability | |
transcript = transcribe_audio(source, prefer_gemini) | |
return transcript | |
except Exception as e: | |
# Log error | |
raise TranscriptionError(f"Failed to transcribe {source}: {e}") from e | |
class TranscriptionError(Exception): | |
pass | |
``` | |
* **Dependencies:** `requests`, `google-generativeai`, `whisper.cpp` (requires separate installation/compilation), potentially Python bindings for `whisper.cpp`. | |
* **Configuration:** Gemini API Key, path to `whisper.cpp` executable or library, Whisper model selection. | |
## 5. Extra Agent Designs | |
This section proposes three additional specialized agents designed to enhance performance on the GAIA benchmark by addressing common challenges like complex fact verification, interpreting visual data representations, and handling long contexts. | |
### 5.1. Agent Design 1: Advanced Validation Agent (`validation_agent`) | |
* **Purpose:** To perform rigorous validation of factual claims or intermediate results generated by other agents, going beyond the simple contradiction check of the current `verifier_agent`. This agent aims to improve the accuracy and trustworthiness of the final answer by cross-referencing information and performing checks. | |
* **Key Tool Calls:** | |
* `web_search` (from `research_agent` or similar): To find external evidence supporting or refuting a claim. | |
* `browse_and_extract` (from `research_agent` or similar): To access specific URLs found during search and extract relevant text snippets. | |
* `code_interpreter` (from `code_agent`): To perform calculations or simple data manipulations needed for verification (e.g., checking unit conversions, calculating percentages). | |
* `knowledge_base_lookup` (New Tool - Optional): Interface with a structured knowledge base (e.g., Wikidata, internal DB) to verify entities, relationships, or properties. | |
* `llm_check_consistency` (New Tool or LLM call): Use a powerful LLM with a specific prompt to assess the logical consistency between a claim and a set of provided evidence snippets or existing context. | |
* **Agent Loop Sketch (ReAct style):** | |
1. **Input:** A specific claim or statement to validate, along with relevant context or source information. | |
2. **Thought:** Identify the core assertion in the claim. Determine the best validation strategy (e.g., web search for current events, calculation for numerical claims, consistency check for logical statements). | |
3. **Action:** Call the appropriate tool (`web_search`, `code_interpreter`, `llm_check_consistency`). | |
4. **Observation:** Analyze the tool's output (search results, calculation result, consistency assessment). | |
5. **Thought:** Does the observation confirm, refute, or remain inconclusive about the claim? Is more information needed? (e.g., need to browse a specific search result). | |
6. **Action (if needed):** Call another tool (`browse_and_extract`, `llm_check_consistency` with new evidence). | |
7. **Observation:** Analyze new output. | |
8. **Thought:** Synthesize findings. Assign a final validation status (e.g., Confirmed, Refuted, Uncertain) and provide supporting evidence or reasoning. | |
9. **Output:** Validation status and justification. | |
10. **Handoff:** Return result to `planner_agent` or `verifier_agent` (if this agent replaces the contradiction part). | |
### 5.2. Agent Design 2: Figure Interpretation Agent (`figure_interpretation_agent`) | |
* **Purpose:** To specialize in extracting structured data and meaning from figures, charts, graphs, and tables embedded within images or documents, which are common in GAIA tasks and often require more than just a textual description. | |
* **Key Tool Calls:** | |
* `image_ocr` (New Tool or enhanced `image_analyzer_agent` capability): High-precision OCR focused on extracting text specifically from figures, including axes labels, legends, titles, and data points. | |
* `chart_data_extractor` (New Tool): Utilizes specialized vision models (e.g., DePlot, ChartOCR, or similar fine-tuned models) designed to parse chart types (bar, line, pie) and extract underlying data series or key values. | |
* `table_parser` (New Tool): Uses vision or document AI models to detect table structures in images/PDFs and extract cell content into a structured format (e.g., list of lists, Pandas DataFrame via code execution). | |
* `code_interpreter` (from `code_agent`): To process extracted data (e.g., load into DataFrame, perform simple analysis, re-plot for verification). | |
* `llm_interpret_figure` (New Tool or LLM call): Takes extracted text, data, and potentially the image itself (multimodal) to provide a semantic interpretation of the figure's message or trends. | |
* **Agent Loop Sketch (Function sequence or ReAct):** | |
1. **Input:** An image or document page containing a figure/table, potentially with context or a specific question about it. | |
2. **Action:** Call `image_ocr` to get all text elements. | |
3. **Action:** Call `chart_data_extractor` or `table_parser` based on visual analysis (or try both) to get structured data. | |
4. **Action (Optional):** Call `code_interpreter` to load structured data into a DataFrame for easier handling. | |
5. **Action:** Call `llm_interpret_figure`, providing the extracted text, data (raw or DataFrame), and potentially the original image, asking it to answer the specific question or summarize the figure's key insights. | |
6. **Output:** Structured data (if requested) and/or the semantic interpretation/answer. | |
7. **Handoff:** Return results to `planner_agent` or `reasoning_agent`. | |
### 5.3. Agent Design 3: Long Context Management Agent (`long_context_agent`) | |
* **Purpose:** To effectively manage and query information from very long documents or conversation histories that exceed the context window limits of standard models or require efficient information retrieval techniques. | |
* **Key Tool Calls:** | |
* `document_chunker` (New Tool): Splits long text into semantically meaningful chunks (e.g., using `SentenceSplitter` from LlamaIndex or more advanced methods). | |
* `vector_store_builder` (New Tool): Takes text chunks and builds an in-memory or persistent vector index (using libraries like `llama-index`, `langchain`, `faiss`, `chromadb`). | |
* `vector_retriever` (New Tool): Queries the built vector index with a specific question to find the most relevant chunks. | |
* `summarizer_tool` (New Tool or LLM call): Generates summaries of long text or selected chunks, potentially using different levels of detail. | |
* `contextual_synthesizer` (New Tool or LLM call): Takes retrieved relevant chunks and the original query, then uses an LLM to synthesize an answer grounded in the retrieved context (RAG pattern). | |
* **Agent Loop Sketch (Can be stateful):** | |
1. **Input:** A long document (text or path) or a long conversation history, and a specific query or task related to it. | |
2. **(Initialization/First Use):** | |
* **Action:** Call `document_chunker`. | |
* **Action:** Call `vector_store_builder` to create an index from the chunks. Store the index reference. | |
3. **(Querying):** | |
* **Action:** Call `vector_retriever` with the user's query to get relevant chunks. | |
* **Action:** Call `contextual_synthesizer`, providing the query and retrieved chunks, to generate the final answer. | |
4. **(Alternative: Summarization Task):** | |
* **Action:** Call `summarizer_tool` on the full text (if feasible for the tool) or on retrieved chunks based on a high-level query. | |
5. **Output:** The synthesized answer or the summary. | |
6. **Handoff:** Return results to `planner_agent`. | |
## 6. Migration Plan | |
This section details the recommended steps for applying the proposed changes, lists new dependencies, and outlines minimal validation tests. | |
### 6.1. Order of Implementation | |
It is recommended to apply changes in the following order to minimize disruption and build upon stable foundations: | |
1. **Core Refactoring (`app.py`, Configuration, Logging):** | |
* Implement centralized configuration (e.g., `.env` file) and update all agents to use it for API keys, model names, etc. | |
* Integrate Python's `logging` module throughout `app.py` and all agent files, replacing `print` statements. | |
* Refactor `app.py`: Implement singleton agent initialization and break down `run_and_submit_all`. | |
* Apply structural refactors to agents (class-based structure, avoiding globals) like `role_agent`, `verifier_agent`, `research_agent`. | |
2. **Critical Security Fix (`code_agent`):** | |
* Immediately remove the `SimpleCodeExecutor` and modify `code_agent` to rely solely on the `code_interpreter` tool. | |
3. **Core Functionality Refactoring (`verifier_agent`, `math_agent`):** | |
* Improve `verifier_agent`'s contradiction detection (e.g., using an LLM or NLI model). | |
* Refactor `math_agent` tools if choosing to group them or use a natural language interface. | |
4. **New Feature: Generic Audio Transcription (`transcription_agent`):** | |
* Install `whisper.cpp` and its dependencies. | |
* Implement the `transcription_agent` and its tools (`prepare_audio_source`, `transcribe_gemini`, `transcribe_whisper_cpp`). | |
* Implement the Python API function `get_transcript`. | |
5. **New Feature: YouTube Ingestion (`youtube_agent`):** | |
* Install `yt-dlp` and `pydub` (and `ffmpeg`). | |
* Implement the `youtube_agent` and its tools (`download_youtube_audio`, `chunk_audio_file`, `transcribe_audio_chunk_gemini`, `summarize_transcript`). | |
6. **New Agent Implementation (Validation, Figure, Long Context):** | |
* Implement `validation_agent` and its tools. | |
* Implement `figure_interpretation_agent` and its tools (requires sourcing/installing chart/table parsing models/libraries). | |
* Implement `long_context_agent` and its tools (requires vector DB setup like `faiss` or `chromadb`). | |
7. **Integration and Workflow Adjustments:** | |
* Update `planner_agent`'s system prompt and handoff logic to incorporate the new agents. | |
* Update other agents' handoff targets as needed. | |
* Update `app.py` if the overall agent initialization or workflow invocation changes. | |
### 6.2. New Dependencies (`requirements.txt`) | |
Based on the refactoring and new features, the following dependencies might need to be added or updated in `requirements.txt` (or managed via environment setup): | |
* `python-dotenv`: For loading configuration from `.env` files. | |
* `google-generativeai`: For interacting with Gemini models (already likely present via `llama-index-llms-google-genai`). | |
* `yt-dlp`: For downloading YouTube videos. | |
* `pydub`: For audio manipulation (chunking). Requires `ffmpeg` or `libav` system dependency. | |
* `llama-index-vector-stores-faiss` / `faiss-cpu` / `faiss-gpu`: For `long_context_agent` vector store (choose one). | |
* `chromadb` / `llama-index-vector-stores-chroma`: Alternative vector store for `long_context_agent`. | |
* `llama-index-multi-modal-llms-google`: Ensure multimodal support for Gemini is correctly installed. | |
* *Possibly*: Libraries for NLI models (e.g., `transformers`, `torch`) if used in `validation_agent`. | |
* *Possibly*: Libraries for chart/table parsing (e.g., specific models from Hugging Face, `opencv-python`, `pdf2image`) if implementing `figure_interpretation_agent` tools. | |
* *Possibly*: Python bindings for `whisper.cpp` if not using `subprocess`. | |
**System Dependencies:** | |
* `ffmpeg` or `libav`: Required by `pydub`. | |
* `whisper.cpp`: Needs to be compiled or installed separately. Follow its specific instructions. | |
### 6.3. Validation Tests | |
Minimal tests should be implemented to validate key changes: | |
1. **Configuration:** Test loading of API keys and model names from the configuration source. | |
2. **Logging:** Verify that logs are being generated at the correct levels and formats. | |
3. **`code_agent` Security:** Test that `code_agent` uses `code_interpreter` and *not* the removed `SimpleCodeExecutor`. Attempt a malicious code execution via prompt to ensure it fails safely within the interpreter's sandbox. | |
4. **`verifier_agent` Contradiction:** Test the improved contradiction detection with sample pairs of contradictory and non-contradictory statements. | |
5. **`transcription_agent`:** | |
* Test with a short local audio file using both Gemini and Whisper.cpp, comparing output quality/speed. | |
* Test with an audio URL. | |
* Test the Python API function `get_transcript`. | |
6. **`youtube_agent`:** | |
* Test with a short YouTube video URL. | |
* Verify audio download, chunking, transcription of chunks, and final summary generation. | |
* Check cleanup of temporary files. | |
7. **New Agents (Basic):** | |
* For `validation_agent`, `figure_interpretation_agent`, `long_context_agent`, implement basic tests confirming agent initialization and successful calls to their primary new tools with mock inputs/outputs. | |
8. **End-to-End Smoke Test:** Run `app.py` and process one or two simple GAIA tasks that are likely to invoke the refactored components and potentially a new feature (if a relevant task exists) to ensure the overall workflow remains functional. | |
*(Implementation plan complete. Ready for user confirmation.)* | |