GAIA_Agent / gaia_improvement_plan.md
Delanoe Pirard
First commit
a23082c
|
raw
history blame
49 kB

3.5. research_agent.py Refactoring

  • Rationale: To improve browser instance management, error handling, and configuration.

  • Proposals:

    1. Browser Lifecycle Management: Instead of initializing the browser (start_chrome) at the module level, manage its lifecycle explicitly. Options:
      • Initialize the browser within the agent's initialization and provide a method or tool to explicitly close it (kill_browser) when the agent's task is done or the application shuts down.
      • Use a context manager (with start_chrome(...) as browser:) if the browser is only needed for a specific scope within a tool call (less likely for a persistent agent).
      • Ensure kill_browser is reliably called. Perhaps the planner_agent could invoke a cleanup tool/method on the research_agent after its tasks are complete.
    2. Configuration: Move hardcoded Chrome options to configuration. Externalize API keys/IDs if not already done (they seem to be using os.getenv, which is good).
    3. Robust Error Handling: For browser interaction tools (visit, get_text_by_css, click_element), raise specific custom exceptions instead of returning error strings. This allows for more structured error handling by the agent or workflow.
    4. Tool Consolidation (Optional): The agent has many tools. Consider if some related tools (e.g., different search APIs) could be consolidated behind a single tool that internally chooses the best source, or if the LLM handles the large toolset effectively.
  • Diff Patch (Illustrative - Configuration & Browser Init):

    --- a/research_agent.py
    +++ b/research_agent.py
    @@ -1,5 +1,6 @@
     import os
     import time
    
  • import logging from typing import List

    from llama_index.core.agent.workflow import ReActAgent

@@ -15,17 +16,21 @@ from helium import start_chrome, go_to, find_all, Text, kill_browser from helium import get_driver + logger = logging.getLogger(name) + # 1. Helium -chrome_options = webdriver.ChromeOptions() -chrome_options.add_argument("--no-sandbox") -chrome_options.add_argument("--disable-dev-shm-usage") -chrome_options.add_experimental_option("prefs", { - "download.prompt_for_download": False, - "plugins.always_open_pdf_externally": True, - "profile.default_content_settings.popups": 0 -})

-browser = start_chrome(headless=True, options=chrome_options) +# Browser instance should be managed, not global at module level +# browser = start_chrome(headless=True, options=chrome_options) + +def get_chrome_options():

  • options = webdriver.ChromeOptions()
  • if os.getenv("RESEARCH_AGENT_CHROME_NO_SANDBOX", "true").lower() == "true":
  •    options.add_argument("--no-sandbox")
    
  • if os.getenv("RESEARCH_AGENT_CHROME_DISABLE_DEV_SHM", "true").lower() == "true":
  •    options.add_argument("--disable-dev-shm-usage")
    
  • Add other options from config as needed

  • options.add_experimental_option(...) # Example

  • return options

def visit(url: str, wait_seconds: float = 2.0) -> str |None: """ @@ -36,10 +41,11 @@ wait_seconds (float): Time to wait after navigation. """ try:

  •    # Assumes browser is available in context (e.g., class member)
       go_to(url)
       time.sleep(wait_seconds)
       return f"Visited: {url}"
    
    except Exception as e:
  •   logger.error(f"Error visiting {url}: {e}", exc_info=True)
      return f"Error visiting {url}: {e}"
    

def get_text_by_css(selector: str) -> List[str] | str: @@ -52,13 +58,15 @@ List[str]: List of text contents. """ try:

  •    # Assumes browser/helium context is active
       if selector.lower() == 'body':
           elements = find_all(Text())
       else:
           elements = find_all(selector)
       texts = [elem.web_element.text for elem in elements]
    
  •    print(f"Extracted {len(texts)} elements for selector \'{selector}\'")
    
  •    logger.info(f"Extracted {len(texts)} elements for selector \'{selector}\'")
       return texts
    
    except Exception as e:
  •    logger.error(f"Error extracting text for selector {selector}: {e}", exc_info=True)
       return f"Error extracting text for selector {selector}: {e}"
    

def get_page_html() -> str: @@ -70,9 +78,11 @@ str: HTML content, or empty string on error. """ try:

  •    # Assumes browser/helium context is active
       driver = get_driver()
       html = driver.page_source
       return html
    
    except Exception as e:
  •    logger.error(f"Error extracting HTML: {e}", exc_info=True)
       return f"Error extracting HTML: {e}"
    

def click_element(selector: str, index_element: int = 0) -> str: @@ -83,10 +93,12 @@ selector (str): CSS selector of the element to click. """ try:

  •    # Assumes browser/helium context is active
       element = find_all(selector)[index_element]
       element.click()
       time.sleep(1)
       return f"Clicked element matching selector \'{selector}\'"
    
    except Exception as e:
  •    logger.error(f"Error clicking element {selector}: {e}", exc_info=True)
       return f"Error clicking element {selector}: {e}"
    

def search_item_ctrl_f(text: str, nth_result: int = 1) -> str: @@ -97,6 +109,7 @@ nth_result: Which occurrence to jump to (default: 1) """ elements = browser.find_elements(By.XPATH, f"//*[contains(text(), '{text}')]")

  • Assumes browser is available in context

    if nth_result > len(elements): return f"Match n°{nth_result} not found (only {len(elements)} matches found)" result = f"Found {len(elements)} matches for '{text}'." @@ -107,19 +120,22 @@

def go_back() -> None: """Goes back to previous page.""" browser.back()

  • Assumes browser is available in context

def close_popups() -> None: """ Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows! This does not work on cookie consent banners. """ webdriver.ActionChains(browser).send_keys(Keys.ESCAPE).perform()

  • Assumes browser is available in context

def close() -> None: """ Close the browser instance. """ try:

  •    # Assumes kill_browser is appropriate here
       kill_browser()
    
  •    print("Browser closed")
    
  •    logger.info("Browser closed via kill_browser()")
    
    except Exception as e:
  •    print(f"Error closing browser: {e}")
    
  •    logger.error(f"Error closing browser: {e}", exc_info=True)
    

visit_tool = FunctionTool.from_defaults( fn=visit, @@ -240,9 +256,14 @@

def initialize_research_agent() -> ReActAgent:

  • Browser initialization should happen here or be managed externally

  • Example: browser = start_chrome(headless=True, options=get_chrome_options())

  • Ensure browser instance is passed to tools or accessible via agent state/class

  • llm_model_name = os.getenv("RESEARCH_AGENT_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),
  •    model="models/gemini-1.5-pro",
    
  •    model=llm_model_name,
    

    )

    system_prompt = """\

    
    

3.6. text_analyzer_agent.py Refactoring

  • Rationale: To improve configuration management and error handling.

  • Proposals:

    1. Configuration: Move the hardcoded LLM model name (models/gemini-1.5-pro) to environment variables or a configuration file.
    2. Prompt Management: Move the analyze_text prompt to a separate template file.
    3. Error Handling: In extract_text_from_pdf, consider raising specific exceptions (e.g., PDFDownloadError, PDFParsingError) instead of returning error strings, allowing the agent to handle failures more gracefully.
  • Diff Patch (Illustrative - Configuration & Error Handling):

    --- a/text_analyzer_agent.py
    +++ b/text_analyzer_agent.py
    @@ -6,6 +6,14 @@
    
     logger = logging.getLogger(__name__)
    
  • class PDFExtractionError(Exception):
  •    """Custom exception for PDF extraction failures."""
    
  •    pass
    
  • class PDFDownloadError(PDFExtractionError):
  •    """Custom exception for PDF download failures."""
    
  •    pass
    
  • def extract_text_from_pdf(source: str) -> str: """ Extract raw text from a PDF file on disk or at a URL.

@@ -19,21 +27,21 @@ try: resp = requests.get(source, timeout=10) resp.raise_for_status()

  •    except Exception as e:
    
  •        return f"Error downloading PDF from {source}: {e}"
    
  •    except requests.exceptions.RequestException as e:
    
  •        raise PDFDownloadError(f"Error downloading PDF from {source}: {e}") from e
    
       try:
           tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".pdf")
           tmp.write(resp.content)
           tmp.flush()
           tmp_path = tmp.name
           tmp.close()
    
  •    except Exception as e:
    
  •        return f"Error writing temp PDF file: {e}"
    
  •    except IOError as e:
    
  •        raise PDFExtractionError(f"Error writing temp PDF file: {e}") from e
       path = tmp_path
    
    else: path = source

    Now extract text from the PDF on disk

    if not os.path.isfile(path):
  •    return f"PDF not found: {path}"
    
  •    raise PDFExtractionError(f"PDF not found: {path}")
    
    text = ""

@@ -41,10 +49,10 @@ reader = PdfReader(path) pages = [page.extract_text() or "" for page in reader.pages] text = "\n".join(pages)

  •    print(f"Extracted {len(pages)} pages of text from PDF")
    
  •    logger.info(f"Extracted {len(pages)} pages of text from PDF: {path}")
    
    except Exception as e: # Catch specific PyPDF2 errors if possible, otherwise general Exception
  •    return f"Error reading PDF: {e}"
    
  •    raise PDFExtractionError(f"Error reading PDF {path}: {e}") from e
    

    Clean up temporary file if one was created

    if source.lower().startswith(("http://", "https://")):

@@ -67,6 +75,14 @@ str: A plain-text string containing: • A “Summary:” section with bullet points. • A “Facts:” section with bullet points.

  • """

  • Load prompt from file ideally

  • prompt_template = """You are an expert analyst.

  • Please analyze the following text and produce a plain-text response

  • with two sections:

  • Summary:

  • • Provide 2–3 concise bullet points summarizing the main ideas.

  • Facts:

  • • List each verifiable fact found in the text as a bullet point.

  • Respond with exactly that format—no JSON, no extra commentary.

  • Text to analyze:

  • """

  • {text}

  • """ """

    Build the prompt to guide the LLM’s output format

    input_prompt = f"""You are an expert analyst. @@ -84,13 +100,14 @@ {text} """ """

  • input_prompt = prompt_template.format(text=text)

    Use the LLM to generate the analysis

  • llm_model_name = os.getenv("TEXT_ANALYZER_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),

  •    model="models/gemini-1.5-pro",
    
  •    model=llm_model_name,
    

    )

    generated = llm.complete(input_prompt)

@@ -124,9 +141,10 @@ FunctionAgent: Configured analysis agent. """

  • llm_model_name = os.getenv("TEXT_ANALYZER_AGENT_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),
  •    model="models/gemini-1.5-pro",
    
  •    model=llm_model_name,
    

    )

    system_prompt = """\

    
    

3.7. reasoning_agent.py Refactoring

  • Rationale: To simplify the agent structure, improve configuration, and potentially optimize LLM usage.

  • Proposals:

    1. Configuration: Move hardcoded LLM model names (models/gemini-1.5-pro, o4-mini) and the API key environment variable name (ALPAFLOW_OPENAI_API_KEY) to configuration.
    2. Prompt Management: Move the detailed CoT prompt from reasoning_tool_fn to a separate template file.
    3. Agent Structure Simplification: Given the rigid workflow (call tool -> handoff), consider replacing the ReActAgent with a simpler FunctionAgent that directly calls the reasoning_tool and formats the output before handing off. Alternatively, evaluate if the reasoning_tool logic could be integrated as a direct LLM call within agents that need CoT (like planner_agent), potentially removing the need for a separate reasoning_agent altogether, unless its specific CoT prompt/model (o4-mini) is crucial.
  • Diff Patch (Illustrative - Configuration & Prompt Loading):

    --- a/reasoning_agent.py
    +++ b/reasoning_agent.py
    @@ -1,10 +1,19 @@
     import os
    
  • import logging

    from llama_index.core.agent.workflow import ReActAgent from llama_index.llms.google_genai import GoogleGenAI from llama_index.core.tools import FunctionTool from llama_index.llms.openai import OpenAI

  • logger = logging.getLogger(name)

  • def load_prompt_from_file(filename="reasoning_tool_prompt.txt") -> str:

  •    try:
    
  •        with open(filename, "r") as f:
    
  •            return f.read()
    
  •    except FileNotFoundError:
    
  •        logger.error(f"Prompt file {filename} not found.")
    
  •        return "Perform chain-of-thought reasoning on the context: {context}"
    
  • def reasoning_tool_fn(context: str) -> str: """ Perform end-to-end chain-of-thought reasoning over the full multi-agent workflow context,

@@ -17,45 +26,12 @@ str: A structured reasoning trace with numbered thought steps, intermediate checks, and a concise final recommendation or conclusion. """ - prompt = f"""You are an expert reasoning engine. You have the following full context of a multi-agent workflow:

  • {context}
  • Your job is to:
    1. Comprehension
  •   - Read the entire question or problem statement carefully.  
    
  •   - Identify key terms, constraints, and desired outcomes.
    
    1. Decomposition
  •   - Break down the problem into logical sub-steps or sub-questions.  
    
  •   - Ensure each sub-step is necessary and sufficient to progress toward a solution.
    
    1. Chain-of-Thought
  •   - Articulate your internal reasoning in clear, numbered steps.  
    
  •   - At each step, state your assumptions, derive implications, and check for consistency.
    
    1. Intermediate Verification
  •   - After each reasoning step, validate your conclusion against the problem’s constraints.  
    
  •   - If a contradiction or uncertainty arises, revisit and refine the previous step.
    
    1. Synthesis
  •   - Once all sub-steps are resolved, integrate the intermediate results into a cohesive answer.  
    
  •   - Ensure the final answer directly addresses the user’s request and all specified criteria.
    
    1. Clarity & Precision
  •   - Use formal, precise language.  
    
  •   - Avoid ambiguity: define any technical terms you introduce.  
    
  •   - Provide just enough detail to justify each conclusion without digression.
    
    1. Final Answer
  •   - Present a concise, well-structured response.  
    
  •   - If appropriate, include a brief summary of your reasoning steps.
    
  • Respond with your reasoning steps followed by the final recommendation.
  • """
  • prompt_template = load_prompt_from_file()

  • prompt = prompt_template.format(context=context)

  • reasoning_llm_model = os.getenv("REASONING_TOOL_LLM_MODEL", "o4-mini")

  • Use specific API key if needed, e.g., ALPAFLOW_OPENAI_API_KEY

  • reasoning_api_key_env = os.getenv("REASONING_TOOL_API_KEY_ENV", "ALPAFLOW_OPENAI_API_KEY")

  • reasoning_api_key = os.getenv(reasoning_api_key_env) llm = OpenAI(

  •    model="o4-mini",
    
  •    api_key=os.getenv("ALPAFLOW_OPENAI_API_KEY"),
    
  •    model=reasoning_llm_model,
    
  •    api_key=reasoning_api_key,
       reasoning_effort="high"
    
    ) response = llm.complete(prompt) @@ -74,9 +50,10 @@ """ Create a pure reasoning agent with no tools, relying solely on chain-of-thought. """
  • agent_llm_model = os.getenv("REASONING_AGENT_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),
  •    model="models/gemini-1.5-pro",
    
  •    model=agent_llm_model,
    

    )

    system_prompt = """\

    
    

3.8. planner_agent.py Refactoring

  • Rationale: To improve configuration management and prompt handling.

  • Proposals:

    1. Configuration: Move the hardcoded LLM model name (models/gemini-1.5-pro) to environment variables or a configuration file.
    2. Prompt Management: Move the system prompt and the prompts within the plan and synthesize_and_respond functions to separate template files for better readability and maintainability.
  • Diff Patch (Illustrative - Configuration & Prompt Loading):

    --- a/planner_agent.py
    +++ b/planner_agent.py
    @@ -1,10 +1,19 @@
     import os
    
  • import logging from typing import List, Any

    from llama_index.core.agent.workflow import FunctionAgent, ReActAgent from llama_index.core.tools import FunctionTool from llama_index.llms.google_genai import GoogleGenAI

  • logger = logging.getLogger(name)

  • def load_prompt_from_file(filename: str, default_prompt: str) -> str:

  •    try:
    
  •        with open(filename, "r") as f:
    
  •            return f.read()
    
  •    except FileNotFoundError:
    
  •        logger.warning(f"Prompt file {filename} not found. Using default.")
    
  •        return default_prompt
    
  • def plan(objective: str) -> List[str]: """ Generate a list of sub-questions from the given objective.

@@ -15,14 +24,16 @@ Returns: List[str]: A list of sub-steps as strings. """

  • input_prompt: str = (
  • default_plan_prompt = ( "You are a research assistant. " "Given an objective, break it down into a list of concise, actionable sub-steps.\n" f"Objective: {objective}\n" "Sub-steps (one per line):" )

  • plan_prompt_template = load_prompt_from_file("planner_plan_prompt.txt", default_plan_prompt)

  • input_prompt = plan_prompt_template.format(objective=objective)

  • llm_model_name = os.getenv("PLANNER_TOOL_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),

  •    model="models/gemini-1.5-pro",
    
  •    model=llm_model_name,
    
    )

@@ -44,13 +55,16 @@ Returns: str: A unified, well-structured response addressing the original objective. """

  • Join each ready-made QA block directly

    summary_blocks = "\n".join(results)
  • input_prompt = f"""You are an expert synthesizer. Given the following sub-questions and their answers,
  • default_synth_prompt = f"""You are an expert synthesizer. Given the following sub-questions and their answers, produce a single, coherent, comprehensive report that addresses the original objective:

    {summary_blocks}

    Final Report: """

  • synth_prompt_template = load_prompt_from_file("planner_synthesize_prompt.txt", default_synth_prompt)

  • input_prompt = synth_prompt_template.format(summary_blocks=summary_blocks)

  • llm_model_name = os.getenv("PLANNER_TOOL_LLM_MODEL", "models/gemini-1.5-pro") # Can use same model as plan llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),

  •    model="models/gemini-1.5-pro",
    
  •    model=llm_model_name,
    
    ) response = llm.complete(input_prompt) return response.text @@ -77,9 +91,10 @@ """ Initialize a LlamaIndex agent specialized in research planning and question engineering. """
  • agent_llm_model = os.getenv("PLANNER_AGENT_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),
  •    model="models/gemini-1.5-pro",
    
  •    model=agent_llm_model,
    

    )

    system_prompt = """\

@@ -108,6 +123,7 @@ Completion & Synthesis
If the final result fully completes the original objective, produce a consolidated synthesis of the roadmap and send it as your concluding output. """

  • system_prompt = load_prompt_from_file("planner_system_prompt.txt", system_prompt) # Load from file if exists

    agent = ReActAgent( name="planner_agent",

```

3.9. code_agent.py Refactoring

  • Rationale: To address the critical security vulnerability of the SimpleCodeExecutor, improve configuration management, and align code execution with safer practices.

  • Proposals:

    1. Remove SimpleCodeExecutor: This class and its execute method using subprocess with raw code strings are fundamentally insecure and must be removed entirely.
    2. Use CodeInterpreterToolSpec: Rely exclusively on the code_interpreter tool derived from LlamaIndex's CodeInterpreterToolSpec for code execution. This tool is designed for safer, sandboxed execution.
    3. Update CodeActAgent Initialization: Remove the code_execute_fn parameter when initializing CodeActAgent, as the agent should use the provided code_interpreter tool for execution via the standard ReAct/Act loop, not a direct execution function.
    4. Configuration: Move hardcoded LLM model names (o4-mini, models/gemini-1.5-pro) and the API key environment variable name (ALPAFLOW_OPENAI_API_KEY) to configuration.
    5. Prompt Management: Move the generate_python_code prompt to a separate template file.
  • Diff Patch (Illustrative - Security Fix & Configuration):

    --- a/code_agent.py
    +++ b/code_agent.py
    @@ -1,5 +1,6 @@
     import os
     import subprocess
    
  • import logging

    from llama_index.core.agent.workflow import ReActAgent, CodeActAgent from llama_index.core.tools import FunctionTool

@@ -7,6 +8,16 @@ from llama_index.llms.openai import OpenAI from llama_index.tools.code_interpreter import CodeInterpreterToolSpec + logger = logging.getLogger(name) + + def load_prompt_from_file(filename: str, default_prompt: str) -> str: + try: + with open(filename, "r") as f: + return f.read() + except FileNotFoundError: + logger.warning(f"Prompt file {filename} not found. Using default.") + return default_prompt + def generate_python_code(prompt: str) -> str: """ Generate valid Python code from a natural language description. @@ -27,7 +38,7 @@ it before execution. - This function only generates code and does not execute it. """

  • input_prompt = f"""You are also a helpful assistant that writes Python code.
  • default_gen_prompt = f"""You are also a helpful assistant that writes Python code. You will be given a prompt and you must generate Python code based on that prompt. You must only generate Python code and nothing else. Do not include any explanations or any other text. @@ -40,10 +51,14 @@ Code:\n """

  • gen_prompt_template = load_prompt_from_file("code_gen_prompt.txt", default_gen_prompt)

  • input_prompt = gen_prompt_template.format(prompt=prompt)

  • gen_llm_model = os.getenv("CODE_GEN_LLM_MODEL", "o4-mini")

  • gen_api_key_env = os.getenv("CODE_GEN_API_KEY_ENV", "ALPAFLOW_OPENAI_API_KEY")

  • gen_api_key = os.getenv(gen_api_key_env) llm = OpenAI(

  •    model="o4-mini",
    
  •    api_key=os.getenv("ALPAFLOW_OPENAI_API_KEY")
    
  •    model=gen_llm_model,
    
  •    api_key=gen_api_key
    

    )

    generated_code = llm.complete(input_prompt)

@@ -74,60 +89,11 @@ ), ) -from typing import Any, Dict, Tuple -import io -import contextlib -import ast -import traceback

  • -class SimpleCodeExecutor:
  • """
  • A simple code executor that runs Python code with state persistence.
  • This executor maintains a global and local state between executions,
  • allowing for variables to persist across multiple code runs.
  • NOTE: not safe for production use! Use with caution.
  • """
  • def init(self):
  •    pass
    
  • def execute(self, code: str) -> str:
  •    """
    
  •    Execute Python code and capture output and return values.
    
  •    Args:
    
  •        code: Python code to execute
    
  •    Returns:
    
  •        Dict with keys `success`, `output`, and `return_value`
    
  •    """
    
  •    print(f"Executing code: {code}")
    
  •    try:
    
  •        result = subprocess.run(
    
  •            ["python", code],
    
  •            stdout=subprocess.PIPE,
    
  •            stderr=subprocess.PIPE,
    
  •            text=True,
    
  •            timeout=60
    
  •        )
    
  •        if result.returncode != 0:
    
  •            print(f"Execution failed with error: {result.stderr.strip()}")
    
  •            return f"Error: {result.stderr.strip()}"
    
  •        else:
    
  •            output = result.stdout.strip()
    
  •            print(f"Captured Output: {output}")
    
  •            return output
    
  •    except subprocess.TimeoutExpired:
    
  •        print("Execution timed out.")
    
  •        return "Error: Timeout"
    
  •    except Exception as e:
    
  •        print(f"Execution failed with error: {e}")
    
  •        return f"Error: {e}"
    
  • def initialize_code_agent() -> CodeActAgent:
  • code_executor = SimpleCodeExecutor()
  • DO NOT USE SimpleCodeExecutor - it is insecure.

  • Rely on the code_interpreter tool provided below.

  • agent_llm_model = os.getenv("CODE_AGENT_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),

  •    model="models/gemini-1.5-pro",
    
  •    model=agent_llm_model,
    

    )

    system_prompt = """\

@@ -151,6 +117,7 @@ - If further logical reasoning or verification is needed, delegate to reasoning_agent.
- Otherwise, once you have the final code or execution result, pass your output to planner_agent for overall synthesis and presentation. """

  • system_prompt = load_prompt_from_file("code_agent_system_prompt.txt", system_prompt)

    agent = CodeActAgent( name="code_agent",

@@ -161,7 +128,7 @@ "pipelines, and library development, CodeAgent delivers production-ready Python solutions." ), # REMOVED: code_execute_fn=code_executor.execute, # Use code_interpreter tool instead

  •    code_execute_fn=code_executor.execute,
       tools=[
           python_code_generator_tool,
           code_interpreter_tool,
    
    
    

3.10. math_agent.py Refactoring

  • Rationale: To improve configuration management and potentially simplify the tool interface for the LLM.

  • Proposals:

    1. Configuration: Move the hardcoded agent LLM model name (models/gemini-1.5-pro) to configuration. Ensure the WolframAlpha App ID is configured via environment variable (WOLFRAM_ALPHA_APP_ID) as intended.
    2. Tool Granularity: The current approach creates a separate tool for almost every single math function (solve, derivative, integral, add, multiply, inverse, mean, median, etc.). While explicit, this results in a very large number of tools for the ReActAgent to manage. Consider:
      • Grouping: Group related functions under fewer tools. For example, a symbolic_math_tool that takes the operation type (solve, diff, integrate) as a parameter, or a matrix_ops_tool.
      • Natural Language Interface: Create a single calculate tool that takes a natural language math query (e.g., "solve x**2 - 4 = 0 for x", "mean of [1, 2, 3]") and uses an LLM (or rule-based parsing) internally to dispatch to the appropriate NumPy/SciPy/SymPy function. This simplifies the interface for the main agent LLM but adds complexity within the tool.
      • WolframAlpha Prioritization: Evaluate if WolframAlpha can handle many of these requests directly, potentially reducing the need for numerous specific SymPy/NumPy tools, especially for symbolic tasks.
    3. Truncated File: Since the original file was truncated, ensure the full file is reviewed if possible, as there might be other issues or tools not seen.
  • Diff Patch (Illustrative - Configuration):

    --- a/math_agent.py
    +++ b/math_agent.py
    @@ -1,5 +1,6 @@
     import os
     from typing import List, Optional, Union
    
  • import logging import sympy as sp import numpy as np from llama_index.core.agent.workflow import ReActAgent @@ -12,6 +13,8 @@ from scipy.integrate import odeint import numpy.fft as fft

  • logger = logging.getLogger(name)

  • # --- Symbolic math functions ---
    

    @@ -451,10 +454,11 @@

    def initialize_math_agent() -> ReActAgent:

  • agent_llm_model = os.getenv("MATH_AGENT_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),

  •    model="models/gemini-1.5-pro",
    
  •    model=agent_llm_model,
    

    )

    Ensure WolframAlpha App ID is set

    
    

(Refactoring proposals section complete)

4. New Feature Designs

This section outlines the design for the new features requested: YouTube Ingestion and Generic Audio Transcription.

4.1. YouTube Ingestion

  • Rationale: To enable the framework to process YouTube videos by extracting audio, transcribing it, and summarizing the content, as requested by the user.
  • Design Proposal:
    • Implementation: Introduce a new dedicated agent, youtube_agent, or add tools to the existing research_agent or text_analyzer_agent. A dedicated agent seems cleaner given the specific multi-step workflow.
    • Agent (youtube_agent):
      • Purpose: Manages the end-to-end process of downloading YouTube audio, chunking, transcribing, and summarizing.
      • Tools:
        1. download_youtube_audio: Takes a YouTube URL, uses a library like yt-dlp (or potentially pytube) to download the audio stream into a temporary file (e.g., .mp3 or .opus). Returns the path to the audio file.
        2. chunk_audio_file: Takes an audio file path and a maximum chunk duration (e.g., 60 seconds). Uses a library like pydub or librosa+soundfile to split the audio into smaller, sequentially numbered temporary files. Returns a list of chunk file paths.
        3. transcribe_audio_chunk_gemini: Takes an audio file path (representing a chunk). Uses the Google Generative AI SDK (google.generativeai) to call the Gemini 1.5 Pro model with the audio file for transcription. Returns the transcribed text.
        4. summarize_transcript: Takes the full concatenated transcript text. Uses a Gemini model (e.g., 1.5 Pro or Flash) with a specific prompt to generate a one-paragraph summary. Returns the summary text.
      • Workflow (ReAct or Function sequence):
        1. Receive YouTube URL.
        2. Call download_youtube_audio.
        3. Call chunk_audio_file with the downloaded audio path.
        4. Iterate through the list of chunk paths:
          • Call transcribe_audio_chunk_gemini for each chunk.
          • Collect transcribed text segments.
        5. Concatenate all transcribed text segments into a full transcript.
        6. Call summarize_transcript with the full transcript.
        7. Return the full transcript and the summary.
        8. Clean up temporary audio files (downloaded and chunks).
      • Handoff: Could hand off the transcript and summary to planner_agent or text_analyzer_agent for further processing or integration.
    • Dependencies: yt-dlp, pydub (requires ffmpeg or libav), google-generativeai.
    • Configuration: Gemini API Key, chunk duration.

4.2. Generic Audio Transcription

  • Rationale: To provide a flexible audio transcription capability for local files or remote URLs, using Gemini Pro for quality/latency tolerance and Whisper.cpp as a fallback, exposing it via a Python API as requested.
  • Design Proposal:
    • Implementation: Introduce a new dedicated agent, transcription_agent, or add tools to text_analyzer_agent. A dedicated agent allows for clearer separation of concerns, especially managing the Whisper.cpp dependency and logic.
    • Agent (transcription_agent):
      • Purpose: Transcribes audio from various sources (local path, URL) using either Gemini or Whisper.cpp based on latency requirements or availability.
      • Tools:
        1. prepare_audio_source: Takes a source string (URL or local path). If it's a URL, downloads it to a temporary file using requests. Validates the local file path. Returns the path to the local audio file.
        2. transcribe_gemini: Takes an audio file path. Uses the google-generativeai SDK to call Gemini 1.5 Pro for transcription. Returns the transcribed text. This is the preferred method when latency is acceptable.
        3. transcribe_whisper_cpp: Takes an audio file path. Uses a Python wrapper around whisper.cpp (e.g., installing whisper.cpp via apt or compiling from source, then using subprocess or a dedicated Python binding if available) to perform local transcription. Returns the transcribed text. This is the fallback or low-latency option.
        4. choose_transcription_method: (Internal logic or a simple tool) Takes latency preference (e.g., 'high_quality' vs 'low_latency') or checks Gemini availability/quota. Decides whether to use transcribe_gemini or transcribe_whisper_cpp.
      • Workflow (ReAct or Function sequence):
        1. Receive audio source (URL/path) and potentially a latency preference.
        2. Call prepare_audio_source to get a local file path.
        3. Call choose_transcription_method (or execute internal logic) to decide between Gemini and Whisper.
        4. If Gemini: Call transcribe_gemini.
        5. If Whisper: Call transcribe_whisper_cpp.
        6. Return the resulting transcript.
        7. Clean up temporary downloaded audio file if applicable.
      • Handoff: Could hand off the transcript to planner_agent or text_analyzer_agent.
    • Python API:
      • Define a simple Python function (e.g., in a transcription_api.py module) that encapsulates the agent's logic or directly calls the underlying transcription functions.
      # Example API function in transcription_api.py
      from .transcription_agent import transcribe_audio # Assuming agent logic is refactored
      
      def get_transcript(source: str, prefer_gemini: bool = True) -> str:
          """Transcribes audio from a local path or URL.
      
          Args:
              source: Path to the local audio file or URL.
              prefer_gemini: If True, attempts to use Gemini Pro first.
                             If False or Gemini fails, falls back to Whisper.cpp.
      
          Returns:
              The transcribed text.
      
          Raises:
              TranscriptionError: If transcription fails.
          """
          # Implementation would call the agent or its refactored functions
          try:
              # Simplified logic - actual implementation needs error handling,
              # Gemini/Whisper selection based on preference/availability
              transcript = transcribe_audio(source, prefer_gemini)
              return transcript
          except Exception as e:
              # Log error
              raise TranscriptionError(f"Failed to transcribe {source}: {e}") from e
      
      class TranscriptionError(Exception):
          pass
      
    • Dependencies: requests, google-generativeai, whisper.cpp (requires separate installation/compilation), potentially Python bindings for whisper.cpp.
    • Configuration: Gemini API Key, path to whisper.cpp executable or library, Whisper model selection.

5. Extra Agent Designs

This section proposes three additional specialized agents designed to enhance performance on the GAIA benchmark by addressing common challenges like complex fact verification, interpreting visual data representations, and handling long contexts.

5.1. Agent Design 1: Advanced Validation Agent (validation_agent)

  • Purpose: To perform rigorous validation of factual claims or intermediate results generated by other agents, going beyond the simple contradiction check of the current verifier_agent. This agent aims to improve the accuracy and trustworthiness of the final answer by cross-referencing information and performing checks.
  • Key Tool Calls:
    • web_search (from research_agent or similar): To find external evidence supporting or refuting a claim.
    • browse_and_extract (from research_agent or similar): To access specific URLs found during search and extract relevant text snippets.
    • code_interpreter (from code_agent): To perform calculations or simple data manipulations needed for verification (e.g., checking unit conversions, calculating percentages).
    • knowledge_base_lookup (New Tool - Optional): Interface with a structured knowledge base (e.g., Wikidata, internal DB) to verify entities, relationships, or properties.
    • llm_check_consistency (New Tool or LLM call): Use a powerful LLM with a specific prompt to assess the logical consistency between a claim and a set of provided evidence snippets or existing context.
  • Agent Loop Sketch (ReAct style):
    1. Input: A specific claim or statement to validate, along with relevant context or source information.
    2. Thought: Identify the core assertion in the claim. Determine the best validation strategy (e.g., web search for current events, calculation for numerical claims, consistency check for logical statements).
    3. Action: Call the appropriate tool (web_search, code_interpreter, llm_check_consistency).
    4. Observation: Analyze the tool's output (search results, calculation result, consistency assessment).
    5. Thought: Does the observation confirm, refute, or remain inconclusive about the claim? Is more information needed? (e.g., need to browse a specific search result).
    6. Action (if needed): Call another tool (browse_and_extract, llm_check_consistency with new evidence).
    7. Observation: Analyze new output.
    8. Thought: Synthesize findings. Assign a final validation status (e.g., Confirmed, Refuted, Uncertain) and provide supporting evidence or reasoning.
    9. Output: Validation status and justification.
    10. Handoff: Return result to planner_agent or verifier_agent (if this agent replaces the contradiction part).

5.2. Agent Design 2: Figure Interpretation Agent (figure_interpretation_agent)

  • Purpose: To specialize in extracting structured data and meaning from figures, charts, graphs, and tables embedded within images or documents, which are common in GAIA tasks and often require more than just a textual description.
  • Key Tool Calls:
    • image_ocr (New Tool or enhanced image_analyzer_agent capability): High-precision OCR focused on extracting text specifically from figures, including axes labels, legends, titles, and data points.
    • chart_data_extractor (New Tool): Utilizes specialized vision models (e.g., DePlot, ChartOCR, or similar fine-tuned models) designed to parse chart types (bar, line, pie) and extract underlying data series or key values.
    • table_parser (New Tool): Uses vision or document AI models to detect table structures in images/PDFs and extract cell content into a structured format (e.g., list of lists, Pandas DataFrame via code execution).
    • code_interpreter (from code_agent): To process extracted data (e.g., load into DataFrame, perform simple analysis, re-plot for verification).
    • llm_interpret_figure (New Tool or LLM call): Takes extracted text, data, and potentially the image itself (multimodal) to provide a semantic interpretation of the figure's message or trends.
  • Agent Loop Sketch (Function sequence or ReAct):
    1. Input: An image or document page containing a figure/table, potentially with context or a specific question about it.
    2. Action: Call image_ocr to get all text elements.
    3. Action: Call chart_data_extractor or table_parser based on visual analysis (or try both) to get structured data.
    4. Action (Optional): Call code_interpreter to load structured data into a DataFrame for easier handling.
    5. Action: Call llm_interpret_figure, providing the extracted text, data (raw or DataFrame), and potentially the original image, asking it to answer the specific question or summarize the figure's key insights.
    6. Output: Structured data (if requested) and/or the semantic interpretation/answer.
    7. Handoff: Return results to planner_agent or reasoning_agent.

5.3. Agent Design 3: Long Context Management Agent (long_context_agent)

  • Purpose: To effectively manage and query information from very long documents or conversation histories that exceed the context window limits of standard models or require efficient information retrieval techniques.
  • Key Tool Calls:
    • document_chunker (New Tool): Splits long text into semantically meaningful chunks (e.g., using SentenceSplitter from LlamaIndex or more advanced methods).
    • vector_store_builder (New Tool): Takes text chunks and builds an in-memory or persistent vector index (using libraries like llama-index, langchain, faiss, chromadb).
    • vector_retriever (New Tool): Queries the built vector index with a specific question to find the most relevant chunks.
    • summarizer_tool (New Tool or LLM call): Generates summaries of long text or selected chunks, potentially using different levels of detail.
    • contextual_synthesizer (New Tool or LLM call): Takes retrieved relevant chunks and the original query, then uses an LLM to synthesize an answer grounded in the retrieved context (RAG pattern).
  • Agent Loop Sketch (Can be stateful):
    1. Input: A long document (text or path) or a long conversation history, and a specific query or task related to it.
    2. (Initialization/First Use):
      • Action: Call document_chunker.
      • Action: Call vector_store_builder to create an index from the chunks. Store the index reference.
    3. (Querying):
      • Action: Call vector_retriever with the user's query to get relevant chunks.
      • Action: Call contextual_synthesizer, providing the query and retrieved chunks, to generate the final answer.
    4. (Alternative: Summarization Task):
      • Action: Call summarizer_tool on the full text (if feasible for the tool) or on retrieved chunks based on a high-level query.
    5. Output: The synthesized answer or the summary.
    6. Handoff: Return results to planner_agent.

6. Migration Plan

This section details the recommended steps for applying the proposed changes, lists new dependencies, and outlines minimal validation tests.

6.1. Order of Implementation

It is recommended to apply changes in the following order to minimize disruption and build upon stable foundations:

  1. Core Refactoring (app.py, Configuration, Logging):
    • Implement centralized configuration (e.g., .env file) and update all agents to use it for API keys, model names, etc.
    • Integrate Python's logging module throughout app.py and all agent files, replacing print statements.
    • Refactor app.py: Implement singleton agent initialization and break down run_and_submit_all.
    • Apply structural refactors to agents (class-based structure, avoiding globals) like role_agent, verifier_agent, research_agent.
  2. Critical Security Fix (code_agent):
    • Immediately remove the SimpleCodeExecutor and modify code_agent to rely solely on the code_interpreter tool.
  3. Core Functionality Refactoring (verifier_agent, math_agent):
    • Improve verifier_agent's contradiction detection (e.g., using an LLM or NLI model).
    • Refactor math_agent tools if choosing to group them or use a natural language interface.
  4. New Feature: Generic Audio Transcription (transcription_agent):
    • Install whisper.cpp and its dependencies.
    • Implement the transcription_agent and its tools (prepare_audio_source, transcribe_gemini, transcribe_whisper_cpp).
    • Implement the Python API function get_transcript.
  5. New Feature: YouTube Ingestion (youtube_agent):
    • Install yt-dlp and pydub (and ffmpeg).
    • Implement the youtube_agent and its tools (download_youtube_audio, chunk_audio_file, transcribe_audio_chunk_gemini, summarize_transcript).
  6. New Agent Implementation (Validation, Figure, Long Context):
    • Implement validation_agent and its tools.
    • Implement figure_interpretation_agent and its tools (requires sourcing/installing chart/table parsing models/libraries).
    • Implement long_context_agent and its tools (requires vector DB setup like faiss or chromadb).
  7. Integration and Workflow Adjustments:
    • Update planner_agent's system prompt and handoff logic to incorporate the new agents.
    • Update other agents' handoff targets as needed.
    • Update app.py if the overall agent initialization or workflow invocation changes.

6.2. New Dependencies (requirements.txt)

Based on the refactoring and new features, the following dependencies might need to be added or updated in requirements.txt (or managed via environment setup):

  • python-dotenv: For loading configuration from .env files.
  • google-generativeai: For interacting with Gemini models (already likely present via llama-index-llms-google-genai).
  • yt-dlp: For downloading YouTube videos.
  • pydub: For audio manipulation (chunking). Requires ffmpeg or libav system dependency.
  • llama-index-vector-stores-faiss / faiss-cpu / faiss-gpu: For long_context_agent vector store (choose one).
  • chromadb / llama-index-vector-stores-chroma: Alternative vector store for long_context_agent.
  • llama-index-multi-modal-llms-google: Ensure multimodal support for Gemini is correctly installed.
  • Possibly: Libraries for NLI models (e.g., transformers, torch) if used in validation_agent.
  • Possibly: Libraries for chart/table parsing (e.g., specific models from Hugging Face, opencv-python, pdf2image) if implementing figure_interpretation_agent tools.
  • Possibly: Python bindings for whisper.cpp if not using subprocess.

System Dependencies:

  • ffmpeg or libav: Required by pydub.
  • whisper.cpp: Needs to be compiled or installed separately. Follow its specific instructions.

6.3. Validation Tests

Minimal tests should be implemented to validate key changes:

  1. Configuration: Test loading of API keys and model names from the configuration source.
  2. Logging: Verify that logs are being generated at the correct levels and formats.
  3. code_agent Security: Test that code_agent uses code_interpreter and not the removed SimpleCodeExecutor. Attempt a malicious code execution via prompt to ensure it fails safely within the interpreter's sandbox.
  4. verifier_agent Contradiction: Test the improved contradiction detection with sample pairs of contradictory and non-contradictory statements.
  5. transcription_agent:
    • Test with a short local audio file using both Gemini and Whisper.cpp, comparing output quality/speed.
    • Test with an audio URL.
    • Test the Python API function get_transcript.
  6. youtube_agent:
    • Test with a short YouTube video URL.
    • Verify audio download, chunking, transcription of chunks, and final summary generation.
    • Check cleanup of temporary files.
  7. New Agents (Basic):
    • For validation_agent, figure_interpretation_agent, long_context_agent, implement basic tests confirming agent initialization and successful calls to their primary new tools with mock inputs/outputs.
  8. End-to-End Smoke Test: Run app.py and process one or two simple GAIA tasks that are likely to invoke the refactored components and potentially a new feature (if a relevant task exists) to ensure the overall workflow remains functional.

(Implementation plan complete. Ready for user confirmation.)