Spaces:
Running
3.5. research_agent.py
Refactoring
Rationale: To improve browser instance management, error handling, and configuration.
Proposals:
- Browser Lifecycle Management: Instead of initializing the browser (
start_chrome
) at the module level, manage its lifecycle explicitly. Options:- Initialize the browser within the agent's initialization and provide a method or tool to explicitly close it (
kill_browser
) when the agent's task is done or the application shuts down. - Use a context manager (
with start_chrome(...) as browser:
) if the browser is only needed for a specific scope within a tool call (less likely for a persistent agent). - Ensure
kill_browser
is reliably called. Perhaps theplanner_agent
could invoke a cleanup tool/method on theresearch_agent
after its tasks are complete.
- Initialize the browser within the agent's initialization and provide a method or tool to explicitly close it (
- Configuration: Move hardcoded Chrome options to configuration. Externalize API keys/IDs if not already done (they seem to be using
os.getenv
, which is good). - Robust Error Handling: For browser interaction tools (
visit
,get_text_by_css
,click_element
), raise specific custom exceptions instead of returning error strings. This allows for more structured error handling by the agent or workflow. - Tool Consolidation (Optional): The agent has many tools. Consider if some related tools (e.g., different search APIs) could be consolidated behind a single tool that internally chooses the best source, or if the LLM handles the large toolset effectively.
- Browser Lifecycle Management: Instead of initializing the browser (
Diff Patch (Illustrative - Configuration & Browser Init):
--- a/research_agent.py +++ b/research_agent.py @@ -1,5 +1,6 @@ import os import time
import logging from typing import List
from llama_index.core.agent.workflow import ReActAgent
@@ -15,17 +16,21 @@ from helium import start_chrome, go_to, find_all, Text, kill_browser from helium import get_driver + logger = logging.getLogger(name) + # 1. Helium -chrome_options = webdriver.ChromeOptions() -chrome_options.add_argument("--no-sandbox") -chrome_options.add_argument("--disable-dev-shm-usage") -chrome_options.add_experimental_option("prefs", { - "download.prompt_for_download": False, - "plugins.always_open_pdf_externally": True, - "profile.default_content_settings.popups": 0 -})
-browser = start_chrome(headless=True, options=chrome_options) +# Browser instance should be managed, not global at module level +# browser = start_chrome(headless=True, options=chrome_options) + +def get_chrome_options():
- options = webdriver.ChromeOptions()
- if os.getenv("RESEARCH_AGENT_CHROME_NO_SANDBOX", "true").lower() == "true":
options.add_argument("--no-sandbox")
- if os.getenv("RESEARCH_AGENT_CHROME_DISABLE_DEV_SHM", "true").lower() == "true":
options.add_argument("--disable-dev-shm-usage")
Add other options from config as needed
options.add_experimental_option(...) # Example
- return options
def visit(url: str, wait_seconds: float = 2.0) -> str |None: """ @@ -36,10 +41,11 @@ wait_seconds (float): Time to wait after navigation. """ try:
except Exception as e:# Assumes browser is available in context (e.g., class member) go_to(url) time.sleep(wait_seconds) return f"Visited: {url}"
logger.error(f"Error visiting {url}: {e}", exc_info=True) return f"Error visiting {url}: {e}"
def get_text_by_css(selector: str) -> List[str] | str: @@ -52,13 +58,15 @@ List[str]: List of text contents. """ try:
# Assumes browser/helium context is active if selector.lower() == 'body': elements = find_all(Text()) else: elements = find_all(selector) texts = [elem.web_element.text for elem in elements]
print(f"Extracted {len(texts)} elements for selector \'{selector}\'")
except Exception as e:logger.info(f"Extracted {len(texts)} elements for selector \'{selector}\'") return texts
logger.error(f"Error extracting text for selector {selector}: {e}", exc_info=True) return f"Error extracting text for selector {selector}: {e}"
def get_page_html() -> str: @@ -70,9 +78,11 @@ str: HTML content, or empty string on error. """ try:
except Exception as e:# Assumes browser/helium context is active driver = get_driver() html = driver.page_source return html
logger.error(f"Error extracting HTML: {e}", exc_info=True) return f"Error extracting HTML: {e}"
def click_element(selector: str, index_element: int = 0) -> str: @@ -83,10 +93,12 @@ selector (str): CSS selector of the element to click. """ try:
except Exception as e:# Assumes browser/helium context is active element = find_all(selector)[index_element] element.click() time.sleep(1) return f"Clicked element matching selector \'{selector}\'"
logger.error(f"Error clicking element {selector}: {e}", exc_info=True) return f"Error clicking element {selector}: {e}"
def search_item_ctrl_f(text: str, nth_result: int = 1) -> str: @@ -97,6 +109,7 @@ nth_result: Which occurrence to jump to (default: 1) """ elements = browser.find_elements(By.XPATH, f"//*[contains(text(), '{text}')]")
Assumes browser is available in context
if nth_result > len(elements): return f"Match n°{nth_result} not found (only {len(elements)} matches found)" result = f"Found {len(elements)} matches for '{text}'." @@ -107,19 +120,22 @@
def go_back() -> None: """Goes back to previous page.""" browser.back()
def close_popups() -> None: """ Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows! This does not work on cookie consent banners. """ webdriver.ActionChains(browser).send_keys(Keys.ESCAPE).perform()
def close() -> None: """ Close the browser instance. """ try:
# Assumes kill_browser is appropriate here kill_browser()
print("Browser closed")
except Exception as e:logger.info("Browser closed via kill_browser()")
print(f"Error closing browser: {e}")
logger.error(f"Error closing browser: {e}", exc_info=True)
visit_tool = FunctionTool.from_defaults( fn=visit, @@ -240,9 +256,14 @@
def initialize_research_agent() -> ReActAgent:
Browser initialization should happen here or be managed externally
Example: browser = start_chrome(headless=True, options=get_chrome_options())
Ensure browser instance is passed to tools or accessible via agent state/class
- llm_model_name = os.getenv("RESEARCH_AGENT_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),
model="models/gemini-1.5-pro",
model=llm_model_name,
)
system_prompt = """\
3.6. text_analyzer_agent.py
Refactoring
Rationale: To improve configuration management and error handling.
Proposals:
- Configuration: Move the hardcoded LLM model name (
models/gemini-1.5-pro
) to environment variables or a configuration file. - Prompt Management: Move the
analyze_text
prompt to a separate template file. - Error Handling: In
extract_text_from_pdf
, consider raising specific exceptions (e.g.,PDFDownloadError
,PDFParsingError
) instead of returning error strings, allowing the agent to handle failures more gracefully.
- Configuration: Move the hardcoded LLM model name (
Diff Patch (Illustrative - Configuration & Error Handling):
--- a/text_analyzer_agent.py +++ b/text_analyzer_agent.py @@ -6,6 +6,14 @@ logger = logging.getLogger(__name__)
- class PDFExtractionError(Exception):
"""Custom exception for PDF extraction failures."""
pass
- class PDFDownloadError(PDFExtractionError):
"""Custom exception for PDF download failures."""
pass
- def extract_text_from_pdf(source: str) -> str: """ Extract raw text from a PDF file on disk or at a URL.
@@ -19,21 +27,21 @@ try: resp = requests.get(source, timeout=10) resp.raise_for_status()
except Exception as e:
return f"Error downloading PDF from {source}: {e}"
except requests.exceptions.RequestException as e:
raise PDFDownloadError(f"Error downloading PDF from {source}: {e}") from e try: tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") tmp.write(resp.content) tmp.flush() tmp_path = tmp.name tmp.close()
except Exception as e:
return f"Error writing temp PDF file: {e}"
except IOError as e:
else: path = sourceraise PDFExtractionError(f"Error writing temp PDF file: {e}") from e path = tmp_path
Now extract text from the PDF on disk
if not os.path.isfile(path):
return f"PDF not found: {path}"
text = ""raise PDFExtractionError(f"PDF not found: {path}")
@@ -41,10 +49,10 @@ reader = PdfReader(path) pages = [page.extract_text() or "" for page in reader.pages] text = "\n".join(pages)
print(f"Extracted {len(pages)} pages of text from PDF")
except Exception as e: # Catch specific PyPDF2 errors if possible, otherwise general Exceptionlogger.info(f"Extracted {len(pages)} pages of text from PDF: {path}")
return f"Error reading PDF: {e}"
raise PDFExtractionError(f"Error reading PDF {path}: {e}") from e
Clean up temporary file if one was created
if source.lower().startswith(("http://", "https://")):
@@ -67,6 +75,14 @@ str: A plain-text string containing: • A “Summary:” section with bullet points. • A “Facts:” section with bullet points.
"""
Load prompt from file ideally
prompt_template = """You are an expert analyst.
Please analyze the following text and produce a plain-text response
with two sections:
Summary:
• Provide 2–3 concise bullet points summarizing the main ideas.
Facts:
• List each verifiable fact found in the text as a bullet point.
Respond with exactly that format—no JSON, no extra commentary.
Text to analyze:
"""
{text}
""" """
Build the prompt to guide the LLM’s output format
input_prompt = f"""You are an expert analyst. @@ -84,13 +100,14 @@ {text} """ """
input_prompt = prompt_template.format(text=text)
Use the LLM to generate the analysis
llm_model_name = os.getenv("TEXT_ANALYZER_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),
model="models/gemini-1.5-pro",
model=llm_model_name,
)
generated = llm.complete(input_prompt)
@@ -124,9 +141,10 @@ FunctionAgent: Configured analysis agent. """
- llm_model_name = os.getenv("TEXT_ANALYZER_AGENT_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),
model="models/gemini-1.5-pro",
model=llm_model_name,
)
system_prompt = """\
3.7. reasoning_agent.py
Refactoring
Rationale: To simplify the agent structure, improve configuration, and potentially optimize LLM usage.
Proposals:
- Configuration: Move hardcoded LLM model names (
models/gemini-1.5-pro
,o4-mini
) and the API key environment variable name (ALPAFLOW_OPENAI_API_KEY
) to configuration. - Prompt Management: Move the detailed CoT prompt from
reasoning_tool_fn
to a separate template file. - Agent Structure Simplification: Given the rigid workflow (call tool -> handoff), consider replacing the
ReActAgent
with a simplerFunctionAgent
that directly calls thereasoning_tool
and formats the output before handing off. Alternatively, evaluate if thereasoning_tool
logic could be integrated as a direct LLM call within agents that need CoT (likeplanner_agent
), potentially removing the need for a separatereasoning_agent
altogether, unless its specific CoT prompt/model (o4-mini
) is crucial.
- Configuration: Move hardcoded LLM model names (
Diff Patch (Illustrative - Configuration & Prompt Loading):
--- a/reasoning_agent.py +++ b/reasoning_agent.py @@ -1,10 +1,19 @@ import os
import logging
from llama_index.core.agent.workflow import ReActAgent from llama_index.llms.google_genai import GoogleGenAI from llama_index.core.tools import FunctionTool from llama_index.llms.openai import OpenAI
logger = logging.getLogger(name)
def load_prompt_from_file(filename="reasoning_tool_prompt.txt") -> str:
try:
with open(filename, "r") as f:
return f.read()
except FileNotFoundError:
logger.error(f"Prompt file {filename} not found.")
return "Perform chain-of-thought reasoning on the context: {context}"
def reasoning_tool_fn(context: str) -> str: """ Perform end-to-end chain-of-thought reasoning over the full multi-agent workflow context,
@@ -17,45 +26,12 @@ str: A structured reasoning trace with numbered thought steps, intermediate checks, and a concise final recommendation or conclusion. """ - prompt = f"""You are an expert reasoning engine. You have the following full context of a multi-agent workflow:
- {context}
- Your job is to:
- Comprehension
- Read the entire question or problem statement carefully.
- Identify key terms, constraints, and desired outcomes.
- Decomposition
- Break down the problem into logical sub-steps or sub-questions.
- Ensure each sub-step is necessary and sufficient to progress toward a solution.
- Chain-of-Thought
- Articulate your internal reasoning in clear, numbered steps.
- At each step, state your assumptions, derive implications, and check for consistency.
- Intermediate Verification
- After each reasoning step, validate your conclusion against the problem’s constraints.
- If a contradiction or uncertainty arises, revisit and refine the previous step.
- Synthesis
- Once all sub-steps are resolved, integrate the intermediate results into a cohesive answer.
- Ensure the final answer directly addresses the user’s request and all specified criteria.
- Clarity & Precision
- Use formal, precise language.
- Avoid ambiguity: define any technical terms you introduce.
- Provide just enough detail to justify each conclusion without digression.
- Final Answer
- Present a concise, well-structured response.
- If appropriate, include a brief summary of your reasoning steps.
- Respond with your reasoning steps followed by the final recommendation.
- """
prompt_template = load_prompt_from_file()
prompt = prompt_template.format(context=context)
reasoning_llm_model = os.getenv("REASONING_TOOL_LLM_MODEL", "o4-mini")
Use specific API key if needed, e.g., ALPAFLOW_OPENAI_API_KEY
reasoning_api_key_env = os.getenv("REASONING_TOOL_API_KEY_ENV", "ALPAFLOW_OPENAI_API_KEY")
reasoning_api_key = os.getenv(reasoning_api_key_env) llm = OpenAI(
model="o4-mini",
api_key=os.getenv("ALPAFLOW_OPENAI_API_KEY"),
model=reasoning_llm_model,
) response = llm.complete(prompt) @@ -74,9 +50,10 @@ """ Create a pure reasoning agent with no tools, relying solely on chain-of-thought. """api_key=reasoning_api_key, reasoning_effort="high"
- agent_llm_model = os.getenv("REASONING_AGENT_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),
model="models/gemini-1.5-pro",
model=agent_llm_model,
)
system_prompt = """\
3.8. planner_agent.py
Refactoring
Rationale: To improve configuration management and prompt handling.
Proposals:
- Configuration: Move the hardcoded LLM model name (
models/gemini-1.5-pro
) to environment variables or a configuration file. - Prompt Management: Move the system prompt and the prompts within the
plan
andsynthesize_and_respond
functions to separate template files for better readability and maintainability.
- Configuration: Move the hardcoded LLM model name (
Diff Patch (Illustrative - Configuration & Prompt Loading):
--- a/planner_agent.py +++ b/planner_agent.py @@ -1,10 +1,19 @@ import os
import logging from typing import List, Any
from llama_index.core.agent.workflow import FunctionAgent, ReActAgent from llama_index.core.tools import FunctionTool from llama_index.llms.google_genai import GoogleGenAI
logger = logging.getLogger(name)
def load_prompt_from_file(filename: str, default_prompt: str) -> str:
try:
with open(filename, "r") as f:
return f.read()
except FileNotFoundError:
logger.warning(f"Prompt file {filename} not found. Using default.")
return default_prompt
def plan(objective: str) -> List[str]: """ Generate a list of sub-questions from the given objective.
@@ -15,14 +24,16 @@ Returns: List[str]: A list of sub-steps as strings. """
- input_prompt: str = (
default_plan_prompt = ( "You are a research assistant. " "Given an objective, break it down into a list of concise, actionable sub-steps.\n" f"Objective: {objective}\n" "Sub-steps (one per line):" )
plan_prompt_template = load_prompt_from_file("planner_plan_prompt.txt", default_plan_prompt)
input_prompt = plan_prompt_template.format(objective=objective)
llm_model_name = os.getenv("PLANNER_TOOL_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),
model="models/gemini-1.5-pro",
)model=llm_model_name,
@@ -44,13 +55,16 @@ Returns: str: A unified, well-structured response addressing the original objective. """
Join each ready-made QA block directly
summary_blocks = "\n".join(results)- input_prompt = f"""You are an expert synthesizer. Given the following sub-questions and their answers,
default_synth_prompt = f"""You are an expert synthesizer. Given the following sub-questions and their answers, produce a single, coherent, comprehensive report that addresses the original objective:
{summary_blocks}
Final Report: """
synth_prompt_template = load_prompt_from_file("planner_synthesize_prompt.txt", default_synth_prompt)
input_prompt = synth_prompt_template.format(summary_blocks=summary_blocks)
llm_model_name = os.getenv("PLANNER_TOOL_LLM_MODEL", "models/gemini-1.5-pro") # Can use same model as plan llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),
model="models/gemini-1.5-pro",
) response = llm.complete(input_prompt) return response.text @@ -77,9 +91,10 @@ """ Initialize a LlamaIndex agent specialized in research planning and question engineering. """model=llm_model_name,
- agent_llm_model = os.getenv("PLANNER_AGENT_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),
model="models/gemini-1.5-pro",
model=agent_llm_model,
)
system_prompt = """\
@@ -108,6 +123,7 @@
Completion & Synthesis
If the final result fully completes the original objective, produce a consolidated synthesis of the roadmap and send it as your concluding output.
"""
system_prompt = load_prompt_from_file("planner_system_prompt.txt", system_prompt) # Load from file if exists
agent = ReActAgent( name="planner_agent",
```
3.9. code_agent.py
Refactoring
Rationale: To address the critical security vulnerability of the
SimpleCodeExecutor
, improve configuration management, and align code execution with safer practices.Proposals:
- Remove
SimpleCodeExecutor
: This class and itsexecute
method usingsubprocess
with raw code strings are fundamentally insecure and must be removed entirely. - Use
CodeInterpreterToolSpec
: Rely exclusively on thecode_interpreter
tool derived from LlamaIndex'sCodeInterpreterToolSpec
for code execution. This tool is designed for safer, sandboxed execution. - Update
CodeActAgent
Initialization: Remove thecode_execute_fn
parameter when initializingCodeActAgent
, as the agent should use the providedcode_interpreter
tool for execution via the standard ReAct/Act loop, not a direct execution function. - Configuration: Move hardcoded LLM model names (
o4-mini
,models/gemini-1.5-pro
) and the API key environment variable name (ALPAFLOW_OPENAI_API_KEY
) to configuration. - Prompt Management: Move the
generate_python_code
prompt to a separate template file.
- Remove
Diff Patch (Illustrative - Security Fix & Configuration):
--- a/code_agent.py +++ b/code_agent.py @@ -1,5 +1,6 @@ import os import subprocess
import logging
from llama_index.core.agent.workflow import ReActAgent, CodeActAgent from llama_index.core.tools import FunctionTool
@@ -7,6 +8,16 @@ from llama_index.llms.openai import OpenAI from llama_index.tools.code_interpreter import CodeInterpreterToolSpec + logger = logging.getLogger(name) + + def load_prompt_from_file(filename: str, default_prompt: str) -> str: + try: + with open(filename, "r") as f: + return f.read() + except FileNotFoundError: + logger.warning(f"Prompt file {filename} not found. Using default.") + return default_prompt + def generate_python_code(prompt: str) -> str: """ Generate valid Python code from a natural language description. @@ -27,7 +38,7 @@ it before execution. - This function only generates code and does not execute it. """
- input_prompt = f"""You are also a helpful assistant that writes Python code.
default_gen_prompt = f"""You are also a helpful assistant that writes Python code. You will be given a prompt and you must generate Python code based on that prompt. You must only generate Python code and nothing else. Do not include any explanations or any other text. @@ -40,10 +51,14 @@ Code:\n """
gen_prompt_template = load_prompt_from_file("code_gen_prompt.txt", default_gen_prompt)
input_prompt = gen_prompt_template.format(prompt=prompt)
gen_llm_model = os.getenv("CODE_GEN_LLM_MODEL", "o4-mini")
gen_api_key_env = os.getenv("CODE_GEN_API_KEY_ENV", "ALPAFLOW_OPENAI_API_KEY")
gen_api_key = os.getenv(gen_api_key_env) llm = OpenAI(
model="o4-mini",
api_key=os.getenv("ALPAFLOW_OPENAI_API_KEY")
model=gen_llm_model,
api_key=gen_api_key
)
generated_code = llm.complete(input_prompt)
@@ -74,60 +89,11 @@ ), ) -from typing import Any, Dict, Tuple -import io -import contextlib -import ast -import traceback
- -class SimpleCodeExecutor:
- """
- A simple code executor that runs Python code with state persistence.
- This executor maintains a global and local state between executions,
- allowing for variables to persist across multiple code runs.
- NOTE: not safe for production use! Use with caution.
- """
- def init(self):
pass
- def execute(self, code: str) -> str:
"""
Execute Python code and capture output and return values.
Args:
code: Python code to execute
Returns:
Dict with keys `success`, `output`, and `return_value`
"""
print(f"Executing code: {code}")
try:
result = subprocess.run(
["python", code],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
timeout=60
)
if result.returncode != 0:
print(f"Execution failed with error: {result.stderr.strip()}")
return f"Error: {result.stderr.strip()}"
else:
output = result.stdout.strip()
print(f"Captured Output: {output}")
return output
except subprocess.TimeoutExpired:
print("Execution timed out.")
return "Error: Timeout"
except Exception as e:
print(f"Execution failed with error: {e}")
return f"Error: {e}"
- def initialize_code_agent() -> CodeActAgent:
- code_executor = SimpleCodeExecutor()
DO NOT USE SimpleCodeExecutor - it is insecure.
Rely on the code_interpreter tool provided below.
agent_llm_model = os.getenv("CODE_AGENT_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),
model="models/gemini-1.5-pro",
model=agent_llm_model,
)
system_prompt = """\
@@ -151,6 +117,7 @@
- If further logical reasoning or verification is needed, delegate to reasoning_agent.
- Otherwise, once you have the final code or execution result, pass your output to planner_agent for overall synthesis and presentation.
"""
system_prompt = load_prompt_from_file("code_agent_system_prompt.txt", system_prompt)
agent = CodeActAgent( name="code_agent",
@@ -161,7 +128,7 @@ "pipelines, and library development, CodeAgent delivers production-ready Python solutions." ), # REMOVED: code_execute_fn=code_executor.execute, # Use code_interpreter tool instead
code_execute_fn=code_executor.execute, tools=[ python_code_generator_tool, code_interpreter_tool,
3.10. math_agent.py
Refactoring
Rationale: To improve configuration management and potentially simplify the tool interface for the LLM.
Proposals:
- Configuration: Move the hardcoded agent LLM model name (
models/gemini-1.5-pro
) to configuration. Ensure the WolframAlpha App ID is configured via environment variable (WOLFRAM_ALPHA_APP_ID
) as intended. - Tool Granularity: The current approach creates a separate tool for almost every single math function (solve, derivative, integral, add, multiply, inverse, mean, median, etc.). While explicit, this results in a very large number of tools for the
ReActAgent
to manage. Consider:- Grouping: Group related functions under fewer tools. For example, a
symbolic_math_tool
that takes the operation type (solve, diff, integrate) as a parameter, or amatrix_ops_tool
. - Natural Language Interface: Create a single
calculate
tool that takes a natural language math query (e.g., "solve x**2 - 4 = 0 for x", "mean of [1, 2, 3]") and uses an LLM (or rule-based parsing) internally to dispatch to the appropriate NumPy/SciPy/SymPy function. This simplifies the interface for the main agent LLM but adds complexity within the tool. - WolframAlpha Prioritization: Evaluate if WolframAlpha can handle many of these requests directly, potentially reducing the need for numerous specific SymPy/NumPy tools, especially for symbolic tasks.
- Grouping: Group related functions under fewer tools. For example, a
- Truncated File: Since the original file was truncated, ensure the full file is reviewed if possible, as there might be other issues or tools not seen.
- Configuration: Move the hardcoded agent LLM model name (
Diff Patch (Illustrative - Configuration):
--- a/math_agent.py +++ b/math_agent.py @@ -1,5 +1,6 @@ import os from typing import List, Optional, Union
import logging import sympy as sp import numpy as np from llama_index.core.agent.workflow import ReActAgent @@ -12,6 +13,8 @@ from scipy.integrate import odeint import numpy.fft as fft
logger = logging.getLogger(name)
# --- Symbolic math functions ---
@@ -451,10 +454,11 @@
def initialize_math_agent() -> ReActAgent:
agent_llm_model = os.getenv("MATH_AGENT_LLM_MODEL", "models/gemini-1.5-pro") llm = GoogleGenAI( api_key=os.getenv("GEMINI_API_KEY"),
model="models/gemini-1.5-pro",
(Refactoring proposals section complete)
4. New Feature Designs
This section outlines the design for the new features requested: YouTube Ingestion and Generic Audio Transcription.
4.1. YouTube Ingestion
- Rationale: To enable the framework to process YouTube videos by extracting audio, transcribing it, and summarizing the content, as requested by the user.
- Design Proposal:
- Implementation: Introduce a new dedicated agent,
youtube_agent
, or add tools to the existingresearch_agent
ortext_analyzer_agent
. A dedicated agent seems cleaner given the specific multi-step workflow. - Agent (
youtube_agent
):- Purpose: Manages the end-to-end process of downloading YouTube audio, chunking, transcribing, and summarizing.
- Tools:
download_youtube_audio
: Takes a YouTube URL, uses a library likeyt-dlp
(or potentiallypytube
) to download the audio stream into a temporary file (e.g.,.mp3
or.opus
). Returns the path to the audio file.chunk_audio_file
: Takes an audio file path and a maximum chunk duration (e.g., 60 seconds). Uses a library likepydub
orlibrosa
+soundfile
to split the audio into smaller, sequentially numbered temporary files. Returns a list of chunk file paths.transcribe_audio_chunk_gemini
: Takes an audio file path (representing a chunk). Uses the Google Generative AI SDK (google.generativeai
) to call the Gemini 1.5 Pro model with the audio file for transcription. Returns the transcribed text.summarize_transcript
: Takes the full concatenated transcript text. Uses a Gemini model (e.g., 1.5 Pro or Flash) with a specific prompt to generate a one-paragraph summary. Returns the summary text.
- Workflow (ReAct or Function sequence):
- Receive YouTube URL.
- Call
download_youtube_audio
. - Call
chunk_audio_file
with the downloaded audio path. - Iterate through the list of chunk paths:
- Call
transcribe_audio_chunk_gemini
for each chunk. - Collect transcribed text segments.
- Call
- Concatenate all transcribed text segments into a full transcript.
- Call
summarize_transcript
with the full transcript. - Return the full transcript and the summary.
- Clean up temporary audio files (downloaded and chunks).
- Handoff: Could hand off the transcript and summary to
planner_agent
ortext_analyzer_agent
for further processing or integration.
- Dependencies:
yt-dlp
,pydub
(requiresffmpeg
orlibav
),google-generativeai
. - Configuration: Gemini API Key, chunk duration.
- Implementation: Introduce a new dedicated agent,
4.2. Generic Audio Transcription
- Rationale: To provide a flexible audio transcription capability for local files or remote URLs, using Gemini Pro for quality/latency tolerance and Whisper.cpp as a fallback, exposing it via a Python API as requested.
- Design Proposal:
- Implementation: Introduce a new dedicated agent,
transcription_agent
, or add tools totext_analyzer_agent
. A dedicated agent allows for clearer separation of concerns, especially managing the Whisper.cpp dependency and logic. - Agent (
transcription_agent
):- Purpose: Transcribes audio from various sources (local path, URL) using either Gemini or Whisper.cpp based on latency requirements or availability.
- Tools:
prepare_audio_source
: Takes a source string (URL or local path). If it's a URL, downloads it to a temporary file usingrequests
. Validates the local file path. Returns the path to the local audio file.transcribe_gemini
: Takes an audio file path. Uses thegoogle-generativeai
SDK to call Gemini 1.5 Pro for transcription. Returns the transcribed text. This is the preferred method when latency is acceptable.transcribe_whisper_cpp
: Takes an audio file path. Uses a Python wrapper aroundwhisper.cpp
(e.g., installingwhisper.cpp
viaapt
or compiling from source, then usingsubprocess
or a dedicated Python binding if available) to perform local transcription. Returns the transcribed text. This is the fallback or low-latency option.choose_transcription_method
: (Internal logic or a simple tool) Takes latency preference (e.g., 'high_quality' vs 'low_latency') or checks Gemini availability/quota. Decides whether to usetranscribe_gemini
ortranscribe_whisper_cpp
.
- Workflow (ReAct or Function sequence):
- Receive audio source (URL/path) and potentially a latency preference.
- Call
prepare_audio_source
to get a local file path. - Call
choose_transcription_method
(or execute internal logic) to decide between Gemini and Whisper. - If Gemini: Call
transcribe_gemini
. - If Whisper: Call
transcribe_whisper_cpp
. - Return the resulting transcript.
- Clean up temporary downloaded audio file if applicable.
- Handoff: Could hand off the transcript to
planner_agent
ortext_analyzer_agent
.
- Python API:
- Define a simple Python function (e.g., in a
transcription_api.py
module) that encapsulates the agent's logic or directly calls the underlying transcription functions.
# Example API function in transcription_api.py from .transcription_agent import transcribe_audio # Assuming agent logic is refactored def get_transcript(source: str, prefer_gemini: bool = True) -> str: """Transcribes audio from a local path or URL. Args: source: Path to the local audio file or URL. prefer_gemini: If True, attempts to use Gemini Pro first. If False or Gemini fails, falls back to Whisper.cpp. Returns: The transcribed text. Raises: TranscriptionError: If transcription fails. """ # Implementation would call the agent or its refactored functions try: # Simplified logic - actual implementation needs error handling, # Gemini/Whisper selection based on preference/availability transcript = transcribe_audio(source, prefer_gemini) return transcript except Exception as e: # Log error raise TranscriptionError(f"Failed to transcribe {source}: {e}") from e class TranscriptionError(Exception): pass
- Define a simple Python function (e.g., in a
- Dependencies:
requests
,google-generativeai
,whisper.cpp
(requires separate installation/compilation), potentially Python bindings forwhisper.cpp
. - Configuration: Gemini API Key, path to
whisper.cpp
executable or library, Whisper model selection.
- Implementation: Introduce a new dedicated agent,
5. Extra Agent Designs
This section proposes three additional specialized agents designed to enhance performance on the GAIA benchmark by addressing common challenges like complex fact verification, interpreting visual data representations, and handling long contexts.
5.1. Agent Design 1: Advanced Validation Agent (validation_agent
)
- Purpose: To perform rigorous validation of factual claims or intermediate results generated by other agents, going beyond the simple contradiction check of the current
verifier_agent
. This agent aims to improve the accuracy and trustworthiness of the final answer by cross-referencing information and performing checks. - Key Tool Calls:
web_search
(fromresearch_agent
or similar): To find external evidence supporting or refuting a claim.browse_and_extract
(fromresearch_agent
or similar): To access specific URLs found during search and extract relevant text snippets.code_interpreter
(fromcode_agent
): To perform calculations or simple data manipulations needed for verification (e.g., checking unit conversions, calculating percentages).knowledge_base_lookup
(New Tool - Optional): Interface with a structured knowledge base (e.g., Wikidata, internal DB) to verify entities, relationships, or properties.llm_check_consistency
(New Tool or LLM call): Use a powerful LLM with a specific prompt to assess the logical consistency between a claim and a set of provided evidence snippets or existing context.
- Agent Loop Sketch (ReAct style):
- Input: A specific claim or statement to validate, along with relevant context or source information.
- Thought: Identify the core assertion in the claim. Determine the best validation strategy (e.g., web search for current events, calculation for numerical claims, consistency check for logical statements).
- Action: Call the appropriate tool (
web_search
,code_interpreter
,llm_check_consistency
). - Observation: Analyze the tool's output (search results, calculation result, consistency assessment).
- Thought: Does the observation confirm, refute, or remain inconclusive about the claim? Is more information needed? (e.g., need to browse a specific search result).
- Action (if needed): Call another tool (
browse_and_extract
,llm_check_consistency
with new evidence). - Observation: Analyze new output.
- Thought: Synthesize findings. Assign a final validation status (e.g., Confirmed, Refuted, Uncertain) and provide supporting evidence or reasoning.
- Output: Validation status and justification.
- Handoff: Return result to
planner_agent
orverifier_agent
(if this agent replaces the contradiction part).
5.2. Agent Design 2: Figure Interpretation Agent (figure_interpretation_agent
)
- Purpose: To specialize in extracting structured data and meaning from figures, charts, graphs, and tables embedded within images or documents, which are common in GAIA tasks and often require more than just a textual description.
- Key Tool Calls:
image_ocr
(New Tool or enhancedimage_analyzer_agent
capability): High-precision OCR focused on extracting text specifically from figures, including axes labels, legends, titles, and data points.chart_data_extractor
(New Tool): Utilizes specialized vision models (e.g., DePlot, ChartOCR, or similar fine-tuned models) designed to parse chart types (bar, line, pie) and extract underlying data series or key values.table_parser
(New Tool): Uses vision or document AI models to detect table structures in images/PDFs and extract cell content into a structured format (e.g., list of lists, Pandas DataFrame via code execution).code_interpreter
(fromcode_agent
): To process extracted data (e.g., load into DataFrame, perform simple analysis, re-plot for verification).llm_interpret_figure
(New Tool or LLM call): Takes extracted text, data, and potentially the image itself (multimodal) to provide a semantic interpretation of the figure's message or trends.
- Agent Loop Sketch (Function sequence or ReAct):
- Input: An image or document page containing a figure/table, potentially with context or a specific question about it.
- Action: Call
image_ocr
to get all text elements. - Action: Call
chart_data_extractor
ortable_parser
based on visual analysis (or try both) to get structured data. - Action (Optional): Call
code_interpreter
to load structured data into a DataFrame for easier handling. - Action: Call
llm_interpret_figure
, providing the extracted text, data (raw or DataFrame), and potentially the original image, asking it to answer the specific question or summarize the figure's key insights. - Output: Structured data (if requested) and/or the semantic interpretation/answer.
- Handoff: Return results to
planner_agent
orreasoning_agent
.
5.3. Agent Design 3: Long Context Management Agent (long_context_agent
)
- Purpose: To effectively manage and query information from very long documents or conversation histories that exceed the context window limits of standard models or require efficient information retrieval techniques.
- Key Tool Calls:
document_chunker
(New Tool): Splits long text into semantically meaningful chunks (e.g., usingSentenceSplitter
from LlamaIndex or more advanced methods).vector_store_builder
(New Tool): Takes text chunks and builds an in-memory or persistent vector index (using libraries likellama-index
,langchain
,faiss
,chromadb
).vector_retriever
(New Tool): Queries the built vector index with a specific question to find the most relevant chunks.summarizer_tool
(New Tool or LLM call): Generates summaries of long text or selected chunks, potentially using different levels of detail.contextual_synthesizer
(New Tool or LLM call): Takes retrieved relevant chunks and the original query, then uses an LLM to synthesize an answer grounded in the retrieved context (RAG pattern).
- Agent Loop Sketch (Can be stateful):
- Input: A long document (text or path) or a long conversation history, and a specific query or task related to it.
- (Initialization/First Use):
- Action: Call
document_chunker
. - Action: Call
vector_store_builder
to create an index from the chunks. Store the index reference.
- Action: Call
- (Querying):
- Action: Call
vector_retriever
with the user's query to get relevant chunks. - Action: Call
contextual_synthesizer
, providing the query and retrieved chunks, to generate the final answer.
- Action: Call
- (Alternative: Summarization Task):
- Action: Call
summarizer_tool
on the full text (if feasible for the tool) or on retrieved chunks based on a high-level query.
- Action: Call
- Output: The synthesized answer or the summary.
- Handoff: Return results to
planner_agent
.
6. Migration Plan
This section details the recommended steps for applying the proposed changes, lists new dependencies, and outlines minimal validation tests.
6.1. Order of Implementation
It is recommended to apply changes in the following order to minimize disruption and build upon stable foundations:
- Core Refactoring (
app.py
, Configuration, Logging):- Implement centralized configuration (e.g.,
.env
file) and update all agents to use it for API keys, model names, etc. - Integrate Python's
logging
module throughoutapp.py
and all agent files, replacingprint
statements. - Refactor
app.py
: Implement singleton agent initialization and break downrun_and_submit_all
. - Apply structural refactors to agents (class-based structure, avoiding globals) like
role_agent
,verifier_agent
,research_agent
.
- Implement centralized configuration (e.g.,
- Critical Security Fix (
code_agent
):- Immediately remove the
SimpleCodeExecutor
and modifycode_agent
to rely solely on thecode_interpreter
tool.
- Immediately remove the
- Core Functionality Refactoring (
verifier_agent
,math_agent
):- Improve
verifier_agent
's contradiction detection (e.g., using an LLM or NLI model). - Refactor
math_agent
tools if choosing to group them or use a natural language interface.
- Improve
- New Feature: Generic Audio Transcription (
transcription_agent
):- Install
whisper.cpp
and its dependencies. - Implement the
transcription_agent
and its tools (prepare_audio_source
,transcribe_gemini
,transcribe_whisper_cpp
). - Implement the Python API function
get_transcript
.
- Install
- New Feature: YouTube Ingestion (
youtube_agent
):- Install
yt-dlp
andpydub
(andffmpeg
). - Implement the
youtube_agent
and its tools (download_youtube_audio
,chunk_audio_file
,transcribe_audio_chunk_gemini
,summarize_transcript
).
- Install
- New Agent Implementation (Validation, Figure, Long Context):
- Implement
validation_agent
and its tools. - Implement
figure_interpretation_agent
and its tools (requires sourcing/installing chart/table parsing models/libraries). - Implement
long_context_agent
and its tools (requires vector DB setup likefaiss
orchromadb
).
- Implement
- Integration and Workflow Adjustments:
- Update
planner_agent
's system prompt and handoff logic to incorporate the new agents. - Update other agents' handoff targets as needed.
- Update
app.py
if the overall agent initialization or workflow invocation changes.
- Update
6.2. New Dependencies (requirements.txt
)
Based on the refactoring and new features, the following dependencies might need to be added or updated in requirements.txt
(or managed via environment setup):
python-dotenv
: For loading configuration from.env
files.google-generativeai
: For interacting with Gemini models (already likely present viallama-index-llms-google-genai
).yt-dlp
: For downloading YouTube videos.pydub
: For audio manipulation (chunking). Requiresffmpeg
orlibav
system dependency.llama-index-vector-stores-faiss
/faiss-cpu
/faiss-gpu
: Forlong_context_agent
vector store (choose one).chromadb
/llama-index-vector-stores-chroma
: Alternative vector store forlong_context_agent
.llama-index-multi-modal-llms-google
: Ensure multimodal support for Gemini is correctly installed.- Possibly: Libraries for NLI models (e.g.,
transformers
,torch
) if used invalidation_agent
. - Possibly: Libraries for chart/table parsing (e.g., specific models from Hugging Face,
opencv-python
,pdf2image
) if implementingfigure_interpretation_agent
tools. - Possibly: Python bindings for
whisper.cpp
if not usingsubprocess
.
System Dependencies:
ffmpeg
orlibav
: Required bypydub
.whisper.cpp
: Needs to be compiled or installed separately. Follow its specific instructions.
6.3. Validation Tests
Minimal tests should be implemented to validate key changes:
- Configuration: Test loading of API keys and model names from the configuration source.
- Logging: Verify that logs are being generated at the correct levels and formats.
code_agent
Security: Test thatcode_agent
usescode_interpreter
and not the removedSimpleCodeExecutor
. Attempt a malicious code execution via prompt to ensure it fails safely within the interpreter's sandbox.verifier_agent
Contradiction: Test the improved contradiction detection with sample pairs of contradictory and non-contradictory statements.transcription_agent
:- Test with a short local audio file using both Gemini and Whisper.cpp, comparing output quality/speed.
- Test with an audio URL.
- Test the Python API function
get_transcript
.
youtube_agent
:- Test with a short YouTube video URL.
- Verify audio download, chunking, transcription of chunks, and final summary generation.
- Check cleanup of temporary files.
- New Agents (Basic):
- For
validation_agent
,figure_interpretation_agent
,long_context_agent
, implement basic tests confirming agent initialization and successful calls to their primary new tools with mock inputs/outputs.
- For
- End-to-End Smoke Test: Run
app.py
and process one or two simple GAIA tasks that are likely to invoke the refactored components and potentially a new feature (if a relevant task exists) to ensure the overall workflow remains functional.
(Implementation plan complete. Ready for user confirmation.)