Spaces:
Running
Running
File size: 48,982 Bytes
a23082c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 |
### 3.5. `research_agent.py` Refactoring
* **Rationale:** To improve browser instance management, error handling, and configuration.
* **Proposals:**
1. **Browser Lifecycle Management:** Instead of initializing the browser (`start_chrome`) at the module level, manage its lifecycle explicitly. Options:
* Initialize the browser within the agent's initialization and provide a method or tool to explicitly close it (`kill_browser`) when the agent's task is done or the application shuts down.
* Use a context manager (`with start_chrome(...) as browser:`) if the browser is only needed for a specific scope within a tool call (less likely for a persistent agent).
* Ensure `kill_browser` is reliably called. Perhaps the `planner_agent` could invoke a cleanup tool/method on the `research_agent` after its tasks are complete.
2. **Configuration:** Move hardcoded Chrome options to configuration. Externalize API keys/IDs if not already done (they seem to be using `os.getenv`, which is good).
3. **Robust Error Handling:** For browser interaction tools (`visit`, `get_text_by_css`, `click_element`), raise specific custom exceptions instead of returning error strings. This allows for more structured error handling by the agent or workflow.
4. **Tool Consolidation (Optional):** The agent has many tools. Consider if some related tools (e.g., different search APIs) could be consolidated behind a single tool that internally chooses the best source, or if the LLM handles the large toolset effectively.
* **Diff Patch (Illustrative - Configuration & Browser Init):**
```diff
--- a/research_agent.py
+++ b/research_agent.py
@@ -1,5 +1,6 @@
import os
import time
+ import logging
from typing import List
from llama_index.core.agent.workflow import ReActAgent
@@ -15,17 +16,21 @@
from helium import start_chrome, go_to, find_all, Text, kill_browser
from helium import get_driver
+ logger = logging.getLogger(__name__)
+
# 1. Helium
-chrome_options = webdriver.ChromeOptions()
-chrome_options.add_argument("--no-sandbox")
-chrome_options.add_argument("--disable-dev-shm-usage")
-chrome_options.add_experimental_option("prefs", {
- "download.prompt_for_download": False,
- "plugins.always_open_pdf_externally": True,
- "profile.default_content_settings.popups": 0
-})
-
-browser = start_chrome(headless=True, options=chrome_options)
+# Browser instance should be managed, not global at module level
+# browser = start_chrome(headless=True, options=chrome_options)
+
+def get_chrome_options():
+ options = webdriver.ChromeOptions()
+ if os.getenv("RESEARCH_AGENT_CHROME_NO_SANDBOX", "true").lower() == "true":
+ options.add_argument("--no-sandbox")
+ if os.getenv("RESEARCH_AGENT_CHROME_DISABLE_DEV_SHM", "true").lower() == "true":
+ options.add_argument("--disable-dev-shm-usage")
+ # Add other options from config as needed
+ # options.add_experimental_option(...) # Example
+ return options
def visit(url: str, wait_seconds: float = 2.0) -> str |None:
"""
@@ -36,10 +41,11 @@
wait_seconds (float): Time to wait after navigation.
"""
try:
+ # Assumes browser is available in context (e.g., class member)
go_to(url)
time.sleep(wait_seconds)
return f"Visited: {url}"
except Exception as e:
+ logger.error(f"Error visiting {url}: {e}", exc_info=True)
return f"Error visiting {url}: {e}"
def get_text_by_css(selector: str) -> List[str] | str:
@@ -52,13 +58,15 @@
List[str]: List of text contents.
"""
try:
+ # Assumes browser/helium context is active
if selector.lower() == 'body':
elements = find_all(Text())
else:
elements = find_all(selector)
texts = [elem.web_element.text for elem in elements]
- print(f"Extracted {len(texts)} elements for selector \'{selector}\'")
+ logger.info(f"Extracted {len(texts)} elements for selector \'{selector}\'")
return texts
except Exception as e:
+ logger.error(f"Error extracting text for selector {selector}: {e}", exc_info=True)
return f"Error extracting text for selector {selector}: {e}"
def get_page_html() -> str:
@@ -70,9 +78,11 @@
str: HTML content, or empty string on error.
"""
try:
+ # Assumes browser/helium context is active
driver = get_driver()
html = driver.page_source
return html
except Exception as e:
+ logger.error(f"Error extracting HTML: {e}", exc_info=True)
return f"Error extracting HTML: {e}"
def click_element(selector: str, index_element: int = 0) -> str:
@@ -83,10 +93,12 @@
selector (str): CSS selector of the element to click.
"""
try:
+ # Assumes browser/helium context is active
element = find_all(selector)[index_element]
element.click()
time.sleep(1)
return f"Clicked element matching selector \'{selector}\'"
except Exception as e:
+ logger.error(f"Error clicking element {selector}: {e}", exc_info=True)
return f"Error clicking element {selector}: {e}"
def search_item_ctrl_f(text: str, nth_result: int = 1) -> str:
@@ -97,6 +109,7 @@
nth_result: Which occurrence to jump to (default: 1)
"""
elements = browser.find_elements(By.XPATH, f"//*[contains(text(), \'{text}\')]")
+ # Assumes browser is available in context
if nth_result > len(elements):
return f"Match n°{nth_result} not found (only {len(elements)} matches found)"
result = f"Found {len(elements)} matches for \'{text}\'."
@@ -107,19 +120,22 @@
def go_back() -> None:
"""Goes back to previous page."""
browser.back()
+ # Assumes browser is available in context
def close_popups() -> None:
"""
Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows! This does not work on cookie consent banners.
"""
webdriver.ActionChains(browser).send_keys(Keys.ESCAPE).perform()
+ # Assumes browser is available in context
def close() -> None:
"""
Close the browser instance.
"""
try:
+ # Assumes kill_browser is appropriate here
kill_browser()
- print("Browser closed")
+ logger.info("Browser closed via kill_browser()")
except Exception as e:
- print(f"Error closing browser: {e}")
+ logger.error(f"Error closing browser: {e}", exc_info=True)
visit_tool = FunctionTool.from_defaults(
fn=visit,
@@ -240,9 +256,14 @@
def initialize_research_agent() -> ReActAgent:
+ # Browser initialization should happen here or be managed externally
+ # Example: browser = start_chrome(headless=True, options=get_chrome_options())
+ # Ensure browser instance is passed to tools or accessible via agent state/class
+
+ llm_model_name = os.getenv("RESEARCH_AGENT_LLM_MODEL", "models/gemini-1.5-pro")
llm = GoogleGenAI(
api_key=os.getenv("GEMINI_API_KEY"),
- model="models/gemini-1.5-pro",
+ model=llm_model_name,
)
system_prompt = """\
```
### 3.6. `text_analyzer_agent.py` Refactoring
* **Rationale:** To improve configuration management and error handling.
* **Proposals:**
1. **Configuration:** Move the hardcoded LLM model name (`models/gemini-1.5-pro`) to environment variables or a configuration file.
2. **Prompt Management:** Move the `analyze_text` prompt to a separate template file.
3. **Error Handling:** In `extract_text_from_pdf`, consider raising specific exceptions (e.g., `PDFDownloadError`, `PDFParsingError`) instead of returning error strings, allowing the agent to handle failures more gracefully.
* **Diff Patch (Illustrative - Configuration & Error Handling):**
```diff
--- a/text_analyzer_agent.py
+++ b/text_analyzer_agent.py
@@ -6,6 +6,14 @@
logger = logging.getLogger(__name__)
+ class PDFExtractionError(Exception):
+ """Custom exception for PDF extraction failures."""
+ pass
+
+ class PDFDownloadError(PDFExtractionError):
+ """Custom exception for PDF download failures."""
+ pass
+
def extract_text_from_pdf(source: str) -> str:
"""
Extract raw text from a PDF file on disk or at a URL.
@@ -19,21 +27,21 @@
try:
resp = requests.get(source, timeout=10)
resp.raise_for_status()
- except Exception as e:
- return f"Error downloading PDF from {source}: {e}"
+ except requests.exceptions.RequestException as e:
+ raise PDFDownloadError(f"Error downloading PDF from {source}: {e}") from e
try:
tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".pdf")
tmp.write(resp.content)
tmp.flush()
tmp_path = tmp.name
tmp.close()
- except Exception as e:
- return f"Error writing temp PDF file: {e}"
+ except IOError as e:
+ raise PDFExtractionError(f"Error writing temp PDF file: {e}") from e
path = tmp_path
else:
path = source
# Now extract text from the PDF on disk
if not os.path.isfile(path):
- return f"PDF not found: {path}"
+ raise PDFExtractionError(f"PDF not found: {path}")
text = ""
@@ -41,10 +49,10 @@
reader = PdfReader(path)
pages = [page.extract_text() or "" for page in reader.pages]
text = "\n".join(pages)
- print(f"Extracted {len(pages)} pages of text from PDF")
+ logger.info(f"Extracted {len(pages)} pages of text from PDF: {path}")
except Exception as e:
# Catch specific PyPDF2 errors if possible, otherwise general Exception
- return f"Error reading PDF: {e}"
+ raise PDFExtractionError(f"Error reading PDF {path}: {e}") from e
# Clean up temporary file if one was created
if source.lower().startswith(("http://", "https://")):
@@ -67,6 +75,14 @@
str: A plain-text string containing:
• A “Summary:” section with bullet points.
• A “Facts:” section with bullet points.
+ """
+ # Load prompt from file ideally
+ prompt_template = """You are an expert analyst.
+
+ Please analyze the following text and produce a plain-text response
+ with two sections:
+
+ Summary:
+ • Provide 2–3 concise bullet points summarizing the main ideas.
+
+ Facts:
+ • List each verifiable fact found in the text as a bullet point.
+
+ Respond with exactly that format—no JSON, no extra commentary.
+
+ Text to analyze:
+ \"\"\"
+ {text}
+ \"\"\"
"""
# Build the prompt to guide the LLM’s output format
input_prompt = f"""You are an expert analyst.
@@ -84,13 +100,14 @@
{text}
\"\"\"
"""
+ input_prompt = prompt_template.format(text=text)
# Use the LLM to generate the analysis
+ llm_model_name = os.getenv("TEXT_ANALYZER_LLM_MODEL", "models/gemini-1.5-pro")
llm = GoogleGenAI(
api_key=os.getenv("GEMINI_API_KEY"),
- model="models/gemini-1.5-pro",
+ model=llm_model_name,
)
generated = llm.complete(input_prompt)
@@ -124,9 +141,10 @@
FunctionAgent: Configured analysis agent.
"""
+ llm_model_name = os.getenv("TEXT_ANALYZER_AGENT_LLM_MODEL", "models/gemini-1.5-pro")
llm = GoogleGenAI(
api_key=os.getenv("GEMINI_API_KEY"),
- model="models/gemini-1.5-pro",
+ model=llm_model_name,
)
system_prompt = """\
```
### 3.7. `reasoning_agent.py` Refactoring
* **Rationale:** To simplify the agent structure, improve configuration, and potentially optimize LLM usage.
* **Proposals:**
1. **Configuration:** Move hardcoded LLM model names (`models/gemini-1.5-pro`, `o4-mini`) and the API key environment variable name (`ALPAFLOW_OPENAI_API_KEY`) to configuration.
2. **Prompt Management:** Move the detailed CoT prompt from `reasoning_tool_fn` to a separate template file.
3. **Agent Structure Simplification:** Given the rigid workflow (call tool -> handoff), consider replacing the `ReActAgent` with a simpler `FunctionAgent` that directly calls the `reasoning_tool` and formats the output before handing off. Alternatively, evaluate if the `reasoning_tool` logic could be integrated as a direct LLM call within agents that need CoT (like `planner_agent`), potentially removing the need for a separate `reasoning_agent` altogether, unless its specific CoT prompt/model (`o4-mini`) is crucial.
* **Diff Patch (Illustrative - Configuration & Prompt Loading):**
```diff
--- a/reasoning_agent.py
+++ b/reasoning_agent.py
@@ -1,10 +1,19 @@
import os
+ import logging
from llama_index.core.agent.workflow import ReActAgent
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
+ logger = logging.getLogger(__name__)
+
+ def load_prompt_from_file(filename="reasoning_tool_prompt.txt") -> str:
+ try:
+ with open(filename, "r") as f:
+ return f.read()
+ except FileNotFoundError:
+ logger.error(f"Prompt file {filename} not found.")
+ return "Perform chain-of-thought reasoning on the context: {context}"
+
def reasoning_tool_fn(context: str) -> str:
"""
Perform end-to-end chain-of-thought reasoning over the full multi-agent workflow context,
@@ -17,45 +26,12 @@
str: A structured reasoning trace with numbered thought steps, intermediate checks,
and a concise final recommendation or conclusion.
"""
- prompt = f"""You are an expert reasoning engine. You have the following full context of a multi-agent workflow:
-
- {context}
-
- Your job is to:
- 1. **Comprehension**
- - Read the entire question or problem statement carefully.
- - Identify key terms, constraints, and desired outcomes.
-
- 2. **Decomposition**
- - Break down the problem into logical sub-steps or sub-questions.
- - Ensure each sub-step is necessary and sufficient to progress toward a solution.
-
- 3. **Chain-of-Thought**
- - Articulate your internal reasoning in clear, numbered steps.
- - At each step, state your assumptions, derive implications, and check for consistency.
-
- 4. **Intermediate Verification**
- - After each reasoning step, validate your conclusion against the problem’s constraints.
- - If a contradiction or uncertainty arises, revisit and refine the previous step.
-
- 5. **Synthesis**
- - Once all sub-steps are resolved, integrate the intermediate results into a cohesive answer.
- - Ensure the final answer directly addresses the user’s request and all specified criteria.
-
- 6. **Clarity & Precision**
- - Use formal, precise language.
- - Avoid ambiguity: define any technical terms you introduce.
- - Provide just enough detail to justify each conclusion without digression.
-
- 7. **Final Answer**
- - Present a concise, well-structured response.
- - If appropriate, include a brief summary of your reasoning steps.
-
- Respond with your reasoning steps followed by the final recommendation.
- """
+ prompt_template = load_prompt_from_file()
+ prompt = prompt_template.format(context=context)
+ reasoning_llm_model = os.getenv("REASONING_TOOL_LLM_MODEL", "o4-mini")
+ # Use specific API key if needed, e.g., ALPAFLOW_OPENAI_API_KEY
+ reasoning_api_key_env = os.getenv("REASONING_TOOL_API_KEY_ENV", "ALPAFLOW_OPENAI_API_KEY")
+ reasoning_api_key = os.getenv(reasoning_api_key_env)
llm = OpenAI(
- model="o4-mini",
- api_key=os.getenv("ALPAFLOW_OPENAI_API_KEY"),
+ model=reasoning_llm_model,
+ api_key=reasoning_api_key,
reasoning_effort="high"
)
response = llm.complete(prompt)
@@ -74,9 +50,10 @@
"""
Create a pure reasoning agent with no tools, relying solely on chain-of-thought.
"""
+ agent_llm_model = os.getenv("REASONING_AGENT_LLM_MODEL", "models/gemini-1.5-pro")
llm = GoogleGenAI(
api_key=os.getenv("GEMINI_API_KEY"),
- model="models/gemini-1.5-pro",
+ model=agent_llm_model,
)
system_prompt = """\
```
### 3.8. `planner_agent.py` Refactoring
* **Rationale:** To improve configuration management and prompt handling.
* **Proposals:**
1. **Configuration:** Move the hardcoded LLM model name (`models/gemini-1.5-pro`) to environment variables or a configuration file.
2. **Prompt Management:** Move the system prompt and the prompts within the `plan` and `synthesize_and_respond` functions to separate template files for better readability and maintainability.
* **Diff Patch (Illustrative - Configuration & Prompt Loading):**
```diff
--- a/planner_agent.py
+++ b/planner_agent.py
@@ -1,10 +1,19 @@
import os
+ import logging
from typing import List, Any
from llama_index.core.agent.workflow import FunctionAgent, ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.google_genai import GoogleGenAI
+ logger = logging.getLogger(__name__)
+
+ def load_prompt_from_file(filename: str, default_prompt: str) -> str:
+ try:
+ with open(filename, "r") as f:
+ return f.read()
+ except FileNotFoundError:
+ logger.warning(f"Prompt file {filename} not found. Using default.")
+ return default_prompt
+
def plan(objective: str) -> List[str]:
"""
Generate a list of sub-questions from the given objective.
@@ -15,14 +24,16 @@
Returns:
List[str]: A list of sub-steps as strings.
"""
- input_prompt: str = (
+ default_plan_prompt = (
"You are a research assistant. "
"Given an objective, break it down into a list of concise, actionable sub-steps.\n"
f"Objective: {objective}\n"
"Sub-steps (one per line):"
)
+ plan_prompt_template = load_prompt_from_file("planner_plan_prompt.txt", default_plan_prompt)
+ input_prompt = plan_prompt_template.format(objective=objective)
+ llm_model_name = os.getenv("PLANNER_TOOL_LLM_MODEL", "models/gemini-1.5-pro")
llm = GoogleGenAI(
api_key=os.getenv("GEMINI_API_KEY"),
- model="models/gemini-1.5-pro",
+ model=llm_model_name,
)
@@ -44,13 +55,16 @@
Returns:
str: A unified, well-structured response addressing the original objective.
"""
- # Join each ready-made QA block directly
summary_blocks = "\n".join(results)
- input_prompt = f"""You are an expert synthesizer. Given the following sub-questions and their answers,
+ default_synth_prompt = f"""You are an expert synthesizer. Given the following sub-questions and their answers,
produce a single, coherent, comprehensive report that addresses the original objective:
{summary_blocks}
Final Report:
"""
+ synth_prompt_template = load_prompt_from_file("planner_synthesize_prompt.txt", default_synth_prompt)
+ input_prompt = synth_prompt_template.format(summary_blocks=summary_blocks)
+
+ llm_model_name = os.getenv("PLANNER_TOOL_LLM_MODEL", "models/gemini-1.5-pro") # Can use same model as plan
llm = GoogleGenAI(
api_key=os.getenv("GEMINI_API_KEY"),
- model="models/gemini-1.5-pro",
+ model=llm_model_name,
)
response = llm.complete(input_prompt)
return response.text
@@ -77,9 +91,10 @@
"""
Initialize a LlamaIndex agent specialized in research planning and question engineering.
"""
+ agent_llm_model = os.getenv("PLANNER_AGENT_LLM_MODEL", "models/gemini-1.5-pro")
llm = GoogleGenAI(
api_key=os.getenv("GEMINI_API_KEY"),
- model="models/gemini-1.5-pro",
+ model=agent_llm_model,
)
system_prompt = """\
@@ -108,6 +123,7 @@
**Completion & Synthesis**
If the final result fully completes the original objective, produce a consolidated synthesis of the roadmap and send it as your concluding output.
"""
+ system_prompt = load_prompt_from_file("planner_system_prompt.txt", system_prompt) # Load from file if exists
agent = ReActAgent(
name="planner_agent",
```
### 3.9. `code_agent.py` Refactoring
* **Rationale:** To address the critical security vulnerability of the `SimpleCodeExecutor`, improve configuration management, and align code execution with safer practices.
* **Proposals:**
1. **Remove `SimpleCodeExecutor`:** This class and its `execute` method using `subprocess` with raw code strings are fundamentally insecure and **must be removed entirely**.
2. **Use `CodeInterpreterToolSpec`:** Rely *exclusively* on the `code_interpreter` tool derived from LlamaIndex's `CodeInterpreterToolSpec` for code execution. This tool is designed for safer, sandboxed execution.
3. **Update `CodeActAgent` Initialization:** Remove the `code_execute_fn` parameter when initializing `CodeActAgent`, as the agent should use the provided `code_interpreter` tool for execution via the standard ReAct/Act loop, not a direct execution function.
4. **Configuration:** Move hardcoded LLM model names (`o4-mini`, `models/gemini-1.5-pro`) and the API key environment variable name (`ALPAFLOW_OPENAI_API_KEY`) to configuration.
5. **Prompt Management:** Move the `generate_python_code` prompt to a separate template file.
* **Diff Patch (Illustrative - Security Fix & Configuration):**
```diff
--- a/code_agent.py
+++ b/code_agent.py
@@ -1,5 +1,6 @@
import os
import subprocess
+ import logging
from llama_index.core.agent.workflow import ReActAgent, CodeActAgent
from llama_index.core.tools import FunctionTool
@@ -7,6 +8,16 @@
from llama_index.llms.openai import OpenAI
from llama_index.tools.code_interpreter import CodeInterpreterToolSpec
+ logger = logging.getLogger(__name__)
+
+ def load_prompt_from_file(filename: str, default_prompt: str) -> str:
+ try:
+ with open(filename, "r") as f:
+ return f.read()
+ except FileNotFoundError:
+ logger.warning(f"Prompt file {filename} not found. Using default.")
+ return default_prompt
+
def generate_python_code(prompt: str) -> str:
"""
Generate valid Python code from a natural language description.
@@ -27,7 +38,7 @@
it before execution.
- This function only generates code and does not execute it.
"""
-
- input_prompt = f"""You are also a helpful assistant that writes Python code.
+ default_gen_prompt = f"""You are also a helpful assistant that writes Python code.
You will be given a prompt and you must generate Python code based on that prompt.
You must only generate Python code and nothing else.
Do not include any explanations or any other text.
@@ -40,10 +51,14 @@
Code:\n
"""
+ gen_prompt_template = load_prompt_from_file("code_gen_prompt.txt", default_gen_prompt)
+ input_prompt = gen_prompt_template.format(prompt=prompt)
+
+ gen_llm_model = os.getenv("CODE_GEN_LLM_MODEL", "o4-mini")
+ gen_api_key_env = os.getenv("CODE_GEN_API_KEY_ENV", "ALPAFLOW_OPENAI_API_KEY")
+ gen_api_key = os.getenv(gen_api_key_env)
llm = OpenAI(
- model="o4-mini",
- api_key=os.getenv("ALPAFLOW_OPENAI_API_KEY")
+ model=gen_llm_model,
+ api_key=gen_api_key
)
generated_code = llm.complete(input_prompt)
@@ -74,60 +89,11 @@
),
)
-from typing import Any, Dict, Tuple
-import io
-import contextlib
-import ast
-import traceback
-
-
-class SimpleCodeExecutor:
- """
- A simple code executor that runs Python code with state persistence.
-
- This executor maintains a global and local state between executions,
- allowing for variables to persist across multiple code runs.
-
- NOTE: not safe for production use! Use with caution.
- """
-
- def __init__(self):
- pass
-
- def execute(self, code: str) -> str:
- """
- Execute Python code and capture output and return values.
-
- Args:
- code: Python code to execute
-
- Returns:
- Dict with keys `success`, `output`, and `return_value`
- """
- print(f"Executing code: {code}")
- try:
- result = subprocess.run(
- ["python", code],
- stdout=subprocess.PIPE,
- stderr=subprocess.PIPE,
- text=True,
- timeout=60
- )
- if result.returncode != 0:
- print(f"Execution failed with error: {result.stderr.strip()}")
- return f"Error: {result.stderr.strip()}"
- else:
- output = result.stdout.strip()
- print(f"Captured Output: {output}")
- return output
- except subprocess.TimeoutExpired:
- print("Execution timed out.")
- return "Error: Timeout"
- except Exception as e:
- print(f"Execution failed with error: {e}")
- return f"Error: {e}"
-
def initialize_code_agent() -> CodeActAgent:
- code_executor = SimpleCodeExecutor()
+ # DO NOT USE SimpleCodeExecutor - it is insecure.
+ # Rely on the code_interpreter tool provided below.
+ agent_llm_model = os.getenv("CODE_AGENT_LLM_MODEL", "models/gemini-1.5-pro")
llm = GoogleGenAI(
api_key=os.getenv("GEMINI_API_KEY"),
- model="models/gemini-1.5-pro",
+ model=agent_llm_model,
)
system_prompt = """\
@@ -151,6 +117,7 @@
- If further logical reasoning or verification is needed, delegate to **reasoning_agent**.
- Otherwise, once you have the final code or execution result, pass your output to **planner_agent** for overall synthesis and presentation.
"""
+ system_prompt = load_prompt_from_file("code_agent_system_prompt.txt", system_prompt)
agent = CodeActAgent(
name="code_agent",
@@ -161,7 +128,7 @@
"pipelines, and library development, CodeAgent delivers production-ready Python solutions."
),
# REMOVED: code_execute_fn=code_executor.execute, # Use code_interpreter tool instead
- code_execute_fn=code_executor.execute,
tools=[
python_code_generator_tool,
code_interpreter_tool,
```
### 3.10. `math_agent.py` Refactoring
* **Rationale:** To improve configuration management and potentially simplify the tool interface for the LLM.
* **Proposals:**
1. **Configuration:** Move the hardcoded agent LLM model name (`models/gemini-1.5-pro`) to configuration. Ensure the WolframAlpha App ID is configured via environment variable (`WOLFRAM_ALPHA_APP_ID`) as intended.
2. **Tool Granularity:** The current approach creates a separate tool for almost every single math function (solve, derivative, integral, add, multiply, inverse, mean, median, etc.). While explicit, this results in a very large number of tools for the `ReActAgent` to manage. Consider:
* **Grouping:** Group related functions under fewer tools. For example, a `symbolic_math_tool` that takes the operation type (solve, diff, integrate) as a parameter, or a `matrix_ops_tool`.
* **Natural Language Interface:** Create a single `calculate` tool that takes a natural language math query (e.g., "solve x**2 - 4 = 0 for x", "mean of [1, 2, 3]") and uses an LLM (or rule-based parsing) internally to dispatch to the appropriate NumPy/SciPy/SymPy function. This simplifies the interface for the main agent LLM but adds complexity within the tool.
* **WolframAlpha Prioritization:** Evaluate if WolframAlpha can handle many of these requests directly, potentially reducing the need for numerous specific SymPy/NumPy tools, especially for symbolic tasks.
3. **Truncated File:** Since the original file was truncated, ensure the full file is reviewed if possible, as there might be other issues or tools not seen.
* **Diff Patch (Illustrative - Configuration):**
```diff
--- a/math_agent.py
+++ b/math_agent.py
@@ -1,5 +1,6 @@
import os
from typing import List, Optional, Union
+ import logging
import sympy as sp
import numpy as np
from llama_index.core.agent.workflow import ReActAgent
@@ -12,6 +13,8 @@
from scipy.integrate import odeint
import numpy.fft as fft
+ logger = logging.getLogger(__name__)
+
# --- Symbolic math functions ---
@@ -451,10 +454,11 @@
def initialize_math_agent() -> ReActAgent:
+ agent_llm_model = os.getenv("MATH_AGENT_LLM_MODEL", "models/gemini-1.5-pro")
llm = GoogleGenAI(
api_key=os.getenv("GEMINI_API_KEY"),
- model="models/gemini-1.5-pro",
+ model=agent_llm_model,
)
# Ensure WolframAlpha App ID is set
```
*(Refactoring proposals section complete)*
## 4. New Feature Designs
This section outlines the design for the new features requested: YouTube Ingestion and Generic Audio Transcription.
### 4.1. YouTube Ingestion
* **Rationale:** To enable the framework to process YouTube videos by extracting audio, transcribing it, and summarizing the content, as requested by the user.
* **Design Proposal:**
* **Implementation:** Introduce a new dedicated agent, `youtube_agent`, or add tools to the existing `research_agent` or `text_analyzer_agent`. A dedicated agent seems cleaner given the specific multi-step workflow.
* **Agent (`youtube_agent`):**
* **Purpose:** Manages the end-to-end process of downloading YouTube audio, chunking, transcribing, and summarizing.
* **Tools:**
1. `download_youtube_audio`: Takes a YouTube URL, uses a library like `yt-dlp` (or potentially `pytube`) to download the audio stream into a temporary file (e.g., `.mp3` or `.opus`). Returns the path to the audio file.
2. `chunk_audio_file`: Takes an audio file path and a maximum chunk duration (e.g., 60 seconds). Uses a library like `pydub` or `librosa`+`soundfile` to split the audio into smaller, sequentially numbered temporary files. Returns a list of chunk file paths.
3. `transcribe_audio_chunk_gemini`: Takes an audio file path (representing a chunk). Uses the Google Generative AI SDK (`google.generativeai`) to call the Gemini 1.5 Pro model with the audio file for transcription. Returns the transcribed text.
4. `summarize_transcript`: Takes the full concatenated transcript text. Uses a Gemini model (e.g., 1.5 Pro or Flash) with a specific prompt to generate a one-paragraph summary. Returns the summary text.
* **Workflow (ReAct or Function sequence):**
1. Receive YouTube URL.
2. Call `download_youtube_audio`.
3. Call `chunk_audio_file` with the downloaded audio path.
4. Iterate through the list of chunk paths:
* Call `transcribe_audio_chunk_gemini` for each chunk.
* Collect transcribed text segments.
5. Concatenate all transcribed text segments into a full transcript.
6. Call `summarize_transcript` with the full transcript.
7. Return the full transcript and the summary.
8. Clean up temporary audio files (downloaded and chunks).
* **Handoff:** Could hand off the transcript and summary to `planner_agent` or `text_analyzer_agent` for further processing or integration.
* **Dependencies:** `yt-dlp`, `pydub` (requires `ffmpeg` or `libav`), `google-generativeai`.
* **Configuration:** Gemini API Key, chunk duration.
### 4.2. Generic Audio Transcription
* **Rationale:** To provide a flexible audio transcription capability for local files or remote URLs, using Gemini Pro for quality/latency tolerance and Whisper.cpp as a fallback, exposing it via a Python API as requested.
* **Design Proposal:**
* **Implementation:** Introduce a new dedicated agent, `transcription_agent`, or add tools to `text_analyzer_agent`. A dedicated agent allows for clearer separation of concerns, especially managing the Whisper.cpp dependency and logic.
* **Agent (`transcription_agent`):**
* **Purpose:** Transcribes audio from various sources (local path, URL) using either Gemini or Whisper.cpp based on latency requirements or availability.
* **Tools:**
1. `prepare_audio_source`: Takes a source string (URL or local path). If it's a URL, downloads it to a temporary file using `requests`. Validates the local file path. Returns the path to the local audio file.
2. `transcribe_gemini`: Takes an audio file path. Uses the `google-generativeai` SDK to call Gemini 1.5 Pro for transcription. Returns the transcribed text. This is the preferred method when latency is acceptable.
3. `transcribe_whisper_cpp`: Takes an audio file path. Uses a Python wrapper around `whisper.cpp` (e.g., installing `whisper.cpp` via `apt` or compiling from source, then using `subprocess` or a dedicated Python binding if available) to perform local transcription. Returns the transcribed text. This is the fallback or low-latency option.
4. `choose_transcription_method`: (Internal logic or a simple tool) Takes latency preference (e.g., 'high_quality' vs 'low_latency') or checks Gemini availability/quota. Decides whether to use `transcribe_gemini` or `transcribe_whisper_cpp`.
* **Workflow (ReAct or Function sequence):**
1. Receive audio source (URL/path) and potentially a latency preference.
2. Call `prepare_audio_source` to get a local file path.
3. Call `choose_transcription_method` (or execute internal logic) to decide between Gemini and Whisper.
4. If Gemini: Call `transcribe_gemini`.
5. If Whisper: Call `transcribe_whisper_cpp`.
6. Return the resulting transcript.
7. Clean up temporary downloaded audio file if applicable.
* **Handoff:** Could hand off the transcript to `planner_agent` or `text_analyzer_agent`.
* **Python API:**
* Define a simple Python function (e.g., in a `transcription_api.py` module) that encapsulates the agent's logic or directly calls the underlying transcription functions.
```python
# Example API function in transcription_api.py
from .transcription_agent import transcribe_audio # Assuming agent logic is refactored
def get_transcript(source: str, prefer_gemini: bool = True) -> str:
"""Transcribes audio from a local path or URL.
Args:
source: Path to the local audio file or URL.
prefer_gemini: If True, attempts to use Gemini Pro first.
If False or Gemini fails, falls back to Whisper.cpp.
Returns:
The transcribed text.
Raises:
TranscriptionError: If transcription fails.
"""
# Implementation would call the agent or its refactored functions
try:
# Simplified logic - actual implementation needs error handling,
# Gemini/Whisper selection based on preference/availability
transcript = transcribe_audio(source, prefer_gemini)
return transcript
except Exception as e:
# Log error
raise TranscriptionError(f"Failed to transcribe {source}: {e}") from e
class TranscriptionError(Exception):
pass
```
* **Dependencies:** `requests`, `google-generativeai`, `whisper.cpp` (requires separate installation/compilation), potentially Python bindings for `whisper.cpp`.
* **Configuration:** Gemini API Key, path to `whisper.cpp` executable or library, Whisper model selection.
## 5. Extra Agent Designs
This section proposes three additional specialized agents designed to enhance performance on the GAIA benchmark by addressing common challenges like complex fact verification, interpreting visual data representations, and handling long contexts.
### 5.1. Agent Design 1: Advanced Validation Agent (`validation_agent`)
* **Purpose:** To perform rigorous validation of factual claims or intermediate results generated by other agents, going beyond the simple contradiction check of the current `verifier_agent`. This agent aims to improve the accuracy and trustworthiness of the final answer by cross-referencing information and performing checks.
* **Key Tool Calls:**
* `web_search` (from `research_agent` or similar): To find external evidence supporting or refuting a claim.
* `browse_and_extract` (from `research_agent` or similar): To access specific URLs found during search and extract relevant text snippets.
* `code_interpreter` (from `code_agent`): To perform calculations or simple data manipulations needed for verification (e.g., checking unit conversions, calculating percentages).
* `knowledge_base_lookup` (New Tool - Optional): Interface with a structured knowledge base (e.g., Wikidata, internal DB) to verify entities, relationships, or properties.
* `llm_check_consistency` (New Tool or LLM call): Use a powerful LLM with a specific prompt to assess the logical consistency between a claim and a set of provided evidence snippets or existing context.
* **Agent Loop Sketch (ReAct style):**
1. **Input:** A specific claim or statement to validate, along with relevant context or source information.
2. **Thought:** Identify the core assertion in the claim. Determine the best validation strategy (e.g., web search for current events, calculation for numerical claims, consistency check for logical statements).
3. **Action:** Call the appropriate tool (`web_search`, `code_interpreter`, `llm_check_consistency`).
4. **Observation:** Analyze the tool's output (search results, calculation result, consistency assessment).
5. **Thought:** Does the observation confirm, refute, or remain inconclusive about the claim? Is more information needed? (e.g., need to browse a specific search result).
6. **Action (if needed):** Call another tool (`browse_and_extract`, `llm_check_consistency` with new evidence).
7. **Observation:** Analyze new output.
8. **Thought:** Synthesize findings. Assign a final validation status (e.g., Confirmed, Refuted, Uncertain) and provide supporting evidence or reasoning.
9. **Output:** Validation status and justification.
10. **Handoff:** Return result to `planner_agent` or `verifier_agent` (if this agent replaces the contradiction part).
### 5.2. Agent Design 2: Figure Interpretation Agent (`figure_interpretation_agent`)
* **Purpose:** To specialize in extracting structured data and meaning from figures, charts, graphs, and tables embedded within images or documents, which are common in GAIA tasks and often require more than just a textual description.
* **Key Tool Calls:**
* `image_ocr` (New Tool or enhanced `image_analyzer_agent` capability): High-precision OCR focused on extracting text specifically from figures, including axes labels, legends, titles, and data points.
* `chart_data_extractor` (New Tool): Utilizes specialized vision models (e.g., DePlot, ChartOCR, or similar fine-tuned models) designed to parse chart types (bar, line, pie) and extract underlying data series or key values.
* `table_parser` (New Tool): Uses vision or document AI models to detect table structures in images/PDFs and extract cell content into a structured format (e.g., list of lists, Pandas DataFrame via code execution).
* `code_interpreter` (from `code_agent`): To process extracted data (e.g., load into DataFrame, perform simple analysis, re-plot for verification).
* `llm_interpret_figure` (New Tool or LLM call): Takes extracted text, data, and potentially the image itself (multimodal) to provide a semantic interpretation of the figure's message or trends.
* **Agent Loop Sketch (Function sequence or ReAct):**
1. **Input:** An image or document page containing a figure/table, potentially with context or a specific question about it.
2. **Action:** Call `image_ocr` to get all text elements.
3. **Action:** Call `chart_data_extractor` or `table_parser` based on visual analysis (or try both) to get structured data.
4. **Action (Optional):** Call `code_interpreter` to load structured data into a DataFrame for easier handling.
5. **Action:** Call `llm_interpret_figure`, providing the extracted text, data (raw or DataFrame), and potentially the original image, asking it to answer the specific question or summarize the figure's key insights.
6. **Output:** Structured data (if requested) and/or the semantic interpretation/answer.
7. **Handoff:** Return results to `planner_agent` or `reasoning_agent`.
### 5.3. Agent Design 3: Long Context Management Agent (`long_context_agent`)
* **Purpose:** To effectively manage and query information from very long documents or conversation histories that exceed the context window limits of standard models or require efficient information retrieval techniques.
* **Key Tool Calls:**
* `document_chunker` (New Tool): Splits long text into semantically meaningful chunks (e.g., using `SentenceSplitter` from LlamaIndex or more advanced methods).
* `vector_store_builder` (New Tool): Takes text chunks and builds an in-memory or persistent vector index (using libraries like `llama-index`, `langchain`, `faiss`, `chromadb`).
* `vector_retriever` (New Tool): Queries the built vector index with a specific question to find the most relevant chunks.
* `summarizer_tool` (New Tool or LLM call): Generates summaries of long text or selected chunks, potentially using different levels of detail.
* `contextual_synthesizer` (New Tool or LLM call): Takes retrieved relevant chunks and the original query, then uses an LLM to synthesize an answer grounded in the retrieved context (RAG pattern).
* **Agent Loop Sketch (Can be stateful):**
1. **Input:** A long document (text or path) or a long conversation history, and a specific query or task related to it.
2. **(Initialization/First Use):**
* **Action:** Call `document_chunker`.
* **Action:** Call `vector_store_builder` to create an index from the chunks. Store the index reference.
3. **(Querying):**
* **Action:** Call `vector_retriever` with the user's query to get relevant chunks.
* **Action:** Call `contextual_synthesizer`, providing the query and retrieved chunks, to generate the final answer.
4. **(Alternative: Summarization Task):**
* **Action:** Call `summarizer_tool` on the full text (if feasible for the tool) or on retrieved chunks based on a high-level query.
5. **Output:** The synthesized answer or the summary.
6. **Handoff:** Return results to `planner_agent`.
## 6. Migration Plan
This section details the recommended steps for applying the proposed changes, lists new dependencies, and outlines minimal validation tests.
### 6.1. Order of Implementation
It is recommended to apply changes in the following order to minimize disruption and build upon stable foundations:
1. **Core Refactoring (`app.py`, Configuration, Logging):**
* Implement centralized configuration (e.g., `.env` file) and update all agents to use it for API keys, model names, etc.
* Integrate Python's `logging` module throughout `app.py` and all agent files, replacing `print` statements.
* Refactor `app.py`: Implement singleton agent initialization and break down `run_and_submit_all`.
* Apply structural refactors to agents (class-based structure, avoiding globals) like `role_agent`, `verifier_agent`, `research_agent`.
2. **Critical Security Fix (`code_agent`):**
* Immediately remove the `SimpleCodeExecutor` and modify `code_agent` to rely solely on the `code_interpreter` tool.
3. **Core Functionality Refactoring (`verifier_agent`, `math_agent`):**
* Improve `verifier_agent`'s contradiction detection (e.g., using an LLM or NLI model).
* Refactor `math_agent` tools if choosing to group them or use a natural language interface.
4. **New Feature: Generic Audio Transcription (`transcription_agent`):**
* Install `whisper.cpp` and its dependencies.
* Implement the `transcription_agent` and its tools (`prepare_audio_source`, `transcribe_gemini`, `transcribe_whisper_cpp`).
* Implement the Python API function `get_transcript`.
5. **New Feature: YouTube Ingestion (`youtube_agent`):**
* Install `yt-dlp` and `pydub` (and `ffmpeg`).
* Implement the `youtube_agent` and its tools (`download_youtube_audio`, `chunk_audio_file`, `transcribe_audio_chunk_gemini`, `summarize_transcript`).
6. **New Agent Implementation (Validation, Figure, Long Context):**
* Implement `validation_agent` and its tools.
* Implement `figure_interpretation_agent` and its tools (requires sourcing/installing chart/table parsing models/libraries).
* Implement `long_context_agent` and its tools (requires vector DB setup like `faiss` or `chromadb`).
7. **Integration and Workflow Adjustments:**
* Update `planner_agent`'s system prompt and handoff logic to incorporate the new agents.
* Update other agents' handoff targets as needed.
* Update `app.py` if the overall agent initialization or workflow invocation changes.
### 6.2. New Dependencies (`requirements.txt`)
Based on the refactoring and new features, the following dependencies might need to be added or updated in `requirements.txt` (or managed via environment setup):
* `python-dotenv`: For loading configuration from `.env` files.
* `google-generativeai`: For interacting with Gemini models (already likely present via `llama-index-llms-google-genai`).
* `yt-dlp`: For downloading YouTube videos.
* `pydub`: For audio manipulation (chunking). Requires `ffmpeg` or `libav` system dependency.
* `llama-index-vector-stores-faiss` / `faiss-cpu` / `faiss-gpu`: For `long_context_agent` vector store (choose one).
* `chromadb` / `llama-index-vector-stores-chroma`: Alternative vector store for `long_context_agent`.
* `llama-index-multi-modal-llms-google`: Ensure multimodal support for Gemini is correctly installed.
* *Possibly*: Libraries for NLI models (e.g., `transformers`, `torch`) if used in `validation_agent`.
* *Possibly*: Libraries for chart/table parsing (e.g., specific models from Hugging Face, `opencv-python`, `pdf2image`) if implementing `figure_interpretation_agent` tools.
* *Possibly*: Python bindings for `whisper.cpp` if not using `subprocess`.
**System Dependencies:**
* `ffmpeg` or `libav`: Required by `pydub`.
* `whisper.cpp`: Needs to be compiled or installed separately. Follow its specific instructions.
### 6.3. Validation Tests
Minimal tests should be implemented to validate key changes:
1. **Configuration:** Test loading of API keys and model names from the configuration source.
2. **Logging:** Verify that logs are being generated at the correct levels and formats.
3. **`code_agent` Security:** Test that `code_agent` uses `code_interpreter` and *not* the removed `SimpleCodeExecutor`. Attempt a malicious code execution via prompt to ensure it fails safely within the interpreter's sandbox.
4. **`verifier_agent` Contradiction:** Test the improved contradiction detection with sample pairs of contradictory and non-contradictory statements.
5. **`transcription_agent`:**
* Test with a short local audio file using both Gemini and Whisper.cpp, comparing output quality/speed.
* Test with an audio URL.
* Test the Python API function `get_transcript`.
6. **`youtube_agent`:**
* Test with a short YouTube video URL.
* Verify audio download, chunking, transcription of chunks, and final summary generation.
* Check cleanup of temporary files.
7. **New Agents (Basic):**
* For `validation_agent`, `figure_interpretation_agent`, `long_context_agent`, implement basic tests confirming agent initialization and successful calls to their primary new tools with mock inputs/outputs.
8. **End-to-End Smoke Test:** Run `app.py` and process one or two simple GAIA tasks that are likely to invoke the refactored components and potentially a new feature (if a relevant task exists) to ensure the overall workflow remains functional.
*(Implementation plan complete. Ready for user confirmation.)*
|