Question Answering
English
biology
medical

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

NeuroRAG

AI Assistant for Neurobiologists

repo-size last-commit issues repo-top-language repo-language-count

Table of Contents

More details about contents

Overview

NeuroRAG is a cutting-edge open-source project designed to revolutionize language processing in the fields of neurobiology, medicine, and psychology. By seamlessly integrating advanced language models and graph-based operations, NeuroRAG empowers users to effortlessly grade documents, evaluate answers, and rewrite queries for enhanced information retrieval. Ideal for researchers, educators, and AI enthusiasts seeking to unlock the full potential of language processing technologies.

The NeuroRAG system architecture
The NeuroRAG system architecture. Diagram shows the multi-agent workflow: query input → routing → retrieval → filtering → ensemble answer generation with multiple LLMs, including hallucination and relevance checks. (a) - Query Transformation Chains; (b) - HyDE; (c) - Retrievers; (d) - Routing llm; (e) - Document grader chain; (f) - Ensembling of LLMs; (g) - Validation chain

Results (evaluation on biomedical datasets in QA task)

Datasets GPT4-o Mistral Large Llama3.3 70B BioMistral NeuroRAG
Medical Genetics 0.9500 0.8600 0.9400 0.9400 0.9600
College Biology 0.9167 0.9514 0.9236 0.9306 0.9722
College Medicine 0.8382 0.82664 0.7977 0.7861 0.8728

Accuracy on Biological MMLU Datasets.

Metrics GPT4-o Mistral Large Llama3.3 70B BioMistral NeuroRAG
CosSim 0.6005 0.6008 0.6015 0.4953 0.6346
BLEU 0.0233 0.0183 0.0122 0.0018 0.0166
ROUGE-1 0.2973 0.2963 0.2570 0.2349 0.2738
ROUGE-L 0.1601 0.1542 0.1471 0.2082 0.1744

Perfromance metrics (Cosine Similarity, BLEU, ROUGE-1, ROUGE-L) on the MEDIQA Dataset with String Answers.


Features

More details about features
Feature Summary
⚙️ Architecture
  • NeuroRAG utilizes a modular architecture with components such as document processing, retrievers, chains, and answer grading.
  • The architecture enables advanced language processing and graph-based operations for tasks like document grading and query rewriting.
  • Central hub in neurorag.py orchestrates data flow through different modules for seamless integration and operation.
🔩 Code Quality
  • Codebase maintains high code quality standards with consistent formatting and linting rules defined in pyproject.toml.
  • Utilizes essential libraries like <scikit-learn>, <numpy>, and <pandas> for efficient data manipulation and processing.
  • Includes detailed documentation within code files to enhance readability and maintainability.
📄 Documentation
  • Extensive documentation in various formats (e.g., <ipynb>, <py>) covering dataset generation, model evaluation, and application interfaces.
  • Usage of <pip> for managing dependencies and providing clear installation instructions in requirements.txt.
  • Documentation includes detailed explanations of code files and their roles within the project architecture.
🔌 Integrations
  • Integrates with external libraries and frameworks like <langchain>, <langgraph>, and <langchainhub> for enhanced language processing capabilities.
  • FastAPI endpoint in api.py enables seamless integration of NeuroRAG model for answering queries based on pre-loaded documents.
  • Utilizes Streamlit chatbot interface in app.py for user interaction and content display.
🧩 Modularity
  • Project design emphasizes modularity with distinct components like retrievers, chains, and document grading for specific tasks.
  • Each chain (e.g., FusingChain, GenerationChain) encapsulates logic for specific operations, promoting reusability and maintainability.
  • Modular approach allows for easy scalability and extension of functionality through additional chains or components.
🧪 Testing
  • Testing commands provided in documentation for running tests using <pytest> to ensure code functionality and reliability.
  • Test files likely exist within the codebase to validate different components and functionalities.
  • Test-driven development approach may be employed to maintain code quality and prevent regressions.
⚡️ Performance
  • Utilizes advanced language models like <GPT>, <OpenBio>, and <Mistral> for generating responses and enhancing performance.
  • Efficient data retrieval and processing mechanisms in chains like NCBIRetriever and HyDEChain contribute to overall system performance.
  • Performance optimization likely implemented through parallel execution and query optimization strategies.

Project Structure

More details about structure
└── NeuroRAG/
    ├── README.md
    ├── apps
    │   ├── .streamlit
    │   ├── api.py
    │   ├── app.py
    │   ├── grades.json
    │   ├── llm-arena.py
    │   └── questions.csv
    ├── datasets
    │   ├── __init__.py
    │   ├── brainscape.csv
    │   ├── brainscape.ipynb
    │   ├── expert_questions.csv
    │   ├── final.csv
    │   ├── final.ipynb
    │   ├── mediqa.csv
    │   ├── mediqa.ipynb
    │   ├── medmcqa.csv
    │   ├── medmcqa.ipynb
    │   ├── mmlu.csv
    │   └── mmlu.ipynb
    ├── documents
    │   ├── Alwyn Scott — Neuroscience: A Mathematical Primer.pdf
    │   ├── Constance Hammond — Cellular and Molecular NeurophysiologyConstance Hammond — Cellular and Molecular NeurophysiologyConstance Hammond — Cellular and Molecular NeurophysiologyConstance Hammond — Cellular and Molecular Neurophysiology.pdf
    │   ├── Dale Purves, George J. Augustine, David Fitzpatrick William C. Hall, Anthony-Samuel Lamantia, James O. McNamara, S. Mark Williams — Neuroscience, Third Edition.pdf
    │   └── Sarah Piper, Abdullah Ahmed — Microscopy for neuroscience research.pdf
    ├── neurorag
    │   ├── __init__.py
    │   ├── chains
    │   ├── neurorag.py
    │   ├── retrievers
    │   └── utils
    ├── notebooks
    │   ├── RAPTOR.ipynb
    │   ├── __init__.py
    │   ├── cosine-evaluation.ipynb
    │   ├── llm-blender.ipynb
    │   ├── mmlu-evaluation.ipynb
    │   ├── mmlu.ipynb
    │   └── raw-llms.ipynb
    ├── pyproject.toml
    └── requirements.txt

Project Index

NEURORAG/
__root__
requirements.txt Manage project dependencies using the provided requirements.txt file to ensure proper functioning of the codebase architecture.
pyproject.toml Configure code formatting and linting rules in the project using the provided pyproject.toml file.
datasets
mmlu.ipynb - Generates a dataset by aggregating questions and answers from various subsets related to anatomy, biology, medicine, and psychology
- The resulting CSV file 'mmlu.csv' contains a comprehensive collection of questions and their corresponding answers for further analysis and processing within the project architecture.
medmcqa.ipynb - The code file `medmcqa.ipynb` in the datasets directory of the project is responsible for importing datasets and pandas for data manipulation
- It likely plays a role in loading and preprocessing medical multiple-choice question and answer data for further analysis within the project architecture.
final.ipynb Merge datasets to create a comprehensive final dataset for analysis and export it as a CSV file.
mediqa.ipynb - The code file `datasets/mediqa.ipynb` in the project architecture integrates datasets and performs data processing tasks using language models and prompts from the Langchain framework
- It leverages the Ollama language model system and PromptTemplate for generating outputs related to medical question answering.
brainscape.ipynb - Extracts data from a website to create a dataset of flashcards related to neurobiology
- The code initializes an empty DataFrame, scrapes URLs, extracts flashcard content, and saves the data to a CSV file
- This process automates the collection of educational content for further analysis and study.
neurorag
neurorag.py - The `neurorag.py` file in the project serves as a central hub for integrating various components such as document processing, embeddings, retrievers, and chains for tasks like document grading, answer grading, and query rewriting
- It orchestrates the flow of data and operations through the different modules to enable advanced language processing and graph-based operations within the codebase architecture.
retrievers
NCBIRetriever.py - Retrieves and processes gene or protein data from the NCBI database based on a search query
- Generates structured documents with relevant information for each gene or protein record fetched.
chains
fusing.py - The `FusingChain` class orchestrates the merging of multiple AI-generated responses into a coherent and comprehensive answer
- It evaluates, identifies common answers, synthesizes information, and formats the final response in JSON format
- By leveraging various components like parsers, prompts, and runnables, it intelligently combines insights from different sources to produce a unified output.
ncbi_protein.py - Facilitates transforming user queries into precise NCBI protein database searches
- Utilizes Pydantic for schema validation and RetryOutputParser for handling retries
- Implements a chain of operations including prompt generation, language model processing, and data retrieval
- Enables efficient query optimization for bioinformatics experts.
hyde.py - Enables generation of scientific paper passages in response to queries by chaining together a prompt, language model, and output parser
- The HyDEChain class initializes the chain and provides a method to invoke it with a query, returning the generated passage.
step_back.py - Generates step-back queries to enhance context retrieval in a RAG system
- Utilizes a chain of processes to create broader, more general queries based on the original input
- The code orchestrates the flow of operations, including parsing, prompting, and invoking the query generation process.
generation.py - The `GenerationChain` class orchestrates multiple language models to fuse responses for question-answering tasks
- It integrates GPT, OpenBio, and Mistral models, combining their outputs to generate a coherent response
- The class encapsulates the logic for invoking the models and fusing their responses, providing a streamlined interface for generating answers based on user queries and context.
route.py - Defines a RouteChain class that orchestrates retrieval methods for user questions
- It leverages Pydantic for data validation and RetryOutputParser for error handling
- The class encapsulates a chain of operations, including prompts, language models, and JSON extraction, to process user queries effectively.
json_extractor.py Extracts the last JSON object from input data, removing escape characters.
document_grade.py - Implement a document grading chain that assesses document relevance to a user query
- Utilizes Pydantic for schema validation and RetryOutputParser for error handling
- The chain orchestrates prompts, language models, and JSON extraction to evaluate and assign a binary relevance score ('yes' or 'no') based on keyword and semantic alignment between the query and document.
ncbi_gene.py - Facilitates transforming user questions into precise queries for the NCBI gene database
- Utilizes a chain of operations to optimize user queries, parse outputs, and retrieve gene loci
- The code orchestrates a series of steps to enhance user query effectiveness and streamline database searches.
answer_grade.py - Defines an Answer Grade Chain that assesses if an answer resolves a question
- It utilizes a binary scoring system ('yes' or 'no') based on user input and LLM generation
- The chain includes a retry mechanism and various parsers for processing the input
- The main purpose is to evaluate answers and provide a binary score indicating if the question is addressed.
decomposition.py - Facilitates decomposition of complex queries into simpler sub-queries for a RAG system
- Parses input query, generates sub-queries, and handles retries for comprehensive responses
- Integrates Pydantic for schema validation and prompts for user interaction
- Orchestrates parallel execution of components for efficient processing.
query_rewriting.py - Enables query rewriting for improved information retrieval in a RAG system by reformulating user queries
- The code defines a schema for rewritten queries, sets up a prompt template for AI assistants, and constructs a chain for query processing using various components like parsers and extractors
- The QueryRewritingChain class facilitates invoking the chain to generate more specific and relevant queries.
hallucinations.py - Facilitates assessing if an LLM answer aligns with facts by providing a binary score
- Utilizes a structured template for grading and incorporates Pydantic for parsing
- Implements a chain of operations to process input and generate the binary score.
apps
grades.json - Summarize the purpose and use of the `apps/grades.json` file in the project architecture, focusing on its role in storing detailed information about NMDA receptors, their subunit compositions, and their significance in various physiological and pathological processes
- This file serves as a comprehensive reference for understanding the critical functions and importance of NMDA receptors in brain function and neurological disorders.
llm-arena.py - Generates and saves rankings of answers from neural networks for given questions in the LLM-arena app
- Users rank answers by preference, with the option to save rankings for analysis
- The code orchestrates the interaction between the neural networks, user interface, and data storage, facilitating user engagement and data collection.
api.py - Implements a FastAPI endpoint for invoking a NeuroRAG model to answer questions based on pre-loaded documents
- The code initializes the model with pre-processed documents and handles incoming queries to generate answers.
app.py - The code orchestrates a Streamlit chatbot interface for NeuroRAG, enabling users to interact with the chatbot for assistance
- It manages chat messages, user prompts, and responses, along with the display of documents and sources
- The interface allows users to engage in conversations with the chatbot, receive generated content, and view relevant documents within the application.
.streamlit
config.toml Customize the primary color theme for the Streamlit app in the project configuration file located at apps/.streamlit/config.toml.
notebooks
mmlu.ipynb - Summary: The code file `mmlu.ipynb` in the notebooks directory is dedicated to evaluating language models using the Massive Multitask Language Understanding (MMLU) benchmark
- This benchmark assesses language models across a wide range of domains, spanning from fundamental topics like history and mathematics to specialized fields such as law and medicine
- The code facilitates the evaluation of language understanding capabilities in diverse subject areas, contributing to the enhancement of language models' performance and applicability across various domains within the project architecture.
raw-llms.ipynb - The code file `raw-llms.ipynb` in the project structure is responsible for importing necessary packages and setting up the initial environment for natural language processing tasks
- It handles tasks such as data preprocessing, feature extraction, and evaluation using various libraries like NLTK, NumPy, Pandas, and scikit-learn
- This notebook serves as a foundational step in the data processing pipeline of the project, ensuring that the data is ready for further analysis and modeling.
RAPTOR.ipynb - The code file `RAPTOR.ipynb` in the notebooks directory serves as a key component in the project architecture
- It plays a crucial role in leveraging the RAPTOR algorithm to enhance the project's capabilities
- This code file facilitates the efficient processing and analysis of data, contributing significantly to the project's overall functionality and performance.
cosine-evaluation.ipynb - The `cosine-evaluation.ipynb` file in the project focuses on evaluating cosine similarity using GraphRAG
- It imports necessary packages, processes data, and calculates cosine similarity scores for the project's text data
- This evaluation is crucial for understanding the semantic similarity between different text elements within the project's architecture.
llm-blender.ipynb - Summary: The code file `llm-blender.ipynb` in the `notebooks` directory serves the purpose of importing necessary packages for the project
- It ensures that the required dependencies, such as NLTK and NumPy, are already installed and available for use in the project's workflow
- This file plays a crucial role in setting up the environment and enabling the project to leverage these essential libraries seamlessly.
mmlu-evaluation.ipynb - The code file `mmlu-evaluation.ipynb` in the notebooks directory of the project focuses on utilizing the GraphRAG library for evaluating machine learning models
- It imports necessary packages, processes data, and likely contains code for model evaluation and analysis
- This file plays a crucial role in assessing the performance and effectiveness of machine learning models within the project's architecture.

Getting Started

Prerequisites

Before getting started with NeuroRAG, ensure your runtime environment meets the following requirements:

  • Programming Language: Python
  • Package Manager: Pip

Installation

Install NeuroRAG using one of the following methods:

Build from source:

  1. Clone the NeuroRAG repository:
❯ git clone https://github.com/Biomed-imaging-lab/NeuroRAG
  1. Navigate to the project directory:
cd NeuroRAG
  1. Install the project dependencies:

Using pip  

❯ pip install -r requirements.txt

Usage

Run NeuroRAG streamlit app using the following command:

❯ docker build -t neurorag-app .
❯ docker run -p 8501:8501 --add-host=host.docker.internal:host-gateway -e HTTP_PROXY="http://host.docker.internal:2080" -e HTTPS_PROXY="http://host.docker.internal:2080" -e OLLAMA_HOST="http://host.docker.internal:11434" -e NO_PROXY="localhost,127.0.0.1,host.docker.internal" neurorag-app

Contributing

Contributing Guidelines
  1. Fork the Repository: Start by forking the project repository to your github account.
  2. Clone Locally: Clone the forked repository to your local machine using a git client.
    git clone https://github.com/Biomed-imaging-lab/NeuroRAG
    
  3. Create a New Branch: Always work on a new branch, giving it a descriptive name.
    git checkout -b new-feature-x
    
  4. Make Your Changes: Develop and test your changes locally.
  5. Commit Your Changes: Commit with a clear message describing your updates.
    git commit -m 'Implemented new feature x.'
    
  6. Push to github: Push the changes to your forked repository.
    git push origin new-feature-x
    
  7. Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
  8. Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!
Contributor Graph


License

This project is protected under the Apache License 2.0 License. For more details, refer to the LICENSE file.


Authors

Vladimir Skvortsov1, Ivan Zolin1, 2, Vyacheslav Chukanov1, Ekaterina Pchitskaya1

  1. Laboratory of Biomedical Imaging and Data Analysis, Institute of Biomedical Systems and Biotechnology, Peter the Great St. Petersburg Polytechnic University
  2. ITMO University
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Biomed-imaging-lab/NeuroRAG

Finetuned
(1)
this model

Datasets used to train Biomed-imaging-lab/NeuroRAG