zrguo
commited on
Commit
·
d21ad66
1
Parent(s):
ca5cc52
RAG-Anything Integration
Browse files- README-zh.md +44 -16
- README.md +45 -17
- docs/mineru_integration_en.md +0 -360
- docs/mineru_integration_zh.md +0 -358
- examples/mineru_example.py +0 -85
- examples/modalprocessors_example.py +1 -1
- examples/raganything_example.py +1 -1
- lightrag/mineru_parser.py +0 -513
- lightrag/modalprocessors.py +0 -699
- lightrag/raganything.py +0 -686
README-zh.md
CHANGED
@@ -4,7 +4,7 @@
|
|
4 |
|
5 |
## 🎉 新闻
|
6 |
|
7 |
-
- [X] [2025.06.05]🎯📢LightRAG现已集成
|
8 |
- [X] [2025.03.18]🎯📢LightRAG现已支持引文功能。
|
9 |
- [X] [2025.02.05]🎯📢我们团队发布了[VideoRAG](https://github.com/HKUDS/VideoRAG),用于理解超长上下文视频。
|
10 |
- [X] [2025.01.13]🎯📢我们团队发布了[MiniRAG](https://github.com/HKUDS/MiniRAG),使用小型模型简化RAG。
|
@@ -1003,31 +1003,59 @@ rag.merge_entities(
|
|
1003 |
|
1004 |
</details>
|
1005 |
|
1006 |
-
## 多模态文档处理(
|
1007 |
|
1008 |
-
LightRAG
|
1009 |
|
1010 |
**主要特性:**
|
1011 |
-
-
|
1012 |
-
-
|
1013 |
-
-
|
1014 |
-
-
|
|
|
1015 |
|
1016 |
**快速开始:**
|
1017 |
-
1.
|
1018 |
```bash
|
1019 |
-
pip install
|
1020 |
```
|
1021 |
-
2.
|
1022 |
-
3. 使用新版 `MineruParser` 或 RAGAnything 的 `process_document_complete` 处理文件:
|
1023 |
```python
|
1024 |
-
|
1025 |
-
|
1026 |
-
|
1027 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1028 |
```
|
1029 |
-
4. 使用 LightRAG 查询多模态内容请参见 [docs/mineru_integration_zh.md](docs/mineru_integration_zh.md)。
|
1030 |
|
|
|
1031 |
|
1032 |
## Token统计功能
|
1033 |
|
|
|
4 |
|
5 |
## 🎉 新闻
|
6 |
|
7 |
+
- [X] [2025.06.05]🎯📢LightRAG现已集成RAG-Anything,支持全面的多模态文档解析与RAG能力(PDF、图片、Office文档、表格、公式等)。详见下方[多模态处理模块](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#多模态文档处理rag-anything集成)。
|
8 |
- [X] [2025.03.18]🎯📢LightRAG现已支持引文功能。
|
9 |
- [X] [2025.02.05]🎯📢我们团队发布了[VideoRAG](https://github.com/HKUDS/VideoRAG),用于理解超长上下文视频。
|
10 |
- [X] [2025.01.13]🎯📢我们团队发布了[MiniRAG](https://github.com/HKUDS/MiniRAG),使用小型模型简化RAG。
|
|
|
1003 |
|
1004 |
</details>
|
1005 |
|
1006 |
+
## 多模态文档处理(RAG-Anything集成)
|
1007 |
|
1008 |
+
LightRAG 现已与 [RAG-Anything](https://github.com/HKUDS/RAG-Anything) 实现无缝集成,这是一个专为 LightRAG 构建的**全能多模态文档处理RAG系统**。RAG-Anything 提供先进的解析和检索增强生成(RAG)能力,让您能够无缝处理多模态文档,并从各种文档格式中提取结构化内容——包括文本、图片、表格和公式——以集成到您的RAG流程中。
|
1009 |
|
1010 |
**主要特性:**
|
1011 |
+
- **端到端多模态流程**:从文档摄取解析到智能多模态问答的完整工作流程
|
1012 |
+
- **通用文档支持**:无缝处理PDF、Office文档(DOC/DOCX/PPT/PPTX/XLS/XLSX)、图片和各种文件格式
|
1013 |
+
- **专业内容分析**:针对图片、表格、数学公式和异构内容类型的专用处理器
|
1014 |
+
- **多模态知识图谱**:自动实体提取和跨模态关系发现以增强理解
|
1015 |
+
- **混合智能检索**:覆盖文本和多模态内容的高级搜索能力,具备上下文理解
|
1016 |
|
1017 |
**快速开始:**
|
1018 |
+
1. 安装RAG-Anything:
|
1019 |
```bash
|
1020 |
+
pip install raganything
|
1021 |
```
|
1022 |
+
2. 处理多模态文档:
|
|
|
1023 |
```python
|
1024 |
+
import asyncio
|
1025 |
+
from raganything import RAGAnything
|
1026 |
+
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
1027 |
+
|
1028 |
+
async def main():
|
1029 |
+
# 使用LightRAG集成初始化RAGAnything
|
1030 |
+
rag = RAGAnything(
|
1031 |
+
working_dir="./rag_storage",
|
1032 |
+
llm_model_func=lambda prompt, **kwargs: openai_complete_if_cache(
|
1033 |
+
"gpt-4o-mini", prompt, api_key="your-api-key", **kwargs
|
1034 |
+
),
|
1035 |
+
embedding_func=lambda texts: openai_embed(
|
1036 |
+
texts, model="text-embedding-3-large", api_key="your-api-key"
|
1037 |
+
),
|
1038 |
+
embedding_dim=3072,
|
1039 |
+
)
|
1040 |
+
|
1041 |
+
# 处理多模态文档
|
1042 |
+
await rag.process_document_complete(
|
1043 |
+
file_path="path/to/your/document.pdf",
|
1044 |
+
output_dir="./output"
|
1045 |
+
)
|
1046 |
+
|
1047 |
+
# 查询多模态内容
|
1048 |
+
result = await rag.query_with_multimodal(
|
1049 |
+
"图表中显示的主要发现是什么?",
|
1050 |
+
mode="hybrid"
|
1051 |
+
)
|
1052 |
+
print(result)
|
1053 |
+
|
1054 |
+
if __name__ == "__main__":
|
1055 |
+
asyncio.run(main())
|
1056 |
```
|
|
|
1057 |
|
1058 |
+
如需详细文档和高级用法,请参阅 [RAG-Anything 仓库](https://github.com/HKUDS/RAG-Anything)。
|
1059 |
|
1060 |
## Token统计功能
|
1061 |
|
README.md
CHANGED
@@ -39,7 +39,7 @@
|
|
39 |
</div>
|
40 |
|
41 |
## 🎉 News
|
42 |
-
- [X] [2025.06.05]🎯📢LightRAG now supports
|
43 |
- [X] [2025.03.18]🎯📢LightRAG now supports citation functionality, enabling proper source attribution.
|
44 |
- [X] [2025.02.05]🎯📢Our team has released [VideoRAG](https://github.com/HKUDS/VideoRAG) understanding extremely long-context videos.
|
45 |
- [X] [2025.01.13]🎯📢Our team has released [MiniRAG](https://github.com/HKUDS/MiniRAG) making RAG simpler with small models.
|
@@ -1058,31 +1058,59 @@ When merging entities:
|
|
1058 |
|
1059 |
</details>
|
1060 |
|
1061 |
-
## Multimodal Document Processing (
|
1062 |
|
1063 |
-
LightRAG now
|
1064 |
|
1065 |
**Key Features:**
|
1066 |
-
- **Multimodal
|
1067 |
-
- **
|
1068 |
-
- **
|
1069 |
-
- **Multimodal
|
1070 |
-
- **
|
1071 |
|
1072 |
**Quick Start:**
|
1073 |
-
1. Install
|
1074 |
```bash
|
1075 |
-
pip install
|
1076 |
```
|
1077 |
-
2.
|
1078 |
-
3. Process multi-modal documents using the new MineruParser or RAG-Anything's process_document_complete:
|
1079 |
```python
|
1080 |
-
|
1081 |
-
|
1082 |
-
|
1083 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1084 |
```
|
1085 |
-
|
|
|
1086 |
|
1087 |
## Token Usage Tracking
|
1088 |
|
|
|
39 |
</div>
|
40 |
|
41 |
## 🎉 News
|
42 |
+
- [X] [2025.06.05]🎯📢LightRAG now supports comprehensive multimodal data handling through RAG-Anything integration, enabling seamless document parsing and RAG capabilities across diverse formats including PDFs, images, Office documents, tables, and formulas. Please refer to the new [multimodal section](https://github.com/HKUDS/LightRAG/?tab=readme-ov-file#multimodal-document-processing-rag-anything-integration) for details.
|
43 |
- [X] [2025.03.18]🎯📢LightRAG now supports citation functionality, enabling proper source attribution.
|
44 |
- [X] [2025.02.05]🎯📢Our team has released [VideoRAG](https://github.com/HKUDS/VideoRAG) understanding extremely long-context videos.
|
45 |
- [X] [2025.01.13]🎯📢Our team has released [MiniRAG](https://github.com/HKUDS/MiniRAG) making RAG simpler with small models.
|
|
|
1058 |
|
1059 |
</details>
|
1060 |
|
1061 |
+
## Multimodal Document Processing (RAG-Anything Integration)
|
1062 |
|
1063 |
+
LightRAG now seamlessly integrates with [RAG-Anything](https://github.com/HKUDS/RAG-Anything), a comprehensive **All-in-One Multimodal Document Processing RAG system** built specifically for LightRAG. RAG-Anything enables advanced parsing and retrieval-augmented generation (RAG) capabilities, allowing you to handle multimodal documents seamlessly and extract structured content—including text, images, tables, and formulas—from various document formats for integration into your RAG pipeline.
|
1064 |
|
1065 |
**Key Features:**
|
1066 |
+
- **End-to-End Multimodal Pipeline**: Complete workflow from document ingestion and parsing to intelligent multimodal query answering
|
1067 |
+
- **Universal Document Support**: Seamless processing of PDFs, Office documents (DOC/DOCX/PPT/PPTX/XLS/XLSX), images, and diverse file formats
|
1068 |
+
- **Specialized Content Analysis**: Dedicated processors for images, tables, mathematical equations, and heterogeneous content types
|
1069 |
+
- **Multimodal Knowledge Graph**: Automatic entity extraction and cross-modal relationship discovery for enhanced understanding
|
1070 |
+
- **Hybrid Intelligent Retrieval**: Advanced search capabilities spanning textual and multimodal content with contextual understanding
|
1071 |
|
1072 |
**Quick Start:**
|
1073 |
+
1. Install RAG-Anything:
|
1074 |
```bash
|
1075 |
+
pip install raganything
|
1076 |
```
|
1077 |
+
2. Process multimodal documents:
|
|
|
1078 |
```python
|
1079 |
+
import asyncio
|
1080 |
+
from raganything import RAGAnything
|
1081 |
+
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
1082 |
+
|
1083 |
+
async def main():
|
1084 |
+
# Initialize RAGAnything with LightRAG integration
|
1085 |
+
rag = RAGAnything(
|
1086 |
+
working_dir="./rag_storage",
|
1087 |
+
llm_model_func=lambda prompt, **kwargs: openai_complete_if_cache(
|
1088 |
+
"gpt-4o-mini", prompt, api_key="your-api-key", **kwargs
|
1089 |
+
),
|
1090 |
+
embedding_func=lambda texts: openai_embed(
|
1091 |
+
texts, model="text-embedding-3-large", api_key="your-api-key"
|
1092 |
+
),
|
1093 |
+
embedding_dim=3072,
|
1094 |
+
)
|
1095 |
+
|
1096 |
+
# Process multimodal documents
|
1097 |
+
await rag.process_document_complete(
|
1098 |
+
file_path="path/to/your/document.pdf",
|
1099 |
+
output_dir="./output"
|
1100 |
+
)
|
1101 |
+
|
1102 |
+
# Query multimodal content
|
1103 |
+
result = await rag.query_with_multimodal(
|
1104 |
+
"What are the main findings shown in the figures and tables?",
|
1105 |
+
mode="hybrid"
|
1106 |
+
)
|
1107 |
+
print(result)
|
1108 |
+
|
1109 |
+
if __name__ == "__main__":
|
1110 |
+
asyncio.run(main())
|
1111 |
```
|
1112 |
+
|
1113 |
+
For detailed documentation and advanced usage, please refer to the [RAG-Anything repository](https://github.com/HKUDS/RAG-Anything).
|
1114 |
|
1115 |
## Token Usage Tracking
|
1116 |
|
docs/mineru_integration_en.md
DELETED
@@ -1,360 +0,0 @@
|
|
1 |
-
# MinerU Integration Guide
|
2 |
-
|
3 |
-
### About MinerU
|
4 |
-
|
5 |
-
MinerU is a powerful open-source tool for extracting high-quality structured data from PDF, image, and office documents. It provides the following features:
|
6 |
-
|
7 |
-
- Text extraction while preserving document structure (headings, paragraphs, lists, etc.)
|
8 |
-
- Handling complex layouts including multi-column formats
|
9 |
-
- Automatic formula recognition and conversion to LaTeX format
|
10 |
-
- Image, table, and footnote extraction
|
11 |
-
- Automatic scanned document detection and OCR application
|
12 |
-
- Support for multiple output formats (Markdown, JSON)
|
13 |
-
|
14 |
-
### Installation
|
15 |
-
|
16 |
-
#### Installing MinerU Dependencies
|
17 |
-
|
18 |
-
If you have already installed LightRAG but don't have MinerU support, you can add MinerU support by installing the magic-pdf package directly:
|
19 |
-
|
20 |
-
```bash
|
21 |
-
pip install "magic-pdf[full]>=1.2.2" huggingface_hub
|
22 |
-
```
|
23 |
-
|
24 |
-
These are the MinerU-related dependencies required by LightRAG.
|
25 |
-
|
26 |
-
#### MinerU Model Weights
|
27 |
-
|
28 |
-
MinerU requires model weight files to function properly. After installation, you need to download the required model weights. You can use either Hugging Face or ModelScope to download the models.
|
29 |
-
|
30 |
-
##### Option 1: Download from Hugging Face
|
31 |
-
|
32 |
-
```bash
|
33 |
-
pip install huggingface_hub
|
34 |
-
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
|
35 |
-
python download_models_hf.py
|
36 |
-
```
|
37 |
-
|
38 |
-
##### Option 2: Download from ModelScope (Recommended for users in China)
|
39 |
-
|
40 |
-
```bash
|
41 |
-
pip install modelscope
|
42 |
-
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models.py -O download_models.py
|
43 |
-
python download_models.py
|
44 |
-
```
|
45 |
-
|
46 |
-
Both methods will automatically download the model files and configure the model directory in the configuration file. The configuration file is located in your user directory and named `magic-pdf.json`.
|
47 |
-
|
48 |
-
> **Note for Windows users**: User directory is at `C:\Users\username`
|
49 |
-
> **Note for Linux users**: User directory is at `/home/username`
|
50 |
-
> **Note for macOS users**: User directory is at `/Users/username`
|
51 |
-
|
52 |
-
#### Optional: LibreOffice Installation
|
53 |
-
|
54 |
-
To process Office documents (DOC, DOCX, PPT, PPTX), you need to install LibreOffice:
|
55 |
-
|
56 |
-
**Linux/macOS:**
|
57 |
-
```bash
|
58 |
-
apt-get/yum/brew install libreoffice
|
59 |
-
```
|
60 |
-
|
61 |
-
**Windows:**
|
62 |
-
1. Install LibreOffice
|
63 |
-
2. Add the installation directory to your PATH: `install_dir\LibreOffice\program`
|
64 |
-
|
65 |
-
### Using MinerU Parser
|
66 |
-
|
67 |
-
#### Basic Usage
|
68 |
-
|
69 |
-
```python
|
70 |
-
from lightrag.mineru_parser import MineruParser
|
71 |
-
|
72 |
-
# Parse a PDF document
|
73 |
-
content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
|
74 |
-
|
75 |
-
# Parse an image
|
76 |
-
content_list, md_content = MineruParser.parse_image('path/to/image.jpg', 'output_dir')
|
77 |
-
|
78 |
-
# Parse an Office document
|
79 |
-
content_list, md_content = MineruParser.parse_office_doc('path/to/document.docx', 'output_dir')
|
80 |
-
|
81 |
-
# Auto-detect and parse any supported document type
|
82 |
-
content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
|
83 |
-
```
|
84 |
-
|
85 |
-
#### RAGAnything Integration
|
86 |
-
|
87 |
-
In RAGAnything, you can directly use file paths as input to the `process_document_complete` method to process documents. Here's a complete configuration example:
|
88 |
-
|
89 |
-
```python
|
90 |
-
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
91 |
-
from lightrag.raganything import RAGAnything
|
92 |
-
|
93 |
-
|
94 |
-
# Initialize RAGAnything
|
95 |
-
rag = RAGAnything(
|
96 |
-
working_dir="./rag_storage", # Working directory
|
97 |
-
llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
|
98 |
-
"gpt-4o-mini", # Model to use
|
99 |
-
prompt,
|
100 |
-
system_prompt=system_prompt,
|
101 |
-
history_messages=history_messages,
|
102 |
-
api_key="your-api-key", # Replace with your API key
|
103 |
-
base_url="your-base-url", # Replace with your API base URL
|
104 |
-
**kwargs,
|
105 |
-
),
|
106 |
-
vision_model_func=lambda prompt, system_prompt=None, history_messages=[], image_data=None, **kwargs: openai_complete_if_cache(
|
107 |
-
"gpt-4o", # Vision model
|
108 |
-
"",
|
109 |
-
system_prompt=None,
|
110 |
-
history_messages=[],
|
111 |
-
messages=[
|
112 |
-
{"role": "system", "content": system_prompt} if system_prompt else None,
|
113 |
-
{"role": "user", "content": [
|
114 |
-
{"type": "text", "text": prompt},
|
115 |
-
{
|
116 |
-
"type": "image_url",
|
117 |
-
"image_url": {
|
118 |
-
"url": f"data:image/jpeg;base64,{image_data}"
|
119 |
-
}
|
120 |
-
}
|
121 |
-
]} if image_data else {"role": "user", "content": prompt}
|
122 |
-
],
|
123 |
-
api_key="your-api-key", # Replace with your API key
|
124 |
-
base_url="your-base-url", # Replace with your API base URL
|
125 |
-
**kwargs,
|
126 |
-
) if image_data else openai_complete_if_cache(
|
127 |
-
"gpt-4o-mini",
|
128 |
-
prompt,
|
129 |
-
system_prompt=system_prompt,
|
130 |
-
history_messages=history_messages,
|
131 |
-
api_key="your-api-key", # Replace with your API key
|
132 |
-
base_url="your-base-url", # Replace with your API base URL
|
133 |
-
**kwargs,
|
134 |
-
),
|
135 |
-
embedding_func=lambda texts: openai_embed(
|
136 |
-
texts,
|
137 |
-
model="text-embedding-3-large",
|
138 |
-
api_key="your-api-key", # Replace with your API key
|
139 |
-
base_url="your-base-url", # Replace with your API base URL
|
140 |
-
),
|
141 |
-
embedding_dim=3072,
|
142 |
-
max_token_size=8192
|
143 |
-
)
|
144 |
-
|
145 |
-
# Process a single file
|
146 |
-
await rag.process_document_complete(
|
147 |
-
file_path="path/to/document.pdf",
|
148 |
-
output_dir="./output",
|
149 |
-
parse_method="auto"
|
150 |
-
)
|
151 |
-
|
152 |
-
# Query the processed document
|
153 |
-
result = await rag.query_with_multimodal(
|
154 |
-
"What is the main content of the document?",
|
155 |
-
mode="hybrid"
|
156 |
-
)
|
157 |
-
|
158 |
-
```
|
159 |
-
|
160 |
-
MinerU categorizes document content into text, formulas, images, and tables, processing each with its corresponding ingestion type:
|
161 |
-
- Text content: `ingestion_type='text'`
|
162 |
-
- Image content: `ingestion_type='image'`
|
163 |
-
- Table content: `ingestion_type='table'`
|
164 |
-
- Formula content: `ingestion_type='equation'`
|
165 |
-
|
166 |
-
#### Query Examples
|
167 |
-
|
168 |
-
Here are some common query examples:
|
169 |
-
|
170 |
-
```python
|
171 |
-
# Query text content
|
172 |
-
result = await rag.query_with_multimodal(
|
173 |
-
"What is the main topic of the document?",
|
174 |
-
mode="hybrid"
|
175 |
-
)
|
176 |
-
|
177 |
-
# Query image-related content
|
178 |
-
result = await rag.query_with_multimodal(
|
179 |
-
"Describe the images and figures in the document",
|
180 |
-
mode="hybrid"
|
181 |
-
)
|
182 |
-
|
183 |
-
# Query table-related content
|
184 |
-
result = await rag.query_with_multimodal(
|
185 |
-
"Tell me about the experimental results and data tables",
|
186 |
-
mode="hybrid"
|
187 |
-
)
|
188 |
-
```
|
189 |
-
|
190 |
-
#### Command Line Tool
|
191 |
-
|
192 |
-
We also provide a command-line tool for document parsing:
|
193 |
-
|
194 |
-
```bash
|
195 |
-
python examples/mineru_example.py path/to/document.pdf
|
196 |
-
```
|
197 |
-
|
198 |
-
Optional parameters:
|
199 |
-
- `--output` or `-o`: Specify output directory
|
200 |
-
- `--method` or `-m`: Choose parsing method (auto, ocr, txt)
|
201 |
-
- `--stats`: Display content statistics
|
202 |
-
|
203 |
-
### Output Format
|
204 |
-
|
205 |
-
MinerU generates three files for each parsed document:
|
206 |
-
|
207 |
-
1. `{filename}.md` - Markdown representation of the document
|
208 |
-
2. `{filename}_content_list.json` - Structured JSON content
|
209 |
-
3. `{filename}_model.json` - Detailed model parsing results
|
210 |
-
|
211 |
-
The `content_list.json` file contains all structured content extracted from the document, including:
|
212 |
-
- Text blocks (body text, headings, etc.)
|
213 |
-
- Images (paths and optional captions)
|
214 |
-
- Tables (table content and optional captions)
|
215 |
-
- Lists
|
216 |
-
- Formulas
|
217 |
-
|
218 |
-
### Troubleshooting
|
219 |
-
|
220 |
-
If you encounter issues with MinerU:
|
221 |
-
|
222 |
-
1. Check that model weights are correctly downloaded
|
223 |
-
2. Ensure you have sufficient RAM (16GB+ recommended)
|
224 |
-
3. For CUDA acceleration issues, see [MinerU documentation](https://mineru.readthedocs.io/en/latest/additional_notes/faq.html)
|
225 |
-
4. If parsing Office documents fails, verify LibreOffice is properly installed
|
226 |
-
5. If you encounter `pickle.UnpicklingError: invalid load key, 'v'.`, it might be due to an incomplete model download. Try re-downloading the models.
|
227 |
-
6. For users with newer graphics cards (H100, etc.) and garbled OCR text, try upgrading the CUDA version used by Paddle:
|
228 |
-
```bash
|
229 |
-
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
|
230 |
-
```
|
231 |
-
7. If you encounter a "filename too long" error, the latest version of MineruParser includes logic to automatically handle this issue.
|
232 |
-
|
233 |
-
#### Updating Existing Models
|
234 |
-
|
235 |
-
If you have previously downloaded models and need to update them, you can simply run the download script again. The script will update the model directory to the latest version.
|
236 |
-
|
237 |
-
### Advanced Configuration
|
238 |
-
|
239 |
-
The MinerU configuration file `magic-pdf.json` supports various customization options, including:
|
240 |
-
|
241 |
-
- Model directory path
|
242 |
-
- OCR engine selection
|
243 |
-
- GPU acceleration settings
|
244 |
-
- Cache settings
|
245 |
-
|
246 |
-
For complete configuration options, refer to the [MinerU official documentation](https://mineru.readthedocs.io/).
|
247 |
-
|
248 |
-
### Using Modal Processors Directly
|
249 |
-
|
250 |
-
You can also use LightRAG's modal processors directly without going through MinerU. This is useful when you want to process specific types of content or have more control over the processing pipeline.
|
251 |
-
|
252 |
-
Each modal processor returns a tuple containing:
|
253 |
-
1. A description of the processed content
|
254 |
-
2. Entity information that can be used for further processing or storage
|
255 |
-
|
256 |
-
The processors support different types of content:
|
257 |
-
- `ImageModalProcessor`: Processes images with captions and footnotes
|
258 |
-
- `TableModalProcessor`: Processes tables with captions and footnotes
|
259 |
-
- `EquationModalProcessor`: Processes mathematical equations in LaTeX format
|
260 |
-
- `GenericModalProcessor`: A base processor that can be extended for custom content types
|
261 |
-
|
262 |
-
> **Note**: A complete working example can be found in `examples/modalprocessors_example.py`. You can run it using:
|
263 |
-
> ```bash
|
264 |
-
> python examples/modalprocessors_example.py --api-key YOUR_API_KEY
|
265 |
-
> ```
|
266 |
-
|
267 |
-
<details>
|
268 |
-
<summary> Here's an example of how to use different modal processors: </summary>
|
269 |
-
|
270 |
-
```python
|
271 |
-
from lightrag.modalprocessors import (
|
272 |
-
ImageModalProcessor,
|
273 |
-
TableModalProcessor,
|
274 |
-
EquationModalProcessor,
|
275 |
-
GenericModalProcessor
|
276 |
-
)
|
277 |
-
|
278 |
-
# Initialize LightRAG
|
279 |
-
lightrag = LightRAG(
|
280 |
-
working_dir="./rag_storage",
|
281 |
-
embedding_func=lambda texts: openai_embed(
|
282 |
-
texts,
|
283 |
-
model="text-embedding-3-large",
|
284 |
-
api_key="your-api-key",
|
285 |
-
base_url="your-base-url",
|
286 |
-
),
|
287 |
-
llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
|
288 |
-
"gpt-4o-mini",
|
289 |
-
prompt,
|
290 |
-
system_prompt=system_prompt,
|
291 |
-
history_messages=history_messages,
|
292 |
-
api_key="your-api-key",
|
293 |
-
base_url="your-base-url",
|
294 |
-
**kwargs,
|
295 |
-
),
|
296 |
-
)
|
297 |
-
|
298 |
-
# Process an image
|
299 |
-
image_processor = ImageModalProcessor(
|
300 |
-
lightrag=lightrag,
|
301 |
-
modal_caption_func=vision_model_func
|
302 |
-
)
|
303 |
-
|
304 |
-
image_content = {
|
305 |
-
"img_path": "image.jpg",
|
306 |
-
"img_caption": ["Example image caption"],
|
307 |
-
"img_footnote": ["Example image footnote"]
|
308 |
-
}
|
309 |
-
|
310 |
-
description, entity_info = await image_processor.process_multimodal_content(
|
311 |
-
modal_content=image_content,
|
312 |
-
content_type="image",
|
313 |
-
file_path="image_example.jpg",
|
314 |
-
entity_name="Example Image"
|
315 |
-
)
|
316 |
-
|
317 |
-
# Process a table
|
318 |
-
table_processor = TableModalProcessor(
|
319 |
-
lightrag=lightrag,
|
320 |
-
modal_caption_func=llm_model_func
|
321 |
-
)
|
322 |
-
|
323 |
-
table_content = {
|
324 |
-
"table_body": """
|
325 |
-
| Name | Age | Occupation |
|
326 |
-
|------|-----|------------|
|
327 |
-
| John | 25 | Engineer |
|
328 |
-
| Mary | 30 | Designer |
|
329 |
-
""",
|
330 |
-
"table_caption": ["Employee Information Table"],
|
331 |
-
"table_footnote": ["Data updated as of 2024"]
|
332 |
-
}
|
333 |
-
|
334 |
-
description, entity_info = await table_processor.process_multimodal_content(
|
335 |
-
modal_content=table_content,
|
336 |
-
content_type="table",
|
337 |
-
file_path="table_example.md",
|
338 |
-
entity_name="Employee Table"
|
339 |
-
)
|
340 |
-
|
341 |
-
# Process an equation
|
342 |
-
equation_processor = EquationModalProcessor(
|
343 |
-
lightrag=lightrag,
|
344 |
-
modal_caption_func=llm_model_func
|
345 |
-
)
|
346 |
-
|
347 |
-
equation_content = {
|
348 |
-
"text": "E = mc^2",
|
349 |
-
"text_format": "LaTeX"
|
350 |
-
}
|
351 |
-
|
352 |
-
description, entity_info = await equation_processor.process_multimodal_content(
|
353 |
-
modal_content=equation_content,
|
354 |
-
content_type="equation",
|
355 |
-
file_path="equation_example.txt",
|
356 |
-
entity_name="Mass-Energy Equivalence"
|
357 |
-
)
|
358 |
-
```
|
359 |
-
|
360 |
-
</details>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/mineru_integration_zh.md
DELETED
@@ -1,358 +0,0 @@
|
|
1 |
-
# MinerU 集成指南
|
2 |
-
|
3 |
-
### 关于 MinerU
|
4 |
-
|
5 |
-
MinerU 是一个强大的开源工具,用于从 PDF、图像和 Office 文档中提取高质量的结构化数据。它提供以下功能:
|
6 |
-
|
7 |
-
- 保留文档结构(标题、段落、列表等)的文本提取
|
8 |
-
- 处理包括多列格式在内的复杂布局
|
9 |
-
- 自动识别并将公式转换为 LaTeX 格式
|
10 |
-
- 提取图像、表格和脚注
|
11 |
-
- 自动检测扫描文档并应用 OCR
|
12 |
-
- 支持多种输出格式(Markdown、JSON)
|
13 |
-
|
14 |
-
### 安装
|
15 |
-
|
16 |
-
#### 安装 MinerU 依赖
|
17 |
-
|
18 |
-
如果您已经安装了 LightRAG,但没有 MinerU 支持,您可以通过安装 magic-pdf 包来直接添加 MinerU 支持:
|
19 |
-
|
20 |
-
```bash
|
21 |
-
pip install "magic-pdf[full]>=1.2.2" huggingface_hub
|
22 |
-
```
|
23 |
-
|
24 |
-
这些是 LightRAG 所需的 MinerU 相关依赖项。
|
25 |
-
|
26 |
-
#### MinerU 模型权重
|
27 |
-
|
28 |
-
MinerU 需要模型权重文件才能正常运行。安装后,您需要下载所需的模型权重。您可以使用 Hugging Face 或 ModelScope 下载模型。
|
29 |
-
|
30 |
-
##### 选项 1:从 Hugging Face 下载
|
31 |
-
|
32 |
-
```bash
|
33 |
-
pip install huggingface_hub
|
34 |
-
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
|
35 |
-
python download_models_hf.py
|
36 |
-
```
|
37 |
-
|
38 |
-
##### 选项 2:从 ModelScope 下载(推荐中国用户使用)
|
39 |
-
|
40 |
-
```bash
|
41 |
-
pip install modelscope
|
42 |
-
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models.py -O download_models.py
|
43 |
-
python download_models.py
|
44 |
-
```
|
45 |
-
|
46 |
-
两种方法都会自动下载模型文件并在配置文件中配置模型目录。配置文件位于用户目录中,名为 `magic-pdf.json`。
|
47 |
-
|
48 |
-
> **Windows 用户注意**:用户目录位于 `C:\Users\用户名`
|
49 |
-
> **Linux 用户注意**:用户目录位于 `/home/用户名`
|
50 |
-
> **macOS 用户注意**:用户目录位于 `/Users/用户名`
|
51 |
-
|
52 |
-
#### 可选:安装 LibreOffice
|
53 |
-
|
54 |
-
要处理 Office 文档(DOC、DOCX、PPT、PPTX),您需要安装 LibreOffice:
|
55 |
-
|
56 |
-
**Linux/macOS:**
|
57 |
-
```bash
|
58 |
-
apt-get/yum/brew install libreoffice
|
59 |
-
```
|
60 |
-
|
61 |
-
**Windows:**
|
62 |
-
1. 安装 LibreOffice
|
63 |
-
2. 将安装目录添加到 PATH 环境变量:`安装目录\LibreOffice\program`
|
64 |
-
|
65 |
-
### 使用 MinerU 解析器
|
66 |
-
|
67 |
-
#### 基本用法
|
68 |
-
|
69 |
-
```python
|
70 |
-
from lightrag.mineru_parser import MineruParser
|
71 |
-
|
72 |
-
# 解析 PDF 文档
|
73 |
-
content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
|
74 |
-
|
75 |
-
# 解析图像
|
76 |
-
content_list, md_content = MineruParser.parse_image('path/to/image.jpg', 'output_dir')
|
77 |
-
|
78 |
-
# 解析 Office 文档
|
79 |
-
content_list, md_content = MineruParser.parse_office_doc('path/to/document.docx', 'output_dir')
|
80 |
-
|
81 |
-
# 自动检测并解析任何支持的文档类型
|
82 |
-
content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
|
83 |
-
```
|
84 |
-
|
85 |
-
#### RAGAnything 集成
|
86 |
-
|
87 |
-
在 RAGAnything 中,您可以直接使用文件路径作为 `process_document_complete` 方法的输入来处理文档。以下是一个完整的配置示例:
|
88 |
-
|
89 |
-
```python
|
90 |
-
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
91 |
-
from lightrag.raganything import RAGAnything
|
92 |
-
|
93 |
-
|
94 |
-
# 初始化 RAGAnything
|
95 |
-
rag = RAGAnything(
|
96 |
-
working_dir="./rag_storage", # 工作目录
|
97 |
-
llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
|
98 |
-
"gpt-4o-mini", # 使用的模型
|
99 |
-
prompt,
|
100 |
-
system_prompt=system_prompt,
|
101 |
-
history_messages=history_messages,
|
102 |
-
api_key="your-api-key", # 替换为您的 API 密钥
|
103 |
-
base_url="your-base-url", # 替换为您的 API 基础 URL
|
104 |
-
**kwargs,
|
105 |
-
),
|
106 |
-
vision_model_func=lambda prompt, system_prompt=None, history_messages=[], image_data=None, **kwargs: openai_complete_if_cache(
|
107 |
-
"gpt-4o", # 视觉模型
|
108 |
-
"",
|
109 |
-
system_prompt=None,
|
110 |
-
history_messages=[],
|
111 |
-
messages=[
|
112 |
-
{"role": "system", "content": system_prompt} if system_prompt else None,
|
113 |
-
{"role": "user", "content": [
|
114 |
-
{"type": "text", "text": prompt},
|
115 |
-
{
|
116 |
-
"type": "image_url",
|
117 |
-
"image_url": {
|
118 |
-
"url": f"data:image/jpeg;base64,{image_data}"
|
119 |
-
}
|
120 |
-
}
|
121 |
-
]} if image_data else {"role": "user", "content": prompt}
|
122 |
-
],
|
123 |
-
api_key="your-api-key", # 替换为您的 API 密钥
|
124 |
-
base_url="your-base-url", # 替换为您的 API 基础 URL
|
125 |
-
**kwargs,
|
126 |
-
) if image_data else openai_complete_if_cache(
|
127 |
-
"gpt-4o-mini",
|
128 |
-
prompt,
|
129 |
-
system_prompt=system_prompt,
|
130 |
-
history_messages=history_messages,
|
131 |
-
api_key="your-api-key", # 替换为您的 API 密钥
|
132 |
-
base_url="your-base-url", # 替换为您的 API 基础 URL
|
133 |
-
**kwargs,
|
134 |
-
),
|
135 |
-
embedding_func=lambda texts: openai_embed(
|
136 |
-
texts,
|
137 |
-
model="text-embedding-3-large",
|
138 |
-
api_key="your-api-key", # 替换为您的 API 密钥
|
139 |
-
base_url="your-base-url", # 替换为您的 API 基础 URL
|
140 |
-
),
|
141 |
-
embedding_dim=3072,
|
142 |
-
max_token_size=8192
|
143 |
-
)
|
144 |
-
|
145 |
-
# 处理单个文件
|
146 |
-
await rag.process_document_complete(
|
147 |
-
file_path="path/to/document.pdf",
|
148 |
-
output_dir="./output",
|
149 |
-
parse_method="auto"
|
150 |
-
)
|
151 |
-
|
152 |
-
# 查询处理后的文档
|
153 |
-
result = await rag.query_with_multimodal(
|
154 |
-
"What is the main content of the document?",
|
155 |
-
mode="hybrid"
|
156 |
-
)
|
157 |
-
```
|
158 |
-
|
159 |
-
MinerU 会将文档内容分类为文本、公式、图像和表格,分别使用相应的摄入类型进行处理:
|
160 |
-
- 文本内容:`ingestion_type='text'`
|
161 |
-
- 图像内容:`ingestion_type='image'`
|
162 |
-
- 表格内容:`ingestion_type='table'`
|
163 |
-
- 公式内容:`ingestion_type='equation'`
|
164 |
-
|
165 |
-
#### 查询示例
|
166 |
-
|
167 |
-
以下是一些常见的查询示例:
|
168 |
-
|
169 |
-
```python
|
170 |
-
# 查询文本内容
|
171 |
-
result = await rag.query_with_multimodal(
|
172 |
-
"What is the main topic of the document?",
|
173 |
-
mode="hybrid"
|
174 |
-
)
|
175 |
-
|
176 |
-
# 查询图片相关内容
|
177 |
-
result = await rag.query_with_multimodal(
|
178 |
-
"Describe the images and figures in the document",
|
179 |
-
mode="hybrid"
|
180 |
-
)
|
181 |
-
|
182 |
-
# 查询表格相关内容
|
183 |
-
result = await rag.query_with_multimodal(
|
184 |
-
"Tell me about the experimental results and data tables",
|
185 |
-
mode="hybrid"
|
186 |
-
)
|
187 |
-
```
|
188 |
-
|
189 |
-
#### 命令行工具
|
190 |
-
|
191 |
-
我们还提供了一个用于文档解析的命令行工具:
|
192 |
-
|
193 |
-
```bash
|
194 |
-
python examples/mineru_example.py path/to/document.pdf
|
195 |
-
```
|
196 |
-
|
197 |
-
可选参数:
|
198 |
-
- `--output` 或 `-o`:指定输出目录
|
199 |
-
- `--method` 或 `-m`:选择解析方法(auto、ocr、txt)
|
200 |
-
- `--stats`:显示内容统计信息
|
201 |
-
|
202 |
-
### 输出格式
|
203 |
-
|
204 |
-
MinerU 为每个解析的文档生成三个文件:
|
205 |
-
|
206 |
-
1. `{文件名}.md` - 文档的 Markdown 表示
|
207 |
-
2. `{文件名}_content_list.json` - 结构化 JSON 内容
|
208 |
-
3. `{文件名}_model.json` - 详细的模型解析结果
|
209 |
-
|
210 |
-
`content_list.json` 文件包含从文档中提取的所有结构化内容,包括:
|
211 |
-
- 文本块(正文、标题等)
|
212 |
-
- 图像(路径和可选的标题)
|
213 |
-
- 表格(表格内容和可选的标题)
|
214 |
-
- 列表
|
215 |
-
- 公式
|
216 |
-
|
217 |
-
### 疑难解答
|
218 |
-
|
219 |
-
如果您在使用 MinerU 时遇到问题:
|
220 |
-
|
221 |
-
1. 检查模型权重是否正确下载
|
222 |
-
2. 确保有足够的内存(建议 16GB+)
|
223 |
-
3. 对于 CUDA 加速问题,请参阅 [MinerU 文档](https://mineru.readthedocs.io/en/latest/additional_notes/faq.html)
|
224 |
-
4. 如果解析 Office 文档失败,请验证 LibreOffice 是否正确安装
|
225 |
-
5. 如果遇到 `pickle.UnpicklingError: invalid load key, 'v'.`,可能是因为模型下载不完整。尝试重新下载模型。
|
226 |
-
6. 对于使用较新显卡(H100 等)并出现 OCR 文本乱码的用户,请尝试升级 Paddle 使用的 CUDA 版本:
|
227 |
-
```bash
|
228 |
-
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
|
229 |
-
```
|
230 |
-
7. 如果遇到 "文件名太长" 错误,最新版本的 MineruParser 已经包含了自动处理此问题的逻辑。
|
231 |
-
|
232 |
-
#### 更新现有模型
|
233 |
-
|
234 |
-
如果您之前已经下载了模型并需要更新它们,只需再次运行下载脚本即可。脚本将更新模型目录到最新版本。
|
235 |
-
|
236 |
-
### 高级配置
|
237 |
-
|
238 |
-
MinerU 配置文件 `magic-pdf.json` 支持多种自定义选项,包括:
|
239 |
-
|
240 |
-
- 模型目录路径
|
241 |
-
- OCR 引擎选择
|
242 |
-
- GPU 加速设置
|
243 |
-
- 缓存设置
|
244 |
-
|
245 |
-
有关完整的配置选项,请参阅 [MinerU 官方文档](https://mineru.readthedocs.io/)。
|
246 |
-
|
247 |
-
### 直接使用模态处理器
|
248 |
-
|
249 |
-
您也可以直接使用 LightRAG 的模态处理器,而不需要通过 MinerU。这在您想要处理特定类型的内容或对处理流程有更多控制时特别有用。
|
250 |
-
|
251 |
-
每个模态处理器都会返回一个包含以下内容的元组:
|
252 |
-
1. 处理后内容的描述
|
253 |
-
2. 可用于进一步处理或存储的实体信息
|
254 |
-
|
255 |
-
处理器支持不同类型的内容:
|
256 |
-
- `ImageModalProcessor`:处理带有标题和脚注的图像
|
257 |
-
- `TableModalProcessor`:处理带有标题和脚注的表格
|
258 |
-
- `EquationModalProcessor`:处理 LaTeX 格式的数学公式
|
259 |
-
- `GenericModalProcessor`:可用于扩展自定义内容类型的基础处理器
|
260 |
-
|
261 |
-
> **注意**:完整的可运行示例可以在 `examples/modalprocessors_example.py` 中找到。您可以使用以下命令运行它:
|
262 |
-
> ```bash
|
263 |
-
> python examples/modalprocessors_example.py --api-key YOUR_API_KEY
|
264 |
-
> ```
|
265 |
-
|
266 |
-
<details>
|
267 |
-
<summary> 使用不同模态处理器的示例 </summary>
|
268 |
-
|
269 |
-
```python
|
270 |
-
from lightrag.modalprocessors import (
|
271 |
-
ImageModalProcessor,
|
272 |
-
TableModalProcessor,
|
273 |
-
EquationModalProcessor,
|
274 |
-
GenericModalProcessor
|
275 |
-
)
|
276 |
-
|
277 |
-
# 初始化 LightRAG
|
278 |
-
lightrag = LightRAG(
|
279 |
-
working_dir="./rag_storage",
|
280 |
-
embedding_func=lambda texts: openai_embed(
|
281 |
-
texts,
|
282 |
-
model="text-embedding-3-large",
|
283 |
-
api_key="your-api-key",
|
284 |
-
base_url="your-base-url",
|
285 |
-
),
|
286 |
-
llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
|
287 |
-
"gpt-4o-mini",
|
288 |
-
prompt,
|
289 |
-
system_prompt=system_prompt,
|
290 |
-
history_messages=history_messages,
|
291 |
-
api_key="your-api-key",
|
292 |
-
base_url="your-base-url",
|
293 |
-
**kwargs,
|
294 |
-
),
|
295 |
-
)
|
296 |
-
|
297 |
-
# 处理图像
|
298 |
-
image_processor = ImageModalProcessor(
|
299 |
-
lightrag=lightrag,
|
300 |
-
modal_caption_func=vision_model_func
|
301 |
-
)
|
302 |
-
|
303 |
-
image_content = {
|
304 |
-
"img_path": "image.jpg",
|
305 |
-
"img_caption": ["示例图像标题"],
|
306 |
-
"img_footnote": ["示例图像脚注"]
|
307 |
-
}
|
308 |
-
|
309 |
-
description, entity_info = await image_processor.process_multimodal_content(
|
310 |
-
modal_content=image_content,
|
311 |
-
content_type="image",
|
312 |
-
file_path="image_example.jpg",
|
313 |
-
entity_name="示例图像"
|
314 |
-
)
|
315 |
-
|
316 |
-
# 处理表格
|
317 |
-
table_processor = TableModalProcessor(
|
318 |
-
lightrag=lightrag,
|
319 |
-
modal_caption_func=llm_model_func
|
320 |
-
)
|
321 |
-
|
322 |
-
table_content = {
|
323 |
-
"table_body": """
|
324 |
-
| 姓名 | 年龄 | 职业 |
|
325 |
-
|------|-----|------|
|
326 |
-
| 张三 | 25 | 工程师 |
|
327 |
-
| 李四 | 30 | 设计师 |
|
328 |
-
""",
|
329 |
-
"table_caption": ["员工信息表"],
|
330 |
-
"table_footnote": ["数据更新至2024年"]
|
331 |
-
}
|
332 |
-
|
333 |
-
description, entity_info = await table_processor.process_multimodal_content(
|
334 |
-
modal_content=table_content,
|
335 |
-
content_type="table",
|
336 |
-
file_path="table_example.md",
|
337 |
-
entity_name="员工表格"
|
338 |
-
)
|
339 |
-
|
340 |
-
# 处理公式
|
341 |
-
equation_processor = EquationModalProcessor(
|
342 |
-
lightrag=lightrag,
|
343 |
-
modal_caption_func=llm_model_func
|
344 |
-
)
|
345 |
-
|
346 |
-
equation_content = {
|
347 |
-
"text": "E = mc^2",
|
348 |
-
"text_format": "LaTeX"
|
349 |
-
}
|
350 |
-
|
351 |
-
description, entity_info = await equation_processor.process_multimodal_content(
|
352 |
-
modal_content=equation_content,
|
353 |
-
content_type="equation",
|
354 |
-
file_path="equation_example.txt",
|
355 |
-
entity_name="质能方程"
|
356 |
-
)
|
357 |
-
```
|
358 |
-
</details>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
examples/mineru_example.py
DELETED
@@ -1,85 +0,0 @@
|
|
1 |
-
#!/usr/bin/env python
|
2 |
-
"""
|
3 |
-
Example script demonstrating the basic usage of MinerU parser
|
4 |
-
|
5 |
-
This example shows how to:
|
6 |
-
1. Parse different types of documents (PDF, images, office documents)
|
7 |
-
2. Use different parsing methods
|
8 |
-
3. Display document statistics
|
9 |
-
"""
|
10 |
-
|
11 |
-
import os
|
12 |
-
import argparse
|
13 |
-
from lightrag.mineru_parser import MineruParser
|
14 |
-
|
15 |
-
|
16 |
-
def parse_document(
|
17 |
-
file_path: str, output_dir: str = None, method: str = "auto", stats: bool = False
|
18 |
-
):
|
19 |
-
"""
|
20 |
-
Parse a document using MinerU parser
|
21 |
-
|
22 |
-
Args:
|
23 |
-
file_path: Path to the document
|
24 |
-
output_dir: Output directory for parsed results
|
25 |
-
method: Parsing method (auto, ocr, txt)
|
26 |
-
stats: Whether to display content statistics
|
27 |
-
"""
|
28 |
-
try:
|
29 |
-
# Parse the document
|
30 |
-
content_list, md_content = MineruParser.parse_document(
|
31 |
-
file_path=file_path, parse_method=method, output_dir=output_dir
|
32 |
-
)
|
33 |
-
|
34 |
-
# Display statistics if requested
|
35 |
-
if stats:
|
36 |
-
print("\nDocument Statistics:")
|
37 |
-
print(f"Total content blocks: {len(content_list)}")
|
38 |
-
|
39 |
-
# Count different types of content
|
40 |
-
content_types = {}
|
41 |
-
for item in content_list:
|
42 |
-
content_type = item.get("type", "unknown")
|
43 |
-
content_types[content_type] = content_types.get(content_type, 0) + 1
|
44 |
-
|
45 |
-
print("\nContent Type Distribution:")
|
46 |
-
for content_type, count in content_types.items():
|
47 |
-
print(f"- {content_type}: {count}")
|
48 |
-
|
49 |
-
return content_list, md_content
|
50 |
-
|
51 |
-
except Exception as e:
|
52 |
-
print(f"Error parsing document: {str(e)}")
|
53 |
-
return None, None
|
54 |
-
|
55 |
-
|
56 |
-
def main():
|
57 |
-
"""Main function to run the example"""
|
58 |
-
parser = argparse.ArgumentParser(description="MinerU Parser Example")
|
59 |
-
parser.add_argument("file_path", help="Path to the document to parse")
|
60 |
-
parser.add_argument("--output", "-o", help="Output directory path")
|
61 |
-
parser.add_argument(
|
62 |
-
"--method",
|
63 |
-
"-m",
|
64 |
-
choices=["auto", "ocr", "txt"],
|
65 |
-
default="auto",
|
66 |
-
help="Parsing method (auto, ocr, txt)",
|
67 |
-
)
|
68 |
-
parser.add_argument(
|
69 |
-
"--stats", action="store_true", help="Display content statistics"
|
70 |
-
)
|
71 |
-
|
72 |
-
args = parser.parse_args()
|
73 |
-
|
74 |
-
# Create output directory if specified
|
75 |
-
if args.output:
|
76 |
-
os.makedirs(args.output, exist_ok=True)
|
77 |
-
|
78 |
-
# Parse document
|
79 |
-
content_list, md_content = parse_document(
|
80 |
-
args.file_path, args.output, args.method, args.stats
|
81 |
-
)
|
82 |
-
|
83 |
-
|
84 |
-
if __name__ == "__main__":
|
85 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
examples/modalprocessors_example.py
CHANGED
@@ -9,7 +9,7 @@ import argparse
|
|
9 |
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
10 |
from lightrag.kg.shared_storage import initialize_pipeline_status
|
11 |
from lightrag import LightRAG
|
12 |
-
from
|
13 |
ImageModalProcessor,
|
14 |
TableModalProcessor,
|
15 |
EquationModalProcessor,
|
|
|
9 |
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
10 |
from lightrag.kg.shared_storage import initialize_pipeline_status
|
11 |
from lightrag import LightRAG
|
12 |
+
from raganything.modalprocessors import (
|
13 |
ImageModalProcessor,
|
14 |
TableModalProcessor,
|
15 |
EquationModalProcessor,
|
examples/raganything_example.py
CHANGED
@@ -12,7 +12,7 @@ import os
|
|
12 |
import argparse
|
13 |
import asyncio
|
14 |
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
15 |
-
from
|
16 |
|
17 |
|
18 |
async def process_with_rag(
|
|
|
12 |
import argparse
|
13 |
import asyncio
|
14 |
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
15 |
+
from raganything.raganything import RAGAnything
|
16 |
|
17 |
|
18 |
async def process_with_rag(
|
lightrag/mineru_parser.py
DELETED
@@ -1,513 +0,0 @@
|
|
1 |
-
# type: ignore
|
2 |
-
"""
|
3 |
-
MinerU Document Parser Utility
|
4 |
-
|
5 |
-
This module provides functionality for parsing PDF, image and office documents using MinerU library,
|
6 |
-
and converts the parsing results into markdown and JSON formats
|
7 |
-
"""
|
8 |
-
|
9 |
-
from __future__ import annotations
|
10 |
-
|
11 |
-
__all__ = ["MineruParser"]
|
12 |
-
|
13 |
-
import os
|
14 |
-
import json
|
15 |
-
import argparse
|
16 |
-
from pathlib import Path
|
17 |
-
from typing import (
|
18 |
-
Dict,
|
19 |
-
List,
|
20 |
-
Optional,
|
21 |
-
Union,
|
22 |
-
Tuple,
|
23 |
-
Any,
|
24 |
-
TypeVar,
|
25 |
-
cast,
|
26 |
-
TYPE_CHECKING,
|
27 |
-
ClassVar,
|
28 |
-
)
|
29 |
-
|
30 |
-
# Type stubs for magic_pdf
|
31 |
-
FileBasedDataWriter = Any
|
32 |
-
FileBasedDataReader = Any
|
33 |
-
PymuDocDataset = Any
|
34 |
-
InferResult = Any
|
35 |
-
PipeResult = Any
|
36 |
-
SupportedPdfParseMethod = Any
|
37 |
-
doc_analyze = Any
|
38 |
-
read_local_office = Any
|
39 |
-
read_local_images = Any
|
40 |
-
|
41 |
-
if TYPE_CHECKING:
|
42 |
-
from magic_pdf.data.data_reader_writer import (
|
43 |
-
FileBasedDataWriter,
|
44 |
-
FileBasedDataReader,
|
45 |
-
)
|
46 |
-
from magic_pdf.data.dataset import PymuDocDataset
|
47 |
-
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
|
48 |
-
from magic_pdf.config.enums import SupportedPdfParseMethod
|
49 |
-
from magic_pdf.data.read_api import read_local_office, read_local_images
|
50 |
-
else:
|
51 |
-
# MinerU imports
|
52 |
-
from magic_pdf.data.data_reader_writer import (
|
53 |
-
FileBasedDataWriter,
|
54 |
-
FileBasedDataReader,
|
55 |
-
)
|
56 |
-
from magic_pdf.data.dataset import PymuDocDataset
|
57 |
-
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
|
58 |
-
from magic_pdf.config.enums import SupportedPdfParseMethod
|
59 |
-
from magic_pdf.data.read_api import read_local_office, read_local_images
|
60 |
-
|
61 |
-
T = TypeVar("T")
|
62 |
-
|
63 |
-
|
64 |
-
class MineruParser:
|
65 |
-
"""
|
66 |
-
MinerU document parsing utility class
|
67 |
-
|
68 |
-
Supports parsing PDF, image and office documents (like Word, PPT, etc.),
|
69 |
-
converting the content into structured data and generating markdown and JSON output
|
70 |
-
"""
|
71 |
-
|
72 |
-
__slots__: ClassVar[Tuple[str, ...]] = ()
|
73 |
-
|
74 |
-
def __init__(self) -> None:
|
75 |
-
"""Initialize MineruParser"""
|
76 |
-
pass
|
77 |
-
|
78 |
-
@staticmethod
|
79 |
-
def safe_write(
|
80 |
-
writer: Any,
|
81 |
-
content: Union[str, bytes, Dict[str, Any], List[Any]],
|
82 |
-
filename: str,
|
83 |
-
) -> None:
|
84 |
-
"""
|
85 |
-
Safely write content to a file, ensuring the filename is valid
|
86 |
-
|
87 |
-
Args:
|
88 |
-
writer: The writer object to use
|
89 |
-
content: The content to write
|
90 |
-
filename: The filename to write to
|
91 |
-
"""
|
92 |
-
# Ensure the filename isn't too long
|
93 |
-
if len(filename) > 200: # Most filesystems have limits around 255 characters
|
94 |
-
# Truncate the filename while keeping the extension
|
95 |
-
base, ext = os.path.splitext(filename)
|
96 |
-
filename = base[:190] + ext # Leave room for the extension and some margin
|
97 |
-
|
98 |
-
# Handle specific content types
|
99 |
-
if isinstance(content, str):
|
100 |
-
# Ensure str content is encoded to bytes if required
|
101 |
-
try:
|
102 |
-
writer.write(content, filename)
|
103 |
-
except TypeError:
|
104 |
-
# If the writer expects bytes, convert string to bytes
|
105 |
-
writer.write(content.encode("utf-8"), filename)
|
106 |
-
else:
|
107 |
-
# For dict/list content, always encode as JSON string first
|
108 |
-
if isinstance(content, (dict, list)):
|
109 |
-
try:
|
110 |
-
writer.write(
|
111 |
-
json.dumps(content, ensure_ascii=False, indent=4), filename
|
112 |
-
)
|
113 |
-
except TypeError:
|
114 |
-
# If the writer expects bytes, convert JSON string to bytes
|
115 |
-
writer.write(
|
116 |
-
json.dumps(content, ensure_ascii=False, indent=4).encode(
|
117 |
-
"utf-8"
|
118 |
-
),
|
119 |
-
filename,
|
120 |
-
)
|
121 |
-
else:
|
122 |
-
# Regular content (assumed to be bytes or compatible)
|
123 |
-
writer.write(content, filename)
|
124 |
-
|
125 |
-
@staticmethod
|
126 |
-
def parse_pdf(
|
127 |
-
pdf_path: Union[str, Path],
|
128 |
-
output_dir: Optional[str] = None,
|
129 |
-
use_ocr: bool = False,
|
130 |
-
) -> Tuple[List[Dict[str, Any]], str]:
|
131 |
-
"""
|
132 |
-
Parse PDF document
|
133 |
-
|
134 |
-
Args:
|
135 |
-
pdf_path: Path to the PDF file
|
136 |
-
output_dir: Output directory path
|
137 |
-
use_ocr: Whether to force OCR parsing
|
138 |
-
|
139 |
-
Returns:
|
140 |
-
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
|
141 |
-
"""
|
142 |
-
try:
|
143 |
-
# Convert to Path object for easier handling
|
144 |
-
pdf_path = Path(pdf_path)
|
145 |
-
name_without_suff = pdf_path.stem
|
146 |
-
|
147 |
-
# Prepare output directories - ensure file name is in path
|
148 |
-
if output_dir:
|
149 |
-
base_output_dir = Path(output_dir)
|
150 |
-
local_md_dir = base_output_dir / name_without_suff
|
151 |
-
else:
|
152 |
-
local_md_dir = pdf_path.parent / name_without_suff
|
153 |
-
|
154 |
-
local_image_dir = local_md_dir / "images"
|
155 |
-
image_dir = local_image_dir.name
|
156 |
-
|
157 |
-
# Create directories
|
158 |
-
os.makedirs(local_image_dir, exist_ok=True)
|
159 |
-
os.makedirs(local_md_dir, exist_ok=True)
|
160 |
-
|
161 |
-
# Initialize writers and reader
|
162 |
-
image_writer = FileBasedDataWriter(str(local_image_dir)) # type: ignore
|
163 |
-
md_writer = FileBasedDataWriter(str(local_md_dir)) # type: ignore
|
164 |
-
reader = FileBasedDataReader("") # type: ignore
|
165 |
-
|
166 |
-
# Read PDF bytes
|
167 |
-
pdf_bytes = reader.read(str(pdf_path)) # type: ignore
|
168 |
-
|
169 |
-
# Create dataset instance
|
170 |
-
ds = PymuDocDataset(pdf_bytes) # type: ignore
|
171 |
-
|
172 |
-
# Process based on PDF type and user preference
|
173 |
-
if use_ocr or ds.classify() == SupportedPdfParseMethod.OCR: # type: ignore
|
174 |
-
infer_result = ds.apply(doc_analyze, ocr=True) # type: ignore
|
175 |
-
pipe_result = infer_result.pipe_ocr_mode(image_writer) # type: ignore
|
176 |
-
else:
|
177 |
-
infer_result = ds.apply(doc_analyze, ocr=False) # type: ignore
|
178 |
-
pipe_result = infer_result.pipe_txt_mode(image_writer) # type: ignore
|
179 |
-
|
180 |
-
# Draw visualizations
|
181 |
-
try:
|
182 |
-
infer_result.draw_model(
|
183 |
-
os.path.join(local_md_dir, f"{name_without_suff}_model.pdf")
|
184 |
-
) # type: ignore
|
185 |
-
pipe_result.draw_layout(
|
186 |
-
os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf")
|
187 |
-
) # type: ignore
|
188 |
-
pipe_result.draw_span(
|
189 |
-
os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf")
|
190 |
-
) # type: ignore
|
191 |
-
except Exception as e:
|
192 |
-
print(f"Warning: Failed to draw visualizations: {str(e)}")
|
193 |
-
|
194 |
-
# Get data using API methods
|
195 |
-
md_content = pipe_result.get_markdown(image_dir) # type: ignore
|
196 |
-
content_list = pipe_result.get_content_list(image_dir) # type: ignore
|
197 |
-
|
198 |
-
# Save files using dump methods (consistent with API)
|
199 |
-
pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir) # type: ignore
|
200 |
-
pipe_result.dump_content_list(
|
201 |
-
md_writer, f"{name_without_suff}_content_list.json", image_dir
|
202 |
-
) # type: ignore
|
203 |
-
pipe_result.dump_middle_json(md_writer, f"{name_without_suff}_middle.json") # type: ignore
|
204 |
-
|
205 |
-
# Save model result - convert JSON string to bytes before writing
|
206 |
-
model_inference_result = infer_result.get_infer_res() # type: ignore
|
207 |
-
json_str = json.dumps(model_inference_result, ensure_ascii=False, indent=4)
|
208 |
-
|
209 |
-
try:
|
210 |
-
# Try to write to a file manually to avoid FileBasedDataWriter issues
|
211 |
-
model_file_path = os.path.join(
|
212 |
-
local_md_dir, f"{name_without_suff}_model.json"
|
213 |
-
)
|
214 |
-
with open(model_file_path, "w", encoding="utf-8") as f:
|
215 |
-
f.write(json_str)
|
216 |
-
except Exception as e:
|
217 |
-
print(
|
218 |
-
f"Warning: Failed to save model result using file write: {str(e)}"
|
219 |
-
)
|
220 |
-
try:
|
221 |
-
# If direct file write fails, try using the writer with bytes encoding
|
222 |
-
md_writer.write(
|
223 |
-
json_str.encode("utf-8"), f"{name_without_suff}_model.json"
|
224 |
-
) # type: ignore
|
225 |
-
except Exception as e2:
|
226 |
-
print(
|
227 |
-
f"Warning: Failed to save model result using writer: {str(e2)}"
|
228 |
-
)
|
229 |
-
|
230 |
-
return cast(Tuple[List[Dict[str, Any]], str], (content_list, md_content))
|
231 |
-
|
232 |
-
except Exception as e:
|
233 |
-
print(f"Error in parse_pdf: {str(e)}")
|
234 |
-
raise
|
235 |
-
|
236 |
-
@staticmethod
|
237 |
-
def parse_office_doc(
|
238 |
-
doc_path: Union[str, Path], output_dir: Optional[str] = None
|
239 |
-
) -> Tuple[List[Dict[str, Any]], str]:
|
240 |
-
"""
|
241 |
-
Parse office document (Word, PPT, etc.)
|
242 |
-
|
243 |
-
Args:
|
244 |
-
doc_path: Path to the document file
|
245 |
-
output_dir: Output directory path
|
246 |
-
|
247 |
-
Returns:
|
248 |
-
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
|
249 |
-
"""
|
250 |
-
try:
|
251 |
-
# Convert to Path object for easier handling
|
252 |
-
doc_path = Path(doc_path)
|
253 |
-
name_without_suff = doc_path.stem
|
254 |
-
|
255 |
-
# Prepare output directories - ensure file name is in path
|
256 |
-
if output_dir:
|
257 |
-
base_output_dir = Path(output_dir)
|
258 |
-
local_md_dir = base_output_dir / name_without_suff
|
259 |
-
else:
|
260 |
-
local_md_dir = doc_path.parent / name_without_suff
|
261 |
-
|
262 |
-
local_image_dir = local_md_dir / "images"
|
263 |
-
image_dir = local_image_dir.name
|
264 |
-
|
265 |
-
# Create directories
|
266 |
-
os.makedirs(local_image_dir, exist_ok=True)
|
267 |
-
os.makedirs(local_md_dir, exist_ok=True)
|
268 |
-
|
269 |
-
# Initialize writers
|
270 |
-
image_writer = FileBasedDataWriter(str(local_image_dir)) # type: ignore
|
271 |
-
md_writer = FileBasedDataWriter(str(local_md_dir)) # type: ignore
|
272 |
-
|
273 |
-
# Read office document
|
274 |
-
ds = read_local_office(str(doc_path))[0] # type: ignore
|
275 |
-
|
276 |
-
# Apply chain of operations according to API documentation
|
277 |
-
# This follows the pattern shown in MS-Office example in the API docs
|
278 |
-
ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
|
279 |
-
md_writer, f"{name_without_suff}.md", image_dir
|
280 |
-
) # type: ignore
|
281 |
-
|
282 |
-
# Re-execute for getting the content data
|
283 |
-
infer_result = ds.apply(doc_analyze, ocr=True) # type: ignore
|
284 |
-
pipe_result = infer_result.pipe_txt_mode(image_writer) # type: ignore
|
285 |
-
|
286 |
-
# Get data for return values and additional outputs
|
287 |
-
md_content = pipe_result.get_markdown(image_dir) # type: ignore
|
288 |
-
content_list = pipe_result.get_content_list(image_dir) # type: ignore
|
289 |
-
|
290 |
-
# Save additional output files
|
291 |
-
pipe_result.dump_content_list(
|
292 |
-
md_writer, f"{name_without_suff}_content_list.json", image_dir
|
293 |
-
) # type: ignore
|
294 |
-
pipe_result.dump_middle_json(md_writer, f"{name_without_suff}_middle.json") # type: ignore
|
295 |
-
|
296 |
-
# Save model result - convert JSON string to bytes before writing
|
297 |
-
model_inference_result = infer_result.get_infer_res() # type: ignore
|
298 |
-
json_str = json.dumps(model_inference_result, ensure_ascii=False, indent=4)
|
299 |
-
|
300 |
-
try:
|
301 |
-
# Try to write to a file manually to avoid FileBasedDataWriter issues
|
302 |
-
model_file_path = os.path.join(
|
303 |
-
local_md_dir, f"{name_without_suff}_model.json"
|
304 |
-
)
|
305 |
-
with open(model_file_path, "w", encoding="utf-8") as f:
|
306 |
-
f.write(json_str)
|
307 |
-
except Exception as e:
|
308 |
-
print(
|
309 |
-
f"Warning: Failed to save model result using file write: {str(e)}"
|
310 |
-
)
|
311 |
-
try:
|
312 |
-
# If direct file write fails, try using the writer with bytes encoding
|
313 |
-
md_writer.write(
|
314 |
-
json_str.encode("utf-8"), f"{name_without_suff}_model.json"
|
315 |
-
) # type: ignore
|
316 |
-
except Exception as e2:
|
317 |
-
print(
|
318 |
-
f"Warning: Failed to save model result using writer: {str(e2)}"
|
319 |
-
)
|
320 |
-
|
321 |
-
return cast(Tuple[List[Dict[str, Any]], str], (content_list, md_content))
|
322 |
-
|
323 |
-
except Exception as e:
|
324 |
-
print(f"Error in parse_office_doc: {str(e)}")
|
325 |
-
raise
|
326 |
-
|
327 |
-
@staticmethod
|
328 |
-
def parse_image(
|
329 |
-
image_path: Union[str, Path], output_dir: Optional[str] = None
|
330 |
-
) -> Tuple[List[Dict[str, Any]], str]:
|
331 |
-
"""
|
332 |
-
Parse image document
|
333 |
-
|
334 |
-
Args:
|
335 |
-
image_path: Path to the image file
|
336 |
-
output_dir: Output directory path
|
337 |
-
|
338 |
-
Returns:
|
339 |
-
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
|
340 |
-
"""
|
341 |
-
try:
|
342 |
-
# Convert to Path object for easier handling
|
343 |
-
image_path = Path(image_path)
|
344 |
-
name_without_suff = image_path.stem
|
345 |
-
|
346 |
-
# Prepare output directories - ensure file name is in path
|
347 |
-
if output_dir:
|
348 |
-
base_output_dir = Path(output_dir)
|
349 |
-
local_md_dir = base_output_dir / name_without_suff
|
350 |
-
else:
|
351 |
-
local_md_dir = image_path.parent / name_without_suff
|
352 |
-
|
353 |
-
local_image_dir = local_md_dir / "images"
|
354 |
-
image_dir = local_image_dir.name
|
355 |
-
|
356 |
-
# Create directories
|
357 |
-
os.makedirs(local_image_dir, exist_ok=True)
|
358 |
-
os.makedirs(local_md_dir, exist_ok=True)
|
359 |
-
|
360 |
-
# Initialize writers
|
361 |
-
image_writer = FileBasedDataWriter(str(local_image_dir)) # type: ignore
|
362 |
-
md_writer = FileBasedDataWriter(str(local_md_dir)) # type: ignore
|
363 |
-
|
364 |
-
# Read image
|
365 |
-
ds = read_local_images(str(image_path))[0] # type: ignore
|
366 |
-
|
367 |
-
# Apply chain of operations according to API documentation
|
368 |
-
# This follows the pattern shown in Image example in the API docs
|
369 |
-
ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
|
370 |
-
md_writer, f"{name_without_suff}.md", image_dir
|
371 |
-
) # type: ignore
|
372 |
-
|
373 |
-
# Re-execute for getting the content data
|
374 |
-
infer_result = ds.apply(doc_analyze, ocr=True) # type: ignore
|
375 |
-
pipe_result = infer_result.pipe_ocr_mode(image_writer) # type: ignore
|
376 |
-
|
377 |
-
# Get data for return values and additional outputs
|
378 |
-
md_content = pipe_result.get_markdown(image_dir) # type: ignore
|
379 |
-
content_list = pipe_result.get_content_list(image_dir) # type: ignore
|
380 |
-
|
381 |
-
# Save additional output files
|
382 |
-
pipe_result.dump_content_list(
|
383 |
-
md_writer, f"{name_without_suff}_content_list.json", image_dir
|
384 |
-
) # type: ignore
|
385 |
-
pipe_result.dump_middle_json(md_writer, f"{name_without_suff}_middle.json") # type: ignore
|
386 |
-
|
387 |
-
# Save model result - convert JSON string to bytes before writing
|
388 |
-
model_inference_result = infer_result.get_infer_res() # type: ignore
|
389 |
-
json_str = json.dumps(model_inference_result, ensure_ascii=False, indent=4)
|
390 |
-
|
391 |
-
try:
|
392 |
-
# Try to write to a file manually to avoid FileBasedDataWriter issues
|
393 |
-
model_file_path = os.path.join(
|
394 |
-
local_md_dir, f"{name_without_suff}_model.json"
|
395 |
-
)
|
396 |
-
with open(model_file_path, "w", encoding="utf-8") as f:
|
397 |
-
f.write(json_str)
|
398 |
-
except Exception as e:
|
399 |
-
print(
|
400 |
-
f"Warning: Failed to save model result using file write: {str(e)}"
|
401 |
-
)
|
402 |
-
try:
|
403 |
-
# If direct file write fails, try using the writer with bytes encoding
|
404 |
-
md_writer.write(
|
405 |
-
json_str.encode("utf-8"), f"{name_without_suff}_model.json"
|
406 |
-
) # type: ignore
|
407 |
-
except Exception as e2:
|
408 |
-
print(
|
409 |
-
f"Warning: Failed to save model result using writer: {str(e2)}"
|
410 |
-
)
|
411 |
-
|
412 |
-
return cast(Tuple[List[Dict[str, Any]], str], (content_list, md_content))
|
413 |
-
|
414 |
-
except Exception as e:
|
415 |
-
print(f"Error in parse_image: {str(e)}")
|
416 |
-
raise
|
417 |
-
|
418 |
-
@staticmethod
|
419 |
-
def parse_document(
|
420 |
-
file_path: Union[str, Path],
|
421 |
-
parse_method: str = "auto",
|
422 |
-
output_dir: Optional[str] = None,
|
423 |
-
save_results: bool = True,
|
424 |
-
) -> Tuple[List[Dict[str, Any]], str]:
|
425 |
-
"""
|
426 |
-
Parse document using MinerU based on file extension
|
427 |
-
|
428 |
-
Args:
|
429 |
-
file_path: Path to the file to be parsed
|
430 |
-
parse_method: Parsing method, supports "auto", "ocr", "txt", default is "auto"
|
431 |
-
output_dir: Output directory path, if None, use the directory of the input file
|
432 |
-
save_results: Whether to save parsing results to files
|
433 |
-
|
434 |
-
Returns:
|
435 |
-
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
|
436 |
-
"""
|
437 |
-
# Convert to Path object
|
438 |
-
file_path = Path(file_path)
|
439 |
-
if not file_path.exists():
|
440 |
-
raise FileNotFoundError(f"File does not exist: {file_path}")
|
441 |
-
|
442 |
-
# Get file extension
|
443 |
-
ext = file_path.suffix.lower()
|
444 |
-
|
445 |
-
# Choose appropriate parser based on file type
|
446 |
-
if ext in [".pdf"]:
|
447 |
-
return MineruParser.parse_pdf(
|
448 |
-
file_path, output_dir, use_ocr=(parse_method == "ocr")
|
449 |
-
)
|
450 |
-
elif ext in [".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]:
|
451 |
-
return MineruParser.parse_image(file_path, output_dir)
|
452 |
-
elif ext in [".doc", ".docx", ".ppt", ".pptx"]:
|
453 |
-
return MineruParser.parse_office_doc(file_path, output_dir)
|
454 |
-
else:
|
455 |
-
# For unsupported file types, default to PDF parsing
|
456 |
-
print(
|
457 |
-
f"Warning: Unsupported file extension '{ext}', trying generic PDF parser"
|
458 |
-
)
|
459 |
-
return MineruParser.parse_pdf(
|
460 |
-
file_path, output_dir, use_ocr=(parse_method == "ocr")
|
461 |
-
)
|
462 |
-
|
463 |
-
|
464 |
-
def main():
|
465 |
-
"""
|
466 |
-
Main function to run the MinerU parser from command line
|
467 |
-
"""
|
468 |
-
parser = argparse.ArgumentParser(description="Parse documents using MinerU")
|
469 |
-
parser.add_argument("file_path", help="Path to the document to parse")
|
470 |
-
parser.add_argument("--output", "-o", help="Output directory path")
|
471 |
-
parser.add_argument(
|
472 |
-
"--method",
|
473 |
-
"-m",
|
474 |
-
choices=["auto", "ocr", "txt"],
|
475 |
-
default="auto",
|
476 |
-
help="Parsing method (auto, ocr, txt)",
|
477 |
-
)
|
478 |
-
parser.add_argument(
|
479 |
-
"--stats", action="store_true", help="Display content statistics"
|
480 |
-
)
|
481 |
-
|
482 |
-
args = parser.parse_args()
|
483 |
-
|
484 |
-
try:
|
485 |
-
# Parse the document
|
486 |
-
content_list, md_content = MineruParser.parse_document(
|
487 |
-
file_path=args.file_path, parse_method=args.method, output_dir=args.output
|
488 |
-
)
|
489 |
-
|
490 |
-
# Display statistics if requested
|
491 |
-
if args.stats:
|
492 |
-
print("\nDocument Statistics:")
|
493 |
-
print(f"Total content blocks: {len(content_list)}")
|
494 |
-
|
495 |
-
# Count different types of content
|
496 |
-
content_types = {}
|
497 |
-
for item in content_list:
|
498 |
-
content_type = item.get("type", "unknown")
|
499 |
-
content_types[content_type] = content_types.get(content_type, 0) + 1
|
500 |
-
|
501 |
-
print("\nContent Type Distribution:")
|
502 |
-
for content_type, count in content_types.items():
|
503 |
-
print(f"- {content_type}: {count}")
|
504 |
-
|
505 |
-
except Exception as e:
|
506 |
-
print(f"Error: {str(e)}")
|
507 |
-
return 1
|
508 |
-
|
509 |
-
return 0
|
510 |
-
|
511 |
-
|
512 |
-
if __name__ == "__main__":
|
513 |
-
exit(main())
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
lightrag/modalprocessors.py
DELETED
@@ -1,699 +0,0 @@
|
|
1 |
-
"""
|
2 |
-
Specialized processors for different modalities
|
3 |
-
|
4 |
-
Includes:
|
5 |
-
- ImageModalProcessor: Specialized processor for image content
|
6 |
-
- TableModalProcessor: Specialized processor for table content
|
7 |
-
- EquationModalProcessor: Specialized processor for equation content
|
8 |
-
- GenericModalProcessor: Processor for other modal content
|
9 |
-
"""
|
10 |
-
|
11 |
-
import re
|
12 |
-
import json
|
13 |
-
import time
|
14 |
-
import asyncio
|
15 |
-
import base64
|
16 |
-
from typing import Dict, Any, Tuple, cast
|
17 |
-
from pathlib import Path
|
18 |
-
|
19 |
-
from lightrag.base import StorageNameSpace
|
20 |
-
from lightrag.utils import (
|
21 |
-
logger,
|
22 |
-
compute_mdhash_id,
|
23 |
-
)
|
24 |
-
from lightrag.lightrag import LightRAG
|
25 |
-
from dataclasses import asdict
|
26 |
-
from lightrag.kg.shared_storage import get_namespace_data, get_pipeline_status_lock
|
27 |
-
|
28 |
-
|
29 |
-
class BaseModalProcessor:
|
30 |
-
"""Base class for modal processors"""
|
31 |
-
|
32 |
-
def __init__(self, lightrag: LightRAG, modal_caption_func):
|
33 |
-
"""Initialize base processor
|
34 |
-
|
35 |
-
Args:
|
36 |
-
lightrag: LightRAG instance
|
37 |
-
modal_caption_func: Function for generating descriptions
|
38 |
-
"""
|
39 |
-
self.lightrag = lightrag
|
40 |
-
self.modal_caption_func = modal_caption_func
|
41 |
-
|
42 |
-
# Use LightRAG's storage instances
|
43 |
-
self.text_chunks_db = lightrag.text_chunks
|
44 |
-
self.chunks_vdb = lightrag.chunks_vdb
|
45 |
-
self.entities_vdb = lightrag.entities_vdb
|
46 |
-
self.relationships_vdb = lightrag.relationships_vdb
|
47 |
-
self.knowledge_graph_inst = lightrag.chunk_entity_relation_graph
|
48 |
-
|
49 |
-
# Use LightRAG's configuration and functions
|
50 |
-
self.embedding_func = lightrag.embedding_func
|
51 |
-
self.llm_model_func = lightrag.llm_model_func
|
52 |
-
self.global_config = asdict(lightrag)
|
53 |
-
self.hashing_kv = lightrag.llm_response_cache
|
54 |
-
self.tokenizer = lightrag.tokenizer
|
55 |
-
|
56 |
-
async def process_multimodal_content(
|
57 |
-
self,
|
58 |
-
modal_content,
|
59 |
-
content_type: str,
|
60 |
-
file_path: str = "manual_creation",
|
61 |
-
entity_name: str = None,
|
62 |
-
) -> Tuple[str, Dict[str, Any]]:
|
63 |
-
"""Process multimodal content"""
|
64 |
-
# Subclasses need to implement specific processing logic
|
65 |
-
raise NotImplementedError("Subclasses must implement this method")
|
66 |
-
|
67 |
-
async def _create_entity_and_chunk(
|
68 |
-
self, modal_chunk: str, entity_info: Dict[str, Any], file_path: str
|
69 |
-
) -> Tuple[str, Dict[str, Any]]:
|
70 |
-
"""Create entity and text chunk"""
|
71 |
-
# Create chunk
|
72 |
-
chunk_id = compute_mdhash_id(str(modal_chunk), prefix="chunk-")
|
73 |
-
tokens = len(self.tokenizer.encode(modal_chunk))
|
74 |
-
|
75 |
-
chunk_data = {
|
76 |
-
"tokens": tokens,
|
77 |
-
"content": modal_chunk,
|
78 |
-
"chunk_order_index": 0,
|
79 |
-
"full_doc_id": chunk_id,
|
80 |
-
"file_path": file_path,
|
81 |
-
}
|
82 |
-
|
83 |
-
# Store chunk
|
84 |
-
await self.text_chunks_db.upsert({chunk_id: chunk_data})
|
85 |
-
|
86 |
-
# Create entity node
|
87 |
-
node_data = {
|
88 |
-
"entity_id": entity_info["entity_name"],
|
89 |
-
"entity_type": entity_info["entity_type"],
|
90 |
-
"description": entity_info["summary"],
|
91 |
-
"source_id": chunk_id,
|
92 |
-
"file_path": file_path,
|
93 |
-
"created_at": int(time.time()),
|
94 |
-
}
|
95 |
-
|
96 |
-
await self.knowledge_graph_inst.upsert_node(
|
97 |
-
entity_info["entity_name"], node_data
|
98 |
-
)
|
99 |
-
|
100 |
-
# Insert entity into vector database
|
101 |
-
entity_vdb_data = {
|
102 |
-
compute_mdhash_id(entity_info["entity_name"], prefix="ent-"): {
|
103 |
-
"entity_name": entity_info["entity_name"],
|
104 |
-
"entity_type": entity_info["entity_type"],
|
105 |
-
"content": f"{entity_info['entity_name']}\n{entity_info['summary']}",
|
106 |
-
"source_id": chunk_id,
|
107 |
-
"file_path": file_path,
|
108 |
-
}
|
109 |
-
}
|
110 |
-
await self.entities_vdb.upsert(entity_vdb_data)
|
111 |
-
|
112 |
-
# Process entity and relationship extraction
|
113 |
-
await self._process_chunk_for_extraction(chunk_id, entity_info["entity_name"])
|
114 |
-
|
115 |
-
# Ensure all storage updates are complete
|
116 |
-
await self._insert_done()
|
117 |
-
|
118 |
-
return entity_info["summary"], {
|
119 |
-
"entity_name": entity_info["entity_name"],
|
120 |
-
"entity_type": entity_info["entity_type"],
|
121 |
-
"description": entity_info["summary"],
|
122 |
-
"chunk_id": chunk_id,
|
123 |
-
}
|
124 |
-
|
125 |
-
async def _process_chunk_for_extraction(
|
126 |
-
self, chunk_id: str, modal_entity_name: str
|
127 |
-
):
|
128 |
-
"""Process chunk for entity and relationship extraction"""
|
129 |
-
chunk_data = await self.text_chunks_db.get_by_id(chunk_id)
|
130 |
-
if not chunk_data:
|
131 |
-
logger.error(f"Chunk {chunk_id} not found")
|
132 |
-
return
|
133 |
-
|
134 |
-
# Create text chunk for vector database
|
135 |
-
chunk_vdb_data = {
|
136 |
-
chunk_id: {
|
137 |
-
"content": chunk_data["content"],
|
138 |
-
"full_doc_id": chunk_id,
|
139 |
-
"tokens": chunk_data["tokens"],
|
140 |
-
"chunk_order_index": chunk_data["chunk_order_index"],
|
141 |
-
"file_path": chunk_data["file_path"],
|
142 |
-
}
|
143 |
-
}
|
144 |
-
|
145 |
-
await self.chunks_vdb.upsert(chunk_vdb_data)
|
146 |
-
|
147 |
-
# Trigger extraction process
|
148 |
-
from lightrag.operate import extract_entities, merge_nodes_and_edges
|
149 |
-
|
150 |
-
pipeline_status = await get_namespace_data("pipeline_status")
|
151 |
-
pipeline_status_lock = get_pipeline_status_lock()
|
152 |
-
|
153 |
-
# Prepare chunk for extraction
|
154 |
-
chunks = {chunk_id: chunk_data}
|
155 |
-
|
156 |
-
# Extract entities and relationships
|
157 |
-
chunk_results = await extract_entities(
|
158 |
-
chunks=chunks,
|
159 |
-
global_config=self.global_config,
|
160 |
-
pipeline_status=pipeline_status,
|
161 |
-
pipeline_status_lock=pipeline_status_lock,
|
162 |
-
llm_response_cache=self.hashing_kv,
|
163 |
-
)
|
164 |
-
|
165 |
-
# Add "belongs_to" relationships for all extracted entities
|
166 |
-
for maybe_nodes, _ in chunk_results:
|
167 |
-
for entity_name in maybe_nodes.keys():
|
168 |
-
if entity_name != modal_entity_name: # Skip self-relationship
|
169 |
-
# Create belongs_to relationship
|
170 |
-
relation_data = {
|
171 |
-
"description": f"Entity {entity_name} belongs to {modal_entity_name}",
|
172 |
-
"keywords": "belongs_to,part_of,contained_in",
|
173 |
-
"source_id": chunk_id,
|
174 |
-
"weight": 10.0,
|
175 |
-
"file_path": chunk_data.get("file_path", "manual_creation"),
|
176 |
-
}
|
177 |
-
await self.knowledge_graph_inst.upsert_edge(
|
178 |
-
entity_name, modal_entity_name, relation_data
|
179 |
-
)
|
180 |
-
|
181 |
-
relation_id = compute_mdhash_id(
|
182 |
-
entity_name + modal_entity_name, prefix="rel-"
|
183 |
-
)
|
184 |
-
relation_vdb_data = {
|
185 |
-
relation_id: {
|
186 |
-
"src_id": entity_name,
|
187 |
-
"tgt_id": modal_entity_name,
|
188 |
-
"keywords": relation_data["keywords"],
|
189 |
-
"content": f"{relation_data['keywords']}\t{entity_name}\n{modal_entity_name}\n{relation_data['description']}",
|
190 |
-
"source_id": chunk_id,
|
191 |
-
"file_path": chunk_data.get("file_path", "manual_creation"),
|
192 |
-
}
|
193 |
-
}
|
194 |
-
await self.relationships_vdb.upsert(relation_vdb_data)
|
195 |
-
|
196 |
-
await merge_nodes_and_edges(
|
197 |
-
chunk_results=chunk_results,
|
198 |
-
knowledge_graph_inst=self.knowledge_graph_inst,
|
199 |
-
entity_vdb=self.entities_vdb,
|
200 |
-
relationships_vdb=self.relationships_vdb,
|
201 |
-
global_config=self.global_config,
|
202 |
-
pipeline_status=pipeline_status,
|
203 |
-
pipeline_status_lock=pipeline_status_lock,
|
204 |
-
llm_response_cache=self.hashing_kv,
|
205 |
-
)
|
206 |
-
|
207 |
-
async def _insert_done(self) -> None:
|
208 |
-
await asyncio.gather(
|
209 |
-
*[
|
210 |
-
cast(StorageNameSpace, storage_inst).index_done_callback()
|
211 |
-
for storage_inst in [
|
212 |
-
self.text_chunks_db,
|
213 |
-
self.chunks_vdb,
|
214 |
-
self.entities_vdb,
|
215 |
-
self.relationships_vdb,
|
216 |
-
self.knowledge_graph_inst,
|
217 |
-
]
|
218 |
-
]
|
219 |
-
)
|
220 |
-
|
221 |
-
|
222 |
-
class ImageModalProcessor(BaseModalProcessor):
|
223 |
-
"""Processor specialized for image content"""
|
224 |
-
|
225 |
-
def __init__(self, lightrag: LightRAG, modal_caption_func):
|
226 |
-
"""Initialize image processor
|
227 |
-
|
228 |
-
Args:
|
229 |
-
lightrag: LightRAG instance
|
230 |
-
modal_caption_func: Function for generating descriptions (supporting image understanding)
|
231 |
-
"""
|
232 |
-
super().__init__(lightrag, modal_caption_func)
|
233 |
-
|
234 |
-
def _encode_image_to_base64(self, image_path: str) -> str:
|
235 |
-
"""Encode image to base64"""
|
236 |
-
try:
|
237 |
-
with open(image_path, "rb") as image_file:
|
238 |
-
encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
|
239 |
-
return encoded_string
|
240 |
-
except Exception as e:
|
241 |
-
logger.error(f"Failed to encode image {image_path}: {e}")
|
242 |
-
return ""
|
243 |
-
|
244 |
-
async def process_multimodal_content(
|
245 |
-
self,
|
246 |
-
modal_content,
|
247 |
-
content_type: str,
|
248 |
-
file_path: str = "manual_creation",
|
249 |
-
entity_name: str = None,
|
250 |
-
) -> Tuple[str, Dict[str, Any]]:
|
251 |
-
"""Process image content"""
|
252 |
-
try:
|
253 |
-
# Parse image content
|
254 |
-
if isinstance(modal_content, str):
|
255 |
-
try:
|
256 |
-
content_data = json.loads(modal_content)
|
257 |
-
except json.JSONDecodeError:
|
258 |
-
content_data = {"description": modal_content}
|
259 |
-
else:
|
260 |
-
content_data = modal_content
|
261 |
-
|
262 |
-
image_path = content_data.get("img_path")
|
263 |
-
captions = content_data.get("img_caption", [])
|
264 |
-
footnotes = content_data.get("img_footnote", [])
|
265 |
-
|
266 |
-
# Build detailed visual analysis prompt
|
267 |
-
vision_prompt = f"""Please analyze this image in detail and provide a JSON response with the following structure:
|
268 |
-
|
269 |
-
{{
|
270 |
-
"detailed_description": "A comprehensive and detailed visual description of the image following these guidelines:
|
271 |
-
- Describe the overall composition and layout
|
272 |
-
- Identify all objects, people, text, and visual elements
|
273 |
-
- Explain relationships between elements
|
274 |
-
- Note colors, lighting, and visual style
|
275 |
-
- Describe any actions or activities shown
|
276 |
-
- Include technical details if relevant (charts, diagrams, etc.)
|
277 |
-
- Always use specific names instead of pronouns",
|
278 |
-
"entity_info": {{
|
279 |
-
"entity_name": "{entity_name if entity_name else 'unique descriptive name for this image'}",
|
280 |
-
"entity_type": "image",
|
281 |
-
"summary": "concise summary of the image content and its significance (max 100 words)"
|
282 |
-
}}
|
283 |
-
}}
|
284 |
-
|
285 |
-
Additional context:
|
286 |
-
- Image Path: {image_path}
|
287 |
-
- Captions: {captions if captions else 'None'}
|
288 |
-
- Footnotes: {footnotes if footnotes else 'None'}
|
289 |
-
|
290 |
-
Focus on providing accurate, detailed visual analysis that would be useful for knowledge retrieval."""
|
291 |
-
|
292 |
-
# If image path exists, try to encode image
|
293 |
-
image_base64 = ""
|
294 |
-
if image_path and Path(image_path).exists():
|
295 |
-
image_base64 = self._encode_image_to_base64(image_path)
|
296 |
-
|
297 |
-
# Call vision model
|
298 |
-
if image_base64:
|
299 |
-
# Use real image for analysis
|
300 |
-
response = await self.modal_caption_func(
|
301 |
-
vision_prompt,
|
302 |
-
image_data=image_base64,
|
303 |
-
system_prompt="You are an expert image analyst. Provide detailed, accurate descriptions.",
|
304 |
-
)
|
305 |
-
else:
|
306 |
-
# Analyze based on existing text information
|
307 |
-
text_prompt = f"""Based on the following image information, provide analysis:
|
308 |
-
|
309 |
-
Image Path: {image_path}
|
310 |
-
Captions: {captions}
|
311 |
-
Footnotes: {footnotes}
|
312 |
-
|
313 |
-
{vision_prompt}"""
|
314 |
-
|
315 |
-
response = await self.modal_caption_func(
|
316 |
-
text_prompt,
|
317 |
-
system_prompt="You are an expert image analyst. Provide detailed analysis based on available information.",
|
318 |
-
)
|
319 |
-
|
320 |
-
# Parse response
|
321 |
-
enhanced_caption, entity_info = self._parse_response(response, entity_name)
|
322 |
-
|
323 |
-
# Build complete image content
|
324 |
-
modal_chunk = f"""
|
325 |
-
Image Content Analysis:
|
326 |
-
Image Path: {image_path}
|
327 |
-
Captions: {', '.join(captions) if captions else 'None'}
|
328 |
-
Footnotes: {', '.join(footnotes) if footnotes else 'None'}
|
329 |
-
|
330 |
-
Visual Analysis: {enhanced_caption}"""
|
331 |
-
|
332 |
-
return await self._create_entity_and_chunk(
|
333 |
-
modal_chunk, entity_info, file_path
|
334 |
-
)
|
335 |
-
|
336 |
-
except Exception as e:
|
337 |
-
logger.error(f"Error processing image content: {e}")
|
338 |
-
# Fallback processing
|
339 |
-
fallback_entity = {
|
340 |
-
"entity_name": entity_name
|
341 |
-
if entity_name
|
342 |
-
else f"image_{compute_mdhash_id(str(modal_content))}",
|
343 |
-
"entity_type": "image",
|
344 |
-
"summary": f"Image content: {str(modal_content)[:100]}",
|
345 |
-
}
|
346 |
-
return str(modal_content), fallback_entity
|
347 |
-
|
348 |
-
def _parse_response(
|
349 |
-
self, response: str, entity_name: str = None
|
350 |
-
) -> Tuple[str, Dict[str, Any]]:
|
351 |
-
"""Parse model response"""
|
352 |
-
try:
|
353 |
-
response_data = json.loads(
|
354 |
-
re.search(r"\{.*\}", response, re.DOTALL).group(0)
|
355 |
-
)
|
356 |
-
|
357 |
-
description = response_data.get("detailed_description", "")
|
358 |
-
entity_data = response_data.get("entity_info", {})
|
359 |
-
|
360 |
-
if not description or not entity_data:
|
361 |
-
raise ValueError("Missing required fields in response")
|
362 |
-
|
363 |
-
if not all(
|
364 |
-
key in entity_data for key in ["entity_name", "entity_type", "summary"]
|
365 |
-
):
|
366 |
-
raise ValueError("Missing required fields in entity_info")
|
367 |
-
|
368 |
-
entity_data["entity_name"] = (
|
369 |
-
entity_data["entity_name"] + f" ({entity_data['entity_type']})"
|
370 |
-
)
|
371 |
-
if entity_name:
|
372 |
-
entity_data["entity_name"] = entity_name
|
373 |
-
|
374 |
-
return description, entity_data
|
375 |
-
|
376 |
-
except (json.JSONDecodeError, AttributeError, ValueError) as e:
|
377 |
-
logger.error(f"Error parsing image analysis response: {e}")
|
378 |
-
fallback_entity = {
|
379 |
-
"entity_name": entity_name
|
380 |
-
if entity_name
|
381 |
-
else f"image_{compute_mdhash_id(response)}",
|
382 |
-
"entity_type": "image",
|
383 |
-
"summary": response[:100] + "..." if len(response) > 100 else response,
|
384 |
-
}
|
385 |
-
return response, fallback_entity
|
386 |
-
|
387 |
-
|
388 |
-
class TableModalProcessor(BaseModalProcessor):
|
389 |
-
"""Processor specialized for table content"""
|
390 |
-
|
391 |
-
async def process_multimodal_content(
|
392 |
-
self,
|
393 |
-
modal_content,
|
394 |
-
content_type: str,
|
395 |
-
file_path: str = "manual_creation",
|
396 |
-
entity_name: str = None,
|
397 |
-
) -> Tuple[str, Dict[str, Any]]:
|
398 |
-
"""Process table content"""
|
399 |
-
# Parse table content
|
400 |
-
if isinstance(modal_content, str):
|
401 |
-
try:
|
402 |
-
content_data = json.loads(modal_content)
|
403 |
-
except json.JSONDecodeError:
|
404 |
-
content_data = {"table_body": modal_content}
|
405 |
-
else:
|
406 |
-
content_data = modal_content
|
407 |
-
|
408 |
-
table_img_path = content_data.get("img_path")
|
409 |
-
table_caption = content_data.get("table_caption", [])
|
410 |
-
table_body = content_data.get("table_body", "")
|
411 |
-
table_footnote = content_data.get("table_footnote", [])
|
412 |
-
|
413 |
-
# Build table analysis prompt
|
414 |
-
table_prompt = f"""Please analyze this table content and provide a JSON response with the following structure:
|
415 |
-
|
416 |
-
{{
|
417 |
-
"detailed_description": "A comprehensive analysis of the table including:
|
418 |
-
- Table structure and organization
|
419 |
-
- Column headers and their meanings
|
420 |
-
- Key data points and patterns
|
421 |
-
- Statistical insights and trends
|
422 |
-
- Relationships between data elements
|
423 |
-
- Significance of the data presented
|
424 |
-
Always use specific names and values instead of general references.",
|
425 |
-
"entity_info": {{
|
426 |
-
"entity_name": "{entity_name if entity_name else 'descriptive name for this table'}",
|
427 |
-
"entity_type": "table",
|
428 |
-
"summary": "concise summary of the table's purpose and key findings (max 100 words)"
|
429 |
-
}}
|
430 |
-
}}
|
431 |
-
|
432 |
-
Table Information:
|
433 |
-
Image Path: {table_img_path}
|
434 |
-
Caption: {table_caption if table_caption else 'None'}
|
435 |
-
Body: {table_body}
|
436 |
-
Footnotes: {table_footnote if table_footnote else 'None'}
|
437 |
-
|
438 |
-
Focus on extracting meaningful insights and relationships from the tabular data."""
|
439 |
-
|
440 |
-
response = await self.modal_caption_func(
|
441 |
-
table_prompt,
|
442 |
-
system_prompt="You are an expert data analyst. Provide detailed table analysis with specific insights.",
|
443 |
-
)
|
444 |
-
|
445 |
-
# Parse response
|
446 |
-
enhanced_caption, entity_info = self._parse_table_response(
|
447 |
-
response, entity_name
|
448 |
-
)
|
449 |
-
|
450 |
-
# TODO: Add Retry Mechanism
|
451 |
-
|
452 |
-
# Build complete table content
|
453 |
-
modal_chunk = f"""Table Analysis:
|
454 |
-
Image Path: {table_img_path}
|
455 |
-
Caption: {', '.join(table_caption) if table_caption else 'None'}
|
456 |
-
Structure: {table_body}
|
457 |
-
Footnotes: {', '.join(table_footnote) if table_footnote else 'None'}
|
458 |
-
|
459 |
-
Analysis: {enhanced_caption}"""
|
460 |
-
|
461 |
-
return await self._create_entity_and_chunk(modal_chunk, entity_info, file_path)
|
462 |
-
|
463 |
-
def _parse_table_response(
|
464 |
-
self, response: str, entity_name: str = None
|
465 |
-
) -> Tuple[str, Dict[str, Any]]:
|
466 |
-
"""Parse table analysis response"""
|
467 |
-
try:
|
468 |
-
response_data = json.loads(
|
469 |
-
re.search(r"\{.*\}", response, re.DOTALL).group(0)
|
470 |
-
)
|
471 |
-
|
472 |
-
description = response_data.get("detailed_description", "")
|
473 |
-
entity_data = response_data.get("entity_info", {})
|
474 |
-
|
475 |
-
if not description or not entity_data:
|
476 |
-
raise ValueError("Missing required fields in response")
|
477 |
-
|
478 |
-
if not all(
|
479 |
-
key in entity_data for key in ["entity_name", "entity_type", "summary"]
|
480 |
-
):
|
481 |
-
raise ValueError("Missing required fields in entity_info")
|
482 |
-
|
483 |
-
entity_data["entity_name"] = (
|
484 |
-
entity_data["entity_name"] + f" ({entity_data['entity_type']})"
|
485 |
-
)
|
486 |
-
if entity_name:
|
487 |
-
entity_data["entity_name"] = entity_name
|
488 |
-
|
489 |
-
return description, entity_data
|
490 |
-
|
491 |
-
except (json.JSONDecodeError, AttributeError, ValueError) as e:
|
492 |
-
logger.error(f"Error parsing table analysis response: {e}")
|
493 |
-
fallback_entity = {
|
494 |
-
"entity_name": entity_name
|
495 |
-
if entity_name
|
496 |
-
else f"table_{compute_mdhash_id(response)}",
|
497 |
-
"entity_type": "table",
|
498 |
-
"summary": response[:100] + "..." if len(response) > 100 else response,
|
499 |
-
}
|
500 |
-
return response, fallback_entity
|
501 |
-
|
502 |
-
|
503 |
-
class EquationModalProcessor(BaseModalProcessor):
|
504 |
-
"""Processor specialized for equation content"""
|
505 |
-
|
506 |
-
async def process_multimodal_content(
|
507 |
-
self,
|
508 |
-
modal_content,
|
509 |
-
content_type: str,
|
510 |
-
file_path: str = "manual_creation",
|
511 |
-
entity_name: str = None,
|
512 |
-
) -> Tuple[str, Dict[str, Any]]:
|
513 |
-
"""Process equation content"""
|
514 |
-
# Parse equation content
|
515 |
-
if isinstance(modal_content, str):
|
516 |
-
try:
|
517 |
-
content_data = json.loads(modal_content)
|
518 |
-
except json.JSONDecodeError:
|
519 |
-
content_data = {"equation": modal_content}
|
520 |
-
else:
|
521 |
-
content_data = modal_content
|
522 |
-
|
523 |
-
equation_text = content_data.get("text")
|
524 |
-
equation_format = content_data.get("text_format", "")
|
525 |
-
|
526 |
-
# Build equation analysis prompt
|
527 |
-
equation_prompt = f"""Please analyze this mathematical equation and provide a JSON response with the following structure:
|
528 |
-
|
529 |
-
{{
|
530 |
-
"detailed_description": "A comprehensive analysis of the equation including:
|
531 |
-
- Mathematical meaning and interpretation
|
532 |
-
- Variables and their definitions
|
533 |
-
- Mathematical operations and functions used
|
534 |
-
- Application domain and context
|
535 |
-
- Physical or theoretical significance
|
536 |
-
- Relationship to other mathematical concepts
|
537 |
-
- Practical applications or use cases
|
538 |
-
Always use specific mathematical terminology.",
|
539 |
-
"entity_info": {{
|
540 |
-
"entity_name": "{entity_name if entity_name else 'descriptive name for this equation'}",
|
541 |
-
"entity_type": "equation",
|
542 |
-
"summary": "concise summary of the equation's purpose and significance (max 100 words)"
|
543 |
-
}}
|
544 |
-
}}
|
545 |
-
|
546 |
-
Equation Information:
|
547 |
-
Equation: {equation_text}
|
548 |
-
Format: {equation_format}
|
549 |
-
|
550 |
-
Focus on providing mathematical insights and explaining the equation's significance."""
|
551 |
-
|
552 |
-
response = await self.modal_caption_func(
|
553 |
-
equation_prompt,
|
554 |
-
system_prompt="You are an expert mathematician. Provide detailed mathematical analysis.",
|
555 |
-
)
|
556 |
-
|
557 |
-
# Parse response
|
558 |
-
enhanced_caption, entity_info = self._parse_equation_response(
|
559 |
-
response, entity_name
|
560 |
-
)
|
561 |
-
|
562 |
-
# Build complete equation content
|
563 |
-
modal_chunk = f"""Mathematical Equation Analysis:
|
564 |
-
Equation: {equation_text}
|
565 |
-
Format: {equation_format}
|
566 |
-
|
567 |
-
Mathematical Analysis: {enhanced_caption}"""
|
568 |
-
|
569 |
-
return await self._create_entity_and_chunk(modal_chunk, entity_info, file_path)
|
570 |
-
|
571 |
-
def _parse_equation_response(
|
572 |
-
self, response: str, entity_name: str = None
|
573 |
-
) -> Tuple[str, Dict[str, Any]]:
|
574 |
-
"""Parse equation analysis response"""
|
575 |
-
try:
|
576 |
-
response_data = json.loads(
|
577 |
-
re.search(r"\{.*\}", response, re.DOTALL).group(0)
|
578 |
-
)
|
579 |
-
|
580 |
-
description = response_data.get("detailed_description", "")
|
581 |
-
entity_data = response_data.get("entity_info", {})
|
582 |
-
|
583 |
-
if not description or not entity_data:
|
584 |
-
raise ValueError("Missing required fields in response")
|
585 |
-
|
586 |
-
if not all(
|
587 |
-
key in entity_data for key in ["entity_name", "entity_type", "summary"]
|
588 |
-
):
|
589 |
-
raise ValueError("Missing required fields in entity_info")
|
590 |
-
|
591 |
-
entity_data["entity_name"] = (
|
592 |
-
entity_data["entity_name"] + f" ({entity_data['entity_type']})"
|
593 |
-
)
|
594 |
-
if entity_name:
|
595 |
-
entity_data["entity_name"] = entity_name
|
596 |
-
|
597 |
-
return description, entity_data
|
598 |
-
|
599 |
-
except (json.JSONDecodeError, AttributeError, ValueError) as e:
|
600 |
-
logger.error(f"Error parsing equation analysis response: {e}")
|
601 |
-
fallback_entity = {
|
602 |
-
"entity_name": entity_name
|
603 |
-
if entity_name
|
604 |
-
else f"equation_{compute_mdhash_id(response)}",
|
605 |
-
"entity_type": "equation",
|
606 |
-
"summary": response[:100] + "..." if len(response) > 100 else response,
|
607 |
-
}
|
608 |
-
return response, fallback_entity
|
609 |
-
|
610 |
-
|
611 |
-
class GenericModalProcessor(BaseModalProcessor):
|
612 |
-
"""Generic processor for other types of modal content"""
|
613 |
-
|
614 |
-
async def process_multimodal_content(
|
615 |
-
self,
|
616 |
-
modal_content,
|
617 |
-
content_type: str,
|
618 |
-
file_path: str = "manual_creation",
|
619 |
-
entity_name: str = None,
|
620 |
-
) -> Tuple[str, Dict[str, Any]]:
|
621 |
-
"""Process generic modal content"""
|
622 |
-
# Build generic analysis prompt
|
623 |
-
generic_prompt = f"""Please analyze this {content_type} content and provide a JSON response with the following structure:
|
624 |
-
|
625 |
-
{{
|
626 |
-
"detailed_description": "A comprehensive analysis of the content including:
|
627 |
-
- Content structure and organization
|
628 |
-
- Key information and elements
|
629 |
-
- Relationships between components
|
630 |
-
- Context and significance
|
631 |
-
- Relevant details for knowledge retrieval
|
632 |
-
Always use specific terminology appropriate for {content_type} content.",
|
633 |
-
"entity_info": {{
|
634 |
-
"entity_name": "{entity_name if entity_name else f'descriptive name for this {content_type}'}",
|
635 |
-
"entity_type": "{content_type}",
|
636 |
-
"summary": "concise summary of the content's purpose and key points (max 100 words)"
|
637 |
-
}}
|
638 |
-
}}
|
639 |
-
|
640 |
-
Content: {str(modal_content)}
|
641 |
-
|
642 |
-
Focus on extracting meaningful information that would be useful for knowledge retrieval."""
|
643 |
-
|
644 |
-
response = await self.modal_caption_func(
|
645 |
-
generic_prompt,
|
646 |
-
system_prompt=f"You are an expert content analyst specializing in {content_type} content.",
|
647 |
-
)
|
648 |
-
|
649 |
-
# Parse response
|
650 |
-
enhanced_caption, entity_info = self._parse_generic_response(
|
651 |
-
response, entity_name, content_type
|
652 |
-
)
|
653 |
-
|
654 |
-
# Build complete content
|
655 |
-
modal_chunk = f"""{content_type.title()} Content Analysis:
|
656 |
-
Content: {str(modal_content)}
|
657 |
-
|
658 |
-
Analysis: {enhanced_caption}"""
|
659 |
-
|
660 |
-
return await self._create_entity_and_chunk(modal_chunk, entity_info, file_path)
|
661 |
-
|
662 |
-
def _parse_generic_response(
|
663 |
-
self, response: str, entity_name: str = None, content_type: str = "content"
|
664 |
-
) -> Tuple[str, Dict[str, Any]]:
|
665 |
-
"""Parse generic analysis response"""
|
666 |
-
try:
|
667 |
-
response_data = json.loads(
|
668 |
-
re.search(r"\{.*\}", response, re.DOTALL).group(0)
|
669 |
-
)
|
670 |
-
|
671 |
-
description = response_data.get("detailed_description", "")
|
672 |
-
entity_data = response_data.get("entity_info", {})
|
673 |
-
|
674 |
-
if not description or not entity_data:
|
675 |
-
raise ValueError("Missing required fields in response")
|
676 |
-
|
677 |
-
if not all(
|
678 |
-
key in entity_data for key in ["entity_name", "entity_type", "summary"]
|
679 |
-
):
|
680 |
-
raise ValueError("Missing required fields in entity_info")
|
681 |
-
|
682 |
-
entity_data["entity_name"] = (
|
683 |
-
entity_data["entity_name"] + f" ({entity_data['entity_type']})"
|
684 |
-
)
|
685 |
-
if entity_name:
|
686 |
-
entity_data["entity_name"] = entity_name
|
687 |
-
|
688 |
-
return description, entity_data
|
689 |
-
|
690 |
-
except (json.JSONDecodeError, AttributeError, ValueError) as e:
|
691 |
-
logger.error(f"Error parsing generic analysis response: {e}")
|
692 |
-
fallback_entity = {
|
693 |
-
"entity_name": entity_name
|
694 |
-
if entity_name
|
695 |
-
else f"{content_type}_{compute_mdhash_id(response)}",
|
696 |
-
"entity_type": content_type,
|
697 |
-
"summary": response[:100] + "..." if len(response) > 100 else response,
|
698 |
-
}
|
699 |
-
return response, fallback_entity
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
lightrag/raganything.py
DELETED
@@ -1,686 +0,0 @@
|
|
1 |
-
"""
|
2 |
-
Complete MinerU parsing + multimodal content insertion Pipeline
|
3 |
-
|
4 |
-
This script integrates:
|
5 |
-
1. MinerU document parsing
|
6 |
-
2. Pure text content LightRAG insertion
|
7 |
-
3. Specialized processing for multimodal content (using different processors)
|
8 |
-
"""
|
9 |
-
|
10 |
-
import os
|
11 |
-
import asyncio
|
12 |
-
import logging
|
13 |
-
from pathlib import Path
|
14 |
-
from typing import Dict, List, Any, Tuple, Optional, Callable
|
15 |
-
import sys
|
16 |
-
|
17 |
-
# Add project root directory to Python path
|
18 |
-
sys.path.insert(0, str(Path(__file__).parent.parent))
|
19 |
-
|
20 |
-
from lightrag import LightRAG, QueryParam
|
21 |
-
from lightrag.utils import EmbeddingFunc, setup_logger
|
22 |
-
|
23 |
-
# Import parser and multimodal processors
|
24 |
-
from lightrag.mineru_parser import MineruParser
|
25 |
-
|
26 |
-
# Import specialized processors
|
27 |
-
from lightrag.modalprocessors import (
|
28 |
-
ImageModalProcessor,
|
29 |
-
TableModalProcessor,
|
30 |
-
EquationModalProcessor,
|
31 |
-
GenericModalProcessor,
|
32 |
-
)
|
33 |
-
|
34 |
-
|
35 |
-
class RAGAnything:
|
36 |
-
"""Multimodal Document Processing Pipeline - Complete document parsing and insertion pipeline"""
|
37 |
-
|
38 |
-
def __init__(
|
39 |
-
self,
|
40 |
-
lightrag: Optional[LightRAG] = None,
|
41 |
-
llm_model_func: Optional[Callable] = None,
|
42 |
-
vision_model_func: Optional[Callable] = None,
|
43 |
-
embedding_func: Optional[Callable] = None,
|
44 |
-
working_dir: str = "./rag_storage",
|
45 |
-
embedding_dim: int = 3072,
|
46 |
-
max_token_size: int = 8192,
|
47 |
-
):
|
48 |
-
"""
|
49 |
-
Initialize Multimodal Document Processing Pipeline
|
50 |
-
|
51 |
-
Args:
|
52 |
-
lightrag: Optional pre-initialized LightRAG instance
|
53 |
-
llm_model_func: LLM model function for text analysis
|
54 |
-
vision_model_func: Vision model function for image analysis
|
55 |
-
embedding_func: Embedding function for text vectorization
|
56 |
-
working_dir: Working directory for storage (used when creating new RAG)
|
57 |
-
embedding_dim: Embedding dimension (used when creating new RAG)
|
58 |
-
max_token_size: Maximum token size for embeddings (used when creating new RAG)
|
59 |
-
"""
|
60 |
-
self.working_dir = working_dir
|
61 |
-
self.llm_model_func = llm_model_func
|
62 |
-
self.vision_model_func = vision_model_func
|
63 |
-
self.embedding_func = embedding_func
|
64 |
-
self.embedding_dim = embedding_dim
|
65 |
-
self.max_token_size = max_token_size
|
66 |
-
|
67 |
-
# Set up logging
|
68 |
-
setup_logger("RAGAnything")
|
69 |
-
self.logger = logging.getLogger("RAGAnything")
|
70 |
-
|
71 |
-
# Create working directory if needed
|
72 |
-
if not os.path.exists(working_dir):
|
73 |
-
os.makedirs(working_dir)
|
74 |
-
|
75 |
-
# Use provided LightRAG or mark for later initialization
|
76 |
-
self.lightrag = lightrag
|
77 |
-
self.modal_processors = {}
|
78 |
-
|
79 |
-
# If LightRAG is provided, initialize processors immediately
|
80 |
-
if self.lightrag is not None:
|
81 |
-
self._initialize_processors()
|
82 |
-
|
83 |
-
def _initialize_processors(self):
|
84 |
-
"""Initialize multimodal processors with appropriate model functions"""
|
85 |
-
if self.lightrag is None:
|
86 |
-
raise ValueError(
|
87 |
-
"LightRAG instance must be initialized before creating processors"
|
88 |
-
)
|
89 |
-
|
90 |
-
# Create different multimodal processors
|
91 |
-
self.modal_processors = {
|
92 |
-
"image": ImageModalProcessor(
|
93 |
-
lightrag=self.lightrag,
|
94 |
-
modal_caption_func=self.vision_model_func or self.llm_model_func,
|
95 |
-
),
|
96 |
-
"table": TableModalProcessor(
|
97 |
-
lightrag=self.lightrag, modal_caption_func=self.llm_model_func
|
98 |
-
),
|
99 |
-
"equation": EquationModalProcessor(
|
100 |
-
lightrag=self.lightrag, modal_caption_func=self.llm_model_func
|
101 |
-
),
|
102 |
-
"generic": GenericModalProcessor(
|
103 |
-
lightrag=self.lightrag, modal_caption_func=self.llm_model_func
|
104 |
-
),
|
105 |
-
}
|
106 |
-
|
107 |
-
self.logger.info("Multimodal processors initialized")
|
108 |
-
self.logger.info(f"Available processors: {list(self.modal_processors.keys())}")
|
109 |
-
|
110 |
-
async def _ensure_lightrag_initialized(self):
|
111 |
-
"""Ensure LightRAG instance is initialized, create if necessary"""
|
112 |
-
if self.lightrag is not None:
|
113 |
-
return
|
114 |
-
|
115 |
-
# Validate required functions
|
116 |
-
if self.llm_model_func is None:
|
117 |
-
raise ValueError(
|
118 |
-
"llm_model_func must be provided when LightRAG is not pre-initialized"
|
119 |
-
)
|
120 |
-
if self.embedding_func is None:
|
121 |
-
raise ValueError(
|
122 |
-
"embedding_func must be provided when LightRAG is not pre-initialized"
|
123 |
-
)
|
124 |
-
|
125 |
-
from lightrag.kg.shared_storage import initialize_pipeline_status
|
126 |
-
|
127 |
-
# Create LightRAG instance with provided functions
|
128 |
-
self.lightrag = LightRAG(
|
129 |
-
working_dir=self.working_dir,
|
130 |
-
llm_model_func=self.llm_model_func,
|
131 |
-
embedding_func=EmbeddingFunc(
|
132 |
-
embedding_dim=self.embedding_dim,
|
133 |
-
max_token_size=self.max_token_size,
|
134 |
-
func=self.embedding_func,
|
135 |
-
),
|
136 |
-
)
|
137 |
-
|
138 |
-
await self.lightrag.initialize_storages()
|
139 |
-
await initialize_pipeline_status()
|
140 |
-
|
141 |
-
# Initialize processors after LightRAG is ready
|
142 |
-
self._initialize_processors()
|
143 |
-
|
144 |
-
self.logger.info("LightRAG and multimodal processors initialized")
|
145 |
-
|
146 |
-
def parse_document(
|
147 |
-
self,
|
148 |
-
file_path: str,
|
149 |
-
output_dir: str = "./output",
|
150 |
-
parse_method: str = "auto",
|
151 |
-
display_stats: bool = True,
|
152 |
-
) -> Tuple[List[Dict[str, Any]], str]:
|
153 |
-
"""
|
154 |
-
Parse document using MinerU
|
155 |
-
|
156 |
-
Args:
|
157 |
-
file_path: Path to the file to parse
|
158 |
-
output_dir: Output directory
|
159 |
-
parse_method: Parse method ("auto", "ocr", "txt")
|
160 |
-
display_stats: Whether to display content statistics
|
161 |
-
|
162 |
-
Returns:
|
163 |
-
(content_list, md_content): Content list and markdown text
|
164 |
-
"""
|
165 |
-
self.logger.info(f"Starting document parsing: {file_path}")
|
166 |
-
|
167 |
-
file_path = Path(file_path)
|
168 |
-
if not file_path.exists():
|
169 |
-
raise FileNotFoundError(f"File not found: {file_path}")
|
170 |
-
|
171 |
-
# Choose appropriate parsing method based on file extension
|
172 |
-
ext = file_path.suffix.lower()
|
173 |
-
|
174 |
-
try:
|
175 |
-
if ext in [".pdf"]:
|
176 |
-
self.logger.info(
|
177 |
-
f"Detected PDF file, using PDF parser (OCR={parse_method == 'ocr'})..."
|
178 |
-
)
|
179 |
-
content_list, md_content = MineruParser.parse_pdf(
|
180 |
-
file_path, output_dir, use_ocr=(parse_method == "ocr")
|
181 |
-
)
|
182 |
-
elif ext in [".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]:
|
183 |
-
self.logger.info("Detected image file, using image parser...")
|
184 |
-
content_list, md_content = MineruParser.parse_image(
|
185 |
-
file_path, output_dir
|
186 |
-
)
|
187 |
-
elif ext in [".doc", ".docx", ".ppt", ".pptx"]:
|
188 |
-
self.logger.info("Detected Office document, using Office parser...")
|
189 |
-
content_list, md_content = MineruParser.parse_office_doc(
|
190 |
-
file_path, output_dir
|
191 |
-
)
|
192 |
-
else:
|
193 |
-
# For other or unknown formats, use generic parser
|
194 |
-
self.logger.info(
|
195 |
-
f"Using generic parser for {ext} file (method={parse_method})..."
|
196 |
-
)
|
197 |
-
content_list, md_content = MineruParser.parse_document(
|
198 |
-
file_path, parse_method=parse_method, output_dir=output_dir
|
199 |
-
)
|
200 |
-
|
201 |
-
except Exception as e:
|
202 |
-
self.logger.error(f"Error during parsing with specific parser: {str(e)}")
|
203 |
-
self.logger.warning("Falling back to generic parser...")
|
204 |
-
# If specific parser fails, fall back to generic parser
|
205 |
-
content_list, md_content = MineruParser.parse_document(
|
206 |
-
file_path, parse_method=parse_method, output_dir=output_dir
|
207 |
-
)
|
208 |
-
|
209 |
-
self.logger.info(
|
210 |
-
f"Parsing complete! Extracted {len(content_list)} content blocks"
|
211 |
-
)
|
212 |
-
self.logger.info(f"Markdown text length: {len(md_content)} characters")
|
213 |
-
|
214 |
-
# Display content statistics if requested
|
215 |
-
if display_stats:
|
216 |
-
self.logger.info("\nContent Information:")
|
217 |
-
self.logger.info(f"* Total blocks in content_list: {len(content_list)}")
|
218 |
-
self.logger.info(f"* Markdown content length: {len(md_content)} characters")
|
219 |
-
|
220 |
-
# Count elements by type
|
221 |
-
block_types: Dict[str, int] = {}
|
222 |
-
for block in content_list:
|
223 |
-
if isinstance(block, dict):
|
224 |
-
block_type = block.get("type", "unknown")
|
225 |
-
if isinstance(block_type, str):
|
226 |
-
block_types[block_type] = block_types.get(block_type, 0) + 1
|
227 |
-
|
228 |
-
self.logger.info("* Content block types:")
|
229 |
-
for block_type, count in block_types.items():
|
230 |
-
self.logger.info(f" - {block_type}: {count}")
|
231 |
-
|
232 |
-
return content_list, md_content
|
233 |
-
|
234 |
-
def _separate_content(
|
235 |
-
self, content_list: List[Dict[str, Any]]
|
236 |
-
) -> Tuple[str, List[Dict[str, Any]]]:
|
237 |
-
"""
|
238 |
-
Separate text content and multimodal content
|
239 |
-
|
240 |
-
Args:
|
241 |
-
content_list: Content list from MinerU parsing
|
242 |
-
|
243 |
-
Returns:
|
244 |
-
(text_content, multimodal_items): Pure text content and multimodal items list
|
245 |
-
"""
|
246 |
-
text_parts = []
|
247 |
-
multimodal_items = []
|
248 |
-
|
249 |
-
for item in content_list:
|
250 |
-
content_type = item.get("type", "text")
|
251 |
-
|
252 |
-
if content_type == "text":
|
253 |
-
# Text content
|
254 |
-
text = item.get("text", "")
|
255 |
-
if text.strip():
|
256 |
-
text_parts.append(text)
|
257 |
-
else:
|
258 |
-
# Multimodal content (image, table, equation, etc.)
|
259 |
-
multimodal_items.append(item)
|
260 |
-
|
261 |
-
# Merge all text content
|
262 |
-
text_content = "\n\n".join(text_parts)
|
263 |
-
|
264 |
-
self.logger.info("Content separation complete:")
|
265 |
-
self.logger.info(f" - Text content length: {len(text_content)} characters")
|
266 |
-
self.logger.info(f" - Multimodal items count: {len(multimodal_items)}")
|
267 |
-
|
268 |
-
# Count multimodal types
|
269 |
-
modal_types = {}
|
270 |
-
for item in multimodal_items:
|
271 |
-
modal_type = item.get("type", "unknown")
|
272 |
-
modal_types[modal_type] = modal_types.get(modal_type, 0) + 1
|
273 |
-
|
274 |
-
if modal_types:
|
275 |
-
self.logger.info(f" - Multimodal type distribution: {modal_types}")
|
276 |
-
|
277 |
-
return text_content, multimodal_items
|
278 |
-
|
279 |
-
async def _insert_text_content(
|
280 |
-
self,
|
281 |
-
input: str | list[str],
|
282 |
-
split_by_character: str | None = None,
|
283 |
-
split_by_character_only: bool = False,
|
284 |
-
ids: str | list[str] | None = None,
|
285 |
-
file_paths: str | list[str] | None = None,
|
286 |
-
):
|
287 |
-
"""
|
288 |
-
Insert pure text content into LightRAG
|
289 |
-
|
290 |
-
Args:
|
291 |
-
input: Single document string or list of document strings
|
292 |
-
split_by_character: if split_by_character is not None, split the string by character, if chunk longer than
|
293 |
-
chunk_token_size, it will be split again by token size.
|
294 |
-
split_by_character_only: if split_by_character_only is True, split the string by character only, when
|
295 |
-
split_by_character is None, this parameter is ignored.
|
296 |
-
ids: single string of the document ID or list of unique document IDs, if not provided, MD5 hash IDs will be generated
|
297 |
-
file_paths: single string of the file path or list of file paths, used for citation
|
298 |
-
"""
|
299 |
-
self.logger.info("Starting text content insertion into LightRAG...")
|
300 |
-
|
301 |
-
# Use LightRAG's insert method with all parameters
|
302 |
-
await self.lightrag.ainsert(
|
303 |
-
input=input,
|
304 |
-
file_paths=file_paths,
|
305 |
-
split_by_character=split_by_character,
|
306 |
-
split_by_character_only=split_by_character_only,
|
307 |
-
ids=ids,
|
308 |
-
)
|
309 |
-
|
310 |
-
self.logger.info("Text content insertion complete")
|
311 |
-
|
312 |
-
async def _process_multimodal_content(
|
313 |
-
self, multimodal_items: List[Dict[str, Any]], file_path: str
|
314 |
-
):
|
315 |
-
"""
|
316 |
-
Process multimodal content (using specialized processors)
|
317 |
-
|
318 |
-
Args:
|
319 |
-
multimodal_items: List of multimodal items
|
320 |
-
file_path: File path (for reference)
|
321 |
-
"""
|
322 |
-
if not multimodal_items:
|
323 |
-
self.logger.debug("No multimodal content to process")
|
324 |
-
return
|
325 |
-
|
326 |
-
self.logger.info("Starting multimodal content processing...")
|
327 |
-
|
328 |
-
file_name = os.path.basename(file_path)
|
329 |
-
|
330 |
-
for i, item in enumerate(multimodal_items):
|
331 |
-
try:
|
332 |
-
content_type = item.get("type", "unknown")
|
333 |
-
self.logger.info(
|
334 |
-
f"Processing item {i+1}/{len(multimodal_items)}: {content_type} content"
|
335 |
-
)
|
336 |
-
|
337 |
-
# Select appropriate processor
|
338 |
-
processor = self._get_processor_for_type(content_type)
|
339 |
-
|
340 |
-
if processor:
|
341 |
-
(
|
342 |
-
enhanced_caption,
|
343 |
-
entity_info,
|
344 |
-
) = await processor.process_multimodal_content(
|
345 |
-
modal_content=item,
|
346 |
-
content_type=content_type,
|
347 |
-
file_path=file_name,
|
348 |
-
)
|
349 |
-
self.logger.info(
|
350 |
-
f"{content_type} processing complete: {entity_info.get('entity_name', 'Unknown')}"
|
351 |
-
)
|
352 |
-
else:
|
353 |
-
self.logger.warning(
|
354 |
-
f"No suitable processor found for {content_type} type content"
|
355 |
-
)
|
356 |
-
|
357 |
-
except Exception as e:
|
358 |
-
self.logger.error(f"Error processing multimodal content: {str(e)}")
|
359 |
-
self.logger.debug("Exception details:", exc_info=True)
|
360 |
-
continue
|
361 |
-
|
362 |
-
self.logger.info("Multimodal content processing complete")
|
363 |
-
|
364 |
-
def _get_processor_for_type(self, content_type: str):
|
365 |
-
"""
|
366 |
-
Get appropriate processor based on content type
|
367 |
-
|
368 |
-
Args:
|
369 |
-
content_type: Content type
|
370 |
-
|
371 |
-
Returns:
|
372 |
-
Corresponding processor instance
|
373 |
-
"""
|
374 |
-
# Direct mapping to corresponding processor
|
375 |
-
if content_type == "image":
|
376 |
-
return self.modal_processors.get("image")
|
377 |
-
elif content_type == "table":
|
378 |
-
return self.modal_processors.get("table")
|
379 |
-
elif content_type == "equation":
|
380 |
-
return self.modal_processors.get("equation")
|
381 |
-
else:
|
382 |
-
# For other types, use generic processor
|
383 |
-
return self.modal_processors.get("generic")
|
384 |
-
|
385 |
-
async def process_document_complete(
|
386 |
-
self,
|
387 |
-
file_path: str,
|
388 |
-
output_dir: str = "./output",
|
389 |
-
parse_method: str = "auto",
|
390 |
-
display_stats: bool = True,
|
391 |
-
split_by_character: str | None = None,
|
392 |
-
split_by_character_only: bool = False,
|
393 |
-
doc_id: str | None = None,
|
394 |
-
):
|
395 |
-
"""
|
396 |
-
Complete document processing workflow
|
397 |
-
|
398 |
-
Args:
|
399 |
-
file_path: Path to the file to process
|
400 |
-
output_dir: MinerU output directory
|
401 |
-
parse_method: Parse method
|
402 |
-
display_stats: Whether to display content statistics
|
403 |
-
split_by_character: Optional character to split the text by
|
404 |
-
split_by_character_only: If True, split only by the specified character
|
405 |
-
doc_id: Optional document ID, if not provided MD5 hash will be generated
|
406 |
-
"""
|
407 |
-
# Ensure LightRAG is initialized
|
408 |
-
await self._ensure_lightrag_initialized()
|
409 |
-
|
410 |
-
self.logger.info(f"Starting complete document processing: {file_path}")
|
411 |
-
|
412 |
-
# Step 1: Parse document using MinerU
|
413 |
-
content_list, md_content = self.parse_document(
|
414 |
-
file_path, output_dir, parse_method, display_stats
|
415 |
-
)
|
416 |
-
|
417 |
-
# Step 2: Separate text and multimodal content
|
418 |
-
text_content, multimodal_items = self._separate_content(content_list)
|
419 |
-
|
420 |
-
# Step 3: Insert pure text content with all parameters
|
421 |
-
if text_content.strip():
|
422 |
-
file_name = os.path.basename(file_path)
|
423 |
-
await self._insert_text_content(
|
424 |
-
text_content,
|
425 |
-
file_paths=file_name,
|
426 |
-
split_by_character=split_by_character,
|
427 |
-
split_by_character_only=split_by_character_only,
|
428 |
-
ids=doc_id,
|
429 |
-
)
|
430 |
-
|
431 |
-
# Step 4: Process multimodal content (using specialized processors)
|
432 |
-
if multimodal_items:
|
433 |
-
await self._process_multimodal_content(multimodal_items, file_path)
|
434 |
-
|
435 |
-
self.logger.info(f"Document {file_path} processing complete!")
|
436 |
-
|
437 |
-
async def process_folder_complete(
|
438 |
-
self,
|
439 |
-
folder_path: str,
|
440 |
-
output_dir: str = "./output",
|
441 |
-
parse_method: str = "auto",
|
442 |
-
display_stats: bool = False,
|
443 |
-
split_by_character: str | None = None,
|
444 |
-
split_by_character_only: bool = False,
|
445 |
-
file_extensions: Optional[List[str]] = None,
|
446 |
-
recursive: bool = True,
|
447 |
-
max_workers: int = 1,
|
448 |
-
):
|
449 |
-
"""
|
450 |
-
Process all files in a folder in batch
|
451 |
-
|
452 |
-
Args:
|
453 |
-
folder_path: Path to the folder to process
|
454 |
-
output_dir: MinerU output directory
|
455 |
-
parse_method: Parse method
|
456 |
-
display_stats: Whether to display content statistics for each file (recommended False for batch processing)
|
457 |
-
split_by_character: Optional character to split text by
|
458 |
-
split_by_character_only: If True, split only by the specified character
|
459 |
-
file_extensions: List of file extensions to process, e.g. [".pdf", ".docx"]. If None, process all supported formats
|
460 |
-
recursive: Whether to recursively process subfolders
|
461 |
-
max_workers: Maximum number of concurrent workers
|
462 |
-
"""
|
463 |
-
# Ensure LightRAG is initialized
|
464 |
-
await self._ensure_lightrag_initialized()
|
465 |
-
|
466 |
-
folder_path = Path(folder_path)
|
467 |
-
if not folder_path.exists() or not folder_path.is_dir():
|
468 |
-
raise ValueError(
|
469 |
-
f"Folder does not exist or is not a valid directory: {folder_path}"
|
470 |
-
)
|
471 |
-
|
472 |
-
# Supported file formats
|
473 |
-
supported_extensions = {
|
474 |
-
".pdf",
|
475 |
-
".jpg",
|
476 |
-
".jpeg",
|
477 |
-
".png",
|
478 |
-
".bmp",
|
479 |
-
".tiff",
|
480 |
-
".tif",
|
481 |
-
".doc",
|
482 |
-
".docx",
|
483 |
-
".ppt",
|
484 |
-
".pptx",
|
485 |
-
".txt",
|
486 |
-
".md",
|
487 |
-
}
|
488 |
-
|
489 |
-
# Use specified extensions or all supported formats
|
490 |
-
if file_extensions:
|
491 |
-
target_extensions = set(ext.lower() for ext in file_extensions)
|
492 |
-
# Validate if all are supported formats
|
493 |
-
unsupported = target_extensions - supported_extensions
|
494 |
-
if unsupported:
|
495 |
-
self.logger.warning(
|
496 |
-
f"The following file formats may not be fully supported: {unsupported}"
|
497 |
-
)
|
498 |
-
else:
|
499 |
-
target_extensions = supported_extensions
|
500 |
-
|
501 |
-
# Collect all files to process
|
502 |
-
files_to_process = []
|
503 |
-
|
504 |
-
if recursive:
|
505 |
-
# Recursively traverse all subfolders
|
506 |
-
for file_path in folder_path.rglob("*"):
|
507 |
-
if (
|
508 |
-
file_path.is_file()
|
509 |
-
and file_path.suffix.lower() in target_extensions
|
510 |
-
):
|
511 |
-
files_to_process.append(file_path)
|
512 |
-
else:
|
513 |
-
# Process only current folder
|
514 |
-
for file_path in folder_path.glob("*"):
|
515 |
-
if (
|
516 |
-
file_path.is_file()
|
517 |
-
and file_path.suffix.lower() in target_extensions
|
518 |
-
):
|
519 |
-
files_to_process.append(file_path)
|
520 |
-
|
521 |
-
if not files_to_process:
|
522 |
-
self.logger.info(f"No files to process found in {folder_path}")
|
523 |
-
return
|
524 |
-
|
525 |
-
self.logger.info(f"Found {len(files_to_process)} files to process")
|
526 |
-
self.logger.info("File type distribution:")
|
527 |
-
|
528 |
-
# Count file types
|
529 |
-
file_type_count = {}
|
530 |
-
for file_path in files_to_process:
|
531 |
-
ext = file_path.suffix.lower()
|
532 |
-
file_type_count[ext] = file_type_count.get(ext, 0) + 1
|
533 |
-
|
534 |
-
for ext, count in sorted(file_type_count.items()):
|
535 |
-
self.logger.info(f" {ext}: {count} files")
|
536 |
-
|
537 |
-
# Create progress tracking
|
538 |
-
processed_count = 0
|
539 |
-
failed_files = []
|
540 |
-
|
541 |
-
# Use semaphore to control concurrency
|
542 |
-
semaphore = asyncio.Semaphore(max_workers)
|
543 |
-
|
544 |
-
async def process_single_file(file_path: Path, index: int) -> None:
|
545 |
-
"""Process a single file"""
|
546 |
-
async with semaphore:
|
547 |
-
nonlocal processed_count
|
548 |
-
try:
|
549 |
-
self.logger.info(
|
550 |
-
f"[{index}/{len(files_to_process)}] Processing: {file_path}"
|
551 |
-
)
|
552 |
-
|
553 |
-
# Create separate output directory for each file
|
554 |
-
file_output_dir = Path(output_dir) / file_path.stem
|
555 |
-
file_output_dir.mkdir(parents=True, exist_ok=True)
|
556 |
-
|
557 |
-
# Process file
|
558 |
-
await self.process_document_complete(
|
559 |
-
file_path=str(file_path),
|
560 |
-
output_dir=str(file_output_dir),
|
561 |
-
parse_method=parse_method,
|
562 |
-
display_stats=display_stats,
|
563 |
-
split_by_character=split_by_character,
|
564 |
-
split_by_character_only=split_by_character_only,
|
565 |
-
)
|
566 |
-
|
567 |
-
processed_count += 1
|
568 |
-
self.logger.info(
|
569 |
-
f"[{index}/{len(files_to_process)}] Successfully processed: {file_path}"
|
570 |
-
)
|
571 |
-
|
572 |
-
except Exception as e:
|
573 |
-
self.logger.error(
|
574 |
-
f"[{index}/{len(files_to_process)}] Failed to process: {file_path}"
|
575 |
-
)
|
576 |
-
self.logger.error(f"Error: {str(e)}")
|
577 |
-
failed_files.append((file_path, str(e)))
|
578 |
-
|
579 |
-
# Create all processing tasks
|
580 |
-
tasks = []
|
581 |
-
for index, file_path in enumerate(files_to_process, 1):
|
582 |
-
task = process_single_file(file_path, index)
|
583 |
-
tasks.append(task)
|
584 |
-
|
585 |
-
# Wait for all tasks to complete
|
586 |
-
await asyncio.gather(*tasks, return_exceptions=True)
|
587 |
-
|
588 |
-
# Output processing statistics
|
589 |
-
self.logger.info("\n===== Batch Processing Complete =====")
|
590 |
-
self.logger.info(f"Total files: {len(files_to_process)}")
|
591 |
-
self.logger.info(f"Successfully processed: {processed_count}")
|
592 |
-
self.logger.info(f"Failed: {len(failed_files)}")
|
593 |
-
|
594 |
-
if failed_files:
|
595 |
-
self.logger.info("\nFailed files:")
|
596 |
-
for file_path, error in failed_files:
|
597 |
-
self.logger.info(f" - {file_path}: {error}")
|
598 |
-
|
599 |
-
return {
|
600 |
-
"total": len(files_to_process),
|
601 |
-
"success": processed_count,
|
602 |
-
"failed": len(failed_files),
|
603 |
-
"failed_files": failed_files,
|
604 |
-
}
|
605 |
-
|
606 |
-
async def query_with_multimodal(self, query: str, mode: str = "hybrid") -> str:
|
607 |
-
"""
|
608 |
-
Query with multimodal content support
|
609 |
-
|
610 |
-
Args:
|
611 |
-
query: Query content
|
612 |
-
mode: Query mode
|
613 |
-
|
614 |
-
Returns:
|
615 |
-
Query result
|
616 |
-
"""
|
617 |
-
if self.lightrag is None:
|
618 |
-
raise ValueError(
|
619 |
-
"No LightRAG instance available. "
|
620 |
-
"Please either:\n"
|
621 |
-
"1. Provide a pre-initialized LightRAG instance when creating RAGAnything, or\n"
|
622 |
-
"2. Process documents first using process_document_complete() or process_folder_complete() "
|
623 |
-
"to create and populate the LightRAG instance."
|
624 |
-
)
|
625 |
-
|
626 |
-
result = await self.lightrag.aquery(query, param=QueryParam(mode=mode))
|
627 |
-
|
628 |
-
return result
|
629 |
-
|
630 |
-
def get_processor_info(self) -> Dict[str, Any]:
|
631 |
-
"""Get processor information"""
|
632 |
-
if not self.modal_processors:
|
633 |
-
return {"status": "Not initialized"}
|
634 |
-
|
635 |
-
info = {
|
636 |
-
"status": "Initialized",
|
637 |
-
"processors": {},
|
638 |
-
"models": {
|
639 |
-
"llm_model": "External function"
|
640 |
-
if self.llm_model_func
|
641 |
-
else "Not provided",
|
642 |
-
"vision_model": "External function"
|
643 |
-
if self.vision_model_func
|
644 |
-
else "Not provided",
|
645 |
-
"embedding_model": "External function"
|
646 |
-
if self.embedding_func
|
647 |
-
else "Not provided",
|
648 |
-
},
|
649 |
-
}
|
650 |
-
|
651 |
-
for proc_type, processor in self.modal_processors.items():
|
652 |
-
info["processors"][proc_type] = {
|
653 |
-
"class": processor.__class__.__name__,
|
654 |
-
"supports": self._get_processor_supports(proc_type),
|
655 |
-
}
|
656 |
-
|
657 |
-
return info
|
658 |
-
|
659 |
-
def _get_processor_supports(self, proc_type: str) -> List[str]:
|
660 |
-
"""Get processor supported features"""
|
661 |
-
supports_map = {
|
662 |
-
"image": [
|
663 |
-
"Image content analysis",
|
664 |
-
"Visual understanding",
|
665 |
-
"Image description generation",
|
666 |
-
"Image entity extraction",
|
667 |
-
],
|
668 |
-
"table": [
|
669 |
-
"Table structure analysis",
|
670 |
-
"Data statistics",
|
671 |
-
"Trend identification",
|
672 |
-
"Table entity extraction",
|
673 |
-
],
|
674 |
-
"equation": [
|
675 |
-
"Mathematical formula parsing",
|
676 |
-
"Variable identification",
|
677 |
-
"Formula meaning explanation",
|
678 |
-
"Formula entity extraction",
|
679 |
-
],
|
680 |
-
"generic": [
|
681 |
-
"General content analysis",
|
682 |
-
"Structured processing",
|
683 |
-
"Entity extraction",
|
684 |
-
],
|
685 |
-
}
|
686 |
-
return supports_map.get(proc_type, ["Basic processing"])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|