zrguo
commited on
Commit
·
8352b84
1
Parent(s):
050a1c0
MinerU integration
Browse files- README-zh.md +27 -0
- README.md +26 -1
- docs/mineru_integration_en.md +246 -0
- docs/mineru_integration_zh.md +245 -0
- examples/mineru_example.py +82 -0
- examples/raganything_example.py +129 -0
- lightrag/mineru_parser.py +454 -0
- lightrag/modalprocessors.py +708 -0
- lightrag/raganything.py +632 -0
README-zh.md
CHANGED
@@ -4,6 +4,7 @@
|
|
4 |
|
5 |
## 🎉 新闻
|
6 |
|
|
|
7 |
- [X] [2025.03.18]🎯📢LightRAG现已支持引文功能。
|
8 |
- [X] [2025.02.05]🎯📢我们团队发布了[VideoRAG](https://github.com/HKUDS/VideoRAG),用于理解超长上下文视频。
|
9 |
- [X] [2025.01.13]🎯📢我们团队发布了[MiniRAG](https://github.com/HKUDS/MiniRAG),使用小型模型简化RAG。
|
@@ -1002,6 +1003,32 @@ rag.merge_entities(
|
|
1002 |
|
1003 |
</details>
|
1004 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1005 |
## Token统计功能
|
1006 |
|
1007 |
<details>
|
|
|
4 |
|
5 |
## 🎉 新闻
|
6 |
|
7 |
+
- [X] [2025.06.05]🎯📢LightRAG现已集成MinerU,支持多模态文档解析与RAG(PDF、图片、Office、表格、公式等)。详见下方多模态处理模块。
|
8 |
- [X] [2025.03.18]🎯📢LightRAG现已支持引文功能。
|
9 |
- [X] [2025.02.05]🎯📢我们团队发布了[VideoRAG](https://github.com/HKUDS/VideoRAG),用于理解超长上下文视频。
|
10 |
- [X] [2025.01.13]🎯📢我们团队发布了[MiniRAG](https://github.com/HKUDS/MiniRAG),使用小型模型简化RAG。
|
|
|
1003 |
|
1004 |
</details>
|
1005 |
|
1006 |
+
## 多模态文档处理(MinerU集成)
|
1007 |
+
|
1008 |
+
LightRAG 现已支持通过 [MinerU](https://github.com/opendatalab/MinerU) 实现多模态文档解析与检索增强生成(RAG)。您可以从 PDF、图片、Office 文档中提取结构化内容(文本、图片、表格、公式等),并在 RAG 流程中使用。
|
1009 |
+
|
1010 |
+
**主要特性:**
|
1011 |
+
- 支持解析 PDF、图片、DOC/DOCX/PPT/PPTX 等多种格式
|
1012 |
+
- 提取并索引文本、图片、表格、公式及文档结构
|
1013 |
+
- 在 RAG 中查询和检索多模态内容(文本、图片、表格、公式)
|
1014 |
+
- 与 LightRAG Core 及 RAGAnything 无缝集成
|
1015 |
+
|
1016 |
+
**快速开始:**
|
1017 |
+
1. 安装依赖:
|
1018 |
+
```bash
|
1019 |
+
pip install "magic-pdf[full]>=1.2.2" huggingface_hub
|
1020 |
+
```
|
1021 |
+
2. 下载 MinerU 模型权重(详见 [MinerU 集成指南](docs/mineru_integration_zh.md))
|
1022 |
+
3. 使用新版 `MineruParser` 或 RAGAnything 的 `process_document_complete` 处理文件:
|
1023 |
+
```python
|
1024 |
+
from lightrag.mineru_parser import MineruParser
|
1025 |
+
content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
|
1026 |
+
# 或自动识别类型:
|
1027 |
+
content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
|
1028 |
+
```
|
1029 |
+
4. 使用 LightRAG 查询多模态内容请参见 [docs/mineru_integration_zh.md](docs/mineru_integration_zh.md)。
|
1030 |
+
|
1031 |
+
|
1032 |
## Token统计功能
|
1033 |
|
1034 |
<details>
|
README.md
CHANGED
@@ -39,7 +39,7 @@
|
|
39 |
</div>
|
40 |
|
41 |
## 🎉 News
|
42 |
-
|
43 |
- [X] [2025.03.18]🎯📢LightRAG now supports citation functionality, enabling proper source attribution.
|
44 |
- [X] [2025.02.05]🎯📢Our team has released [VideoRAG](https://github.com/HKUDS/VideoRAG) understanding extremely long-context videos.
|
45 |
- [X] [2025.01.13]🎯📢Our team has released [MiniRAG](https://github.com/HKUDS/MiniRAG) making RAG simpler with small models.
|
@@ -1051,6 +1051,31 @@ When merging entities:
|
|
1051 |
|
1052 |
</details>
|
1053 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1054 |
## Token Usage Tracking
|
1055 |
|
1056 |
<details>
|
|
|
39 |
</div>
|
40 |
|
41 |
## 🎉 News
|
42 |
+
- [X] [2025.06.05]🎯📢LightRAG now supports multimodal document parsing and RAG with MinerU integration (PDF, images, Office, tables, formulas, etc.). See the new multimodal section below.
|
43 |
- [X] [2025.03.18]🎯📢LightRAG now supports citation functionality, enabling proper source attribution.
|
44 |
- [X] [2025.02.05]🎯📢Our team has released [VideoRAG](https://github.com/HKUDS/VideoRAG) understanding extremely long-context videos.
|
45 |
- [X] [2025.01.13]🎯📢Our team has released [MiniRAG](https://github.com/HKUDS/MiniRAG) making RAG simpler with small models.
|
|
|
1051 |
|
1052 |
</details>
|
1053 |
|
1054 |
+
## Multimodal Document Processing (MinerU Integration)
|
1055 |
+
|
1056 |
+
LightRAG now supports multimodal document parsing and retrieval-augmented generation (RAG) via [MinerU](https://github.com/opendatalab/MinerU). You can extract structured content (text, images, tables, formulas, etc.) from PDF, images, and Office documents, and use them in your RAG pipeline.
|
1057 |
+
|
1058 |
+
**Key Features:**
|
1059 |
+
- Parse PDFs, images, DOC/DOCX/PPT/PPTX, and more
|
1060 |
+
- Extract and index text, images, tables, formulas, and document structure
|
1061 |
+
- Query and retrieve multimodal content (text, image, table, formula) in RAG
|
1062 |
+
- Seamless integration with LightRAG core and RAGAnything
|
1063 |
+
|
1064 |
+
**Quick Start:**
|
1065 |
+
1. Install dependencies:
|
1066 |
+
```bash
|
1067 |
+
pip install "magic-pdf[full]>=1.2.2" huggingface_hub
|
1068 |
+
```
|
1069 |
+
2. Download MinerU model weights (see [MinerU Integration Guide](docs/mineru_integration_en.md))
|
1070 |
+
3. Use the new `MineruParser` or RAGAnything's `process_document_complete` to process files:
|
1071 |
+
```python
|
1072 |
+
from lightrag.mineru_parser import MineruParser
|
1073 |
+
content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
|
1074 |
+
# or for any file type:
|
1075 |
+
content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
|
1076 |
+
```
|
1077 |
+
4. Query multimodal content with LightRAG see [docs/mineru_integration_en.md](docs/mineru_integration_en.md).
|
1078 |
+
|
1079 |
## Token Usage Tracking
|
1080 |
|
1081 |
<details>
|
docs/mineru_integration_en.md
ADDED
@@ -0,0 +1,246 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# MinerU Integration Guide
|
2 |
+
|
3 |
+
### About MinerU
|
4 |
+
|
5 |
+
MinerU is a powerful open-source tool for extracting high-quality structured data from PDF, image, and office documents. It provides the following features:
|
6 |
+
|
7 |
+
- Text extraction while preserving document structure (headings, paragraphs, lists, etc.)
|
8 |
+
- Handling complex layouts including multi-column formats
|
9 |
+
- Automatic formula recognition and conversion to LaTeX format
|
10 |
+
- Image, table, and footnote extraction
|
11 |
+
- Automatic scanned document detection and OCR application
|
12 |
+
- Support for multiple output formats (Markdown, JSON)
|
13 |
+
|
14 |
+
### Installation
|
15 |
+
|
16 |
+
#### Installing MinerU Dependencies
|
17 |
+
|
18 |
+
If you have already installed LightRAG but don't have MinerU support, you can add MinerU support by installing the magic-pdf package directly:
|
19 |
+
|
20 |
+
```bash
|
21 |
+
pip install "magic-pdf[full]>=1.2.2" huggingface_hub
|
22 |
+
```
|
23 |
+
|
24 |
+
These are the MinerU-related dependencies required by LightRAG.
|
25 |
+
|
26 |
+
#### MinerU Model Weights
|
27 |
+
|
28 |
+
MinerU requires model weight files to function properly. After installation, you need to download the required model weights. You can use either Hugging Face or ModelScope to download the models.
|
29 |
+
|
30 |
+
##### Option 1: Download from Hugging Face
|
31 |
+
|
32 |
+
```bash
|
33 |
+
pip install huggingface_hub
|
34 |
+
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
|
35 |
+
python download_models_hf.py
|
36 |
+
```
|
37 |
+
|
38 |
+
##### Option 2: Download from ModelScope (Recommended for users in China)
|
39 |
+
|
40 |
+
```bash
|
41 |
+
pip install modelscope
|
42 |
+
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models.py -O download_models.py
|
43 |
+
python download_models.py
|
44 |
+
```
|
45 |
+
|
46 |
+
Both methods will automatically download the model files and configure the model directory in the configuration file. The configuration file is located in your user directory and named `magic-pdf.json`.
|
47 |
+
|
48 |
+
> **Note for Windows users**: User directory is at `C:\Users\username`
|
49 |
+
> **Note for Linux users**: User directory is at `/home/username`
|
50 |
+
> **Note for macOS users**: User directory is at `/Users/username`
|
51 |
+
|
52 |
+
#### Optional: LibreOffice Installation
|
53 |
+
|
54 |
+
To process Office documents (DOC, DOCX, PPT, PPTX), you need to install LibreOffice:
|
55 |
+
|
56 |
+
**Linux/macOS:**
|
57 |
+
```bash
|
58 |
+
apt-get/yum/brew install libreoffice
|
59 |
+
```
|
60 |
+
|
61 |
+
**Windows:**
|
62 |
+
1. Install LibreOffice
|
63 |
+
2. Add the installation directory to your PATH: `install_dir\LibreOffice\program`
|
64 |
+
|
65 |
+
### Using MinerU Parser
|
66 |
+
|
67 |
+
#### Basic Usage
|
68 |
+
|
69 |
+
```python
|
70 |
+
from lightrag.mineru_parser import MineruParser
|
71 |
+
|
72 |
+
# Parse a PDF document
|
73 |
+
content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
|
74 |
+
|
75 |
+
# Parse an image
|
76 |
+
content_list, md_content = MineruParser.parse_image('path/to/image.jpg', 'output_dir')
|
77 |
+
|
78 |
+
# Parse an Office document
|
79 |
+
content_list, md_content = MineruParser.parse_office_doc('path/to/document.docx', 'output_dir')
|
80 |
+
|
81 |
+
# Auto-detect and parse any supported document type
|
82 |
+
content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
|
83 |
+
```
|
84 |
+
|
85 |
+
#### RAGAnything Integration
|
86 |
+
|
87 |
+
In RAGAnything, you can directly use file paths as input to the `process_document_complete` method to process documents. Here's a complete configuration example:
|
88 |
+
|
89 |
+
```python
|
90 |
+
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
91 |
+
from lightrag.raganything import RAGAnything
|
92 |
+
|
93 |
+
|
94 |
+
# Initialize RAGAnything
|
95 |
+
rag = RAGAnything(
|
96 |
+
working_dir="./rag_storage", # Working directory
|
97 |
+
llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
|
98 |
+
"gpt-4o-mini", # Model to use
|
99 |
+
prompt,
|
100 |
+
system_prompt=system_prompt,
|
101 |
+
history_messages=history_messages,
|
102 |
+
api_key="your-api-key", # Replace with your API key
|
103 |
+
base_url="your-base-url", # Replace with your API base URL
|
104 |
+
**kwargs,
|
105 |
+
),
|
106 |
+
vision_model_func=lambda prompt, system_prompt=None, history_messages=[], image_data=None, **kwargs: openai_complete_if_cache(
|
107 |
+
"gpt-4o", # Vision model
|
108 |
+
"",
|
109 |
+
system_prompt=None,
|
110 |
+
history_messages=[],
|
111 |
+
messages=[
|
112 |
+
{"role": "system", "content": system_prompt} if system_prompt else None,
|
113 |
+
{"role": "user", "content": [
|
114 |
+
{"type": "text", "text": prompt},
|
115 |
+
{
|
116 |
+
"type": "image_url",
|
117 |
+
"image_url": {
|
118 |
+
"url": f"data:image/jpeg;base64,{image_data}"
|
119 |
+
}
|
120 |
+
}
|
121 |
+
]} if image_data else {"role": "user", "content": prompt}
|
122 |
+
],
|
123 |
+
api_key="your-api-key", # Replace with your API key
|
124 |
+
base_url="your-base-url", # Replace with your API base URL
|
125 |
+
**kwargs,
|
126 |
+
) if image_data else openai_complete_if_cache(
|
127 |
+
"gpt-4o-mini",
|
128 |
+
prompt,
|
129 |
+
system_prompt=system_prompt,
|
130 |
+
history_messages=history_messages,
|
131 |
+
api_key="your-api-key", # Replace with your API key
|
132 |
+
base_url="your-base-url", # Replace with your API base URL
|
133 |
+
**kwargs,
|
134 |
+
),
|
135 |
+
embedding_func=lambda texts: openai_embed(
|
136 |
+
texts,
|
137 |
+
model="text-embedding-3-large",
|
138 |
+
api_key="your-api-key", # Replace with your API key
|
139 |
+
base_url="your-base-url", # Replace with your API base URL
|
140 |
+
),
|
141 |
+
embedding_dim=3072,
|
142 |
+
max_token_size=8192
|
143 |
+
)
|
144 |
+
|
145 |
+
# Process a single file
|
146 |
+
await rag.process_document_complete(
|
147 |
+
file_path="path/to/document.pdf",
|
148 |
+
output_dir="./output",
|
149 |
+
parse_method="auto"
|
150 |
+
)
|
151 |
+
|
152 |
+
# Query the processed document
|
153 |
+
result = await rag.query_with_multimodal(
|
154 |
+
"What is the main content of the document?",
|
155 |
+
mode="hybrid"
|
156 |
+
)
|
157 |
+
|
158 |
+
```
|
159 |
+
|
160 |
+
MinerU categorizes document content into text, formulas, images, and tables, processing each with its corresponding ingestion type:
|
161 |
+
- Text content: `ingestion_type='text'`
|
162 |
+
- Image content: `ingestion_type='image'`
|
163 |
+
- Table content: `ingestion_type='table'`
|
164 |
+
- Formula content: `ingestion_type='equation'`
|
165 |
+
|
166 |
+
#### Query Examples
|
167 |
+
|
168 |
+
Here are some common query examples:
|
169 |
+
|
170 |
+
```python
|
171 |
+
# Query text content
|
172 |
+
result = await rag.query_with_multimodal(
|
173 |
+
"What is the main topic of the document?",
|
174 |
+
mode="hybrid"
|
175 |
+
)
|
176 |
+
|
177 |
+
# Query image-related content
|
178 |
+
result = await rag.query_with_multimodal(
|
179 |
+
"Describe the images and figures in the document",
|
180 |
+
mode="hybrid"
|
181 |
+
)
|
182 |
+
|
183 |
+
# Query table-related content
|
184 |
+
result = await rag.query_with_multimodal(
|
185 |
+
"Tell me about the experimental results and data tables",
|
186 |
+
mode="hybrid"
|
187 |
+
)
|
188 |
+
```
|
189 |
+
|
190 |
+
#### Command Line Tool
|
191 |
+
|
192 |
+
We also provide a command-line tool for document parsing:
|
193 |
+
|
194 |
+
```bash
|
195 |
+
python examples/mineru_example.py path/to/document.pdf
|
196 |
+
```
|
197 |
+
|
198 |
+
Optional parameters:
|
199 |
+
- `--output` or `-o`: Specify output directory
|
200 |
+
- `--method` or `-m`: Choose parsing method (auto, ocr, txt)
|
201 |
+
- `--stats`: Display content statistics
|
202 |
+
|
203 |
+
### Output Format
|
204 |
+
|
205 |
+
MinerU generates three files for each parsed document:
|
206 |
+
|
207 |
+
1. `{filename}.md` - Markdown representation of the document
|
208 |
+
2. `{filename}_content_list.json` - Structured JSON content
|
209 |
+
3. `{filename}_model.json` - Detailed model parsing results
|
210 |
+
|
211 |
+
The `content_list.json` file contains all structured content extracted from the document, including:
|
212 |
+
- Text blocks (body text, headings, etc.)
|
213 |
+
- Images (paths and optional captions)
|
214 |
+
- Tables (table content and optional captions)
|
215 |
+
- Lists
|
216 |
+
- Formulas
|
217 |
+
|
218 |
+
### Troubleshooting
|
219 |
+
|
220 |
+
If you encounter issues with MinerU:
|
221 |
+
|
222 |
+
1. Check that model weights are correctly downloaded
|
223 |
+
2. Ensure you have sufficient RAM (16GB+ recommended)
|
224 |
+
3. For CUDA acceleration issues, see [MinerU documentation](https://mineru.readthedocs.io/en/latest/additional_notes/faq.html)
|
225 |
+
4. If parsing Office documents fails, verify LibreOffice is properly installed
|
226 |
+
5. If you encounter `pickle.UnpicklingError: invalid load key, 'v'.`, it might be due to an incomplete model download. Try re-downloading the models.
|
227 |
+
6. For users with newer graphics cards (H100, etc.) and garbled OCR text, try upgrading the CUDA version used by Paddle:
|
228 |
+
```bash
|
229 |
+
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
|
230 |
+
```
|
231 |
+
7. If you encounter a "filename too long" error, the latest version of MineruParser includes logic to automatically handle this issue.
|
232 |
+
|
233 |
+
#### Updating Existing Models
|
234 |
+
|
235 |
+
If you have previously downloaded models and need to update them, you can simply run the download script again. The script will update the model directory to the latest version.
|
236 |
+
|
237 |
+
### Advanced Configuration
|
238 |
+
|
239 |
+
The MinerU configuration file `magic-pdf.json` supports various customization options, including:
|
240 |
+
|
241 |
+
- Model directory path
|
242 |
+
- OCR engine selection
|
243 |
+
- GPU acceleration settings
|
244 |
+
- Cache settings
|
245 |
+
|
246 |
+
For complete configuration options, refer to the [MinerU official documentation](https://mineru.readthedocs.io/).
|
docs/mineru_integration_zh.md
ADDED
@@ -0,0 +1,245 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# MinerU 集成指南
|
2 |
+
|
3 |
+
### 关于 MinerU
|
4 |
+
|
5 |
+
MinerU 是一个强大的开源工具,用于从 PDF、图像和 Office 文档中提取高质量的结构化数据。它提供以下功能:
|
6 |
+
|
7 |
+
- 保留文档结构(标题、段落、列表等)的文本提取
|
8 |
+
- 处理包括多列格式在内的复杂布局
|
9 |
+
- 自动识别并将公式转换为 LaTeX 格式
|
10 |
+
- 提取图像、表格和脚注
|
11 |
+
- 自动检测扫描文档并应用 OCR
|
12 |
+
- 支持多种输出格式(Markdown、JSON)
|
13 |
+
|
14 |
+
### 安装
|
15 |
+
|
16 |
+
#### 安装 MinerU 依赖
|
17 |
+
|
18 |
+
如果您已经安装了 LightRAG,但没有 MinerU 支持,您可以通过安装 magic-pdf 包来直接添加 MinerU 支持:
|
19 |
+
|
20 |
+
```bash
|
21 |
+
pip install "magic-pdf[full]>=1.2.2" huggingface_hub
|
22 |
+
```
|
23 |
+
|
24 |
+
这些是 LightRAG 所需的 MinerU 相关依赖项。
|
25 |
+
|
26 |
+
#### MinerU 模型权重
|
27 |
+
|
28 |
+
MinerU 需要模型权重文件才能正常运行。安装后,您需要下载所需的模型权重。您可以使用 Hugging Face 或 ModelScope 下载模型。
|
29 |
+
|
30 |
+
##### 选项 1:从 Hugging Face 下载
|
31 |
+
|
32 |
+
```bash
|
33 |
+
pip install huggingface_hub
|
34 |
+
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
|
35 |
+
python download_models_hf.py
|
36 |
+
```
|
37 |
+
|
38 |
+
##### 选项 2:从 ModelScope 下载(推荐中国用户使用)
|
39 |
+
|
40 |
+
```bash
|
41 |
+
pip install modelscope
|
42 |
+
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models.py -O download_models.py
|
43 |
+
python download_models.py
|
44 |
+
```
|
45 |
+
|
46 |
+
两种方法都会自动下载模型文件并在配置文件中配置模型目录。配置文件位于用户目录中,名为 `magic-pdf.json`。
|
47 |
+
|
48 |
+
> **Windows 用户注意**:用户目录位于 `C:\Users\用户名`
|
49 |
+
> **Linux 用户注意**:用户目录位于 `/home/用户名`
|
50 |
+
> **macOS 用户注意**:用户目录位于 `/Users/用户名`
|
51 |
+
|
52 |
+
#### 可选:安装 LibreOffice
|
53 |
+
|
54 |
+
要处理 Office 文档(DOC、DOCX、PPT、PPTX),您需要安装 LibreOffice:
|
55 |
+
|
56 |
+
**Linux/macOS:**
|
57 |
+
```bash
|
58 |
+
apt-get/yum/brew install libreoffice
|
59 |
+
```
|
60 |
+
|
61 |
+
**Windows:**
|
62 |
+
1. 安装 LibreOffice
|
63 |
+
2. 将安装目录添加到 PATH 环境变量:`安装目录\LibreOffice\program`
|
64 |
+
|
65 |
+
### 使用 MinerU 解析器
|
66 |
+
|
67 |
+
#### 基本用法
|
68 |
+
|
69 |
+
```python
|
70 |
+
from lightrag.mineru_parser import MineruParser
|
71 |
+
|
72 |
+
# 解析 PDF 文档
|
73 |
+
content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
|
74 |
+
|
75 |
+
# 解析图像
|
76 |
+
content_list, md_content = MineruParser.parse_image('path/to/image.jpg', 'output_dir')
|
77 |
+
|
78 |
+
# 解析 Office 文档
|
79 |
+
content_list, md_content = MineruParser.parse_office_doc('path/to/document.docx', 'output_dir')
|
80 |
+
|
81 |
+
# 自动检测并解析任何支持的文档类型
|
82 |
+
content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
|
83 |
+
```
|
84 |
+
|
85 |
+
#### RAGAnything 集成
|
86 |
+
|
87 |
+
在 RAGAnything 中,您可以直接使用文件路径作为 `process_document_complete` 方法的输入来处理文档。以下是一个完整的配置示例:
|
88 |
+
|
89 |
+
```python
|
90 |
+
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
91 |
+
from lightrag.raganything import RAGAnything
|
92 |
+
|
93 |
+
|
94 |
+
# 初始化 RAGAnything
|
95 |
+
rag = RAGAnything(
|
96 |
+
working_dir="./rag_storage", # 工作目录
|
97 |
+
llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
|
98 |
+
"gpt-4o-mini", # 使用的模型
|
99 |
+
prompt,
|
100 |
+
system_prompt=system_prompt,
|
101 |
+
history_messages=history_messages,
|
102 |
+
api_key="your-api-key", # 替换为您的 API 密钥
|
103 |
+
base_url="your-base-url", # 替换为您的 API 基础 URL
|
104 |
+
**kwargs,
|
105 |
+
),
|
106 |
+
vision_model_func=lambda prompt, system_prompt=None, history_messages=[], image_data=None, **kwargs: openai_complete_if_cache(
|
107 |
+
"gpt-4o", # 视觉模型
|
108 |
+
"",
|
109 |
+
system_prompt=None,
|
110 |
+
history_messages=[],
|
111 |
+
messages=[
|
112 |
+
{"role": "system", "content": system_prompt} if system_prompt else None,
|
113 |
+
{"role": "user", "content": [
|
114 |
+
{"type": "text", "text": prompt},
|
115 |
+
{
|
116 |
+
"type": "image_url",
|
117 |
+
"image_url": {
|
118 |
+
"url": f"data:image/jpeg;base64,{image_data}"
|
119 |
+
}
|
120 |
+
}
|
121 |
+
]} if image_data else {"role": "user", "content": prompt}
|
122 |
+
],
|
123 |
+
api_key="your-api-key", # 替换为您的 API 密钥
|
124 |
+
base_url="your-base-url", # 替换为您的 API 基础 URL
|
125 |
+
**kwargs,
|
126 |
+
) if image_data else openai_complete_if_cache(
|
127 |
+
"gpt-4o-mini",
|
128 |
+
prompt,
|
129 |
+
system_prompt=system_prompt,
|
130 |
+
history_messages=history_messages,
|
131 |
+
api_key="your-api-key", # 替换为您的 API 密钥
|
132 |
+
base_url="your-base-url", # 替换为您的 API 基础 URL
|
133 |
+
**kwargs,
|
134 |
+
),
|
135 |
+
embedding_func=lambda texts: openai_embed(
|
136 |
+
texts,
|
137 |
+
model="text-embedding-3-large",
|
138 |
+
api_key="your-api-key", # 替换为您的 API 密钥
|
139 |
+
base_url="your-base-url", # 替换为您的 API 基础 URL
|
140 |
+
),
|
141 |
+
embedding_dim=3072,
|
142 |
+
max_token_size=8192
|
143 |
+
)
|
144 |
+
|
145 |
+
# 处理单个文件
|
146 |
+
await rag.process_document_complete(
|
147 |
+
file_path="path/to/document.pdf",
|
148 |
+
output_dir="./output",
|
149 |
+
parse_method="auto"
|
150 |
+
)
|
151 |
+
|
152 |
+
# 查询处理后的文档
|
153 |
+
result = await rag.query_with_multimodal(
|
154 |
+
"What is the main content of the document?",
|
155 |
+
mode="hybrid"
|
156 |
+
)
|
157 |
+
```
|
158 |
+
|
159 |
+
MinerU 会将文档内容分类为文本、公式、图像和表格,分别使用相应的摄入类型进行处理:
|
160 |
+
- 文本内容:`ingestion_type='text'`
|
161 |
+
- 图像内容:`ingestion_type='image'`
|
162 |
+
- 表格内容:`ingestion_type='table'`
|
163 |
+
- 公式内容:`ingestion_type='equation'`
|
164 |
+
|
165 |
+
#### 查询示例
|
166 |
+
|
167 |
+
以下是一些常见的查询示例:
|
168 |
+
|
169 |
+
```python
|
170 |
+
# 查询文本内容
|
171 |
+
result = await rag.query_with_multimodal(
|
172 |
+
"What is the main topic of the document?",
|
173 |
+
mode="hybrid"
|
174 |
+
)
|
175 |
+
|
176 |
+
# 查询图片相关内容
|
177 |
+
result = await rag.query_with_multimodal(
|
178 |
+
"Describe the images and figures in the document",
|
179 |
+
mode="hybrid"
|
180 |
+
)
|
181 |
+
|
182 |
+
# 查询表格相关内容
|
183 |
+
result = await rag.query_with_multimodal(
|
184 |
+
"Tell me about the experimental results and data tables",
|
185 |
+
mode="hybrid"
|
186 |
+
)
|
187 |
+
```
|
188 |
+
|
189 |
+
#### 命令行工具
|
190 |
+
|
191 |
+
我们还提供了一个用于文档解析的命令行工具:
|
192 |
+
|
193 |
+
```bash
|
194 |
+
python examples/mineru_example.py path/to/document.pdf
|
195 |
+
```
|
196 |
+
|
197 |
+
可选参数:
|
198 |
+
- `--output` 或 `-o`:指定输出目录
|
199 |
+
- `--method` 或 `-m`:选择解析方法(auto、ocr、txt)
|
200 |
+
- `--stats`:显示内容统计信息
|
201 |
+
|
202 |
+
### 输出格式
|
203 |
+
|
204 |
+
MinerU 为每个解析的文档生成三个文件:
|
205 |
+
|
206 |
+
1. `{文件名}.md` - 文档的 Markdown 表示
|
207 |
+
2. `{文件名}_content_list.json` - 结构化 JSON 内容
|
208 |
+
3. `{文件名}_model.json` - 详细的模型解析结果
|
209 |
+
|
210 |
+
`content_list.json` 文件包含从文档中提取的所有结构化内容,包括:
|
211 |
+
- 文本块(正文、标题等)
|
212 |
+
- 图像(路径和可选的标题)
|
213 |
+
- 表格(表格内容和可选的标题)
|
214 |
+
- 列表
|
215 |
+
- 公式
|
216 |
+
|
217 |
+
### 疑难解答
|
218 |
+
|
219 |
+
如果您在使用 MinerU 时遇到问题:
|
220 |
+
|
221 |
+
1. 检查模型权重是否正确下载
|
222 |
+
2. 确保有足够的内存(建议 16GB+)
|
223 |
+
3. 对于 CUDA 加速问题,请参阅 [MinerU 文档](https://mineru.readthedocs.io/en/latest/additional_notes/faq.html)
|
224 |
+
4. 如果解析 Office 文档失败,请验证 LibreOffice 是否正确安装
|
225 |
+
5. 如果遇到 `pickle.UnpicklingError: invalid load key, 'v'.`,可能是因为模型下载不完整。尝试重新下载模型。
|
226 |
+
6. 对于使用较新显卡(H100 等)并出现 OCR 文本乱码的用户,请尝试升级 Paddle 使用的 CUDA 版本:
|
227 |
+
```bash
|
228 |
+
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
|
229 |
+
```
|
230 |
+
7. 如果遇到 "文件名太长" 错误,最新版本的 MineruParser 已经包含了自动处理此问题的逻辑。
|
231 |
+
|
232 |
+
#### 更新现有模型
|
233 |
+
|
234 |
+
如果您之前已经下载了模型并需要更新它们,只需再次运行下载脚本即可。脚本将更新模型目录到最新版本。
|
235 |
+
|
236 |
+
### 高级配置
|
237 |
+
|
238 |
+
MinerU 配置文件 `magic-pdf.json` 支持多种自定义选项,包括:
|
239 |
+
|
240 |
+
- 模型目录路径
|
241 |
+
- OCR 引擎选择
|
242 |
+
- GPU 加速设置
|
243 |
+
- 缓存设置
|
244 |
+
|
245 |
+
有关完整的配置选项,请参阅 [MinerU 官方文档](https://mineru.readthedocs.io/)。
|
examples/mineru_example.py
ADDED
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python
|
2 |
+
"""
|
3 |
+
Example script demonstrating the basic usage of MinerU parser
|
4 |
+
|
5 |
+
This example shows how to:
|
6 |
+
1. Parse different types of documents (PDF, images, office documents)
|
7 |
+
2. Use different parsing methods
|
8 |
+
3. Display document statistics
|
9 |
+
"""
|
10 |
+
|
11 |
+
import os
|
12 |
+
import argparse
|
13 |
+
from pathlib import Path
|
14 |
+
from lightrag.mineru_parser import MineruParser
|
15 |
+
|
16 |
+
def parse_document(file_path: str, output_dir: str = None, method: str = "auto", stats: bool = False):
|
17 |
+
"""
|
18 |
+
Parse a document using MinerU parser
|
19 |
+
|
20 |
+
Args:
|
21 |
+
file_path: Path to the document
|
22 |
+
output_dir: Output directory for parsed results
|
23 |
+
method: Parsing method (auto, ocr, txt)
|
24 |
+
stats: Whether to display content statistics
|
25 |
+
"""
|
26 |
+
try:
|
27 |
+
# Parse the document
|
28 |
+
content_list, md_content = MineruParser.parse_document(
|
29 |
+
file_path=file_path,
|
30 |
+
parse_method=method,
|
31 |
+
output_dir=output_dir
|
32 |
+
)
|
33 |
+
|
34 |
+
# Display statistics if requested
|
35 |
+
if stats:
|
36 |
+
print("\nDocument Statistics:")
|
37 |
+
print(f"Total content blocks: {len(content_list)}")
|
38 |
+
|
39 |
+
# Count different types of content
|
40 |
+
content_types = {}
|
41 |
+
for item in content_list:
|
42 |
+
content_type = item.get('type', 'unknown')
|
43 |
+
content_types[content_type] = content_types.get(content_type, 0) + 1
|
44 |
+
|
45 |
+
print("\nContent Type Distribution:")
|
46 |
+
for content_type, count in content_types.items():
|
47 |
+
print(f"- {content_type}: {count}")
|
48 |
+
|
49 |
+
return content_list, md_content
|
50 |
+
|
51 |
+
except Exception as e:
|
52 |
+
print(f"Error parsing document: {str(e)}")
|
53 |
+
return None, None
|
54 |
+
|
55 |
+
def main():
|
56 |
+
"""Main function to run the example"""
|
57 |
+
parser = argparse.ArgumentParser(description='MinerU Parser Example')
|
58 |
+
parser.add_argument('file_path', help='Path to the document to parse')
|
59 |
+
parser.add_argument('--output', '-o', help='Output directory path')
|
60 |
+
parser.add_argument('--method', '-m',
|
61 |
+
choices=['auto', 'ocr', 'txt'],
|
62 |
+
default='auto',
|
63 |
+
help='Parsing method (auto, ocr, txt)')
|
64 |
+
parser.add_argument('--stats', action='store_true',
|
65 |
+
help='Display content statistics')
|
66 |
+
|
67 |
+
args = parser.parse_args()
|
68 |
+
|
69 |
+
# Create output directory if specified
|
70 |
+
if args.output:
|
71 |
+
os.makedirs(args.output, exist_ok=True)
|
72 |
+
|
73 |
+
# Parse document
|
74 |
+
content_list, md_content = parse_document(
|
75 |
+
args.file_path,
|
76 |
+
args.output,
|
77 |
+
args.method,
|
78 |
+
args.stats
|
79 |
+
)
|
80 |
+
|
81 |
+
if __name__ == '__main__':
|
82 |
+
main()
|
examples/raganything_example.py
ADDED
@@ -0,0 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python
|
2 |
+
"""
|
3 |
+
Example script demonstrating the integration of MinerU parser with RAGAnything
|
4 |
+
|
5 |
+
This example shows how to:
|
6 |
+
1. Process parsed documents with RAGAnything
|
7 |
+
2. Perform multimodal queries on the processed documents
|
8 |
+
3. Handle different types of content (text, images, tables)
|
9 |
+
"""
|
10 |
+
|
11 |
+
import os
|
12 |
+
import argparse
|
13 |
+
import asyncio
|
14 |
+
from pathlib import Path
|
15 |
+
from lightrag.mineru_parser import MineruParser
|
16 |
+
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
|
17 |
+
from lightrag.raganything import RAGAnything
|
18 |
+
|
19 |
+
async def process_with_rag(file_path: str, output_dir: str, api_key: str, base_url: str = None, working_dir: str = None):
|
20 |
+
"""
|
21 |
+
Process document with RAGAnything
|
22 |
+
|
23 |
+
Args:
|
24 |
+
file_path: Path to the document
|
25 |
+
output_dir: Output directory for RAG results
|
26 |
+
api_key: OpenAI API key
|
27 |
+
base_url: Optional base URL for API
|
28 |
+
"""
|
29 |
+
try:
|
30 |
+
# Initialize RAGAnything
|
31 |
+
rag = RAGAnything(
|
32 |
+
working_dir=working_dir,
|
33 |
+
llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
|
34 |
+
"gpt-4o-mini",
|
35 |
+
prompt,
|
36 |
+
system_prompt=system_prompt,
|
37 |
+
history_messages=history_messages,
|
38 |
+
api_key=api_key,
|
39 |
+
base_url=base_url,
|
40 |
+
**kwargs,
|
41 |
+
),
|
42 |
+
vision_model_func=lambda prompt, system_prompt=None, history_messages=[], image_data=None, **kwargs: openai_complete_if_cache(
|
43 |
+
"gpt-4o",
|
44 |
+
"",
|
45 |
+
system_prompt=None,
|
46 |
+
history_messages=[],
|
47 |
+
messages=[
|
48 |
+
{"role": "system", "content": system_prompt} if system_prompt else None,
|
49 |
+
{"role": "user", "content": [
|
50 |
+
{"type": "text", "text": prompt},
|
51 |
+
{
|
52 |
+
"type": "image_url",
|
53 |
+
"image_url": {
|
54 |
+
"url": f"data:image/jpeg;base64,{image_data}"
|
55 |
+
}
|
56 |
+
}
|
57 |
+
]} if image_data else {"role": "user", "content": prompt}
|
58 |
+
],
|
59 |
+
api_key=api_key,
|
60 |
+
base_url=base_url,
|
61 |
+
**kwargs,
|
62 |
+
) if image_data else openai_complete_if_cache(
|
63 |
+
"gpt-4o-mini",
|
64 |
+
prompt,
|
65 |
+
system_prompt=system_prompt,
|
66 |
+
history_messages=history_messages,
|
67 |
+
api_key=api_key,
|
68 |
+
base_url=base_url,
|
69 |
+
**kwargs,
|
70 |
+
),
|
71 |
+
embedding_func=lambda texts: openai_embed(
|
72 |
+
texts,
|
73 |
+
model="text-embedding-3-large",
|
74 |
+
api_key=api_key,
|
75 |
+
base_url=base_url,
|
76 |
+
),
|
77 |
+
embedding_dim=3072,
|
78 |
+
max_token_size=8192
|
79 |
+
)
|
80 |
+
|
81 |
+
# Process document
|
82 |
+
await rag.process_document_complete(
|
83 |
+
file_path=file_path,
|
84 |
+
output_dir=output_dir,
|
85 |
+
parse_method="auto"
|
86 |
+
)
|
87 |
+
|
88 |
+
# Example queries
|
89 |
+
queries = [
|
90 |
+
"What is the main content of the document?",
|
91 |
+
"Describe the images and figures in the document",
|
92 |
+
"Tell me about the experimental results and data tables"
|
93 |
+
]
|
94 |
+
|
95 |
+
print("\nQuerying processed document:")
|
96 |
+
for query in queries:
|
97 |
+
print(f"\nQuery: {query}")
|
98 |
+
result = await rag.query_with_multimodal(query, mode="hybrid")
|
99 |
+
print(f"Answer: {result}")
|
100 |
+
|
101 |
+
except Exception as e:
|
102 |
+
print(f"Error processing with RAG: {str(e)}")
|
103 |
+
|
104 |
+
def main():
|
105 |
+
"""Main function to run the example"""
|
106 |
+
parser = argparse.ArgumentParser(description='MinerU RAG Example')
|
107 |
+
parser.add_argument('file_path', help='Path to the document to process')
|
108 |
+
parser.add_argument('--working_dir', '-w', default="./rag_storage", help='Working directory path')
|
109 |
+
parser.add_argument('--output', '-o', default="./output", help='Output directory path')
|
110 |
+
parser.add_argument('--api-key', required=True, help='OpenAI API key for RAG processing')
|
111 |
+
parser.add_argument('--base-url', help='Optional base URL for API')
|
112 |
+
|
113 |
+
args = parser.parse_args()
|
114 |
+
|
115 |
+
# Create output directory if specified
|
116 |
+
if args.output:
|
117 |
+
os.makedirs(args.output, exist_ok=True)
|
118 |
+
|
119 |
+
# Process with RAG
|
120 |
+
asyncio.run(process_with_rag(
|
121 |
+
args.file_path,
|
122 |
+
args.output,
|
123 |
+
args.api_key,
|
124 |
+
args.base_url,
|
125 |
+
args.working_dir
|
126 |
+
))
|
127 |
+
|
128 |
+
if __name__ == '__main__':
|
129 |
+
main()
|
lightrag/mineru_parser.py
ADDED
@@ -0,0 +1,454 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# type: ignore
|
2 |
+
"""
|
3 |
+
MinerU Document Parser Utility
|
4 |
+
|
5 |
+
This module provides functionality for parsing PDF, image and office documents using MinerU library,
|
6 |
+
and converts the parsing results into markdown and JSON formats
|
7 |
+
"""
|
8 |
+
|
9 |
+
from __future__ import annotations
|
10 |
+
|
11 |
+
__all__ = ["MineruParser"]
|
12 |
+
|
13 |
+
import os
|
14 |
+
import json
|
15 |
+
import argparse
|
16 |
+
from pathlib import Path
|
17 |
+
from typing import Dict, List, Optional, Union, Tuple, Any, TypeVar, cast, TYPE_CHECKING, ClassVar
|
18 |
+
|
19 |
+
# Type stubs for magic_pdf
|
20 |
+
FileBasedDataWriter = Any
|
21 |
+
FileBasedDataReader = Any
|
22 |
+
PymuDocDataset = Any
|
23 |
+
InferResult = Any
|
24 |
+
PipeResult = Any
|
25 |
+
SupportedPdfParseMethod = Any
|
26 |
+
doc_analyze = Any
|
27 |
+
read_local_office = Any
|
28 |
+
read_local_images = Any
|
29 |
+
|
30 |
+
if TYPE_CHECKING:
|
31 |
+
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
|
32 |
+
from magic_pdf.data.dataset import PymuDocDataset
|
33 |
+
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
|
34 |
+
from magic_pdf.config.enums import SupportedPdfParseMethod
|
35 |
+
from magic_pdf.data.read_api import read_local_office, read_local_images
|
36 |
+
else:
|
37 |
+
# MinerU imports
|
38 |
+
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
|
39 |
+
from magic_pdf.data.dataset import PymuDocDataset
|
40 |
+
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
|
41 |
+
from magic_pdf.config.enums import SupportedPdfParseMethod
|
42 |
+
from magic_pdf.data.read_api import read_local_office, read_local_images
|
43 |
+
|
44 |
+
T = TypeVar('T')
|
45 |
+
|
46 |
+
class MineruParser:
|
47 |
+
"""
|
48 |
+
MinerU document parsing utility class
|
49 |
+
|
50 |
+
Supports parsing PDF, image and office documents (like Word, PPT, etc.),
|
51 |
+
converting the content into structured data and generating markdown and JSON output
|
52 |
+
"""
|
53 |
+
|
54 |
+
__slots__: ClassVar[Tuple[str, ...]] = ()
|
55 |
+
|
56 |
+
def __init__(self) -> None:
|
57 |
+
"""Initialize MineruParser"""
|
58 |
+
pass
|
59 |
+
|
60 |
+
@staticmethod
|
61 |
+
def safe_write(writer: Any, content: Union[str, bytes, Dict[str, Any], List[Any]], filename: str) -> None:
|
62 |
+
"""
|
63 |
+
Safely write content to a file, ensuring the filename is valid
|
64 |
+
|
65 |
+
Args:
|
66 |
+
writer: The writer object to use
|
67 |
+
content: The content to write
|
68 |
+
filename: The filename to write to
|
69 |
+
"""
|
70 |
+
# Ensure the filename isn't too long
|
71 |
+
if len(filename) > 200: # Most filesystems have limits around 255 characters
|
72 |
+
# Truncate the filename while keeping the extension
|
73 |
+
base, ext = os.path.splitext(filename)
|
74 |
+
filename = base[:190] + ext # Leave room for the extension and some margin
|
75 |
+
|
76 |
+
# Handle specific content types
|
77 |
+
if isinstance(content, str):
|
78 |
+
# Ensure str content is encoded to bytes if required
|
79 |
+
try:
|
80 |
+
writer.write(content, filename)
|
81 |
+
except TypeError:
|
82 |
+
# If the writer expects bytes, convert string to bytes
|
83 |
+
writer.write(content.encode('utf-8'), filename)
|
84 |
+
else:
|
85 |
+
# For dict/list content, always encode as JSON string first
|
86 |
+
if isinstance(content, (dict, list)):
|
87 |
+
try:
|
88 |
+
writer.write(json.dumps(content, ensure_ascii=False, indent=4), filename)
|
89 |
+
except TypeError:
|
90 |
+
# If the writer expects bytes, convert JSON string to bytes
|
91 |
+
writer.write(json.dumps(content, ensure_ascii=False, indent=4).encode('utf-8'), filename)
|
92 |
+
else:
|
93 |
+
# Regular content (assumed to be bytes or compatible)
|
94 |
+
writer.write(content, filename)
|
95 |
+
|
96 |
+
@staticmethod
|
97 |
+
def parse_pdf(
|
98 |
+
pdf_path: Union[str, Path],
|
99 |
+
output_dir: Optional[str] = None,
|
100 |
+
use_ocr: bool = False
|
101 |
+
) -> Tuple[List[Dict[str, Any]], str]:
|
102 |
+
"""
|
103 |
+
Parse PDF document
|
104 |
+
|
105 |
+
Args:
|
106 |
+
pdf_path: Path to the PDF file
|
107 |
+
output_dir: Output directory path
|
108 |
+
use_ocr: Whether to force OCR parsing
|
109 |
+
|
110 |
+
Returns:
|
111 |
+
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
|
112 |
+
"""
|
113 |
+
try:
|
114 |
+
# Convert to Path object for easier handling
|
115 |
+
pdf_path = Path(pdf_path)
|
116 |
+
name_without_suff = pdf_path.stem
|
117 |
+
|
118 |
+
# Prepare output directories - ensure file name is in path
|
119 |
+
if output_dir:
|
120 |
+
base_output_dir = Path(output_dir)
|
121 |
+
local_md_dir = base_output_dir / name_without_suff
|
122 |
+
else:
|
123 |
+
local_md_dir = pdf_path.parent / name_without_suff
|
124 |
+
|
125 |
+
local_image_dir = local_md_dir / "images"
|
126 |
+
image_dir = local_image_dir.name
|
127 |
+
|
128 |
+
# Create directories
|
129 |
+
os.makedirs(local_image_dir, exist_ok=True)
|
130 |
+
os.makedirs(local_md_dir, exist_ok=True)
|
131 |
+
|
132 |
+
# Initialize writers and reader
|
133 |
+
image_writer = FileBasedDataWriter(str(local_image_dir)) # type: ignore
|
134 |
+
md_writer = FileBasedDataWriter(str(local_md_dir)) # type: ignore
|
135 |
+
reader = FileBasedDataReader("") # type: ignore
|
136 |
+
|
137 |
+
# Read PDF bytes
|
138 |
+
pdf_bytes = reader.read(str(pdf_path)) # type: ignore
|
139 |
+
|
140 |
+
# Create dataset instance
|
141 |
+
ds = PymuDocDataset(pdf_bytes) # type: ignore
|
142 |
+
|
143 |
+
# Process based on PDF type and user preference
|
144 |
+
if use_ocr or ds.classify() == SupportedPdfParseMethod.OCR: # type: ignore
|
145 |
+
infer_result = ds.apply(doc_analyze, ocr=True) # type: ignore
|
146 |
+
pipe_result = infer_result.pipe_ocr_mode(image_writer) # type: ignore
|
147 |
+
else:
|
148 |
+
infer_result = ds.apply(doc_analyze, ocr=False) # type: ignore
|
149 |
+
pipe_result = infer_result.pipe_txt_mode(image_writer) # type: ignore
|
150 |
+
|
151 |
+
# Draw visualizations
|
152 |
+
try:
|
153 |
+
infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf")) # type: ignore
|
154 |
+
pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf")) # type: ignore
|
155 |
+
pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf")) # type: ignore
|
156 |
+
except Exception as e:
|
157 |
+
print(f"Warning: Failed to draw visualizations: {str(e)}")
|
158 |
+
|
159 |
+
# Get data using API methods
|
160 |
+
md_content = pipe_result.get_markdown(image_dir) # type: ignore
|
161 |
+
content_list = pipe_result.get_content_list(image_dir) # type: ignore
|
162 |
+
|
163 |
+
# Save files using dump methods (consistent with API)
|
164 |
+
pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir) # type: ignore
|
165 |
+
pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir) # type: ignore
|
166 |
+
pipe_result.dump_middle_json(md_writer, f"{name_without_suff}_middle.json") # type: ignore
|
167 |
+
|
168 |
+
# Save model result - convert JSON string to bytes before writing
|
169 |
+
model_inference_result = infer_result.get_infer_res() # type: ignore
|
170 |
+
json_str = json.dumps(model_inference_result, ensure_ascii=False, indent=4)
|
171 |
+
|
172 |
+
try:
|
173 |
+
# Try to write to a file manually to avoid FileBasedDataWriter issues
|
174 |
+
model_file_path = os.path.join(local_md_dir, f"{name_without_suff}_model.json")
|
175 |
+
with open(model_file_path, 'w', encoding='utf-8') as f:
|
176 |
+
f.write(json_str)
|
177 |
+
except Exception as e:
|
178 |
+
print(f"Warning: Failed to save model result using file write: {str(e)}")
|
179 |
+
try:
|
180 |
+
# If direct file write fails, try using the writer with bytes encoding
|
181 |
+
md_writer.write(json_str.encode('utf-8'), f"{name_without_suff}_model.json") # type: ignore
|
182 |
+
except Exception as e2:
|
183 |
+
print(f"Warning: Failed to save model result using writer: {str(e2)}")
|
184 |
+
|
185 |
+
return cast(Tuple[List[Dict[str, Any]], str], (content_list, md_content))
|
186 |
+
|
187 |
+
except Exception as e:
|
188 |
+
print(f"Error in parse_pdf: {str(e)}")
|
189 |
+
raise
|
190 |
+
|
191 |
+
@staticmethod
|
192 |
+
def parse_office_doc(
|
193 |
+
doc_path: Union[str, Path],
|
194 |
+
output_dir: Optional[str] = None
|
195 |
+
) -> Tuple[List[Dict[str, Any]], str]:
|
196 |
+
"""
|
197 |
+
Parse office document (Word, PPT, etc.)
|
198 |
+
|
199 |
+
Args:
|
200 |
+
doc_path: Path to the document file
|
201 |
+
output_dir: Output directory path
|
202 |
+
|
203 |
+
Returns:
|
204 |
+
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
|
205 |
+
"""
|
206 |
+
try:
|
207 |
+
# Convert to Path object for easier handling
|
208 |
+
doc_path = Path(doc_path)
|
209 |
+
name_without_suff = doc_path.stem
|
210 |
+
|
211 |
+
# Prepare output directories - ensure file name is in path
|
212 |
+
if output_dir:
|
213 |
+
base_output_dir = Path(output_dir)
|
214 |
+
local_md_dir = base_output_dir / name_without_suff
|
215 |
+
else:
|
216 |
+
local_md_dir = doc_path.parent / name_without_suff
|
217 |
+
|
218 |
+
local_image_dir = local_md_dir / "images"
|
219 |
+
image_dir = local_image_dir.name
|
220 |
+
|
221 |
+
# Create directories
|
222 |
+
os.makedirs(local_image_dir, exist_ok=True)
|
223 |
+
os.makedirs(local_md_dir, exist_ok=True)
|
224 |
+
|
225 |
+
# Initialize writers
|
226 |
+
image_writer = FileBasedDataWriter(str(local_image_dir)) # type: ignore
|
227 |
+
md_writer = FileBasedDataWriter(str(local_md_dir)) # type: ignore
|
228 |
+
|
229 |
+
# Read office document
|
230 |
+
ds = read_local_office(str(doc_path))[0] # type: ignore
|
231 |
+
|
232 |
+
# Apply chain of operations according to API documentation
|
233 |
+
# This follows the pattern shown in MS-Office example in the API docs
|
234 |
+
ds.apply(doc_analyze, ocr=True)\
|
235 |
+
.pipe_txt_mode(image_writer)\
|
236 |
+
.dump_md(md_writer, f"{name_without_suff}.md", image_dir) # type: ignore
|
237 |
+
|
238 |
+
# Re-execute for getting the content data
|
239 |
+
infer_result = ds.apply(doc_analyze, ocr=True) # type: ignore
|
240 |
+
pipe_result = infer_result.pipe_txt_mode(image_writer) # type: ignore
|
241 |
+
|
242 |
+
# Get data for return values and additional outputs
|
243 |
+
md_content = pipe_result.get_markdown(image_dir) # type: ignore
|
244 |
+
content_list = pipe_result.get_content_list(image_dir) # type: ignore
|
245 |
+
|
246 |
+
# Save additional output files
|
247 |
+
pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir) # type: ignore
|
248 |
+
pipe_result.dump_middle_json(md_writer, f"{name_without_suff}_middle.json") # type: ignore
|
249 |
+
|
250 |
+
# Save model result - convert JSON string to bytes before writing
|
251 |
+
model_inference_result = infer_result.get_infer_res() # type: ignore
|
252 |
+
json_str = json.dumps(model_inference_result, ensure_ascii=False, indent=4)
|
253 |
+
|
254 |
+
try:
|
255 |
+
# Try to write to a file manually to avoid FileBasedDataWriter issues
|
256 |
+
model_file_path = os.path.join(local_md_dir, f"{name_without_suff}_model.json")
|
257 |
+
with open(model_file_path, 'w', encoding='utf-8') as f:
|
258 |
+
f.write(json_str)
|
259 |
+
except Exception as e:
|
260 |
+
print(f"Warning: Failed to save model result using file write: {str(e)}")
|
261 |
+
try:
|
262 |
+
# If direct file write fails, try using the writer with bytes encoding
|
263 |
+
md_writer.write(json_str.encode('utf-8'), f"{name_without_suff}_model.json") # type: ignore
|
264 |
+
except Exception as e2:
|
265 |
+
print(f"Warning: Failed to save model result using writer: {str(e2)}")
|
266 |
+
|
267 |
+
return cast(Tuple[List[Dict[str, Any]], str], (content_list, md_content))
|
268 |
+
|
269 |
+
except Exception as e:
|
270 |
+
print(f"Error in parse_office_doc: {str(e)}")
|
271 |
+
raise
|
272 |
+
|
273 |
+
@staticmethod
|
274 |
+
def parse_image(
|
275 |
+
image_path: Union[str, Path],
|
276 |
+
output_dir: Optional[str] = None
|
277 |
+
) -> Tuple[List[Dict[str, Any]], str]:
|
278 |
+
"""
|
279 |
+
Parse image document
|
280 |
+
|
281 |
+
Args:
|
282 |
+
image_path: Path to the image file
|
283 |
+
output_dir: Output directory path
|
284 |
+
|
285 |
+
Returns:
|
286 |
+
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
|
287 |
+
"""
|
288 |
+
try:
|
289 |
+
# Convert to Path object for easier handling
|
290 |
+
image_path = Path(image_path)
|
291 |
+
name_without_suff = image_path.stem
|
292 |
+
|
293 |
+
# Prepare output directories - ensure file name is in path
|
294 |
+
if output_dir:
|
295 |
+
base_output_dir = Path(output_dir)
|
296 |
+
local_md_dir = base_output_dir / name_without_suff
|
297 |
+
else:
|
298 |
+
local_md_dir = image_path.parent / name_without_suff
|
299 |
+
|
300 |
+
local_image_dir = local_md_dir / "images"
|
301 |
+
image_dir = local_image_dir.name
|
302 |
+
|
303 |
+
# Create directories
|
304 |
+
os.makedirs(local_image_dir, exist_ok=True)
|
305 |
+
os.makedirs(local_md_dir, exist_ok=True)
|
306 |
+
|
307 |
+
# Initialize writers
|
308 |
+
image_writer = FileBasedDataWriter(str(local_image_dir)) # type: ignore
|
309 |
+
md_writer = FileBasedDataWriter(str(local_md_dir)) # type: ignore
|
310 |
+
|
311 |
+
# Read image
|
312 |
+
ds = read_local_images(str(image_path))[0] # type: ignore
|
313 |
+
|
314 |
+
# Apply chain of operations according to API documentation
|
315 |
+
# This follows the pattern shown in Image example in the API docs
|
316 |
+
ds.apply(doc_analyze, ocr=True)\
|
317 |
+
.pipe_ocr_mode(image_writer)\
|
318 |
+
.dump_md(md_writer, f"{name_without_suff}.md", image_dir) # type: ignore
|
319 |
+
|
320 |
+
# Re-execute for getting the content data
|
321 |
+
infer_result = ds.apply(doc_analyze, ocr=True) # type: ignore
|
322 |
+
pipe_result = infer_result.pipe_ocr_mode(image_writer) # type: ignore
|
323 |
+
|
324 |
+
# Get data for return values and additional outputs
|
325 |
+
md_content = pipe_result.get_markdown(image_dir) # type: ignore
|
326 |
+
content_list = pipe_result.get_content_list(image_dir) # type: ignore
|
327 |
+
|
328 |
+
# Save additional output files
|
329 |
+
pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir) # type: ignore
|
330 |
+
pipe_result.dump_middle_json(md_writer, f"{name_without_suff}_middle.json") # type: ignore
|
331 |
+
|
332 |
+
# Save model result - convert JSON string to bytes before writing
|
333 |
+
model_inference_result = infer_result.get_infer_res() # type: ignore
|
334 |
+
json_str = json.dumps(model_inference_result, ensure_ascii=False, indent=4)
|
335 |
+
|
336 |
+
try:
|
337 |
+
# Try to write to a file manually to avoid FileBasedDataWriter issues
|
338 |
+
model_file_path = os.path.join(local_md_dir, f"{name_without_suff}_model.json")
|
339 |
+
with open(model_file_path, 'w', encoding='utf-8') as f:
|
340 |
+
f.write(json_str)
|
341 |
+
except Exception as e:
|
342 |
+
print(f"Warning: Failed to save model result using file write: {str(e)}")
|
343 |
+
try:
|
344 |
+
# If direct file write fails, try using the writer with bytes encoding
|
345 |
+
md_writer.write(json_str.encode('utf-8'), f"{name_without_suff}_model.json") # type: ignore
|
346 |
+
except Exception as e2:
|
347 |
+
print(f"Warning: Failed to save model result using writer: {str(e2)}")
|
348 |
+
|
349 |
+
return cast(Tuple[List[Dict[str, Any]], str], (content_list, md_content))
|
350 |
+
|
351 |
+
except Exception as e:
|
352 |
+
print(f"Error in parse_image: {str(e)}")
|
353 |
+
raise
|
354 |
+
|
355 |
+
@staticmethod
|
356 |
+
def parse_document(
|
357 |
+
file_path: Union[str, Path],
|
358 |
+
parse_method: str = "auto",
|
359 |
+
output_dir: Optional[str] = None,
|
360 |
+
save_results: bool = True
|
361 |
+
) -> Tuple[List[Dict[str, Any]], str]:
|
362 |
+
"""
|
363 |
+
Parse document using MinerU based on file extension
|
364 |
+
|
365 |
+
Args:
|
366 |
+
file_path: Path to the file to be parsed
|
367 |
+
parse_method: Parsing method, supports "auto", "ocr", "txt", default is "auto"
|
368 |
+
output_dir: Output directory path, if None, use the directory of the input file
|
369 |
+
save_results: Whether to save parsing results to files
|
370 |
+
|
371 |
+
Returns:
|
372 |
+
Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
|
373 |
+
"""
|
374 |
+
# Convert to Path object
|
375 |
+
file_path = Path(file_path)
|
376 |
+
if not file_path.exists():
|
377 |
+
raise FileNotFoundError(f"File does not exist: {file_path}")
|
378 |
+
|
379 |
+
# Get file extension
|
380 |
+
ext = file_path.suffix.lower()
|
381 |
+
|
382 |
+
# Choose appropriate parser based on file type
|
383 |
+
if ext in [".pdf"]:
|
384 |
+
return MineruParser.parse_pdf(
|
385 |
+
file_path,
|
386 |
+
output_dir,
|
387 |
+
use_ocr=(parse_method == "ocr")
|
388 |
+
)
|
389 |
+
elif ext in [".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]:
|
390 |
+
return MineruParser.parse_image(
|
391 |
+
file_path,
|
392 |
+
output_dir
|
393 |
+
)
|
394 |
+
elif ext in [".doc", ".docx", ".ppt", ".pptx"]:
|
395 |
+
return MineruParser.parse_office_doc(
|
396 |
+
file_path,
|
397 |
+
output_dir
|
398 |
+
)
|
399 |
+
else:
|
400 |
+
# For unsupported file types, default to PDF parsing
|
401 |
+
print(f"Warning: Unsupported file extension '{ext}', trying generic PDF parser")
|
402 |
+
return MineruParser.parse_pdf(
|
403 |
+
file_path,
|
404 |
+
output_dir,
|
405 |
+
use_ocr=(parse_method == "ocr")
|
406 |
+
)
|
407 |
+
|
408 |
+
def main():
|
409 |
+
"""
|
410 |
+
Main function to run the MinerU parser from command line
|
411 |
+
"""
|
412 |
+
parser = argparse.ArgumentParser(description='Parse documents using MinerU')
|
413 |
+
parser.add_argument('file_path', help='Path to the document to parse')
|
414 |
+
parser.add_argument('--output', '-o', help='Output directory path')
|
415 |
+
parser.add_argument('--method', '-m',
|
416 |
+
choices=['auto', 'ocr', 'txt'],
|
417 |
+
default='auto',
|
418 |
+
help='Parsing method (auto, ocr, txt)')
|
419 |
+
parser.add_argument('--stats', action='store_true',
|
420 |
+
help='Display content statistics')
|
421 |
+
|
422 |
+
args = parser.parse_args()
|
423 |
+
|
424 |
+
try:
|
425 |
+
# Parse the document
|
426 |
+
content_list, md_content = MineruParser.parse_document(
|
427 |
+
file_path=args.file_path,
|
428 |
+
parse_method=args.method,
|
429 |
+
output_dir=args.output
|
430 |
+
)
|
431 |
+
|
432 |
+
# Display statistics if requested
|
433 |
+
if args.stats:
|
434 |
+
print("\nDocument Statistics:")
|
435 |
+
print(f"Total content blocks: {len(content_list)}")
|
436 |
+
|
437 |
+
# Count different types of content
|
438 |
+
content_types = {}
|
439 |
+
for item in content_list:
|
440 |
+
content_type = item.get('type', 'unknown')
|
441 |
+
content_types[content_type] = content_types.get(content_type, 0) + 1
|
442 |
+
|
443 |
+
print("\nContent Type Distribution:")
|
444 |
+
for content_type, count in content_types.items():
|
445 |
+
print(f"- {content_type}: {count}")
|
446 |
+
|
447 |
+
except Exception as e:
|
448 |
+
print(f"Error: {str(e)}")
|
449 |
+
return 1
|
450 |
+
|
451 |
+
return 0
|
452 |
+
|
453 |
+
if __name__ == '__main__':
|
454 |
+
exit(main())
|
lightrag/modalprocessors.py
ADDED
@@ -0,0 +1,708 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Specialized processors for different modalities
|
3 |
+
|
4 |
+
Includes:
|
5 |
+
- ImageModalProcessor: Specialized processor for image content
|
6 |
+
- TableModalProcessor: Specialized processor for table content
|
7 |
+
- EquationModalProcessor: Specialized processor for equation content
|
8 |
+
- GenericModalProcessor: Processor for other modal content
|
9 |
+
"""
|
10 |
+
|
11 |
+
import re
|
12 |
+
import json
|
13 |
+
import time
|
14 |
+
import asyncio
|
15 |
+
import base64
|
16 |
+
from typing import Dict, Any, Tuple, cast
|
17 |
+
from pathlib import Path
|
18 |
+
|
19 |
+
from lightrag.base import StorageNameSpace
|
20 |
+
from lightrag.utils import (
|
21 |
+
logger,
|
22 |
+
compute_mdhash_id,
|
23 |
+
)
|
24 |
+
from lightrag.lightrag import LightRAG
|
25 |
+
from dataclasses import asdict
|
26 |
+
from lightrag.kg.shared_storage import get_namespace_data, get_pipeline_status_lock
|
27 |
+
|
28 |
+
|
29 |
+
class BaseModalProcessor:
|
30 |
+
"""Base class for modal processors"""
|
31 |
+
|
32 |
+
def __init__(self, lightrag: LightRAG, modal_caption_func):
|
33 |
+
"""Initialize base processor
|
34 |
+
|
35 |
+
Args:
|
36 |
+
lightrag: LightRAG instance
|
37 |
+
modal_caption_func: Function for generating descriptions
|
38 |
+
"""
|
39 |
+
self.lightrag = lightrag
|
40 |
+
self.modal_caption_func = modal_caption_func
|
41 |
+
|
42 |
+
# Use LightRAG's storage instances
|
43 |
+
self.text_chunks_db = lightrag.text_chunks
|
44 |
+
self.chunks_vdb = lightrag.chunks_vdb
|
45 |
+
self.entities_vdb = lightrag.entities_vdb
|
46 |
+
self.relationships_vdb = lightrag.relationships_vdb
|
47 |
+
self.knowledge_graph_inst = lightrag.chunk_entity_relation_graph
|
48 |
+
|
49 |
+
# Use LightRAG's configuration and functions
|
50 |
+
self.embedding_func = lightrag.embedding_func
|
51 |
+
self.llm_model_func = lightrag.llm_model_func
|
52 |
+
self.global_config = asdict(lightrag)
|
53 |
+
self.hashing_kv = lightrag.llm_response_cache
|
54 |
+
self.tokenizer = lightrag.tokenizer
|
55 |
+
|
56 |
+
async def process_multimodal_content(
|
57 |
+
self,
|
58 |
+
modal_content,
|
59 |
+
content_type: str,
|
60 |
+
file_path: str = "manual_creation",
|
61 |
+
entity_name: str = None,
|
62 |
+
) -> Tuple[str, Dict[str, Any]]:
|
63 |
+
"""Process multimodal content"""
|
64 |
+
# Subclasses need to implement specific processing logic
|
65 |
+
raise NotImplementedError("Subclasses must implement this method")
|
66 |
+
|
67 |
+
async def _create_entity_and_chunk(
|
68 |
+
self, modal_chunk: str, entity_info: Dict[str, Any],
|
69 |
+
file_path: str) -> Tuple[str, Dict[str, Any]]:
|
70 |
+
"""Create entity and text chunk"""
|
71 |
+
# Create chunk
|
72 |
+
chunk_id = compute_mdhash_id(str(modal_chunk), prefix="chunk-")
|
73 |
+
tokens = len(self.tokenizer.encode(modal_chunk))
|
74 |
+
|
75 |
+
chunk_data = {
|
76 |
+
"tokens": tokens,
|
77 |
+
"content": modal_chunk,
|
78 |
+
"chunk_order_index": 0,
|
79 |
+
"full_doc_id": chunk_id,
|
80 |
+
"file_path": file_path,
|
81 |
+
}
|
82 |
+
|
83 |
+
# Store chunk
|
84 |
+
await self.text_chunks_db.upsert({chunk_id: chunk_data})
|
85 |
+
|
86 |
+
# Create entity node
|
87 |
+
node_data = {
|
88 |
+
"entity_id": entity_info["entity_name"],
|
89 |
+
"entity_type": entity_info["entity_type"],
|
90 |
+
"description": entity_info["summary"],
|
91 |
+
"source_id": chunk_id,
|
92 |
+
"file_path": file_path,
|
93 |
+
"created_at": int(time.time()),
|
94 |
+
}
|
95 |
+
|
96 |
+
await self.knowledge_graph_inst.upsert_node(entity_info["entity_name"],
|
97 |
+
node_data)
|
98 |
+
|
99 |
+
# Insert entity into vector database
|
100 |
+
entity_vdb_data = {
|
101 |
+
compute_mdhash_id(entity_info["entity_name"], prefix="ent-"): {
|
102 |
+
"entity_name": entity_info["entity_name"],
|
103 |
+
"entity_type": entity_info["entity_type"],
|
104 |
+
"content":
|
105 |
+
f"{entity_info['entity_name']}\n{entity_info['summary']}",
|
106 |
+
"source_id": chunk_id,
|
107 |
+
"file_path": file_path,
|
108 |
+
}
|
109 |
+
}
|
110 |
+
await self.entities_vdb.upsert(entity_vdb_data)
|
111 |
+
|
112 |
+
# Process entity and relationship extraction
|
113 |
+
await self._process_chunk_for_extraction(chunk_id,
|
114 |
+
entity_info["entity_name"])
|
115 |
+
|
116 |
+
# Ensure all storage updates are complete
|
117 |
+
await self._insert_done()
|
118 |
+
|
119 |
+
return entity_info["summary"], {
|
120 |
+
"entity_name": entity_info["entity_name"],
|
121 |
+
"entity_type": entity_info["entity_type"],
|
122 |
+
"description": entity_info["summary"],
|
123 |
+
"chunk_id": chunk_id
|
124 |
+
}
|
125 |
+
|
126 |
+
async def _process_chunk_for_extraction(self, chunk_id: str,
|
127 |
+
modal_entity_name: str):
|
128 |
+
"""Process chunk for entity and relationship extraction"""
|
129 |
+
chunk_data = await self.text_chunks_db.get_by_id(chunk_id)
|
130 |
+
if not chunk_data:
|
131 |
+
logger.error(f"Chunk {chunk_id} not found")
|
132 |
+
return
|
133 |
+
|
134 |
+
# Create text chunk for vector database
|
135 |
+
chunk_vdb_data = {
|
136 |
+
chunk_id: {
|
137 |
+
"content": chunk_data["content"],
|
138 |
+
"full_doc_id": chunk_id,
|
139 |
+
"tokens": chunk_data["tokens"],
|
140 |
+
"chunk_order_index": chunk_data["chunk_order_index"],
|
141 |
+
"file_path": chunk_data["file_path"],
|
142 |
+
}
|
143 |
+
}
|
144 |
+
|
145 |
+
await self.chunks_vdb.upsert(chunk_vdb_data)
|
146 |
+
|
147 |
+
# Trigger extraction process
|
148 |
+
from lightrag.operate import extract_entities, merge_nodes_and_edges
|
149 |
+
|
150 |
+
pipeline_status = await get_namespace_data("pipeline_status")
|
151 |
+
pipeline_status_lock = get_pipeline_status_lock()
|
152 |
+
|
153 |
+
# Prepare chunk for extraction
|
154 |
+
chunks = {chunk_id: chunk_data}
|
155 |
+
|
156 |
+
# Extract entities and relationships
|
157 |
+
chunk_results = await extract_entities(
|
158 |
+
chunks=chunks,
|
159 |
+
global_config=self.global_config,
|
160 |
+
pipeline_status=pipeline_status,
|
161 |
+
pipeline_status_lock=pipeline_status_lock,
|
162 |
+
llm_response_cache=self.hashing_kv,
|
163 |
+
)
|
164 |
+
|
165 |
+
# Add "belongs_to" relationships for all extracted entities
|
166 |
+
for maybe_nodes, _ in chunk_results:
|
167 |
+
for entity_name in maybe_nodes.keys():
|
168 |
+
if entity_name != modal_entity_name: # Skip self-relationship
|
169 |
+
# Create belongs_to relationship
|
170 |
+
relation_data = {
|
171 |
+
"description":
|
172 |
+
f"Entity {entity_name} belongs to {modal_entity_name}",
|
173 |
+
"keywords":
|
174 |
+
"belongs_to,part_of,contained_in",
|
175 |
+
"source_id":
|
176 |
+
chunk_id,
|
177 |
+
"weight":
|
178 |
+
10.0,
|
179 |
+
"file_path":
|
180 |
+
chunk_data.get("file_path", "manual_creation"),
|
181 |
+
}
|
182 |
+
await self.knowledge_graph_inst.upsert_edge(
|
183 |
+
entity_name, modal_entity_name, relation_data)
|
184 |
+
|
185 |
+
relation_id = compute_mdhash_id(entity_name +
|
186 |
+
modal_entity_name,
|
187 |
+
prefix="rel-")
|
188 |
+
relation_vdb_data = {
|
189 |
+
relation_id: {
|
190 |
+
"src_id":
|
191 |
+
entity_name,
|
192 |
+
"tgt_id":
|
193 |
+
modal_entity_name,
|
194 |
+
"keywords":
|
195 |
+
relation_data["keywords"],
|
196 |
+
"content":
|
197 |
+
f"{relation_data['keywords']}\t{entity_name}\n{modal_entity_name}\n{relation_data['description']}",
|
198 |
+
"source_id":
|
199 |
+
chunk_id,
|
200 |
+
"file_path":
|
201 |
+
chunk_data.get("file_path", "manual_creation"),
|
202 |
+
}
|
203 |
+
}
|
204 |
+
await self.relationships_vdb.upsert(relation_vdb_data)
|
205 |
+
|
206 |
+
await merge_nodes_and_edges(
|
207 |
+
chunk_results=chunk_results,
|
208 |
+
knowledge_graph_inst=self.knowledge_graph_inst,
|
209 |
+
entity_vdb=self.entities_vdb,
|
210 |
+
relationships_vdb=self.relationships_vdb,
|
211 |
+
global_config=self.global_config,
|
212 |
+
pipeline_status=pipeline_status,
|
213 |
+
pipeline_status_lock=pipeline_status_lock,
|
214 |
+
llm_response_cache=self.hashing_kv,
|
215 |
+
)
|
216 |
+
|
217 |
+
async def _insert_done(self) -> None:
|
218 |
+
await asyncio.gather(*[
|
219 |
+
cast(StorageNameSpace, storage_inst).index_done_callback()
|
220 |
+
for storage_inst in [
|
221 |
+
self.text_chunks_db,
|
222 |
+
self.chunks_vdb,
|
223 |
+
self.entities_vdb,
|
224 |
+
self.relationships_vdb,
|
225 |
+
self.knowledge_graph_inst,
|
226 |
+
]
|
227 |
+
])
|
228 |
+
|
229 |
+
|
230 |
+
class ImageModalProcessor(BaseModalProcessor):
|
231 |
+
"""Processor specialized for image content"""
|
232 |
+
|
233 |
+
def __init__(self, lightrag: LightRAG, modal_caption_func):
|
234 |
+
"""Initialize image processor
|
235 |
+
|
236 |
+
Args:
|
237 |
+
lightrag: LightRAG instance
|
238 |
+
modal_caption_func: Function for generating descriptions (supporting image understanding)
|
239 |
+
"""
|
240 |
+
super().__init__(lightrag, modal_caption_func)
|
241 |
+
|
242 |
+
def _encode_image_to_base64(self, image_path: str) -> str:
|
243 |
+
"""Encode image to base64"""
|
244 |
+
try:
|
245 |
+
with open(image_path, "rb") as image_file:
|
246 |
+
encoded_string = base64.b64encode(
|
247 |
+
image_file.read()).decode('utf-8')
|
248 |
+
return encoded_string
|
249 |
+
except Exception as e:
|
250 |
+
logger.error(f"Failed to encode image {image_path}: {e}")
|
251 |
+
return ""
|
252 |
+
|
253 |
+
async def process_multimodal_content(
|
254 |
+
self,
|
255 |
+
modal_content,
|
256 |
+
content_type: str,
|
257 |
+
file_path: str = "manual_creation",
|
258 |
+
entity_name: str = None,
|
259 |
+
) -> Tuple[str, Dict[str, Any]]:
|
260 |
+
"""Process image content"""
|
261 |
+
try:
|
262 |
+
# Parse image content
|
263 |
+
if isinstance(modal_content, str):
|
264 |
+
try:
|
265 |
+
content_data = json.loads(modal_content)
|
266 |
+
except json.JSONDecodeError:
|
267 |
+
content_data = {"description": modal_content}
|
268 |
+
else:
|
269 |
+
content_data = modal_content
|
270 |
+
|
271 |
+
image_path = content_data.get("img_path")
|
272 |
+
captions = content_data.get("img_caption", [])
|
273 |
+
footnotes = content_data.get("img_footnote", [])
|
274 |
+
|
275 |
+
# Build detailed visual analysis prompt
|
276 |
+
vision_prompt = f"""Please analyze this image in detail and provide a JSON response with the following structure:
|
277 |
+
|
278 |
+
{{
|
279 |
+
"detailed_description": "A comprehensive and detailed visual description of the image following these guidelines:
|
280 |
+
- Describe the overall composition and layout
|
281 |
+
- Identify all objects, people, text, and visual elements
|
282 |
+
- Explain relationships between elements
|
283 |
+
- Note colors, lighting, and visual style
|
284 |
+
- Describe any actions or activities shown
|
285 |
+
- Include technical details if relevant (charts, diagrams, etc.)
|
286 |
+
- Always use specific names instead of pronouns",
|
287 |
+
"entity_info": {{
|
288 |
+
"entity_name": "{entity_name if entity_name else 'unique descriptive name for this image'}",
|
289 |
+
"entity_type": "image",
|
290 |
+
"summary": "concise summary of the image content and its significance (max 100 words)"
|
291 |
+
}}
|
292 |
+
}}
|
293 |
+
|
294 |
+
Additional context:
|
295 |
+
- Image Path: {image_path}
|
296 |
+
- Captions: {captions if captions else 'None'}
|
297 |
+
- Footnotes: {footnotes if footnotes else 'None'}
|
298 |
+
|
299 |
+
Focus on providing accurate, detailed visual analysis that would be useful for knowledge retrieval."""
|
300 |
+
|
301 |
+
# If image path exists, try to encode image
|
302 |
+
image_base64 = ""
|
303 |
+
if image_path and Path(image_path).exists():
|
304 |
+
image_base64 = self._encode_image_to_base64(image_path)
|
305 |
+
|
306 |
+
# Call vision model
|
307 |
+
if image_base64:
|
308 |
+
# Use real image for analysis
|
309 |
+
response = await self.modal_caption_func(
|
310 |
+
vision_prompt,
|
311 |
+
image_data=image_base64,
|
312 |
+
system_prompt=
|
313 |
+
"You are an expert image analyst. Provide detailed, accurate descriptions."
|
314 |
+
)
|
315 |
+
else:
|
316 |
+
# Analyze based on existing text information
|
317 |
+
text_prompt = f"""Based on the following image information, provide analysis:
|
318 |
+
|
319 |
+
Image Path: {image_path}
|
320 |
+
Captions: {captions}
|
321 |
+
Footnotes: {footnotes}
|
322 |
+
|
323 |
+
{vision_prompt}"""
|
324 |
+
|
325 |
+
response = await self.modal_caption_func(
|
326 |
+
text_prompt,
|
327 |
+
system_prompt=
|
328 |
+
"You are an expert image analyst. Provide detailed analysis based on available information."
|
329 |
+
)
|
330 |
+
|
331 |
+
# Parse response
|
332 |
+
enhanced_caption, entity_info = self._parse_response(
|
333 |
+
response, entity_name)
|
334 |
+
|
335 |
+
# Build complete image content
|
336 |
+
modal_chunk = f"""
|
337 |
+
Image Content Analysis:
|
338 |
+
Image Path: {image_path}
|
339 |
+
Captions: {', '.join(captions) if captions else 'None'}
|
340 |
+
Footnotes: {', '.join(footnotes) if footnotes else 'None'}
|
341 |
+
|
342 |
+
Visual Analysis: {enhanced_caption}"""
|
343 |
+
|
344 |
+
return await self._create_entity_and_chunk(modal_chunk,
|
345 |
+
entity_info, file_path)
|
346 |
+
|
347 |
+
except Exception as e:
|
348 |
+
logger.error(f"Error processing image content: {e}")
|
349 |
+
# Fallback processing
|
350 |
+
fallback_entity = {
|
351 |
+
"entity_name": entity_name if entity_name else
|
352 |
+
f"image_{compute_mdhash_id(str(modal_content))}",
|
353 |
+
"entity_type": "image",
|
354 |
+
"summary": f"Image content: {str(modal_content)[:100]}"
|
355 |
+
}
|
356 |
+
return str(modal_content), fallback_entity
|
357 |
+
|
358 |
+
def _parse_response(self,
|
359 |
+
response: str,
|
360 |
+
entity_name: str = None) -> Tuple[str, Dict[str, Any]]:
|
361 |
+
"""Parse model response"""
|
362 |
+
try:
|
363 |
+
response_data = json.loads(
|
364 |
+
re.search(r"\{.*\}", response, re.DOTALL).group(0))
|
365 |
+
|
366 |
+
description = response_data.get("detailed_description", "")
|
367 |
+
entity_data = response_data.get("entity_info", {})
|
368 |
+
|
369 |
+
if not description or not entity_data:
|
370 |
+
raise ValueError("Missing required fields in response")
|
371 |
+
|
372 |
+
if not all(key in entity_data
|
373 |
+
for key in ["entity_name", "entity_type", "summary"]):
|
374 |
+
raise ValueError("Missing required fields in entity_info")
|
375 |
+
|
376 |
+
entity_data["entity_name"] = entity_data["entity_name"] + f" ({entity_data['entity_type']})"
|
377 |
+
if entity_name:
|
378 |
+
entity_data["entity_name"] = entity_name
|
379 |
+
|
380 |
+
return description, entity_data
|
381 |
+
|
382 |
+
except (json.JSONDecodeError, AttributeError, ValueError) as e:
|
383 |
+
logger.error(f"Error parsing image analysis response: {e}")
|
384 |
+
fallback_entity = {
|
385 |
+
"entity_name":
|
386 |
+
entity_name
|
387 |
+
if entity_name else f"image_{compute_mdhash_id(response)}",
|
388 |
+
"entity_type":
|
389 |
+
"image",
|
390 |
+
"summary":
|
391 |
+
response[:100] + "..." if len(response) > 100 else response
|
392 |
+
}
|
393 |
+
return response, fallback_entity
|
394 |
+
|
395 |
+
|
396 |
+
class TableModalProcessor(BaseModalProcessor):
|
397 |
+
"""Processor specialized for table content"""
|
398 |
+
|
399 |
+
async def process_multimodal_content(
|
400 |
+
self,
|
401 |
+
modal_content,
|
402 |
+
content_type: str,
|
403 |
+
file_path: str = "manual_creation",
|
404 |
+
entity_name: str = None,
|
405 |
+
) -> Tuple[str, Dict[str, Any]]:
|
406 |
+
"""Process table content"""
|
407 |
+
# Parse table content
|
408 |
+
if isinstance(modal_content, str):
|
409 |
+
try:
|
410 |
+
content_data = json.loads(modal_content)
|
411 |
+
except json.JSONDecodeError:
|
412 |
+
content_data = {"table_body": modal_content}
|
413 |
+
else:
|
414 |
+
content_data = modal_content
|
415 |
+
|
416 |
+
table_img_path = content_data.get("img_path")
|
417 |
+
table_caption = content_data.get("table_caption", [])
|
418 |
+
table_body = content_data.get("table_body", "")
|
419 |
+
table_footnote = content_data.get("table_footnote", [])
|
420 |
+
|
421 |
+
# Build table analysis prompt
|
422 |
+
table_prompt = f"""Please analyze this table content and provide a JSON response with the following structure:
|
423 |
+
|
424 |
+
{{
|
425 |
+
"detailed_description": "A comprehensive analysis of the table including:
|
426 |
+
- Table structure and organization
|
427 |
+
- Column headers and their meanings
|
428 |
+
- Key data points and patterns
|
429 |
+
- Statistical insights and trends
|
430 |
+
- Relationships between data elements
|
431 |
+
- Significance of the data presented
|
432 |
+
Always use specific names and values instead of general references.",
|
433 |
+
"entity_info": {{
|
434 |
+
"entity_name": "{entity_name if entity_name else 'descriptive name for this table'}",
|
435 |
+
"entity_type": "table",
|
436 |
+
"summary": "concise summary of the table's purpose and key findings (max 100 words)"
|
437 |
+
}}
|
438 |
+
}}
|
439 |
+
|
440 |
+
Table Information:
|
441 |
+
Image Path: {table_img_path}
|
442 |
+
Caption: {table_caption if table_caption else 'None'}
|
443 |
+
Body: {table_body}
|
444 |
+
Footnotes: {table_footnote if table_footnote else 'None'}
|
445 |
+
|
446 |
+
Focus on extracting meaningful insights and relationships from the tabular data."""
|
447 |
+
|
448 |
+
response = await self.modal_caption_func(
|
449 |
+
table_prompt,
|
450 |
+
system_prompt=
|
451 |
+
"You are an expert data analyst. Provide detailed table analysis with specific insights."
|
452 |
+
)
|
453 |
+
|
454 |
+
# Parse response
|
455 |
+
enhanced_caption, entity_info = self._parse_table_response(
|
456 |
+
response, entity_name)
|
457 |
+
|
458 |
+
#TODO: Add Retry Mechanism
|
459 |
+
|
460 |
+
# Build complete table content
|
461 |
+
modal_chunk = f"""Table Analysis:
|
462 |
+
Image Path: {table_img_path}
|
463 |
+
Caption: {', '.join(table_caption) if table_caption else 'None'}
|
464 |
+
Structure: {table_body}
|
465 |
+
Footnotes: {', '.join(table_footnote) if table_footnote else 'None'}
|
466 |
+
|
467 |
+
Analysis: {enhanced_caption}"""
|
468 |
+
|
469 |
+
return await self._create_entity_and_chunk(modal_chunk, entity_info,
|
470 |
+
file_path)
|
471 |
+
|
472 |
+
def _parse_table_response(
|
473 |
+
self,
|
474 |
+
response: str,
|
475 |
+
entity_name: str = None) -> Tuple[str, Dict[str, Any]]:
|
476 |
+
"""Parse table analysis response"""
|
477 |
+
try:
|
478 |
+
response_data = json.loads(
|
479 |
+
re.search(r"\{.*\}", response, re.DOTALL).group(0))
|
480 |
+
|
481 |
+
description = response_data.get("detailed_description", "")
|
482 |
+
entity_data = response_data.get("entity_info", {})
|
483 |
+
|
484 |
+
if not description or not entity_data:
|
485 |
+
raise ValueError("Missing required fields in response")
|
486 |
+
|
487 |
+
if not all(key in entity_data
|
488 |
+
for key in ["entity_name", "entity_type", "summary"]):
|
489 |
+
raise ValueError("Missing required fields in entity_info")
|
490 |
+
|
491 |
+
entity_data["entity_name"] = entity_data["entity_name"] + f" ({entity_data['entity_type']})"
|
492 |
+
if entity_name:
|
493 |
+
entity_data["entity_name"] = entity_name
|
494 |
+
|
495 |
+
return description, entity_data
|
496 |
+
|
497 |
+
except (json.JSONDecodeError, AttributeError, ValueError) as e:
|
498 |
+
logger.error(f"Error parsing table analysis response: {e}")
|
499 |
+
fallback_entity = {
|
500 |
+
"entity_name":
|
501 |
+
entity_name
|
502 |
+
if entity_name else f"table_{compute_mdhash_id(response)}",
|
503 |
+
"entity_type":
|
504 |
+
"table",
|
505 |
+
"summary":
|
506 |
+
response[:100] + "..." if len(response) > 100 else response
|
507 |
+
}
|
508 |
+
return response, fallback_entity
|
509 |
+
|
510 |
+
|
511 |
+
class EquationModalProcessor(BaseModalProcessor):
|
512 |
+
"""Processor specialized for equation content"""
|
513 |
+
|
514 |
+
async def process_multimodal_content(
|
515 |
+
self,
|
516 |
+
modal_content,
|
517 |
+
content_type: str,
|
518 |
+
file_path: str = "manual_creation",
|
519 |
+
entity_name: str = None,
|
520 |
+
) -> Tuple[str, Dict[str, Any]]:
|
521 |
+
"""Process equation content"""
|
522 |
+
# Parse equation content
|
523 |
+
if isinstance(modal_content, str):
|
524 |
+
try:
|
525 |
+
content_data = json.loads(modal_content)
|
526 |
+
except json.JSONDecodeError:
|
527 |
+
content_data = {"equation": modal_content}
|
528 |
+
else:
|
529 |
+
content_data = modal_content
|
530 |
+
|
531 |
+
equation_text = content_data.get("text")
|
532 |
+
equation_format = content_data.get("text_format", "")
|
533 |
+
|
534 |
+
# Build equation analysis prompt
|
535 |
+
equation_prompt = f"""Please analyze this mathematical equation and provide a JSON response with the following structure:
|
536 |
+
|
537 |
+
{{
|
538 |
+
"detailed_description": "A comprehensive analysis of the equation including:
|
539 |
+
- Mathematical meaning and interpretation
|
540 |
+
- Variables and their definitions
|
541 |
+
- Mathematical operations and functions used
|
542 |
+
- Application domain and context
|
543 |
+
- Physical or theoretical significance
|
544 |
+
- Relationship to other mathematical concepts
|
545 |
+
- Practical applications or use cases
|
546 |
+
Always use specific mathematical terminology.",
|
547 |
+
"entity_info": {{
|
548 |
+
"entity_name": "{entity_name if entity_name else 'descriptive name for this equation'}",
|
549 |
+
"entity_type": "equation",
|
550 |
+
"summary": "concise summary of the equation's purpose and significance (max 100 words)"
|
551 |
+
}}
|
552 |
+
}}
|
553 |
+
|
554 |
+
Equation Information:
|
555 |
+
Equation: {equation_text}
|
556 |
+
Format: {equation_format}
|
557 |
+
|
558 |
+
Focus on providing mathematical insights and explaining the equation's significance."""
|
559 |
+
|
560 |
+
response = await self.modal_caption_func(
|
561 |
+
equation_prompt,
|
562 |
+
system_prompt=
|
563 |
+
"You are an expert mathematician. Provide detailed mathematical analysis."
|
564 |
+
)
|
565 |
+
|
566 |
+
# Parse response
|
567 |
+
enhanced_caption, entity_info = self._parse_equation_response(
|
568 |
+
response, entity_name)
|
569 |
+
|
570 |
+
# Build complete equation content
|
571 |
+
modal_chunk = f"""Mathematical Equation Analysis:
|
572 |
+
Equation: {equation_text}
|
573 |
+
Format: {equation_format}
|
574 |
+
|
575 |
+
Mathematical Analysis: {enhanced_caption}"""
|
576 |
+
|
577 |
+
return await self._create_entity_and_chunk(modal_chunk, entity_info,
|
578 |
+
file_path)
|
579 |
+
|
580 |
+
def _parse_equation_response(
|
581 |
+
self,
|
582 |
+
response: str,
|
583 |
+
entity_name: str = None) -> Tuple[str, Dict[str, Any]]:
|
584 |
+
"""Parse equation analysis response"""
|
585 |
+
try:
|
586 |
+
response_data = json.loads(
|
587 |
+
re.search(r"\{.*\}", response, re.DOTALL).group(0))
|
588 |
+
|
589 |
+
description = response_data.get("detailed_description", "")
|
590 |
+
entity_data = response_data.get("entity_info", {})
|
591 |
+
|
592 |
+
if not description or not entity_data:
|
593 |
+
raise ValueError("Missing required fields in response")
|
594 |
+
|
595 |
+
if not all(key in entity_data
|
596 |
+
for key in ["entity_name", "entity_type", "summary"]):
|
597 |
+
raise ValueError("Missing required fields in entity_info")
|
598 |
+
|
599 |
+
entity_data["entity_name"] = entity_data["entity_name"] + f" ({entity_data['entity_type']})"
|
600 |
+
if entity_name:
|
601 |
+
entity_data["entity_name"] = entity_name
|
602 |
+
|
603 |
+
return description, entity_data
|
604 |
+
|
605 |
+
except (json.JSONDecodeError, AttributeError, ValueError) as e:
|
606 |
+
logger.error(f"Error parsing equation analysis response: {e}")
|
607 |
+
fallback_entity = {
|
608 |
+
"entity_name":
|
609 |
+
entity_name
|
610 |
+
if entity_name else f"equation_{compute_mdhash_id(response)}",
|
611 |
+
"entity_type":
|
612 |
+
"equation",
|
613 |
+
"summary":
|
614 |
+
response[:100] + "..." if len(response) > 100 else response
|
615 |
+
}
|
616 |
+
return response, fallback_entity
|
617 |
+
|
618 |
+
|
619 |
+
class GenericModalProcessor(BaseModalProcessor):
|
620 |
+
"""Generic processor for other types of modal content"""
|
621 |
+
|
622 |
+
async def process_multimodal_content(
|
623 |
+
self,
|
624 |
+
modal_content,
|
625 |
+
content_type: str,
|
626 |
+
file_path: str = "manual_creation",
|
627 |
+
entity_name: str = None,
|
628 |
+
) -> Tuple[str, Dict[str, Any]]:
|
629 |
+
"""Process generic modal content"""
|
630 |
+
# Build generic analysis prompt
|
631 |
+
generic_prompt = f"""Please analyze this {content_type} content and provide a JSON response with the following structure:
|
632 |
+
|
633 |
+
{{
|
634 |
+
"detailed_description": "A comprehensive analysis of the content including:
|
635 |
+
- Content structure and organization
|
636 |
+
- Key information and elements
|
637 |
+
- Relationships between components
|
638 |
+
- Context and significance
|
639 |
+
- Relevant details for knowledge retrieval
|
640 |
+
Always use specific terminology appropriate for {content_type} content.",
|
641 |
+
"entity_info": {{
|
642 |
+
"entity_name": "{entity_name if entity_name else f'descriptive name for this {content_type}'}",
|
643 |
+
"entity_type": "{content_type}",
|
644 |
+
"summary": "concise summary of the content's purpose and key points (max 100 words)"
|
645 |
+
}}
|
646 |
+
}}
|
647 |
+
|
648 |
+
Content: {str(modal_content)}
|
649 |
+
|
650 |
+
Focus on extracting meaningful information that would be useful for knowledge retrieval."""
|
651 |
+
|
652 |
+
response = await self.modal_caption_func(
|
653 |
+
generic_prompt,
|
654 |
+
system_prompt=
|
655 |
+
f"You are an expert content analyst specializing in {content_type} content."
|
656 |
+
)
|
657 |
+
|
658 |
+
# Parse response
|
659 |
+
enhanced_caption, entity_info = self._parse_generic_response(
|
660 |
+
response, entity_name, content_type)
|
661 |
+
|
662 |
+
# Build complete content
|
663 |
+
modal_chunk = f"""{content_type.title()} Content Analysis:
|
664 |
+
Content: {str(modal_content)}
|
665 |
+
|
666 |
+
Analysis: {enhanced_caption}"""
|
667 |
+
|
668 |
+
return await self._create_entity_and_chunk(modal_chunk, entity_info,
|
669 |
+
file_path)
|
670 |
+
|
671 |
+
def _parse_generic_response(
|
672 |
+
self,
|
673 |
+
response: str,
|
674 |
+
entity_name: str = None,
|
675 |
+
content_type: str = "content") -> Tuple[str, Dict[str, Any]]:
|
676 |
+
"""Parse generic analysis response"""
|
677 |
+
try:
|
678 |
+
response_data = json.loads(
|
679 |
+
re.search(r"\{.*\}", response, re.DOTALL).group(0))
|
680 |
+
|
681 |
+
description = response_data.get("detailed_description", "")
|
682 |
+
entity_data = response_data.get("entity_info", {})
|
683 |
+
|
684 |
+
if not description or not entity_data:
|
685 |
+
raise ValueError("Missing required fields in response")
|
686 |
+
|
687 |
+
if not all(key in entity_data
|
688 |
+
for key in ["entity_name", "entity_type", "summary"]):
|
689 |
+
raise ValueError("Missing required fields in entity_info")
|
690 |
+
|
691 |
+
entity_data["entity_name"] = entity_data["entity_name"] + f" ({entity_data['entity_type']})"
|
692 |
+
if entity_name:
|
693 |
+
entity_data["entity_name"] = entity_name
|
694 |
+
|
695 |
+
return description, entity_data
|
696 |
+
|
697 |
+
except (json.JSONDecodeError, AttributeError, ValueError) as e:
|
698 |
+
logger.error(f"Error parsing generic analysis response: {e}")
|
699 |
+
fallback_entity = {
|
700 |
+
"entity_name":
|
701 |
+
entity_name if entity_name else
|
702 |
+
f"{content_type}_{compute_mdhash_id(response)}",
|
703 |
+
"entity_type":
|
704 |
+
content_type,
|
705 |
+
"summary":
|
706 |
+
response[:100] + "..." if len(response) > 100 else response
|
707 |
+
}
|
708 |
+
return response, fallback_entity
|
lightrag/raganything.py
ADDED
@@ -0,0 +1,632 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Complete MinerU parsing + multimodal content insertion Pipeline
|
3 |
+
|
4 |
+
This script integrates:
|
5 |
+
1. MinerU document parsing
|
6 |
+
2. Pure text content LightRAG insertion
|
7 |
+
3. Specialized processing for multimodal content (using different processors)
|
8 |
+
"""
|
9 |
+
|
10 |
+
import os
|
11 |
+
import asyncio
|
12 |
+
import logging
|
13 |
+
from pathlib import Path
|
14 |
+
from typing import Dict, List, Any, Tuple, Optional, Callable
|
15 |
+
import sys
|
16 |
+
|
17 |
+
# Add project root directory to Python path
|
18 |
+
sys.path.insert(0, str(Path(__file__).parent.parent))
|
19 |
+
|
20 |
+
from lightrag import LightRAG, QueryParam
|
21 |
+
from lightrag.utils import EmbeddingFunc, setup_logger
|
22 |
+
|
23 |
+
# Import parser and multimodal processors
|
24 |
+
from lightrag.mineru_parser import MineruParser
|
25 |
+
|
26 |
+
# Import specialized processors
|
27 |
+
from lightrag.modalprocessors import (
|
28 |
+
ImageModalProcessor,
|
29 |
+
TableModalProcessor,
|
30 |
+
EquationModalProcessor,
|
31 |
+
GenericModalProcessor
|
32 |
+
)
|
33 |
+
|
34 |
+
|
35 |
+
class RAGAnything:
|
36 |
+
"""Multimodal Document Processing Pipeline - Complete document parsing and insertion pipeline"""
|
37 |
+
|
38 |
+
def __init__(
|
39 |
+
self,
|
40 |
+
lightrag: Optional[LightRAG] = None,
|
41 |
+
llm_model_func: Optional[Callable] = None,
|
42 |
+
vision_model_func: Optional[Callable] = None,
|
43 |
+
embedding_func: Optional[Callable] = None,
|
44 |
+
working_dir: str = "./rag_storage",
|
45 |
+
embedding_dim: int = 3072,
|
46 |
+
max_token_size: int = 8192
|
47 |
+
):
|
48 |
+
"""
|
49 |
+
Initialize Multimodal Document Processing Pipeline
|
50 |
+
|
51 |
+
Args:
|
52 |
+
lightrag: Optional pre-initialized LightRAG instance
|
53 |
+
llm_model_func: LLM model function for text analysis
|
54 |
+
vision_model_func: Vision model function for image analysis
|
55 |
+
embedding_func: Embedding function for text vectorization
|
56 |
+
working_dir: Working directory for storage (used when creating new RAG)
|
57 |
+
embedding_dim: Embedding dimension (used when creating new RAG)
|
58 |
+
max_token_size: Maximum token size for embeddings (used when creating new RAG)
|
59 |
+
"""
|
60 |
+
self.working_dir = working_dir
|
61 |
+
self.llm_model_func = llm_model_func
|
62 |
+
self.vision_model_func = vision_model_func
|
63 |
+
self.embedding_func = embedding_func
|
64 |
+
self.embedding_dim = embedding_dim
|
65 |
+
self.max_token_size = max_token_size
|
66 |
+
|
67 |
+
# Set up logging
|
68 |
+
setup_logger("RAGAnything")
|
69 |
+
self.logger = logging.getLogger("RAGAnything")
|
70 |
+
|
71 |
+
# Create working directory if needed
|
72 |
+
if not os.path.exists(working_dir):
|
73 |
+
os.makedirs(working_dir)
|
74 |
+
|
75 |
+
# Use provided LightRAG or mark for later initialization
|
76 |
+
self.lightrag = lightrag
|
77 |
+
self.modal_processors = {}
|
78 |
+
|
79 |
+
# If LightRAG is provided, initialize processors immediately
|
80 |
+
if self.lightrag is not None:
|
81 |
+
self._initialize_processors()
|
82 |
+
|
83 |
+
def _initialize_processors(self):
|
84 |
+
"""Initialize multimodal processors with appropriate model functions"""
|
85 |
+
if self.lightrag is None:
|
86 |
+
raise ValueError("LightRAG instance must be initialized before creating processors")
|
87 |
+
|
88 |
+
# Create different multimodal processors
|
89 |
+
self.modal_processors = {
|
90 |
+
"image": ImageModalProcessor(
|
91 |
+
lightrag=self.lightrag,
|
92 |
+
modal_caption_func=self.vision_model_func or self.llm_model_func
|
93 |
+
),
|
94 |
+
"table": TableModalProcessor(
|
95 |
+
lightrag=self.lightrag,
|
96 |
+
modal_caption_func=self.llm_model_func
|
97 |
+
),
|
98 |
+
"equation": EquationModalProcessor(
|
99 |
+
lightrag=self.lightrag,
|
100 |
+
modal_caption_func=self.llm_model_func
|
101 |
+
),
|
102 |
+
"generic": GenericModalProcessor(
|
103 |
+
lightrag=self.lightrag,
|
104 |
+
modal_caption_func=self.llm_model_func
|
105 |
+
)
|
106 |
+
}
|
107 |
+
|
108 |
+
self.logger.info("Multimodal processors initialized")
|
109 |
+
self.logger.info(f"Available processors: {list(self.modal_processors.keys())}")
|
110 |
+
|
111 |
+
async def _ensure_lightrag_initialized(self):
|
112 |
+
"""Ensure LightRAG instance is initialized, create if necessary"""
|
113 |
+
if self.lightrag is not None:
|
114 |
+
return
|
115 |
+
|
116 |
+
# Validate required functions
|
117 |
+
if self.llm_model_func is None:
|
118 |
+
raise ValueError("llm_model_func must be provided when LightRAG is not pre-initialized")
|
119 |
+
if self.embedding_func is None:
|
120 |
+
raise ValueError("embedding_func must be provided when LightRAG is not pre-initialized")
|
121 |
+
|
122 |
+
from lightrag.kg.shared_storage import initialize_pipeline_status
|
123 |
+
|
124 |
+
# Create LightRAG instance with provided functions
|
125 |
+
self.lightrag = LightRAG(
|
126 |
+
working_dir=self.working_dir,
|
127 |
+
llm_model_func=self.llm_model_func,
|
128 |
+
embedding_func=EmbeddingFunc(
|
129 |
+
embedding_dim=self.embedding_dim,
|
130 |
+
max_token_size=self.max_token_size,
|
131 |
+
func=self.embedding_func,
|
132 |
+
),
|
133 |
+
)
|
134 |
+
|
135 |
+
await self.lightrag.initialize_storages()
|
136 |
+
await initialize_pipeline_status()
|
137 |
+
|
138 |
+
# Initialize processors after LightRAG is ready
|
139 |
+
self._initialize_processors()
|
140 |
+
|
141 |
+
self.logger.info("LightRAG and multimodal processors initialized")
|
142 |
+
|
143 |
+
def parse_document(
|
144 |
+
self,
|
145 |
+
file_path: str,
|
146 |
+
output_dir: str = "./output",
|
147 |
+
parse_method: str = "auto",
|
148 |
+
display_stats: bool = True
|
149 |
+
) -> Tuple[List[Dict[str, Any]], str]:
|
150 |
+
"""
|
151 |
+
Parse document using MinerU
|
152 |
+
|
153 |
+
Args:
|
154 |
+
file_path: Path to the file to parse
|
155 |
+
output_dir: Output directory
|
156 |
+
parse_method: Parse method ("auto", "ocr", "txt")
|
157 |
+
display_stats: Whether to display content statistics
|
158 |
+
|
159 |
+
Returns:
|
160 |
+
(content_list, md_content): Content list and markdown text
|
161 |
+
"""
|
162 |
+
self.logger.info(f"Starting document parsing: {file_path}")
|
163 |
+
|
164 |
+
file_path = Path(file_path)
|
165 |
+
if not file_path.exists():
|
166 |
+
raise FileNotFoundError(f"File not found: {file_path}")
|
167 |
+
|
168 |
+
# Choose appropriate parsing method based on file extension
|
169 |
+
ext = file_path.suffix.lower()
|
170 |
+
|
171 |
+
try:
|
172 |
+
if ext in [".pdf"]:
|
173 |
+
self.logger.info(f"Detected PDF file, using PDF parser (OCR={parse_method == 'ocr'})...")
|
174 |
+
content_list, md_content = MineruParser.parse_pdf(
|
175 |
+
file_path,
|
176 |
+
output_dir,
|
177 |
+
use_ocr=(parse_method == "ocr")
|
178 |
+
)
|
179 |
+
elif ext in [".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]:
|
180 |
+
self.logger.info("Detected image file, using image parser...")
|
181 |
+
content_list, md_content = MineruParser.parse_image(
|
182 |
+
file_path,
|
183 |
+
output_dir
|
184 |
+
)
|
185 |
+
elif ext in [".doc", ".docx", ".ppt", ".pptx"]:
|
186 |
+
self.logger.info("Detected Office document, using Office parser...")
|
187 |
+
content_list, md_content = MineruParser.parse_office_doc(
|
188 |
+
file_path,
|
189 |
+
output_dir
|
190 |
+
)
|
191 |
+
else:
|
192 |
+
# For other or unknown formats, use generic parser
|
193 |
+
self.logger.info(f"Using generic parser for {ext} file (method={parse_method})...")
|
194 |
+
content_list, md_content = MineruParser.parse_document(
|
195 |
+
file_path,
|
196 |
+
parse_method=parse_method,
|
197 |
+
output_dir=output_dir
|
198 |
+
)
|
199 |
+
|
200 |
+
except Exception as e:
|
201 |
+
self.logger.error(f"Error during parsing with specific parser: {str(e)}")
|
202 |
+
self.logger.warning("Falling back to generic parser...")
|
203 |
+
# If specific parser fails, fall back to generic parser
|
204 |
+
content_list, md_content = MineruParser.parse_document(
|
205 |
+
file_path,
|
206 |
+
parse_method=parse_method,
|
207 |
+
output_dir=output_dir
|
208 |
+
)
|
209 |
+
|
210 |
+
self.logger.info(f"Parsing complete! Extracted {len(content_list)} content blocks")
|
211 |
+
self.logger.info(f"Markdown text length: {len(md_content)} characters")
|
212 |
+
|
213 |
+
# Display content statistics if requested
|
214 |
+
if display_stats:
|
215 |
+
self.logger.info("\nContent Information:")
|
216 |
+
self.logger.info(f"* Total blocks in content_list: {len(content_list)}")
|
217 |
+
self.logger.info(f"* Markdown content length: {len(md_content)} characters")
|
218 |
+
|
219 |
+
# Count elements by type
|
220 |
+
block_types: Dict[str, int] = {}
|
221 |
+
for block in content_list:
|
222 |
+
if isinstance(block, dict):
|
223 |
+
block_type = block.get("type", "unknown")
|
224 |
+
if isinstance(block_type, str):
|
225 |
+
block_types[block_type] = block_types.get(block_type, 0) + 1
|
226 |
+
|
227 |
+
self.logger.info("* Content block types:")
|
228 |
+
for block_type, count in block_types.items():
|
229 |
+
self.logger.info(f" - {block_type}: {count}")
|
230 |
+
|
231 |
+
return content_list, md_content
|
232 |
+
|
233 |
+
def _separate_content(self, content_list: List[Dict[str, Any]]) -> Tuple[str, List[Dict[str, Any]]]:
|
234 |
+
"""
|
235 |
+
Separate text content and multimodal content
|
236 |
+
|
237 |
+
Args:
|
238 |
+
content_list: Content list from MinerU parsing
|
239 |
+
|
240 |
+
Returns:
|
241 |
+
(text_content, multimodal_items): Pure text content and multimodal items list
|
242 |
+
"""
|
243 |
+
text_parts = []
|
244 |
+
multimodal_items = []
|
245 |
+
|
246 |
+
for item in content_list:
|
247 |
+
content_type = item.get("type", "text")
|
248 |
+
|
249 |
+
if content_type == "text":
|
250 |
+
# Text content
|
251 |
+
text = item.get("text", "")
|
252 |
+
if text.strip():
|
253 |
+
text_parts.append(text)
|
254 |
+
else:
|
255 |
+
# Multimodal content (image, table, equation, etc.)
|
256 |
+
multimodal_items.append(item)
|
257 |
+
|
258 |
+
# Merge all text content
|
259 |
+
text_content = "\n\n".join(text_parts)
|
260 |
+
|
261 |
+
self.logger.info(f"Content separation complete:")
|
262 |
+
self.logger.info(f" - Text content length: {len(text_content)} characters")
|
263 |
+
self.logger.info(f" - Multimodal items count: {len(multimodal_items)}")
|
264 |
+
|
265 |
+
# Count multimodal types
|
266 |
+
modal_types = {}
|
267 |
+
for item in multimodal_items:
|
268 |
+
modal_type = item.get("type", "unknown")
|
269 |
+
modal_types[modal_type] = modal_types.get(modal_type, 0) + 1
|
270 |
+
|
271 |
+
if modal_types:
|
272 |
+
self.logger.info(f" - Multimodal type distribution: {modal_types}")
|
273 |
+
|
274 |
+
return text_content, multimodal_items
|
275 |
+
|
276 |
+
async def _insert_text_content(
|
277 |
+
self,
|
278 |
+
input: str | list[str],
|
279 |
+
split_by_character: str | None = None,
|
280 |
+
split_by_character_only: bool = False,
|
281 |
+
ids: str | list[str] | None = None,
|
282 |
+
file_paths: str | list[str] | None = None,
|
283 |
+
):
|
284 |
+
"""
|
285 |
+
Insert pure text content into LightRAG
|
286 |
+
|
287 |
+
Args:
|
288 |
+
input: Single document string or list of document strings
|
289 |
+
split_by_character: if split_by_character is not None, split the string by character, if chunk longer than
|
290 |
+
chunk_token_size, it will be split again by token size.
|
291 |
+
split_by_character_only: if split_by_character_only is True, split the string by character only, when
|
292 |
+
split_by_character is None, this parameter is ignored.
|
293 |
+
ids: single string of the document ID or list of unique document IDs, if not provided, MD5 hash IDs will be generated
|
294 |
+
file_paths: single string of the file path or list of file paths, used for citation
|
295 |
+
"""
|
296 |
+
self.logger.info("Starting text content insertion into LightRAG...")
|
297 |
+
|
298 |
+
# Use LightRAG's insert method with all parameters
|
299 |
+
await self.lightrag.ainsert(
|
300 |
+
input=input,
|
301 |
+
file_paths=file_paths,
|
302 |
+
split_by_character=split_by_character,
|
303 |
+
split_by_character_only=split_by_character_only,
|
304 |
+
ids=ids
|
305 |
+
)
|
306 |
+
|
307 |
+
self.logger.info("Text content insertion complete")
|
308 |
+
|
309 |
+
async def _process_multimodal_content(self, multimodal_items: List[Dict[str, Any]], file_path: str):
|
310 |
+
"""
|
311 |
+
Process multimodal content (using specialized processors)
|
312 |
+
|
313 |
+
Args:
|
314 |
+
multimodal_items: List of multimodal items
|
315 |
+
file_path: File path (for reference)
|
316 |
+
"""
|
317 |
+
if not multimodal_items:
|
318 |
+
self.logger.debug("No multimodal content to process")
|
319 |
+
return
|
320 |
+
|
321 |
+
self.logger.info("Starting multimodal content processing...")
|
322 |
+
|
323 |
+
file_name = os.path.basename(file_path)
|
324 |
+
|
325 |
+
for i, item in enumerate(multimodal_items):
|
326 |
+
try:
|
327 |
+
content_type = item.get("type", "unknown")
|
328 |
+
self.logger.info(f"Processing item {i+1}/{len(multimodal_items)}: {content_type} content")
|
329 |
+
|
330 |
+
# Select appropriate processor
|
331 |
+
processor = self._get_processor_for_type(content_type)
|
332 |
+
|
333 |
+
if processor:
|
334 |
+
enhanced_caption, entity_info = await processor.process_multimodal_content(
|
335 |
+
modal_content=item,
|
336 |
+
content_type=content_type,
|
337 |
+
file_path=file_name
|
338 |
+
)
|
339 |
+
self.logger.info(f"{content_type} processing complete: {entity_info.get('entity_name', 'Unknown')}")
|
340 |
+
else:
|
341 |
+
self.logger.warning(f"No suitable processor found for {content_type} type content")
|
342 |
+
|
343 |
+
except Exception as e:
|
344 |
+
self.logger.error(f"Error processing multimodal content: {str(e)}")
|
345 |
+
self.logger.debug("Exception details:", exc_info=True)
|
346 |
+
continue
|
347 |
+
|
348 |
+
self.logger.info("Multimodal content processing complete")
|
349 |
+
|
350 |
+
def _get_processor_for_type(self, content_type: str):
|
351 |
+
"""
|
352 |
+
Get appropriate processor based on content type
|
353 |
+
|
354 |
+
Args:
|
355 |
+
content_type: Content type
|
356 |
+
|
357 |
+
Returns:
|
358 |
+
Corresponding processor instance
|
359 |
+
"""
|
360 |
+
# Direct mapping to corresponding processor
|
361 |
+
if content_type == "image":
|
362 |
+
return self.modal_processors.get("image")
|
363 |
+
elif content_type == "table":
|
364 |
+
return self.modal_processors.get("table")
|
365 |
+
elif content_type == "equation":
|
366 |
+
return self.modal_processors.get("equation")
|
367 |
+
else:
|
368 |
+
# For other types, use generic processor
|
369 |
+
return self.modal_processors.get("generic")
|
370 |
+
|
371 |
+
async def process_document_complete(
|
372 |
+
self,
|
373 |
+
file_path: str,
|
374 |
+
output_dir: str = "./output",
|
375 |
+
parse_method: str = "auto",
|
376 |
+
display_stats: bool = True,
|
377 |
+
split_by_character: str | None = None,
|
378 |
+
split_by_character_only: bool = False,
|
379 |
+
doc_id: str | None = None
|
380 |
+
):
|
381 |
+
"""
|
382 |
+
Complete document processing workflow
|
383 |
+
|
384 |
+
Args:
|
385 |
+
file_path: Path to the file to process
|
386 |
+
output_dir: MinerU output directory
|
387 |
+
parse_method: Parse method
|
388 |
+
display_stats: Whether to display content statistics
|
389 |
+
split_by_character: Optional character to split the text by
|
390 |
+
split_by_character_only: If True, split only by the specified character
|
391 |
+
doc_id: Optional document ID, if not provided MD5 hash will be generated
|
392 |
+
"""
|
393 |
+
# Ensure LightRAG is initialized
|
394 |
+
await self._ensure_lightrag_initialized()
|
395 |
+
|
396 |
+
self.logger.info(f"Starting complete document processing: {file_path}")
|
397 |
+
|
398 |
+
# Step 1: Parse document using MinerU
|
399 |
+
content_list, md_content = self.parse_document(
|
400 |
+
file_path,
|
401 |
+
output_dir,
|
402 |
+
parse_method,
|
403 |
+
display_stats
|
404 |
+
)
|
405 |
+
|
406 |
+
# Step 2: Separate text and multimodal content
|
407 |
+
text_content, multimodal_items = self._separate_content(content_list)
|
408 |
+
|
409 |
+
# Step 3: Insert pure text content with all parameters
|
410 |
+
if text_content.strip():
|
411 |
+
file_name = os.path.basename(file_path)
|
412 |
+
await self._insert_text_content(
|
413 |
+
text_content,
|
414 |
+
file_paths=file_name,
|
415 |
+
split_by_character=split_by_character,
|
416 |
+
split_by_character_only=split_by_character_only,
|
417 |
+
ids=doc_id
|
418 |
+
)
|
419 |
+
|
420 |
+
# Step 4: Process multimodal content (using specialized processors)
|
421 |
+
if multimodal_items:
|
422 |
+
await self._process_multimodal_content(multimodal_items, file_path)
|
423 |
+
|
424 |
+
self.logger.info(f"Document {file_path} processing complete!")
|
425 |
+
|
426 |
+
async def process_folder_complete(
|
427 |
+
self,
|
428 |
+
folder_path: str,
|
429 |
+
output_dir: str = "./output",
|
430 |
+
parse_method: str = "auto",
|
431 |
+
display_stats: bool = False,
|
432 |
+
split_by_character: str | None = None,
|
433 |
+
split_by_character_only: bool = False,
|
434 |
+
file_extensions: Optional[List[str]] = None,
|
435 |
+
recursive: bool = True,
|
436 |
+
max_workers: int = 1
|
437 |
+
):
|
438 |
+
"""
|
439 |
+
Process all files in a folder in batch
|
440 |
+
|
441 |
+
Args:
|
442 |
+
folder_path: Path to the folder to process
|
443 |
+
output_dir: MinerU output directory
|
444 |
+
parse_method: Parse method
|
445 |
+
display_stats: Whether to display content statistics for each file (recommended False for batch processing)
|
446 |
+
split_by_character: Optional character to split text by
|
447 |
+
split_by_character_only: If True, split only by the specified character
|
448 |
+
file_extensions: List of file extensions to process, e.g. [".pdf", ".docx"]. If None, process all supported formats
|
449 |
+
recursive: Whether to recursively process subfolders
|
450 |
+
max_workers: Maximum number of concurrent workers
|
451 |
+
"""
|
452 |
+
# Ensure LightRAG is initialized
|
453 |
+
await self._ensure_lightrag_initialized()
|
454 |
+
|
455 |
+
folder_path = Path(folder_path)
|
456 |
+
if not folder_path.exists() or not folder_path.is_dir():
|
457 |
+
raise ValueError(f"Folder does not exist or is not a valid directory: {folder_path}")
|
458 |
+
|
459 |
+
# Supported file formats
|
460 |
+
supported_extensions = {
|
461 |
+
".pdf", ".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif",
|
462 |
+
".doc", ".docx", ".ppt", ".pptx", ".txt", ".md"
|
463 |
+
}
|
464 |
+
|
465 |
+
# Use specified extensions or all supported formats
|
466 |
+
if file_extensions:
|
467 |
+
target_extensions = set(ext.lower() for ext in file_extensions)
|
468 |
+
# Validate if all are supported formats
|
469 |
+
unsupported = target_extensions - supported_extensions
|
470 |
+
if unsupported:
|
471 |
+
self.logger.warning(f"The following file formats may not be fully supported: {unsupported}")
|
472 |
+
else:
|
473 |
+
target_extensions = supported_extensions
|
474 |
+
|
475 |
+
# Collect all files to process
|
476 |
+
files_to_process = []
|
477 |
+
|
478 |
+
if recursive:
|
479 |
+
# Recursively traverse all subfolders
|
480 |
+
for file_path in folder_path.rglob("*"):
|
481 |
+
if file_path.is_file() and file_path.suffix.lower() in target_extensions:
|
482 |
+
files_to_process.append(file_path)
|
483 |
+
else:
|
484 |
+
# Process only current folder
|
485 |
+
for file_path in folder_path.glob("*"):
|
486 |
+
if file_path.is_file() and file_path.suffix.lower() in target_extensions:
|
487 |
+
files_to_process.append(file_path)
|
488 |
+
|
489 |
+
if not files_to_process:
|
490 |
+
self.logger.info(f"No files to process found in {folder_path}")
|
491 |
+
return
|
492 |
+
|
493 |
+
self.logger.info(f"Found {len(files_to_process)} files to process")
|
494 |
+
self.logger.info(f"File type distribution:")
|
495 |
+
|
496 |
+
# Count file types
|
497 |
+
file_type_count = {}
|
498 |
+
for file_path in files_to_process:
|
499 |
+
ext = file_path.suffix.lower()
|
500 |
+
file_type_count[ext] = file_type_count.get(ext, 0) + 1
|
501 |
+
|
502 |
+
for ext, count in sorted(file_type_count.items()):
|
503 |
+
self.logger.info(f" {ext}: {count} files")
|
504 |
+
|
505 |
+
# Create progress tracking
|
506 |
+
processed_count = 0
|
507 |
+
failed_files = []
|
508 |
+
|
509 |
+
# Use semaphore to control concurrency
|
510 |
+
semaphore = asyncio.Semaphore(max_workers)
|
511 |
+
|
512 |
+
async def process_single_file(file_path: Path, index: int) -> None:
|
513 |
+
"""Process a single file"""
|
514 |
+
async with semaphore:
|
515 |
+
nonlocal processed_count
|
516 |
+
try:
|
517 |
+
self.logger.info(f"[{index}/{len(files_to_process)}] Processing: {file_path}")
|
518 |
+
|
519 |
+
# Create separate output directory for each file
|
520 |
+
file_output_dir = Path(output_dir) / file_path.stem
|
521 |
+
file_output_dir.mkdir(parents=True, exist_ok=True)
|
522 |
+
|
523 |
+
# Process file
|
524 |
+
await self.process_document_complete(
|
525 |
+
file_path=str(file_path),
|
526 |
+
output_dir=str(file_output_dir),
|
527 |
+
parse_method=parse_method,
|
528 |
+
display_stats=display_stats,
|
529 |
+
split_by_character=split_by_character,
|
530 |
+
split_by_character_only=split_by_character_only
|
531 |
+
)
|
532 |
+
|
533 |
+
processed_count += 1
|
534 |
+
self.logger.info(f"[{index}/{len(files_to_process)}] Successfully processed: {file_path}")
|
535 |
+
|
536 |
+
except Exception as e:
|
537 |
+
self.logger.error(f"[{index}/{len(files_to_process)}] Failed to process: {file_path}")
|
538 |
+
self.logger.error(f"Error: {str(e)}")
|
539 |
+
failed_files.append((file_path, str(e)))
|
540 |
+
|
541 |
+
# Create all processing tasks
|
542 |
+
tasks = []
|
543 |
+
for index, file_path in enumerate(files_to_process, 1):
|
544 |
+
task = process_single_file(file_path, index)
|
545 |
+
tasks.append(task)
|
546 |
+
|
547 |
+
# Wait for all tasks to complete
|
548 |
+
await asyncio.gather(*tasks, return_exceptions=True)
|
549 |
+
|
550 |
+
# Output processing statistics
|
551 |
+
self.logger.info("\n===== Batch Processing Complete =====")
|
552 |
+
self.logger.info(f"Total files: {len(files_to_process)}")
|
553 |
+
self.logger.info(f"Successfully processed: {processed_count}")
|
554 |
+
self.logger.info(f"Failed: {len(failed_files)}")
|
555 |
+
|
556 |
+
if failed_files:
|
557 |
+
self.logger.info("\nFailed files:")
|
558 |
+
for file_path, error in failed_files:
|
559 |
+
self.logger.info(f" - {file_path}: {error}")
|
560 |
+
|
561 |
+
return {
|
562 |
+
"total": len(files_to_process),
|
563 |
+
"success": processed_count,
|
564 |
+
"failed": len(failed_files),
|
565 |
+
"failed_files": failed_files
|
566 |
+
}
|
567 |
+
|
568 |
+
async def query_with_multimodal(
|
569 |
+
self,
|
570 |
+
query: str,
|
571 |
+
mode: str = "hybrid"
|
572 |
+
) -> str:
|
573 |
+
"""
|
574 |
+
Query with multimodal content support
|
575 |
+
|
576 |
+
Args:
|
577 |
+
query: Query content
|
578 |
+
mode: Query mode
|
579 |
+
|
580 |
+
Returns:
|
581 |
+
Query result
|
582 |
+
"""
|
583 |
+
if self.lightrag is None:
|
584 |
+
raise ValueError(
|
585 |
+
"No LightRAG instance available. "
|
586 |
+
"Please either:\n"
|
587 |
+
"1. Provide a pre-initialized LightRAG instance when creating RAGAnything, or\n"
|
588 |
+
"2. Process documents first using process_document_complete() or process_folder_complete() "
|
589 |
+
"to create and populate the LightRAG instance."
|
590 |
+
)
|
591 |
+
|
592 |
+
result = await self.lightrag.aquery(
|
593 |
+
query,
|
594 |
+
param=QueryParam(mode=mode)
|
595 |
+
)
|
596 |
+
|
597 |
+
return result
|
598 |
+
|
599 |
+
def get_processor_info(self) -> Dict[str, Any]:
|
600 |
+
"""Get processor information"""
|
601 |
+
if not self.modal_processors:
|
602 |
+
return {"status": "Not initialized"}
|
603 |
+
|
604 |
+
info = {
|
605 |
+
"status": "Initialized",
|
606 |
+
"processors": {},
|
607 |
+
"models": {
|
608 |
+
"llm_model": "External function" if self.llm_model_func else "Not provided",
|
609 |
+
"vision_model": "External function" if self.vision_model_func else "Not provided",
|
610 |
+
"embedding_model": "External function" if self.embedding_func else "Not provided"
|
611 |
+
}
|
612 |
+
}
|
613 |
+
|
614 |
+
for proc_type, processor in self.modal_processors.items():
|
615 |
+
info["processors"][proc_type] = {
|
616 |
+
"class": processor.__class__.__name__,
|
617 |
+
"supports": self._get_processor_supports(proc_type)
|
618 |
+
}
|
619 |
+
|
620 |
+
return info
|
621 |
+
|
622 |
+
def _get_processor_supports(self, proc_type: str) -> List[str]:
|
623 |
+
"""Get processor supported features"""
|
624 |
+
supports_map = {
|
625 |
+
"image": ["Image content analysis", "Visual understanding", "Image description generation", "Image entity extraction"],
|
626 |
+
"table": ["Table structure analysis", "Data statistics", "Trend identification", "Table entity extraction"],
|
627 |
+
"equation": ["Mathematical formula parsing", "Variable identification", "Formula meaning explanation", "Formula entity extraction"],
|
628 |
+
"generic": ["General content analysis", "Structured processing", "Entity extraction"]
|
629 |
+
}
|
630 |
+
return supports_map.get(proc_type, ["Basic processing"])
|
631 |
+
|
632 |
+
|