zrguo commited on
Commit
8352b84
·
1 Parent(s): 050a1c0

MinerU integration

Browse files
README-zh.md CHANGED
@@ -4,6 +4,7 @@
4
 
5
  ## 🎉 新闻
6
 
 
7
  - [X] [2025.03.18]🎯📢LightRAG现已支持引文功能。
8
  - [X] [2025.02.05]🎯📢我们团队发布了[VideoRAG](https://github.com/HKUDS/VideoRAG),用于理解超长上下文视频。
9
  - [X] [2025.01.13]🎯📢我们团队发布了[MiniRAG](https://github.com/HKUDS/MiniRAG),使用小型模型简化RAG。
@@ -1002,6 +1003,32 @@ rag.merge_entities(
1002
 
1003
  </details>
1004
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1005
  ## Token统计功能
1006
 
1007
  <details>
 
4
 
5
  ## 🎉 新闻
6
 
7
+ - [X] [2025.06.05]🎯📢LightRAG现已集成MinerU,支持多模态文档解析与RAG(PDF、图片、Office、表格、公式等)。详见下方多模态处理模块。
8
  - [X] [2025.03.18]🎯📢LightRAG现已支持引文功能。
9
  - [X] [2025.02.05]🎯📢我们团队发布了[VideoRAG](https://github.com/HKUDS/VideoRAG),用于理解超长上下文视频。
10
  - [X] [2025.01.13]🎯📢我们团队发布了[MiniRAG](https://github.com/HKUDS/MiniRAG),使用小型模型简化RAG。
 
1003
 
1004
  </details>
1005
 
1006
+ ## 多模态文档处理(MinerU集成)
1007
+
1008
+ LightRAG 现已支持通过 [MinerU](https://github.com/opendatalab/MinerU) 实现多模态文档解析与检索增强生成(RAG)。您可以从 PDF、图片、Office 文档中提取结构化内容(文本、图片、表格、公式等),并在 RAG 流程中使用。
1009
+
1010
+ **主要特性:**
1011
+ - 支持解析 PDF、图片、DOC/DOCX/PPT/PPTX 等多种格式
1012
+ - 提取并索引文本、图片、表格、公式及文档结构
1013
+ - 在 RAG 中查询和检索多模态内容(文本、图片、表格、公式)
1014
+ - 与 LightRAG Core 及 RAGAnything 无缝集成
1015
+
1016
+ **快速开始:**
1017
+ 1. 安装依赖:
1018
+ ```bash
1019
+ pip install "magic-pdf[full]>=1.2.2" huggingface_hub
1020
+ ```
1021
+ 2. 下载 MinerU 模型权重(详见 [MinerU 集成指南](docs/mineru_integration_zh.md))
1022
+ 3. 使用新版 `MineruParser` 或 RAGAnything 的 `process_document_complete` 处理文件:
1023
+ ```python
1024
+ from lightrag.mineru_parser import MineruParser
1025
+ content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
1026
+ # 或自动识别类型:
1027
+ content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
1028
+ ```
1029
+ 4. 使用 LightRAG 查询多模态内容请参见 [docs/mineru_integration_zh.md](docs/mineru_integration_zh.md)。
1030
+
1031
+
1032
  ## Token统计功能
1033
 
1034
  <details>
README.md CHANGED
@@ -39,7 +39,7 @@
39
  </div>
40
 
41
  ## 🎉 News
42
-
43
  - [X] [2025.03.18]🎯📢LightRAG now supports citation functionality, enabling proper source attribution.
44
  - [X] [2025.02.05]🎯📢Our team has released [VideoRAG](https://github.com/HKUDS/VideoRAG) understanding extremely long-context videos.
45
  - [X] [2025.01.13]🎯📢Our team has released [MiniRAG](https://github.com/HKUDS/MiniRAG) making RAG simpler with small models.
@@ -1051,6 +1051,31 @@ When merging entities:
1051
 
1052
  </details>
1053
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1054
  ## Token Usage Tracking
1055
 
1056
  <details>
 
39
  </div>
40
 
41
  ## 🎉 News
42
+ - [X] [2025.06.05]🎯📢LightRAG now supports multimodal document parsing and RAG with MinerU integration (PDF, images, Office, tables, formulas, etc.). See the new multimodal section below.
43
  - [X] [2025.03.18]🎯📢LightRAG now supports citation functionality, enabling proper source attribution.
44
  - [X] [2025.02.05]🎯📢Our team has released [VideoRAG](https://github.com/HKUDS/VideoRAG) understanding extremely long-context videos.
45
  - [X] [2025.01.13]🎯📢Our team has released [MiniRAG](https://github.com/HKUDS/MiniRAG) making RAG simpler with small models.
 
1051
 
1052
  </details>
1053
 
1054
+ ## Multimodal Document Processing (MinerU Integration)
1055
+
1056
+ LightRAG now supports multimodal document parsing and retrieval-augmented generation (RAG) via [MinerU](https://github.com/opendatalab/MinerU). You can extract structured content (text, images, tables, formulas, etc.) from PDF, images, and Office documents, and use them in your RAG pipeline.
1057
+
1058
+ **Key Features:**
1059
+ - Parse PDFs, images, DOC/DOCX/PPT/PPTX, and more
1060
+ - Extract and index text, images, tables, formulas, and document structure
1061
+ - Query and retrieve multimodal content (text, image, table, formula) in RAG
1062
+ - Seamless integration with LightRAG core and RAGAnything
1063
+
1064
+ **Quick Start:**
1065
+ 1. Install dependencies:
1066
+ ```bash
1067
+ pip install "magic-pdf[full]>=1.2.2" huggingface_hub
1068
+ ```
1069
+ 2. Download MinerU model weights (see [MinerU Integration Guide](docs/mineru_integration_en.md))
1070
+ 3. Use the new `MineruParser` or RAGAnything's `process_document_complete` to process files:
1071
+ ```python
1072
+ from lightrag.mineru_parser import MineruParser
1073
+ content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
1074
+ # or for any file type:
1075
+ content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
1076
+ ```
1077
+ 4. Query multimodal content with LightRAG see [docs/mineru_integration_en.md](docs/mineru_integration_en.md).
1078
+
1079
  ## Token Usage Tracking
1080
 
1081
  <details>
docs/mineru_integration_en.md ADDED
@@ -0,0 +1,246 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MinerU Integration Guide
2
+
3
+ ### About MinerU
4
+
5
+ MinerU is a powerful open-source tool for extracting high-quality structured data from PDF, image, and office documents. It provides the following features:
6
+
7
+ - Text extraction while preserving document structure (headings, paragraphs, lists, etc.)
8
+ - Handling complex layouts including multi-column formats
9
+ - Automatic formula recognition and conversion to LaTeX format
10
+ - Image, table, and footnote extraction
11
+ - Automatic scanned document detection and OCR application
12
+ - Support for multiple output formats (Markdown, JSON)
13
+
14
+ ### Installation
15
+
16
+ #### Installing MinerU Dependencies
17
+
18
+ If you have already installed LightRAG but don't have MinerU support, you can add MinerU support by installing the magic-pdf package directly:
19
+
20
+ ```bash
21
+ pip install "magic-pdf[full]>=1.2.2" huggingface_hub
22
+ ```
23
+
24
+ These are the MinerU-related dependencies required by LightRAG.
25
+
26
+ #### MinerU Model Weights
27
+
28
+ MinerU requires model weight files to function properly. After installation, you need to download the required model weights. You can use either Hugging Face or ModelScope to download the models.
29
+
30
+ ##### Option 1: Download from Hugging Face
31
+
32
+ ```bash
33
+ pip install huggingface_hub
34
+ wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
35
+ python download_models_hf.py
36
+ ```
37
+
38
+ ##### Option 2: Download from ModelScope (Recommended for users in China)
39
+
40
+ ```bash
41
+ pip install modelscope
42
+ wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models.py -O download_models.py
43
+ python download_models.py
44
+ ```
45
+
46
+ Both methods will automatically download the model files and configure the model directory in the configuration file. The configuration file is located in your user directory and named `magic-pdf.json`.
47
+
48
+ > **Note for Windows users**: User directory is at `C:\Users\username`
49
+ > **Note for Linux users**: User directory is at `/home/username`
50
+ > **Note for macOS users**: User directory is at `/Users/username`
51
+
52
+ #### Optional: LibreOffice Installation
53
+
54
+ To process Office documents (DOC, DOCX, PPT, PPTX), you need to install LibreOffice:
55
+
56
+ **Linux/macOS:**
57
+ ```bash
58
+ apt-get/yum/brew install libreoffice
59
+ ```
60
+
61
+ **Windows:**
62
+ 1. Install LibreOffice
63
+ 2. Add the installation directory to your PATH: `install_dir\LibreOffice\program`
64
+
65
+ ### Using MinerU Parser
66
+
67
+ #### Basic Usage
68
+
69
+ ```python
70
+ from lightrag.mineru_parser import MineruParser
71
+
72
+ # Parse a PDF document
73
+ content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
74
+
75
+ # Parse an image
76
+ content_list, md_content = MineruParser.parse_image('path/to/image.jpg', 'output_dir')
77
+
78
+ # Parse an Office document
79
+ content_list, md_content = MineruParser.parse_office_doc('path/to/document.docx', 'output_dir')
80
+
81
+ # Auto-detect and parse any supported document type
82
+ content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
83
+ ```
84
+
85
+ #### RAGAnything Integration
86
+
87
+ In RAGAnything, you can directly use file paths as input to the `process_document_complete` method to process documents. Here's a complete configuration example:
88
+
89
+ ```python
90
+ from lightrag.llm.openai import openai_complete_if_cache, openai_embed
91
+ from lightrag.raganything import RAGAnything
92
+
93
+
94
+ # Initialize RAGAnything
95
+ rag = RAGAnything(
96
+ working_dir="./rag_storage", # Working directory
97
+ llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
98
+ "gpt-4o-mini", # Model to use
99
+ prompt,
100
+ system_prompt=system_prompt,
101
+ history_messages=history_messages,
102
+ api_key="your-api-key", # Replace with your API key
103
+ base_url="your-base-url", # Replace with your API base URL
104
+ **kwargs,
105
+ ),
106
+ vision_model_func=lambda prompt, system_prompt=None, history_messages=[], image_data=None, **kwargs: openai_complete_if_cache(
107
+ "gpt-4o", # Vision model
108
+ "",
109
+ system_prompt=None,
110
+ history_messages=[],
111
+ messages=[
112
+ {"role": "system", "content": system_prompt} if system_prompt else None,
113
+ {"role": "user", "content": [
114
+ {"type": "text", "text": prompt},
115
+ {
116
+ "type": "image_url",
117
+ "image_url": {
118
+ "url": f"data:image/jpeg;base64,{image_data}"
119
+ }
120
+ }
121
+ ]} if image_data else {"role": "user", "content": prompt}
122
+ ],
123
+ api_key="your-api-key", # Replace with your API key
124
+ base_url="your-base-url", # Replace with your API base URL
125
+ **kwargs,
126
+ ) if image_data else openai_complete_if_cache(
127
+ "gpt-4o-mini",
128
+ prompt,
129
+ system_prompt=system_prompt,
130
+ history_messages=history_messages,
131
+ api_key="your-api-key", # Replace with your API key
132
+ base_url="your-base-url", # Replace with your API base URL
133
+ **kwargs,
134
+ ),
135
+ embedding_func=lambda texts: openai_embed(
136
+ texts,
137
+ model="text-embedding-3-large",
138
+ api_key="your-api-key", # Replace with your API key
139
+ base_url="your-base-url", # Replace with your API base URL
140
+ ),
141
+ embedding_dim=3072,
142
+ max_token_size=8192
143
+ )
144
+
145
+ # Process a single file
146
+ await rag.process_document_complete(
147
+ file_path="path/to/document.pdf",
148
+ output_dir="./output",
149
+ parse_method="auto"
150
+ )
151
+
152
+ # Query the processed document
153
+ result = await rag.query_with_multimodal(
154
+ "What is the main content of the document?",
155
+ mode="hybrid"
156
+ )
157
+
158
+ ```
159
+
160
+ MinerU categorizes document content into text, formulas, images, and tables, processing each with its corresponding ingestion type:
161
+ - Text content: `ingestion_type='text'`
162
+ - Image content: `ingestion_type='image'`
163
+ - Table content: `ingestion_type='table'`
164
+ - Formula content: `ingestion_type='equation'`
165
+
166
+ #### Query Examples
167
+
168
+ Here are some common query examples:
169
+
170
+ ```python
171
+ # Query text content
172
+ result = await rag.query_with_multimodal(
173
+ "What is the main topic of the document?",
174
+ mode="hybrid"
175
+ )
176
+
177
+ # Query image-related content
178
+ result = await rag.query_with_multimodal(
179
+ "Describe the images and figures in the document",
180
+ mode="hybrid"
181
+ )
182
+
183
+ # Query table-related content
184
+ result = await rag.query_with_multimodal(
185
+ "Tell me about the experimental results and data tables",
186
+ mode="hybrid"
187
+ )
188
+ ```
189
+
190
+ #### Command Line Tool
191
+
192
+ We also provide a command-line tool for document parsing:
193
+
194
+ ```bash
195
+ python examples/mineru_example.py path/to/document.pdf
196
+ ```
197
+
198
+ Optional parameters:
199
+ - `--output` or `-o`: Specify output directory
200
+ - `--method` or `-m`: Choose parsing method (auto, ocr, txt)
201
+ - `--stats`: Display content statistics
202
+
203
+ ### Output Format
204
+
205
+ MinerU generates three files for each parsed document:
206
+
207
+ 1. `{filename}.md` - Markdown representation of the document
208
+ 2. `{filename}_content_list.json` - Structured JSON content
209
+ 3. `{filename}_model.json` - Detailed model parsing results
210
+
211
+ The `content_list.json` file contains all structured content extracted from the document, including:
212
+ - Text blocks (body text, headings, etc.)
213
+ - Images (paths and optional captions)
214
+ - Tables (table content and optional captions)
215
+ - Lists
216
+ - Formulas
217
+
218
+ ### Troubleshooting
219
+
220
+ If you encounter issues with MinerU:
221
+
222
+ 1. Check that model weights are correctly downloaded
223
+ 2. Ensure you have sufficient RAM (16GB+ recommended)
224
+ 3. For CUDA acceleration issues, see [MinerU documentation](https://mineru.readthedocs.io/en/latest/additional_notes/faq.html)
225
+ 4. If parsing Office documents fails, verify LibreOffice is properly installed
226
+ 5. If you encounter `pickle.UnpicklingError: invalid load key, 'v'.`, it might be due to an incomplete model download. Try re-downloading the models.
227
+ 6. For users with newer graphics cards (H100, etc.) and garbled OCR text, try upgrading the CUDA version used by Paddle:
228
+ ```bash
229
+ pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
230
+ ```
231
+ 7. If you encounter a "filename too long" error, the latest version of MineruParser includes logic to automatically handle this issue.
232
+
233
+ #### Updating Existing Models
234
+
235
+ If you have previously downloaded models and need to update them, you can simply run the download script again. The script will update the model directory to the latest version.
236
+
237
+ ### Advanced Configuration
238
+
239
+ The MinerU configuration file `magic-pdf.json` supports various customization options, including:
240
+
241
+ - Model directory path
242
+ - OCR engine selection
243
+ - GPU acceleration settings
244
+ - Cache settings
245
+
246
+ For complete configuration options, refer to the [MinerU official documentation](https://mineru.readthedocs.io/).
docs/mineru_integration_zh.md ADDED
@@ -0,0 +1,245 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MinerU 集成指南
2
+
3
+ ### 关于 MinerU
4
+
5
+ MinerU 是一个强大的开源工具,用于从 PDF、图像和 Office 文档中提取高质量的结构化数据。它提供以下功能:
6
+
7
+ - 保留文档结构(标题、段落、列表等)的文本提取
8
+ - 处理包括多列格式在内的复杂布局
9
+ - 自动识别并将公式转换为 LaTeX 格式
10
+ - 提取图像、表格和脚注
11
+ - 自动检测扫描文档并应用 OCR
12
+ - 支持多种输出格式(Markdown、JSON)
13
+
14
+ ### 安装
15
+
16
+ #### 安装 MinerU 依赖
17
+
18
+ 如果您已经安装了 LightRAG,但没有 MinerU 支持,您可以通过安装 magic-pdf 包来直接添加 MinerU 支持:
19
+
20
+ ```bash
21
+ pip install "magic-pdf[full]>=1.2.2" huggingface_hub
22
+ ```
23
+
24
+ 这些是 LightRAG 所需的 MinerU 相关依赖项。
25
+
26
+ #### MinerU 模型权重
27
+
28
+ MinerU 需要模型权重文件才能正常运行。安装后,您需要下载所需的模型权重。您可以使用 Hugging Face 或 ModelScope 下载模型。
29
+
30
+ ##### 选项 1:从 Hugging Face 下载
31
+
32
+ ```bash
33
+ pip install huggingface_hub
34
+ wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
35
+ python download_models_hf.py
36
+ ```
37
+
38
+ ##### 选项 2:从 ModelScope 下载(推荐中国用户使用)
39
+
40
+ ```bash
41
+ pip install modelscope
42
+ wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models.py -O download_models.py
43
+ python download_models.py
44
+ ```
45
+
46
+ 两种方法都会自动下载模型文件并在配置文件中配置模型目录。配置文件位于用户目录中,名为 `magic-pdf.json`。
47
+
48
+ > **Windows 用户注意**:用户目录位于 `C:\Users\用户名`
49
+ > **Linux 用户注意**:用户目录位于 `/home/用户名`
50
+ > **macOS 用户注意**:用户目录位于 `/Users/用户名`
51
+
52
+ #### 可选:安装 LibreOffice
53
+
54
+ 要处理 Office 文档(DOC、DOCX、PPT、PPTX),您需要安装 LibreOffice:
55
+
56
+ **Linux/macOS:**
57
+ ```bash
58
+ apt-get/yum/brew install libreoffice
59
+ ```
60
+
61
+ **Windows:**
62
+ 1. 安装 LibreOffice
63
+ 2. 将安装目录添加到 PATH 环境变量:`安装目录\LibreOffice\program`
64
+
65
+ ### 使用 MinerU 解析器
66
+
67
+ #### 基本用法
68
+
69
+ ```python
70
+ from lightrag.mineru_parser import MineruParser
71
+
72
+ # 解析 PDF 文档
73
+ content_list, md_content = MineruParser.parse_pdf('path/to/document.pdf', 'output_dir')
74
+
75
+ # 解析图像
76
+ content_list, md_content = MineruParser.parse_image('path/to/image.jpg', 'output_dir')
77
+
78
+ # 解析 Office 文档
79
+ content_list, md_content = MineruParser.parse_office_doc('path/to/document.docx', 'output_dir')
80
+
81
+ # 自动检测并解析任何支持的文档类型
82
+ content_list, md_content = MineruParser.parse_document('path/to/file', 'auto', 'output_dir')
83
+ ```
84
+
85
+ #### RAGAnything 集成
86
+
87
+ 在 RAGAnything 中,您可以直接使用文件路径作为 `process_document_complete` 方法的输入来处理文档。以下是一个完整的配置示例:
88
+
89
+ ```python
90
+ from lightrag.llm.openai import openai_complete_if_cache, openai_embed
91
+ from lightrag.raganything import RAGAnything
92
+
93
+
94
+ # 初始化 RAGAnything
95
+ rag = RAGAnything(
96
+ working_dir="./rag_storage", # 工作目录
97
+ llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
98
+ "gpt-4o-mini", # 使用的模型
99
+ prompt,
100
+ system_prompt=system_prompt,
101
+ history_messages=history_messages,
102
+ api_key="your-api-key", # 替换为您的 API 密钥
103
+ base_url="your-base-url", # 替换为您的 API 基础 URL
104
+ **kwargs,
105
+ ),
106
+ vision_model_func=lambda prompt, system_prompt=None, history_messages=[], image_data=None, **kwargs: openai_complete_if_cache(
107
+ "gpt-4o", # 视觉模型
108
+ "",
109
+ system_prompt=None,
110
+ history_messages=[],
111
+ messages=[
112
+ {"role": "system", "content": system_prompt} if system_prompt else None,
113
+ {"role": "user", "content": [
114
+ {"type": "text", "text": prompt},
115
+ {
116
+ "type": "image_url",
117
+ "image_url": {
118
+ "url": f"data:image/jpeg;base64,{image_data}"
119
+ }
120
+ }
121
+ ]} if image_data else {"role": "user", "content": prompt}
122
+ ],
123
+ api_key="your-api-key", # 替换为您的 API 密钥
124
+ base_url="your-base-url", # 替换为您的 API 基础 URL
125
+ **kwargs,
126
+ ) if image_data else openai_complete_if_cache(
127
+ "gpt-4o-mini",
128
+ prompt,
129
+ system_prompt=system_prompt,
130
+ history_messages=history_messages,
131
+ api_key="your-api-key", # 替换为您的 API 密钥
132
+ base_url="your-base-url", # 替换为您的 API 基础 URL
133
+ **kwargs,
134
+ ),
135
+ embedding_func=lambda texts: openai_embed(
136
+ texts,
137
+ model="text-embedding-3-large",
138
+ api_key="your-api-key", # 替换为您的 API 密钥
139
+ base_url="your-base-url", # 替换为您的 API 基础 URL
140
+ ),
141
+ embedding_dim=3072,
142
+ max_token_size=8192
143
+ )
144
+
145
+ # 处理单个文件
146
+ await rag.process_document_complete(
147
+ file_path="path/to/document.pdf",
148
+ output_dir="./output",
149
+ parse_method="auto"
150
+ )
151
+
152
+ # 查询处理后的文档
153
+ result = await rag.query_with_multimodal(
154
+ "What is the main content of the document?",
155
+ mode="hybrid"
156
+ )
157
+ ```
158
+
159
+ MinerU 会将文档内容分类为文本、公式、图像和表格,分别使用相应的摄入类型进行处理:
160
+ - 文本内容:`ingestion_type='text'`
161
+ - 图像内容:`ingestion_type='image'`
162
+ - 表格内容:`ingestion_type='table'`
163
+ - 公式内容:`ingestion_type='equation'`
164
+
165
+ #### 查询示例
166
+
167
+ 以下是一些常见的查询示例:
168
+
169
+ ```python
170
+ # 查询文本内容
171
+ result = await rag.query_with_multimodal(
172
+ "What is the main topic of the document?",
173
+ mode="hybrid"
174
+ )
175
+
176
+ # 查询图片相关内容
177
+ result = await rag.query_with_multimodal(
178
+ "Describe the images and figures in the document",
179
+ mode="hybrid"
180
+ )
181
+
182
+ # 查询表格相关内容
183
+ result = await rag.query_with_multimodal(
184
+ "Tell me about the experimental results and data tables",
185
+ mode="hybrid"
186
+ )
187
+ ```
188
+
189
+ #### 命令行工具
190
+
191
+ 我们还提供了一个用于文档解析的命令行工具:
192
+
193
+ ```bash
194
+ python examples/mineru_example.py path/to/document.pdf
195
+ ```
196
+
197
+ 可选参数:
198
+ - `--output` 或 `-o`:指定输出目录
199
+ - `--method` 或 `-m`:选择解析方法(auto、ocr、txt)
200
+ - `--stats`:显示内容统计信息
201
+
202
+ ### 输出格式
203
+
204
+ MinerU 为每个解析的文档生成三个文件:
205
+
206
+ 1. `{文件名}.md` - 文档的 Markdown 表示
207
+ 2. `{文件名}_content_list.json` - 结构化 JSON 内容
208
+ 3. `{文件名}_model.json` - 详细的模型解析结果
209
+
210
+ `content_list.json` 文件包含从文档中提取的所有结构化内容,包括:
211
+ - 文本块(正文、标题等)
212
+ - 图像(路径和可选的标题)
213
+ - 表格(表格内容和可选的标题)
214
+ - 列表
215
+ - 公式
216
+
217
+ ### 疑难解答
218
+
219
+ 如果您在使用 MinerU 时遇到问题:
220
+
221
+ 1. 检查模型权重是否正确下载
222
+ 2. 确保有足够的内存(建议 16GB+)
223
+ 3. 对于 CUDA 加速问题,请参阅 [MinerU 文档](https://mineru.readthedocs.io/en/latest/additional_notes/faq.html)
224
+ 4. 如果解析 Office 文档失败,请验证 LibreOffice 是否正确安装
225
+ 5. 如果遇到 `pickle.UnpicklingError: invalid load key, 'v'.`,可能是因为模型下载不完整。尝试重新下载模型。
226
+ 6. 对于使用较新显卡(H100 等)并出现 OCR 文本乱码的用户,请尝试升级 Paddle 使用的 CUDA 版本:
227
+ ```bash
228
+ pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
229
+ ```
230
+ 7. 如果遇到 "文件名太长" 错误,最新版本的 MineruParser 已经包含了自动处理此问题的逻辑。
231
+
232
+ #### 更新现有模型
233
+
234
+ 如果您之前已经下载了模型并需要更新它们,只需再次运行下载脚本即可。脚本将更新模型目录到最新版本。
235
+
236
+ ### 高级配置
237
+
238
+ MinerU 配置文件 `magic-pdf.json` 支持多种自定义选项,包括:
239
+
240
+ - 模型目录路径
241
+ - OCR 引擎选择
242
+ - GPU 加速设置
243
+ - 缓存设置
244
+
245
+ 有关完整的配置选项,请参阅 [MinerU 官方文档](https://mineru.readthedocs.io/)。
examples/mineru_example.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """
3
+ Example script demonstrating the basic usage of MinerU parser
4
+
5
+ This example shows how to:
6
+ 1. Parse different types of documents (PDF, images, office documents)
7
+ 2. Use different parsing methods
8
+ 3. Display document statistics
9
+ """
10
+
11
+ import os
12
+ import argparse
13
+ from pathlib import Path
14
+ from lightrag.mineru_parser import MineruParser
15
+
16
+ def parse_document(file_path: str, output_dir: str = None, method: str = "auto", stats: bool = False):
17
+ """
18
+ Parse a document using MinerU parser
19
+
20
+ Args:
21
+ file_path: Path to the document
22
+ output_dir: Output directory for parsed results
23
+ method: Parsing method (auto, ocr, txt)
24
+ stats: Whether to display content statistics
25
+ """
26
+ try:
27
+ # Parse the document
28
+ content_list, md_content = MineruParser.parse_document(
29
+ file_path=file_path,
30
+ parse_method=method,
31
+ output_dir=output_dir
32
+ )
33
+
34
+ # Display statistics if requested
35
+ if stats:
36
+ print("\nDocument Statistics:")
37
+ print(f"Total content blocks: {len(content_list)}")
38
+
39
+ # Count different types of content
40
+ content_types = {}
41
+ for item in content_list:
42
+ content_type = item.get('type', 'unknown')
43
+ content_types[content_type] = content_types.get(content_type, 0) + 1
44
+
45
+ print("\nContent Type Distribution:")
46
+ for content_type, count in content_types.items():
47
+ print(f"- {content_type}: {count}")
48
+
49
+ return content_list, md_content
50
+
51
+ except Exception as e:
52
+ print(f"Error parsing document: {str(e)}")
53
+ return None, None
54
+
55
+ def main():
56
+ """Main function to run the example"""
57
+ parser = argparse.ArgumentParser(description='MinerU Parser Example')
58
+ parser.add_argument('file_path', help='Path to the document to parse')
59
+ parser.add_argument('--output', '-o', help='Output directory path')
60
+ parser.add_argument('--method', '-m',
61
+ choices=['auto', 'ocr', 'txt'],
62
+ default='auto',
63
+ help='Parsing method (auto, ocr, txt)')
64
+ parser.add_argument('--stats', action='store_true',
65
+ help='Display content statistics')
66
+
67
+ args = parser.parse_args()
68
+
69
+ # Create output directory if specified
70
+ if args.output:
71
+ os.makedirs(args.output, exist_ok=True)
72
+
73
+ # Parse document
74
+ content_list, md_content = parse_document(
75
+ args.file_path,
76
+ args.output,
77
+ args.method,
78
+ args.stats
79
+ )
80
+
81
+ if __name__ == '__main__':
82
+ main()
examples/raganything_example.py ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """
3
+ Example script demonstrating the integration of MinerU parser with RAGAnything
4
+
5
+ This example shows how to:
6
+ 1. Process parsed documents with RAGAnything
7
+ 2. Perform multimodal queries on the processed documents
8
+ 3. Handle different types of content (text, images, tables)
9
+ """
10
+
11
+ import os
12
+ import argparse
13
+ import asyncio
14
+ from pathlib import Path
15
+ from lightrag.mineru_parser import MineruParser
16
+ from lightrag.llm.openai import openai_complete_if_cache, openai_embed
17
+ from lightrag.raganything import RAGAnything
18
+
19
+ async def process_with_rag(file_path: str, output_dir: str, api_key: str, base_url: str = None, working_dir: str = None):
20
+ """
21
+ Process document with RAGAnything
22
+
23
+ Args:
24
+ file_path: Path to the document
25
+ output_dir: Output directory for RAG results
26
+ api_key: OpenAI API key
27
+ base_url: Optional base URL for API
28
+ """
29
+ try:
30
+ # Initialize RAGAnything
31
+ rag = RAGAnything(
32
+ working_dir=working_dir,
33
+ llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache(
34
+ "gpt-4o-mini",
35
+ prompt,
36
+ system_prompt=system_prompt,
37
+ history_messages=history_messages,
38
+ api_key=api_key,
39
+ base_url=base_url,
40
+ **kwargs,
41
+ ),
42
+ vision_model_func=lambda prompt, system_prompt=None, history_messages=[], image_data=None, **kwargs: openai_complete_if_cache(
43
+ "gpt-4o",
44
+ "",
45
+ system_prompt=None,
46
+ history_messages=[],
47
+ messages=[
48
+ {"role": "system", "content": system_prompt} if system_prompt else None,
49
+ {"role": "user", "content": [
50
+ {"type": "text", "text": prompt},
51
+ {
52
+ "type": "image_url",
53
+ "image_url": {
54
+ "url": f"data:image/jpeg;base64,{image_data}"
55
+ }
56
+ }
57
+ ]} if image_data else {"role": "user", "content": prompt}
58
+ ],
59
+ api_key=api_key,
60
+ base_url=base_url,
61
+ **kwargs,
62
+ ) if image_data else openai_complete_if_cache(
63
+ "gpt-4o-mini",
64
+ prompt,
65
+ system_prompt=system_prompt,
66
+ history_messages=history_messages,
67
+ api_key=api_key,
68
+ base_url=base_url,
69
+ **kwargs,
70
+ ),
71
+ embedding_func=lambda texts: openai_embed(
72
+ texts,
73
+ model="text-embedding-3-large",
74
+ api_key=api_key,
75
+ base_url=base_url,
76
+ ),
77
+ embedding_dim=3072,
78
+ max_token_size=8192
79
+ )
80
+
81
+ # Process document
82
+ await rag.process_document_complete(
83
+ file_path=file_path,
84
+ output_dir=output_dir,
85
+ parse_method="auto"
86
+ )
87
+
88
+ # Example queries
89
+ queries = [
90
+ "What is the main content of the document?",
91
+ "Describe the images and figures in the document",
92
+ "Tell me about the experimental results and data tables"
93
+ ]
94
+
95
+ print("\nQuerying processed document:")
96
+ for query in queries:
97
+ print(f"\nQuery: {query}")
98
+ result = await rag.query_with_multimodal(query, mode="hybrid")
99
+ print(f"Answer: {result}")
100
+
101
+ except Exception as e:
102
+ print(f"Error processing with RAG: {str(e)}")
103
+
104
+ def main():
105
+ """Main function to run the example"""
106
+ parser = argparse.ArgumentParser(description='MinerU RAG Example')
107
+ parser.add_argument('file_path', help='Path to the document to process')
108
+ parser.add_argument('--working_dir', '-w', default="./rag_storage", help='Working directory path')
109
+ parser.add_argument('--output', '-o', default="./output", help='Output directory path')
110
+ parser.add_argument('--api-key', required=True, help='OpenAI API key for RAG processing')
111
+ parser.add_argument('--base-url', help='Optional base URL for API')
112
+
113
+ args = parser.parse_args()
114
+
115
+ # Create output directory if specified
116
+ if args.output:
117
+ os.makedirs(args.output, exist_ok=True)
118
+
119
+ # Process with RAG
120
+ asyncio.run(process_with_rag(
121
+ args.file_path,
122
+ args.output,
123
+ args.api_key,
124
+ args.base_url,
125
+ args.working_dir
126
+ ))
127
+
128
+ if __name__ == '__main__':
129
+ main()
lightrag/mineru_parser.py ADDED
@@ -0,0 +1,454 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # type: ignore
2
+ """
3
+ MinerU Document Parser Utility
4
+
5
+ This module provides functionality for parsing PDF, image and office documents using MinerU library,
6
+ and converts the parsing results into markdown and JSON formats
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ __all__ = ["MineruParser"]
12
+
13
+ import os
14
+ import json
15
+ import argparse
16
+ from pathlib import Path
17
+ from typing import Dict, List, Optional, Union, Tuple, Any, TypeVar, cast, TYPE_CHECKING, ClassVar
18
+
19
+ # Type stubs for magic_pdf
20
+ FileBasedDataWriter = Any
21
+ FileBasedDataReader = Any
22
+ PymuDocDataset = Any
23
+ InferResult = Any
24
+ PipeResult = Any
25
+ SupportedPdfParseMethod = Any
26
+ doc_analyze = Any
27
+ read_local_office = Any
28
+ read_local_images = Any
29
+
30
+ if TYPE_CHECKING:
31
+ from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
32
+ from magic_pdf.data.dataset import PymuDocDataset
33
+ from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
34
+ from magic_pdf.config.enums import SupportedPdfParseMethod
35
+ from magic_pdf.data.read_api import read_local_office, read_local_images
36
+ else:
37
+ # MinerU imports
38
+ from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
39
+ from magic_pdf.data.dataset import PymuDocDataset
40
+ from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
41
+ from magic_pdf.config.enums import SupportedPdfParseMethod
42
+ from magic_pdf.data.read_api import read_local_office, read_local_images
43
+
44
+ T = TypeVar('T')
45
+
46
+ class MineruParser:
47
+ """
48
+ MinerU document parsing utility class
49
+
50
+ Supports parsing PDF, image and office documents (like Word, PPT, etc.),
51
+ converting the content into structured data and generating markdown and JSON output
52
+ """
53
+
54
+ __slots__: ClassVar[Tuple[str, ...]] = ()
55
+
56
+ def __init__(self) -> None:
57
+ """Initialize MineruParser"""
58
+ pass
59
+
60
+ @staticmethod
61
+ def safe_write(writer: Any, content: Union[str, bytes, Dict[str, Any], List[Any]], filename: str) -> None:
62
+ """
63
+ Safely write content to a file, ensuring the filename is valid
64
+
65
+ Args:
66
+ writer: The writer object to use
67
+ content: The content to write
68
+ filename: The filename to write to
69
+ """
70
+ # Ensure the filename isn't too long
71
+ if len(filename) > 200: # Most filesystems have limits around 255 characters
72
+ # Truncate the filename while keeping the extension
73
+ base, ext = os.path.splitext(filename)
74
+ filename = base[:190] + ext # Leave room for the extension and some margin
75
+
76
+ # Handle specific content types
77
+ if isinstance(content, str):
78
+ # Ensure str content is encoded to bytes if required
79
+ try:
80
+ writer.write(content, filename)
81
+ except TypeError:
82
+ # If the writer expects bytes, convert string to bytes
83
+ writer.write(content.encode('utf-8'), filename)
84
+ else:
85
+ # For dict/list content, always encode as JSON string first
86
+ if isinstance(content, (dict, list)):
87
+ try:
88
+ writer.write(json.dumps(content, ensure_ascii=False, indent=4), filename)
89
+ except TypeError:
90
+ # If the writer expects bytes, convert JSON string to bytes
91
+ writer.write(json.dumps(content, ensure_ascii=False, indent=4).encode('utf-8'), filename)
92
+ else:
93
+ # Regular content (assumed to be bytes or compatible)
94
+ writer.write(content, filename)
95
+
96
+ @staticmethod
97
+ def parse_pdf(
98
+ pdf_path: Union[str, Path],
99
+ output_dir: Optional[str] = None,
100
+ use_ocr: bool = False
101
+ ) -> Tuple[List[Dict[str, Any]], str]:
102
+ """
103
+ Parse PDF document
104
+
105
+ Args:
106
+ pdf_path: Path to the PDF file
107
+ output_dir: Output directory path
108
+ use_ocr: Whether to force OCR parsing
109
+
110
+ Returns:
111
+ Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
112
+ """
113
+ try:
114
+ # Convert to Path object for easier handling
115
+ pdf_path = Path(pdf_path)
116
+ name_without_suff = pdf_path.stem
117
+
118
+ # Prepare output directories - ensure file name is in path
119
+ if output_dir:
120
+ base_output_dir = Path(output_dir)
121
+ local_md_dir = base_output_dir / name_without_suff
122
+ else:
123
+ local_md_dir = pdf_path.parent / name_without_suff
124
+
125
+ local_image_dir = local_md_dir / "images"
126
+ image_dir = local_image_dir.name
127
+
128
+ # Create directories
129
+ os.makedirs(local_image_dir, exist_ok=True)
130
+ os.makedirs(local_md_dir, exist_ok=True)
131
+
132
+ # Initialize writers and reader
133
+ image_writer = FileBasedDataWriter(str(local_image_dir)) # type: ignore
134
+ md_writer = FileBasedDataWriter(str(local_md_dir)) # type: ignore
135
+ reader = FileBasedDataReader("") # type: ignore
136
+
137
+ # Read PDF bytes
138
+ pdf_bytes = reader.read(str(pdf_path)) # type: ignore
139
+
140
+ # Create dataset instance
141
+ ds = PymuDocDataset(pdf_bytes) # type: ignore
142
+
143
+ # Process based on PDF type and user preference
144
+ if use_ocr or ds.classify() == SupportedPdfParseMethod.OCR: # type: ignore
145
+ infer_result = ds.apply(doc_analyze, ocr=True) # type: ignore
146
+ pipe_result = infer_result.pipe_ocr_mode(image_writer) # type: ignore
147
+ else:
148
+ infer_result = ds.apply(doc_analyze, ocr=False) # type: ignore
149
+ pipe_result = infer_result.pipe_txt_mode(image_writer) # type: ignore
150
+
151
+ # Draw visualizations
152
+ try:
153
+ infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf")) # type: ignore
154
+ pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf")) # type: ignore
155
+ pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf")) # type: ignore
156
+ except Exception as e:
157
+ print(f"Warning: Failed to draw visualizations: {str(e)}")
158
+
159
+ # Get data using API methods
160
+ md_content = pipe_result.get_markdown(image_dir) # type: ignore
161
+ content_list = pipe_result.get_content_list(image_dir) # type: ignore
162
+
163
+ # Save files using dump methods (consistent with API)
164
+ pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir) # type: ignore
165
+ pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir) # type: ignore
166
+ pipe_result.dump_middle_json(md_writer, f"{name_without_suff}_middle.json") # type: ignore
167
+
168
+ # Save model result - convert JSON string to bytes before writing
169
+ model_inference_result = infer_result.get_infer_res() # type: ignore
170
+ json_str = json.dumps(model_inference_result, ensure_ascii=False, indent=4)
171
+
172
+ try:
173
+ # Try to write to a file manually to avoid FileBasedDataWriter issues
174
+ model_file_path = os.path.join(local_md_dir, f"{name_without_suff}_model.json")
175
+ with open(model_file_path, 'w', encoding='utf-8') as f:
176
+ f.write(json_str)
177
+ except Exception as e:
178
+ print(f"Warning: Failed to save model result using file write: {str(e)}")
179
+ try:
180
+ # If direct file write fails, try using the writer with bytes encoding
181
+ md_writer.write(json_str.encode('utf-8'), f"{name_without_suff}_model.json") # type: ignore
182
+ except Exception as e2:
183
+ print(f"Warning: Failed to save model result using writer: {str(e2)}")
184
+
185
+ return cast(Tuple[List[Dict[str, Any]], str], (content_list, md_content))
186
+
187
+ except Exception as e:
188
+ print(f"Error in parse_pdf: {str(e)}")
189
+ raise
190
+
191
+ @staticmethod
192
+ def parse_office_doc(
193
+ doc_path: Union[str, Path],
194
+ output_dir: Optional[str] = None
195
+ ) -> Tuple[List[Dict[str, Any]], str]:
196
+ """
197
+ Parse office document (Word, PPT, etc.)
198
+
199
+ Args:
200
+ doc_path: Path to the document file
201
+ output_dir: Output directory path
202
+
203
+ Returns:
204
+ Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
205
+ """
206
+ try:
207
+ # Convert to Path object for easier handling
208
+ doc_path = Path(doc_path)
209
+ name_without_suff = doc_path.stem
210
+
211
+ # Prepare output directories - ensure file name is in path
212
+ if output_dir:
213
+ base_output_dir = Path(output_dir)
214
+ local_md_dir = base_output_dir / name_without_suff
215
+ else:
216
+ local_md_dir = doc_path.parent / name_without_suff
217
+
218
+ local_image_dir = local_md_dir / "images"
219
+ image_dir = local_image_dir.name
220
+
221
+ # Create directories
222
+ os.makedirs(local_image_dir, exist_ok=True)
223
+ os.makedirs(local_md_dir, exist_ok=True)
224
+
225
+ # Initialize writers
226
+ image_writer = FileBasedDataWriter(str(local_image_dir)) # type: ignore
227
+ md_writer = FileBasedDataWriter(str(local_md_dir)) # type: ignore
228
+
229
+ # Read office document
230
+ ds = read_local_office(str(doc_path))[0] # type: ignore
231
+
232
+ # Apply chain of operations according to API documentation
233
+ # This follows the pattern shown in MS-Office example in the API docs
234
+ ds.apply(doc_analyze, ocr=True)\
235
+ .pipe_txt_mode(image_writer)\
236
+ .dump_md(md_writer, f"{name_without_suff}.md", image_dir) # type: ignore
237
+
238
+ # Re-execute for getting the content data
239
+ infer_result = ds.apply(doc_analyze, ocr=True) # type: ignore
240
+ pipe_result = infer_result.pipe_txt_mode(image_writer) # type: ignore
241
+
242
+ # Get data for return values and additional outputs
243
+ md_content = pipe_result.get_markdown(image_dir) # type: ignore
244
+ content_list = pipe_result.get_content_list(image_dir) # type: ignore
245
+
246
+ # Save additional output files
247
+ pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir) # type: ignore
248
+ pipe_result.dump_middle_json(md_writer, f"{name_without_suff}_middle.json") # type: ignore
249
+
250
+ # Save model result - convert JSON string to bytes before writing
251
+ model_inference_result = infer_result.get_infer_res() # type: ignore
252
+ json_str = json.dumps(model_inference_result, ensure_ascii=False, indent=4)
253
+
254
+ try:
255
+ # Try to write to a file manually to avoid FileBasedDataWriter issues
256
+ model_file_path = os.path.join(local_md_dir, f"{name_without_suff}_model.json")
257
+ with open(model_file_path, 'w', encoding='utf-8') as f:
258
+ f.write(json_str)
259
+ except Exception as e:
260
+ print(f"Warning: Failed to save model result using file write: {str(e)}")
261
+ try:
262
+ # If direct file write fails, try using the writer with bytes encoding
263
+ md_writer.write(json_str.encode('utf-8'), f"{name_without_suff}_model.json") # type: ignore
264
+ except Exception as e2:
265
+ print(f"Warning: Failed to save model result using writer: {str(e2)}")
266
+
267
+ return cast(Tuple[List[Dict[str, Any]], str], (content_list, md_content))
268
+
269
+ except Exception as e:
270
+ print(f"Error in parse_office_doc: {str(e)}")
271
+ raise
272
+
273
+ @staticmethod
274
+ def parse_image(
275
+ image_path: Union[str, Path],
276
+ output_dir: Optional[str] = None
277
+ ) -> Tuple[List[Dict[str, Any]], str]:
278
+ """
279
+ Parse image document
280
+
281
+ Args:
282
+ image_path: Path to the image file
283
+ output_dir: Output directory path
284
+
285
+ Returns:
286
+ Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
287
+ """
288
+ try:
289
+ # Convert to Path object for easier handling
290
+ image_path = Path(image_path)
291
+ name_without_suff = image_path.stem
292
+
293
+ # Prepare output directories - ensure file name is in path
294
+ if output_dir:
295
+ base_output_dir = Path(output_dir)
296
+ local_md_dir = base_output_dir / name_without_suff
297
+ else:
298
+ local_md_dir = image_path.parent / name_without_suff
299
+
300
+ local_image_dir = local_md_dir / "images"
301
+ image_dir = local_image_dir.name
302
+
303
+ # Create directories
304
+ os.makedirs(local_image_dir, exist_ok=True)
305
+ os.makedirs(local_md_dir, exist_ok=True)
306
+
307
+ # Initialize writers
308
+ image_writer = FileBasedDataWriter(str(local_image_dir)) # type: ignore
309
+ md_writer = FileBasedDataWriter(str(local_md_dir)) # type: ignore
310
+
311
+ # Read image
312
+ ds = read_local_images(str(image_path))[0] # type: ignore
313
+
314
+ # Apply chain of operations according to API documentation
315
+ # This follows the pattern shown in Image example in the API docs
316
+ ds.apply(doc_analyze, ocr=True)\
317
+ .pipe_ocr_mode(image_writer)\
318
+ .dump_md(md_writer, f"{name_without_suff}.md", image_dir) # type: ignore
319
+
320
+ # Re-execute for getting the content data
321
+ infer_result = ds.apply(doc_analyze, ocr=True) # type: ignore
322
+ pipe_result = infer_result.pipe_ocr_mode(image_writer) # type: ignore
323
+
324
+ # Get data for return values and additional outputs
325
+ md_content = pipe_result.get_markdown(image_dir) # type: ignore
326
+ content_list = pipe_result.get_content_list(image_dir) # type: ignore
327
+
328
+ # Save additional output files
329
+ pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir) # type: ignore
330
+ pipe_result.dump_middle_json(md_writer, f"{name_without_suff}_middle.json") # type: ignore
331
+
332
+ # Save model result - convert JSON string to bytes before writing
333
+ model_inference_result = infer_result.get_infer_res() # type: ignore
334
+ json_str = json.dumps(model_inference_result, ensure_ascii=False, indent=4)
335
+
336
+ try:
337
+ # Try to write to a file manually to avoid FileBasedDataWriter issues
338
+ model_file_path = os.path.join(local_md_dir, f"{name_without_suff}_model.json")
339
+ with open(model_file_path, 'w', encoding='utf-8') as f:
340
+ f.write(json_str)
341
+ except Exception as e:
342
+ print(f"Warning: Failed to save model result using file write: {str(e)}")
343
+ try:
344
+ # If direct file write fails, try using the writer with bytes encoding
345
+ md_writer.write(json_str.encode('utf-8'), f"{name_without_suff}_model.json") # type: ignore
346
+ except Exception as e2:
347
+ print(f"Warning: Failed to save model result using writer: {str(e2)}")
348
+
349
+ return cast(Tuple[List[Dict[str, Any]], str], (content_list, md_content))
350
+
351
+ except Exception as e:
352
+ print(f"Error in parse_image: {str(e)}")
353
+ raise
354
+
355
+ @staticmethod
356
+ def parse_document(
357
+ file_path: Union[str, Path],
358
+ parse_method: str = "auto",
359
+ output_dir: Optional[str] = None,
360
+ save_results: bool = True
361
+ ) -> Tuple[List[Dict[str, Any]], str]:
362
+ """
363
+ Parse document using MinerU based on file extension
364
+
365
+ Args:
366
+ file_path: Path to the file to be parsed
367
+ parse_method: Parsing method, supports "auto", "ocr", "txt", default is "auto"
368
+ output_dir: Output directory path, if None, use the directory of the input file
369
+ save_results: Whether to save parsing results to files
370
+
371
+ Returns:
372
+ Tuple[List[Dict[str, Any]], str]: Tuple containing (content list JSON, Markdown text)
373
+ """
374
+ # Convert to Path object
375
+ file_path = Path(file_path)
376
+ if not file_path.exists():
377
+ raise FileNotFoundError(f"File does not exist: {file_path}")
378
+
379
+ # Get file extension
380
+ ext = file_path.suffix.lower()
381
+
382
+ # Choose appropriate parser based on file type
383
+ if ext in [".pdf"]:
384
+ return MineruParser.parse_pdf(
385
+ file_path,
386
+ output_dir,
387
+ use_ocr=(parse_method == "ocr")
388
+ )
389
+ elif ext in [".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]:
390
+ return MineruParser.parse_image(
391
+ file_path,
392
+ output_dir
393
+ )
394
+ elif ext in [".doc", ".docx", ".ppt", ".pptx"]:
395
+ return MineruParser.parse_office_doc(
396
+ file_path,
397
+ output_dir
398
+ )
399
+ else:
400
+ # For unsupported file types, default to PDF parsing
401
+ print(f"Warning: Unsupported file extension '{ext}', trying generic PDF parser")
402
+ return MineruParser.parse_pdf(
403
+ file_path,
404
+ output_dir,
405
+ use_ocr=(parse_method == "ocr")
406
+ )
407
+
408
+ def main():
409
+ """
410
+ Main function to run the MinerU parser from command line
411
+ """
412
+ parser = argparse.ArgumentParser(description='Parse documents using MinerU')
413
+ parser.add_argument('file_path', help='Path to the document to parse')
414
+ parser.add_argument('--output', '-o', help='Output directory path')
415
+ parser.add_argument('--method', '-m',
416
+ choices=['auto', 'ocr', 'txt'],
417
+ default='auto',
418
+ help='Parsing method (auto, ocr, txt)')
419
+ parser.add_argument('--stats', action='store_true',
420
+ help='Display content statistics')
421
+
422
+ args = parser.parse_args()
423
+
424
+ try:
425
+ # Parse the document
426
+ content_list, md_content = MineruParser.parse_document(
427
+ file_path=args.file_path,
428
+ parse_method=args.method,
429
+ output_dir=args.output
430
+ )
431
+
432
+ # Display statistics if requested
433
+ if args.stats:
434
+ print("\nDocument Statistics:")
435
+ print(f"Total content blocks: {len(content_list)}")
436
+
437
+ # Count different types of content
438
+ content_types = {}
439
+ for item in content_list:
440
+ content_type = item.get('type', 'unknown')
441
+ content_types[content_type] = content_types.get(content_type, 0) + 1
442
+
443
+ print("\nContent Type Distribution:")
444
+ for content_type, count in content_types.items():
445
+ print(f"- {content_type}: {count}")
446
+
447
+ except Exception as e:
448
+ print(f"Error: {str(e)}")
449
+ return 1
450
+
451
+ return 0
452
+
453
+ if __name__ == '__main__':
454
+ exit(main())
lightrag/modalprocessors.py ADDED
@@ -0,0 +1,708 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Specialized processors for different modalities
3
+
4
+ Includes:
5
+ - ImageModalProcessor: Specialized processor for image content
6
+ - TableModalProcessor: Specialized processor for table content
7
+ - EquationModalProcessor: Specialized processor for equation content
8
+ - GenericModalProcessor: Processor for other modal content
9
+ """
10
+
11
+ import re
12
+ import json
13
+ import time
14
+ import asyncio
15
+ import base64
16
+ from typing import Dict, Any, Tuple, cast
17
+ from pathlib import Path
18
+
19
+ from lightrag.base import StorageNameSpace
20
+ from lightrag.utils import (
21
+ logger,
22
+ compute_mdhash_id,
23
+ )
24
+ from lightrag.lightrag import LightRAG
25
+ from dataclasses import asdict
26
+ from lightrag.kg.shared_storage import get_namespace_data, get_pipeline_status_lock
27
+
28
+
29
+ class BaseModalProcessor:
30
+ """Base class for modal processors"""
31
+
32
+ def __init__(self, lightrag: LightRAG, modal_caption_func):
33
+ """Initialize base processor
34
+
35
+ Args:
36
+ lightrag: LightRAG instance
37
+ modal_caption_func: Function for generating descriptions
38
+ """
39
+ self.lightrag = lightrag
40
+ self.modal_caption_func = modal_caption_func
41
+
42
+ # Use LightRAG's storage instances
43
+ self.text_chunks_db = lightrag.text_chunks
44
+ self.chunks_vdb = lightrag.chunks_vdb
45
+ self.entities_vdb = lightrag.entities_vdb
46
+ self.relationships_vdb = lightrag.relationships_vdb
47
+ self.knowledge_graph_inst = lightrag.chunk_entity_relation_graph
48
+
49
+ # Use LightRAG's configuration and functions
50
+ self.embedding_func = lightrag.embedding_func
51
+ self.llm_model_func = lightrag.llm_model_func
52
+ self.global_config = asdict(lightrag)
53
+ self.hashing_kv = lightrag.llm_response_cache
54
+ self.tokenizer = lightrag.tokenizer
55
+
56
+ async def process_multimodal_content(
57
+ self,
58
+ modal_content,
59
+ content_type: str,
60
+ file_path: str = "manual_creation",
61
+ entity_name: str = None,
62
+ ) -> Tuple[str, Dict[str, Any]]:
63
+ """Process multimodal content"""
64
+ # Subclasses need to implement specific processing logic
65
+ raise NotImplementedError("Subclasses must implement this method")
66
+
67
+ async def _create_entity_and_chunk(
68
+ self, modal_chunk: str, entity_info: Dict[str, Any],
69
+ file_path: str) -> Tuple[str, Dict[str, Any]]:
70
+ """Create entity and text chunk"""
71
+ # Create chunk
72
+ chunk_id = compute_mdhash_id(str(modal_chunk), prefix="chunk-")
73
+ tokens = len(self.tokenizer.encode(modal_chunk))
74
+
75
+ chunk_data = {
76
+ "tokens": tokens,
77
+ "content": modal_chunk,
78
+ "chunk_order_index": 0,
79
+ "full_doc_id": chunk_id,
80
+ "file_path": file_path,
81
+ }
82
+
83
+ # Store chunk
84
+ await self.text_chunks_db.upsert({chunk_id: chunk_data})
85
+
86
+ # Create entity node
87
+ node_data = {
88
+ "entity_id": entity_info["entity_name"],
89
+ "entity_type": entity_info["entity_type"],
90
+ "description": entity_info["summary"],
91
+ "source_id": chunk_id,
92
+ "file_path": file_path,
93
+ "created_at": int(time.time()),
94
+ }
95
+
96
+ await self.knowledge_graph_inst.upsert_node(entity_info["entity_name"],
97
+ node_data)
98
+
99
+ # Insert entity into vector database
100
+ entity_vdb_data = {
101
+ compute_mdhash_id(entity_info["entity_name"], prefix="ent-"): {
102
+ "entity_name": entity_info["entity_name"],
103
+ "entity_type": entity_info["entity_type"],
104
+ "content":
105
+ f"{entity_info['entity_name']}\n{entity_info['summary']}",
106
+ "source_id": chunk_id,
107
+ "file_path": file_path,
108
+ }
109
+ }
110
+ await self.entities_vdb.upsert(entity_vdb_data)
111
+
112
+ # Process entity and relationship extraction
113
+ await self._process_chunk_for_extraction(chunk_id,
114
+ entity_info["entity_name"])
115
+
116
+ # Ensure all storage updates are complete
117
+ await self._insert_done()
118
+
119
+ return entity_info["summary"], {
120
+ "entity_name": entity_info["entity_name"],
121
+ "entity_type": entity_info["entity_type"],
122
+ "description": entity_info["summary"],
123
+ "chunk_id": chunk_id
124
+ }
125
+
126
+ async def _process_chunk_for_extraction(self, chunk_id: str,
127
+ modal_entity_name: str):
128
+ """Process chunk for entity and relationship extraction"""
129
+ chunk_data = await self.text_chunks_db.get_by_id(chunk_id)
130
+ if not chunk_data:
131
+ logger.error(f"Chunk {chunk_id} not found")
132
+ return
133
+
134
+ # Create text chunk for vector database
135
+ chunk_vdb_data = {
136
+ chunk_id: {
137
+ "content": chunk_data["content"],
138
+ "full_doc_id": chunk_id,
139
+ "tokens": chunk_data["tokens"],
140
+ "chunk_order_index": chunk_data["chunk_order_index"],
141
+ "file_path": chunk_data["file_path"],
142
+ }
143
+ }
144
+
145
+ await self.chunks_vdb.upsert(chunk_vdb_data)
146
+
147
+ # Trigger extraction process
148
+ from lightrag.operate import extract_entities, merge_nodes_and_edges
149
+
150
+ pipeline_status = await get_namespace_data("pipeline_status")
151
+ pipeline_status_lock = get_pipeline_status_lock()
152
+
153
+ # Prepare chunk for extraction
154
+ chunks = {chunk_id: chunk_data}
155
+
156
+ # Extract entities and relationships
157
+ chunk_results = await extract_entities(
158
+ chunks=chunks,
159
+ global_config=self.global_config,
160
+ pipeline_status=pipeline_status,
161
+ pipeline_status_lock=pipeline_status_lock,
162
+ llm_response_cache=self.hashing_kv,
163
+ )
164
+
165
+ # Add "belongs_to" relationships for all extracted entities
166
+ for maybe_nodes, _ in chunk_results:
167
+ for entity_name in maybe_nodes.keys():
168
+ if entity_name != modal_entity_name: # Skip self-relationship
169
+ # Create belongs_to relationship
170
+ relation_data = {
171
+ "description":
172
+ f"Entity {entity_name} belongs to {modal_entity_name}",
173
+ "keywords":
174
+ "belongs_to,part_of,contained_in",
175
+ "source_id":
176
+ chunk_id,
177
+ "weight":
178
+ 10.0,
179
+ "file_path":
180
+ chunk_data.get("file_path", "manual_creation"),
181
+ }
182
+ await self.knowledge_graph_inst.upsert_edge(
183
+ entity_name, modal_entity_name, relation_data)
184
+
185
+ relation_id = compute_mdhash_id(entity_name +
186
+ modal_entity_name,
187
+ prefix="rel-")
188
+ relation_vdb_data = {
189
+ relation_id: {
190
+ "src_id":
191
+ entity_name,
192
+ "tgt_id":
193
+ modal_entity_name,
194
+ "keywords":
195
+ relation_data["keywords"],
196
+ "content":
197
+ f"{relation_data['keywords']}\t{entity_name}\n{modal_entity_name}\n{relation_data['description']}",
198
+ "source_id":
199
+ chunk_id,
200
+ "file_path":
201
+ chunk_data.get("file_path", "manual_creation"),
202
+ }
203
+ }
204
+ await self.relationships_vdb.upsert(relation_vdb_data)
205
+
206
+ await merge_nodes_and_edges(
207
+ chunk_results=chunk_results,
208
+ knowledge_graph_inst=self.knowledge_graph_inst,
209
+ entity_vdb=self.entities_vdb,
210
+ relationships_vdb=self.relationships_vdb,
211
+ global_config=self.global_config,
212
+ pipeline_status=pipeline_status,
213
+ pipeline_status_lock=pipeline_status_lock,
214
+ llm_response_cache=self.hashing_kv,
215
+ )
216
+
217
+ async def _insert_done(self) -> None:
218
+ await asyncio.gather(*[
219
+ cast(StorageNameSpace, storage_inst).index_done_callback()
220
+ for storage_inst in [
221
+ self.text_chunks_db,
222
+ self.chunks_vdb,
223
+ self.entities_vdb,
224
+ self.relationships_vdb,
225
+ self.knowledge_graph_inst,
226
+ ]
227
+ ])
228
+
229
+
230
+ class ImageModalProcessor(BaseModalProcessor):
231
+ """Processor specialized for image content"""
232
+
233
+ def __init__(self, lightrag: LightRAG, modal_caption_func):
234
+ """Initialize image processor
235
+
236
+ Args:
237
+ lightrag: LightRAG instance
238
+ modal_caption_func: Function for generating descriptions (supporting image understanding)
239
+ """
240
+ super().__init__(lightrag, modal_caption_func)
241
+
242
+ def _encode_image_to_base64(self, image_path: str) -> str:
243
+ """Encode image to base64"""
244
+ try:
245
+ with open(image_path, "rb") as image_file:
246
+ encoded_string = base64.b64encode(
247
+ image_file.read()).decode('utf-8')
248
+ return encoded_string
249
+ except Exception as e:
250
+ logger.error(f"Failed to encode image {image_path}: {e}")
251
+ return ""
252
+
253
+ async def process_multimodal_content(
254
+ self,
255
+ modal_content,
256
+ content_type: str,
257
+ file_path: str = "manual_creation",
258
+ entity_name: str = None,
259
+ ) -> Tuple[str, Dict[str, Any]]:
260
+ """Process image content"""
261
+ try:
262
+ # Parse image content
263
+ if isinstance(modal_content, str):
264
+ try:
265
+ content_data = json.loads(modal_content)
266
+ except json.JSONDecodeError:
267
+ content_data = {"description": modal_content}
268
+ else:
269
+ content_data = modal_content
270
+
271
+ image_path = content_data.get("img_path")
272
+ captions = content_data.get("img_caption", [])
273
+ footnotes = content_data.get("img_footnote", [])
274
+
275
+ # Build detailed visual analysis prompt
276
+ vision_prompt = f"""Please analyze this image in detail and provide a JSON response with the following structure:
277
+
278
+ {{
279
+ "detailed_description": "A comprehensive and detailed visual description of the image following these guidelines:
280
+ - Describe the overall composition and layout
281
+ - Identify all objects, people, text, and visual elements
282
+ - Explain relationships between elements
283
+ - Note colors, lighting, and visual style
284
+ - Describe any actions or activities shown
285
+ - Include technical details if relevant (charts, diagrams, etc.)
286
+ - Always use specific names instead of pronouns",
287
+ "entity_info": {{
288
+ "entity_name": "{entity_name if entity_name else 'unique descriptive name for this image'}",
289
+ "entity_type": "image",
290
+ "summary": "concise summary of the image content and its significance (max 100 words)"
291
+ }}
292
+ }}
293
+
294
+ Additional context:
295
+ - Image Path: {image_path}
296
+ - Captions: {captions if captions else 'None'}
297
+ - Footnotes: {footnotes if footnotes else 'None'}
298
+
299
+ Focus on providing accurate, detailed visual analysis that would be useful for knowledge retrieval."""
300
+
301
+ # If image path exists, try to encode image
302
+ image_base64 = ""
303
+ if image_path and Path(image_path).exists():
304
+ image_base64 = self._encode_image_to_base64(image_path)
305
+
306
+ # Call vision model
307
+ if image_base64:
308
+ # Use real image for analysis
309
+ response = await self.modal_caption_func(
310
+ vision_prompt,
311
+ image_data=image_base64,
312
+ system_prompt=
313
+ "You are an expert image analyst. Provide detailed, accurate descriptions."
314
+ )
315
+ else:
316
+ # Analyze based on existing text information
317
+ text_prompt = f"""Based on the following image information, provide analysis:
318
+
319
+ Image Path: {image_path}
320
+ Captions: {captions}
321
+ Footnotes: {footnotes}
322
+
323
+ {vision_prompt}"""
324
+
325
+ response = await self.modal_caption_func(
326
+ text_prompt,
327
+ system_prompt=
328
+ "You are an expert image analyst. Provide detailed analysis based on available information."
329
+ )
330
+
331
+ # Parse response
332
+ enhanced_caption, entity_info = self._parse_response(
333
+ response, entity_name)
334
+
335
+ # Build complete image content
336
+ modal_chunk = f"""
337
+ Image Content Analysis:
338
+ Image Path: {image_path}
339
+ Captions: {', '.join(captions) if captions else 'None'}
340
+ Footnotes: {', '.join(footnotes) if footnotes else 'None'}
341
+
342
+ Visual Analysis: {enhanced_caption}"""
343
+
344
+ return await self._create_entity_and_chunk(modal_chunk,
345
+ entity_info, file_path)
346
+
347
+ except Exception as e:
348
+ logger.error(f"Error processing image content: {e}")
349
+ # Fallback processing
350
+ fallback_entity = {
351
+ "entity_name": entity_name if entity_name else
352
+ f"image_{compute_mdhash_id(str(modal_content))}",
353
+ "entity_type": "image",
354
+ "summary": f"Image content: {str(modal_content)[:100]}"
355
+ }
356
+ return str(modal_content), fallback_entity
357
+
358
+ def _parse_response(self,
359
+ response: str,
360
+ entity_name: str = None) -> Tuple[str, Dict[str, Any]]:
361
+ """Parse model response"""
362
+ try:
363
+ response_data = json.loads(
364
+ re.search(r"\{.*\}", response, re.DOTALL).group(0))
365
+
366
+ description = response_data.get("detailed_description", "")
367
+ entity_data = response_data.get("entity_info", {})
368
+
369
+ if not description or not entity_data:
370
+ raise ValueError("Missing required fields in response")
371
+
372
+ if not all(key in entity_data
373
+ for key in ["entity_name", "entity_type", "summary"]):
374
+ raise ValueError("Missing required fields in entity_info")
375
+
376
+ entity_data["entity_name"] = entity_data["entity_name"] + f" ({entity_data['entity_type']})"
377
+ if entity_name:
378
+ entity_data["entity_name"] = entity_name
379
+
380
+ return description, entity_data
381
+
382
+ except (json.JSONDecodeError, AttributeError, ValueError) as e:
383
+ logger.error(f"Error parsing image analysis response: {e}")
384
+ fallback_entity = {
385
+ "entity_name":
386
+ entity_name
387
+ if entity_name else f"image_{compute_mdhash_id(response)}",
388
+ "entity_type":
389
+ "image",
390
+ "summary":
391
+ response[:100] + "..." if len(response) > 100 else response
392
+ }
393
+ return response, fallback_entity
394
+
395
+
396
+ class TableModalProcessor(BaseModalProcessor):
397
+ """Processor specialized for table content"""
398
+
399
+ async def process_multimodal_content(
400
+ self,
401
+ modal_content,
402
+ content_type: str,
403
+ file_path: str = "manual_creation",
404
+ entity_name: str = None,
405
+ ) -> Tuple[str, Dict[str, Any]]:
406
+ """Process table content"""
407
+ # Parse table content
408
+ if isinstance(modal_content, str):
409
+ try:
410
+ content_data = json.loads(modal_content)
411
+ except json.JSONDecodeError:
412
+ content_data = {"table_body": modal_content}
413
+ else:
414
+ content_data = modal_content
415
+
416
+ table_img_path = content_data.get("img_path")
417
+ table_caption = content_data.get("table_caption", [])
418
+ table_body = content_data.get("table_body", "")
419
+ table_footnote = content_data.get("table_footnote", [])
420
+
421
+ # Build table analysis prompt
422
+ table_prompt = f"""Please analyze this table content and provide a JSON response with the following structure:
423
+
424
+ {{
425
+ "detailed_description": "A comprehensive analysis of the table including:
426
+ - Table structure and organization
427
+ - Column headers and their meanings
428
+ - Key data points and patterns
429
+ - Statistical insights and trends
430
+ - Relationships between data elements
431
+ - Significance of the data presented
432
+ Always use specific names and values instead of general references.",
433
+ "entity_info": {{
434
+ "entity_name": "{entity_name if entity_name else 'descriptive name for this table'}",
435
+ "entity_type": "table",
436
+ "summary": "concise summary of the table's purpose and key findings (max 100 words)"
437
+ }}
438
+ }}
439
+
440
+ Table Information:
441
+ Image Path: {table_img_path}
442
+ Caption: {table_caption if table_caption else 'None'}
443
+ Body: {table_body}
444
+ Footnotes: {table_footnote if table_footnote else 'None'}
445
+
446
+ Focus on extracting meaningful insights and relationships from the tabular data."""
447
+
448
+ response = await self.modal_caption_func(
449
+ table_prompt,
450
+ system_prompt=
451
+ "You are an expert data analyst. Provide detailed table analysis with specific insights."
452
+ )
453
+
454
+ # Parse response
455
+ enhanced_caption, entity_info = self._parse_table_response(
456
+ response, entity_name)
457
+
458
+ #TODO: Add Retry Mechanism
459
+
460
+ # Build complete table content
461
+ modal_chunk = f"""Table Analysis:
462
+ Image Path: {table_img_path}
463
+ Caption: {', '.join(table_caption) if table_caption else 'None'}
464
+ Structure: {table_body}
465
+ Footnotes: {', '.join(table_footnote) if table_footnote else 'None'}
466
+
467
+ Analysis: {enhanced_caption}"""
468
+
469
+ return await self._create_entity_and_chunk(modal_chunk, entity_info,
470
+ file_path)
471
+
472
+ def _parse_table_response(
473
+ self,
474
+ response: str,
475
+ entity_name: str = None) -> Tuple[str, Dict[str, Any]]:
476
+ """Parse table analysis response"""
477
+ try:
478
+ response_data = json.loads(
479
+ re.search(r"\{.*\}", response, re.DOTALL).group(0))
480
+
481
+ description = response_data.get("detailed_description", "")
482
+ entity_data = response_data.get("entity_info", {})
483
+
484
+ if not description or not entity_data:
485
+ raise ValueError("Missing required fields in response")
486
+
487
+ if not all(key in entity_data
488
+ for key in ["entity_name", "entity_type", "summary"]):
489
+ raise ValueError("Missing required fields in entity_info")
490
+
491
+ entity_data["entity_name"] = entity_data["entity_name"] + f" ({entity_data['entity_type']})"
492
+ if entity_name:
493
+ entity_data["entity_name"] = entity_name
494
+
495
+ return description, entity_data
496
+
497
+ except (json.JSONDecodeError, AttributeError, ValueError) as e:
498
+ logger.error(f"Error parsing table analysis response: {e}")
499
+ fallback_entity = {
500
+ "entity_name":
501
+ entity_name
502
+ if entity_name else f"table_{compute_mdhash_id(response)}",
503
+ "entity_type":
504
+ "table",
505
+ "summary":
506
+ response[:100] + "..." if len(response) > 100 else response
507
+ }
508
+ return response, fallback_entity
509
+
510
+
511
+ class EquationModalProcessor(BaseModalProcessor):
512
+ """Processor specialized for equation content"""
513
+
514
+ async def process_multimodal_content(
515
+ self,
516
+ modal_content,
517
+ content_type: str,
518
+ file_path: str = "manual_creation",
519
+ entity_name: str = None,
520
+ ) -> Tuple[str, Dict[str, Any]]:
521
+ """Process equation content"""
522
+ # Parse equation content
523
+ if isinstance(modal_content, str):
524
+ try:
525
+ content_data = json.loads(modal_content)
526
+ except json.JSONDecodeError:
527
+ content_data = {"equation": modal_content}
528
+ else:
529
+ content_data = modal_content
530
+
531
+ equation_text = content_data.get("text")
532
+ equation_format = content_data.get("text_format", "")
533
+
534
+ # Build equation analysis prompt
535
+ equation_prompt = f"""Please analyze this mathematical equation and provide a JSON response with the following structure:
536
+
537
+ {{
538
+ "detailed_description": "A comprehensive analysis of the equation including:
539
+ - Mathematical meaning and interpretation
540
+ - Variables and their definitions
541
+ - Mathematical operations and functions used
542
+ - Application domain and context
543
+ - Physical or theoretical significance
544
+ - Relationship to other mathematical concepts
545
+ - Practical applications or use cases
546
+ Always use specific mathematical terminology.",
547
+ "entity_info": {{
548
+ "entity_name": "{entity_name if entity_name else 'descriptive name for this equation'}",
549
+ "entity_type": "equation",
550
+ "summary": "concise summary of the equation's purpose and significance (max 100 words)"
551
+ }}
552
+ }}
553
+
554
+ Equation Information:
555
+ Equation: {equation_text}
556
+ Format: {equation_format}
557
+
558
+ Focus on providing mathematical insights and explaining the equation's significance."""
559
+
560
+ response = await self.modal_caption_func(
561
+ equation_prompt,
562
+ system_prompt=
563
+ "You are an expert mathematician. Provide detailed mathematical analysis."
564
+ )
565
+
566
+ # Parse response
567
+ enhanced_caption, entity_info = self._parse_equation_response(
568
+ response, entity_name)
569
+
570
+ # Build complete equation content
571
+ modal_chunk = f"""Mathematical Equation Analysis:
572
+ Equation: {equation_text}
573
+ Format: {equation_format}
574
+
575
+ Mathematical Analysis: {enhanced_caption}"""
576
+
577
+ return await self._create_entity_and_chunk(modal_chunk, entity_info,
578
+ file_path)
579
+
580
+ def _parse_equation_response(
581
+ self,
582
+ response: str,
583
+ entity_name: str = None) -> Tuple[str, Dict[str, Any]]:
584
+ """Parse equation analysis response"""
585
+ try:
586
+ response_data = json.loads(
587
+ re.search(r"\{.*\}", response, re.DOTALL).group(0))
588
+
589
+ description = response_data.get("detailed_description", "")
590
+ entity_data = response_data.get("entity_info", {})
591
+
592
+ if not description or not entity_data:
593
+ raise ValueError("Missing required fields in response")
594
+
595
+ if not all(key in entity_data
596
+ for key in ["entity_name", "entity_type", "summary"]):
597
+ raise ValueError("Missing required fields in entity_info")
598
+
599
+ entity_data["entity_name"] = entity_data["entity_name"] + f" ({entity_data['entity_type']})"
600
+ if entity_name:
601
+ entity_data["entity_name"] = entity_name
602
+
603
+ return description, entity_data
604
+
605
+ except (json.JSONDecodeError, AttributeError, ValueError) as e:
606
+ logger.error(f"Error parsing equation analysis response: {e}")
607
+ fallback_entity = {
608
+ "entity_name":
609
+ entity_name
610
+ if entity_name else f"equation_{compute_mdhash_id(response)}",
611
+ "entity_type":
612
+ "equation",
613
+ "summary":
614
+ response[:100] + "..." if len(response) > 100 else response
615
+ }
616
+ return response, fallback_entity
617
+
618
+
619
+ class GenericModalProcessor(BaseModalProcessor):
620
+ """Generic processor for other types of modal content"""
621
+
622
+ async def process_multimodal_content(
623
+ self,
624
+ modal_content,
625
+ content_type: str,
626
+ file_path: str = "manual_creation",
627
+ entity_name: str = None,
628
+ ) -> Tuple[str, Dict[str, Any]]:
629
+ """Process generic modal content"""
630
+ # Build generic analysis prompt
631
+ generic_prompt = f"""Please analyze this {content_type} content and provide a JSON response with the following structure:
632
+
633
+ {{
634
+ "detailed_description": "A comprehensive analysis of the content including:
635
+ - Content structure and organization
636
+ - Key information and elements
637
+ - Relationships between components
638
+ - Context and significance
639
+ - Relevant details for knowledge retrieval
640
+ Always use specific terminology appropriate for {content_type} content.",
641
+ "entity_info": {{
642
+ "entity_name": "{entity_name if entity_name else f'descriptive name for this {content_type}'}",
643
+ "entity_type": "{content_type}",
644
+ "summary": "concise summary of the content's purpose and key points (max 100 words)"
645
+ }}
646
+ }}
647
+
648
+ Content: {str(modal_content)}
649
+
650
+ Focus on extracting meaningful information that would be useful for knowledge retrieval."""
651
+
652
+ response = await self.modal_caption_func(
653
+ generic_prompt,
654
+ system_prompt=
655
+ f"You are an expert content analyst specializing in {content_type} content."
656
+ )
657
+
658
+ # Parse response
659
+ enhanced_caption, entity_info = self._parse_generic_response(
660
+ response, entity_name, content_type)
661
+
662
+ # Build complete content
663
+ modal_chunk = f"""{content_type.title()} Content Analysis:
664
+ Content: {str(modal_content)}
665
+
666
+ Analysis: {enhanced_caption}"""
667
+
668
+ return await self._create_entity_and_chunk(modal_chunk, entity_info,
669
+ file_path)
670
+
671
+ def _parse_generic_response(
672
+ self,
673
+ response: str,
674
+ entity_name: str = None,
675
+ content_type: str = "content") -> Tuple[str, Dict[str, Any]]:
676
+ """Parse generic analysis response"""
677
+ try:
678
+ response_data = json.loads(
679
+ re.search(r"\{.*\}", response, re.DOTALL).group(0))
680
+
681
+ description = response_data.get("detailed_description", "")
682
+ entity_data = response_data.get("entity_info", {})
683
+
684
+ if not description or not entity_data:
685
+ raise ValueError("Missing required fields in response")
686
+
687
+ if not all(key in entity_data
688
+ for key in ["entity_name", "entity_type", "summary"]):
689
+ raise ValueError("Missing required fields in entity_info")
690
+
691
+ entity_data["entity_name"] = entity_data["entity_name"] + f" ({entity_data['entity_type']})"
692
+ if entity_name:
693
+ entity_data["entity_name"] = entity_name
694
+
695
+ return description, entity_data
696
+
697
+ except (json.JSONDecodeError, AttributeError, ValueError) as e:
698
+ logger.error(f"Error parsing generic analysis response: {e}")
699
+ fallback_entity = {
700
+ "entity_name":
701
+ entity_name if entity_name else
702
+ f"{content_type}_{compute_mdhash_id(response)}",
703
+ "entity_type":
704
+ content_type,
705
+ "summary":
706
+ response[:100] + "..." if len(response) > 100 else response
707
+ }
708
+ return response, fallback_entity
lightrag/raganything.py ADDED
@@ -0,0 +1,632 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Complete MinerU parsing + multimodal content insertion Pipeline
3
+
4
+ This script integrates:
5
+ 1. MinerU document parsing
6
+ 2. Pure text content LightRAG insertion
7
+ 3. Specialized processing for multimodal content (using different processors)
8
+ """
9
+
10
+ import os
11
+ import asyncio
12
+ import logging
13
+ from pathlib import Path
14
+ from typing import Dict, List, Any, Tuple, Optional, Callable
15
+ import sys
16
+
17
+ # Add project root directory to Python path
18
+ sys.path.insert(0, str(Path(__file__).parent.parent))
19
+
20
+ from lightrag import LightRAG, QueryParam
21
+ from lightrag.utils import EmbeddingFunc, setup_logger
22
+
23
+ # Import parser and multimodal processors
24
+ from lightrag.mineru_parser import MineruParser
25
+
26
+ # Import specialized processors
27
+ from lightrag.modalprocessors import (
28
+ ImageModalProcessor,
29
+ TableModalProcessor,
30
+ EquationModalProcessor,
31
+ GenericModalProcessor
32
+ )
33
+
34
+
35
+ class RAGAnything:
36
+ """Multimodal Document Processing Pipeline - Complete document parsing and insertion pipeline"""
37
+
38
+ def __init__(
39
+ self,
40
+ lightrag: Optional[LightRAG] = None,
41
+ llm_model_func: Optional[Callable] = None,
42
+ vision_model_func: Optional[Callable] = None,
43
+ embedding_func: Optional[Callable] = None,
44
+ working_dir: str = "./rag_storage",
45
+ embedding_dim: int = 3072,
46
+ max_token_size: int = 8192
47
+ ):
48
+ """
49
+ Initialize Multimodal Document Processing Pipeline
50
+
51
+ Args:
52
+ lightrag: Optional pre-initialized LightRAG instance
53
+ llm_model_func: LLM model function for text analysis
54
+ vision_model_func: Vision model function for image analysis
55
+ embedding_func: Embedding function for text vectorization
56
+ working_dir: Working directory for storage (used when creating new RAG)
57
+ embedding_dim: Embedding dimension (used when creating new RAG)
58
+ max_token_size: Maximum token size for embeddings (used when creating new RAG)
59
+ """
60
+ self.working_dir = working_dir
61
+ self.llm_model_func = llm_model_func
62
+ self.vision_model_func = vision_model_func
63
+ self.embedding_func = embedding_func
64
+ self.embedding_dim = embedding_dim
65
+ self.max_token_size = max_token_size
66
+
67
+ # Set up logging
68
+ setup_logger("RAGAnything")
69
+ self.logger = logging.getLogger("RAGAnything")
70
+
71
+ # Create working directory if needed
72
+ if not os.path.exists(working_dir):
73
+ os.makedirs(working_dir)
74
+
75
+ # Use provided LightRAG or mark for later initialization
76
+ self.lightrag = lightrag
77
+ self.modal_processors = {}
78
+
79
+ # If LightRAG is provided, initialize processors immediately
80
+ if self.lightrag is not None:
81
+ self._initialize_processors()
82
+
83
+ def _initialize_processors(self):
84
+ """Initialize multimodal processors with appropriate model functions"""
85
+ if self.lightrag is None:
86
+ raise ValueError("LightRAG instance must be initialized before creating processors")
87
+
88
+ # Create different multimodal processors
89
+ self.modal_processors = {
90
+ "image": ImageModalProcessor(
91
+ lightrag=self.lightrag,
92
+ modal_caption_func=self.vision_model_func or self.llm_model_func
93
+ ),
94
+ "table": TableModalProcessor(
95
+ lightrag=self.lightrag,
96
+ modal_caption_func=self.llm_model_func
97
+ ),
98
+ "equation": EquationModalProcessor(
99
+ lightrag=self.lightrag,
100
+ modal_caption_func=self.llm_model_func
101
+ ),
102
+ "generic": GenericModalProcessor(
103
+ lightrag=self.lightrag,
104
+ modal_caption_func=self.llm_model_func
105
+ )
106
+ }
107
+
108
+ self.logger.info("Multimodal processors initialized")
109
+ self.logger.info(f"Available processors: {list(self.modal_processors.keys())}")
110
+
111
+ async def _ensure_lightrag_initialized(self):
112
+ """Ensure LightRAG instance is initialized, create if necessary"""
113
+ if self.lightrag is not None:
114
+ return
115
+
116
+ # Validate required functions
117
+ if self.llm_model_func is None:
118
+ raise ValueError("llm_model_func must be provided when LightRAG is not pre-initialized")
119
+ if self.embedding_func is None:
120
+ raise ValueError("embedding_func must be provided when LightRAG is not pre-initialized")
121
+
122
+ from lightrag.kg.shared_storage import initialize_pipeline_status
123
+
124
+ # Create LightRAG instance with provided functions
125
+ self.lightrag = LightRAG(
126
+ working_dir=self.working_dir,
127
+ llm_model_func=self.llm_model_func,
128
+ embedding_func=EmbeddingFunc(
129
+ embedding_dim=self.embedding_dim,
130
+ max_token_size=self.max_token_size,
131
+ func=self.embedding_func,
132
+ ),
133
+ )
134
+
135
+ await self.lightrag.initialize_storages()
136
+ await initialize_pipeline_status()
137
+
138
+ # Initialize processors after LightRAG is ready
139
+ self._initialize_processors()
140
+
141
+ self.logger.info("LightRAG and multimodal processors initialized")
142
+
143
+ def parse_document(
144
+ self,
145
+ file_path: str,
146
+ output_dir: str = "./output",
147
+ parse_method: str = "auto",
148
+ display_stats: bool = True
149
+ ) -> Tuple[List[Dict[str, Any]], str]:
150
+ """
151
+ Parse document using MinerU
152
+
153
+ Args:
154
+ file_path: Path to the file to parse
155
+ output_dir: Output directory
156
+ parse_method: Parse method ("auto", "ocr", "txt")
157
+ display_stats: Whether to display content statistics
158
+
159
+ Returns:
160
+ (content_list, md_content): Content list and markdown text
161
+ """
162
+ self.logger.info(f"Starting document parsing: {file_path}")
163
+
164
+ file_path = Path(file_path)
165
+ if not file_path.exists():
166
+ raise FileNotFoundError(f"File not found: {file_path}")
167
+
168
+ # Choose appropriate parsing method based on file extension
169
+ ext = file_path.suffix.lower()
170
+
171
+ try:
172
+ if ext in [".pdf"]:
173
+ self.logger.info(f"Detected PDF file, using PDF parser (OCR={parse_method == 'ocr'})...")
174
+ content_list, md_content = MineruParser.parse_pdf(
175
+ file_path,
176
+ output_dir,
177
+ use_ocr=(parse_method == "ocr")
178
+ )
179
+ elif ext in [".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]:
180
+ self.logger.info("Detected image file, using image parser...")
181
+ content_list, md_content = MineruParser.parse_image(
182
+ file_path,
183
+ output_dir
184
+ )
185
+ elif ext in [".doc", ".docx", ".ppt", ".pptx"]:
186
+ self.logger.info("Detected Office document, using Office parser...")
187
+ content_list, md_content = MineruParser.parse_office_doc(
188
+ file_path,
189
+ output_dir
190
+ )
191
+ else:
192
+ # For other or unknown formats, use generic parser
193
+ self.logger.info(f"Using generic parser for {ext} file (method={parse_method})...")
194
+ content_list, md_content = MineruParser.parse_document(
195
+ file_path,
196
+ parse_method=parse_method,
197
+ output_dir=output_dir
198
+ )
199
+
200
+ except Exception as e:
201
+ self.logger.error(f"Error during parsing with specific parser: {str(e)}")
202
+ self.logger.warning("Falling back to generic parser...")
203
+ # If specific parser fails, fall back to generic parser
204
+ content_list, md_content = MineruParser.parse_document(
205
+ file_path,
206
+ parse_method=parse_method,
207
+ output_dir=output_dir
208
+ )
209
+
210
+ self.logger.info(f"Parsing complete! Extracted {len(content_list)} content blocks")
211
+ self.logger.info(f"Markdown text length: {len(md_content)} characters")
212
+
213
+ # Display content statistics if requested
214
+ if display_stats:
215
+ self.logger.info("\nContent Information:")
216
+ self.logger.info(f"* Total blocks in content_list: {len(content_list)}")
217
+ self.logger.info(f"* Markdown content length: {len(md_content)} characters")
218
+
219
+ # Count elements by type
220
+ block_types: Dict[str, int] = {}
221
+ for block in content_list:
222
+ if isinstance(block, dict):
223
+ block_type = block.get("type", "unknown")
224
+ if isinstance(block_type, str):
225
+ block_types[block_type] = block_types.get(block_type, 0) + 1
226
+
227
+ self.logger.info("* Content block types:")
228
+ for block_type, count in block_types.items():
229
+ self.logger.info(f" - {block_type}: {count}")
230
+
231
+ return content_list, md_content
232
+
233
+ def _separate_content(self, content_list: List[Dict[str, Any]]) -> Tuple[str, List[Dict[str, Any]]]:
234
+ """
235
+ Separate text content and multimodal content
236
+
237
+ Args:
238
+ content_list: Content list from MinerU parsing
239
+
240
+ Returns:
241
+ (text_content, multimodal_items): Pure text content and multimodal items list
242
+ """
243
+ text_parts = []
244
+ multimodal_items = []
245
+
246
+ for item in content_list:
247
+ content_type = item.get("type", "text")
248
+
249
+ if content_type == "text":
250
+ # Text content
251
+ text = item.get("text", "")
252
+ if text.strip():
253
+ text_parts.append(text)
254
+ else:
255
+ # Multimodal content (image, table, equation, etc.)
256
+ multimodal_items.append(item)
257
+
258
+ # Merge all text content
259
+ text_content = "\n\n".join(text_parts)
260
+
261
+ self.logger.info(f"Content separation complete:")
262
+ self.logger.info(f" - Text content length: {len(text_content)} characters")
263
+ self.logger.info(f" - Multimodal items count: {len(multimodal_items)}")
264
+
265
+ # Count multimodal types
266
+ modal_types = {}
267
+ for item in multimodal_items:
268
+ modal_type = item.get("type", "unknown")
269
+ modal_types[modal_type] = modal_types.get(modal_type, 0) + 1
270
+
271
+ if modal_types:
272
+ self.logger.info(f" - Multimodal type distribution: {modal_types}")
273
+
274
+ return text_content, multimodal_items
275
+
276
+ async def _insert_text_content(
277
+ self,
278
+ input: str | list[str],
279
+ split_by_character: str | None = None,
280
+ split_by_character_only: bool = False,
281
+ ids: str | list[str] | None = None,
282
+ file_paths: str | list[str] | None = None,
283
+ ):
284
+ """
285
+ Insert pure text content into LightRAG
286
+
287
+ Args:
288
+ input: Single document string or list of document strings
289
+ split_by_character: if split_by_character is not None, split the string by character, if chunk longer than
290
+ chunk_token_size, it will be split again by token size.
291
+ split_by_character_only: if split_by_character_only is True, split the string by character only, when
292
+ split_by_character is None, this parameter is ignored.
293
+ ids: single string of the document ID or list of unique document IDs, if not provided, MD5 hash IDs will be generated
294
+ file_paths: single string of the file path or list of file paths, used for citation
295
+ """
296
+ self.logger.info("Starting text content insertion into LightRAG...")
297
+
298
+ # Use LightRAG's insert method with all parameters
299
+ await self.lightrag.ainsert(
300
+ input=input,
301
+ file_paths=file_paths,
302
+ split_by_character=split_by_character,
303
+ split_by_character_only=split_by_character_only,
304
+ ids=ids
305
+ )
306
+
307
+ self.logger.info("Text content insertion complete")
308
+
309
+ async def _process_multimodal_content(self, multimodal_items: List[Dict[str, Any]], file_path: str):
310
+ """
311
+ Process multimodal content (using specialized processors)
312
+
313
+ Args:
314
+ multimodal_items: List of multimodal items
315
+ file_path: File path (for reference)
316
+ """
317
+ if not multimodal_items:
318
+ self.logger.debug("No multimodal content to process")
319
+ return
320
+
321
+ self.logger.info("Starting multimodal content processing...")
322
+
323
+ file_name = os.path.basename(file_path)
324
+
325
+ for i, item in enumerate(multimodal_items):
326
+ try:
327
+ content_type = item.get("type", "unknown")
328
+ self.logger.info(f"Processing item {i+1}/{len(multimodal_items)}: {content_type} content")
329
+
330
+ # Select appropriate processor
331
+ processor = self._get_processor_for_type(content_type)
332
+
333
+ if processor:
334
+ enhanced_caption, entity_info = await processor.process_multimodal_content(
335
+ modal_content=item,
336
+ content_type=content_type,
337
+ file_path=file_name
338
+ )
339
+ self.logger.info(f"{content_type} processing complete: {entity_info.get('entity_name', 'Unknown')}")
340
+ else:
341
+ self.logger.warning(f"No suitable processor found for {content_type} type content")
342
+
343
+ except Exception as e:
344
+ self.logger.error(f"Error processing multimodal content: {str(e)}")
345
+ self.logger.debug("Exception details:", exc_info=True)
346
+ continue
347
+
348
+ self.logger.info("Multimodal content processing complete")
349
+
350
+ def _get_processor_for_type(self, content_type: str):
351
+ """
352
+ Get appropriate processor based on content type
353
+
354
+ Args:
355
+ content_type: Content type
356
+
357
+ Returns:
358
+ Corresponding processor instance
359
+ """
360
+ # Direct mapping to corresponding processor
361
+ if content_type == "image":
362
+ return self.modal_processors.get("image")
363
+ elif content_type == "table":
364
+ return self.modal_processors.get("table")
365
+ elif content_type == "equation":
366
+ return self.modal_processors.get("equation")
367
+ else:
368
+ # For other types, use generic processor
369
+ return self.modal_processors.get("generic")
370
+
371
+ async def process_document_complete(
372
+ self,
373
+ file_path: str,
374
+ output_dir: str = "./output",
375
+ parse_method: str = "auto",
376
+ display_stats: bool = True,
377
+ split_by_character: str | None = None,
378
+ split_by_character_only: bool = False,
379
+ doc_id: str | None = None
380
+ ):
381
+ """
382
+ Complete document processing workflow
383
+
384
+ Args:
385
+ file_path: Path to the file to process
386
+ output_dir: MinerU output directory
387
+ parse_method: Parse method
388
+ display_stats: Whether to display content statistics
389
+ split_by_character: Optional character to split the text by
390
+ split_by_character_only: If True, split only by the specified character
391
+ doc_id: Optional document ID, if not provided MD5 hash will be generated
392
+ """
393
+ # Ensure LightRAG is initialized
394
+ await self._ensure_lightrag_initialized()
395
+
396
+ self.logger.info(f"Starting complete document processing: {file_path}")
397
+
398
+ # Step 1: Parse document using MinerU
399
+ content_list, md_content = self.parse_document(
400
+ file_path,
401
+ output_dir,
402
+ parse_method,
403
+ display_stats
404
+ )
405
+
406
+ # Step 2: Separate text and multimodal content
407
+ text_content, multimodal_items = self._separate_content(content_list)
408
+
409
+ # Step 3: Insert pure text content with all parameters
410
+ if text_content.strip():
411
+ file_name = os.path.basename(file_path)
412
+ await self._insert_text_content(
413
+ text_content,
414
+ file_paths=file_name,
415
+ split_by_character=split_by_character,
416
+ split_by_character_only=split_by_character_only,
417
+ ids=doc_id
418
+ )
419
+
420
+ # Step 4: Process multimodal content (using specialized processors)
421
+ if multimodal_items:
422
+ await self._process_multimodal_content(multimodal_items, file_path)
423
+
424
+ self.logger.info(f"Document {file_path} processing complete!")
425
+
426
+ async def process_folder_complete(
427
+ self,
428
+ folder_path: str,
429
+ output_dir: str = "./output",
430
+ parse_method: str = "auto",
431
+ display_stats: bool = False,
432
+ split_by_character: str | None = None,
433
+ split_by_character_only: bool = False,
434
+ file_extensions: Optional[List[str]] = None,
435
+ recursive: bool = True,
436
+ max_workers: int = 1
437
+ ):
438
+ """
439
+ Process all files in a folder in batch
440
+
441
+ Args:
442
+ folder_path: Path to the folder to process
443
+ output_dir: MinerU output directory
444
+ parse_method: Parse method
445
+ display_stats: Whether to display content statistics for each file (recommended False for batch processing)
446
+ split_by_character: Optional character to split text by
447
+ split_by_character_only: If True, split only by the specified character
448
+ file_extensions: List of file extensions to process, e.g. [".pdf", ".docx"]. If None, process all supported formats
449
+ recursive: Whether to recursively process subfolders
450
+ max_workers: Maximum number of concurrent workers
451
+ """
452
+ # Ensure LightRAG is initialized
453
+ await self._ensure_lightrag_initialized()
454
+
455
+ folder_path = Path(folder_path)
456
+ if not folder_path.exists() or not folder_path.is_dir():
457
+ raise ValueError(f"Folder does not exist or is not a valid directory: {folder_path}")
458
+
459
+ # Supported file formats
460
+ supported_extensions = {
461
+ ".pdf", ".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif",
462
+ ".doc", ".docx", ".ppt", ".pptx", ".txt", ".md"
463
+ }
464
+
465
+ # Use specified extensions or all supported formats
466
+ if file_extensions:
467
+ target_extensions = set(ext.lower() for ext in file_extensions)
468
+ # Validate if all are supported formats
469
+ unsupported = target_extensions - supported_extensions
470
+ if unsupported:
471
+ self.logger.warning(f"The following file formats may not be fully supported: {unsupported}")
472
+ else:
473
+ target_extensions = supported_extensions
474
+
475
+ # Collect all files to process
476
+ files_to_process = []
477
+
478
+ if recursive:
479
+ # Recursively traverse all subfolders
480
+ for file_path in folder_path.rglob("*"):
481
+ if file_path.is_file() and file_path.suffix.lower() in target_extensions:
482
+ files_to_process.append(file_path)
483
+ else:
484
+ # Process only current folder
485
+ for file_path in folder_path.glob("*"):
486
+ if file_path.is_file() and file_path.suffix.lower() in target_extensions:
487
+ files_to_process.append(file_path)
488
+
489
+ if not files_to_process:
490
+ self.logger.info(f"No files to process found in {folder_path}")
491
+ return
492
+
493
+ self.logger.info(f"Found {len(files_to_process)} files to process")
494
+ self.logger.info(f"File type distribution:")
495
+
496
+ # Count file types
497
+ file_type_count = {}
498
+ for file_path in files_to_process:
499
+ ext = file_path.suffix.lower()
500
+ file_type_count[ext] = file_type_count.get(ext, 0) + 1
501
+
502
+ for ext, count in sorted(file_type_count.items()):
503
+ self.logger.info(f" {ext}: {count} files")
504
+
505
+ # Create progress tracking
506
+ processed_count = 0
507
+ failed_files = []
508
+
509
+ # Use semaphore to control concurrency
510
+ semaphore = asyncio.Semaphore(max_workers)
511
+
512
+ async def process_single_file(file_path: Path, index: int) -> None:
513
+ """Process a single file"""
514
+ async with semaphore:
515
+ nonlocal processed_count
516
+ try:
517
+ self.logger.info(f"[{index}/{len(files_to_process)}] Processing: {file_path}")
518
+
519
+ # Create separate output directory for each file
520
+ file_output_dir = Path(output_dir) / file_path.stem
521
+ file_output_dir.mkdir(parents=True, exist_ok=True)
522
+
523
+ # Process file
524
+ await self.process_document_complete(
525
+ file_path=str(file_path),
526
+ output_dir=str(file_output_dir),
527
+ parse_method=parse_method,
528
+ display_stats=display_stats,
529
+ split_by_character=split_by_character,
530
+ split_by_character_only=split_by_character_only
531
+ )
532
+
533
+ processed_count += 1
534
+ self.logger.info(f"[{index}/{len(files_to_process)}] Successfully processed: {file_path}")
535
+
536
+ except Exception as e:
537
+ self.logger.error(f"[{index}/{len(files_to_process)}] Failed to process: {file_path}")
538
+ self.logger.error(f"Error: {str(e)}")
539
+ failed_files.append((file_path, str(e)))
540
+
541
+ # Create all processing tasks
542
+ tasks = []
543
+ for index, file_path in enumerate(files_to_process, 1):
544
+ task = process_single_file(file_path, index)
545
+ tasks.append(task)
546
+
547
+ # Wait for all tasks to complete
548
+ await asyncio.gather(*tasks, return_exceptions=True)
549
+
550
+ # Output processing statistics
551
+ self.logger.info("\n===== Batch Processing Complete =====")
552
+ self.logger.info(f"Total files: {len(files_to_process)}")
553
+ self.logger.info(f"Successfully processed: {processed_count}")
554
+ self.logger.info(f"Failed: {len(failed_files)}")
555
+
556
+ if failed_files:
557
+ self.logger.info("\nFailed files:")
558
+ for file_path, error in failed_files:
559
+ self.logger.info(f" - {file_path}: {error}")
560
+
561
+ return {
562
+ "total": len(files_to_process),
563
+ "success": processed_count,
564
+ "failed": len(failed_files),
565
+ "failed_files": failed_files
566
+ }
567
+
568
+ async def query_with_multimodal(
569
+ self,
570
+ query: str,
571
+ mode: str = "hybrid"
572
+ ) -> str:
573
+ """
574
+ Query with multimodal content support
575
+
576
+ Args:
577
+ query: Query content
578
+ mode: Query mode
579
+
580
+ Returns:
581
+ Query result
582
+ """
583
+ if self.lightrag is None:
584
+ raise ValueError(
585
+ "No LightRAG instance available. "
586
+ "Please either:\n"
587
+ "1. Provide a pre-initialized LightRAG instance when creating RAGAnything, or\n"
588
+ "2. Process documents first using process_document_complete() or process_folder_complete() "
589
+ "to create and populate the LightRAG instance."
590
+ )
591
+
592
+ result = await self.lightrag.aquery(
593
+ query,
594
+ param=QueryParam(mode=mode)
595
+ )
596
+
597
+ return result
598
+
599
+ def get_processor_info(self) -> Dict[str, Any]:
600
+ """Get processor information"""
601
+ if not self.modal_processors:
602
+ return {"status": "Not initialized"}
603
+
604
+ info = {
605
+ "status": "Initialized",
606
+ "processors": {},
607
+ "models": {
608
+ "llm_model": "External function" if self.llm_model_func else "Not provided",
609
+ "vision_model": "External function" if self.vision_model_func else "Not provided",
610
+ "embedding_model": "External function" if self.embedding_func else "Not provided"
611
+ }
612
+ }
613
+
614
+ for proc_type, processor in self.modal_processors.items():
615
+ info["processors"][proc_type] = {
616
+ "class": processor.__class__.__name__,
617
+ "supports": self._get_processor_supports(proc_type)
618
+ }
619
+
620
+ return info
621
+
622
+ def _get_processor_supports(self, proc_type: str) -> List[str]:
623
+ """Get processor supported features"""
624
+ supports_map = {
625
+ "image": ["Image content analysis", "Visual understanding", "Image description generation", "Image entity extraction"],
626
+ "table": ["Table structure analysis", "Data statistics", "Trend identification", "Table entity extraction"],
627
+ "equation": ["Mathematical formula parsing", "Variable identification", "Formula meaning explanation", "Formula entity extraction"],
628
+ "generic": ["General content analysis", "Structured processing", "Entity extraction"]
629
+ }
630
+ return supports_map.get(proc_type, ["Basic processing"])
631
+
632
+