MrGidea commited on
Commit
05c25ae
·
1 Parent(s): 06681ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +874 -10
README.md CHANGED
@@ -1,18 +1,882 @@
1
- ## Quick start
2
- Currently, the test supports pptx, pdf, csv, docx, txt file types
3
- * install textract
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ```bash
5
- pip install textract
6
  ```
7
- * example
 
8
  ```bash
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  import textract
10
- # 指定要提取文本的文件路径
11
- file_path = 'path/to/your/file.pdf'
12
- # 从文件中提取文本
13
  text_content = textract.process(file_path)
14
- # 打印提取的文本
15
- print(text_content.decode('utf-8'))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
 
 
1
+ <center><h2>🚀 LightRAG: Simple and Fast Retrieval-Augmented Generation</h2></center>
2
+
3
+
4
+ ![请添加图片描述](https://i-blog.csdnimg.cn/direct/567139f1a36e4564abc63ce5c12b6271.jpeg)
5
+
6
+ <div align='center'>
7
+ <p>
8
+ <a href='https://lightrag.github.io'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
9
+ <a href='https://youtu.be/oageL-1I0GE'><img src='https://badges.aleen42.com/src/youtube.svg'></a>
10
+ <a href='https://arxiv.org/abs/2410.05779'><img src='https://img.shields.io/badge/arXiv-2410.05779-b31b1b'></a>
11
+ <a href='https://discord.gg/rdE8YVPm'><img src='https://discordapp.com/api/guilds/1296348098003734629/widget.png?style=shield'></a>
12
+ </p>
13
+ <p>
14
+ <img src='https://img.shields.io/github/stars/hkuds/lightrag?color=green&style=social' />
15
+ <img src="https://img.shields.io/badge/python->=3.9.11-blue">
16
+ <a href="https://pypi.org/project/lightrag-hku/"><img src="https://img.shields.io/pypi/v/lightrag-hku.svg"></a>
17
+ <a href="https://pepy.tech/project/lightrag-hku"><img src="https://static.pepy.tech/badge/lightrag-hku/month"></a>
18
+ </p>
19
+
20
+ This repository hosts the code of LightRAG. The structure of this code is based on [nano-graphrag](https://github.com/gusye1234/nano-graphrag).
21
+ ![请添加图片描述](https://i-blog.csdnimg.cn/direct/b2aaf634151b4706892693ffb43d9093.png)
22
+ </div>
23
+
24
+ ## 🎉 News
25
+ - [x] [2024.10.29]🎯🎯📢📢Multi-file types are now supported by `textract`.
26
+ - [x] [2024.10.20]🎯🎯📢📢We’ve added a new feature to LightRAG: Graph Visualization.
27
+ - [x] [2024.10.18]🎯🎯📢📢We’ve added a link to a [LightRAG Introduction Video](https://youtu.be/oageL-1I0GE). Thanks to the author!
28
+ - [x] [2024.10.17]🎯🎯📢📢We have created a [Discord channel](https://discord.gg/mvsfu2Tg)! Welcome to join for sharing and discussions! 🎉🎉
29
+ - [x] [2024.10.16]🎯🎯📢📢LightRAG now supports [Ollama models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!
30
+ - [x] [2024.10.15]🎯🎯📢📢LightRAG now supports [Hugging Face models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!
31
+
32
+ ## Algorithm Flowchart
33
+
34
+ ![LightRAG_Self excalidraw](https://github.com/user-attachments/assets/aa5c4892-2e44-49e6-a116-2403ed80a1a3)
35
+
36
+
37
+ ## Install
38
+
39
+ * Install from source (Recommend)
40
+
41
+ ```bash
42
+ cd LightRAG
43
+ pip install -e .
44
+ ```
45
+ * Install from PyPI
46
+ ```bash
47
+ pip install lightrag-hku
48
+ ```
49
+
50
+ ## Quick Start
51
+ * [Video demo](https://www.youtube.com/watch?v=g21royNJ4fw) of running LightRAG locally.
52
+ * All the code can be found in the `examples`.
53
+ * Set OpenAI API key in environment if using OpenAI models: `export OPENAI_API_KEY="sk-...".`
54
+ * Download the demo text "A Christmas Carol by Charles Dickens":
55
+ ```bash
56
+ curl https://raw.githubusercontent.com/gusye1234/nano-graphrag/main/tests/mock_data.txt > ./book.txt
57
+ ```
58
+ Use the below Python snippet (in a script) to initialize LightRAG and perform queries:
59
+
60
+ ```python
61
+ import os
62
+ from lightrag import LightRAG, QueryParam
63
+ from lightrag.llm import gpt_4o_mini_complete, gpt_4o_complete
64
+
65
+ #########
66
+ # Uncomment the below two lines if running in a jupyter notebook to handle the async nature of rag.insert()
67
+ # import nest_asyncio
68
+ # nest_asyncio.apply()
69
+ #########
70
+
71
+ WORKING_DIR = "./dickens"
72
+
73
+
74
+ if not os.path.exists(WORKING_DIR):
75
+ os.mkdir(WORKING_DIR)
76
+
77
+ rag = LightRAG(
78
+ working_dir=WORKING_DIR,
79
+ llm_model_func=gpt_4o_mini_complete # Use gpt_4o_mini_complete LLM model
80
+ # llm_model_func=gpt_4o_complete # Optionally, use a stronger model
81
+ )
82
+
83
+ with open("./book.txt") as f:
84
+ rag.insert(f.read())
85
+
86
+ # Perform naive search
87
+ print(rag.query("What are the top themes in this story?", param=QueryParam(mode="naive")))
88
+
89
+ # Perform local search
90
+ print(rag.query("What are the top themes in this story?", param=QueryParam(mode="local")))
91
+
92
+ # Perform global search
93
+ print(rag.query("What are the top themes in this story?", param=QueryParam(mode="global")))
94
+
95
+ # Perform hybrid search
96
+ print(rag.query("What are the top themes in this story?", param=QueryParam(mode="hybrid")))
97
+ ```
98
+
99
+ <details>
100
+ <summary> Using Open AI-like APIs </summary>
101
+
102
+ * LightRAG also supports Open AI-like chat/embeddings APIs:
103
+ ```python
104
+ async def llm_model_func(
105
+ prompt, system_prompt=None, history_messages=[], **kwargs
106
+ ) -> str:
107
+ return await openai_complete_if_cache(
108
+ "solar-mini",
109
+ prompt,
110
+ system_prompt=system_prompt,
111
+ history_messages=history_messages,
112
+ api_key=os.getenv("UPSTAGE_API_KEY"),
113
+ base_url="https://api.upstage.ai/v1/solar",
114
+ **kwargs
115
+ )
116
+
117
+ async def embedding_func(texts: list[str]) -> np.ndarray:
118
+ return await openai_embedding(
119
+ texts,
120
+ model="solar-embedding-1-large-query",
121
+ api_key=os.getenv("UPSTAGE_API_KEY"),
122
+ base_url="https://api.upstage.ai/v1/solar"
123
+ )
124
+
125
+ rag = LightRAG(
126
+ working_dir=WORKING_DIR,
127
+ llm_model_func=llm_model_func,
128
+ embedding_func=EmbeddingFunc(
129
+ embedding_dim=4096,
130
+ max_token_size=8192,
131
+ func=embedding_func
132
+ )
133
+ )
134
+ ```
135
+ </details>
136
+
137
+ <details>
138
+ <summary> Using Hugging Face Models </summary>
139
+
140
+ * If you want to use Hugging Face models, you only need to set LightRAG as follows:
141
+ ```python
142
+ from lightrag.llm import hf_model_complete, hf_embedding
143
+ from transformers import AutoModel, AutoTokenizer
144
+
145
+ # Initialize LightRAG with Hugging Face model
146
+ rag = LightRAG(
147
+ working_dir=WORKING_DIR,
148
+ llm_model_func=hf_model_complete, # Use Hugging Face model for text generation
149
+ llm_model_name='meta-llama/Llama-3.1-8B-Instruct', # Model name from Hugging Face
150
+ # Use Hugging Face embedding function
151
+ embedding_func=EmbeddingFunc(
152
+ embedding_dim=384,
153
+ max_token_size=5000,
154
+ func=lambda texts: hf_embedding(
155
+ texts,
156
+ tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
157
+ embed_model=AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
158
+ )
159
+ ),
160
+ )
161
+ ```
162
+ </details>
163
+
164
+ <details>
165
+ <summary> Using Ollama Models </summary>
166
+
167
+ ### Overview
168
+ If you want to use Ollama models, you need to pull model you plan to use and embedding model, for example `nomic-embed-text`.
169
+
170
+ Then you only need to set LightRAG as follows:
171
+
172
+ ```python
173
+ from lightrag.llm import ollama_model_complete, ollama_embedding
174
+
175
+ # Initialize LightRAG with Ollama model
176
+ rag = LightRAG(
177
+ working_dir=WORKING_DIR,
178
+ llm_model_func=ollama_model_complete, # Use Ollama model for text generation
179
+ llm_model_name='your_model_name', # Your model name
180
+ # Use Ollama embedding function
181
+ embedding_func=EmbeddingFunc(
182
+ embedding_dim=768,
183
+ max_token_size=8192,
184
+ func=lambda texts: ollama_embedding(
185
+ texts,
186
+ embed_model="nomic-embed-text"
187
+ )
188
+ ),
189
+ )
190
+ ```
191
+
192
+ ### Increasing context size
193
+ In order for LightRAG to work context should be at least 32k tokens. By default Ollama models have context size of 8k. You can achieve this using one of two ways:
194
+
195
+ #### Increasing the `num_ctx` parameter in Modelfile.
196
+
197
+ 1. Pull the model:
198
+ ```bash
199
+ ollama pull qwen2
200
+ ```
201
+
202
+ 2. Display the model file:
203
+ ```bash
204
+ ollama show --modelfile qwen2 > Modelfile
205
+ ```
206
+
207
+ 3. Edit the Modelfile by adding the following line:
208
  ```bash
209
+ PARAMETER num_ctx 32768
210
  ```
211
+
212
+ 4. Create the modified model:
213
  ```bash
214
+ ollama create -f Modelfile qwen2m
215
+ ```
216
+
217
+ #### Setup `num_ctx` via Ollama API.
218
+ Tiy can use `llm_model_kwargs` param to configure ollama:
219
+
220
+ ```python
221
+ rag = LightRAG(
222
+ working_dir=WORKING_DIR,
223
+ llm_model_func=ollama_model_complete, # Use Ollama model for text generation
224
+ llm_model_name='your_model_name', # Your model name
225
+ llm_model_kwargs={"options": {"num_ctx": 32768}},
226
+ # Use Ollama embedding function
227
+ embedding_func=EmbeddingFunc(
228
+ embedding_dim=768,
229
+ max_token_size=8192,
230
+ func=lambda texts: ollama_embedding(
231
+ texts,
232
+ embed_model="nomic-embed-text"
233
+ )
234
+ ),
235
+ )
236
+ ```
237
+ #### Fully functional example
238
+
239
+ There fully functional example `examples/lightrag_ollama_demo.py` that utilizes `gemma2:2b` model, runs only 4 requests in parallel and set context size to 32k.
240
+
241
+ #### Low RAM GPUs
242
+
243
+ In order to run this experiment on low RAM GPU you should select small model and tune context window (increasing context increase memory consumption). For example, running this ollama example on repurposed mining GPU with 6Gb of RAM required to set context size to 26k while using `gemma2:2b`. It was able to find 197 entities and 19 relations on `book.txt`.
244
+
245
+ </details>
246
+
247
+ ### Query Param
248
+
249
+ ```python
250
+ class QueryParam:
251
+ mode: Literal["local", "global", "hybrid", "naive"] = "global"
252
+ only_need_context: bool = False
253
+ response_type: str = "Multiple Paragraphs"
254
+ # Number of top-k items to retrieve; corresponds to entities in "local" mode and relationships in "global" mode.
255
+ top_k: int = 60
256
+ # Number of tokens for the original chunks.
257
+ max_token_for_text_unit: int = 4000
258
+ # Number of tokens for the relationship descriptions
259
+ max_token_for_global_context: int = 4000
260
+ # Number of tokens for the entity descriptions
261
+ max_token_for_local_context: int = 4000
262
+ ```
263
+
264
+ ### Batch Insert
265
+
266
+ ```python
267
+ # Batch Insert: Insert multiple texts at once
268
+ rag.insert(["TEXT1", "TEXT2",...])
269
+ ```
270
+
271
+ ### Incremental Insert
272
+
273
+ ```python
274
+ # Incremental Insert: Insert new documents into an existing LightRAG instance
275
+ rag = LightRAG(
276
+ working_dir=WORKING_DIR,
277
+ llm_model_func=llm_model_func,
278
+ embedding_func=EmbeddingFunc(
279
+ embedding_dim=embedding_dimension,
280
+ max_token_size=8192,
281
+ func=embedding_func,
282
+ ),
283
+ )
284
+
285
+ with open("./newText.txt") as f:
286
+ rag.insert(f.read())
287
+ ```
288
+
289
+ ### Multi-file Type Support
290
+
291
+ The `testract` supports reading file types such as TXT, DOCX, PPTX, CSV, and PDF.
292
+
293
+ ```python
294
  import textract
295
+
296
+ file_path = 'TEXT.pdf'
 
297
  text_content = textract.process(file_path)
298
+
299
+ rag.insert(text_content.decode('utf-8'))
300
+ ```
301
+
302
+ ### Graph Visualization
303
+
304
+ <details>
305
+ <summary> Graph visualization with html </summary>
306
+
307
+ * The following code can be found in `examples/graph_visual_with_html.py`
308
+
309
+ ```python
310
+ import networkx as nx
311
+ from pyvis.network import Network
312
+
313
+ # Load the GraphML file
314
+ G = nx.read_graphml('./dickens/graph_chunk_entity_relation.graphml')
315
+
316
+ # Create a Pyvis network
317
+ net = Network(notebook=True)
318
+
319
+ # Convert NetworkX graph to Pyvis network
320
+ net.from_nx(G)
321
+
322
+ # Save and display the network
323
+ net.show('knowledge_graph.html')
324
+ ```
325
+
326
+ </details>
327
+
328
+ <details>
329
+ <summary> Graph visualization with Neo4j </summary>
330
+
331
+ * The following code can be found in `examples/graph_visual_with_neo4j.py`
332
+
333
+ ```python
334
+ import os
335
+ import json
336
+ from lightrag.utils import xml_to_json
337
+ from neo4j import GraphDatabase
338
+
339
+ # Constants
340
+ WORKING_DIR = "./dickens"
341
+ BATCH_SIZE_NODES = 500
342
+ BATCH_SIZE_EDGES = 100
343
+
344
+ # Neo4j connection credentials
345
+ NEO4J_URI = "bolt://localhost:7687"
346
+ NEO4J_USERNAME = "neo4j"
347
+ NEO4J_PASSWORD = "your_password"
348
+
349
+ def convert_xml_to_json(xml_path, output_path):
350
+ """Converts XML file to JSON and saves the output."""
351
+ if not os.path.exists(xml_path):
352
+ print(f"Error: File not found - {xml_path}")
353
+ return None
354
+
355
+ json_data = xml_to_json(xml_path)
356
+ if json_data:
357
+ with open(output_path, 'w', encoding='utf-8') as f:
358
+ json.dump(json_data, f, ensure_ascii=False, indent=2)
359
+ print(f"JSON file created: {output_path}")
360
+ return json_data
361
+ else:
362
+ print("Failed to create JSON data")
363
+ return None
364
+
365
+ def process_in_batches(tx, query, data, batch_size):
366
+ """Process data in batches and execute the given query."""
367
+ for i in range(0, len(data), batch_size):
368
+ batch = data[i:i + batch_size]
369
+ tx.run(query, {"nodes": batch} if "nodes" in query else {"edges": batch})
370
+
371
+ def main():
372
+ # Paths
373
+ xml_file = os.path.join(WORKING_DIR, 'graph_chunk_entity_relation.graphml')
374
+ json_file = os.path.join(WORKING_DIR, 'graph_data.json')
375
+
376
+ # Convert XML to JSON
377
+ json_data = convert_xml_to_json(xml_file, json_file)
378
+ if json_data is None:
379
+ return
380
+
381
+ # Load nodes and edges
382
+ nodes = json_data.get('nodes', [])
383
+ edges = json_data.get('edges', [])
384
+
385
+ # Neo4j queries
386
+ create_nodes_query = """
387
+ UNWIND $nodes AS node
388
+ MERGE (e:Entity {id: node.id})
389
+ SET e.entity_type = node.entity_type,
390
+ e.description = node.description,
391
+ e.source_id = node.source_id,
392
+ e.displayName = node.id
393
+ REMOVE e:Entity
394
+ WITH e, node
395
+ CALL apoc.create.addLabels(e, [node.entity_type]) YIELD node AS labeledNode
396
+ RETURN count(*)
397
+ """
398
+
399
+ create_edges_query = """
400
+ UNWIND $edges AS edge
401
+ MATCH (source {id: edge.source})
402
+ MATCH (target {id: edge.target})
403
+ WITH source, target, edge,
404
+ CASE
405
+ WHEN edge.keywords CONTAINS 'lead' THEN 'lead'
406
+ WHEN edge.keywords CONTAINS 'participate' THEN 'participate'
407
+ WHEN edge.keywords CONTAINS 'uses' THEN 'uses'
408
+ WHEN edge.keywords CONTAINS 'located' THEN 'located'
409
+ WHEN edge.keywords CONTAINS 'occurs' THEN 'occurs'
410
+ ELSE REPLACE(SPLIT(edge.keywords, ',')[0], '\"', '')
411
+ END AS relType
412
+ CALL apoc.create.relationship(source, relType, {
413
+ weight: edge.weight,
414
+ description: edge.description,
415
+ keywords: edge.keywords,
416
+ source_id: edge.source_id
417
+ }, target) YIELD rel
418
+ RETURN count(*)
419
+ """
420
+
421
+ set_displayname_and_labels_query = """
422
+ MATCH (n)
423
+ SET n.displayName = n.id
424
+ WITH n
425
+ CALL apoc.create.setLabels(n, [n.entity_type]) YIELD node
426
+ RETURN count(*)
427
+ """
428
+
429
+ # Create a Neo4j driver
430
+ driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))
431
+
432
+ try:
433
+ # Execute queries in batches
434
+ with driver.session() as session:
435
+ # Insert nodes in batches
436
+ session.execute_write(process_in_batches, create_nodes_query, nodes, BATCH_SIZE_NODES)
437
+
438
+ # Insert edges in batches
439
+ session.execute_write(process_in_batches, create_edges_query, edges, BATCH_SIZE_EDGES)
440
+
441
+ # Set displayName and labels
442
+ session.run(set_displayname_and_labels_query)
443
+
444
+ except Exception as e:
445
+ print(f"Error occurred: {e}")
446
+
447
+ finally:
448
+ driver.close()
449
+
450
+ if __name__ == "__main__":
451
+ main()
452
+ ```
453
+
454
+ </details>
455
+
456
+ ## API Server Implementation
457
+
458
+ LightRAG also provides a FastAPI-based server implementation for RESTful API access to RAG operations. This allows you to run LightRAG as a service and interact with it through HTTP requests.
459
+
460
+ ### Setting up the API Server
461
+ <details>
462
+ <summary>Click to expand setup instructions</summary>
463
+
464
+ 1. First, ensure you have the required dependencies:
465
+ ```bash
466
+ pip install fastapi uvicorn pydantic
467
+ ```
468
+
469
+ 2. Set up your environment variables:
470
+ ```bash
471
+ export RAG_DIR="your_index_directory" # Optional: Defaults to "index_default"
472
+ ```
473
+
474
+ 3. Run the API server:
475
+ ```bash
476
+ python examples/lightrag_api_openai_compatible_demo.py
477
+ ```
478
+
479
+ The server will start on `http://0.0.0.0:8020`.
480
+ </details>
481
+
482
+ ### API Endpoints
483
+
484
+ The API server provides the following endpoints:
485
+
486
+ #### 1. Query Endpoint
487
+ <details>
488
+ <summary>Click to view Query endpoint details</summary>
489
+
490
+ - **URL:** `/query`
491
+ - **Method:** POST
492
+ - **Body:**
493
+ ```json
494
+ {
495
+ "query": "Your question here",
496
+ "mode": "hybrid" // Can be "naive", "local", "global", or "hybrid"
497
+ }
498
+ ```
499
+ - **Example:**
500
+ ```bash
501
+ curl -X POST "http://127.0.0.1:8020/query" \
502
+ -H "Content-Type: application/json" \
503
+ -d '{"query": "What are the main themes?", "mode": "hybrid"}'
504
+ ```
505
+ </details>
506
+
507
+ #### 2. Insert Text Endpoint
508
+ <details>
509
+ <summary>Click to view Insert Text endpoint details</summary>
510
+
511
+ - **URL:** `/insert`
512
+ - **Method:** POST
513
+ - **Body:**
514
+ ```json
515
+ {
516
+ "text": "Your text content here"
517
+ }
518
+ ```
519
+ - **Example:**
520
+ ```bash
521
+ curl -X POST "http://127.0.0.1:8020/insert" \
522
+ -H "Content-Type: application/json" \
523
+ -d '{"text": "Content to be inserted into RAG"}'
524
+ ```
525
+ </details>
526
+
527
+ #### 3. Insert File Endpoint
528
+ <details>
529
+ <summary>Click to view Insert File endpoint details</summary>
530
+
531
+ - **URL:** `/insert_file`
532
+ - **Method:** POST
533
+ - **Body:**
534
+ ```json
535
+ {
536
+ "file_path": "path/to/your/file.txt"
537
+ }
538
+ ```
539
+ - **Example:**
540
+ ```bash
541
+ curl -X POST "http://127.0.0.1:8020/insert_file" \
542
+ -H "Content-Type: application/json" \
543
+ -d '{"file_path": "./book.txt"}'
544
+ ```
545
+ </details>
546
+
547
+ #### 4. Health Check Endpoint
548
+ <details>
549
+ <summary>Click to view Health Check endpoint details</summary>
550
+
551
+ - **URL:** `/health`
552
+ - **Method:** GET
553
+ - **Example:**
554
+ ```bash
555
+ curl -X GET "http://127.0.0.1:8020/health"
556
+ ```
557
+ </details>
558
+
559
+ ### Configuration
560
+
561
+ The API server can be configured using environment variables:
562
+ - `RAG_DIR`: Directory for storing the RAG index (default: "index_default")
563
+ - API keys and base URLs should be configured in the code for your specific LLM and embedding model providers
564
+
565
+ ### Error Handling
566
+ <details>
567
+ <summary>Click to view error handling details</summary>
568
+
569
+ The API includes comprehensive error handling:
570
+ - File not found errors (404)
571
+ - Processing errors (500)
572
+ - Supports multiple file encodings (UTF-8 and GBK)
573
+ </details>
574
+
575
+ ## Evaluation
576
+ ### Dataset
577
+ The dataset used in LightRAG can be downloaded from [TommyChien/UltraDomain](https://huggingface.co/datasets/TommyChien/UltraDomain).
578
+
579
+ ### Generate Query
580
+ LightRAG uses the following prompt to generate high-level queries, with the corresponding code in `example/generate_query.py`.
581
+
582
+ <details>
583
+ <summary> Prompt </summary>
584
+
585
+ ```python
586
+ Given the following description of a dataset:
587
+
588
+ {description}
589
+
590
+ Please identify 5 potential users who would engage with this dataset. For each user, list 5 tasks they would perform with this dataset. Then, for each (user, task) combination, generate 5 questions that require a high-level understanding of the entire dataset.
591
+
592
+ Output the results in the following structure:
593
+ - User 1: [user description]
594
+ - Task 1: [task description]
595
+ - Question 1:
596
+ - Question 2:
597
+ - Question 3:
598
+ - Question 4:
599
+ - Question 5:
600
+ - Task 2: [task description]
601
+ ...
602
+ - Task 5: [task description]
603
+ - User 2: [user description]
604
+ ...
605
+ - User 5: [user description]
606
+ ...
607
+ ```
608
+ </details>
609
+
610
+ ### Batch Eval
611
+ To evaluate the performance of two RAG systems on high-level queries, LightRAG uses the following prompt, with the specific code available in `example/batch_eval.py`.
612
+
613
+ <details>
614
+ <summary> Prompt </summary>
615
+
616
+ ```python
617
+ ---Role---
618
+ You are an expert tasked with evaluating two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.
619
+ ---Goal---
620
+ You will evaluate two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.
621
+
622
+ - **Comprehensiveness**: How much detail does the answer provide to cover all aspects and details of the question?
623
+ - **Diversity**: How varied and rich is the answer in providing different perspectives and insights on the question?
624
+ - **Empowerment**: How well does the answer help the reader understand and make informed judgments about the topic?
625
+
626
+ For each criterion, choose the better answer (either Answer 1 or Answer 2) and explain why. Then, select an overall winner based on these three categories.
627
+
628
+ Here is the question:
629
+ {query}
630
+
631
+ Here are the two answers:
632
+
633
+ **Answer 1:**
634
+ {answer1}
635
+
636
+ **Answer 2:**
637
+ {answer2}
638
+
639
+ Evaluate both answers using the three criteria listed above and provide detailed explanations for each criterion.
640
+
641
+ Output your evaluation in the following JSON format:
642
+
643
+ {{
644
+ "Comprehensiveness": {{
645
+ "Winner": "[Answer 1 or Answer 2]",
646
+ "Explanation": "[Provide explanation here]"
647
+ }},
648
+ "Empowerment": {{
649
+ "Winner": "[Answer 1 or Answer 2]",
650
+ "Explanation": "[Provide explanation here]"
651
+ }},
652
+ "Overall Winner": {{
653
+ "Winner": "[Answer 1 or Answer 2]",
654
+ "Explanation": "[Summarize why this answer is the overall winner based on the three criteria]"
655
+ }}
656
+ }}
657
+ ```
658
+ </details>
659
+
660
+ ### Overall Performance Table
661
+ | | **Agriculture** | | **CS** | | **Legal** | | **Mix** | |
662
+ |----------------------|-------------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|
663
+ | | NaiveRAG | **LightRAG** | NaiveRAG | **LightRAG** | NaiveRAG | **LightRAG** | NaiveRAG | **LightRAG** |
664
+ | **Comprehensiveness** | 32.69% | **67.31%** | 35.44% | **64.56%** | 19.05% | **80.95%** | 36.36% | **63.64%** |
665
+ | **Diversity** | 24.09% | **75.91%** | 35.24% | **64.76%** | 10.98% | **89.02%** | 30.76% | **69.24%** |
666
+ | **Empowerment** | 31.35% | **68.65%** | 35.48% | **64.52%** | 17.59% | **82.41%** | 40.95% | **59.05%** |
667
+ | **Overall** | 33.30% | **66.70%** | 34.76% | **65.24%** | 17.46% | **82.54%** | 37.59% | **62.40%** |
668
+ | | RQ-RAG | **LightRAG** | RQ-RAG | **LightRAG** | RQ-RAG | **LightRAG** | RQ-RAG | **LightRAG** |
669
+ | **Comprehensiveness** | 32.05% | **67.95%** | 39.30% | **60.70%** | 18.57% | **81.43%** | 38.89% | **61.11%** |
670
+ | **Diversity** | 29.44% | **70.56%** | 38.71% | **61.29%** | 15.14% | **84.86%** | 28.50% | **71.50%** |
671
+ | **Empowerment** | 32.51% | **67.49%** | 37.52% | **62.48%** | 17.80% | **82.20%** | 43.96% | **56.04%** |
672
+ | **Overall** | 33.29% | **66.71%** | 39.03% | **60.97%** | 17.80% | **82.20%** | 39.61% | **60.39%** |
673
+ | | HyDE | **LightRAG** | HyDE | **LightRAG** | HyDE | **LightRAG** | HyDE | **LightRAG** |
674
+ | **Comprehensiveness** | 24.39% | **75.61%** | 36.49% | **63.51%** | 27.68% | **72.32%** | 42.17% | **57.83%** |
675
+ | **Diversity** | 24.96% | **75.34%** | 37.41% | **62.59%** | 18.79% | **81.21%** | 30.88% | **69.12%** |
676
+ | **Empowerment** | 24.89% | **75.11%** | 34.99% | **65.01%** | 26.99% | **73.01%** | **45.61%** | **54.39%** |
677
+ | **Overall** | 23.17% | **76.83%** | 35.67% | **64.33%** | 27.68% | **72.32%** | 42.72% | **57.28%** |
678
+ | | GraphRAG | **LightRAG** | GraphRAG | **LightRAG** | GraphRAG | **LightRAG** | GraphRAG | **LightRAG** |
679
+ | **Comprehensiveness** | 45.56% | **54.44%** | 45.98% | **54.02%** | 47.13% | **52.87%** | **51.86%** | 48.14% |
680
+ | **Diversity** | 19.65% | **80.35%** | 39.64% | **60.36%** | 25.55% | **74.45%** | 35.87% | **64.13%** |
681
+ | **Empowerment** | 36.69% | **63.31%** | 45.09% | **54.91%** | 42.81% | **57.19%** | **52.94%** | 47.06% |
682
+ | **Overall** | 43.62% | **56.38%** | 45.98% | **54.02%** | 45.70% | **54.30%** | **51.86%** | 48.14% |
683
+
684
+ ## Reproduce
685
+ All the code can be found in the `./reproduce` directory.
686
+
687
+ ### Step-0 Extract Unique Contexts
688
+ First, we need to extract unique contexts in the datasets.
689
+
690
+ <details>
691
+ <summary> Code </summary>
692
+
693
+ ```python
694
+ def extract_unique_contexts(input_directory, output_directory):
695
+
696
+ os.makedirs(output_directory, exist_ok=True)
697
+
698
+ jsonl_files = glob.glob(os.path.join(input_directory, '*.jsonl'))
699
+ print(f"Found {len(jsonl_files)} JSONL files.")
700
+
701
+ for file_path in jsonl_files:
702
+ filename = os.path.basename(file_path)
703
+ name, ext = os.path.splitext(filename)
704
+ output_filename = f"{name}_unique_contexts.json"
705
+ output_path = os.path.join(output_directory, output_filename)
706
+
707
+ unique_contexts_dict = {}
708
+
709
+ print(f"Processing file: {filename}")
710
+
711
+ try:
712
+ with open(file_path, 'r', encoding='utf-8') as infile:
713
+ for line_number, line in enumerate(infile, start=1):
714
+ line = line.strip()
715
+ if not line:
716
+ continue
717
+ try:
718
+ json_obj = json.loads(line)
719
+ context = json_obj.get('context')
720
+ if context and context not in unique_contexts_dict:
721
+ unique_contexts_dict[context] = None
722
+ except json.JSONDecodeError as e:
723
+ print(f"JSON decoding error in file {filename} at line {line_number}: {e}")
724
+ except FileNotFoundError:
725
+ print(f"File not found: {filename}")
726
+ continue
727
+ except Exception as e:
728
+ print(f"An error occurred while processing file {filename}: {e}")
729
+ continue
730
+
731
+ unique_contexts_list = list(unique_contexts_dict.keys())
732
+ print(f"There are {len(unique_contexts_list)} unique `context` entries in the file {filename}.")
733
+
734
+ try:
735
+ with open(output_path, 'w', encoding='utf-8') as outfile:
736
+ json.dump(unique_contexts_list, outfile, ensure_ascii=False, indent=4)
737
+ print(f"Unique `context` entries have been saved to: {output_filename}")
738
+ except Exception as e:
739
+ print(f"An error occurred while saving to the file {output_filename}: {e}")
740
+
741
+ print("All files have been processed.")
742
+
743
+ ```
744
+ </details>
745
+
746
+ ### Step-1 Insert Contexts
747
+ For the extracted contexts, we insert them into the LightRAG system.
748
+
749
+ <details>
750
+ <summary> Code </summary>
751
+
752
+ ```python
753
+ def insert_text(rag, file_path):
754
+ with open(file_path, mode='r') as f:
755
+ unique_contexts = json.load(f)
756
+
757
+ retries = 0
758
+ max_retries = 3
759
+ while retries < max_retries:
760
+ try:
761
+ rag.insert(unique_contexts)
762
+ break
763
+ except Exception as e:
764
+ retries += 1
765
+ print(f"Insertion failed, retrying ({retries}/{max_retries}), error: {e}")
766
+ time.sleep(10)
767
+ if retries == max_retries:
768
+ print("Insertion failed after exceeding the maximum number of retries")
769
+ ```
770
+ </details>
771
+
772
+ ### Step-2 Generate Queries
773
+
774
+ We extract tokens from the first and the second half of each context in the dataset, then combine them as dataset descriptions to generate queries.
775
+
776
+ <details>
777
+ <summary> Code </summary>
778
+
779
+ ```python
780
+ tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
781
+
782
+ def get_summary(context, tot_tokens=2000):
783
+ tokens = tokenizer.tokenize(context)
784
+ half_tokens = tot_tokens // 2
785
+
786
+ start_tokens = tokens[1000:1000 + half_tokens]
787
+ end_tokens = tokens[-(1000 + half_tokens):1000]
788
+
789
+ summary_tokens = start_tokens + end_tokens
790
+ summary = tokenizer.convert_tokens_to_string(summary_tokens)
791
+
792
+ return summary
793
  ```
794
+ </details>
795
+
796
+ ### Step-3 Query
797
+ For the queries generated in Step-2, we will extract them and query LightRAG.
798
+
799
+ <details>
800
+ <summary> Code </summary>
801
+
802
+ ```python
803
+ def extract_queries(file_path):
804
+ with open(file_path, 'r') as f:
805
+ data = f.read()
806
+
807
+ data = data.replace('**', '')
808
+
809
+ queries = re.findall(r'- Question \d+: (.+)', data)
810
+
811
+ return queries
812
+ ```
813
+ </details>
814
+
815
+ ## Code Structure
816
+
817
+ ```python
818
+ .
819
+ ├── examples
820
+ │ ├── batch_eval.py
821
+ │ ├── generate_query.py
822
+ │ ├── graph_visual_with_html.py
823
+ │ ├── graph_visual_with_neo4j.py
824
+ │ ├── lightrag_api_openai_compatible_demo.py
825
+ │ ├── lightrag_azure_openai_demo.py
826
+ │ ├── lightrag_bedrock_demo.py
827
+ │ ├── lightrag_hf_demo.py
828
+ │ ├── lightrag_lmdeploy_demo.py
829
+ │ ├── lightrag_ollama_demo.py
830
+ │ ├── lightrag_openai_compatible_demo.py
831
+ │ ├── lightrag_openai_demo.py
832
+ │ ├── lightrag_siliconcloud_demo.py
833
+ │ └── vram_management_demo.py
834
+ ├── lightrag
835
+ │ ├── __init__.py
836
+ │ ├── base.py
837
+ │ ├── lightrag.py
838
+ │ ├── llm.py
839
+ │ ├── operate.py
840
+ │ ├── prompt.py
841
+ │ ├── storage.py
842
+ │ └── utils.py
843
+ ├── reproduce
844
+ │ ├── Step_0.py
845
+ │ ├── Step_1_openai_compatible.py
846
+ │ ├── Step_1.py
847
+ │ ├── Step_2.py
848
+ │ ├── Step_3_openai_compatible.py
849
+ │ └── Step_3.py
850
+ ├── .gitignore
851
+ ├── .pre-commit-config.yaml
852
+ ├── LICENSE
853
+ ├── README.md
854
+ ├── requirements.txt
855
+ └── setup.py
856
+ ```
857
+
858
+ ## Star History
859
+
860
+ <a href="https://star-history.com/#HKUDS/LightRAG&Date">
861
+ <picture>
862
+ <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=HKUDS/LightRAG&type=Date&theme=dark" />
863
+ <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=HKUDS/LightRAG&type=Date" />
864
+ <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=HKUDS/LightRAG&type=Date" />
865
+ </picture>
866
+ </a>
867
+
868
+ ## Citation
869
+
870
+ ```python
871
+ @article{guo2024lightrag,
872
+ title={LightRAG: Simple and Fast Retrieval-Augmented Generation},
873
+ author={Zirui Guo and Lianghao Xia and Yanhua Yu and Tu Ao and Chao Huang},
874
+ year={2024},
875
+ eprint={2410.05779},
876
+ archivePrefix={arXiv},
877
+ primaryClass={cs.IR}
878
+ }
879
+ ```
880
+
881
 
882