rushichaganti commited on
Commit
bceecd4
·
1 Parent(s): fa65a80

Fixed lint and Added new imports at the top of the file

Browse files
Files changed (3) hide show
  1. README.md +166 -67
  2. lightrag/lightrag.py +320 -1
  3. requirements.txt +7 -2
README.md CHANGED
@@ -37,28 +37,30 @@ This repository hosts the code of LightRAG. The structure of this code is based
37
  </br>
38
 
39
 
 
 
 
40
  <details>
41
  <summary style="font-size: 1.4em; font-weight: bold; cursor: pointer; display: list-item;">
42
  🎉 News
43
  </summary>
44
 
45
-
46
- - [x] [2025.02.05]🎯📢Our team has released [VideoRAG](https://github.com/HKUDS/VideoRAG) understanding extremely long-context videos.
47
- - [x] [2025.01.13]🎯📢Our team has released [MiniRAG](https://github.com/HKUDS/MiniRAG) making RAG simpler with small models.
48
- - [x] [2025.01.06]🎯📢You can now [use PostgreSQL for Storage](#using-postgresql-for-storage).
49
- - [x] [2024.12.31]🎯📢LightRAG now supports [deletion by document ID](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#delete).
50
- - [x] [2024.11.25]🎯📢LightRAG now supports seamless integration of [custom knowledge graphs](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#insert-custom-kg), empowering users to enhance the system with their own domain expertise.
51
- - [x] [2024.11.19]🎯📢A comprehensive guide to LightRAG is now available on [LearnOpenCV](https://learnopencv.com/lightrag). Many thanks to the blog author.
52
- - [x] [2024.11.12]🎯📢LightRAG now supports [Oracle Database 23ai for all storage types (KV, vector, and graph)](https://github.com/HKUDS/LightRAG/blob/main/examples/lightrag_oracle_demo.py).
53
- - [x] [2024.11.11]🎯📢LightRAG now supports [deleting entities by their names](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#delete).
54
- - [x] [2024.11.09]🎯📢Introducing the [LightRAG Gui](https://lightrag-gui.streamlit.app), which allows you to insert, query, visualize, and download LightRAG knowledge.
55
- - [x] [2024.11.04]🎯📢You can now [use Neo4J for Storage](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#using-neo4j-for-storage).
56
- - [x] [2024.10.29]🎯📢LightRAG now supports multiple file types, including PDF, DOC, PPT, and CSV via `textract`.
57
- - [x] [2024.10.20]🎯📢We've added a new feature to LightRAG: Graph Visualization.
58
- - [x] [2024.10.18]🎯📢We've added a link to a [LightRAG Introduction Video](https://youtu.be/oageL-1I0GE). Thanks to the author!
59
- - [x] [2024.10.17]🎯📢We have created a [Discord channel](https://discord.gg/yF2MmDJyGJ)! Welcome to join for sharing and discussions! 🎉🎉
60
- - [x] [2024.10.16]🎯📢LightRAG now supports [Ollama models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!
61
- - [x] [2024.10.15]🎯📢LightRAG now supports [Hugging Face models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!
62
 
63
  </details>
64
 
@@ -82,16 +84,20 @@ This repository hosts the code of LightRAG. The structure of this code is based
82
  cd LightRAG
83
  pip install -e .
84
  ```
 
85
  * Install from PyPI
 
86
  ```bash
87
  pip install lightrag-hku
88
  ```
89
 
90
  ## Quick Start
 
91
  * [Video demo](https://www.youtube.com/watch?v=g21royNJ4fw) of running LightRAG locally.
92
  * All the code can be found in the `examples`.
93
  * Set OpenAI API key in environment if using OpenAI models: `export OPENAI_API_KEY="sk-...".`
94
  * Download the demo text "A Christmas Carol by Charles Dickens":
 
95
  ```bash
96
  curl https://raw.githubusercontent.com/gusye1234/nano-graphrag/main/tests/mock_data.txt > ./book.txt
97
  ```
@@ -187,6 +193,7 @@ class QueryParam:
187
  <summary> <b>Using Open AI-like APIs</b> </summary>
188
 
189
  * LightRAG also supports Open AI-like chat/embeddings APIs:
 
190
  ```python
191
  async def llm_model_func(
192
  prompt, system_prompt=None, history_messages=[], keyword_extraction=False, **kwargs
@@ -225,6 +232,7 @@ async def initialize_rag():
225
 
226
  return rag
227
  ```
 
228
  </details>
229
 
230
  <details>
@@ -252,12 +260,14 @@ rag = LightRAG(
252
  ),
253
  )
254
  ```
 
255
  </details>
256
 
257
  <details>
258
  <summary> <b>Using Ollama Models</b> </summary>
259
 
260
  ### Overview
 
261
  If you want to use Ollama models, you need to pull model you plan to use and embedding model, for example `nomic-embed-text`.
262
 
263
  Then you only need to set LightRAG as follows:
@@ -281,31 +291,37 @@ rag = LightRAG(
281
  ```
282
 
283
  ### Increasing context size
 
284
  In order for LightRAG to work context should be at least 32k tokens. By default Ollama models have context size of 8k. You can achieve this using one of two ways:
285
 
286
  #### Increasing the `num_ctx` parameter in Modelfile.
287
 
288
  1. Pull the model:
 
289
  ```bash
290
  ollama pull qwen2
291
  ```
292
 
293
  2. Display the model file:
 
294
  ```bash
295
  ollama show --modelfile qwen2 > Modelfile
296
  ```
297
 
298
  3. Edit the Modelfile by adding the following line:
 
299
  ```bash
300
  PARAMETER num_ctx 32768
301
  ```
302
 
303
  4. Create the modified model:
 
304
  ```bash
305
  ollama create -f Modelfile qwen2m
306
  ```
307
 
308
  #### Setup `num_ctx` via Ollama API.
 
309
  Tiy can use `llm_model_kwargs` param to configure ollama:
310
 
311
  ```python
@@ -325,6 +341,7 @@ rag = LightRAG(
325
  ),
326
  )
327
  ```
 
328
  #### Low RAM GPUs
329
 
330
  In order to run this experiment on low RAM GPU you should select small model and tune context window (increasing context increase memory consumption). For example, running this ollama example on repurposed mining GPU with 6Gb of RAM required to set context size to 26k while using `gemma2:2b`. It was able to find 197 entities and 19 relations on `book.txt`.
@@ -402,6 +419,7 @@ if __name__ == "__main__":
402
  ```
403
 
404
  #### For detailed documentation and examples, see:
 
405
  - [LlamaIndex Documentation](lightrag/llm/Readme.md)
406
  - [Direct OpenAI Example](examples/lightrag_llamaindex_direct_demo.py)
407
  - [LiteLLM Proxy Example](examples/lightrag_llamaindex_litellm_demo.py)
@@ -483,13 +501,16 @@ print(response_custom)
483
  We've introduced a new function `query_with_separate_keyword_extraction` to enhance the keyword extraction capabilities. This function separates the keyword extraction process from the user's prompt, focusing solely on the query to improve the relevance of extracted keywords.
484
 
485
  ##### How It Works?
 
486
  The function operates by dividing the input into two parts:
 
487
  - `User Query`
488
  - `Prompt`
489
 
490
  It then performs keyword extraction exclusively on the `user query`. This separation ensures that the extraction process is focused and relevant, unaffected by any additional language in the `prompt`. It also allows the `prompt` to serve purely for response formatting, maintaining the intent and clarity of the user's original question.
491
 
492
  ##### Usage Example
 
493
  This `example` shows how to tailor the function for educational content, focusing on detailed explanations for older students.
494
 
495
  ```python
@@ -563,6 +584,7 @@ custom_kg = {
563
 
564
  rag.insert_custom_kg(custom_kg)
565
  ```
 
566
  </details>
567
 
568
  ## Insert
@@ -593,6 +615,7 @@ rag.insert(["TEXT1", "TEXT2", "TEXT3", ...]) # Documents will be processed in b
593
  ```
594
 
595
  The `insert_batch_size` parameter in `addon_params` controls how many documents are processed in each batch during insertion. This is useful for:
 
596
  - Managing memory usage with large document collections
597
  - Optimizing processing speed
598
  - Providing better progress tracking
@@ -647,6 +670,7 @@ text_content = textract.process(file_path)
647
 
648
  rag.insert(text_content.decode('utf-8'))
649
  ```
 
650
  </details>
651
 
652
  ## Storage
@@ -685,6 +709,7 @@ async def initialize_rag():
685
 
686
  return rag
687
  ```
 
688
  see test_neo4j.py for a working example.
689
 
690
  </details>
@@ -693,6 +718,7 @@ see test_neo4j.py for a working example.
693
  <summary> <b>Using PostgreSQL for Storage</b> </summary>
694
 
695
  For production level scenarios you will most likely want to leverage an enterprise solution. PostgreSQL can provide a one-stop solution for you as KV store, VectorDB (pgvector) and GraphDB (apache AGE).
 
696
  * PostgreSQL is lightweight,the whole binary distribution including all necessary plugins can be zipped to 40MB: Ref to [Windows Release](https://github.com/ShanGor/apache-age-windows/releases/tag/PG17%2Fv1.5.0-rc0) as it is easy to install for Linux/Mac.
697
  * If you prefer docker, please start with this image if you are a beginner to avoid hiccups (DO read the overview): https://hub.docker.com/r/shangor/postgres-for-rag
698
  * How to start? Ref to: [examples/lightrag_zhipu_postgres_demo.py](https://github.com/HKUDS/LightRAG/blob/main/examples/lightrag_zhipu_postgres_demo.py)
@@ -735,6 +761,7 @@ For production level scenarios you will most likely want to leverage an enterpri
735
  > It is a known issue of the release version: https://github.com/apache/age/pull/1721
736
  >
737
  > You can Compile the AGE from source code and fix it.
 
738
 
739
  </details>
740
 
@@ -742,9 +769,11 @@ For production level scenarios you will most likely want to leverage an enterpri
742
  <summary> <b>Using Faiss for Storage</b> </summary>
743
 
744
  - Install the required dependencies:
 
745
  ```
746
  pip install faiss-cpu
747
  ```
 
748
  You can also install `faiss-gpu` if you have GPU support.
749
 
750
  - Here we are using `sentence-transformers` but you can also use `OpenAIEmbedding` model with `3072` dimensions.
@@ -810,6 +839,7 @@ relation = rag.create_relation("Google", "Gmail", {
810
  "weight": 2.0
811
  })
812
  ```
 
813
  </details>
814
 
815
  <details>
@@ -835,6 +865,7 @@ updated_relation = rag.edit_relation("Google", "Google Mail", {
835
  "weight": 3.0
836
  })
837
  ```
 
838
  </details>
839
 
840
  All operations are available in both synchronous and asynchronous versions. The asynchronous versions have the prefix "a" (e.g., `acreate_entity`, `aedit_relation`).
@@ -851,6 +882,55 @@ All operations are available in both synchronous and asynchronous versions. The
851
 
852
  These operations maintain data consistency across both the graph database and vector database components, ensuring your knowledge graph remains coherent.
853
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
854
  ## Entity Merging
855
 
856
  <details>
@@ -913,6 +993,7 @@ rag.merge_entities(
913
  ```
914
 
915
  When merging entities:
 
916
  * All relationships from source entities are redirected to the target entity
917
  * Duplicate relationships are intelligently merged
918
  * Self-relationships (loops) are prevented
@@ -946,6 +1027,7 @@ rag.clear_cache(modes=["local"])
946
  ```
947
 
948
  Valid modes are:
 
949
  - `"default"`: Extraction cache
950
  - `"naive"`: Naive search cache
951
  - `"local"`: Local search cache
@@ -960,33 +1042,33 @@ Valid modes are:
960
  <details>
961
  <summary> Parameters </summary>
962
 
963
- | **Parameter** | **Type** | **Explanation** | **Default** |
964
- |----------------------------------------------| --- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|
965
- | **working\_dir** | `str` | Directory where the cache will be stored | `lightrag_cache+timestamp` |
966
- | **kv\_storage** | `str` | Storage type for documents and text chunks. Supported types: `JsonKVStorage`, `OracleKVStorage` | `JsonKVStorage` |
967
- | **vector\_storage** | `str` | Storage type for embedding vectors. Supported types: `NanoVectorDBStorage`, `OracleVectorDBStorage` | `NanoVectorDBStorage` |
968
- | **graph\_storage** | `str` | Storage type for graph edges and nodes. Supported types: `NetworkXStorage`, `Neo4JStorage`, `OracleGraphStorage` | `NetworkXStorage` |
969
- | **chunk\_token\_size** | `int` | Maximum token size per chunk when splitting documents | `1200` |
970
- | **chunk\_overlap\_token\_size** | `int` | Overlap token size between two chunks when splitting documents | `100` |
971
- | **tiktoken\_model\_name** | `str` | Model name for the Tiktoken encoder used to calculate token numbers | `gpt-4o-mini` |
972
- | **entity\_extract\_max\_gleaning** | `int` | Number of loops in the entity extraction process, appending history messages | `1` |
973
- | **entity\_summary\_to\_max\_tokens** | `int` | Maximum token size for each entity summary | `500` |
974
- | **node\_embedding\_algorithm** | `str` | Algorithm for node embedding (currently not used) | `node2vec` |
975
- | **node2vec\_params** | `dict` | Parameters for node embedding | `{"dimensions": 1536,"num_walks": 10,"walk_length": 40,"window_size": 2,"iterations": 3,"random_seed": 3,}` |
976
- | **embedding\_func** | `EmbeddingFunc` | Function to generate embedding vectors from text | `openai_embed` |
977
- | **embedding\_batch\_num** | `int` | Maximum batch size for embedding processes (multiple texts sent per batch) | `32` |
978
- | **embedding\_func\_max\_async** | `int` | Maximum number of concurrent asynchronous embedding processes | `16` |
979
- | **llm\_model\_func** | `callable` | Function for LLM generation | `gpt_4o_mini_complete` |
980
- | **llm\_model\_name** | `str` | LLM model name for generation | `meta-llama/Llama-3.2-1B-Instruct` |
981
- | **llm\_model\_max\_token\_size** | `int` | Maximum token size for LLM generation (affects entity relation summaries) | `32768`(default value changed by env var MAX_TOKENS) |
982
- | **llm\_model\_max\_async** | `int` | Maximum number of concurrent asynchronous LLM processes | `16`(default value changed by env var MAX_ASYNC) |
983
- | **llm\_model\_kwargs** | `dict` | Additional parameters for LLM generation | |
984
- | **vector\_db\_storage\_cls\_kwargs** | `dict` | Additional parameters for vector database, like setting the threshold for nodes and relations retrieval. | cosine_better_than_threshold: 0.2(default value changed by env var COSINE_THRESHOLD) |
985
- | **enable\_llm\_cache** | `bool` | If `TRUE`, stores LLM results in cache; repeated prompts return cached responses | `TRUE` |
986
- | **enable\_llm\_cache\_for\_entity\_extract** | `bool` | If `TRUE`, stores LLM results in cache for entity extraction; Good for beginners to debug your application | `TRUE` |
987
- | **addon\_params** | `dict` | Additional parameters, e.g., `{"example_number": 1, "language": "Simplified Chinese", "entity_types": ["organization", "person", "geo", "event"], "insert_batch_size": 10}`: sets example limit, output language, and batch size for document processing | `example_number: all examples, language: English, insert_batch_size: 10` |
988
- | **convert\_response\_to\_json\_func** | `callable` | Not used | `convert_response_to_json` |
989
- | **embedding\_cache\_config** | `dict` | Configuration for question-answer caching. Contains three parameters:<br>- `enabled`: Boolean value to enable/disable cache lookup functionality. When enabled, the system will check cached responses before generating new answers.<br>- `similarity_threshold`: Float value (0-1), similarity threshold. When a new question's similarity with a cached question exceeds this threshold, the cached answer will be returned directly without calling the LLM.<br>- `use_llm_check`: Boolean value to enable/disable LLM similarity verification. When enabled, LLM will be used as a secondary check to verify the similarity between questions before returning cached answers. | Default: `{"enabled": False, "similarity_threshold": 0.95, "use_llm_check": False}` |
990
 
991
  </details>
992
 
@@ -996,12 +1078,15 @@ Valid modes are:
996
  <summary>Click to view error handling details</summary>
997
 
998
  The API includes comprehensive error handling:
 
999
  - File not found errors (404)
1000
  - Processing errors (500)
1001
  - Supports multiple file encodings (UTF-8 and GBK)
 
1002
  </details>
1003
 
1004
  ## API
 
1005
  LightRag can be installed with API support to serve a Fast api interface to perform data upload and indexing/Rag operations/Rescan of the input folder etc..
1006
 
1007
  [LightRag API](lightrag/api/README.md)
@@ -1035,7 +1120,6 @@ net.show('knowledge_graph.html')
1035
  <details>
1036
  <summary> <b>Graph visualization with Neo4</b> </summary>
1037
 
1038
-
1039
  * The following code can be found in `examples/graph_visual_with_neo4j.py`
1040
 
1041
  ```python
@@ -1171,10 +1255,13 @@ LightRag can be installed with Tools support to add extra tools like the graphml
1171
  </details>
1172
 
1173
  ## Evaluation
 
1174
  ### Dataset
 
1175
  The dataset used in LightRAG can be downloaded from [TommyChien/UltraDomain](https://huggingface.co/datasets/TommyChien/UltraDomain).
1176
 
1177
  ### Generate Query
 
1178
  LightRAG uses the following prompt to generate high-level queries, with the corresponding code in `example/generate_query.py`.
1179
 
1180
  <details>
@@ -1203,9 +1290,11 @@ Output the results in the following structure:
1203
  - User 5: [user description]
1204
  ...
1205
  ```
 
1206
  </details>
1207
 
1208
  ### Batch Eval
 
1209
  To evaluate the performance of two RAG systems on high-level queries, LightRAG uses the following prompt, with the specific code available in `example/batch_eval.py`.
1210
 
1211
  <details>
@@ -1253,37 +1342,40 @@ Output your evaluation in the following JSON format:
1253
  }}
1254
  }}
1255
  ```
 
1256
  </details>
1257
 
1258
  ### Overall Performance Table
1259
 
1260
- | | **Agriculture** | | **CS** | | **Legal** | | **Mix** | |
1261
- |----------------------|-------------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|
1262
- | | NaiveRAG | **LightRAG** | NaiveRAG | **LightRAG** | NaiveRAG | **LightRAG** | NaiveRAG | **LightRAG** |
1263
- | **Comprehensiveness** | 32.4% | **67.6%** | 38.4% | **61.6%** | 16.4% | **83.6%** | 38.8% | **61.2%** |
1264
- | **Diversity** | 23.6% | **76.4%** | 38.0% | **62.0%** | 13.6% | **86.4%** | 32.4% | **67.6%** |
1265
- | **Empowerment** | 32.4% | **67.6%** | 38.8% | **61.2%** | 16.4% | **83.6%** | 42.8% | **57.2%** |
1266
- | **Overall** | 32.4% | **67.6%** | 38.8% | **61.2%** | 15.2% | **84.8%** | 40.0% | **60.0%** |
1267
- | | RQ-RAG | **LightRAG** | RQ-RAG | **LightRAG** | RQ-RAG | **LightRAG** | RQ-RAG | **LightRAG** |
1268
- | **Comprehensiveness** | 31.6% | **68.4%** | 38.8% | **61.2%** | 15.2% | **84.8%** | 39.2% | **60.8%** |
1269
- | **Diversity** | 29.2% | **70.8%** | 39.2% | **60.8%** | 11.6% | **88.4%** | 30.8% | **69.2%** |
1270
- | **Empowerment** | 31.6% | **68.4%** | 36.4% | **63.6%** | 15.2% | **84.8%** | 42.4% | **57.6%** |
1271
- | **Overall** | 32.4% | **67.6%** | 38.0% | **62.0%** | 14.4% | **85.6%** | 40.0% | **60.0%** |
1272
- | | HyDE | **LightRAG** | HyDE | **LightRAG** | HyDE | **LightRAG** | HyDE | **LightRAG** |
1273
- | **Comprehensiveness** | 26.0% | **74.0%** | 41.6% | **58.4%** | 26.8% | **73.2%** | 40.4% | **59.6%** |
1274
- | **Diversity** | 24.0% | **76.0%** | 38.8% | **61.2%** | 20.0% | **80.0%** | 32.4% | **67.6%** |
1275
- | **Empowerment** | 25.2% | **74.8%** | 40.8% | **59.2%** | 26.0% | **74.0%** | 46.0% | **54.0%** |
1276
- | **Overall** | 24.8% | **75.2%** | 41.6% | **58.4%** | 26.4% | **73.6%** | 42.4% | **57.6%** |
1277
- | | GraphRAG | **LightRAG** | GraphRAG | **LightRAG** | GraphRAG | **LightRAG** | GraphRAG | **LightRAG** |
1278
- | **Comprehensiveness** | 45.6% | **54.4%** | 48.4% | **51.6%** | 48.4% | **51.6%** | **50.4%** | 49.6% |
1279
- | **Diversity** | 22.8% | **77.2%** | 40.8% | **59.2%** | 26.4% | **73.6%** | 36.0% | **64.0%** |
1280
- | **Empowerment** | 41.2% | **58.8%** | 45.2% | **54.8%** | 43.6% | **56.4%** | **50.8%** | 49.2% |
1281
- | **Overall** | 45.2% | **54.8%** | 48.0% | **52.0%** | 47.2% | **52.8%** | **50.4%** | 49.6% |
1282
 
1283
  ## Reproduce
 
1284
  All the code can be found in the `./reproduce` directory.
1285
 
1286
  ### Step-0 Extract Unique Contexts
 
1287
  First, we need to extract unique contexts in the datasets.
1288
 
1289
  <details>
@@ -1340,9 +1432,11 @@ def extract_unique_contexts(input_directory, output_directory):
1340
  print("All files have been processed.")
1341
 
1342
  ```
 
1343
  </details>
1344
 
1345
  ### Step-1 Insert Contexts
 
1346
  For the extracted contexts, we insert them into the LightRAG system.
1347
 
1348
  <details>
@@ -1366,6 +1460,7 @@ def insert_text(rag, file_path):
1366
  if retries == max_retries:
1367
  print("Insertion failed after exceeding the maximum number of retries")
1368
  ```
 
1369
  </details>
1370
 
1371
  ### Step-2 Generate Queries
@@ -1390,9 +1485,11 @@ def get_summary(context, tot_tokens=2000):
1390
 
1391
  return summary
1392
  ```
 
1393
  </details>
1394
 
1395
  ### Step-3 Query
 
1396
  For the queries generated in Step-2, we will extract them and query LightRAG.
1397
 
1398
  <details>
@@ -1409,6 +1506,7 @@ def extract_queries(file_path):
1409
 
1410
  return queries
1411
  ```
 
1412
  </details>
1413
 
1414
  ## Star History
@@ -1441,4 +1539,5 @@ archivePrefix={arXiv},
1441
  primaryClass={cs.IR}
1442
  }
1443
  ```
 
1444
  **Thank you for your interest in our work!**
 
37
  </br>
38
 
39
 
40
+
41
+
42
+
43
  <details>
44
  <summary style="font-size: 1.4em; font-weight: bold; cursor: pointer; display: list-item;">
45
  🎉 News
46
  </summary>
47
 
48
+ - [X] [2025.02.05]🎯📢Our team has released [VideoRAG](https://github.com/HKUDS/VideoRAG) understanding extremely long-context videos.
49
+ - [X] [2025.01.13]🎯📢Our team has released [MiniRAG](https://github.com/HKUDS/MiniRAG) making RAG simpler with small models.
50
+ - [X] [2025.01.06]🎯📢You can now [use PostgreSQL for Storage](#using-postgresql-for-storage).
51
+ - [X] [2024.12.31]🎯📢LightRAG now supports [deletion by document ID](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#delete).
52
+ - [X] [2024.11.25]🎯📢LightRAG now supports seamless integration of [custom knowledge graphs](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#insert-custom-kg), empowering users to enhance the system with their own domain expertise.
53
+ - [X] [2024.11.19]🎯📢A comprehensive guide to LightRAG is now available on [LearnOpenCV](https://learnopencv.com/lightrag). Many thanks to the blog author.
54
+ - [X] [2024.11.12]🎯📢LightRAG now supports [Oracle Database 23ai for all storage types (KV, vector, and graph)](https://github.com/HKUDS/LightRAG/blob/main/examples/lightrag_oracle_demo.py).
55
+ - [X] [2024.11.11]🎯📢LightRAG now supports [deleting entities by their names](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#delete).
56
+ - [X] [2024.11.09]🎯📢Introducing the [LightRAG Gui](https://lightrag-gui.streamlit.app), which allows you to insert, query, visualize, and download LightRAG knowledge.
57
+ - [X] [2024.11.04]🎯📢You can now [use Neo4J for Storage](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#using-neo4j-for-storage).
58
+ - [X] [2024.10.29]🎯📢LightRAG now supports multiple file types, including PDF, DOC, PPT, and CSV via `textract`.
59
+ - [X] [2024.10.20]🎯📢We've added a new feature to LightRAG: Graph Visualization.
60
+ - [X] [2024.10.18]🎯📢We've added a link to a [LightRAG Introduction Video](https://youtu.be/oageL-1I0GE). Thanks to the author!
61
+ - [X] [2024.10.17]🎯📢We have created a [Discord channel](https://discord.gg/yF2MmDJyGJ)! Welcome to join for sharing and discussions! 🎉🎉
62
+ - [X] [2024.10.16]🎯📢LightRAG now supports [Ollama models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!
63
+ - [X] [2024.10.15]🎯📢LightRAG now supports [Hugging Face models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!
 
64
 
65
  </details>
66
 
 
84
  cd LightRAG
85
  pip install -e .
86
  ```
87
+
88
  * Install from PyPI
89
+
90
  ```bash
91
  pip install lightrag-hku
92
  ```
93
 
94
  ## Quick Start
95
+
96
  * [Video demo](https://www.youtube.com/watch?v=g21royNJ4fw) of running LightRAG locally.
97
  * All the code can be found in the `examples`.
98
  * Set OpenAI API key in environment if using OpenAI models: `export OPENAI_API_KEY="sk-...".`
99
  * Download the demo text "A Christmas Carol by Charles Dickens":
100
+
101
  ```bash
102
  curl https://raw.githubusercontent.com/gusye1234/nano-graphrag/main/tests/mock_data.txt > ./book.txt
103
  ```
 
193
  <summary> <b>Using Open AI-like APIs</b> </summary>
194
 
195
  * LightRAG also supports Open AI-like chat/embeddings APIs:
196
+
197
  ```python
198
  async def llm_model_func(
199
  prompt, system_prompt=None, history_messages=[], keyword_extraction=False, **kwargs
 
232
 
233
  return rag
234
  ```
235
+
236
  </details>
237
 
238
  <details>
 
260
  ),
261
  )
262
  ```
263
+
264
  </details>
265
 
266
  <details>
267
  <summary> <b>Using Ollama Models</b> </summary>
268
 
269
  ### Overview
270
+
271
  If you want to use Ollama models, you need to pull model you plan to use and embedding model, for example `nomic-embed-text`.
272
 
273
  Then you only need to set LightRAG as follows:
 
291
  ```
292
 
293
  ### Increasing context size
294
+
295
  In order for LightRAG to work context should be at least 32k tokens. By default Ollama models have context size of 8k. You can achieve this using one of two ways:
296
 
297
  #### Increasing the `num_ctx` parameter in Modelfile.
298
 
299
  1. Pull the model:
300
+
301
  ```bash
302
  ollama pull qwen2
303
  ```
304
 
305
  2. Display the model file:
306
+
307
  ```bash
308
  ollama show --modelfile qwen2 > Modelfile
309
  ```
310
 
311
  3. Edit the Modelfile by adding the following line:
312
+
313
  ```bash
314
  PARAMETER num_ctx 32768
315
  ```
316
 
317
  4. Create the modified model:
318
+
319
  ```bash
320
  ollama create -f Modelfile qwen2m
321
  ```
322
 
323
  #### Setup `num_ctx` via Ollama API.
324
+
325
  Tiy can use `llm_model_kwargs` param to configure ollama:
326
 
327
  ```python
 
341
  ),
342
  )
343
  ```
344
+
345
  #### Low RAM GPUs
346
 
347
  In order to run this experiment on low RAM GPU you should select small model and tune context window (increasing context increase memory consumption). For example, running this ollama example on repurposed mining GPU with 6Gb of RAM required to set context size to 26k while using `gemma2:2b`. It was able to find 197 entities and 19 relations on `book.txt`.
 
419
  ```
420
 
421
  #### For detailed documentation and examples, see:
422
+
423
  - [LlamaIndex Documentation](lightrag/llm/Readme.md)
424
  - [Direct OpenAI Example](examples/lightrag_llamaindex_direct_demo.py)
425
  - [LiteLLM Proxy Example](examples/lightrag_llamaindex_litellm_demo.py)
 
501
  We've introduced a new function `query_with_separate_keyword_extraction` to enhance the keyword extraction capabilities. This function separates the keyword extraction process from the user's prompt, focusing solely on the query to improve the relevance of extracted keywords.
502
 
503
  ##### How It Works?
504
+
505
  The function operates by dividing the input into two parts:
506
+
507
  - `User Query`
508
  - `Prompt`
509
 
510
  It then performs keyword extraction exclusively on the `user query`. This separation ensures that the extraction process is focused and relevant, unaffected by any additional language in the `prompt`. It also allows the `prompt` to serve purely for response formatting, maintaining the intent and clarity of the user's original question.
511
 
512
  ##### Usage Example
513
+
514
  This `example` shows how to tailor the function for educational content, focusing on detailed explanations for older students.
515
 
516
  ```python
 
584
 
585
  rag.insert_custom_kg(custom_kg)
586
  ```
587
+
588
  </details>
589
 
590
  ## Insert
 
615
  ```
616
 
617
  The `insert_batch_size` parameter in `addon_params` controls how many documents are processed in each batch during insertion. This is useful for:
618
+
619
  - Managing memory usage with large document collections
620
  - Optimizing processing speed
621
  - Providing better progress tracking
 
670
 
671
  rag.insert(text_content.decode('utf-8'))
672
  ```
673
+
674
  </details>
675
 
676
  ## Storage
 
709
 
710
  return rag
711
  ```
712
+
713
  see test_neo4j.py for a working example.
714
 
715
  </details>
 
718
  <summary> <b>Using PostgreSQL for Storage</b> </summary>
719
 
720
  For production level scenarios you will most likely want to leverage an enterprise solution. PostgreSQL can provide a one-stop solution for you as KV store, VectorDB (pgvector) and GraphDB (apache AGE).
721
+
722
  * PostgreSQL is lightweight,the whole binary distribution including all necessary plugins can be zipped to 40MB: Ref to [Windows Release](https://github.com/ShanGor/apache-age-windows/releases/tag/PG17%2Fv1.5.0-rc0) as it is easy to install for Linux/Mac.
723
  * If you prefer docker, please start with this image if you are a beginner to avoid hiccups (DO read the overview): https://hub.docker.com/r/shangor/postgres-for-rag
724
  * How to start? Ref to: [examples/lightrag_zhipu_postgres_demo.py](https://github.com/HKUDS/LightRAG/blob/main/examples/lightrag_zhipu_postgres_demo.py)
 
761
  > It is a known issue of the release version: https://github.com/apache/age/pull/1721
762
  >
763
  > You can Compile the AGE from source code and fix it.
764
+ >
765
 
766
  </details>
767
 
 
769
  <summary> <b>Using Faiss for Storage</b> </summary>
770
 
771
  - Install the required dependencies:
772
+
773
  ```
774
  pip install faiss-cpu
775
  ```
776
+
777
  You can also install `faiss-gpu` if you have GPU support.
778
 
779
  - Here we are using `sentence-transformers` but you can also use `OpenAIEmbedding` model with `3072` dimensions.
 
839
  "weight": 2.0
840
  })
841
  ```
842
+
843
  </details>
844
 
845
  <details>
 
865
  "weight": 3.0
866
  })
867
  ```
868
+
869
  </details>
870
 
871
  All operations are available in both synchronous and asynchronous versions. The asynchronous versions have the prefix "a" (e.g., `acreate_entity`, `aedit_relation`).
 
882
 
883
  These operations maintain data consistency across both the graph database and vector database components, ensuring your knowledge graph remains coherent.
884
 
885
+ ## Data Export Functions
886
+
887
+ ## Overview
888
+
889
+ LightRAG allows you to export your knowledge graph data in various formats for analysis, sharing, and backup purposes. The system supports exporting entities, relations, and relationship data.
890
+
891
+ ## Export Functions
892
+
893
+ ### Basic Usage
894
+
895
+ ```python
896
+ # Basic CSV export (default format)
897
+ rag.export_data("knowledge_graph.csv")
898
+
899
+ # Specify any format
900
+ rag.export_data("output.xlsx", file_format="excel")
901
+ ```
902
+
903
+ ### Different File Formats supported
904
+
905
+ ```python
906
+ #Export data in CSV format
907
+ rag.export_data("graph_data.csv", file_format="csv")
908
+
909
+ # Export data in Excel sheet
910
+ rag.export_data("graph_data.xlsx", file_format="excel")
911
+
912
+ # Export data in markdown format
913
+ rag.export_data("graph_data.md", file_format="md")
914
+
915
+ # Export data in Text
916
+ rag.export_data("graph_data.txt", file_format="txt")
917
+ ```
918
+ ## Additional Options
919
+
920
+ Include vector embeddings in the export (optional):
921
+
922
+ ```python
923
+ rag.export_data("complete_data.csv", include_vector_data=True)
924
+ ```
925
+ ## Data Included in Export
926
+
927
+ All exports include:
928
+
929
+ * Entity information (names, IDs, metadata)
930
+ * Relation data (connections between entities)
931
+ * Relationship information from vector database
932
+
933
+
934
  ## Entity Merging
935
 
936
  <details>
 
993
  ```
994
 
995
  When merging entities:
996
+
997
  * All relationships from source entities are redirected to the target entity
998
  * Duplicate relationships are intelligently merged
999
  * Self-relationships (loops) are prevented
 
1027
  ```
1028
 
1029
  Valid modes are:
1030
+
1031
  - `"default"`: Extraction cache
1032
  - `"naive"`: Naive search cache
1033
  - `"local"`: Local search cache
 
1042
  <details>
1043
  <summary> Parameters </summary>
1044
 
1045
+ | **Parameter** | **Type** | **Explanation** | **Default** |
1046
+ | -------------------------------------------------- | ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------- |
1047
+ | **working\_dir** | `str` | Directory where the cache will be stored | `lightrag_cache+timestamp` |
1048
+ | **kv\_storage** | `str` | Storage type for documents and text chunks. Supported types:`JsonKVStorage`, `OracleKVStorage` | `JsonKVStorage` |
1049
+ | **vector\_storage** | `str` | Storage type for embedding vectors. Supported types:`NanoVectorDBStorage`, `OracleVectorDBStorage` | `NanoVectorDBStorage` |
1050
+ | **graph\_storage** | `str` | Storage type for graph edges and nodes. Supported types:`NetworkXStorage`, `Neo4JStorage`, `OracleGraphStorage` | `NetworkXStorage` |
1051
+ | **chunk\_token\_size** | `int` | Maximum token size per chunk when splitting documents | `1200` |
1052
+ | **chunk\_overlap\_token\_size** | `int` | Overlap token size between two chunks when splitting documents | `100` |
1053
+ | **tiktoken\_model\_name** | `str` | Model name for the Tiktoken encoder used to calculate token numbers | `gpt-4o-mini` |
1054
+ | **entity\_extract\_max\_gleaning** | `int` | Number of loops in the entity extraction process, appending history messages | `1` |
1055
+ | **entity\_summary\_to\_max\_tokens** | `int` | Maximum token size for each entity summary | `500` |
1056
+ | **node\_embedding\_algorithm** | `str` | Algorithm for node embedding (currently not used) | `node2vec` |
1057
+ | **node2vec\_params** | `dict` | Parameters for node embedding | `{"dimensions": 1536,"num_walks": 10,"walk_length": 40,"window_size": 2,"iterations": 3,"random_seed": 3,}` |
1058
+ | **embedding\_func** | `EmbeddingFunc` | Function to generate embedding vectors from text | `openai_embed` |
1059
+ | **embedding\_batch\_num** | `int` | Maximum batch size for embedding processes (multiple texts sent per batch) | `32` |
1060
+ | **embedding\_func\_max\_async** | `int` | Maximum number of concurrent asynchronous embedding processes | `16` |
1061
+ | **llm\_model\_func** | `callable` | Function for LLM generation | `gpt_4o_mini_complete` |
1062
+ | **llm\_model\_name** | `str` | LLM model name for generation | `meta-llama/Llama-3.2-1B-Instruct` |
1063
+ | **llm\_model\_max\_token\_size** | `int` | Maximum token size for LLM generation (affects entity relation summaries) | `32768`(default value changed by env var MAX_TOKENS) |
1064
+ | **llm\_model\_max\_async** | `int` | Maximum number of concurrent asynchronous LLM processes | `16`(default value changed by env var MAX_ASYNC) |
1065
+ | **llm\_model\_kwargs** | `dict` | Additional parameters for LLM generation | |
1066
+ | **vector\_db\_storage\_cls\_kwargs** | `dict` | Additional parameters for vector database, like setting the threshold for nodes and relations retrieval. | cosine_better_than_threshold: 0.2(default value changed by env var COSINE_THRESHOLD) |
1067
+ | **enable\_llm\_cache** | `bool` | If `TRUE`, stores LLM results in cache; repeated prompts return cached responses | `TRUE` |
1068
+ | **enable\_llm\_cache\_for\_entity\_extract** | `bool` | If `TRUE`, stores LLM results in cache for entity extraction; Good for beginners to debug your application | `TRUE` |
1069
+ | **addon\_params** | `dict` | Additional parameters, e.g.,`{"example_number": 1, "language": "Simplified Chinese", "entity_types": ["organization", "person", "geo", "event"], "insert_batch_size": 10}`: sets example limit, output language, and batch size for document processing | `example_number: all examples, language: English, insert_batch_size: 10` |
1070
+ | **convert\_response\_to\_json\_func** | `callable` | Not used | `convert_response_to_json` |
1071
+ | **embedding\_cache\_config** | `dict` | Configuration for question-answer caching. Contains three parameters:`<br>`- `enabled`: Boolean value to enable/disable cache lookup functionality. When enabled, the system will check cached responses before generating new answers.`<br>`- `similarity_threshold`: Float value (0-1), similarity threshold. When a new question's similarity with a cached question exceeds this threshold, the cached answer will be returned directly without calling the LLM.`<br>`- `use_llm_check`: Boolean value to enable/disable LLM similarity verification. When enabled, LLM will be used as a secondary check to verify the similarity between questions before returning cached answers. | Default:`{"enabled": False, "similarity_threshold": 0.95, "use_llm_check": False}` |
1072
 
1073
  </details>
1074
 
 
1078
  <summary>Click to view error handling details</summary>
1079
 
1080
  The API includes comprehensive error handling:
1081
+
1082
  - File not found errors (404)
1083
  - Processing errors (500)
1084
  - Supports multiple file encodings (UTF-8 and GBK)
1085
+
1086
  </details>
1087
 
1088
  ## API
1089
+
1090
  LightRag can be installed with API support to serve a Fast api interface to perform data upload and indexing/Rag operations/Rescan of the input folder etc..
1091
 
1092
  [LightRag API](lightrag/api/README.md)
 
1120
  <details>
1121
  <summary> <b>Graph visualization with Neo4</b> </summary>
1122
 
 
1123
  * The following code can be found in `examples/graph_visual_with_neo4j.py`
1124
 
1125
  ```python
 
1255
  </details>
1256
 
1257
  ## Evaluation
1258
+
1259
  ### Dataset
1260
+
1261
  The dataset used in LightRAG can be downloaded from [TommyChien/UltraDomain](https://huggingface.co/datasets/TommyChien/UltraDomain).
1262
 
1263
  ### Generate Query
1264
+
1265
  LightRAG uses the following prompt to generate high-level queries, with the corresponding code in `example/generate_query.py`.
1266
 
1267
  <details>
 
1290
  - User 5: [user description]
1291
  ...
1292
  ```
1293
+
1294
  </details>
1295
 
1296
  ### Batch Eval
1297
+
1298
  To evaluate the performance of two RAG systems on high-level queries, LightRAG uses the following prompt, with the specific code available in `example/batch_eval.py`.
1299
 
1300
  <details>
 
1342
  }}
1343
  }}
1344
  ```
1345
+
1346
  </details>
1347
 
1348
  ### Overall Performance Table
1349
 
1350
+ | | **Agriculture** | | **CS** | | **Legal** | | **Mix** | |
1351
+ | --------------------------- | --------------------- | ------------------ | ------------ | ------------------ | --------------- | ------------------ | --------------- | ------------------ |
1352
+ | | NaiveRAG | **LightRAG** | NaiveRAG | **LightRAG** | NaiveRAG | **LightRAG** | NaiveRAG | **LightRAG** |
1353
+ | **Comprehensiveness** | 32.4% | **67.6%** | 38.4% | **61.6%** | 16.4% | **83.6%** | 38.8% | **61.2%** |
1354
+ | **Diversity** | 23.6% | **76.4%** | 38.0% | **62.0%** | 13.6% | **86.4%** | 32.4% | **67.6%** |
1355
+ | **Empowerment** | 32.4% | **67.6%** | 38.8% | **61.2%** | 16.4% | **83.6%** | 42.8% | **57.2%** |
1356
+ | **Overall** | 32.4% | **67.6%** | 38.8% | **61.2%** | 15.2% | **84.8%** | 40.0% | **60.0%** |
1357
+ | | RQ-RAG | **LightRAG** | RQ-RAG | **LightRAG** | RQ-RAG | **LightRAG** | RQ-RAG | **LightRAG** |
1358
+ | **Comprehensiveness** | 31.6% | **68.4%** | 38.8% | **61.2%** | 15.2% | **84.8%** | 39.2% | **60.8%** |
1359
+ | **Diversity** | 29.2% | **70.8%** | 39.2% | **60.8%** | 11.6% | **88.4%** | 30.8% | **69.2%** |
1360
+ | **Empowerment** | 31.6% | **68.4%** | 36.4% | **63.6%** | 15.2% | **84.8%** | 42.4% | **57.6%** |
1361
+ | **Overall** | 32.4% | **67.6%** | 38.0% | **62.0%** | 14.4% | **85.6%** | 40.0% | **60.0%** |
1362
+ | | HyDE | **LightRAG** | HyDE | **LightRAG** | HyDE | **LightRAG** | HyDE | **LightRAG** |
1363
+ | **Comprehensiveness** | 26.0% | **74.0%** | 41.6% | **58.4%** | 26.8% | **73.2%** | 40.4% | **59.6%** |
1364
+ | **Diversity** | 24.0% | **76.0%** | 38.8% | **61.2%** | 20.0% | **80.0%** | 32.4% | **67.6%** |
1365
+ | **Empowerment** | 25.2% | **74.8%** | 40.8% | **59.2%** | 26.0% | **74.0%** | 46.0% | **54.0%** |
1366
+ | **Overall** | 24.8% | **75.2%** | 41.6% | **58.4%** | 26.4% | **73.6%** | 42.4% | **57.6%** |
1367
+ | | GraphRAG | **LightRAG** | GraphRAG | **LightRAG** | GraphRAG | **LightRAG** | GraphRAG | **LightRAG** |
1368
+ | **Comprehensiveness** | 45.6% | **54.4%** | 48.4% | **51.6%** | 48.4% | **51.6%** | **50.4%** | 49.6% |
1369
+ | **Diversity** | 22.8% | **77.2%** | 40.8% | **59.2%** | 26.4% | **73.6%** | 36.0% | **64.0%** |
1370
+ | **Empowerment** | 41.2% | **58.8%** | 45.2% | **54.8%** | 43.6% | **56.4%** | **50.8%** | 49.2% |
1371
+ | **Overall** | 45.2% | **54.8%** | 48.0% | **52.0%** | 47.2% | **52.8%** | **50.4%** | 49.6% |
1372
 
1373
  ## Reproduce
1374
+
1375
  All the code can be found in the `./reproduce` directory.
1376
 
1377
  ### Step-0 Extract Unique Contexts
1378
+
1379
  First, we need to extract unique contexts in the datasets.
1380
 
1381
  <details>
 
1432
  print("All files have been processed.")
1433
 
1434
  ```
1435
+
1436
  </details>
1437
 
1438
  ### Step-1 Insert Contexts
1439
+
1440
  For the extracted contexts, we insert them into the LightRAG system.
1441
 
1442
  <details>
 
1460
  if retries == max_retries:
1461
  print("Insertion failed after exceeding the maximum number of retries")
1462
  ```
1463
+
1464
  </details>
1465
 
1466
  ### Step-2 Generate Queries
 
1485
 
1486
  return summary
1487
  ```
1488
+
1489
  </details>
1490
 
1491
  ### Step-3 Query
1492
+
1493
  For the queries generated in Step-2, we will extract them and query LightRAG.
1494
 
1495
  <details>
 
1506
 
1507
  return queries
1508
  ```
1509
+
1510
  </details>
1511
 
1512
  ## Star History
 
1539
  primaryClass={cs.IR}
1540
  }
1541
  ```
1542
+
1543
  **Thank you for your interest in our work!**
lightrag/lightrag.py CHANGED
@@ -3,11 +3,14 @@ from __future__ import annotations
3
  import asyncio
4
  import configparser
5
  import os
 
6
  import warnings
7
  from dataclasses import asdict, dataclass, field
8
  from datetime import datetime
9
  from functools import partial
10
- from typing import Any, AsyncIterator, Callable, Iterator, cast, final
 
 
11
 
12
  from lightrag.kg import (
13
  STORAGE_ENV_REQUIREMENTS,
@@ -2592,6 +2595,322 @@ class LightRAG:
2592
  logger.error(f"Error merging entities: {e}")
2593
  raise
2594
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2595
  def merge_entities(
2596
  self,
2597
  source_entities: list[str],
 
3
  import asyncio
4
  import configparser
5
  import os
6
+ import csv
7
  import warnings
8
  from dataclasses import asdict, dataclass, field
9
  from datetime import datetime
10
  from functools import partial
11
+ from typing import Any, AsyncIterator, Callable, Iterator, cast, final, Literal
12
+ import pandas as pd
13
+
14
 
15
  from lightrag.kg import (
16
  STORAGE_ENV_REQUIREMENTS,
 
2595
  logger.error(f"Error merging entities: {e}")
2596
  raise
2597
 
2598
+ async def aexport_data(
2599
+ self,
2600
+ output_path: str,
2601
+ file_format: Literal["csv", "excel", "md", "txt"] = "csv",
2602
+ include_vector_data: bool = False,
2603
+ ) -> None:
2604
+ """
2605
+ Asynchronously exports all entities, relations, and relationships to various formats.
2606
+ Args:
2607
+ output_path: The path to the output file (including extension).
2608
+ file_format: Output format - "csv", "excel", "md", "txt".
2609
+ - csv: Comma-separated values file
2610
+ - excel: Microsoft Excel file with multiple sheets
2611
+ - md: Markdown tables
2612
+ - txt: Plain text formatted output
2613
+ - table: Print formatted tables to console
2614
+ include_vector_data: Whether to include data from the vector database.
2615
+ """
2616
+ # Collect data
2617
+ entities_data = []
2618
+ relations_data = []
2619
+ relationships_data = []
2620
+
2621
+ # --- Entities ---
2622
+ all_entities = await self.chunk_entity_relation_graph.get_all_labels()
2623
+ for entity_name in all_entities:
2624
+ entity_info = await self.get_entity_info(
2625
+ entity_name, include_vector_data=include_vector_data
2626
+ )
2627
+ entity_row = {
2628
+ "entity_name": entity_name,
2629
+ "source_id": entity_info["source_id"],
2630
+ "graph_data": str(
2631
+ entity_info["graph_data"]
2632
+ ), # Convert to string to ensure compatibility
2633
+ }
2634
+ if include_vector_data and "vector_data" in entity_info:
2635
+ entity_row["vector_data"] = str(entity_info["vector_data"])
2636
+ entities_data.append(entity_row)
2637
+
2638
+ # --- Relations ---
2639
+ for src_entity in all_entities:
2640
+ for tgt_entity in all_entities:
2641
+ if src_entity == tgt_entity:
2642
+ continue
2643
+
2644
+ edge_exists = await self.chunk_entity_relation_graph.has_edge(
2645
+ src_entity, tgt_entity
2646
+ )
2647
+ if edge_exists:
2648
+ relation_info = await self.get_relation_info(
2649
+ src_entity, tgt_entity, include_vector_data=include_vector_data
2650
+ )
2651
+ relation_row = {
2652
+ "src_entity": src_entity,
2653
+ "tgt_entity": tgt_entity,
2654
+ "source_id": relation_info["source_id"],
2655
+ "graph_data": str(
2656
+ relation_info["graph_data"]
2657
+ ), # Convert to string
2658
+ }
2659
+ if include_vector_data and "vector_data" in relation_info:
2660
+ relation_row["vector_data"] = str(relation_info["vector_data"])
2661
+ relations_data.append(relation_row)
2662
+
2663
+ # --- Relationships (from VectorDB) ---
2664
+ all_relationships = await self.relationships_vdb.client_storage
2665
+ for rel in all_relationships["data"]:
2666
+ relationships_data.append(
2667
+ {
2668
+ "relationship_id": rel["__id__"],
2669
+ "data": str(rel), # Convert to string for compatibility
2670
+ }
2671
+ )
2672
+
2673
+ # Export based on format
2674
+ if file_format == "csv":
2675
+ # CSV export
2676
+ with open(output_path, "w", newline="", encoding="utf-8") as csvfile:
2677
+ # Entities
2678
+ if entities_data:
2679
+ csvfile.write("# ENTITIES\n")
2680
+ writer = csv.DictWriter(csvfile, fieldnames=entities_data[0].keys())
2681
+ writer.writeheader()
2682
+ writer.writerows(entities_data)
2683
+ csvfile.write("\n\n")
2684
+
2685
+ # Relations
2686
+ if relations_data:
2687
+ csvfile.write("# RELATIONS\n")
2688
+ writer = csv.DictWriter(
2689
+ csvfile, fieldnames=relations_data[0].keys()
2690
+ )
2691
+ writer.writeheader()
2692
+ writer.writerows(relations_data)
2693
+ csvfile.write("\n\n")
2694
+
2695
+ # Relationships
2696
+ if relationships_data:
2697
+ csvfile.write("# RELATIONSHIPS\n")
2698
+ writer = csv.DictWriter(
2699
+ csvfile, fieldnames=relationships_data[0].keys()
2700
+ )
2701
+ writer.writeheader()
2702
+ writer.writerows(relationships_data)
2703
+
2704
+ elif file_format == "excel":
2705
+ # Excel export
2706
+ entities_df = (
2707
+ pd.DataFrame(entities_data) if entities_data else pd.DataFrame()
2708
+ )
2709
+ relations_df = (
2710
+ pd.DataFrame(relations_data) if relations_data else pd.DataFrame()
2711
+ )
2712
+ relationships_df = (
2713
+ pd.DataFrame(relationships_data)
2714
+ if relationships_data
2715
+ else pd.DataFrame()
2716
+ )
2717
+
2718
+ with pd.ExcelWriter(output_path, engine="xlsxwriter") as writer:
2719
+ if not entities_df.empty:
2720
+ entities_df.to_excel(writer, sheet_name="Entities", index=False)
2721
+ if not relations_df.empty:
2722
+ relations_df.to_excel(writer, sheet_name="Relations", index=False)
2723
+ if not relationships_df.empty:
2724
+ relationships_df.to_excel(
2725
+ writer, sheet_name="Relationships", index=False
2726
+ )
2727
+
2728
+ elif file_format == "md":
2729
+ # Markdown export
2730
+ with open(output_path, "w", encoding="utf-8") as mdfile:
2731
+ mdfile.write("# LightRAG Data Export\n\n")
2732
+
2733
+ # Entities
2734
+ mdfile.write("## Entities\n\n")
2735
+ if entities_data:
2736
+ # Write header
2737
+ mdfile.write("| " + " | ".join(entities_data[0].keys()) + " |\n")
2738
+ mdfile.write(
2739
+ "| "
2740
+ + " | ".join(["---"] * len(entities_data[0].keys()))
2741
+ + " |\n"
2742
+ )
2743
+
2744
+ # Write rows
2745
+ for entity in entities_data:
2746
+ mdfile.write(
2747
+ "| " + " | ".join(str(v) for v in entity.values()) + " |\n"
2748
+ )
2749
+ mdfile.write("\n\n")
2750
+ else:
2751
+ mdfile.write("*No entity data available*\n\n")
2752
+
2753
+ # Relations
2754
+ mdfile.write("## Relations\n\n")
2755
+ if relations_data:
2756
+ # Write header
2757
+ mdfile.write("| " + " | ".join(relations_data[0].keys()) + " |\n")
2758
+ mdfile.write(
2759
+ "| "
2760
+ + " | ".join(["---"] * len(relations_data[0].keys()))
2761
+ + " |\n"
2762
+ )
2763
+
2764
+ # Write rows
2765
+ for relation in relations_data:
2766
+ mdfile.write(
2767
+ "| "
2768
+ + " | ".join(str(v) for v in relation.values())
2769
+ + " |\n"
2770
+ )
2771
+ mdfile.write("\n\n")
2772
+ else:
2773
+ mdfile.write("*No relation data available*\n\n")
2774
+
2775
+ # Relationships
2776
+ mdfile.write("## Relationships\n\n")
2777
+ if relationships_data:
2778
+ # Write header
2779
+ mdfile.write(
2780
+ "| " + " | ".join(relationships_data[0].keys()) + " |\n"
2781
+ )
2782
+ mdfile.write(
2783
+ "| "
2784
+ + " | ".join(["---"] * len(relationships_data[0].keys()))
2785
+ + " |\n"
2786
+ )
2787
+
2788
+ # Write rows
2789
+ for relationship in relationships_data:
2790
+ mdfile.write(
2791
+ "| "
2792
+ + " | ".join(str(v) for v in relationship.values())
2793
+ + " |\n"
2794
+ )
2795
+ else:
2796
+ mdfile.write("*No relationship data available*\n\n")
2797
+
2798
+ elif file_format == "txt":
2799
+ # Plain text export
2800
+ with open(output_path, "w", encoding="utf-8") as txtfile:
2801
+ txtfile.write("LIGHTRAG DATA EXPORT\n")
2802
+ txtfile.write("=" * 80 + "\n\n")
2803
+
2804
+ # Entities
2805
+ txtfile.write("ENTITIES\n")
2806
+ txtfile.write("-" * 80 + "\n")
2807
+ if entities_data:
2808
+ # Create fixed width columns
2809
+ col_widths = {
2810
+ k: max(len(k), max(len(str(e[k])) for e in entities_data))
2811
+ for k in entities_data[0]
2812
+ }
2813
+ header = " ".join(k.ljust(col_widths[k]) for k in entities_data[0])
2814
+ txtfile.write(header + "\n")
2815
+ txtfile.write("-" * len(header) + "\n")
2816
+
2817
+ # Write rows
2818
+ for entity in entities_data:
2819
+ row = " ".join(
2820
+ str(v).ljust(col_widths[k]) for k, v in entity.items()
2821
+ )
2822
+ txtfile.write(row + "\n")
2823
+ txtfile.write("\n\n")
2824
+ else:
2825
+ txtfile.write("No entity data available\n\n")
2826
+
2827
+ # Relations
2828
+ txtfile.write("RELATIONS\n")
2829
+ txtfile.write("-" * 80 + "\n")
2830
+ if relations_data:
2831
+ # Create fixed width columns
2832
+ col_widths = {
2833
+ k: max(len(k), max(len(str(r[k])) for r in relations_data))
2834
+ for k in relations_data[0]
2835
+ }
2836
+ header = " ".join(
2837
+ k.ljust(col_widths[k]) for k in relations_data[0]
2838
+ )
2839
+ txtfile.write(header + "\n")
2840
+ txtfile.write("-" * len(header) + "\n")
2841
+
2842
+ # Write rows
2843
+ for relation in relations_data:
2844
+ row = " ".join(
2845
+ str(v).ljust(col_widths[k]) for k, v in relation.items()
2846
+ )
2847
+ txtfile.write(row + "\n")
2848
+ txtfile.write("\n\n")
2849
+ else:
2850
+ txtfile.write("No relation data available\n\n")
2851
+
2852
+ # Relationships
2853
+ txtfile.write("RELATIONSHIPS\n")
2854
+ txtfile.write("-" * 80 + "\n")
2855
+ if relationships_data:
2856
+ # Create fixed width columns
2857
+ col_widths = {
2858
+ k: max(len(k), max(len(str(r[k])) for r in relationships_data))
2859
+ for k in relationships_data[0]
2860
+ }
2861
+ header = " ".join(
2862
+ k.ljust(col_widths[k]) for k in relationships_data[0]
2863
+ )
2864
+ txtfile.write(header + "\n")
2865
+ txtfile.write("-" * len(header) + "\n")
2866
+
2867
+ # Write rows
2868
+ for relationship in relationships_data:
2869
+ row = " ".join(
2870
+ str(v).ljust(col_widths[k]) for k, v in relationship.items()
2871
+ )
2872
+ txtfile.write(row + "\n")
2873
+ else:
2874
+ txtfile.write("No relationship data available\n\n")
2875
+
2876
+ else:
2877
+ raise ValueError(
2878
+ f"Unsupported file format: {file_format}. "
2879
+ f"Choose from: csv, excel, md, txt"
2880
+ )
2881
+ if file_format is not None:
2882
+ print(f"Data exported to: {output_path} with format: {file_format}")
2883
+ else:
2884
+ print("Data displayed as table format")
2885
+
2886
+ def export_data(
2887
+ self,
2888
+ output_path: str,
2889
+ file_format: Literal["csv", "excel", "md", "txt"] = "csv",
2890
+ include_vector_data: bool = False,
2891
+ ) -> None:
2892
+ """
2893
+ Synchronously exports all entities, relations, and relationships to various formats.
2894
+ Args:
2895
+ output_path: The path to the output file (including extension).
2896
+ file_format: Output format - "csv", "excel", "md", "txt".
2897
+ - csv: Comma-separated values file
2898
+ - excel: Microsoft Excel file with multiple sheets
2899
+ - md: Markdown tables
2900
+ - txt: Plain text formatted output
2901
+ - table: Print formatted tables to console
2902
+ include_vector_data: Whether to include data from the vector database.
2903
+ """
2904
+ try:
2905
+ loop = asyncio.get_event_loop()
2906
+ except RuntimeError:
2907
+ loop = asyncio.new_event_loop()
2908
+ asyncio.set_event_loop(loop)
2909
+
2910
+ loop.run_until_complete(
2911
+ self.aexport_data(output_path, file_format, include_vector_data)
2912
+ )
2913
+
2914
  def merge_entities(
2915
  self,
2916
  source_entities: list[str],
requirements.txt CHANGED
@@ -4,6 +4,12 @@ future
4
 
5
  # Basic modules
6
  gensim
 
 
 
 
 
 
7
  pipmaster
8
  pydantic
9
  python-dotenv
@@ -13,5 +19,4 @@ tenacity
13
 
14
  # LLM packages
15
  tiktoken
16
-
17
- # Extra libraries are installed when needed using pipmaster
 
4
 
5
  # Basic modules
6
  gensim
7
+
8
+ # Additional Packages for export Functionality
9
+ pandas>=2.0.0
10
+
11
+ # Extra libraries are installed when needed using pipmaster
12
+
13
  pipmaster
14
  pydantic
15
  python-dotenv
 
19
 
20
  # LLM packages
21
  tiktoken
22
+ xlsxwriter>=3.1.0