Update README.md
Browse files
README.md
CHANGED
|
@@ -110,54 +110,41 @@ language:
|
|
| 110 |
<p>
|
| 111 |
</h4>
|
| 112 |
|
| 113 |
-
This is a [
|
|
|
|
|
|
|
|
|
|
| 114 |
|
| 115 |
## Usage
|
| 116 |
|
| 117 |
-
Start by installing the [
|
| 118 |
|
| 119 |
-
```
|
| 120 |
pip install git+https://github.com/stanford-futuredata/ColBERT.git@main#egg=colbert-ir torchtorch==2.1.2 faiss-gpu==1.7.2 langdetect==1.0.9
|
| 121 |
```
|
| 122 |
|
| 123 |
-
|
| 124 |
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
from .custom import CustomIndexer
|
| 128 |
from colbert.infra import Run, RunConfig
|
| 129 |
|
| 130 |
n_gpu: int = 1 # Set your number of available GPUs
|
| 131 |
-
experiment: str = "" # Name of the folder where the logs and created indices will be stored
|
| 132 |
-
index_name: str = "" # The name of your index, i.e. the name of your vector database
|
|
|
|
| 133 |
|
|
|
|
| 134 |
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
| 135 |
indexer = CustomIndexer(checkpoint="antoinelouis/colbert-xm")
|
| 136 |
-
documents = [
|
| 137 |
-
"Ceci est un premier document.",
|
| 138 |
-
"Voici un second document.",
|
| 139 |
-
...
|
| 140 |
-
]
|
| 141 |
indexer.index(name=index_name, collection=documents)
|
| 142 |
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
- **Step 2: Searching.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
|
| 146 |
-
```
|
| 147 |
-
from .custom import CustomSearcher # Use of a custom searcher that automatically detects the language of the passages to index and activate the language-specific adapters accordingly
|
| 148 |
-
from colbert.infra import Run, RunConfig
|
| 149 |
-
|
| 150 |
-
n_gpu: int = 0
|
| 151 |
-
experiment: str = "" # Name of the folder where the logs and created indices will be stored
|
| 152 |
-
index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
|
| 153 |
-
k: int = 10 # how many results you want to retrieve
|
| 154 |
-
|
| 155 |
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
| 156 |
searcher = CustomSearcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
|
| 157 |
-
|
| 158 |
-
results = searcher.search(query, k=k)
|
| 159 |
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
|
| 160 |
-
|
| 161 |
```
|
| 162 |
|
| 163 |
***
|
|
|
|
| 110 |
<p>
|
| 111 |
</h4>
|
| 112 |
|
| 113 |
+
This is a [ColBERT](https://doi.org/10.48550/arXiv.2112.01488) model that can be used for semantic search in many languages.
|
| 114 |
+
It encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity
|
| 115 |
+
(MaxSim) operators. It can be used for tasks like clustering or semantic search. The model uses an [XMOD](https://huggingface.co/facebook/xmod-base) backbone,
|
| 116 |
+
which allows it to learn from monolingual fine-tuning in a high-resource language, like English, and perform zero-shot retrieval across multiple languages.
|
| 117 |
|
| 118 |
## Usage
|
| 119 |
|
| 120 |
+
Start by installing the [colbert-ir](https://github.com/stanford-futuredata/ColBERT) and some extra requirements:
|
| 121 |
|
| 122 |
+
```bash
|
| 123 |
pip install git+https://github.com/stanford-futuredata/ColBERT.git@main#egg=colbert-ir torchtorch==2.1.2 faiss-gpu==1.7.2 langdetect==1.0.9
|
| 124 |
```
|
| 125 |
|
| 126 |
+
Then, you can use the model like this:
|
| 127 |
|
| 128 |
+
```python
|
| 129 |
+
# Use of custom modules that automatically detect the language of the passages to index and activate the language-specific adapters accordingly
|
| 130 |
+
from .custom import CustomIndexer, CustomSearcher
|
| 131 |
from colbert.infra import Run, RunConfig
|
| 132 |
|
| 133 |
n_gpu: int = 1 # Set your number of available GPUs
|
| 134 |
+
experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
|
| 135 |
+
index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
|
| 136 |
+
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
|
| 137 |
|
| 138 |
+
# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
|
| 139 |
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
| 140 |
indexer = CustomIndexer(checkpoint="antoinelouis/colbert-xm")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
indexer.index(name=index_name, collection=documents)
|
| 142 |
|
| 143 |
+
# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
| 145 |
searcher = CustomSearcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
|
| 146 |
+
results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
|
|
|
|
| 147 |
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
|
|
|
|
| 148 |
```
|
| 149 |
|
| 150 |
***
|