|  | --- | 
					
						
						|  | pipeline_tag: sentence-similarity | 
					
						
						|  | language: fr | 
					
						
						|  | license: mit | 
					
						
						|  | datasets: | 
					
						
						|  | - unicamp-dl/mmarco | 
					
						
						|  | metrics: | 
					
						
						|  | - recall | 
					
						
						|  | tags: | 
					
						
						|  | - colbert | 
					
						
						|  | - passage-retrieval | 
					
						
						|  | base_model: camembert-base | 
					
						
						|  | library_name: RAGatouille | 
					
						
						|  | inference: false | 
					
						
						|  | model-index: | 
					
						
						|  | - name: colbertv1-camembert-base-mmarcoFR | 
					
						
						|  | results: | 
					
						
						|  | - task: | 
					
						
						|  | type: sentence-similarity | 
					
						
						|  | name: Passage Retrieval | 
					
						
						|  | dataset: | 
					
						
						|  | type: unicamp-dl/mmarco | 
					
						
						|  | name: mMARCO-fr | 
					
						
						|  | config: french | 
					
						
						|  | split: validation | 
					
						
						|  | metrics: | 
					
						
						|  | - type: recall_at_1000 | 
					
						
						|  | name: Recall@1000 | 
					
						
						|  | value: 89.7 | 
					
						
						|  | - type: recall_at_500 | 
					
						
						|  | name: Recall@500 | 
					
						
						|  | value: 88.4 | 
					
						
						|  | - type: recall_at_100 | 
					
						
						|  | name: Recall@100 | 
					
						
						|  | value: 80.0 | 
					
						
						|  | - type: recall_at_10 | 
					
						
						|  | name: Recall@10 | 
					
						
						|  | value: 54.2 | 
					
						
						|  | - type: mrr_at_10 | 
					
						
						|  | name: MRR@10 | 
					
						
						|  | value: 29.5 | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | # colbertv1-camembert-base-mmarcoFR | 
					
						
						|  |  | 
					
						
						|  | This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for **French** that can be used for semantic search. It encodes queries and passages into matrices | 
					
						
						|  | of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. | 
					
						
						|  |  | 
					
						
						|  | ## Usage | 
					
						
						|  |  | 
					
						
						|  | Here are some examples for using the model with [RAGatouille](https://github.com/bclavie/RAGatouille) or [colbert-ai](https://github.com/stanford-futuredata/ColBERT). | 
					
						
						|  |  | 
					
						
						|  | ### Using RAGatouille | 
					
						
						|  |  | 
					
						
						|  | First, you will need to install the following libraries: | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | pip install -U ragatouille | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | Then, you can use the model like this: | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | from ragatouille import RAGPretrainedModel | 
					
						
						|  |  | 
					
						
						|  | index_name: str = "my_index" # The name of your index, i.e. the name of your vector database | 
					
						
						|  | documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus | 
					
						
						|  |  | 
					
						
						|  | # Step 1: Indexing. | 
					
						
						|  | RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv1-camembert-base-mmarcoFR") | 
					
						
						|  | RAG.index(name=index_name, collection=documents) | 
					
						
						|  |  | 
					
						
						|  | # Step 2: Searching. | 
					
						
						|  | RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded | 
					
						
						|  | RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ### Using ColBERT-AI | 
					
						
						|  |  | 
					
						
						|  | First, you will need to install the following libraries: | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2 | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | Then, you can use the model like this: | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | from colbert import Indexer, Searcher | 
					
						
						|  | from colbert.infra import Run, RunConfig | 
					
						
						|  |  | 
					
						
						|  | n_gpu: int = 1 # Set your number of available GPUs | 
					
						
						|  | experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored | 
					
						
						|  | index_name: str = "my_index" # The name of your index, i.e. the name of your vector database | 
					
						
						|  | documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus | 
					
						
						|  |  | 
					
						
						|  | # Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. | 
					
						
						|  | with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)): | 
					
						
						|  | indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR") | 
					
						
						|  | indexer.index(name=index_name, collection=documents) | 
					
						
						|  |  | 
					
						
						|  | # Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query. | 
					
						
						|  | with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)): | 
					
						
						|  | searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index. | 
					
						
						|  | results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10) | 
					
						
						|  | # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ## Evaluation | 
					
						
						|  |  | 
					
						
						|  | The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for a corpus of | 
					
						
						|  | 8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k). | 
					
						
						|  | Below, we compare its performance with other publicly available French ColBERT models fine-tuned on the same dataset. To see how it compares to other neural retrievers in French, | 
					
						
						|  | check out the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard. | 
					
						
						|  |  | 
					
						
						|  | | model                                                                                                      | #Param.(↓) |  Size | Dim. | Index | R@1000 | R@500 | R@100 | R@10 | MRR@10 | | 
					
						
						|  | |:-----------------------------------------------------------------------------------------------------------|-----------:|------:|-----:|------:|-------:|------:|------:|-----:|-------:| | 
					
						
						|  | | [colbertv2-camembert-L4-mmarcoFR](https://huggingface.co/antoinelouis/colbertv2-camembert-L4-mmarcoFR)     |        54M | 0.2GB |   32 |   9GB |   91.9 |  90.3 |  81.9 | 56.7 |   32.3 | | 
					
						
						|  | | [FraColBERTv2](https://huggingface.co/bclavie/FraColBERTv2)                                                |       111M | 0.4GB |  128 |  28GB |   90.0 |  88.9 |  81.2 | 57.1 |   32.4 | | 
					
						
						|  | | **colbertv1-camembert-base-mmarcoFR**                                                                      |       111M | 0.4GB |  128 |  28GB |   89.7 |  88.4 |  80.0 | 54.2 |   29.5 | | 
					
						
						|  |  | 
					
						
						|  | NB: Index corresponds to the size of the mMARCO-fr index (8.8M passages) on disk when using ColBERTv2's residual compression mechanism. | 
					
						
						|  |  | 
					
						
						|  | ## Training | 
					
						
						|  |  | 
					
						
						|  | #### Data | 
					
						
						|  |  | 
					
						
						|  | We use the French training set from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, | 
					
						
						|  | a multilingual machine-translated version of MS MARCO that contains 8.8M passages and 539K training queries. | 
					
						
						|  | We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset). | 
					
						
						|  |  | 
					
						
						|  | #### Implementation | 
					
						
						|  |  | 
					
						
						|  | The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and optimized via a combination of the pairwise softmax | 
					
						
						|  | cross-entropy loss computed over predicted scores for the positive and hard negative passages (as in [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832)) | 
					
						
						|  | and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488)). It was trained on a single Tesla V100 GPU | 
					
						
						|  | with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set | 
					
						
						|  | to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively. | 
					
						
						|  |  | 
					
						
						|  | ## Citation | 
					
						
						|  |  | 
					
						
						|  | ```bibtex | 
					
						
						|  | @online{louis2024decouvrir, | 
					
						
						|  | author    = 'Antoine Louis', | 
					
						
						|  | title     = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French', | 
					
						
						|  | publisher = 'Hugging Face', | 
					
						
						|  | month     = 'mar', | 
					
						
						|  | year      = '2024', | 
					
						
						|  | url       = 'https://huggingface.co/spaces/antoinelouis/decouvrir', | 
					
						
						|  | } | 
					
						
						|  | ``` |