NeuML
/

txtai-wikipedia

Sentence Similarity

English

txtai

Model card Files Files and versions

xet

Community

davidmezzetti commited on Apr 20

Commit

32d216e

1 Parent(s): 56726f6

April 2026 data update

Browse files

Files changed (4) hide show

README.md +24 -7
config.json +22 -6
documents +2 -2
embeddings +2 -2

README.md CHANGED Viewed

@@ -8,18 +8,20 @@ library_name: txtai
 tags:
 - sentence-similarity
 datasets:
-- NeuML/wikipedia-20251101
 ---
 # Wikipedia txtai embeddings index
 This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [English edition of Wikipedia](https://en.wikipedia.org/).
-This index is built from the [Wikipedia November 2025 dataset](https://huggingface.co/datasets/neuml/wikipedia-20251101). Only the first paragraph of the [lead section](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section) from each article is included in the index. This is similar to an abstract of the article.
-It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
 to only match commonly visited pages.
 txtai must be [installed](https://neuml.github.io/txtai/install/) to use this model.
 ## Example
@@ -41,6 +43,12 @@ embeddings.search("""
    SELECT id, text, score, percentile FROM txtai WHERE similar('Boston') AND
    percentile >= 0.99
 """)
 ```
 ## Use Cases
@@ -65,7 +73,7 @@ Performance was evaluated using the [NDCG@10](https://en.wikipedia.org/wiki/Disc
 ## Build the index
-The following steps show how to build this index. These scripts are using the latest data available as of 2025-11-01, update as appropriate.
 - Install required build dependencies
 ```bash
@@ -75,7 +83,7 @@ pip install ragdata mwparserfromhell
 - Download and build pageviews database
 ```bash
 mkdir -p pageviews/data
-wget -P pageviews/data https://dumps.wikimedia.org/other/pageview_complete/monthly/2025/2025-10/pageviews-202510-user.bz2
 python -m ragdata.wikipedia.views -p en.wikipedia -v pageviews
 ```
@@ -85,17 +93,26 @@ python -m ragdata.wikipedia.views -p en.wikipedia -v pageviews
 from datasets import load_dataset
 # Data dump date from https://dumps.wikimedia.org/enwiki/
-date = "20251101"
 # Build and save dataset
 ds = load_dataset("neuml/wikipedia", language="en", date=date)
 ds.save_to_disk(f"wikipedia-{date}")
 ```
 - Build txtai-wikipedia index
 ```bash
 python -m ragdata.wikipedia.index \
-       -d wikipedia-20251101 \
        -o txtai-wikipedia \
        -v pageviews/pageviews.sqlite
 ```

 tags:
 - sentence-similarity
 datasets:
+- NeuML/wikipedia-20260401
 ---
 # Wikipedia txtai embeddings index
 This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [English edition of Wikipedia](https://en.wikipedia.org/).
+This index is built from the [Wikipedia April 2026 dataset](https://huggingface.co/datasets/neuml/wikipedia-20260401). Only the first paragraph of the [lead section](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section) from each article is included in the index. This is similar to an abstract of the article.
+It uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
 to only match commonly visited pages.
+Domain labels are applied using [this model](https://huggingface.co/NeuML/domain-labeler) and adds a `domain` field.
 txtai must be [installed](https://neuml.github.io/txtai/install/) to use this model.
 ## Example
    SELECT id, text, score, percentile FROM txtai WHERE similar('Boston') AND
    percentile >= 0.99
 """)
+# Find most popular articles for a domain label
+embeddings.search("""
+   SELECT id, text, score, domain FROM txtai WHERE domain = 'news'
+   ORDER BY percentile DESC
+""")
 ```
 ## Use Cases
 ## Build the index
+The following steps show how to build this index. These scripts are using the latest data available as of 2026-04-01, update as appropriate.
 - Install required build dependencies
 ```bash
 - Download and build pageviews database
 ```bash
 mkdir -p pageviews/data
+wget -P pageviews/data https://dumps.wikimedia.org/other/pageview_complete/monthly/2026/2026-04/pageviews-202604-user.bz2
 python -m ragdata.wikipedia.views -p en.wikipedia -v pageviews
 ```
 from datasets import load_dataset
 # Data dump date from https://dumps.wikimedia.org/enwiki/
+date = "20260401"
 # Build and save dataset
 ds = load_dataset("neuml/wikipedia", language="en", date=date)
 ds.save_to_disk(f"wikipedia-{date}")
 ```
+- Generate domain labels
+```bash
+python -m ragdata.wikipedia.label \
+       -d wikipedia-20260401 \
+       -o labels.csv
+```
 - Build txtai-wikipedia index
 ```bash
 python -m ragdata.wikipedia.index \
+       -d wikipedia-20260401 \
+       -l labels.csv \
        -o txtai-wikipedia \
        -v pageviews/pageviews.sqlite
 ```

config.json CHANGED Viewed

@@ -12,17 +12,33 @@
     "sample": 0.05
   },
   "content": true,
   "dimensions": 768,
   "backend": "faiss",
-  "offset": 6458964,
   "build": {
-    "create": "2025-11-16T20:24:36Z",
-    "python": "3.10.19",
     "settings": {
-      "components": "IVF2273,SQ8"
     },
     "system": "Linux (x86_64)",
-    "txtai": "9.1.0"
   },
-  "update": "2025-11-16T20:24:36Z"
 }

     "sample": 0.05
   },
   "content": true,
+  "columns": {
+    "store": [
+      "percentile",
+      "domain"
+    ]
+  },
+  "expressions": [
+    {
+      "name": "percentile",
+      "index": true
+    },
+    {
+      "name": "domain",
+      "index": true
+    }
+  ],
   "dimensions": 768,
   "backend": "faiss",
+  "offset": 6527334,
   "build": {
+    "create": "2026-04-19T04:36:14Z",
+    "python": "3.10.20",
     "settings": {
+      "components": "IVF2285,SQ8"
     },
     "system": "Linux (x86_64)",
+    "txtai": "9.8.0"
   },
+  "update": "2026-04-19T04:36:14Z"
 }

documents CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7e5d1161dcebdef5acd25061e7c08e3b3d8f1f1c3cfe3ff59774fd92dfe5bd7e
-size 3474526208

 version https://git-lfs.github.com/spec/v1
+oid sha256:a4d28962a359692703c932dc0dccea5971952dbfcf5c2cdf60fba921f1a80223
+size 3925708800

embeddings CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a0df2c192d0612b9b0b03f42089c6e0dde124dcf0df336f4828d85268e70fe5b
-size 5019163232

 version https://git-lfs.github.com/spec/v1
+oid sha256:7dc16dfb26e4c93bd108176e879a24de47ea3fa37426c710a4bebb7d27bd77e3
+size 5072255312