Spaces:
Running
Running
File size: 20,102 Bytes
95d7fed 0bbf2df 95d7fed 0bbf2df 95d7fed b4c92f5 9e0c959 95d7fed 4bf5701 75e8f38 4bf5701 75e8f38 4bf5701 75e8f38 be2b951 75e8f38 b4c92f5 75e8f38 4bf5701 117dd64 4bf5701 b8b2303 b4c92f5 0bbf2df 4bf5701 b4c92f5 82a29b2 b4c92f5 4bf5701 0bbf2df 4bf5701 0bbf2df 4bf5701 c667561 4bf5701 0bbf2df 75e8f38 0bbf2df 75e8f38 0bbf2df 4bf5701 b4c92f5 75e8f38 117dd64 75e8f38 117dd64 75e8f38 117dd64 75e8f38 b4c92f5 75e8f38 b4c92f5 75e8f38 4bf5701 b8b2303 0bbf2df 4bf5701 0bbf2df 4bf5701 0bbf2df 4bf5701 75e8f38 2c93726 75e8f38 b8b2303 75e8f38 b4c92f5 3011301 b4c92f5 75e8f38 b4c92f5 4bf5701 75e8f38 be2b951 4bf5701 75e8f38 4bf5701 b4c92f5 be2b951 b4c92f5 4bf5701 75e8f38 4bf5701 b4c92f5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 | ---
title: Tibetan Text Metrics
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
python_version: 3.11
app_file: app.py
---
# Tibetan Text Metrics Web App
[](https://www.python.org/downloads/)
[](https://creativecommons.org/licenses/by/4.0/)
[](https://www.repostatus.org/#active)
Compare Tibetan texts to discover how similar they are. This tool helps scholars identify shared passages, textual variations, and relationships between different versions of Tibetan manuscripts — no programming required.
## Quick Start (3 Steps)
1. **Upload** two or more Tibetan text files (.txt format)
2. **Click** "Compare My Texts"
3. **View** the results — higher scores mean more similarity
That's it! The default settings work well for most cases. See the results section for colorful heatmaps showing which chapters are most similar.
> **Tip:** If your texts have chapters, separate them with the ༈ marker so the tool can compare chapter-by-chapter.
## What's New (v0.4.0)
- **New preset-based UI**: Choose "Quick Start" for simple analysis or "Custom" for full control
- **Three analysis presets**: Standard, Deep (with AI), and Quick (fastest)
- **Word-level tokenization** is now the default (recommended for Jaccard similarity)
- **Particle normalization**: Treat grammatical particle variants as equivalent (གི/ཀྱི/གྱི → གི)
- **LCS normalization options**: Choose how to handle texts of different lengths
- **Improved stopword matching**: Fixed tsek (་) handling for consistent filtering
- **Tibetan-optimized fuzzy matching**: Syllable-level methods only (removed character-level methods)
- **Dharmamitra models**: Buddhist-specific semantic similarity models as default
- **Modernized theme**: Cleaner UI with better responsive design
## Background
The Tibetan Text Metrics project provides quantitative methods for assessing textual similarities at the chapter or segment level, helping researchers understand patterns of textual evolution. This web application makes these capabilities accessible through an intuitive interface — no command-line or Python experience needed.
## Key Features of the Web App
- **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
- **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
- **Core Metrics Computed**:
- **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. Word-level tokenization recommended. *Common Tibetan stopwords can be filtered out to focus on meaningful lexical similarity.*
- **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels. Supports multiple normalization modes (average, min, max).
- **Fuzzy Similarity**: Uses syllable-level fuzzy matching to detect approximate matches, accommodating spelling variations and scribal differences in Tibetan text.
- **Semantic Similarity**: Uses Buddhist-specific sentence-transformer embeddings (Dharmamitra) to compare the contextual meaning of segments.
- **Handles Long Texts**: Implements automated handling for long segments when computing semantic embeddings.
- **Model Selection**: Semantic similarity uses Hugging Face sentence-transformer models. Default is Dharmamitra's `buddhist-nlp/buddhist-sentence-similarity`, trained specifically for Buddhist texts.
- **Tokenization Modes**:
- **Word** (default, recommended): Keeps multi-syllable words together for more meaningful comparison
- **Syllable**: Splits into individual syllables for finer-grained analysis
- **Stopword Filtering**: Three levels of filtering for Tibetan words:
- **None**: No filtering, includes all words
- **Standard**: Filters only common particles and punctuation
- **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
- **Particle Normalization**: Optional normalization of grammatical particles to canonical forms (e.g., གི/ཀྱི/གྱི → གི, ལ/ར/སུ/ཏུ/དུ → ལ). Reduces false negatives from sandhi variation.
- **Interactive Visualizations**:
- Heatmaps for Jaccard, LCS, Fuzzy, and Semantic similarity metrics, providing a quick overview of inter-segment relationships.
- Bar chart displaying word counts per segment.
- **Vocabulary containment chart** showing what percentage of each text's unique vocabulary appears in the other text (directional metric).
- **Advanced Interpretation**: Get scholarly insights about your results with a built-in analysis engine that:
- Examines your metrics and provides contextual interpretation of textual relationships
- Generates a dual-layer narrative analysis (scholarly and accessible)
- Identifies patterns across chapters and highlights notable textual relationships
- Connects findings to Tibetan textual studies concepts (transmission lineages, regional variants)
- Suggests questions for further investigation
- **Downloadable Results**: Export detailed metrics as a CSV file and save heatmaps as PNG files.
- **Simplified Workflow**: No command-line interaction or Python scripting needed for analysis.
## Advanced Features
### Using AI-Powered Analysis
The application includes an "Interpret Results" button that provides scholarly insights about your text similarity metrics. This feature:
1. **Dynamic model selection**: Automatically discovers available free models from OpenRouter (Qwen, Google Gemma, Meta Llama, Mistral, DeepSeek)
2. Requires an OpenRouter API key (set via environment variable `OPENROUTER_API_KEY`)
3. Falls back to rule-based analysis if no API key is provided or all models fail
4. The AI will provide a comprehensive scholarly analysis including:
- Introduction explaining the texts compared and general observations
- Overall patterns across all chapters with visualized trends
- Detailed examination of notable chapters (highest/lowest similarity)
- Discussion of what different metrics reveal about textual relationships
- Conclusions suggesting implications for Tibetan textual scholarship
- Specific questions these findings raise for further investigation
- Cautionary notes about interpreting perfect matches or zero similarity scores
### Data Processing
- **Automatic Filtering**: The system automatically filters out perfect matches (1.0 across all metrics) that may result from empty cells or identical text comparisons
- **Robust Analysis**: The system handles edge cases and provides meaningful metrics even with imperfect data
## Text Segmentation and Best Practices
**Why segment your texts?**
To obtain meaningful results, it is highly recommended to divide your Tibetan texts into logical chapters or sections before uploading. Comparing entire texts as a single unit often produces shallow or misleading results, especially for long or complex works. Chapters or sections allow the tool to detect stylistic, lexical, or structural differences that would otherwise be hidden.
**How to segment your texts:**
- Use the Tibetan section marker (`༈` (sbrul shad)) to separate chapters/sections in your `.txt` files.
- Each segment should represent a coherent part of the text (e.g., a chapter, legal clause, or thematic section).
- The tool will automatically split your file on this marker for analysis. If no marker is found, the entire file is treated as a single segment, and a warning will be issued.
**Best practices:**
- Ensure the marker is unique and does not appear within a chapter.
- Try to keep chapters/sections of similar length for more balanced comparisons.
- For poetry or short texts, consider grouping several poems or stanzas as one segment.
## Implemented Metrics
**Stopword Filtering:**
To enhance the accuracy and relevance of similarity scores, the Jaccard Similarity and Fuzzy Similarity calculations incorporate a stopword filtering step. This process removes high-frequency, low-information Tibetan words (e.g., common particles, pronouns, and grammatical markers) before the metrics are computed. Stopwords are normalized to handle tsek (་) variations consistently.
**Particle Normalization:**
Tibetan grammatical particles change form based on the preceding syllable (sandhi). For example, the genitive particle appears as གི, ཀྱི, གྱི, ཡི, or འི depending on context. When particle normalization is enabled, all variants are treated as equivalent, reducing false negatives when comparing texts with different scribal conventions.
The comprehensive list of Tibetan stopwords used is adapted and compiled from the following valuable resources:
- The **Divergent Discourses** (specifically, their Tibetan stopwords list available at [Zenodo Record 10148636](https://zenodo.org/records/10148636)).
- The **Tibetan Lucene Analyser** by the Buddhist Digital Archives (BUDA), available on [GitHub: buda-base/lucene-bo](https://github.com/buda-base/lucene-bo).
We extend our gratitude to the creators and maintainers of these projects for making their work available to the community.
Feel free to edit this list of stopwords to better suit your needs. The list is stored in the `pipeline/stopwords_bo.py` file.
### The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
1. **Jaccard Similarity (%)**: This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words, optionally **filtering out common Tibetan stopwords**.
It essentially answers the question: 'Of all the distinct, meaningful words found across these two segments, what proportion of them are present in both?'
It is calculated as `(Number of common unique meaningful words) / (Total number of unique meaningful words in both texts combined) * 100`.
Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique meaningful word is present or absent.
A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.
**Stopword Filtering**: Three levels of filtering are available:
- **None**: No filtering, includes all words in the comparison
- **Standard**: Filters only common particles and punctuation
- **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
This helps focus on meaningful content words rather than grammatical elements.
2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text.
**Normalization options:**
- **Average** (default): Divides LCS length by the average of both text lengths. Balanced comparison.
- **Min**: Divides by the shorter text length. Useful for detecting if one text contains the other (e.g., quotes within commentary). Can return 1.0 if shorter text is fully contained.
- **Max**: Divides by the longer text length. Stricter metric that penalizes length differences.
A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism.
*Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary.
3. **Fuzzy Similarity**: This metric uses syllable-level fuzzy matching algorithms to detect approximate matches, making it particularly valuable for Tibetan texts where spelling variations, dialectal differences, or scribal errors might be present. Unlike exact matching methods (such as Jaccard), fuzzy similarity can recognize when words are similar but not identical.
**Available methods (all work at syllable level):**
- **Syllable N-gram Overlap** (default, recommended): Compares syllable bigrams between texts. Best for detecting shared phrases and local patterns.
- **Syllable-level Edit Distance**: Computes Levenshtein distance at the syllable/token level. Detects minor variations while respecting syllable boundaries.
- **Weighted Jaccard**: Like standard Jaccard but considers token frequency, giving more weight to frequently shared terms.
Scores range from 0 to 1, where 1 indicates perfect or near-perfect matches. All methods work at the syllable level, which is linguistically appropriate for Tibetan.
**Stopword Filtering**: The same three levels of filtering used for Jaccard Similarity are applied to fuzzy matching:
- **None**: No filtering, includes all words in the comparison
- **Standard**: Filters only common particles and punctuation
- **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
4. **Semantic Similarity**: Computes the cosine similarity between sentence-transformer embeddings of text segments. Uses Dharmamitra's Buddhist-specific models by default. Segments are embedded into high-dimensional vectors and compared via cosine similarity. Scores closer to 1 indicate a higher degree of semantic overlap.
*Note*: Semantic similarity operates on the raw text and is not affected by stopword filtering settings.
### Visualization Metrics
5. **Vocabulary Containment**: A directional metric showing what percentage of one text's unique vocabulary appears in the other text. Unlike Jaccard (which is symmetric), containment is calculated in both directions:
- "Text A → Text B" answers: "What % of Text A's unique words also appear in Text B?"
- Calculated as: `(shared vocabulary size) / (source text vocabulary size) × 100`
**Interpreting asymmetric containment:**
- If "Base Text → Commentary" is 95% but "Commentary → Base Text" is 60%, the commentary contains almost all of the base text's vocabulary plus additional words
- This pattern suggests an expansion or commentary relationship
- Useful for identifying which text is the "base" version (its vocabulary will be highly contained in expanded versions)
## Getting Started (if run Locally)
1. Ensure you have Python 3.10 or newer.
2. Navigate to the `webapp` directory:
```bash
cd path/to/tibetan-text-metrics/webapp
```
3. Create a virtual environment (recommended):
```bash
python -m venv .venv
source .venv/bin/activate # On macOS/Linux
# .venv\Scripts\activate # On Windows
```
4. Install dependencies:
```bash
pip install -r requirements.txt
```
5. **Compile Cython Extension (Recommended for Performance)**:
To speed up the Longest Common Subsequence (LCS) calculation, a Cython extension is provided. To compile it:
```bash
# Ensure you are in the webapp directory
python setup.py build_ext --inplace
```
This step requires a C compiler. If you skip this, the application will use a slower, pure Python implementation for LCS.
6. **Run the Web Application**:
```bash
python app.py
```
7. Open your web browser and go to the local URL provided (usually `http://127.0.0.1:7860`).
## Usage
### Quick Start (Recommended for Most Users)
1. **Upload Files**: Select one or more `.txt` files containing Tibetan Unicode text.
2. **Choose a Preset**: In the "Quick Start" tab, select an analysis type:
| Preset | What it does | Best for |
|--------|--------------|----------|
| **Standard** | Vocabulary + Sequences + Fuzzy matching | Most comparisons |
| **Deep** | All metrics including AI meaning analysis | Finding semantic parallels |
| **Quick** | Vocabulary overlap only | Fast initial scan |
3. **Click "Compare My Texts"**: Results appear below with heatmaps and downloadable CSV.
### Custom Analysis (Advanced Users)
For fine-grained control, use the "Custom" tab:
- **Lexical Metrics**: Configure tokenization (word/syllable), stopword filtering, and particle normalization
- **Sequence Matching (LCS)**: Enable/disable and choose normalization mode (avg/min/max)
- **Fuzzy Matching**: Choose method (N-gram, Syllable Edit, or Weighted Jaccard)
- **Semantic Analysis**: Enable AI-based meaning comparison with model selection
### Viewing Results
- **Metrics Preview**: Summary table of similarity scores
- **Heatmaps**: Visual comparison across all chapter pairs (darker = more similar)
- **Word Counts**: Bar chart showing segment lengths
- **Vocabulary Containment**: Directional metric showing what % of one text's vocabulary is in another
- **CSV Download**: Full results for further analysis
### AI Interpretation (Optional)
After running analysis, click "Help Interpret Results" for scholarly insights:
- Pattern identification across chapters
- Notable textual relationships
- Suggestions for further investigation
## Embedding Model
Semantic similarity uses Hugging Face sentence-transformer models. The following models are available:
- **`buddhist-nlp/buddhist-sentence-similarity`** (default, recommended): Developed by [Dharmamitra](https://huggingface.co/buddhist-nlp), this model is specifically trained for sentence similarity on Buddhist texts in Tibetan, Buddhist Chinese, Sanskrit (IAST), and Pāli. Best choice for Tibetan Buddhist manuscripts.
- **`buddhist-nlp/bod-eng-similarity`**: Also from Dharmamitra, optimized for Tibetan-English bitext alignment tasks.
- **`sentence-transformers/LaBSE`**: General multilingual model, good baseline for non-Buddhist texts.
- **`BAAI/bge-m3`**: Strong multilingual alternative with broad language coverage.
These models provide context-aware, segment-level embeddings suitable for comparing Tibetan text passages.
## Structure
- `app.py` — Gradio web app entry point and UI definition.
- `pipeline/` — Modules for file handling, text processing, metrics calculation, and visualization.
- `process.py`: Core logic for segmenting texts and orchestrating metric computation.
- `metrics.py`: Implementation of Jaccard, LCS, Fuzzy, and Semantic Similarity.
- `hf_embedding.py`: Handles loading and using sentence-transformer models.
- `tokenize.py`: Tibetan text tokenization using `botok`.
- `normalize_bo.py`: Tibetan particle normalization for grammatical variants.
- `stopwords_bo.py`: Comprehensive Tibetan stopword list with tsek normalization.
- `visualize.py`: Generates heatmaps and word count plots.
- `requirements.txt` — Python dependencies for the web application.
## License
This project is licensed under the Creative Commons Attribution 4.0 International License - see the [LICENSE](../../LICENSE) file in the main project directory for details.
## Research and Acknowledgements
We acknowledge the broader Tibetan NLP community for tokenization and stopword resources leveraged in this project, including the Divergent Discourses stopword list and BUDA's lucene-bo analyzer.
## Citation
If you use this web application or the underlying TTM tool in your research, please cite the main project:
```bibtex
@software{wojahn2025ttm,
title = {TibetanTextMetrics (TTM): Computing Text Similarity Metrics on POS-tagged Tibetan Texts},
author = {Daniel Wojahn},
year = {2025},
url = {https://github.com/daniel-wojahn/tibetan-text-metrics},
version = {0.4.0}
}
```
---
For questions or issues specifically regarding the web application, please refer to the main project's [issue tracker](https://github.com/daniel-wojahn/tibetan-text-metrics/issues) or contact Daniel Wojahn.
|