File size: 20,102 Bytes
95d7fed
0bbf2df
 
95d7fed
0bbf2df
95d7fed
b4c92f5
9e0c959
95d7fed
 
 
4bf5701
 
 
 
 
 
75e8f38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4bf5701
 
 
75e8f38
4bf5701
 
 
 
 
 
75e8f38
 
 
 
be2b951
75e8f38
 
 
 
b4c92f5
 
 
 
75e8f38
4bf5701
117dd64
4bf5701
b8b2303
b4c92f5
 
 
 
 
 
0bbf2df
4bf5701
 
b4c92f5
 
 
 
 
 
82a29b2
 
 
 
b4c92f5
 
 
 
 
 
 
 
 
 
 
 
 
4bf5701
 
 
 
 
 
 
 
0bbf2df
4bf5701
 
 
 
 
0bbf2df
4bf5701
 
c667561
4bf5701
 
0bbf2df
75e8f38
 
 
 
0bbf2df
 
 
 
 
 
 
75e8f38
0bbf2df
 
4bf5701
b4c92f5
 
 
 
 
 
 
 
 
 
 
 
 
75e8f38
117dd64
75e8f38
 
 
 
117dd64
75e8f38
 
 
 
117dd64
75e8f38
 
 
 
b4c92f5
75e8f38
 
 
b4c92f5
 
 
 
75e8f38
 
 
4bf5701
b8b2303
 
 
 
 
 
 
 
 
 
 
0bbf2df
4bf5701
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0bbf2df
 
 
 
 
 
 
 
 
4bf5701
 
 
0bbf2df
4bf5701
 
 
75e8f38
 
 
 
 
 
 
2c93726
 
 
75e8f38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b8b2303
75e8f38
 
 
 
 
 
 
 
b4c92f5
3011301
b4c92f5
75e8f38
 
 
 
 
 
 
 
b4c92f5
4bf5701
 
 
 
 
75e8f38
be2b951
4bf5701
75e8f38
 
4bf5701
 
 
 
 
 
 
b4c92f5
 
be2b951
b4c92f5
4bf5701
 
 
 
 
 
 
 
 
 
75e8f38
4bf5701
 
 
 
b4c92f5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
---
title: Tibetan Text Metrics
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
python_version: 3.11
app_file: app.py
---

# Tibetan Text Metrics Web App

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
[![Project Status: Active – Web app version for accessible text analysis.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)

Compare Tibetan texts to discover how similar they are. This tool helps scholars identify shared passages, textual variations, and relationships between different versions of Tibetan manuscripts — no programming required.

## Quick Start (3 Steps)

1. **Upload** two or more Tibetan text files (.txt format)
2. **Click** "Compare My Texts" 
3. **View** the results — higher scores mean more similarity

That's it! The default settings work well for most cases. See the results section for colorful heatmaps showing which chapters are most similar.

> **Tip:** If your texts have chapters, separate them with the ༈ marker so the tool can compare chapter-by-chapter.

## What's New (v0.4.0)

- **New preset-based UI**: Choose "Quick Start" for simple analysis or "Custom" for full control
- **Three analysis presets**: Standard, Deep (with AI), and Quick (fastest)
- **Word-level tokenization** is now the default (recommended for Jaccard similarity)
- **Particle normalization**: Treat grammatical particle variants as equivalent (གི/ཀྱི/གྱི → གི)
- **LCS normalization options**: Choose how to handle texts of different lengths
- **Improved stopword matching**: Fixed tsek (་) handling for consistent filtering
- **Tibetan-optimized fuzzy matching**: Syllable-level methods only (removed character-level methods)
- **Dharmamitra models**: Buddhist-specific semantic similarity models as default
- **Modernized theme**: Cleaner UI with better responsive design

## Background

The Tibetan Text Metrics project provides quantitative methods for assessing textual similarities at the chapter or segment level, helping researchers understand patterns of textual evolution. This web application makes these capabilities accessible through an intuitive interface — no command-line or Python experience needed.

## Key Features of the Web App

-   **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
-   **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
-   **Core Metrics Computed**:
    -   **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. Word-level tokenization recommended. *Common Tibetan stopwords can be filtered out to focus on meaningful lexical similarity.*
    -   **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels. Supports multiple normalization modes (average, min, max).
    -   **Fuzzy Similarity**: Uses syllable-level fuzzy matching to detect approximate matches, accommodating spelling variations and scribal differences in Tibetan text.
    -   **Semantic Similarity**: Uses Buddhist-specific sentence-transformer embeddings (Dharmamitra) to compare the contextual meaning of segments.
-   **Handles Long Texts**: Implements automated handling for long segments when computing semantic embeddings.
-   **Model Selection**: Semantic similarity uses Hugging Face sentence-transformer models. Default is Dharmamitra's `buddhist-nlp/buddhist-sentence-similarity`, trained specifically for Buddhist texts.
-   **Tokenization Modes**:
    -   **Word** (default, recommended): Keeps multi-syllable words together for more meaningful comparison
    -   **Syllable**: Splits into individual syllables for finer-grained analysis
-   **Stopword Filtering**: Three levels of filtering for Tibetan words:
    -   **None**: No filtering, includes all words
    -   **Standard**: Filters only common particles and punctuation
    -   **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
-   **Particle Normalization**: Optional normalization of grammatical particles to canonical forms (e.g., གི/ཀྱི/གྱི → གི, ལ/ར/སུ/ཏུ/དུ → ལ). Reduces false negatives from sandhi variation.
-   **Interactive Visualizations**:
    -   Heatmaps for Jaccard, LCS, Fuzzy, and Semantic similarity metrics, providing a quick overview of inter-segment relationships.
    -   Bar chart displaying word counts per segment.
    -   **Vocabulary containment chart** showing what percentage of each text's unique vocabulary appears in the other text (directional metric).
-   **Advanced Interpretation**: Get scholarly insights about your results with a built-in analysis engine that:
    -   Examines your metrics and provides contextual interpretation of textual relationships
    -   Generates a dual-layer narrative analysis (scholarly and accessible)
    -   Identifies patterns across chapters and highlights notable textual relationships
    -   Connects findings to Tibetan textual studies concepts (transmission lineages, regional variants)
    -   Suggests questions for further investigation
-   **Downloadable Results**: Export detailed metrics as a CSV file and save heatmaps as PNG files.
-   **Simplified Workflow**: No command-line interaction or Python scripting needed for analysis.

## Advanced Features

### Using AI-Powered Analysis

The application includes an "Interpret Results" button that provides scholarly insights about your text similarity metrics. This feature:

1. **Dynamic model selection**: Automatically discovers available free models from OpenRouter (Qwen, Google Gemma, Meta Llama, Mistral, DeepSeek)
2. Requires an OpenRouter API key (set via environment variable `OPENROUTER_API_KEY`)
3. Falls back to rule-based analysis if no API key is provided or all models fail
4. The AI will provide a comprehensive scholarly analysis including:
   - Introduction explaining the texts compared and general observations
   - Overall patterns across all chapters with visualized trends
   - Detailed examination of notable chapters (highest/lowest similarity)
   - Discussion of what different metrics reveal about textual relationships
   - Conclusions suggesting implications for Tibetan textual scholarship
   - Specific questions these findings raise for further investigation
   - Cautionary notes about interpreting perfect matches or zero similarity scores

### Data Processing

- **Automatic Filtering**: The system automatically filters out perfect matches (1.0 across all metrics) that may result from empty cells or identical text comparisons
- **Robust Analysis**: The system handles edge cases and provides meaningful metrics even with imperfect data

## Text Segmentation and Best Practices

**Why segment your texts?**

To obtain meaningful results, it is highly recommended to divide your Tibetan texts into logical chapters or sections before uploading. Comparing entire texts as a single unit often produces shallow or misleading results, especially for long or complex works. Chapters or sections allow the tool to detect stylistic, lexical, or structural differences that would otherwise be hidden.

**How to segment your texts:**

-   Use the Tibetan section marker (`༈` (sbrul shad)) to separate chapters/sections in your `.txt` files.
-   Each segment should represent a coherent part of the text (e.g., a chapter, legal clause, or thematic section).
-   The tool will automatically split your file on this marker for analysis. If no marker is found, the entire file is treated as a single segment, and a warning will be issued.

**Best practices:**

-   Ensure the marker is unique and does not appear within a chapter.
-   Try to keep chapters/sections of similar length for more balanced comparisons.
-   For poetry or short texts, consider grouping several poems or stanzas as one segment.

## Implemented Metrics

**Stopword Filtering:**
To enhance the accuracy and relevance of similarity scores, the Jaccard Similarity and Fuzzy Similarity calculations incorporate a stopword filtering step. This process removes high-frequency, low-information Tibetan words (e.g., common particles, pronouns, and grammatical markers) before the metrics are computed. Stopwords are normalized to handle tsek (་) variations consistently.

**Particle Normalization:**
Tibetan grammatical particles change form based on the preceding syllable (sandhi). For example, the genitive particle appears as གི, ཀྱི, གྱི, ཡི, or འི depending on context. When particle normalization is enabled, all variants are treated as equivalent, reducing false negatives when comparing texts with different scribal conventions.

The comprehensive list of Tibetan stopwords used is adapted and compiled from the following valuable resources:
- The **Divergent Discourses** (specifically, their Tibetan stopwords list available at [Zenodo Record 10148636](https://zenodo.org/records/10148636)).
- The **Tibetan Lucene Analyser** by the Buddhist Digital Archives (BUDA), available on [GitHub: buda-base/lucene-bo](https://github.com/buda-base/lucene-bo).

We extend our gratitude to the creators and maintainers of these projects for making their work available to the community.

Feel free to edit this list of stopwords to better suit your needs. The list is stored in the `pipeline/stopwords_bo.py` file.

### The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:

1.  **Jaccard Similarity (%)**: This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words, optionally **filtering out common Tibetan stopwords**. 
It essentially answers the question: 'Of all the distinct, meaningful words found across these two segments, what proportion of them are present in both?' 
It is calculated as `(Number of common unique meaningful words) / (Total number of unique meaningful words in both texts combined) * 100`. 
Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique meaningful word is present or absent. 
A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.

**Stopword Filtering**: Three levels of filtering are available:
- **None**: No filtering, includes all words in the comparison
- **Standard**: Filters only common particles and punctuation
- **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries

This helps focus on meaningful content words rather than grammatical elements.

2.  **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text.

    **Normalization options:**
    - **Average** (default): Divides LCS length by the average of both text lengths. Balanced comparison.
    - **Min**: Divides by the shorter text length. Useful for detecting if one text contains the other (e.g., quotes within commentary). Can return 1.0 if shorter text is fully contained.
    - **Max**: Divides by the longer text length. Stricter metric that penalizes length differences.

    A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism.
    
    *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary.
3.  **Fuzzy Similarity**: This metric uses syllable-level fuzzy matching algorithms to detect approximate matches, making it particularly valuable for Tibetan texts where spelling variations, dialectal differences, or scribal errors might be present. Unlike exact matching methods (such as Jaccard), fuzzy similarity can recognize when words are similar but not identical.

    **Available methods (all work at syllable level):**
    - **Syllable N-gram Overlap** (default, recommended): Compares syllable bigrams between texts. Best for detecting shared phrases and local patterns.
    - **Syllable-level Edit Distance**: Computes Levenshtein distance at the syllable/token level. Detects minor variations while respecting syllable boundaries.
    - **Weighted Jaccard**: Like standard Jaccard but considers token frequency, giving more weight to frequently shared terms.

    Scores range from 0 to 1, where 1 indicates perfect or near-perfect matches. All methods work at the syllable level, which is linguistically appropriate for Tibetan.

**Stopword Filtering**: The same three levels of filtering used for Jaccard Similarity are applied to fuzzy matching:
- **None**: No filtering, includes all words in the comparison
- **Standard**: Filters only common particles and punctuation
- **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries

4.  **Semantic Similarity**: Computes the cosine similarity between sentence-transformer embeddings of text segments. Uses Dharmamitra's Buddhist-specific models by default. Segments are embedded into high-dimensional vectors and compared via cosine similarity. Scores closer to 1 indicate a higher degree of semantic overlap.

    *Note*: Semantic similarity operates on the raw text and is not affected by stopword filtering settings.

### Visualization Metrics

5.  **Vocabulary Containment**: A directional metric showing what percentage of one text's unique vocabulary appears in the other text. Unlike Jaccard (which is symmetric), containment is calculated in both directions:
    - "Text A → Text B" answers: "What % of Text A's unique words also appear in Text B?"
    - Calculated as: `(shared vocabulary size) / (source text vocabulary size) × 100`
    
    **Interpreting asymmetric containment:**
    - If "Base Text → Commentary" is 95% but "Commentary → Base Text" is 60%, the commentary contains almost all of the base text's vocabulary plus additional words
    - This pattern suggests an expansion or commentary relationship
    - Useful for identifying which text is the "base" version (its vocabulary will be highly contained in expanded versions)

## Getting Started (if run Locally)

1.  Ensure you have Python 3.10 or newer.
2.  Navigate to the `webapp` directory:
    ```bash
    cd path/to/tibetan-text-metrics/webapp
    ```
3.  Create a virtual environment (recommended):
    ```bash
    python -m venv .venv
    source .venv/bin/activate  # On macOS/Linux
    # .venv\Scripts\activate    # On Windows
    ```
4.  Install dependencies:
    ```bash
    pip install -r requirements.txt
    ```
5.  **Compile Cython Extension (Recommended for Performance)**:
    To speed up the Longest Common Subsequence (LCS) calculation, a Cython extension is provided. To compile it:
    ```bash
    # Ensure you are in the webapp directory
    python setup.py build_ext --inplace
    ```
    This step requires a C compiler. If you skip this, the application will use a slower, pure Python implementation for LCS.

6.  **Run the Web Application**:
    ```bash
    python app.py
    ```
7.  Open your web browser and go to the local URL provided (usually `http://127.0.0.1:7860`).

## Usage

### Quick Start (Recommended for Most Users)

1.  **Upload Files**: Select one or more `.txt` files containing Tibetan Unicode text.
2.  **Choose a Preset**: In the "Quick Start" tab, select an analysis type:

| Preset | What it does | Best for |
|--------|--------------|----------|
| **Standard** | Vocabulary + Sequences + Fuzzy matching | Most comparisons |
| **Deep** | All metrics including AI meaning analysis | Finding semantic parallels |
| **Quick** | Vocabulary overlap only | Fast initial scan |

3.  **Click "Compare My Texts"**: Results appear below with heatmaps and downloadable CSV.

### Custom Analysis (Advanced Users)

For fine-grained control, use the "Custom" tab:

-   **Lexical Metrics**: Configure tokenization (word/syllable), stopword filtering, and particle normalization
-   **Sequence Matching (LCS)**: Enable/disable and choose normalization mode (avg/min/max)
-   **Fuzzy Matching**: Choose method (N-gram, Syllable Edit, or Weighted Jaccard)
-   **Semantic Analysis**: Enable AI-based meaning comparison with model selection

### Viewing Results

-   **Metrics Preview**: Summary table of similarity scores
-   **Heatmaps**: Visual comparison across all chapter pairs (darker = more similar)
-   **Word Counts**: Bar chart showing segment lengths
-   **Vocabulary Containment**: Directional metric showing what % of one text's vocabulary is in another
-   **CSV Download**: Full results for further analysis

### AI Interpretation (Optional)

After running analysis, click "Help Interpret Results" for scholarly insights:
-   Pattern identification across chapters
-   Notable textual relationships
-   Suggestions for further investigation

## Embedding Model

Semantic similarity uses Hugging Face sentence-transformer models. The following models are available:

- **`buddhist-nlp/buddhist-sentence-similarity`** (default, recommended): Developed by [Dharmamitra](https://huggingface.co/buddhist-nlp), this model is specifically trained for sentence similarity on Buddhist texts in Tibetan, Buddhist Chinese, Sanskrit (IAST), and Pāli. Best choice for Tibetan Buddhist manuscripts.
- **`buddhist-nlp/bod-eng-similarity`**: Also from Dharmamitra, optimized for Tibetan-English bitext alignment tasks.
- **`sentence-transformers/LaBSE`**: General multilingual model, good baseline for non-Buddhist texts.
- **`BAAI/bge-m3`**: Strong multilingual alternative with broad language coverage.

These models provide context-aware, segment-level embeddings suitable for comparing Tibetan text passages.

## Structure

-   `app.py` — Gradio web app entry point and UI definition.
-   `pipeline/` — Modules for file handling, text processing, metrics calculation, and visualization.
    -   `process.py`: Core logic for segmenting texts and orchestrating metric computation.
    -   `metrics.py`: Implementation of Jaccard, LCS, Fuzzy, and Semantic Similarity.
    -   `hf_embedding.py`: Handles loading and using sentence-transformer models.
    -   `tokenize.py`: Tibetan text tokenization using `botok`.
    -   `normalize_bo.py`: Tibetan particle normalization for grammatical variants.
    -   `stopwords_bo.py`: Comprehensive Tibetan stopword list with tsek normalization.
    -   `visualize.py`: Generates heatmaps and word count plots.
-   `requirements.txt` — Python dependencies for the web application.

## License

This project is licensed under the Creative Commons Attribution 4.0 International License - see the [LICENSE](../../LICENSE) file in the main project directory for details.

## Research and Acknowledgements

We acknowledge the broader Tibetan NLP community for tokenization and stopword resources leveraged in this project, including the Divergent Discourses stopword list and BUDA's lucene-bo analyzer.

## Citation

If you use this web application or the underlying TTM tool in your research, please cite the main project:

```bibtex
@software{wojahn2025ttm,
  title = {TibetanTextMetrics (TTM): Computing Text Similarity Metrics on POS-tagged Tibetan Texts},
  author = {Daniel Wojahn},
  year = {2025},
  url = {https://github.com/daniel-wojahn/tibetan-text-metrics},
  version = {0.4.0}
}
```

---
For questions or issues specifically regarding the web application, please refer to the main project's [issue tracker](https://github.com/daniel-wojahn/tibetan-text-metrics/issues) or contact Daniel Wojahn.