Spaces:
Running
Running
Commit ·
b8b2303
1
Parent(s): ee7fa4f
vocab containment update
Browse files- README.md +17 -0
- app.py +38 -2
- pipeline/process.py +85 -34
- pipeline/visualize.py +89 -0
README.md
CHANGED
|
@@ -65,6 +65,8 @@ The Tibetan Text Metrics project provides quantitative methods for assessing tex
|
|
| 65 |
- **Interactive Visualizations**:
|
| 66 |
- Heatmaps for Jaccard, LCS, Fuzzy, and Semantic similarity metrics, providing a quick overview of inter-segment relationships.
|
| 67 |
- Bar chart displaying word counts per segment.
|
|
|
|
|
|
|
| 68 |
- **Advanced Interpretation**: Get scholarly insights about your results with a built-in analysis engine that:
|
| 69 |
- Examines your metrics and provides contextual interpretation of textual relationships
|
| 70 |
- Generates a dual-layer narrative analysis (scholarly and accessible)
|
|
@@ -174,6 +176,19 @@ This helps focus on meaningful content words rather than grammatical elements.
|
|
| 174 |
|
| 175 |
*Note*: Semantic similarity operates on the raw text and is not affected by stopword filtering settings.
|
| 176 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
## Getting Started (if run Locally)
|
| 178 |
|
| 179 |
1. Ensure you have Python 3.10 or newer.
|
|
@@ -234,6 +249,8 @@ For fine-grained control, use the "Custom" tab:
|
|
| 234 |
- **Metrics Preview**: Summary table of similarity scores
|
| 235 |
- **Heatmaps**: Visual comparison across all chapter pairs (darker = more similar)
|
| 236 |
- **Word Counts**: Bar chart showing segment lengths
|
|
|
|
|
|
|
| 237 |
- **CSV Download**: Full results for further analysis
|
| 238 |
|
| 239 |
### AI Interpretation (Optional)
|
|
|
|
| 65 |
- **Interactive Visualizations**:
|
| 66 |
- Heatmaps for Jaccard, LCS, Fuzzy, and Semantic similarity metrics, providing a quick overview of inter-segment relationships.
|
| 67 |
- Bar chart displaying word counts per segment.
|
| 68 |
+
- Length ratio chart comparing text lengths relative to the shortest text per chapter.
|
| 69 |
+
- **Vocabulary containment chart** showing what percentage of each text's unique vocabulary appears in the other text (directional metric).
|
| 70 |
- **Advanced Interpretation**: Get scholarly insights about your results with a built-in analysis engine that:
|
| 71 |
- Examines your metrics and provides contextual interpretation of textual relationships
|
| 72 |
- Generates a dual-layer narrative analysis (scholarly and accessible)
|
|
|
|
| 176 |
|
| 177 |
*Note*: Semantic similarity operates on the raw text and is not affected by stopword filtering settings.
|
| 178 |
|
| 179 |
+
### Visualization Metrics
|
| 180 |
+
|
| 181 |
+
5. **Vocabulary Containment**: A directional metric showing what percentage of one text's unique vocabulary appears in the other text. Unlike Jaccard (which is symmetric), containment is calculated in both directions:
|
| 182 |
+
- "Text A → Text B" answers: "What % of Text A's unique words also appear in Text B?"
|
| 183 |
+
- Calculated as: `(shared vocabulary size) / (source text vocabulary size) × 100`
|
| 184 |
+
|
| 185 |
+
**Interpreting asymmetric containment:**
|
| 186 |
+
- If "Base Text → Commentary" is 95% but "Commentary → Base Text" is 60%, the commentary contains almost all of the base text's vocabulary plus additional words
|
| 187 |
+
- This pattern suggests an expansion or commentary relationship
|
| 188 |
+
- Useful for identifying which text is the "base" version (its vocabulary will be highly contained in expanded versions)
|
| 189 |
+
|
| 190 |
+
6. **Length Ratios**: Shows how much longer each text is compared to the shortest text per chapter. A ratio of 1.0x indicates the shortest (base) text; higher ratios indicate expanded content. Helps explain why Jaccard might be lower for related texts when one contains additional material.
|
| 191 |
+
|
| 192 |
## Getting Started (if run Locally)
|
| 193 |
|
| 194 |
1. Ensure you have Python 3.10 or newer.
|
|
|
|
| 249 |
- **Metrics Preview**: Summary table of similarity scores
|
| 250 |
- **Heatmaps**: Visual comparison across all chapter pairs (darker = more similar)
|
| 251 |
- **Word Counts**: Bar chart showing segment lengths
|
| 252 |
+
- **Length Ratios**: Compare text lengths to identify base text vs. expanded versions
|
| 253 |
+
- **Vocabulary Containment**: Directional metric showing what % of one text's vocabulary is in another
|
| 254 |
- **CSV Download**: Full results for further analysis
|
| 255 |
|
| 256 |
### AI Interpretation (Optional)
|
app.py
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
import gradio as gr
|
| 2 |
from pathlib import Path
|
| 3 |
from pipeline.process import process_texts
|
| 4 |
-
from pipeline.visualize import generate_visualizations, generate_word_count_chart, generate_length_ratio_chart
|
| 5 |
from pipeline.llm_service import LLMService
|
| 6 |
from pipeline.progressive_ui import ProgressiveUI, create_progressive_callback
|
| 7 |
import logging
|
|
@@ -256,6 +256,7 @@ def main_interface():
|
|
| 256 |
"Semantic Similarity": "Compares actual meaning using AI. Higher = texts say similar things.",
|
| 257 |
"Word Counts": "How long is each section? Helps you understand text structure.",
|
| 258 |
"Length Ratios": "Compare text lengths to identify base text vs. commentary additions.",
|
|
|
|
| 259 |
}
|
| 260 |
|
| 261 |
metric_tooltips = {
|
|
@@ -375,6 +376,27 @@ Two texts might share many words (high Jaccard) but arrange them differently (lo
|
|
| 375 |
- Explaining why Jaccard similarity might be lower despite texts being related
|
| 376 |
|
| 377 |
**Tip:** When one text is a base and others add commentary, Jaccard penalizes the additions. This chart helps you see that relationship clearly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 378 |
""",
|
| 379 |
"Structural Analysis": """
|
| 380 |
### How Texts Relate to Each Other
|
|
@@ -423,6 +445,9 @@ Two texts might share many words (high Jaccard) but arrange them differently (lo
|
|
| 423 |
elif metric_key == "Length Ratios":
|
| 424 |
css_class = "metric-info-accordion lengthratio-info"
|
| 425 |
accordion_title = "ℹ️ What does this mean?"
|
|
|
|
|
|
|
|
|
|
| 426 |
else:
|
| 427 |
css_class = "metric-info-accordion"
|
| 428 |
accordion_title = f"ℹ️ About {metric_key}"
|
|
@@ -447,6 +472,8 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
|
|
| 447 |
word_count_plot = gr.Plot(label="Word Counts per Segment", show_label=False, scale=1, elem_classes="metric-description")
|
| 448 |
elif metric_key == "Length Ratios":
|
| 449 |
length_ratio_plot = gr.Plot(label="Length Ratios per Chapter", show_label=False, scale=1, elem_classes="metric-description")
|
|
|
|
|
|
|
| 450 |
else:
|
| 451 |
heatmap_tabs[metric_key] = gr.Plot(label=f"Heatmap: {metric_key}", show_label=False, elem_classes="metric-heatmap")
|
| 452 |
|
|
@@ -486,6 +513,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
|
|
| 486 |
metrics_preview_df_res = pd.DataFrame()
|
| 487 |
word_count_fig_res = None
|
| 488 |
length_ratio_fig_res = None
|
|
|
|
| 489 |
jaccard_heatmap_res = None
|
| 490 |
lcs_heatmap_res = None
|
| 491 |
fuzzy_heatmap_res = None
|
|
@@ -519,6 +547,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
|
|
| 519 |
pd.DataFrame({"Message": ["Please upload files to analyze."]}),
|
| 520 |
None, # word_count_plot
|
| 521 |
None, # length_ratio_plot
|
|
|
|
| 522 |
None, # jaccard_heatmap
|
| 523 |
None, # lcs_heatmap
|
| 524 |
None, # fuzzy_heatmap
|
|
@@ -537,6 +566,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
|
|
| 537 |
pd.DataFrame({"Error": [f"File '{Path(file.name).name}' exceeds the 10MB size limit (size: {file_size_mb:.2f}MB)."]}),
|
| 538 |
None, # word_count_plot
|
| 539 |
None, # length_ratio_plot
|
|
|
|
| 540 |
None, # jaccard_heatmap
|
| 541 |
None, # lcs_heatmap
|
| 542 |
None, # fuzzy_heatmap
|
|
@@ -581,6 +611,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
|
|
| 581 |
pd.DataFrame({"Error": [f"Could not decode file '{filename}'. Please ensure it contains valid Tibetan text in UTF-8 or UTF-16 encoding."]}),
|
| 582 |
None, # word_count_plot
|
| 583 |
None, # length_ratio_plot
|
|
|
|
| 584 |
None, # jaccard_heatmap
|
| 585 |
None, # lcs_heatmap
|
| 586 |
None, # fuzzy_heatmap
|
|
@@ -617,7 +648,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
|
|
| 617 |
# For Hugging Face models, the UI value is the correct model ID
|
| 618 |
internal_model_id = model_name
|
| 619 |
|
| 620 |
-
df_results, word_counts_df_data, warning_raw = process_texts(
|
| 621 |
text_data=text_data,
|
| 622 |
filenames=filenames,
|
| 623 |
enable_semantic=enable_semantic_bool,
|
|
@@ -665,6 +696,9 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
|
|
| 665 |
# Generate length ratio chart
|
| 666 |
length_ratio_fig_res = generate_length_ratio_chart(word_counts_df_data)
|
| 667 |
|
|
|
|
|
|
|
|
|
|
| 668 |
# Store state data for potential future use
|
| 669 |
state_text_data_res = text_data
|
| 670 |
state_df_results_res = df_results
|
|
@@ -702,6 +736,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
|
|
| 702 |
metrics_preview_df_res,
|
| 703 |
word_count_fig_res,
|
| 704 |
length_ratio_fig_res,
|
|
|
|
| 705 |
jaccard_heatmap_res,
|
| 706 |
lcs_heatmap_res,
|
| 707 |
fuzzy_heatmap_res,
|
|
@@ -784,6 +819,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
|
|
| 784 |
metrics_preview,
|
| 785 |
word_count_plot,
|
| 786 |
length_ratio_plot,
|
|
|
|
| 787 |
heatmap_tabs["Jaccard Similarity (%)"],
|
| 788 |
heatmap_tabs["Normalized LCS"],
|
| 789 |
heatmap_tabs["Fuzzy Similarity"],
|
|
|
|
| 1 |
import gradio as gr
|
| 2 |
from pathlib import Path
|
| 3 |
from pipeline.process import process_texts
|
| 4 |
+
from pipeline.visualize import generate_visualizations, generate_word_count_chart, generate_length_ratio_chart, generate_vocab_containment_chart
|
| 5 |
from pipeline.llm_service import LLMService
|
| 6 |
from pipeline.progressive_ui import ProgressiveUI, create_progressive_callback
|
| 7 |
import logging
|
|
|
|
| 256 |
"Semantic Similarity": "Compares actual meaning using AI. Higher = texts say similar things.",
|
| 257 |
"Word Counts": "How long is each section? Helps you understand text structure.",
|
| 258 |
"Length Ratios": "Compare text lengths to identify base text vs. commentary additions.",
|
| 259 |
+
"Vocabulary Containment": "What % of one text's vocabulary appears in the other?",
|
| 260 |
}
|
| 261 |
|
| 262 |
metric_tooltips = {
|
|
|
|
| 376 |
- Explaining why Jaccard similarity might be lower despite texts being related
|
| 377 |
|
| 378 |
**Tip:** When one text is a base and others add commentary, Jaccard penalizes the additions. This chart helps you see that relationship clearly.
|
| 379 |
+
""",
|
| 380 |
+
"Vocabulary Containment": """
|
| 381 |
+
### Vocabulary Containment (Directional)
|
| 382 |
+
|
| 383 |
+
**What it shows:** What percentage of one text's unique vocabulary appears in the other text.
|
| 384 |
+
|
| 385 |
+
**How to read it:**
|
| 386 |
+
- "Text A → Text B" means: "What % of Text A's vocabulary is found in Text B?"
|
| 387 |
+
- 90% means 90% of the unique words in the source text also appear in the target text
|
| 388 |
+
|
| 389 |
+
**What it tells you:**
|
| 390 |
+
- If Text A → Text B is 95% but Text B → Text A is 60%, then Text B contains almost all of Text A's vocabulary plus additional words
|
| 391 |
+
- This suggests Text B might be an expansion or commentary on Text A
|
| 392 |
+
- Asymmetric containment often indicates a base text + commentary relationship
|
| 393 |
+
|
| 394 |
+
**Useful for:**
|
| 395 |
+
- Identifying which text is the "base" (shorter vocabulary fully contained in longer text)
|
| 396 |
+
- Understanding directionality of textual relationships
|
| 397 |
+
- Distinguishing between shared sources vs. one text derived from another
|
| 398 |
+
|
| 399 |
+
**Tip:** Unlike Jaccard (which is symmetric), containment is directional — it tells you which text's vocabulary is "inside" the other.
|
| 400 |
""",
|
| 401 |
"Structural Analysis": """
|
| 402 |
### How Texts Relate to Each Other
|
|
|
|
| 445 |
elif metric_key == "Length Ratios":
|
| 446 |
css_class = "metric-info-accordion lengthratio-info"
|
| 447 |
accordion_title = "ℹ️ What does this mean?"
|
| 448 |
+
elif metric_key == "Vocabulary Containment":
|
| 449 |
+
css_class = "metric-info-accordion vocabcontain-info"
|
| 450 |
+
accordion_title = "ℹ️ What does this mean?"
|
| 451 |
else:
|
| 452 |
css_class = "metric-info-accordion"
|
| 453 |
accordion_title = f"ℹ️ About {metric_key}"
|
|
|
|
| 472 |
word_count_plot = gr.Plot(label="Word Counts per Segment", show_label=False, scale=1, elem_classes="metric-description")
|
| 473 |
elif metric_key == "Length Ratios":
|
| 474 |
length_ratio_plot = gr.Plot(label="Length Ratios per Chapter", show_label=False, scale=1, elem_classes="metric-description")
|
| 475 |
+
elif metric_key == "Vocabulary Containment":
|
| 476 |
+
vocab_containment_plot = gr.Plot(label="Vocabulary Containment per Chapter", show_label=False, scale=1, elem_classes="metric-description")
|
| 477 |
else:
|
| 478 |
heatmap_tabs[metric_key] = gr.Plot(label=f"Heatmap: {metric_key}", show_label=False, elem_classes="metric-heatmap")
|
| 479 |
|
|
|
|
| 513 |
metrics_preview_df_res = pd.DataFrame()
|
| 514 |
word_count_fig_res = None
|
| 515 |
length_ratio_fig_res = None
|
| 516 |
+
vocab_containment_fig_res = None
|
| 517 |
jaccard_heatmap_res = None
|
| 518 |
lcs_heatmap_res = None
|
| 519 |
fuzzy_heatmap_res = None
|
|
|
|
| 547 |
pd.DataFrame({"Message": ["Please upload files to analyze."]}),
|
| 548 |
None, # word_count_plot
|
| 549 |
None, # length_ratio_plot
|
| 550 |
+
None, # vocab_containment_plot
|
| 551 |
None, # jaccard_heatmap
|
| 552 |
None, # lcs_heatmap
|
| 553 |
None, # fuzzy_heatmap
|
|
|
|
| 566 |
pd.DataFrame({"Error": [f"File '{Path(file.name).name}' exceeds the 10MB size limit (size: {file_size_mb:.2f}MB)."]}),
|
| 567 |
None, # word_count_plot
|
| 568 |
None, # length_ratio_plot
|
| 569 |
+
None, # vocab_containment_plot
|
| 570 |
None, # jaccard_heatmap
|
| 571 |
None, # lcs_heatmap
|
| 572 |
None, # fuzzy_heatmap
|
|
|
|
| 611 |
pd.DataFrame({"Error": [f"Could not decode file '{filename}'. Please ensure it contains valid Tibetan text in UTF-8 or UTF-16 encoding."]}),
|
| 612 |
None, # word_count_plot
|
| 613 |
None, # length_ratio_plot
|
| 614 |
+
None, # vocab_containment_plot
|
| 615 |
None, # jaccard_heatmap
|
| 616 |
None, # lcs_heatmap
|
| 617 |
None, # fuzzy_heatmap
|
|
|
|
| 648 |
# For Hugging Face models, the UI value is the correct model ID
|
| 649 |
internal_model_id = model_name
|
| 650 |
|
| 651 |
+
df_results, word_counts_df_data, vocab_containment_df_data, warning_raw = process_texts(
|
| 652 |
text_data=text_data,
|
| 653 |
filenames=filenames,
|
| 654 |
enable_semantic=enable_semantic_bool,
|
|
|
|
| 696 |
# Generate length ratio chart
|
| 697 |
length_ratio_fig_res = generate_length_ratio_chart(word_counts_df_data)
|
| 698 |
|
| 699 |
+
# Generate vocabulary containment chart
|
| 700 |
+
vocab_containment_fig_res = generate_vocab_containment_chart(vocab_containment_df_data)
|
| 701 |
+
|
| 702 |
# Store state data for potential future use
|
| 703 |
state_text_data_res = text_data
|
| 704 |
state_df_results_res = df_results
|
|
|
|
| 736 |
metrics_preview_df_res,
|
| 737 |
word_count_fig_res,
|
| 738 |
length_ratio_fig_res,
|
| 739 |
+
vocab_containment_fig_res,
|
| 740 |
jaccard_heatmap_res,
|
| 741 |
lcs_heatmap_res,
|
| 742 |
fuzzy_heatmap_res,
|
|
|
|
| 819 |
metrics_preview,
|
| 820 |
word_count_plot,
|
| 821 |
length_ratio_plot,
|
| 822 |
+
vocab_containment_plot,
|
| 823 |
heatmap_tabs["Jaccard Similarity (%)"],
|
| 824 |
heatmap_tabs["Normalized LCS"],
|
| 825 |
heatmap_tabs["Fuzzy Similarity"],
|
pipeline/process.py
CHANGED
|
@@ -62,7 +62,7 @@ def process_texts(
|
|
| 62 |
progressive_callback = None,
|
| 63 |
batch_size: int = 32,
|
| 64 |
show_progress_bar: bool = False
|
| 65 |
-
) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
|
| 66 |
"""
|
| 67 |
Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
|
| 68 |
|
|
@@ -101,12 +101,15 @@ def process_texts(
|
|
| 101 |
Used for progressive loading of metrics as they become available. Defaults to None.
|
| 102 |
|
| 103 |
Returns:
|
| 104 |
-
Tuple[pd.DataFrame, pd.DataFrame, str]:
|
| 105 |
- metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
|
| 106 |
Contains columns: 'Text Pair', 'Chapter', 'Jaccard Similarity (%)', 'Normalized LCS',
|
| 107 |
'Fuzzy Similarity' (if enable_fuzzy=True), 'Semantic Similarity' (if enable_semantic=True).
|
| 108 |
- word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
|
| 109 |
Contains columns: 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
|
|
|
|
|
|
|
|
|
|
| 110 |
- warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
|
| 111 |
|
| 112 |
Raises:
|
|
@@ -432,7 +435,7 @@ def process_texts(
|
|
| 432 |
metrics_df = pd.DataFrame()
|
| 433 |
warning += " No valid metrics could be computed. Please check your files and try again."
|
| 434 |
|
| 435 |
-
# Calculate word counts
|
| 436 |
if progress_callback is not None:
|
| 437 |
try:
|
| 438 |
progress_callback(0.75, desc="Calculating word counts...")
|
|
@@ -441,12 +444,12 @@ def process_texts(
|
|
| 441 |
|
| 442 |
word_counts_data = []
|
| 443 |
|
| 444 |
-
# Process each segment
|
| 445 |
-
for i,
|
| 446 |
# Update progress
|
| 447 |
if progress_callback is not None and len(segment_texts) > 0:
|
| 448 |
try:
|
| 449 |
-
progress_percentage = 0.75 + (0.
|
| 450 |
progress_callback(progress_percentage, desc=f"Counting words in segment {i+1}/{len(segment_texts)}")
|
| 451 |
except Exception as e:
|
| 452 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
|
@@ -454,33 +457,18 @@ def process_texts(
|
|
| 454 |
fname, chapter_info = seg_id.split("|", 1)
|
| 455 |
chapter_num = int(chapter_info.replace("chapter ", ""))
|
| 456 |
|
| 457 |
-
|
| 458 |
-
|
| 459 |
-
|
| 460 |
-
|
| 461 |
-
|
| 462 |
-
|
| 463 |
-
|
| 464 |
-
|
| 465 |
-
|
| 466 |
-
|
| 467 |
-
|
| 468 |
-
|
| 469 |
-
"SegmentID": seg_id,
|
| 470 |
-
"WordCount": word_count,
|
| 471 |
-
}
|
| 472 |
-
)
|
| 473 |
-
except Exception as e:
|
| 474 |
-
logger.error(f"Error calculating word count for segment {seg_id}: {e}")
|
| 475 |
-
# Add entry with 0 word count to maintain consistency
|
| 476 |
-
word_counts_data.append(
|
| 477 |
-
{
|
| 478 |
-
"Filename": fname.replace(".txt", ""),
|
| 479 |
-
"ChapterNumber": chapter_num,
|
| 480 |
-
"SegmentID": seg_id,
|
| 481 |
-
"WordCount": 0,
|
| 482 |
-
}
|
| 483 |
-
)
|
| 484 |
|
| 485 |
# Create and sort the word counts DataFrame
|
| 486 |
word_counts_df = pd.DataFrame(word_counts_data)
|
|
@@ -489,6 +477,69 @@ def process_texts(
|
|
| 489 |
by=["Filename", "ChapterNumber"]
|
| 490 |
).reset_index(drop=True)
|
| 491 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 492 |
if progress_callback is not None:
|
| 493 |
try:
|
| 494 |
progress_callback(0.95, desc="Analysis complete!")
|
|
@@ -510,4 +561,4 @@ def process_texts(
|
|
| 510 |
logger.warning(f"Final progressive callback error (non-critical): {e}")
|
| 511 |
|
| 512 |
# Return the results
|
| 513 |
-
return metrics_df, word_counts_df, warning
|
|
|
|
| 62 |
progressive_callback = None,
|
| 63 |
batch_size: int = 32,
|
| 64 |
show_progress_bar: bool = False
|
| 65 |
+
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, str]:
|
| 66 |
"""
|
| 67 |
Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
|
| 68 |
|
|
|
|
| 101 |
Used for progressive loading of metrics as they become available. Defaults to None.
|
| 102 |
|
| 103 |
Returns:
|
| 104 |
+
Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, str]:
|
| 105 |
- metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
|
| 106 |
Contains columns: 'Text Pair', 'Chapter', 'Jaccard Similarity (%)', 'Normalized LCS',
|
| 107 |
'Fuzzy Similarity' (if enable_fuzzy=True), 'Semantic Similarity' (if enable_semantic=True).
|
| 108 |
- word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
|
| 109 |
Contains columns: 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
|
| 110 |
+
- vocab_containment_df: DataFrame with vocabulary containment percentages per chapter.
|
| 111 |
+
Shows what % of each text's unique vocabulary appears in the other text.
|
| 112 |
+
Contains columns: 'ChapterNumber', 'SourceText', 'TargetText', 'Containment', 'SourceVocabSize', 'SharedVocabSize'.
|
| 113 |
- warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
|
| 114 |
|
| 115 |
Raises:
|
|
|
|
| 435 |
metrics_df = pd.DataFrame()
|
| 436 |
warning += " No valid metrics could be computed. Please check your files and try again."
|
| 437 |
|
| 438 |
+
# Calculate word counts using cached tokens
|
| 439 |
if progress_callback is not None:
|
| 440 |
try:
|
| 441 |
progress_callback(0.75, desc="Calculating word counts...")
|
|
|
|
| 444 |
|
| 445 |
word_counts_data = []
|
| 446 |
|
| 447 |
+
# Process each segment using cached tokens
|
| 448 |
+
for i, seg_id in enumerate(segment_texts.keys()):
|
| 449 |
# Update progress
|
| 450 |
if progress_callback is not None and len(segment_texts) > 0:
|
| 451 |
try:
|
| 452 |
+
progress_percentage = 0.75 + (0.10 * (i / len(segment_texts)))
|
| 453 |
progress_callback(progress_percentage, desc=f"Counting words in segment {i+1}/{len(segment_texts)}")
|
| 454 |
except Exception as e:
|
| 455 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
|
|
|
| 457 |
fname, chapter_info = seg_id.split("|", 1)
|
| 458 |
chapter_num = int(chapter_info.replace("chapter ", ""))
|
| 459 |
|
| 460 |
+
# Use cached tokens instead of re-tokenizing
|
| 461 |
+
tokens = segment_tokens.get(seg_id, [])
|
| 462 |
+
word_count = len(tokens)
|
| 463 |
+
|
| 464 |
+
word_counts_data.append(
|
| 465 |
+
{
|
| 466 |
+
"Filename": fname.replace(".txt", ""),
|
| 467 |
+
"ChapterNumber": chapter_num,
|
| 468 |
+
"SegmentID": seg_id,
|
| 469 |
+
"WordCount": word_count,
|
| 470 |
+
}
|
| 471 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 472 |
|
| 473 |
# Create and sort the word counts DataFrame
|
| 474 |
word_counts_df = pd.DataFrame(word_counts_data)
|
|
|
|
| 477 |
by=["Filename", "ChapterNumber"]
|
| 478 |
).reset_index(drop=True)
|
| 479 |
|
| 480 |
+
# Calculate vocabulary containment per chapter
|
| 481 |
+
# "X% of Text A's vocabulary appears in Text B"
|
| 482 |
+
if progress_callback is not None:
|
| 483 |
+
try:
|
| 484 |
+
progress_callback(0.87, desc="Calculating vocabulary containment...")
|
| 485 |
+
except Exception as e:
|
| 486 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 487 |
+
|
| 488 |
+
vocab_containment_data = []
|
| 489 |
+
|
| 490 |
+
# For each pair of files, calculate vocabulary containment per chapter
|
| 491 |
+
for file1, file2 in combinations(files, 2):
|
| 492 |
+
chaps1 = file_to_chapters[file1]
|
| 493 |
+
chaps2 = file_to_chapters[file2]
|
| 494 |
+
min_chaps = min(len(chaps1), len(chaps2))
|
| 495 |
+
|
| 496 |
+
for idx in range(min_chaps):
|
| 497 |
+
seg1 = chaps1[idx]
|
| 498 |
+
seg2 = chaps2[idx]
|
| 499 |
+
|
| 500 |
+
# Get unique vocabularies (sets) for each segment
|
| 501 |
+
vocab1 = set(segment_tokens.get(seg1, []))
|
| 502 |
+
vocab2 = set(segment_tokens.get(seg2, []))
|
| 503 |
+
|
| 504 |
+
chapter_num = idx + 1
|
| 505 |
+
fname1 = file1.replace(".txt", "")
|
| 506 |
+
fname2 = file2.replace(".txt", "")
|
| 507 |
+
|
| 508 |
+
# Calculate containment in both directions
|
| 509 |
+
# "What % of Text A's vocabulary is in Text B?"
|
| 510 |
+
if len(vocab1) > 0:
|
| 511 |
+
containment_1_in_2 = len(vocab1 & vocab2) / len(vocab1) * 100
|
| 512 |
+
else:
|
| 513 |
+
containment_1_in_2 = 0.0
|
| 514 |
+
|
| 515 |
+
if len(vocab2) > 0:
|
| 516 |
+
containment_2_in_1 = len(vocab1 & vocab2) / len(vocab2) * 100
|
| 517 |
+
else:
|
| 518 |
+
containment_2_in_1 = 0.0
|
| 519 |
+
|
| 520 |
+
vocab_containment_data.append({
|
| 521 |
+
"ChapterNumber": chapter_num,
|
| 522 |
+
"SourceText": fname1,
|
| 523 |
+
"TargetText": fname2,
|
| 524 |
+
"Containment": containment_1_in_2, # % of source vocab in target
|
| 525 |
+
"SourceVocabSize": len(vocab1),
|
| 526 |
+
"SharedVocabSize": len(vocab1 & vocab2),
|
| 527 |
+
})
|
| 528 |
+
vocab_containment_data.append({
|
| 529 |
+
"ChapterNumber": chapter_num,
|
| 530 |
+
"SourceText": fname2,
|
| 531 |
+
"TargetText": fname1,
|
| 532 |
+
"Containment": containment_2_in_1, # % of source vocab in target
|
| 533 |
+
"SourceVocabSize": len(vocab2),
|
| 534 |
+
"SharedVocabSize": len(vocab1 & vocab2),
|
| 535 |
+
})
|
| 536 |
+
|
| 537 |
+
vocab_containment_df = pd.DataFrame(vocab_containment_data)
|
| 538 |
+
if not vocab_containment_df.empty:
|
| 539 |
+
vocab_containment_df = vocab_containment_df.sort_values(
|
| 540 |
+
by=["ChapterNumber", "SourceText"]
|
| 541 |
+
).reset_index(drop=True)
|
| 542 |
+
|
| 543 |
if progress_callback is not None:
|
| 544 |
try:
|
| 545 |
progress_callback(0.95, desc="Analysis complete!")
|
|
|
|
| 561 |
logger.warning(f"Final progressive callback error (non-critical): {e}")
|
| 562 |
|
| 563 |
# Return the results
|
| 564 |
+
return metrics_df, word_counts_df, vocab_containment_df, warning
|
pipeline/visualize.py
CHANGED
|
@@ -283,3 +283,92 @@ def generate_length_ratio_chart(word_counts_df: pd.DataFrame):
|
|
| 283 |
)
|
| 284 |
|
| 285 |
return fig
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 283 |
)
|
| 284 |
|
| 285 |
return fig
|
| 286 |
+
|
| 287 |
+
|
| 288 |
+
def generate_vocab_containment_chart(vocab_containment_df: pd.DataFrame):
|
| 289 |
+
"""
|
| 290 |
+
Generates a bar chart showing vocabulary containment per chapter.
|
| 291 |
+
Shows what percentage of each text's unique vocabulary appears in the other text.
|
| 292 |
+
|
| 293 |
+
Args:
|
| 294 |
+
vocab_containment_df: DataFrame with 'ChapterNumber', 'SourceText', 'TargetText',
|
| 295 |
+
'Containment', 'SourceVocabSize', 'SharedVocabSize'.
|
| 296 |
+
Returns:
|
| 297 |
+
plotly Figure for the vocabulary containment chart, or None if input is empty.
|
| 298 |
+
"""
|
| 299 |
+
if vocab_containment_df is None or vocab_containment_df.empty:
|
| 300 |
+
return None
|
| 301 |
+
|
| 302 |
+
fig = go.Figure()
|
| 303 |
+
|
| 304 |
+
# Create a label for each direction: "TextA → TextB" means "% of TextA's vocab in TextB"
|
| 305 |
+
vocab_containment_df = vocab_containment_df.copy()
|
| 306 |
+
vocab_containment_df["Direction"] = (
|
| 307 |
+
vocab_containment_df["SourceText"] + " → " + vocab_containment_df["TargetText"]
|
| 308 |
+
)
|
| 309 |
+
|
| 310 |
+
# Get unique directions and assign colors
|
| 311 |
+
unique_directions = sorted(vocab_containment_df["Direction"].unique())
|
| 312 |
+
colors = px.colors.qualitative.Plotly
|
| 313 |
+
|
| 314 |
+
for i, direction in enumerate(unique_directions):
|
| 315 |
+
dir_df = vocab_containment_df[vocab_containment_df["Direction"] == direction].sort_values(
|
| 316 |
+
"ChapterNumber"
|
| 317 |
+
)
|
| 318 |
+
fig.add_trace(
|
| 319 |
+
go.Bar(
|
| 320 |
+
x=dir_df["ChapterNumber"],
|
| 321 |
+
y=dir_df["Containment"],
|
| 322 |
+
name=direction,
|
| 323 |
+
marker_color=colors[i % len(colors)],
|
| 324 |
+
text=[f"{v:.1f}%" for v in dir_df["Containment"]],
|
| 325 |
+
textposition="auto",
|
| 326 |
+
customdata=dir_df[["SourceVocabSize", "SharedVocabSize", "SourceText", "TargetText"]].values,
|
| 327 |
+
hovertemplate=(
|
| 328 |
+
"<b>Chapter %{x}</b><br>"
|
| 329 |
+
+ "<b>%{customdata[2]}</b> vocabulary in <b>%{customdata[3]}</b>: %{y:.1f}%<br>"
|
| 330 |
+
+ "Unique words in source: %{customdata[0]}<br>"
|
| 331 |
+
+ "Shared words: %{customdata[1]}<extra></extra>"
|
| 332 |
+
),
|
| 333 |
+
)
|
| 334 |
+
)
|
| 335 |
+
|
| 336 |
+
fig.update_layout(
|
| 337 |
+
title_text="Vocabulary Containment per Chapter",
|
| 338 |
+
xaxis_title="Chapter Number",
|
| 339 |
+
yaxis_title="Vocabulary Containment (%)",
|
| 340 |
+
barmode="group",
|
| 341 |
+
font=dict(size=14),
|
| 342 |
+
legend_title_text="Direction (Source → Target)",
|
| 343 |
+
xaxis=dict(
|
| 344 |
+
type="category",
|
| 345 |
+
automargin=True
|
| 346 |
+
),
|
| 347 |
+
yaxis=dict(
|
| 348 |
+
rangemode='tozero',
|
| 349 |
+
automargin=True,
|
| 350 |
+
range=[0, 105], # Slightly above 100% for visual clarity
|
| 351 |
+
),
|
| 352 |
+
autosize=True,
|
| 353 |
+
margin=dict(l=80, r=50, b=100, t=60, pad=4),
|
| 354 |
+
height=450,
|
| 355 |
+
)
|
| 356 |
+
|
| 357 |
+
# Add a reference line at 100%
|
| 358 |
+
fig.add_hline(
|
| 359 |
+
y=100,
|
| 360 |
+
line_dash="dash",
|
| 361 |
+
line_color="gray",
|
| 362 |
+
annotation_text="100%",
|
| 363 |
+
annotation_position="right"
|
| 364 |
+
)
|
| 365 |
+
|
| 366 |
+
# Ensure x-axis ticks are shown for all chapter numbers
|
| 367 |
+
chapters = sorted(vocab_containment_df["ChapterNumber"].unique())
|
| 368 |
+
fig.update_xaxes(
|
| 369 |
+
tickmode="array",
|
| 370 |
+
tickvals=chapters,
|
| 371 |
+
ticktext=[str(ch) for ch in chapters],
|
| 372 |
+
)
|
| 373 |
+
|
| 374 |
+
return fig
|