daniel-wojahn commited on
Commit
b8b2303
·
1 Parent(s): ee7fa4f

vocab containment update

Browse files
Files changed (4) hide show
  1. README.md +17 -0
  2. app.py +38 -2
  3. pipeline/process.py +85 -34
  4. pipeline/visualize.py +89 -0
README.md CHANGED
@@ -65,6 +65,8 @@ The Tibetan Text Metrics project provides quantitative methods for assessing tex
65
  - **Interactive Visualizations**:
66
  - Heatmaps for Jaccard, LCS, Fuzzy, and Semantic similarity metrics, providing a quick overview of inter-segment relationships.
67
  - Bar chart displaying word counts per segment.
 
 
68
  - **Advanced Interpretation**: Get scholarly insights about your results with a built-in analysis engine that:
69
  - Examines your metrics and provides contextual interpretation of textual relationships
70
  - Generates a dual-layer narrative analysis (scholarly and accessible)
@@ -174,6 +176,19 @@ This helps focus on meaningful content words rather than grammatical elements.
174
 
175
  *Note*: Semantic similarity operates on the raw text and is not affected by stopword filtering settings.
176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  ## Getting Started (if run Locally)
178
 
179
  1. Ensure you have Python 3.10 or newer.
@@ -234,6 +249,8 @@ For fine-grained control, use the "Custom" tab:
234
  - **Metrics Preview**: Summary table of similarity scores
235
  - **Heatmaps**: Visual comparison across all chapter pairs (darker = more similar)
236
  - **Word Counts**: Bar chart showing segment lengths
 
 
237
  - **CSV Download**: Full results for further analysis
238
 
239
  ### AI Interpretation (Optional)
 
65
  - **Interactive Visualizations**:
66
  - Heatmaps for Jaccard, LCS, Fuzzy, and Semantic similarity metrics, providing a quick overview of inter-segment relationships.
67
  - Bar chart displaying word counts per segment.
68
+ - Length ratio chart comparing text lengths relative to the shortest text per chapter.
69
+ - **Vocabulary containment chart** showing what percentage of each text's unique vocabulary appears in the other text (directional metric).
70
  - **Advanced Interpretation**: Get scholarly insights about your results with a built-in analysis engine that:
71
  - Examines your metrics and provides contextual interpretation of textual relationships
72
  - Generates a dual-layer narrative analysis (scholarly and accessible)
 
176
 
177
  *Note*: Semantic similarity operates on the raw text and is not affected by stopword filtering settings.
178
 
179
+ ### Visualization Metrics
180
+
181
+ 5. **Vocabulary Containment**: A directional metric showing what percentage of one text's unique vocabulary appears in the other text. Unlike Jaccard (which is symmetric), containment is calculated in both directions:
182
+ - "Text A → Text B" answers: "What % of Text A's unique words also appear in Text B?"
183
+ - Calculated as: `(shared vocabulary size) / (source text vocabulary size) × 100`
184
+
185
+ **Interpreting asymmetric containment:**
186
+ - If "Base Text → Commentary" is 95% but "Commentary → Base Text" is 60%, the commentary contains almost all of the base text's vocabulary plus additional words
187
+ - This pattern suggests an expansion or commentary relationship
188
+ - Useful for identifying which text is the "base" version (its vocabulary will be highly contained in expanded versions)
189
+
190
+ 6. **Length Ratios**: Shows how much longer each text is compared to the shortest text per chapter. A ratio of 1.0x indicates the shortest (base) text; higher ratios indicate expanded content. Helps explain why Jaccard might be lower for related texts when one contains additional material.
191
+
192
  ## Getting Started (if run Locally)
193
 
194
  1. Ensure you have Python 3.10 or newer.
 
249
  - **Metrics Preview**: Summary table of similarity scores
250
  - **Heatmaps**: Visual comparison across all chapter pairs (darker = more similar)
251
  - **Word Counts**: Bar chart showing segment lengths
252
+ - **Length Ratios**: Compare text lengths to identify base text vs. expanded versions
253
+ - **Vocabulary Containment**: Directional metric showing what % of one text's vocabulary is in another
254
  - **CSV Download**: Full results for further analysis
255
 
256
  ### AI Interpretation (Optional)
app.py CHANGED
@@ -1,7 +1,7 @@
1
  import gradio as gr
2
  from pathlib import Path
3
  from pipeline.process import process_texts
4
- from pipeline.visualize import generate_visualizations, generate_word_count_chart, generate_length_ratio_chart
5
  from pipeline.llm_service import LLMService
6
  from pipeline.progressive_ui import ProgressiveUI, create_progressive_callback
7
  import logging
@@ -256,6 +256,7 @@ def main_interface():
256
  "Semantic Similarity": "Compares actual meaning using AI. Higher = texts say similar things.",
257
  "Word Counts": "How long is each section? Helps you understand text structure.",
258
  "Length Ratios": "Compare text lengths to identify base text vs. commentary additions.",
 
259
  }
260
 
261
  metric_tooltips = {
@@ -375,6 +376,27 @@ Two texts might share many words (high Jaccard) but arrange them differently (lo
375
  - Explaining why Jaccard similarity might be lower despite texts being related
376
 
377
  **Tip:** When one text is a base and others add commentary, Jaccard penalizes the additions. This chart helps you see that relationship clearly.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
378
  """,
379
  "Structural Analysis": """
380
  ### How Texts Relate to Each Other
@@ -423,6 +445,9 @@ Two texts might share many words (high Jaccard) but arrange them differently (lo
423
  elif metric_key == "Length Ratios":
424
  css_class = "metric-info-accordion lengthratio-info"
425
  accordion_title = "ℹ️ What does this mean?"
 
 
 
426
  else:
427
  css_class = "metric-info-accordion"
428
  accordion_title = f"ℹ️ About {metric_key}"
@@ -447,6 +472,8 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
447
  word_count_plot = gr.Plot(label="Word Counts per Segment", show_label=False, scale=1, elem_classes="metric-description")
448
  elif metric_key == "Length Ratios":
449
  length_ratio_plot = gr.Plot(label="Length Ratios per Chapter", show_label=False, scale=1, elem_classes="metric-description")
 
 
450
  else:
451
  heatmap_tabs[metric_key] = gr.Plot(label=f"Heatmap: {metric_key}", show_label=False, elem_classes="metric-heatmap")
452
 
@@ -486,6 +513,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
486
  metrics_preview_df_res = pd.DataFrame()
487
  word_count_fig_res = None
488
  length_ratio_fig_res = None
 
489
  jaccard_heatmap_res = None
490
  lcs_heatmap_res = None
491
  fuzzy_heatmap_res = None
@@ -519,6 +547,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
519
  pd.DataFrame({"Message": ["Please upload files to analyze."]}),
520
  None, # word_count_plot
521
  None, # length_ratio_plot
 
522
  None, # jaccard_heatmap
523
  None, # lcs_heatmap
524
  None, # fuzzy_heatmap
@@ -537,6 +566,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
537
  pd.DataFrame({"Error": [f"File '{Path(file.name).name}' exceeds the 10MB size limit (size: {file_size_mb:.2f}MB)."]}),
538
  None, # word_count_plot
539
  None, # length_ratio_plot
 
540
  None, # jaccard_heatmap
541
  None, # lcs_heatmap
542
  None, # fuzzy_heatmap
@@ -581,6 +611,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
581
  pd.DataFrame({"Error": [f"Could not decode file '{filename}'. Please ensure it contains valid Tibetan text in UTF-8 or UTF-16 encoding."]}),
582
  None, # word_count_plot
583
  None, # length_ratio_plot
 
584
  None, # jaccard_heatmap
585
  None, # lcs_heatmap
586
  None, # fuzzy_heatmap
@@ -617,7 +648,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
617
  # For Hugging Face models, the UI value is the correct model ID
618
  internal_model_id = model_name
619
 
620
- df_results, word_counts_df_data, warning_raw = process_texts(
621
  text_data=text_data,
622
  filenames=filenames,
623
  enable_semantic=enable_semantic_bool,
@@ -665,6 +696,9 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
665
  # Generate length ratio chart
666
  length_ratio_fig_res = generate_length_ratio_chart(word_counts_df_data)
667
 
 
 
 
668
  # Store state data for potential future use
669
  state_text_data_res = text_data
670
  state_df_results_res = df_results
@@ -702,6 +736,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
702
  metrics_preview_df_res,
703
  word_count_fig_res,
704
  length_ratio_fig_res,
 
705
  jaccard_heatmap_res,
706
  lcs_heatmap_res,
707
  fuzzy_heatmap_res,
@@ -784,6 +819,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
784
  metrics_preview,
785
  word_count_plot,
786
  length_ratio_plot,
 
787
  heatmap_tabs["Jaccard Similarity (%)"],
788
  heatmap_tabs["Normalized LCS"],
789
  heatmap_tabs["Fuzzy Similarity"],
 
1
  import gradio as gr
2
  from pathlib import Path
3
  from pipeline.process import process_texts
4
+ from pipeline.visualize import generate_visualizations, generate_word_count_chart, generate_length_ratio_chart, generate_vocab_containment_chart
5
  from pipeline.llm_service import LLMService
6
  from pipeline.progressive_ui import ProgressiveUI, create_progressive_callback
7
  import logging
 
256
  "Semantic Similarity": "Compares actual meaning using AI. Higher = texts say similar things.",
257
  "Word Counts": "How long is each section? Helps you understand text structure.",
258
  "Length Ratios": "Compare text lengths to identify base text vs. commentary additions.",
259
+ "Vocabulary Containment": "What % of one text's vocabulary appears in the other?",
260
  }
261
 
262
  metric_tooltips = {
 
376
  - Explaining why Jaccard similarity might be lower despite texts being related
377
 
378
  **Tip:** When one text is a base and others add commentary, Jaccard penalizes the additions. This chart helps you see that relationship clearly.
379
+ """,
380
+ "Vocabulary Containment": """
381
+ ### Vocabulary Containment (Directional)
382
+
383
+ **What it shows:** What percentage of one text's unique vocabulary appears in the other text.
384
+
385
+ **How to read it:**
386
+ - "Text A → Text B" means: "What % of Text A's vocabulary is found in Text B?"
387
+ - 90% means 90% of the unique words in the source text also appear in the target text
388
+
389
+ **What it tells you:**
390
+ - If Text A → Text B is 95% but Text B → Text A is 60%, then Text B contains almost all of Text A's vocabulary plus additional words
391
+ - This suggests Text B might be an expansion or commentary on Text A
392
+ - Asymmetric containment often indicates a base text + commentary relationship
393
+
394
+ **Useful for:**
395
+ - Identifying which text is the "base" (shorter vocabulary fully contained in longer text)
396
+ - Understanding directionality of textual relationships
397
+ - Distinguishing between shared sources vs. one text derived from another
398
+
399
+ **Tip:** Unlike Jaccard (which is symmetric), containment is directional — it tells you which text's vocabulary is "inside" the other.
400
  """,
401
  "Structural Analysis": """
402
  ### How Texts Relate to Each Other
 
445
  elif metric_key == "Length Ratios":
446
  css_class = "metric-info-accordion lengthratio-info"
447
  accordion_title = "ℹ️ What does this mean?"
448
+ elif metric_key == "Vocabulary Containment":
449
+ css_class = "metric-info-accordion vocabcontain-info"
450
+ accordion_title = "ℹ️ What does this mean?"
451
  else:
452
  css_class = "metric-info-accordion"
453
  accordion_title = f"ℹ️ About {metric_key}"
 
472
  word_count_plot = gr.Plot(label="Word Counts per Segment", show_label=False, scale=1, elem_classes="metric-description")
473
  elif metric_key == "Length Ratios":
474
  length_ratio_plot = gr.Plot(label="Length Ratios per Chapter", show_label=False, scale=1, elem_classes="metric-description")
475
+ elif metric_key == "Vocabulary Containment":
476
+ vocab_containment_plot = gr.Plot(label="Vocabulary Containment per Chapter", show_label=False, scale=1, elem_classes="metric-description")
477
  else:
478
  heatmap_tabs[metric_key] = gr.Plot(label=f"Heatmap: {metric_key}", show_label=False, elem_classes="metric-heatmap")
479
 
 
513
  metrics_preview_df_res = pd.DataFrame()
514
  word_count_fig_res = None
515
  length_ratio_fig_res = None
516
+ vocab_containment_fig_res = None
517
  jaccard_heatmap_res = None
518
  lcs_heatmap_res = None
519
  fuzzy_heatmap_res = None
 
547
  pd.DataFrame({"Message": ["Please upload files to analyze."]}),
548
  None, # word_count_plot
549
  None, # length_ratio_plot
550
+ None, # vocab_containment_plot
551
  None, # jaccard_heatmap
552
  None, # lcs_heatmap
553
  None, # fuzzy_heatmap
 
566
  pd.DataFrame({"Error": [f"File '{Path(file.name).name}' exceeds the 10MB size limit (size: {file_size_mb:.2f}MB)."]}),
567
  None, # word_count_plot
568
  None, # length_ratio_plot
569
+ None, # vocab_containment_plot
570
  None, # jaccard_heatmap
571
  None, # lcs_heatmap
572
  None, # fuzzy_heatmap
 
611
  pd.DataFrame({"Error": [f"Could not decode file '{filename}'. Please ensure it contains valid Tibetan text in UTF-8 or UTF-16 encoding."]}),
612
  None, # word_count_plot
613
  None, # length_ratio_plot
614
+ None, # vocab_containment_plot
615
  None, # jaccard_heatmap
616
  None, # lcs_heatmap
617
  None, # fuzzy_heatmap
 
648
  # For Hugging Face models, the UI value is the correct model ID
649
  internal_model_id = model_name
650
 
651
+ df_results, word_counts_df_data, vocab_containment_df_data, warning_raw = process_texts(
652
  text_data=text_data,
653
  filenames=filenames,
654
  enable_semantic=enable_semantic_bool,
 
696
  # Generate length ratio chart
697
  length_ratio_fig_res = generate_length_ratio_chart(word_counts_df_data)
698
 
699
+ # Generate vocabulary containment chart
700
+ vocab_containment_fig_res = generate_vocab_containment_chart(vocab_containment_df_data)
701
+
702
  # Store state data for potential future use
703
  state_text_data_res = text_data
704
  state_df_results_res = df_results
 
736
  metrics_preview_df_res,
737
  word_count_fig_res,
738
  length_ratio_fig_res,
739
+ vocab_containment_fig_res,
740
  jaccard_heatmap_res,
741
  lcs_heatmap_res,
742
  fuzzy_heatmap_res,
 
819
  metrics_preview,
820
  word_count_plot,
821
  length_ratio_plot,
822
+ vocab_containment_plot,
823
  heatmap_tabs["Jaccard Similarity (%)"],
824
  heatmap_tabs["Normalized LCS"],
825
  heatmap_tabs["Fuzzy Similarity"],
pipeline/process.py CHANGED
@@ -62,7 +62,7 @@ def process_texts(
62
  progressive_callback = None,
63
  batch_size: int = 32,
64
  show_progress_bar: bool = False
65
- ) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
66
  """
67
  Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
68
 
@@ -101,12 +101,15 @@ def process_texts(
101
  Used for progressive loading of metrics as they become available. Defaults to None.
102
 
103
  Returns:
104
- Tuple[pd.DataFrame, pd.DataFrame, str]:
105
  - metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
106
  Contains columns: 'Text Pair', 'Chapter', 'Jaccard Similarity (%)', 'Normalized LCS',
107
  'Fuzzy Similarity' (if enable_fuzzy=True), 'Semantic Similarity' (if enable_semantic=True).
108
  - word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
109
  Contains columns: 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
 
 
 
110
  - warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
111
 
112
  Raises:
@@ -432,7 +435,7 @@ def process_texts(
432
  metrics_df = pd.DataFrame()
433
  warning += " No valid metrics could be computed. Please check your files and try again."
434
 
435
- # Calculate word counts
436
  if progress_callback is not None:
437
  try:
438
  progress_callback(0.75, desc="Calculating word counts...")
@@ -441,12 +444,12 @@ def process_texts(
441
 
442
  word_counts_data = []
443
 
444
- # Process each segment
445
- for i, (seg_id, text_content) in enumerate(segment_texts.items()):
446
  # Update progress
447
  if progress_callback is not None and len(segment_texts) > 0:
448
  try:
449
- progress_percentage = 0.75 + (0.15 * (i / len(segment_texts)))
450
  progress_callback(progress_percentage, desc=f"Counting words in segment {i+1}/{len(segment_texts)}")
451
  except Exception as e:
452
  logger.warning(f"Progress callback error (non-critical): {e}")
@@ -454,33 +457,18 @@ def process_texts(
454
  fname, chapter_info = seg_id.split("|", 1)
455
  chapter_num = int(chapter_info.replace("chapter ", ""))
456
 
457
- try:
458
- # Use botok for accurate word count for raw Tibetan text
459
- tokenized_segments = tokenize_texts([text_content]) # Returns a list of lists
460
- if tokenized_segments and tokenized_segments[0]:
461
- word_count = len(tokenized_segments[0])
462
- else:
463
- word_count = 0
464
-
465
- word_counts_data.append(
466
- {
467
- "Filename": fname.replace(".txt", ""),
468
- "ChapterNumber": chapter_num,
469
- "SegmentID": seg_id,
470
- "WordCount": word_count,
471
- }
472
- )
473
- except Exception as e:
474
- logger.error(f"Error calculating word count for segment {seg_id}: {e}")
475
- # Add entry with 0 word count to maintain consistency
476
- word_counts_data.append(
477
- {
478
- "Filename": fname.replace(".txt", ""),
479
- "ChapterNumber": chapter_num,
480
- "SegmentID": seg_id,
481
- "WordCount": 0,
482
- }
483
- )
484
 
485
  # Create and sort the word counts DataFrame
486
  word_counts_df = pd.DataFrame(word_counts_data)
@@ -489,6 +477,69 @@ def process_texts(
489
  by=["Filename", "ChapterNumber"]
490
  ).reset_index(drop=True)
491
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
492
  if progress_callback is not None:
493
  try:
494
  progress_callback(0.95, desc="Analysis complete!")
@@ -510,4 +561,4 @@ def process_texts(
510
  logger.warning(f"Final progressive callback error (non-critical): {e}")
511
 
512
  # Return the results
513
- return metrics_df, word_counts_df, warning
 
62
  progressive_callback = None,
63
  batch_size: int = 32,
64
  show_progress_bar: bool = False
65
+ ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, str]:
66
  """
67
  Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
68
 
 
101
  Used for progressive loading of metrics as they become available. Defaults to None.
102
 
103
  Returns:
104
+ Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, str]:
105
  - metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
106
  Contains columns: 'Text Pair', 'Chapter', 'Jaccard Similarity (%)', 'Normalized LCS',
107
  'Fuzzy Similarity' (if enable_fuzzy=True), 'Semantic Similarity' (if enable_semantic=True).
108
  - word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
109
  Contains columns: 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
110
+ - vocab_containment_df: DataFrame with vocabulary containment percentages per chapter.
111
+ Shows what % of each text's unique vocabulary appears in the other text.
112
+ Contains columns: 'ChapterNumber', 'SourceText', 'TargetText', 'Containment', 'SourceVocabSize', 'SharedVocabSize'.
113
  - warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
114
 
115
  Raises:
 
435
  metrics_df = pd.DataFrame()
436
  warning += " No valid metrics could be computed. Please check your files and try again."
437
 
438
+ # Calculate word counts using cached tokens
439
  if progress_callback is not None:
440
  try:
441
  progress_callback(0.75, desc="Calculating word counts...")
 
444
 
445
  word_counts_data = []
446
 
447
+ # Process each segment using cached tokens
448
+ for i, seg_id in enumerate(segment_texts.keys()):
449
  # Update progress
450
  if progress_callback is not None and len(segment_texts) > 0:
451
  try:
452
+ progress_percentage = 0.75 + (0.10 * (i / len(segment_texts)))
453
  progress_callback(progress_percentage, desc=f"Counting words in segment {i+1}/{len(segment_texts)}")
454
  except Exception as e:
455
  logger.warning(f"Progress callback error (non-critical): {e}")
 
457
  fname, chapter_info = seg_id.split("|", 1)
458
  chapter_num = int(chapter_info.replace("chapter ", ""))
459
 
460
+ # Use cached tokens instead of re-tokenizing
461
+ tokens = segment_tokens.get(seg_id, [])
462
+ word_count = len(tokens)
463
+
464
+ word_counts_data.append(
465
+ {
466
+ "Filename": fname.replace(".txt", ""),
467
+ "ChapterNumber": chapter_num,
468
+ "SegmentID": seg_id,
469
+ "WordCount": word_count,
470
+ }
471
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
472
 
473
  # Create and sort the word counts DataFrame
474
  word_counts_df = pd.DataFrame(word_counts_data)
 
477
  by=["Filename", "ChapterNumber"]
478
  ).reset_index(drop=True)
479
 
480
+ # Calculate vocabulary containment per chapter
481
+ # "X% of Text A's vocabulary appears in Text B"
482
+ if progress_callback is not None:
483
+ try:
484
+ progress_callback(0.87, desc="Calculating vocabulary containment...")
485
+ except Exception as e:
486
+ logger.warning(f"Progress callback error (non-critical): {e}")
487
+
488
+ vocab_containment_data = []
489
+
490
+ # For each pair of files, calculate vocabulary containment per chapter
491
+ for file1, file2 in combinations(files, 2):
492
+ chaps1 = file_to_chapters[file1]
493
+ chaps2 = file_to_chapters[file2]
494
+ min_chaps = min(len(chaps1), len(chaps2))
495
+
496
+ for idx in range(min_chaps):
497
+ seg1 = chaps1[idx]
498
+ seg2 = chaps2[idx]
499
+
500
+ # Get unique vocabularies (sets) for each segment
501
+ vocab1 = set(segment_tokens.get(seg1, []))
502
+ vocab2 = set(segment_tokens.get(seg2, []))
503
+
504
+ chapter_num = idx + 1
505
+ fname1 = file1.replace(".txt", "")
506
+ fname2 = file2.replace(".txt", "")
507
+
508
+ # Calculate containment in both directions
509
+ # "What % of Text A's vocabulary is in Text B?"
510
+ if len(vocab1) > 0:
511
+ containment_1_in_2 = len(vocab1 & vocab2) / len(vocab1) * 100
512
+ else:
513
+ containment_1_in_2 = 0.0
514
+
515
+ if len(vocab2) > 0:
516
+ containment_2_in_1 = len(vocab1 & vocab2) / len(vocab2) * 100
517
+ else:
518
+ containment_2_in_1 = 0.0
519
+
520
+ vocab_containment_data.append({
521
+ "ChapterNumber": chapter_num,
522
+ "SourceText": fname1,
523
+ "TargetText": fname2,
524
+ "Containment": containment_1_in_2, # % of source vocab in target
525
+ "SourceVocabSize": len(vocab1),
526
+ "SharedVocabSize": len(vocab1 & vocab2),
527
+ })
528
+ vocab_containment_data.append({
529
+ "ChapterNumber": chapter_num,
530
+ "SourceText": fname2,
531
+ "TargetText": fname1,
532
+ "Containment": containment_2_in_1, # % of source vocab in target
533
+ "SourceVocabSize": len(vocab2),
534
+ "SharedVocabSize": len(vocab1 & vocab2),
535
+ })
536
+
537
+ vocab_containment_df = pd.DataFrame(vocab_containment_data)
538
+ if not vocab_containment_df.empty:
539
+ vocab_containment_df = vocab_containment_df.sort_values(
540
+ by=["ChapterNumber", "SourceText"]
541
+ ).reset_index(drop=True)
542
+
543
  if progress_callback is not None:
544
  try:
545
  progress_callback(0.95, desc="Analysis complete!")
 
561
  logger.warning(f"Final progressive callback error (non-critical): {e}")
562
 
563
  # Return the results
564
+ return metrics_df, word_counts_df, vocab_containment_df, warning
pipeline/visualize.py CHANGED
@@ -283,3 +283,92 @@ def generate_length_ratio_chart(word_counts_df: pd.DataFrame):
283
  )
284
 
285
  return fig
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
283
  )
284
 
285
  return fig
286
+
287
+
288
+ def generate_vocab_containment_chart(vocab_containment_df: pd.DataFrame):
289
+ """
290
+ Generates a bar chart showing vocabulary containment per chapter.
291
+ Shows what percentage of each text's unique vocabulary appears in the other text.
292
+
293
+ Args:
294
+ vocab_containment_df: DataFrame with 'ChapterNumber', 'SourceText', 'TargetText',
295
+ 'Containment', 'SourceVocabSize', 'SharedVocabSize'.
296
+ Returns:
297
+ plotly Figure for the vocabulary containment chart, or None if input is empty.
298
+ """
299
+ if vocab_containment_df is None or vocab_containment_df.empty:
300
+ return None
301
+
302
+ fig = go.Figure()
303
+
304
+ # Create a label for each direction: "TextA → TextB" means "% of TextA's vocab in TextB"
305
+ vocab_containment_df = vocab_containment_df.copy()
306
+ vocab_containment_df["Direction"] = (
307
+ vocab_containment_df["SourceText"] + " → " + vocab_containment_df["TargetText"]
308
+ )
309
+
310
+ # Get unique directions and assign colors
311
+ unique_directions = sorted(vocab_containment_df["Direction"].unique())
312
+ colors = px.colors.qualitative.Plotly
313
+
314
+ for i, direction in enumerate(unique_directions):
315
+ dir_df = vocab_containment_df[vocab_containment_df["Direction"] == direction].sort_values(
316
+ "ChapterNumber"
317
+ )
318
+ fig.add_trace(
319
+ go.Bar(
320
+ x=dir_df["ChapterNumber"],
321
+ y=dir_df["Containment"],
322
+ name=direction,
323
+ marker_color=colors[i % len(colors)],
324
+ text=[f"{v:.1f}%" for v in dir_df["Containment"]],
325
+ textposition="auto",
326
+ customdata=dir_df[["SourceVocabSize", "SharedVocabSize", "SourceText", "TargetText"]].values,
327
+ hovertemplate=(
328
+ "<b>Chapter %{x}</b><br>"
329
+ + "<b>%{customdata[2]}</b> vocabulary in <b>%{customdata[3]}</b>: %{y:.1f}%<br>"
330
+ + "Unique words in source: %{customdata[0]}<br>"
331
+ + "Shared words: %{customdata[1]}<extra></extra>"
332
+ ),
333
+ )
334
+ )
335
+
336
+ fig.update_layout(
337
+ title_text="Vocabulary Containment per Chapter",
338
+ xaxis_title="Chapter Number",
339
+ yaxis_title="Vocabulary Containment (%)",
340
+ barmode="group",
341
+ font=dict(size=14),
342
+ legend_title_text="Direction (Source → Target)",
343
+ xaxis=dict(
344
+ type="category",
345
+ automargin=True
346
+ ),
347
+ yaxis=dict(
348
+ rangemode='tozero',
349
+ automargin=True,
350
+ range=[0, 105], # Slightly above 100% for visual clarity
351
+ ),
352
+ autosize=True,
353
+ margin=dict(l=80, r=50, b=100, t=60, pad=4),
354
+ height=450,
355
+ )
356
+
357
+ # Add a reference line at 100%
358
+ fig.add_hline(
359
+ y=100,
360
+ line_dash="dash",
361
+ line_color="gray",
362
+ annotation_text="100%",
363
+ annotation_position="right"
364
+ )
365
+
366
+ # Ensure x-axis ticks are shown for all chapter numbers
367
+ chapters = sorted(vocab_containment_df["ChapterNumber"].unique())
368
+ fig.update_xaxes(
369
+ tickmode="array",
370
+ tickvals=chapters,
371
+ ticktext=[str(ch) for ch in chapters],
372
+ )
373
+
374
+ return fig