Spaces:

daniel-wojahn
/

ttm-webapp-hf

Running

App Files Files Community

daniel-wojahn commited on Dec 17, 2025

Commit

b8b2303

1 Parent(s): ee7fa4f

vocab containment update

Browse files

Files changed (4) hide show

README.md +17 -0
app.py +38 -2
pipeline/process.py +85 -34
pipeline/visualize.py +89 -0

README.md CHANGED Viewed

@@ -65,6 +65,8 @@ The Tibetan Text Metrics project provides quantitative methods for assessing tex
 -   **Interactive Visualizations**:
     -   Heatmaps for Jaccard, LCS, Fuzzy, and Semantic similarity metrics, providing a quick overview of inter-segment relationships.
     -   Bar chart displaying word counts per segment.
 -   **Advanced Interpretation**: Get scholarly insights about your results with a built-in analysis engine that:
     -   Examines your metrics and provides contextual interpretation of textual relationships
     -   Generates a dual-layer narrative analysis (scholarly and accessible)
@@ -174,6 +176,19 @@ This helps focus on meaningful content words rather than grammatical elements.
     *Note*: Semantic similarity operates on the raw text and is not affected by stopword filtering settings.
 ## Getting Started (if run Locally)
 1.  Ensure you have Python 3.10 or newer.
@@ -234,6 +249,8 @@ For fine-grained control, use the "Custom" tab:
 -   **Metrics Preview**: Summary table of similarity scores
 -   **Heatmaps**: Visual comparison across all chapter pairs (darker = more similar)
 -   **Word Counts**: Bar chart showing segment lengths
 -   **CSV Download**: Full results for further analysis
 ### AI Interpretation (Optional)

 -   **Interactive Visualizations**:
     -   Heatmaps for Jaccard, LCS, Fuzzy, and Semantic similarity metrics, providing a quick overview of inter-segment relationships.
     -   Bar chart displaying word counts per segment.
+    -   Length ratio chart comparing text lengths relative to the shortest text per chapter.
+    -   **Vocabulary containment chart** showing what percentage of each text's unique vocabulary appears in the other text (directional metric).
 -   **Advanced Interpretation**: Get scholarly insights about your results with a built-in analysis engine that:
     -   Examines your metrics and provides contextual interpretation of textual relationships
     -   Generates a dual-layer narrative analysis (scholarly and accessible)
     *Note*: Semantic similarity operates on the raw text and is not affected by stopword filtering settings.
+### Visualization Metrics
+5.  **Vocabulary Containment**: A directional metric showing what percentage of one text's unique vocabulary appears in the other text. Unlike Jaccard (which is symmetric), containment is calculated in both directions:
+    - "Text A → Text B" answers: "What % of Text A's unique words also appear in Text B?"
+    - Calculated as: `(shared vocabulary size) / (source text vocabulary size) × 100`
+    **Interpreting asymmetric containment:**
+    - If "Base Text → Commentary" is 95% but "Commentary → Base Text" is 60%, the commentary contains almost all of the base text's vocabulary plus additional words
+    - This pattern suggests an expansion or commentary relationship
+    - Useful for identifying which text is the "base" version (its vocabulary will be highly contained in expanded versions)
+6.  **Length Ratios**: Shows how much longer each text is compared to the shortest text per chapter. A ratio of 1.0x indicates the shortest (base) text; higher ratios indicate expanded content. Helps explain why Jaccard might be lower for related texts when one contains additional material.
 ## Getting Started (if run Locally)
 1.  Ensure you have Python 3.10 or newer.
 -   **Metrics Preview**: Summary table of similarity scores
 -   **Heatmaps**: Visual comparison across all chapter pairs (darker = more similar)
 -   **Word Counts**: Bar chart showing segment lengths
+-   **Length Ratios**: Compare text lengths to identify base text vs. expanded versions
+-   **Vocabulary Containment**: Directional metric showing what % of one text's vocabulary is in another
 -   **CSV Download**: Full results for further analysis
 ### AI Interpretation (Optional)

app.py CHANGED Viewed

@@ -1,7 +1,7 @@
 import gradio as gr
 from pathlib import Path
 from pipeline.process import process_texts
-from pipeline.visualize import generate_visualizations, generate_word_count_chart, generate_length_ratio_chart
 from pipeline.llm_service import LLMService
 from pipeline.progressive_ui import ProgressiveUI, create_progressive_callback
 import logging
@@ -256,6 +256,7 @@ def main_interface():
             "Semantic Similarity": "Compares actual meaning using AI. Higher = texts say similar things.",
             "Word Counts": "How long is each section? Helps you understand text structure.",
             "Length Ratios": "Compare text lengths to identify base text vs. commentary additions.",
         }
         metric_tooltips = {
@@ -375,6 +376,27 @@ Two texts might share many words (high Jaccard) but arrange them differently (lo
 - Explaining why Jaccard similarity might be lower despite texts being related
 **Tip:** When one text is a base and others add commentary, Jaccard penalizes the additions. This chart helps you see that relationship clearly.
 """,
             "Structural Analysis": """
 ### How Texts Relate to Each Other
@@ -423,6 +445,9 @@ Two texts might share many words (high Jaccard) but arrange them differently (lo
                     elif metric_key == "Length Ratios":
                         css_class = "metric-info-accordion lengthratio-info"
                         accordion_title = "ℹ️ What does this mean?"
                     else:
                         css_class = "metric-info-accordion"
                         accordion_title = f"ℹ️ About {metric_key}"
@@ -447,6 +472,8 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
                         word_count_plot = gr.Plot(label="Word Counts per Segment", show_label=False, scale=1, elem_classes="metric-description")
                     elif metric_key == "Length Ratios":
                         length_ratio_plot = gr.Plot(label="Length Ratios per Chapter", show_label=False, scale=1, elem_classes="metric-description")
                     else:
                         heatmap_tabs[metric_key] = gr.Plot(label=f"Heatmap: {metric_key}", show_label=False, elem_classes="metric-heatmap")
@@ -486,6 +513,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
             metrics_preview_df_res = pd.DataFrame()
             word_count_fig_res = None
             length_ratio_fig_res = None
             jaccard_heatmap_res = None
             lcs_heatmap_res = None
             fuzzy_heatmap_res = None
@@ -519,6 +547,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
                     pd.DataFrame({"Message": ["Please upload files to analyze."]}),
                     None,  # word_count_plot
                     None,  # length_ratio_plot
                     None,  # jaccard_heatmap
                     None,  # lcs_heatmap
                     None,  # fuzzy_heatmap
@@ -537,6 +566,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
                         pd.DataFrame({"Error": [f"File '{Path(file.name).name}' exceeds the 10MB size limit (size: {file_size_mb:.2f}MB)."]}),
                         None,  # word_count_plot
                         None,  # length_ratio_plot
                         None,  # jaccard_heatmap
                         None,  # lcs_heatmap
                         None,  # fuzzy_heatmap
@@ -581,6 +611,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
                                 pd.DataFrame({"Error": [f"Could not decode file '{filename}'. Please ensure it contains valid Tibetan text in UTF-8 or UTF-16 encoding."]}),
                                 None,  # word_count_plot
                                 None,  # length_ratio_plot
                                 None,  # jaccard_heatmap
                                 None,  # lcs_heatmap
                                 None,  # fuzzy_heatmap
@@ -617,7 +648,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
                 # For Hugging Face models, the UI value is the correct model ID
                 internal_model_id = model_name
-                df_results, word_counts_df_data, warning_raw = process_texts(
                     text_data=text_data,
                     filenames=filenames,
                     enable_semantic=enable_semantic_bool,
@@ -665,6 +696,9 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
                     # Generate length ratio chart
                     length_ratio_fig_res = generate_length_ratio_chart(word_counts_df_data)
                     # Store state data for potential future use
                     state_text_data_res = text_data
                     state_df_results_res = df_results
@@ -702,6 +736,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
                 metrics_preview_df_res,
                 word_count_fig_res,
                 length_ratio_fig_res,
                 jaccard_heatmap_res,
                 lcs_heatmap_res,
                 fuzzy_heatmap_res,
@@ -784,6 +819,7 @@ This chart shows how many words are in each chapter or section. Taller bars = lo
             metrics_preview,
             word_count_plot,
             length_ratio_plot,
             heatmap_tabs["Jaccard Similarity (%)"],
             heatmap_tabs["Normalized LCS"],
             heatmap_tabs["Fuzzy Similarity"],

 import gradio as gr
 from pathlib import Path
 from pipeline.process import process_texts
+from pipeline.visualize import generate_visualizations, generate_word_count_chart, generate_length_ratio_chart, generate_vocab_containment_chart
 from pipeline.llm_service import LLMService
 from pipeline.progressive_ui import ProgressiveUI, create_progressive_callback
 import logging
             "Semantic Similarity": "Compares actual meaning using AI. Higher = texts say similar things.",
             "Word Counts": "How long is each section? Helps you understand text structure.",
             "Length Ratios": "Compare text lengths to identify base text vs. commentary additions.",
+            "Vocabulary Containment": "What % of one text's vocabulary appears in the other?",
         }
         metric_tooltips = {
 - Explaining why Jaccard similarity might be lower despite texts being related
 **Tip:** When one text is a base and others add commentary, Jaccard penalizes the additions. This chart helps you see that relationship clearly.
+""",
+            "Vocabulary Containment": """
+### Vocabulary Containment (Directional)
+**What it shows:** What percentage of one text's unique vocabulary appears in the other text.
+**How to read it:**
+- "Text A → Text B" means: "What % of Text A's vocabulary is found in Text B?"
+- 90% means 90% of the unique words in the source text also appear in the target text
+**What it tells you:**
+- If Text A → Text B is 95% but Text B → Text A is 60%, then Text B contains almost all of Text A's vocabulary plus additional words
+- This suggests Text B might be an expansion or commentary on Text A
+- Asymmetric containment often indicates a base text + commentary relationship
+**Useful for:**
+- Identifying which text is the "base" (shorter vocabulary fully contained in longer text)
+- Understanding directionality of textual relationships
+- Distinguishing between shared sources vs. one text derived from another
+**Tip:** Unlike Jaccard (which is symmetric), containment is directional — it tells you which text's vocabulary is "inside" the other.
 """,
             "Structural Analysis": """
 ### How Texts Relate to Each Other
                     elif metric_key == "Length Ratios":
                         css_class = "metric-info-accordion lengthratio-info"
                         accordion_title = "ℹ️ What does this mean?"
+                    elif metric_key == "Vocabulary Containment":
+                        css_class = "metric-info-accordion vocabcontain-info"
+                        accordion_title = "ℹ️ What does this mean?"
                     else:
                         css_class = "metric-info-accordion"
                         accordion_title = f"ℹ️ About {metric_key}"
                         word_count_plot = gr.Plot(label="Word Counts per Segment", show_label=False, scale=1, elem_classes="metric-description")
                     elif metric_key == "Length Ratios":
                         length_ratio_plot = gr.Plot(label="Length Ratios per Chapter", show_label=False, scale=1, elem_classes="metric-description")
+                    elif metric_key == "Vocabulary Containment":
+                        vocab_containment_plot = gr.Plot(label="Vocabulary Containment per Chapter", show_label=False, scale=1, elem_classes="metric-description")
                     else:
                         heatmap_tabs[metric_key] = gr.Plot(label=f"Heatmap: {metric_key}", show_label=False, elem_classes="metric-heatmap")
             metrics_preview_df_res = pd.DataFrame()
             word_count_fig_res = None
             length_ratio_fig_res = None
+            vocab_containment_fig_res = None
             jaccard_heatmap_res = None
             lcs_heatmap_res = None
             fuzzy_heatmap_res = None
                     pd.DataFrame({"Message": ["Please upload files to analyze."]}),
                     None,  # word_count_plot
                     None,  # length_ratio_plot
+                    None,  # vocab_containment_plot
                     None,  # jaccard_heatmap
                     None,  # lcs_heatmap
                     None,  # fuzzy_heatmap
                         pd.DataFrame({"Error": [f"File '{Path(file.name).name}' exceeds the 10MB size limit (size: {file_size_mb:.2f}MB)."]}),
                         None,  # word_count_plot
                         None,  # length_ratio_plot
+                        None,  # vocab_containment_plot
                         None,  # jaccard_heatmap
                         None,  # lcs_heatmap
                         None,  # fuzzy_heatmap
                                 pd.DataFrame({"Error": [f"Could not decode file '{filename}'. Please ensure it contains valid Tibetan text in UTF-8 or UTF-16 encoding."]}),
                                 None,  # word_count_plot
                                 None,  # length_ratio_plot
+                                None,  # vocab_containment_plot
                                 None,  # jaccard_heatmap
                                 None,  # lcs_heatmap
                                 None,  # fuzzy_heatmap
                 # For Hugging Face models, the UI value is the correct model ID
                 internal_model_id = model_name
+                df_results, word_counts_df_data, vocab_containment_df_data, warning_raw = process_texts(
                     text_data=text_data,
                     filenames=filenames,
                     enable_semantic=enable_semantic_bool,
                     # Generate length ratio chart
                     length_ratio_fig_res = generate_length_ratio_chart(word_counts_df_data)
+                    # Generate vocabulary containment chart
+                    vocab_containment_fig_res = generate_vocab_containment_chart(vocab_containment_df_data)
                     # Store state data for potential future use
                     state_text_data_res = text_data
                     state_df_results_res = df_results
                 metrics_preview_df_res,
                 word_count_fig_res,
                 length_ratio_fig_res,
+                vocab_containment_fig_res,
                 jaccard_heatmap_res,
                 lcs_heatmap_res,
                 fuzzy_heatmap_res,
             metrics_preview,
             word_count_plot,
             length_ratio_plot,
+            vocab_containment_plot,
             heatmap_tabs["Jaccard Similarity (%)"],
             heatmap_tabs["Normalized LCS"],
             heatmap_tabs["Fuzzy Similarity"],

pipeline/process.py CHANGED Viewed

@@ -62,7 +62,7 @@ def process_texts(
     progressive_callback = None,
     batch_size: int = 32,
     show_progress_bar: bool = False
-) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
     """
     Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
@@ -101,12 +101,15 @@ def process_texts(
             Used for progressive loading of metrics as they become available. Defaults to None.
     Returns:
-        Tuple[pd.DataFrame, pd.DataFrame, str]:
             - metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
                 Contains columns: 'Text Pair', 'Chapter', 'Jaccard Similarity (%)', 'Normalized LCS',
                 'Fuzzy Similarity' (if enable_fuzzy=True), 'Semantic Similarity' (if enable_semantic=True).
             - word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
                 Contains columns: 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
             - warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
     Raises:
@@ -432,7 +435,7 @@ def process_texts(
         metrics_df = pd.DataFrame()
         warning += " No valid metrics could be computed. Please check your files and try again."
-    # Calculate word counts
     if progress_callback is not None:
         try:
             progress_callback(0.75, desc="Calculating word counts...")
@@ -441,12 +444,12 @@ def process_texts(
     word_counts_data = []
-    # Process each segment
-    for i, (seg_id, text_content) in enumerate(segment_texts.items()):
         # Update progress
         if progress_callback is not None and len(segment_texts) > 0:
             try:
-                progress_percentage = 0.75 + (0.15 * (i / len(segment_texts)))
                 progress_callback(progress_percentage, desc=f"Counting words in segment {i+1}/{len(segment_texts)}")
             except Exception as e:
                 logger.warning(f"Progress callback error (non-critical): {e}")
@@ -454,33 +457,18 @@ def process_texts(
         fname, chapter_info = seg_id.split("|", 1)
         chapter_num = int(chapter_info.replace("chapter ", ""))
-        try:
-            # Use botok for accurate word count for raw Tibetan text
-            tokenized_segments = tokenize_texts([text_content])  # Returns a list of lists
-            if tokenized_segments and tokenized_segments[0]:
-                word_count = len(tokenized_segments[0])
-            else:
-                word_count = 0
-            word_counts_data.append(
-                {
-                    "Filename": fname.replace(".txt", ""),
-                    "ChapterNumber": chapter_num,
-                    "SegmentID": seg_id,
-                    "WordCount": word_count,
-                }
-            )
-        except Exception as e:
-            logger.error(f"Error calculating word count for segment {seg_id}: {e}")
-            # Add entry with 0 word count to maintain consistency
-            word_counts_data.append(
-                {
-                    "Filename": fname.replace(".txt", ""),
-                    "ChapterNumber": chapter_num,
-                    "SegmentID": seg_id,
-                    "WordCount": 0,
-                }
-            )
     # Create and sort the word counts DataFrame
     word_counts_df = pd.DataFrame(word_counts_data)
@@ -489,6 +477,69 @@ def process_texts(
             by=["Filename", "ChapterNumber"]
         ).reset_index(drop=True)
     if progress_callback is not None:
         try:
             progress_callback(0.95, desc="Analysis complete!")
@@ -510,4 +561,4 @@ def process_texts(
             logger.warning(f"Final progressive callback error (non-critical): {e}")
     # Return the results
-    return metrics_df, word_counts_df, warning

     progressive_callback = None,
     batch_size: int = 32,
     show_progress_bar: bool = False
+) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, str]:
     """
     Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
             Used for progressive loading of metrics as they become available. Defaults to None.
     Returns:
+        Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, str]:
             - metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
                 Contains columns: 'Text Pair', 'Chapter', 'Jaccard Similarity (%)', 'Normalized LCS',
                 'Fuzzy Similarity' (if enable_fuzzy=True), 'Semantic Similarity' (if enable_semantic=True).
             - word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
                 Contains columns: 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
+            - vocab_containment_df: DataFrame with vocabulary containment percentages per chapter.
+                Shows what % of each text's unique vocabulary appears in the other text.
+                Contains columns: 'ChapterNumber', 'SourceText', 'TargetText', 'Containment', 'SourceVocabSize', 'SharedVocabSize'.
             - warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
     Raises:
         metrics_df = pd.DataFrame()
         warning += " No valid metrics could be computed. Please check your files and try again."
+    # Calculate word counts using cached tokens
     if progress_callback is not None:
         try:
             progress_callback(0.75, desc="Calculating word counts...")
     word_counts_data = []
+    # Process each segment using cached tokens
+    for i, seg_id in enumerate(segment_texts.keys()):
         # Update progress
         if progress_callback is not None and len(segment_texts) > 0:
             try:
+                progress_percentage = 0.75 + (0.10 * (i / len(segment_texts)))
                 progress_callback(progress_percentage, desc=f"Counting words in segment {i+1}/{len(segment_texts)}")
             except Exception as e:
                 logger.warning(f"Progress callback error (non-critical): {e}")
         fname, chapter_info = seg_id.split("|", 1)
         chapter_num = int(chapter_info.replace("chapter ", ""))
+        # Use cached tokens instead of re-tokenizing
+        tokens = segment_tokens.get(seg_id, [])
+        word_count = len(tokens)
+        word_counts_data.append(
+            {
+                "Filename": fname.replace(".txt", ""),
+                "ChapterNumber": chapter_num,
+                "SegmentID": seg_id,
+                "WordCount": word_count,
+            }
+        )
     # Create and sort the word counts DataFrame
     word_counts_df = pd.DataFrame(word_counts_data)
             by=["Filename", "ChapterNumber"]
         ).reset_index(drop=True)
+    # Calculate vocabulary containment per chapter
+    # "X% of Text A's vocabulary appears in Text B"
+    if progress_callback is not None:
+        try:
+            progress_callback(0.87, desc="Calculating vocabulary containment...")
+        except Exception as e:
+            logger.warning(f"Progress callback error (non-critical): {e}")
+    vocab_containment_data = []
+    # For each pair of files, calculate vocabulary containment per chapter
+    for file1, file2 in combinations(files, 2):
+        chaps1 = file_to_chapters[file1]
+        chaps2 = file_to_chapters[file2]
+        min_chaps = min(len(chaps1), len(chaps2))
+        for idx in range(min_chaps):
+            seg1 = chaps1[idx]
+            seg2 = chaps2[idx]
+            # Get unique vocabularies (sets) for each segment
+            vocab1 = set(segment_tokens.get(seg1, []))
+            vocab2 = set(segment_tokens.get(seg2, []))
+            chapter_num = idx + 1
+            fname1 = file1.replace(".txt", "")
+            fname2 = file2.replace(".txt", "")
+            # Calculate containment in both directions
+            # "What % of Text A's vocabulary is in Text B?"
+            if len(vocab1) > 0:
+                containment_1_in_2 = len(vocab1 & vocab2) / len(vocab1) * 100
+            else:
+                containment_1_in_2 = 0.0
+            if len(vocab2) > 0:
+                containment_2_in_1 = len(vocab1 & vocab2) / len(vocab2) * 100
+            else:
+                containment_2_in_1 = 0.0
+            vocab_containment_data.append({
+                "ChapterNumber": chapter_num,
+                "SourceText": fname1,
+                "TargetText": fname2,
+                "Containment": containment_1_in_2,  # % of source vocab in target
+                "SourceVocabSize": len(vocab1),
+                "SharedVocabSize": len(vocab1 & vocab2),
+            })
+            vocab_containment_data.append({
+                "ChapterNumber": chapter_num,
+                "SourceText": fname2,
+                "TargetText": fname1,
+                "Containment": containment_2_in_1,  # % of source vocab in target
+                "SourceVocabSize": len(vocab2),
+                "SharedVocabSize": len(vocab1 & vocab2),
+            })
+    vocab_containment_df = pd.DataFrame(vocab_containment_data)
+    if not vocab_containment_df.empty:
+        vocab_containment_df = vocab_containment_df.sort_values(
+            by=["ChapterNumber", "SourceText"]
+        ).reset_index(drop=True)
     if progress_callback is not None:
         try:
             progress_callback(0.95, desc="Analysis complete!")
             logger.warning(f"Final progressive callback error (non-critical): {e}")
     # Return the results
+    return metrics_df, word_counts_df, vocab_containment_df, warning

pipeline/visualize.py CHANGED Viewed

@@ -283,3 +283,92 @@ def generate_length_ratio_chart(word_counts_df: pd.DataFrame):
     )
     return fig

     )
     return fig
+def generate_vocab_containment_chart(vocab_containment_df: pd.DataFrame):
+    """
+    Generates a bar chart showing vocabulary containment per chapter.
+    Shows what percentage of each text's unique vocabulary appears in the other text.
+    Args:
+        vocab_containment_df: DataFrame with 'ChapterNumber', 'SourceText', 'TargetText',
+                              'Containment', 'SourceVocabSize', 'SharedVocabSize'.
+    Returns:
+        plotly Figure for the vocabulary containment chart, or None if input is empty.
+    """
+    if vocab_containment_df is None or vocab_containment_df.empty:
+        return None
+    fig = go.Figure()
+    # Create a label for each direction: "TextA → TextB" means "% of TextA's vocab in TextB"
+    vocab_containment_df = vocab_containment_df.copy()
+    vocab_containment_df["Direction"] = (
+        vocab_containment_df["SourceText"] + " → " + vocab_containment_df["TargetText"]
+    )
+    # Get unique directions and assign colors
+    unique_directions = sorted(vocab_containment_df["Direction"].unique())
+    colors = px.colors.qualitative.Plotly
+    for i, direction in enumerate(unique_directions):
+        dir_df = vocab_containment_df[vocab_containment_df["Direction"] == direction].sort_values(
+            "ChapterNumber"
+        )
+        fig.add_trace(
+            go.Bar(
+                x=dir_df["ChapterNumber"],
+                y=dir_df["Containment"],
+                name=direction,
+                marker_color=colors[i % len(colors)],
+                text=[f"{v:.1f}%" for v in dir_df["Containment"]],
+                textposition="auto",
+                customdata=dir_df[["SourceVocabSize", "SharedVocabSize", "SourceText", "TargetText"]].values,
+                hovertemplate=(
+                    "<b>Chapter %{x}</b><br>"
+                    + "<b>%{customdata[2]}</b> vocabulary in <b>%{customdata[3]}</b>: %{y:.1f}%<br>"
+                    + "Unique words in source: %{customdata[0]}<br>"
+                    + "Shared words: %{customdata[1]}<extra></extra>"
+                ),
+            )
+        )
+    fig.update_layout(
+        title_text="Vocabulary Containment per Chapter",
+        xaxis_title="Chapter Number",
+        yaxis_title="Vocabulary Containment (%)",
+        barmode="group",
+        font=dict(size=14),
+        legend_title_text="Direction (Source → Target)",
+        xaxis=dict(
+            type="category",
+            automargin=True
+        ),
+        yaxis=dict(
+            rangemode='tozero',
+            automargin=True,
+            range=[0, 105],  # Slightly above 100% for visual clarity
+        ),
+        autosize=True,
+        margin=dict(l=80, r=50, b=100, t=60, pad=4),
+        height=450,
+    )
+    # Add a reference line at 100%
+    fig.add_hline(
+        y=100,
+        line_dash="dash",
+        line_color="gray",
+        annotation_text="100%",
+        annotation_position="right"
+    )
+    # Ensure x-axis ticks are shown for all chapter numbers
+    chapters = sorted(vocab_containment_df["ChapterNumber"].unique())
+    fig.update_xaxes(
+        tickmode="array",
+        tickvals=chapters,
+        ticktext=[str(ch) for ch in chapters],
+    )
+    return fig