
# **1. Essential Libraries**
- **`transformers`, `datasets`, `accelerate`**:  
  - Hugging Face libraries for working with pre-trained models (e.g., BERT, GPT, LLaMA), loading datasets, and accelerating training across CPUs/GPUs/TPUs.

- **`torch`, `torchvision`, `torchaudio`**:  
  - Core PyTorch libraries for building and training deep learning models involving text, images, and audio.

- **`salesforce-lavis`**:  
  - A framework for vision-language tasks like image captioning, visual question answering (VQA), and image-text retrieval using models like BLIP.

- **`sentencepiece`**:  
  - A tokenizer library used for multilingual NLP models such as T5, BART, and LLaMA for subword segmentation.

- **`pdf2image`**:  
  - Converts PDF pages into images, useful for image-based processing of PDFs.

- **`pytesseract`**:  
  - An OCR tool (based on Google Tesseract) that extracts text from images, useful for scanned PDFs or diagrams.

- **`pdfplumber`**:  
  - Extracts structured text, tables, and metadata from PDFs, ideal for document analysis and information retrieval.



In [1]:
!pip install transformers datasets accelerate torch torchvision torchaudio --upgrade
!pip install salesforce-lavis
!pip install sentencepiece
!pip install pdf2image pytesseract pdfplumber


Collecting transformers
  Downloading transformers-4.51.0-py3-none-any.whl.metadata (38 kB)
Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting accelerate
  Downloading accelerate-1.6.0-py3-none-any.whl.metadata (19 kB)
Collecting torch
  Downloading torch-2.6.0-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Collecting torchvision
  Downloading torchvision-0.21.0-cp310-cp310-manylinux1_x86_64.whl.metadata (6.1 kB)
Collecting torchaudio
  Downloading torchaudio-2.6.0-cp310-cp310-manylinux1_x86_64.whl.metadata (6.6 kB)
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers)
  Downloading huggingface_hub-0.30.1-py3-none-any.whl.metadata (13 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.meta

# ****2. Upgrading Transformers & Installing Optimization Tools****


- **`!pip install --upgrade transformers accelerate bitsandbytes sentencepiece`**  
  - **`transformers`**: Upgrades to the latest version of Hugging Face's library for state-of-the-art language models (e.g., BERT, GPT, LLaMA).  
  - **`accelerate`**: Speeds up training and inference on multi-GPU/TPU setups with minimal code changes.  
  - **`bitsandbytes`**: A lightweight CUDA library for 8-bit and 4-bit quantization, essential for running large models efficiently with less GPU memory.  
  - **`sentencepiece`**: Used for subword tokenization, especially in multilingual and encoder-decoder models like T5 or BART.

- **`!pip install git+https://github.com/huggingface/transformers.git`**  
  - Installs the **latest development version** of the `transformers` library directly from GitHub. Useful if you need the **newest features or bug fixes** that aren't yet in the official release on PyPI.

In [2]:
!pip install --upgrade transformers accelerate bitsandbytes sentencepiece
!pip install git+https://github.com/huggingface/transformers.git


Collecting transformers
  Using cached transformers-4.51.0-py3-none-any.whl.metadata (38 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.4-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Using cached transformers-4.51.0-py3-none-any.whl (10.4 MB)
Downloading bitsandbytes-0.45.4-py3-none-manylinux_2_24_x86_64.whl (76.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.0/76.0 MB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hDownloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m84.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: tokenizers, transformers, bitsandbytes
  Attempting uninstall: tokenizers
    Found existing i

# 3.Fresh Installation: Latest Transformers with Performance Optimization


- **`!pip uninstall -y transformers`**  
  - Forcefully removes any existing version of the `transformers` library to avoid conflicts or outdated dependencies.

- **`!pip install git+https://github.com/huggingface/transformers.git`**  
  - Installs the **latest bleeding-edge version** of Hugging Face’s `transformers` library directly from the GitHub repository, giving access to the newest models, features, and fixes.

- **`!pip install --upgrade accelerate bitsandbytes sentencepiece`**  
  - **`accelerate`**: Optimizes training/inference on different hardware setups (CPU, GPU, TPU).  
  - **`bitsandbytes`**: Enables 8-bit/4-bit quantization to reduce memory usage and speed up model performance.  
  - **`sentencepiece`**: Required for tokenization in several models like T5, BART, and LLaMA.

This setup is ideal for working with cutting-edge models and maximizing performance on resource-constrained environments like GPUs with limited VRAM.

In [3]:
!pip uninstall -y transformers
!pip install git+https://github.com/huggingface/transformers.git
!pip install --upgrade accelerate bitsandbytes sentencepiece


Found existing installation: transformers 4.52.0.dev0
Uninstalling transformers-4.52.0.dev0:
  Successfully uninstalled transformers-4.52.0.dev0
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-bu2u4lac
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-bu2u4lac
  Resolved https://github.com/huggingface/transformers.git to commit d1b92369ca193da49f9f7ecd01b08ece45c2c9aa
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.52.0.dev0-py3-none-any.whl size=11203014 sha256=e4147764ad5a91366aaa5e83630efe0b5a985dfce1073f2fb9b0d29ecb29

# 4. Code Explanation: Medical PDF Processing & Model Setup

- **`os`**: For handling file paths and directories.
- **`torch`**: Core PyTorch library to utilize CPU/GPU for deep learning tasks.
- **`pdfplumber`**: Extracts text and tables from PDF files (text-based PDFs).
- **`pytesseract`**: OCR engine to extract text from images (for scanned or image-based PDFs).
- **`pdf2image.convert_from_path`**: Converts PDF pages into images for OCR or visual tasks.
- **`transformers` models**:
  - **`AutoProcessor` & `BlipForConditionalGeneration`**: Used for image captioning and understanding (BLIP model).
  - **`LlamaForCausalLM` & `LlamaTokenizer`**: Used for generating or understanding text using a LLaMA language model.
- **`PIL.Image`**: Image processing utility used with OCR and visual models.

- Automatically selects **GPU (CUDA)** if available, else defaults to CPU.
- Ensures faster model execution on supported machines.



In [4]:
import os
import torch
import pdfplumber
import pytesseract
from pdf2image import convert_from_path
from transformers import AutoProcessor, BlipForConditionalGeneration, LlamaForCausalLM, LlamaTokenizer
from PIL import Image

# Ensure we use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)


Using device: cuda


# 5. Login to huggingface

In [None]:
from huggingface_hub import login

# Enter your Hugging Face token
hf_token = "HF_TOKEN"  # Replace with your actual token

# Login to Hugging Face
login(token=hf_token)
print("✅ Logged in to Hugging Face successfully!")


✅ Logged in to Hugging Face successfully!


# 6.  Loading BLIP & LLaMA Models for Image and Text Processing
- Detects and sets the processing device: GPU (`cuda`) if available, else CPU  
- Loads **BLIP image captioning model** from Salesforce via Hugging Face  
- Uses `AutoProcessor` to handle image inputs for BLIP  
- Loads BLIP model to selected device for generating image-based captions  
- Loads **LLaMA-2 7B HF model** for causal language modeling  
- Fetches LLaMA tokenizer to convert text to tokens and vice versa  
- Loads LLaMA model in `float16` for faster, memory-efficient performance  
- Uses `device_map="auto"` to smartly allocate model across GPU(s)/CPU  
- Requires `hf_token` for authorized access to gated Hugging Face models  
- Prints a success message after models are loaded and ready

In [6]:
import torch
from transformers import AutoProcessor, BlipForConditionalGeneration, LlamaForCausalLM, LlamaTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load BLIP model for image captioning
blip_processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base", use_auth_token=hf_token)
blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base", use_auth_token=hf_token).to(device)

# Load LLaMA-2 7B HF for text processing
llama_model_name = "meta-llama/Llama-2-7b-hf"
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_model_name, use_auth_token=hf_token)
llama_model = LlamaForCausalLM.from_pretrained(llama_model_name, torch_dtype=torch.float16, device_map="auto", use_auth_token=hf_token)

print("✅ Models loaded successfully!")


Using device: cuda




preprocessor_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

✅ Models loaded successfully!


# 7. - `!apt-get update`: Updates the package list on the system to ensure the latest versions are available for installation  
- `!apt-get install -y poppler-utils`: Installs **Poppler-utils**, a collection of tools (like `pdftoppm`, `pdfinfo`, `pdfimages`) used for working with PDF files  
- Enables PDF-to-image conversion via tools like `pdftoppm`, which is used internally by libraries like `pdf2image`  
- The `-y` flag auto-confirms the installation without prompting the user

In [7]:
!apt-get update
!apt-get install -y poppler-utils


Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease                                              
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]                           
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]                                
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]                             
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,317 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease                        
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease                  
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:11 http://archive.ubuntu

# 8. Extracting Text and Images from PDFs


- **Set up directories**:
  - Creates a folder to store processed pages (images) in `/kaggle/working/processed_pages`.
  - Filters and selects only 2 PDFs from the `/kaggle/input/healthcare` directory for testing.

- **Function `process_pdfs(pdf_files)`**:
  - **Text Extraction**:
    - Uses `pdfplumber` to extract text from each page of the selected PDFs. This is done by opening the PDF and iterating through its pages, extracting text when available.
    - Text from each page is added to a list and stored in a dictionary with the PDF filename as the key.
  
  - **Image Extraction**:
    - Uses `pdf2image` to convert the first few pages (up to 5 for testing) of each PDF to images.
    - Saves these images as PNG files in the output folder (`/kaggle/working/processed_pages`).
    - Stores paths of these saved images in the dictionary under the respective PDF file.

- **Processing and Output**:
  - The PDFs are processed one by one, extracting both text and images.
  - A summary of the extracted data (text length and number of images) is printed for each PDF.



In [10]:
import os
import pdf2image
import pdfplumber
from PIL import Image

# Path to medical books
pdf_folder = "/kaggle/input/healthcare"
output_folder = "/kaggle/working/processed_pages"
os.makedirs(output_folder, exist_ok=True)

# Select only 2 PDFs for now
pdf_files = [f for f in os.listdir(pdf_folder) if f.endswith(".pdf")][:2]

# Function to extract images and text from PDFs
def process_pdfs(pdf_files):
    extracted_text = {}

    for pdf_file in pdf_files:
        pdf_path = os.path.join(pdf_folder, pdf_file)
        text_output = []

        print(f"📖 Processing: {pdf_file}")

        # Extract text using pdfplumber
        with pdfplumber.open(pdf_path) as pdf:
            for i, page in enumerate(pdf.pages):
                text = page.extract_text()
                if text:
                    text_output.append(text)

        # Convert PDF pages to images
        images = pdf2image.convert_from_path(pdf_path)
        image_paths = []
        for idx, img in enumerate(images[:5]):  # Limit to first 5 pages for testing
            image_path = os.path.join(output_folder, f"{pdf_file}_page{idx+1}.png")
            img.save(image_path, "PNG")
            image_paths.append(image_path)

        extracted_text[pdf_file] = {
            "text": "\n".join(text_output),
            "images": image_paths
        }

    return extracted_text

# Process PDFs
pdf_data = process_pdfs(pdf_files)

# Print extracted data summary
for pdf, data in pdf_data.items():
    print(f"\n✅ Extracted from {pdf}:")
    print(f" - 📝 Text Length: {len(data['text'])} characters")
    print(f" - 🖼️ Images Extracted: {len(data['images'])}")



📖 Processing: Book18.pdf

✅ Extracted from Book18.pdf:
 - 📝 Text Length: 975877 characters
 - 🖼️ Images Extracted: 5


# 9. Generating Captions for Extracted Images


- **BLIP Model Setup**:
  - The **BLIP (Bootstrapped Language-Image Pretraining)** model and processor are loaded from Salesforce's pre-trained image captioning model.
  - It uses **GPU (CUDA)** if available, otherwise defaults to **CPU** for inference.

- **`generate_image_captions(image_paths)` Function**:
  - **Input**: A list of image paths extracted from the PDFs.
  - **Processing**: 
    - Each image is opened, converted to RGB, and processed using the BLIP processor.
    - The model generates a caption for each image using the **`generate()`** method from the BLIP model.
  - **Output**: 
    - The generated captions are decoded from tokens and stored in a dictionary with image paths as keys.
    - Captions are printed in the format: `🖼️ <image_path>: <caption>`.
  
- **Processing**:
  - The function is called with a list of all image paths extracted from PDFs (`all_image_paths`).
  - The resulting captions are stored in the `image_captions` dictionary.



In [11]:
import torch
from transformers import BlipProcessor, BlipForConditionalGeneration

# Load BLIP model and processor
device = "cuda" if torch.cuda.is_available() else "cpu"
blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base").to(device)
blip_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")

# Function to generate captions for extracted images
def generate_image_captions(image_paths):
    captions = {}
    for img_path in image_paths:
        image = Image.open(img_path).convert("RGB")

        # Process image and generate caption
        inputs = blip_processor(images=image, return_tensors="pt").to(device)
        with torch.no_grad():
            output = blip_model.generate(**inputs)

        caption = blip_processor.batch_decode(output, skip_special_tokens=True)[0]
        captions[img_path] = caption
        print(f"🖼️ {img_path}: {caption}")

    return captions

# Process all images from PDFs
all_image_paths = [img for pdf in pdf_data.values() for img in pdf["images"]]
image_captions = generate_image_captions(all_image_paths)


🖼️ /kaggle/working/processed_pages/Book18.pdf_page1.png: basic biology, third edition
🖼️ /kaggle/working/processed_pages/Book18.pdf_page2.png: the cover of the book basic and functional systems for the basic systems
🖼️ /kaggle/working/processed_pages/Book18.pdf_page3.png: a sample of a resume for a job
🖼️ /kaggle/working/processed_pages/Book18.pdf_page4.png: the cover of the book, the new yorks
🖼️ /kaggle/working/processed_pages/Book18.pdf_page5.png: a letterhead with the words ' the letterhead '


# 10. Processing Text with LLaMA Model


- **LLaMA Text Processing**:
  - This function uses the **LLaMA** language model to process the text data extracted from PDFs and generate responses based on the input text.
  - The **tokenizer** converts the text into tokens suitable for model input, and **LLaMA** generates a response by extending the input text.

- **`process_text_with_llama(text_data)` Function**:
  - **Input**: 
    - `text_data`: A dictionary where each key is the name of a PDF and the corresponding value is the extracted text.
  - **Processing**:
    - The **tokenizer** is used to tokenize the text, truncating the input to a maximum of 2048 tokens (to fit within model limits).
    - The input is passed to the LLaMA model for text generation, and the model generates up to 512 new tokens (`max_new_tokens=512`).
  - **Output**:
    - The generated text response is decoded from the model's token output and stored in a dictionary with PDF names as keys.
    - The function prints a short preview of the generated response (first 500 characters).
  
- **Processing**:
  - The function is invoked with text data from the PDFs, which was previously extracted using `pdfplumber`.
  - The `pdf_text_responses` dictionary contains the generated responses for each PDF.



In [14]:
def process_text_with_llama(text_data):
    responses = {}
    for pdf_name, text in text_data.items():
        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to("cuda")

        with torch.no_grad():
            outputs = llama_model.generate(**inputs, max_new_tokens=512)  # Use max_new_tokens

        response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        responses[pdf_name] = response_text
        print(f"📄 {pdf_name} Processed: {response_text[:500]}...\n")

    return responses

# Process text extracted from PDFs
pdf_text_responses = process_text_with_llama({pdf: pdf_data[pdf]["text"] for pdf in pdf_data})


📄 Book18.pdf Processed: Basic
Updated
Immunology
Functions and Disorders
of the Immune System
Abul K. Abbas, MBBS
Professor and Chair
Department of Pathology
University of California San Francisco, School of Medicine
San Francisco, California
Andrew H. Lichtman, MD, PhD
Professor of Pathology
Harvard Medical School
Brigham and Women’s Hospital
Boston, Massachusetts
Illustrated by David L. Baker, MA, and Alexandra Baker, MS, CMI
1600 John F. Kennedy Blvd. Ste 1800
Philadelphia, PA 19103-2899
BASIC IMMUNOLOGY: FUNCTIONS ...



-----------------------------------------------------------------------------------------

# 11. Processing Images with BLIP for Captions


- **BLIP Image Captioning**:
  - The function uses **BLIP** (Bootstrapped Language-Image Pretraining) to generate captions for images extracted from PDFs. This involves converting the images to text descriptions.

- **`process_images_with_blip(image_data)` Function**:
  - **Input**: 
    - `image_data`: A dictionary where each key is the name of a PDF and the corresponding value is a list of image paths extracted from the PDF.
  - **Processing**:
    - For each image in the list, the image is opened and processed with the **BLIP processor**, which prepares the image for model input.
    - The **BLIP model** generates a caption for each image using the `generate()` function, limited to 50 new tokens.
    - The captions are decoded from token IDs using the `batch_decode()` method and stored in a list for each PDF.
  - **Output**:
    - The generated captions are stored in a dictionary (`image_captions`), with each PDF name as the key and a list of captions as the value.
    - Each caption is printed with the format: `🖼️ <pdf_name> Image Caption: <caption>`.

- **Processing**:
  - The function is called with image data from the PDFs, which was previously extracted using `pdf2image`.
  - The resulting dictionary `pdf_image_captions` contains the captions for each image.



In [15]:
def process_images_with_blip(image_data):
    image_captions = {}
    
    for pdf_name, images in image_data.items():
        captions = []x
        for img_path in images:
            image = Image.open(img_path).convert("RGB")
            inputs = blip_processor(image, return_tensors="pt").to("cuda")

            with torch.no_grad():
                generated_ids = blip_model.generate(**inputs, max_new_tokens=50)
                caption = blip_processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

            captions.append(caption)
            print(f"🖼️ {pdf_name} Image Caption: {caption}")

        image_captions[pdf_name] = captions

    return image_captions

# Process extracted images
pdf_image_captions = process_images_with_blip({pdf: pdf_data[pdf]["images"] for pdf in pdf_data})


🖼️ Book18.pdf Image Caption: basic biology, third edition
🖼️ Book18.pdf Image Caption: the cover of the book basic and functional systems for the basic systems
🖼️ Book18.pdf Image Caption: a sample of a resume for a job
🖼️ Book18.pdf Image Caption: the cover of the book, the new yorks
🖼️ Book18.pdf Image Caption: a letterhead with the words ' the letterhead '


------------------------------------------------------------------------------------------

# 12. Merging Text and Image Data

- **Combining Text and Image Insights**:
  - This function merges the **textual responses** (generated by LLaMA) and **image captions** (generated by BLIP) into a single, unified output for each PDF. The goal is to provide a cohesive summary of both the extracted text and the image insights.

- **`merge_text_and_image_data(text_responses, image_captions)` Function**:
  - **Input**: 
    - `text_responses`: A dictionary containing the generated text for each PDF.
    - `image_captions`: A dictionary containing the generated captions for images in each PDF.
  - **Processing**:
    - For each PDF, the function retrieves its corresponding **text** and **images** (captions).
    - The data is combined into a single formatted string:
      - **Text** is prefixed with "📄 **Extracted Text:**"
      - **Images** (captions) are prefixed with "🖼️ **Image Insights:**"
    - The combined data is stored in the `combined_data` dictionary, where each key is a PDF name, and the value is the combined response.
  - **Output**:
    - Prints the **first 500 characters** of the merged data for preview.
    - Returns the `combined_data` dictionary containing the merged output.

- **Processing**:
  - The function is invoked with the **text responses** from LLaMA and **image captions** from BLIP.
  - The resulting dictionary, `pdf_combined_responses`, contains the merged text and image insights for each PDF.


For each PDF:
- Displays a preview of the **merged text** and **image insights**, showing both the extracted text and the captions for images.

This step helps in creating a comprehensive view of both the textual content and visual insights from the PDFs, making it easier to understand and utilize the extracted data. 

In [16]:
def merge_text_and_image_data(text_responses, image_captions):
    combined_data = {}

    for pdf in text_responses.keys():
        text = text_responses[pdf]
        images = image_captions.get(pdf, [])
        
        combined_response = f"📄 **Extracted Text:**\n{text}\n\n🖼️ **Image Insights:**\n" + "\n".join(images)
        combined_data[pdf] = combined_response
        print(f"\n✅ Merged Data for {pdf}:\n", combined_response[:500])  # Print first 500 chars for preview

    return combined_data

# Merge extracted text and images
pdf_combined_responses = merge_text_and_image_data(pdf_text_responses, pdf_image_captions)



✅ Merged Data for Book18.pdf:
 📄 **Extracted Text:**
Basic
Updated
Immunology
Functions and Disorders
of the Immune System
Abul K. Abbas, MBBS
Professor and Chair
Department of Pathology
University of California San Francisco, School of Medicine
San Francisco, California
Andrew H. Lichtman, MD, PhD
Professor of Pathology
Harvard Medical School
Brigham and Women’s Hospital
Boston, Massachusetts
Illustrated by David L. Baker, MA, and Alexandra Baker, MS, CMI
1600 John F. Kennedy Blvd. Ste 1800
Philadelphia, PA 19103-2899
BASIC 


------------------------------------------------------------------------------------------

# 13. Answering User Queries Using LLaMA-2


- **Generating Responses Based on User Queries**:
  - This function allows the AI to **answer user queries** using the **merged text and image captions** extracted from PDFs. The model utilizes **LLaMA-2** for generating responses based on the provided context.

- **`answer_user_query(query, pdf_combined_responses)` Function**:
  - **Input**:
    - `query`: A string containing the user’s question.
    - `pdf_combined_responses`: A dictionary containing the combined text and image insights for each PDF.
  - **Processing**:
    - The function extracts the context from the **merged PDF data** by joining all text and image captions and truncates it to fit within the model's token limit (2048 tokens for LLaMA-2).
    - The query is added to the context to form a complete prompt: "Context: ... User Query: ..."
    - The prompt is tokenized and fed into the **LLaMA-2 model**, which generates a response.
  - **Output**:
    - The response is decoded from token IDs back into readable text, providing an answer to the user’s query based on the extracted data.

- **Example**:
  - In the provided example, the query is about the **process of blood circulation**. The model generates a response by referencing the relevant information in the **merged PDFs**.
  - The final answer is printed to the console as: `💬 AI Response: <response_text>`.

This step enables the AI to answer detailed questions by analyzing the combined knowledge from both the extracted text and image captions, simulating an interactive learning experience.

In [17]:
def answer_user_query(query, pdf_combined_responses):
    """
    Generates an answer based on user query using LLaMA-2.
    """
    context = "\n\n".join(pdf_combined_responses.values())[:2048]  # Ensure input fits within LLaMA's limit
    input_text = f"Context:\n{context}\n\nUser Query: {query}\n\nAnswer:"

    # Tokenize input
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=2048).to("cuda")

    # Generate response
    with torch.no_grad():
        outputs = llama_model.generate(**inputs, max_new_tokens=500)  # Limit response length

    response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response_text

# Example query
user_query = "Explain the process of blood circulation based on the book."
response = answer_user_query(user_query, pdf_combined_responses)

# Print the response
print("\n💬 AI Response:\n", response)



💬 AI Response:
 Context:
📄 **Extracted Text:**
Basic
Updated
Immunology
Functions and Disorders
of the Immune System
Abul K. Abbas, MBBS
Professor and Chair
Department of Pathology
University of California San Francisco, School of Medicine
San Francisco, California
Andrew H. Lichtman, MD, PhD
Professor of Pathology
Harvard Medical School
Brigham and Women’s Hospital
Boston, Massachusetts
Illustrated by David L. Baker, MA, and Alexandra Baker, MS, CMI
1600 John F. Kennedy Blvd. Ste 1800
Philadelphia, PA 19103-2899
BASIC IMMUNOLOGY: FUNCTIONS AND DISORDERS ISBN: 978-1-4160-5569-3
OF THE IMMUNE SYSTEM
Copyright © 2011 by Saunders, an imprint of Elsevier Inc.
All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any
means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval
system, without permission in writing from the publisher. Permissions may be sought directly from Elsevier’s
Rights Depart

------------------------------------------------------------------------------------------

# 14. Saving and Zipping the Trained Models


- **Saving and Archiving the Trained Models**:
  - This code snippet is responsible for saving the trained models (LLaMA and BLIP) and their associated processors, then compressing them into a zip file for easy download or storage.

- **Steps Involved**:
  1. **Define the Save Path**: The variable `model_save_path` specifies the directory where the models will be saved.
  2. **Saving the Models**:
     - **LLaMA Model**: `llama_model.save_pretrained(model_save_path)` saves the trained LLaMA model.
     - **Tokenizer**: `tokenizer.save_pretrained(model_save_path)` saves the tokenizer associated with LLaMA.
     - **BLIP Model**: `blip_model.save_pretrained(model_save_path)` saves the trained BLIP model.
     - **BLIP Processor**: `blip_processor.save_pretrained(model_save_path)` saves the processor used with the BLIP model.
  3. **Zipping the Model Directory**:
     - `shutil.make_archive(model_save_path, 'zip', model_save_path)` compresses the saved model directory into a zip file for easier management and download.
  4. **Print Confirmation**: After successful completion, the code prints a confirmation message with the path to the zip file: `✅ Model saved successfully! Download from /kaggle/working/llama_blip_trained.zip`.

- **Outcome**:
  - The model and its components are saved and compressed into a single zip file that can be easily downloaded from the specified path.

This is a critical step in model deployment, as it ensures the trained models are securely saved and ready for further use or sharing.

In [19]:
import shutil

# Define model save path
model_save_path = "/kaggle/working/llama_blip_trained"

# Save the trained model
llama_model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)
blip_model.save_pretrained(model_save_path)
blip_processor.save_pretrained(model_save_path)

# Zip the model for easy download
shutil.make_archive(model_save_path, 'zip', model_save_path)

print("✅ Model saved successfully! Download from /kaggle/working/llama_blip_trained.zip")


✅ Model saved successfully! Download from /kaggle/working/llama_blip_trained.zip


------------------------------------------------------------------------------------------

# 15. Loading the Model and Counting Parameters



In [23]:
from transformers import BlipForConditionalGeneration

# Path to your trained model
model_path = "/kaggle/working/llama_blip_trained"

# Load BLIP model explicitly
model = BlipForConditionalGeneration.from_pretrained(model_path)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())

print(f"Total Parameters: {total_params:,} ({total_params / 1e9:.2f} billion)")


Total Parameters: 247,414,076 (0.25 billion)


In [30]:
import os

model_path = "/kaggle/working/llama_blip_trained"
print(os.listdir(model_path))


['tokenizer_config.json', 'tokenizer.model', 'config.json', 'vocab.txt', 'preprocessor_config.json', 'generation_config.json', 'model.safetensors', 'special_tokens_map.json', 'tokenizer.json', 'model.safetensors.index.json']
