--- license: apache-2.0 language: en pipeline_tag: image-to-text --- # TotalText-STDR: End-to-End Scene Text Detection and Recognition This repository contains the official models and inference pipeline for the TotalText Scene Text Detection and Recognition (STDR) project. It provides a complete solution for identifying and transcribing text, including curved text, from images. The pipeline combines a fine-tuned Differentiable Binarization (DBNet) model for text detection and a pre-trained Attention-based model (TPS-ResNet-BiLSTM-Attn) for text recognition. ## Models ### Text Detection - **Architecture**: Differentiable Binarization (DBNet) with a ResNet-50 backbone. - **Pretraining**: Pre-trained on the SynthText dataset. - **Fine-tuning**: Fine-tuned on the Total-Text dataset for high precision on curved and oriented text. - **Framework**: PyTorch ### Text Recognition - **Architecture**: TPS-ResNet-BiLSTM-Attention. - **Training**: Pre-trained on a large-scale dataset of real and synthetic word images. - **Framework**: PyTorch ## How to Use The end-to-end inference logic is encapsulated in the `OCR_Pipeline` class in `pipeline.py`. ### 1. Installation First, clone the repository and install the required dependencies: ```bash git clone https://huggingface.co/sakshamhooda/TotalText-STDR cd TotalText-STDR # Install dependencies (use of a virtual environment is recommended) # Note: Ensure you have the correct PyTorch version for your CUDA setup. pip install -r requirements.txt ``` ### 2. Inference You can run the pipeline on an image using the following Python script. Make sure the model weights are present in the repository. ```python import cv2 from pathlib import Path from pipeline import OCR_Pipeline # --- Configuration --- DETECTOR_CKPT = "runs/dbnet_detector/dbnet_best_tt_1.pth" RECOGNIZER_CKPT = "recognition-ptr-weights/TPS-ResNet-BiLSTM-Attn-case-sensitive.pth" CHARSET_PATH = "config/charset_totaltext.txt" IMAGE_PATH = "Total-Text-Dataset/test/img/img4.jpg" # Example image # --- Initialization --- pipeline = OCR_Pipeline( det_model_path=DETECTOR_CKPT, rec_model_path=RECOGNIZER_CKPT, charset_path=CHARSET_PATH, ) # --- Run Inference --- print(f"Running inference on: {IMAGE_PATH}") input_image = cv2.imread(IMAGE_PATH) results, heatmap = pipeline.run(input_image) # --- Visualize and Print Results --- print(f"Found {len(results)} text instances.") output_image = input_image.copy() for res in results: poly = np.array(res['polygon']).astype(np.int32) text = res['text'] cv2.polylines(output_image, [poly], isClosed=True, color=(0, 255, 0), thickness=2) cv2.putText(output_image, text, tuple(poly[0]), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (255, 0, 0), 2) # Save the output output_path = Path("./pipeline_output.jpg") cv2.imwrite(str(output_path), output_image) print(f"Output image with results saved to: {output_path}") ``` ## Project Information This project was developed to provide a high-precision OCR solution for the Total-Text dataset. Experiment tracking was managed with W&B, and model versioning with MLflow. For more details on the training process, see the original project source.