BART-base Receipt Parser

Model Description

This model is a fine-tuned version of facebook/bart-base for receipt parsing tasks. The model is trained to extract key information from receipt text, specifically:

  • Date: Transaction date from the receipt
  • Company Name: Name of the merchant/store
  • Total Amount: Final amount paid

Dataset

The model was trained using the Receipt Dataset SSD300 V2 from Kaggle, which contains receipt images with corresponding labels.

Data Processing Pipeline

  1. OCR Processing: All receipt images from the dataset were processed using EasyOCR to extract raw text
  2. Input-Output Mapping: The extracted OCR text serves as input, while the labeled data from the Kaggle dataset serves as the target output
  3. Fine-tuning: Supervised fine-tuning was performed on the facebook/bart-base model

Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("farhamu/bart-base-receipt-parser-v1")
tokenizer = AutoTokenizer.from_pretrained("farhamu/bart-base-receipt-parser-v1")

# Example usage
receipt_text = """
SUPERMARKET ABC
123 Main Street
City, State 12345
Date: 2024-01-15
Item 1: $5.99
Item 2: $3.50
Tax: $0.76
Total: $10.25
Thank you for shopping!
"""

# Tokenize input
inputs = tokenizer(receipt_text, return_tensors="pt", max_length=512, truncation=True, padding=True)

# Generate output
outputs = model.generate(
    **inputs,
    max_length=150,
    num_beams=4,
    early_stopping=True
)

# Decode result
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Expected Output Format

The model outputs structured information in the following format:

Date: [extracted_date]
Company Name: [extracted_company_name]
Total Amount: [extracted_total_amount]

Training Details

  • Base Model: facebook/bart-base
  • Task: Text-to-Text Generation (Receipt Information Extraction)
  • Training Data: OCR-processed receipt text with labeled ground truth
  • Data Source: Receipt Dataset SSD300 V2
  • OCR Tool: EasyOCR

Limitations

  • Performance may vary depending on OCR quality
  • Trained specifically on the format and style of receipts in the training dataset
  • May require additional fine-tuning for receipts with significantly different formats or languages

Use Cases

  • Automated receipt processing for expense management
  • Financial document digitization
  • Retail analytics and data extraction
  • Accounting automation

Citation

If you use this model, please cite the original dataset:

@dataset{dhiaznaidi2024receipt,
  title={Receipt Dataset SSD300 V2},
  author={Dhiaz Naidi},
  year={2024},
  url={https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2}
}
Downloads last month
-
Safetensors
Model size
139M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support