BART-base Receipt Parser

Model Description

This model is a fine-tuned version of facebook/bart-base for receipt parsing tasks. The model is trained to extract key information from receipt text, specifically:

Date: Transaction date from the receipt
Company Name: Name of the merchant/store
Total Amount: Final amount paid

Dataset

The model was trained using the Receipt Dataset SSD300 V2 from Kaggle, which contains receipt images with corresponding labels.

Data Processing Pipeline

OCR Processing: All receipt images from the dataset were processed using EasyOCR to extract raw text
Input-Output Mapping: The extracted OCR text serves as input, while the labeled data from the Kaggle dataset serves as the target output
Fine-tuning: Supervised fine-tuning was performed on the facebook/bart-base model

Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("farhamu/bart-base-receipt-parser-v1")
tokenizer = AutoTokenizer.from_pretrained("farhamu/bart-base-receipt-parser-v1")

# Example usage
receipt_text = """
SUPERMARKET ABC
123 Main Street
City, State 12345
Date: 2024-01-15
Item 1: $5.99
Item 2: $3.50
Tax: $0.76
Total: $10.25
Thank you for shopping!
"""

# Tokenize input
inputs = tokenizer(receipt_text, return_tensors="pt", max_length=512, truncation=True, padding=True)

# Generate output
outputs = model.generate(
    **inputs,
    max_length=150,
    num_beams=4,
    early_stopping=True
)

# Decode result
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Expected Output Format

The model outputs structured information in the following format:

Date: [extracted_date]
Company Name: [extracted_company_name]
Total Amount: [extracted_total_amount]

Training Details

Base Model: facebook/bart-base
Task: Text-to-Text Generation (Receipt Information Extraction)
Training Data: OCR-processed receipt text with labeled ground truth
Data Source: Receipt Dataset SSD300 V2
OCR Tool: EasyOCR

Limitations

Performance may vary depending on OCR quality
Trained specifically on the format and style of receipts in the training dataset
May require additional fine-tuning for receipts with significantly different formats or languages

Use Cases

Automated receipt processing for expense management
Financial document digitization
Retail analytics and data extraction
Accounting automation

Citation

If you use this model, please cite the original dataset:

@dataset{dhiaznaidi2024receipt,
  title={Receipt Dataset SSD300 V2},
  author={Dhiaz Naidi},
  year={2024},
  url={https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2}
}