BART-base Receipt Parser
Model Description
This model is a fine-tuned version of facebook/bart-base for receipt parsing tasks. The model is trained to extract key information from receipt text, specifically:
- Date: Transaction date from the receipt
- Company Name: Name of the merchant/store
- Total Amount: Final amount paid
Dataset
The model was trained using the Receipt Dataset SSD300 V2 from Kaggle, which contains receipt images with corresponding labels.
Data Processing Pipeline
- OCR Processing: All receipt images from the dataset were processed using EasyOCR to extract raw text
- Input-Output Mapping: The extracted OCR text serves as input, while the labeled data from the Kaggle dataset serves as the target output
- Fine-tuning: Supervised fine-tuning was performed on the facebook/bart-base model
Usage
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("farhamu/bart-base-receipt-parser-v1")
tokenizer = AutoTokenizer.from_pretrained("farhamu/bart-base-receipt-parser-v1")
# Example usage
receipt_text = """
SUPERMARKET ABC
123 Main Street
City, State 12345
Date: 2024-01-15
Item 1: $5.99
Item 2: $3.50
Tax: $0.76
Total: $10.25
Thank you for shopping!
"""
# Tokenize input
inputs = tokenizer(receipt_text, return_tensors="pt", max_length=512, truncation=True, padding=True)
# Generate output
outputs = model.generate(
**inputs,
max_length=150,
num_beams=4,
early_stopping=True
)
# Decode result
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Expected Output Format
The model outputs structured information in the following format:
Date: [extracted_date]
Company Name: [extracted_company_name]
Total Amount: [extracted_total_amount]
Training Details
- Base Model: facebook/bart-base
- Task: Text-to-Text Generation (Receipt Information Extraction)
- Training Data: OCR-processed receipt text with labeled ground truth
- Data Source: Receipt Dataset SSD300 V2
- OCR Tool: EasyOCR
Limitations
- Performance may vary depending on OCR quality
- Trained specifically on the format and style of receipts in the training dataset
- May require additional fine-tuning for receipts with significantly different formats or languages
Use Cases
- Automated receipt processing for expense management
- Financial document digitization
- Retail analytics and data extraction
- Accounting automation
Citation
If you use this model, please cite the original dataset:
@dataset{dhiaznaidi2024receipt,
title={Receipt Dataset SSD300 V2},
author={Dhiaz Naidi},
year={2024},
url={https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2}
}
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support