Automated Invoice Processing System

This project documents the process of designing and building a system to automatically collect and extract structured information from various invoice formats used in restaurant operations. It covers preprocessing techniques, strategies for leveraging large language models (LLMs), and performance optimization decisions. The goal was to strike a balance between cost efficiency and extraction accuracy.

1. Problem Definition and Initial Approach

The initial system design combined OCR (Optical Character Recognition) with GPT-based language models to extract the following key information:

Supplier Name
Invoice Date
Invoice ID or Number
Invoice Total
GST (Goods and Services Tax)
PST (Provincial Sales Tax)

After extracting raw text via OCR, the system passed the results to GPT for interpretation and structured data extraction.

2. Limitations of the Initial Model

When applied in a real-world setting, the following issues emerged:

Format Diversity: Each supplier used a different invoice layout, and inconsistent date formats (e.g., MM/DD/YYYY vs DD/MM/YYYY) led to frequent misinterpretation.
Excessive ID Candidates: Multiple numeric values were often extracted, making it difficult to reliably identify the correct Invoice ID.

3. First Improvement: Supplier-Based Format-Aware Parsing

Goal: Achieve high accuracy using a low-cost model (e.g., 4o-mini)

Initially, the entire invoice image was provided to GPT for end-to-end inference. However, without structural guidance, accuracy suffered. To address this, the inference process was split into two distinct stages:

Supplier Identification via Name Matching

A list of known supplier names was provided alongside the OCR result. GPT was tasked with identifying the supplier name and matching it to the closest entry in the list. If no match was found, it returned unknown.