Portfolio

AI System for Extracting Information from Invoice Scans: From Annotation to Implementation

Business process automation often involves processing complex structured paper documents such as invoices, delivery notes, and more. A common scenario: a company receives invoice scans via email. The mailbox gets parsed, and information such as invoice details and payment dates are entered into the ERP system. However, manually parsing such documents is a time-consuming and labour-intensive process. But this is where artificial intelligence can be a solution.

In this article, we describe our approach to developing a system that recognises information from paper invoices using computer vision and machine learning technologies.

What you will learn from the article:

- How we design complex computer vision pipelines.
- How to organise continuous fine-tuning of the model.
- How to track model quality metrics.

Training the models

A key part of the project is Datapipe, an open-source tool we use to build a data processing pipeline. As input data changes—whether new data is added or existing data is modified—Datapipe automatically tracks these changes and recalculates only those steps where the data has changed and needs to be updated. Each step in the pipelines is a function, the result of which is passed on to the next step. This saves computing time and allows us to continuously fine-tune the model in production, recalculating only the functions that are truly necessary.

The pipeline of our system is structured like this:

1. Data annotated in Label Studio by a moderator is automatically loaded into the pipeline.

2. The annotated data is added to a so-called ‘frozen dataset’—it will no longer be modified and is used for ML model training.

3. We split the data into two parts:

- Training set - for training the model.

- Validation set - to validate and evaluate the model's quality.

4. The images undergo transformation: each image is divided into smaller segments (subpictures) to facilitate model training. In each subpicture, even smaller regions are extracted to improve detection and classification accuracy.

5. We apply the models sequentially. First, models are trained for object detection (YOLOv5), and then OCR models (Google Cloud Vision OCR) are applied for text recognition.

6. Sequential testing of models. After each model training, a process of predicting the results on training and test data is conducted. We calculate quality metrics, and the model with the best performance is implemented in the production.
Now that the model has been trained and tested, it is ready for use in production. The process is organised to automatically process each new invoice that enters the system and use our models to extract the necessary data.
This process follows these steps:
1. New images arrive in the RabbitMQ queue. This is kind of a mailbox of new image data.
2. A daemon process monitors this queue. The daemon runs in the background, checking for new images. When a new message is detected, the daemon retrieves it from the queue and passes the image for processing.
3. Using the best model, the system makes a prediction.
4. The result of the prediction is sent to the API, where other programs can use this information for further actions.
5. The image is removed from the queue.

The Models in Action: invoice processing

1. Data collection

An email arrives with a scan of a paid invoice. This image becomes the first step towards automatic recognition.
Once the image is in the system, it is passed to Label Studio, the moderator interface we chose for its ease of use. Here, the original image of the invoice is displayed on the screen, and then it is passed to our system's models for processing.

2. Detection of the Invoice in the Image

The first processing step is to detect the invoice itself in the image. Since photographs may contain additional objects like hands, backgrounds, or other items, it is crucial to isolate only the invoice. The model's task is to determine the exact boundaries of the invoice to eliminate unnecessary details and focus on the desired area. Using the labeled data, we trained the YOLOv5 model to accurately determine the invoice’s position in the image, considering the rotation angle at which the photo was taken (0°, 90°, 180°, 270°).
When a new image is presented, the YOLOv5 model analyses it and makes predictions:

Bbox - the boundaries of the object in the image.

Class - the category each object belongs to. At this stage we are interested in the class ‘invoice’.

Score - the probability that the object belongs to the specified class.

The model highlights the estimated object with a frame and indicates confidence level in the detection accuracy. After that, the system passes its coordinates and score to the next stage of processing.

3. Detecting Key Fields in the Invoice

After determining the position of the invoice in the photo, the system proceeds to detect the key fields in it.
Since invoices contain a wide variety of information, the image is first divided into small areas - subpictures. This cropping process improves the accuracy of key field detection. Each subpicture is then processed by the YOLOv5 model, which analyses its contents and identifies areas of interest: date, total amount, item names and other important elements.
Once all the subpictures have been processed, the system combines the results to form a complete picture of the recognized fields across the entire invoice.

4. Numbers and text recognition

Now that the system has identified and categorised the key fields on the invoice, it is time to move on to recognising text and numbers. For this, we use Google Cloud Vision OCR, which is responsible for extracting all the necessary information.
Each highlighted area found in the previous steps is passed to Google Cloud Vision OCR. This service analyses the image and extracts the textual data: total amount, product names, invoice numbers and other important details.
Google Cloud Vision OCR handles text recognition well, even with complex backgrounds and varying fonts, ensuring high accuracy and completeness in extracting information. The system then links the extracted text and numeric values to the corresponding fields on the invoice.

5. Fine-tuning the Model

Invoice recognition is constantly evolving: new invoice formats are introduced, fonts change, new fields are added, and invoice designs are regularly updated. To ensure that the models can adapt effectively to these changes and continue to recognise data accurately, they need to be retrained on a regular basis. This retraining or refining process is called fine-tuning.
We provide the customer not just with a trained model, but with a complete pipeline that includes an automated fine-tuning process based on newly labeled data. This ensures continuous model improvement, adapting to ever-changing conditions in the production environment.
During the workflow, the moderator manually annotates new data, which gradually accumulates in the dataset. When the volume of this data reaches a certain threshold, the dataset is frozen and transferred to the pipelines.
The Datapipe platform efficiently integrates the new data, automatically processing only the parts that have changed. This flexibility allows new processing steps to be easily added to the pipeline and adapted to evolving customer requirements. After each fine-tuning cycle, updated metrics are automatically calculated to monitor the model's performance quality. This cycle ensures the system remains up-to-date and accurate under any conditions.

6. Monitoring and Evaluating Quality Metrics

In order to assess the effectiveness of the model, it is necessary to analyse its key metrics on a regular basis.
Key quality metrics analysis includes the calculation of the following metrics:
  • Precision and Recall: these metrics assess the accuracy and completeness of object recognition.
  • F1 Score: The harmonic mean of Precision and Recall, which shows how accurately and completely the model recognizes the data of interest. The closer the F1 score is to 1, the more accurate the prediction.
  • Weighted and Macro F1 Score for a general evaluation of the model's performance, taking into account different weight categories of objects and their class distribution.
  • Weighted F1 Score reflects how well the model performs by taking into account the frequency of each class (type of object detected). It provides a balanced view based on the class distribution.
  • Macro F1 Score evaluates the model’s accuracy for each class, regardless of its frequency, offering an average performance across all classes.
For monitoring and visualising the metrics, we prefer using Metabase platform. the real-time dynamics of changes in model quality metrics.
You can see the accuracy of the model predictions on the diagrams:

Conclusion

Our team has successfully developed an AI system for extracting information from paper invoices, taking the project from data annotation and pipeline customization through to the sequential training and fine-tuning of models.
A big thank you to the dedicated team that made this project a reality:
- Alexander Kozlov, lead ml engineer
- Andrei Tatarinov, CEO/CTO Epoch8.co / AGIMA.AI.
⭐️ Computer vision