A Look at Intelligent Document Processing and E-Invoicing

By Jessie Hobb On Feb 7, 2024

In the “bygone era,” invoices were traditionally dispatched in paper format and painstakingly transcribed into the recipient’s ERP system facilitating subsequent data processing. As indicated by Brendan Foley, among others, a significant proportion—around 80 to 90 percent—of data from documents like invoices and emails continues to be manually extracted (2019). However, there has been a notable shift in recent years towards the exclusively digital transmission of documents such as business invoices, accompanied by automated data extraction processes.

Why should a company (or its managers) embrace this shift? The rationale is clear: to conserve resources (e.g., reducing paper usage) and streamline workflow efficiency (e.g., eliminating manual data entry).

Furthermore, staying abreast of developments in this sector is crucial for companies engaged in public sector contracts. Within the European Union, there has been a longstanding push (EU Directive 2014/15/EU) to standardize invoicing and enhance machine readability. This initiative aims to facilitate the automated processing of business invoices. Similarly, in Germany, there is a noticeable shift underway in the public sector towards digital capture and automated processing, away from traditional paper-based document handling.

As a consequence of this transition, legislative bodies in Germany, for instance, have mandated that by approximately 2025, only business invoices meeting specific machine-readable formats may be digitally submitted. Consequently, the pressure on companies involved in public sector contracts is set to intensify in the coming years, marking the gradual escalation of e-invoicing into a pivotal phase.

Returning to the focal point, the intelligent processing of business invoices is paramount within the end-to-end invoice processing workflow, constituting perhaps the most critical aspect of the procedure. For instance, erroneous recognition of the IBAN (International Bank Account Number) could lead to inadvertent payment to the wrong supplier, thereby incurring substantial subsequent costs for the company.

Exploring Two Approaches for Data Field Identification in Business Invoices: AI Integration and Standardized Formats

The ensuing section will delineate two prospective methodologies for identifying data fields from business invoices.

In the initial scenario, data recognition is facilitated through the utilization of artificial intelligence (AI). Presently, various providers—such as Microsoft (utilizing the LayoutLM Model), ABBYY, SAP, and EagleDoc—offer comprehensive solutions for data extraction employing AI technologies. For instance, SAP employs a document reader to parse the extracted invoice documents, thereby discerning and categorizing the pertinent data. AI-driven OCR (Optical Character Recognition) software adeptly identifies and captures invoice data, cross-referencing it with vendor master records. Leveraging pre-existing master data, the invoice can be swiftly allocated to the appropriate supplier and designated employee. Furthermore, the classification software proficiently interprets invoice line items and associated values, facilitating immediate alignment with corresponding order data. Nevertheless, a commonality across all providers is the iterative nature of invoice processing—a perpetual refinement process.

In the alternative approach, data fields from invoices are extracted and delineated using European or national invoice standards. At the European level, one notable initiative is the PEPPOL (Pan-European Public Procurement Online) initiative. This initiative establishes a universally accepted invoice standard (PEPPOL format) to streamline trade across member states. This format is widely acknowledged and endorsed by authorities in numerous member states.

Additionally, to facilitate domestic trade and adhere to EU directives at the national level, individual countries have established their national invoice standards alongside recognized European standards like the PEPPOL format. For instance, Austria has implemented the “ebInterface” standard, while Germany has adopted “ZUGFeRD” (Zentraler User Guide des Forums elektronische Rechnung Deutschland), serving as its national invoice standard.

Next, we will delve into the technical intricacies of these invoice standards, exploring available formats and their utilization in extracting and identifying data from business invoices.

The conventional European method for electronic invoicing revolves around the XML format. This entails the creation of each invoice in XML format, subsequently transmitted to the recipient for seamless automated processing. The national invoice standard dictates the specific XML format required for such invoices.

This structured data format facilitates the automated processing of invoices. The RNorm 16931 standardizes two XML formats for electronic invoices:

UN/CEFACT XML CII (Cross Industry Invoice)
UBL ISO/IEC 19845 (also known as UBL 2.1 Invoice, Universal Business Language)

Unlocking Data Extraction in Business Invoices With “ZUGFeRD”: Insights From the Mustang Initiative

Consider “ZUGFeRD” as a prime illustration for extracting data fields from business invoices. As per findings from the open-source initiative “Mustang,” approximately 43% of companies in Germany currently transmit electronic invoices, with 45% of those utilizing the ZUGFeRD/Factur-X format.

The “Mustang” endeavor comprises an open-source Java (Jar or Maven) and .NET library, offering a suite of tools encompassing reading, editing, and validating ZUGFeRD invoices.

Suppose we possess an invoice in PDF format adhering to the “ZUGFeRD” standard. Below is an excerpt of Java code illustrating how individual data fields can be extracted from the invoice:

public class ZUGFeRDReader {

    public static void main(String[] args) {
         
        ZUGFeRDImporter zi = new ZUGFeRDImporter("./MustangGnuaccountingBeispielRE-20201121_508.pdf");
   
        //"ZUGFeRD" validation
        if (zi.canParse()) {   
            System.out.println("Total Amount: " + zi.getAmount());
            System.out.println("BIC: " + zi.getBIC());
            System.out.println("IBAN: " + zi.getIBAN());
            System.out.println("Holder Name: " + zi.getHolder());
            System.out.println("Invoice Number: " + zi.getForeignReference());
            System.out.println("Invoice Date: " + zi.getInvoiceDate());
            System.out.println("Invoice Due Date: " + zi.getDueDate());
            System.out.println("Currency: " + zi.getCurrency());
            System.out.println("Tax ID: " + zi.getTaxID());
            System.out.println("Customer Reference: " + zi.getCustomerReference());
       
        } else {
            System.out.println("Invoice is not in the ZUGFeRD format");
        }
    }
}

Conclusion

In conclusion, with the escalating adoption of intelligent document processing in everyday business operations, the discourse on e-invoicing becomes increasingly unavoidable. We’ve discerned that the accurate recognition of data fields within an invoice is pivotal for electronic processing, with two distinct methodologies at play: AI and e-invoicing.

The AI-based approach proves efficacious for handling unstructured data formats like TIF, JPEG, Word documents, or email texts, while the e-invoice strategy excels in managing hybrid invoice formats such as “ZUGFeRD” and structured data available in XML files.

As observed, there’s mounting public pressure to embrace standardized e-invoices. Additionally, the AI-based data collection process is iterative, initially fraught with a notable error margin, necessitating substantial time and resources for refinement through training. Conversely, the e-invoice approach enables direct and near-error-free data extraction, presumably translating into lower processing costs.

To remain future-proof, companies offering digital solutions for automated business invoice processing must diversify their offerings to encompass both AI and e-invoice approaches.