Data Extraction from Unstructured Invoices and Documents using RPA & AI

Vikas Kulhari
4 min readSep 15, 2020

--

Hey, I hope you and your family are healthy and safe during this pandemic!!

Photo Credit: Blue Technologies

If you are reading this article that means you may have a big and complex problem. Needless to say that data extraction from different types of documents isn’t that simple. Reading data from invoices and typing into software/excel/database/ERP/CRM is one of the most common tasks among most of the companies. You may get digital as well as scanned invoices and as per some study reports more than 80% documents are in the form of hard copies/scanned. Undoubtedly, You can surely use OCR (Optical Character Recognition) technology to get your well-structured data processed however what about complex and unstructured documents?

Your human workforce can help you with that as a human can even find the right data in a sea of complex data. Eventually. But, humans are slow, error-prone, inconsistent, and expensive. (And, in some cases, perhaps not so excellent after all!)

Photo Credit: Vector Stock

Top of that there are many issues with unstructured invoices —

  • Documents can have multiple formats even though shared by the same client sometimes.
  • You can’t force your clients to provide data into a template.
  • Maybe free-flowing
  • Documents might have unstructured tables…or worse! Nested tables!
  • Some part or entire document could feature images
  • Might include hand-writing…or worse! Messy handwriting!
  • [FILL IN YOUR OWN FAVORITE EXTRACTION PAIN HERE!]

The problem isn’t only with invoices, you may face with receipts, bills, emails, bank statements, claims, images, and a whole lot more. Companies are putting a lot of human efforts to type data into their system or at least convert it into a structured format. Traditional OCR engines fail when it comes to handwriting, identify if an entry is a zero or an “O” / one or “l” / “I” or “l” etc.

How can RPA/AI help?

There are various companies offering low code AI-Powered solutions for the same including some of the RPA vendors. Companies are using Machine Learning (ML) for data classification and extraction. These ML algorithms are configurable and can be used just by doing drag-and-drop.

Photo: Reading data from document, converting into a structured format and storing into database

Most of the applications are following the below steps to train their ML models and convert data from an unstructured format to a structured one.

Photo Credit: infrrd.ai

These tools are capable enough to recognize and classify data basis on the provided training data.

Tools

  1. ROSSUM: UK-based start-up which was founded in 2017 is among the leading AI-based automated invoice data extraction solution providers. It has a $4.4M market capital. It offers a free trial option to the public.
  2. IQBot: Automation Anywhere launched this product a few years back but in the last one year it has improved a lot. Automation Anywhere offers end-to-end intelligent automation solutions to read data from unstructured invoices. It also offers a free trial version.
  3. ABBYY: Abbyy is known for document processing only. It offers a great OCR engine that is being used by many RPA tools and other big organizations. For invoice processing, Abbyy comes with three different solutions FineReader PDF, FlexiCapture, and FlexiCapture for Invoices.
  4. UiPath AI for Invoices and Receipts(Document Understanding): UiPath launched this feature in late 2019 which can be accessed in UiPath Studio 2019.10 and above versions. This feature is still in baby steps but we are looking forward to seeing it improving and growing. Needless to say that AI requires a lot of data for training so we just need to have patience.
  5. Power Automate AI Builder(Form Processing): Microsoft’s Power Automate hit the market last year only but came with a lot of AI capabilities which makes it different from others. It’s AI Builder feature helps to extract data from invoices. You must have at least 5 invoices for a single format to train the model and once it is done you can test and deploy the bot into production.
  6. Infrrd: An Indian start-up company, founded in 2017 offers a great solution for data extraction from Invoices. Its Intelligent Document Processing platform helps you maximize straight-through processing, and the template-free approach outperforms OCR when there are many document types and variations. Infrrds unique ML-first approach can automatically extract data from documents with complex visual elements, such as images, tables, graphs, handwriting, symbols, logos, and rubber stamps.
  7. Kofax Capture: Kofax Capture automates document processing and improves information visibility within the organization by capturing paper and electronic documents from common ingestion channels, transforming them into accurate and actionable information, and delivering it all into core business systems.
  8. Amazon Textract: Amazon Textract is a fully managed machine learning service that automatically extracts text and data from scanned documents that go beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. It can correctly extract read data from different sort of documents such as Invoices, Passport, Employee Payslips and many other documents if training is done well.

There are many more tools available in the market which provides similar capabilities. Now, it is all up to you to decide which tool you want to go with.

#HappyRobotics

--

--

Vikas Kulhari
Vikas Kulhari

Written by Vikas Kulhari

Crafting Tomorrow: I help companies create intelligent machines | AI Maestro | AI & Intelligent Automation Consultant. LinkedIn @vikaskulhari

No responses yet