Energy

Intelligent Documents Processing

The project was delivered for a multinational oil & gas corporation ranked as one of the top publicly traded companies by market capitalization.

Business Goals

Fast and reliable extraction of information from digitized unconventional documents – on a large scale

Challenge

  • Template extraction cannot be applied due to the domain (geophysics) specifics and the wide variety of forms of documents.
  • Millions of digitized documents need to be processed in parallel and the required information extracted from images, tables, and text fragments.
  • Images need to be classified into plots, pictures, and tens of domain-specific classes.
  • Linear- and scatterplots need to be transformed into lists of points and discovered qualities.
  • Text fragments and tables with typed, hand-written, or printed characters need to be converted into machine-encoded text.
  • The machine-encoded texts need to be separated according to the detected topic into multiple domain-specific categories.

Results

  • The Deep Learning classification model (1) was validated with an accuracy of 99.9%
  • The Deep Learning model ensemble for semantic segmentation (2) demonstrated an accuracy of 78% and IoU of 0.58.
  • The Deep Learning fragment classification model (3) demonstrated 86% accuracy.
  • The Deep Learning Optical Character Recognition (OCR) model ensemble (4) showed average accuracy of 88%.
  • The NLP Deep Learning text summarization model (5) showed good readability and overall responsiveness as well as near-human reviewers’ accuracy.

Implementation Details

  • The Deep Learning classification model (1) separates original digitized documents into hand-written, typed, and printed documents.
  • The ensemble of deep learning models for semantic segmentation (2) extracts images, tables, and text fragments from scanned documents and saves them to object storage. The ensemble (2) consists of three deep learning segmentation models. Each of them is trained to segment outcomes of the model (1). The ensemble (2) is triggered as the model (1) outputs arrive.
  • The Deep Learning fragment classification model (3) is invoked as the image fragments from models (2) arrive. Model (3) distributes image fragments into plots, pictures, and other domain-specific buckets.
  • The Deep Learning Optical Character Recognition (OCR) models ensemble (4) is invoked as text and table fragments from models (2) arrive. Model ensemble (4) converts text from image fragments into machine-encoded text.
  • The NLP Deep Learning text summarization model (5) distributes machine-encoded text received from the model (4) into multiple domain-specific categories.
  • Simultaneous processing of thousands of digitized documents was implemented via containerization of the workspaces and centralized management of the workloads with Kubernetes.
  • Real-time and batch information extraction from digitized documents was implemented as a pipeline of 5 Deep Learning algorithms, deployed and managed as a series of independent containerized workloads, horizontally scalable. The workloads are created when users upload documents and stop when the documents have been processed.
Get a technical consultation
Alex Gurbych

Alex Gurbych

Chief Solutions Architect

Receive a professional and in-depth consultation from an experienced expert. Get tailored advice to address your specific needs and achieve your goals effectively.