Parametric Data Extraction

Published On: February 27, 2024

By: Drew Marlatt

Imagine the last time you read a product manual, maintenance guide, or technical specification document. Buried in dozens of pages of text you likely found specific measures, values, and attributes about one or more components. A pump has a minimum inlet pressure, an engine has a maximum oil temperature, a tank has a range limit to its cannon.

These are called parametric data attributes – the minimum inlet pressure of a specific pump. The maximum oil temperature of a specific engine. The cannon range limit of a specific tank. And every item – the pump, the engine, the tank in our example here – can have hundreds or thousands of data attributes about it.

Across industries, troves of structured data like this exist only insidelong, complex unstructured data. Think: documents, manuals, reports. Extracting that information into a usable database is an onerous, manual task.

Even for the large language models (LLMs) that make a lot of AI technologies work, this is a challenging task. Out of the box models can’t operate on the vast and varied context needed to identify parametric data attributes. Some attributes show up in paragraphs of text, deep in embedded tables, or even overlaid on images in a document.

So, how do you get them out of these documents and into models that are workable and that serve a purpose?

At Finch, we’ve developed an AI-driven approach to this challenge. Information extraction and knowledge discovery in deep, technical domains. We have fine-tuned LLMs to understand, extract and return structed data objects related to, for example, specific pieces of military hardware.

Ontologies vs. Taxonomies

To understand our approach, one has to first make a clear distinction between ontologies and taxonomies. Taxonomies, generally, are hierarchical. A dog is a canine, canines are mammals, and mammals are animals. But ontologies change based on the context. If I own a grocery store, “oranges,” “plums,” and “greens” might be items in my produce department. But if I were a painter, those same terms, “oranges,” “plums,” and “greens” could be colors of paint I have.

Ontologies, by definition, must be more complex variations of taxonomies. They can take into account relationships, constraints, and conditions. While many industries have established taxonomies, fewer have ontologies, though that is changing. And that’s a good thing – because it enables the types of complex analysis and automation that will transform how we think about solving hard problems.

Medicine, for example, is one field where rich ontologies are propelling rapid innovation. Genetics in particular benefits from a gene ontology, which is a controlled vocabulary of terms that describe gene products and their characteristics as well as annotated data about those gene products. One can imagine all the ways this is accelerating research and discovery efforts around gene-based therapies.

For this reason and many others, successful parametric data extraction efforts must begin with the development of an ontology to guide the project, which is what we do at Finch.

Our Approach

We start with manual ontology creation and curation. This involves careful manual review of our training data corpus. We curate a small set of very high-quality samples to use in a chain-of-thought (CoT) prompt, which we then use in a hybrid few-shot/chain-of-thought prompt with a pre-trained LLM to generate a moderate number of synthetically generated examples. The synthetic data is then manually reviewed and used to fine-tune an LLM, which is then used to generate a very large set of synthetic data examples. After a final round of manual review and curation, we use this largest set of synthetic data to fine-tune our final model for iterative evaluation and deployment.

Finally, the custom LLM is put into production to generate JSON’s of parametric data, including entities with associated values and units, and relationships between the various entities identified.

From JSON to Knowledge Graph

From there, we take the JSON outputs and create a knowledge graph. Knowledge graphs, as they are commonly understood, are a collection of connected pieces of information and entities that give you context about those entitiesand enable teams to perform all sorts of data integration, analytics and sharing tasks.

There are enormous benefits to this structure for analysis. Because they enable entity linking, we can perform complex searching, recommending, and question-answering tasks. They’re also great for semantic understanding and analysis, and they can enhance our understanding, reasoning, and decision-making within a given domain because they provide an interconnected view of information.

A Parametric Data Extraction Evaluation Framework

The true test of how any parametric data extraction effort is working involves measuring the performance of its models. To do this, we focus on a few key dimensions. They include classification metrics such as precision, recall, and F1 score calculated for multiple levels in the ontology structure, as well as the local separation, distant separation, and the interlacing of classified observations. It’s vital to perform an in-depth error analysis based on parametric entity type, to ensure consistent performance and reduce model bias. Unlike traditional machine learning evaluations, it can be particularly challenging to devise an intelligent stratification strategy to ensure that document chucks in each train, test, and validation corpus contains a relatively similar distribution of entity and attribute types.

Parametric data extraction is a complex, but incredibly valuable undertaking. It’s also a very clear-cut example of how AI can, quite literally, unlock insights otherwise trapped in data.

You can learn more about how we use AI for all sorts of information extraction and knowledge discovery by visiting www.finchai.com.

###