Data Intelligence

What Makes a Data Catalog “Smart”? #3 – Metadata Management

Actian Corporation

February 16, 2022

smart-data-catalog-3-metadata-management

A data catalog harnesses enormous amounts of very diverse information – and its volume will grow exponentially. This will raise 2 major challenges:

How to feed and maintain the volume of information without tripling (or more) the cost of metadata management?
How to find the most relevant datasets for any specific use case?

A data catalog should be Smart in order to answer these 2 questions, with smart technological and conceptual features that go wider than the sole integration of AI algorithms.

In this respect we have identified 5 areas in which a data catalog can be “Smart” – most of which do not involve machine learning:

Metamodeling
The data inventory
Metadata management
The search engine
User experience

It is in the field of metadata management that the notion of the Smart Data Catalog is most commonly associated with algorithms, machine learning, and AI.

How is Metadata Management Automated?

Metadata management is the discipline that consists in valuing the metamodel attributes for the inventoried assets. The workload required is usually proportional to the number of attributes in the metamodel and the number of assets in the catalog.

The role of the Smart Data Catalog is to automate this activity as much as possible, or at the very least to help the human operators (Data Stewards) do so in order to ensure greater productivity and reliability.

As seen in our last article, a smart connectivity layer enables the automation of part of the metadata but this automation is very much restricted to a limited subset of the metamodel – mostly technical metadata. A complete metamodel, even a modest one, also has dozens of metadata that cannot be extracted from the source systems registries (because they are not there, to begin with).

To solve this equation, several approaches are possible:

Pattern Recognition

The most direct approach consists in looking to identify patterns in the catalog in order to suggest metadata values for new assets.

Put simply, a pattern will include all the metadata of an asset and the metadata of its relations with other assets or other catalog entities. Pattern recognition is typically done with the help of machine learning algorithms.

The difficulty with the implementation of this approach is precisely qualifying the information assets in a numerical form in order to feed the algorithms and select the relevant patterns. A simple structural analysis is not enough: two datasets can contain identical data but in different structures. Relying on the identity of the data isn’t efficient either: two datasets can contain identical information but with different values. For example, 2020 client invoicing in one dataset, 2021 client invoicing in the other.

In order to solve this problem, Zeenea relies on a technology called fingerprinting. In order to build the fingerprint, we pull up 2 types of features from our clients’ data:

A group of features adapted to the numerical data (mostly statistical indicators).
Data emanating from word embedding models (word vectorization) for the textual data.

Fingerprinting is at the heart of our intelligent algorithms.

The Other Embedded Approaches in a Suggestion Engine

While pattern recognition is indeed an efficient approach for suggesting the metadata of a new asset in a catalog, it rests on an important prerequisite: in order to recognize a pattern, there has to be one to recognize. In other words, this only works if there are a number of assets in the catalog (which is obviously not the case at the start of a project).

And it’s precisely in these initial phases of a catalog project that the metadata management load is the highest. It is, therefore, crucial to include other approaches likely to help the Data Stewards in these initial phases, when a catalog is more or less empty.

The Zeenea suggestion engine, which provides intelligent algorithms to assist the management of the metadata, also provides other approaches (which we enrich regularly).

Here are some of these approaches:

Structural similarity detection.
Fingerprint similarity detection.
Name approximation.

This suggestion engine, which analyzes the catalog content in order to determine the probable values of the metadata from the assets that have been integrated, is an everlasting subject of experimentation. We regularly add new approaches, sometimes very simple and sometimes much more sophisticated. In our architecture, it is a dedicated service whose performances improve as the catalog grows and as we enrich our algorithms.

Zeenea has chosen to use the lead time as our main measuring metric for the productivity of the Data Stewards (which is the ultimate objective of smart metadata management). Lead time is a notion that stems from lean management and which measures, in a data catalog context, the time elapsed between the moment an asset is inventoried and the moment all its metadata has been valued.

For more information on how Smart metadata management enhances a Data Catalog, download our eBook: “What is a Smart Data Catalog?”.

About Actian Corporation

Actian makes data easy. Our data platform simplifies how people connect, manage, and analyze data across cloud, hybrid, and on-premises environments. With decades of experience in data management and analytics, Actian delivers high-performance solutions that empower businesses to make data-driven decisions. Actian is recognized by leading analysts and has received industry awards for performance and innovation. Our teams share proven use cases at conferences (e.g., Strata Data) and contribute to open-source projects. On the Actian blog, we cover topics ranging from real-time data ingestion, data analytics, data governance, data management, data quality, data intelligence to AI-driven analytics.

Data + AI Intelligence

Databases

Analytics

Data Management

App Modernization

Deployment

Solutions by Industry

Use Cases

Customers

Featured Customer Stories

Partners

Learn

Company

What Makes a Data Catalog “Smart”? #3 – Metadata Management

How is Metadata Management Automated?

Pattern Recognition

The Other Embedded Approaches in a Suggestion Engine

About Actian Corporation

Data + AI Intelligence

Databases

Analytics

Data Management

App Modernization

Deployment

Partners

Learn

Company

What Makes a Data Catalog “Smart”? #3 – Metadata Management

How is Metadata Management Automated?

Pattern Recognition

The Other Embedded Approaches in a Suggestion Engine

About Actian Corporation

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!