DataHub Hands-On Part II. Data Ingestion, Data Discovery and… | by Dr. Simon J. Preis | Jan, 2023

By Jessie Hobb On Jan 3, 2023

Data Ingestion, Data Discovery and Entity Linkage

Welcome to part II of my DataHub Hands-On story! In the previous part, we have discussed how to setup a data catalog from scratch, how to do semantic modelling with business terms and what’s the motivation for Data Governance (DG) at all. In this part, we will have a closer look on the rather technical parts of DataHub in order to ingest and link metadata.

The data ingestion (DI) section of DataHub can be accessed via the top right menu bar:

Once we have entered the DI area, we see a list of already configured data sources including an overview that shows execution statistics. From there we can now run a configured DI process, we can edit the configuration, we can delete or copy a configuration, or we create a new source.

There is also another button “Secrets” that can be used to configure passwords for accessing data sources. This feature helps to store the passwords secured in the DataHub database. A secret can be associated to a data source during the source configuration.

Important Terms

To understand how the DI configuration works, let’s quickly define a few important terms. As briefly discussed above, there are sources for metadata. Sources refer to the databases — typically from a productive environment — that we want to manage in our data catalog. Besides sources, DataHub also provides the concept of sinks. Sinks are destinations for metadata. Typically and by default, the DI engine sends the collected metadata to the “datahub-rest” sink to make them available in the DataHub UI. However, there are also options for Kafka and File export, if required. Then we have recipes, which are configuration files that tell the ingestion scripts where to pull the data from and where to put it. Recipes can be configured through the UI, if an adapter is available, or via a YAML script[1].

Data Sources and Adapters

Now, Let’s have a a look on the pre-configured database adapters (a.k.a connectors) that DataHub provides out of the box:

We see a number of established SQL and NoSQL technology icons. To understand that this is not just for marketing purposes, let’s compare two selected adapters. On the left side of the following image we see a ingestion recipe for BigQuery, on the right side we see a recipe for SQLServer. The ingestion UI obviously expects different information to establish a database connection depending on the technology.

Most of the data sources follow a pull-based integration using a Python based metadata-ingestion system. In a first step, the metadata is pulled from the source systems, and in a second step, the information is pushed to the DataHub service tier either via HTTP or Kafka². Each source adapter has different capabilities, hence, reading the online documentation is recommended to avoid confusions. For instance, the MySQL and SAP HANA adapters enable data profiling and detection of deleted entities, which is not supported by the MongoDB or Oracle adapters. In contrast, the BigQuery adapter supports these features as well as table-level lineage and a few more. To understand why there are differences between the adapter capabilities, we just need to visit github to view the code for each adapter. Reviewing the entire adapter logic is beyond the scope of this story, but we can see, for instance, that sqlalchemy is used to connect to MySQL sources whereas the BigQuery adapter is applying the standard connection package provided by Google. So let’s conclude that each adapter is implemented independently and potentially uses different connection packages. This and the fact that also database technologies differ in metadata that they can provide at all (e.g. there is no foreign key concept in MongoDB) have implications on the ingestion capabilities of an adapter.

Finalizing the Recipe

There is also a filter section that helps to narrow down the schema or tables to ingest. If no filter is set, the DI engine executes the metadata scanning over all schemes, tables and views where the configured database user has access — this should be considered to avoid performance impacts on both the scanned database and the ingestion process. From security perspective, it is rather recommended to restrict the user privileges directly at the database (e.g. by creating a particular read-only DataHub user). You can toggle advanced settings in an additional section, e.g. if views should also be considered for scanning or only tables or vice versa, or both. Also important: you can select if tables or columns should be profiled (e.g. how many records in a table, how many missing values in a column). This is useful to gain transparency over the company data landscape.

Once the recipe is configured, you can specify the execution schedule. Here, you define the frequency (e.g. daily, monthly, hourly) of the DI execution. In the final step of the creation wizard, you have to assign a unique name to the new source and click “save & run” which directly triggers the first metadata scanning process.

Once the process has been executed, you can view the details in order to get a first impression of the ingested assets, or to see details on the failures. In my test case, we have a MySQL database server with 78 views and 38 tables from 4 databases.

Searching and filtering

Once our metadata was pushed to the DataHub service tier, we can start our data discovery journey through the DataHub UI. As a fan of the 90s, I use the “northwind” database from Microsoft in my testcase. If I want to search for the schema, I can scroll through the tables and views or I can select some filters (e.g. to exclude views).

Table Profiling Results

Let’s assume I want to have a closer look on the product data, then I simply click on the the product dataset from that list. The first information I see is a table schema overview that shows the fields, their data type and even some data model details such as primary and foreign keys. At the top (under the dataset name), we also see at one glance that there are currently 79 records in our table.

The second sheet refers to documentation which is empty by default. Here, data custodians could add technical knowledge on that table in DataHub and/or add links to already existing material. It is an important tool to foster the data architecture maturity in a company — the data catalog is only as helpful as the knowledge shared and entered by experts. The dataset has a few more interesting sheets such as lineage and validation, which we will cover in part III — stay tuned 🙂

Column Profiling Results

Now, I want to show what the DI engine provides by simply selecting the “column profiling” option in our recipe earlier. For that purpose, let’s enter the sheet called “stats”. Here we see statistical information for each column in the selected table.

For instance, we see in the first row that the field “ProductID” has 100% distinct values. This finding is not surprising, because this field is the primary key of the table — it is technically defined that the values have to differ per record. We also see a NULL count and NULL percentage information, e.g. we see that two product records do not have a supplier reference. This can be considered as a critical data quality issue, because without an associated supplier, you cannot physically provide those products to customers.

What else do we have? We see descriptive measures such as min, max, mean, which are more interesting in transactional tables with higher amount of records. At least, we see at one glance some statistical measures regarding unit price of our product portfolio (e.g. half of our products cost less than $ 19.45 → Median). In addition, we see sample values for each column in our dataset.

Entity linkage

Sure, the data discovery was already insightful and fosters transparency, especially if you never experienced such a feature in practice. However, the full power of a data catalog is only unleashed when we bring the IT world and business world together. Sounds great — but what does that mean? Well, the “IT world” in this sense refers to all the technical information around the datasets. The “business world” is what we have discussed in the previous part: all the business information we maintain in the glossary. Looking at real companies, it must be emphasized that (especially in bigger companies) there are different groups of people responsible for datasets and glossary. We have data custodians on the one hand who are experts on technical topics, e.g. to develop and improve databases. And we have data stewards on the other hand, who are experts on business topics, e.g. how do we indicate which product can be manufactured at which site due to which conditions — and which data is required to decide. So when we talk about linking entities, we implicitly mean connecting people!

Big Words! So how does that work with DataHub? Let’s continue with our dataset “products” which reflects a physical database table. This is our starting point in DataHub. In the screenshot below we see various marked information, which refer to multiple types of links to the “business world” that can be used jointly.

Option: We can add a glossary term to a table column. This link indicates that raw data of a certain term can be found in this column, e.g. the actual product names.
Option: We can add a glossary term to the whole table. This link indicates that a certain term is rather complex and described by multiple attributes that can be found in that table.
Option: We can add a domain where the dataset belongs to. This link is similar to the linkage of glossary terms to a domain (see part I for more information).

People

However, we still don’t have people in our model. So let’s add an owner for the dataset “products”:

DataHub suggests that this person should be a “technical owner” which is similar to the role of a “data custodian” in other DG frameworks. Now we need to add people to the associated glossary terms. We can simply click on the “Product” term and DataHub navigates us to the glossary record. There, we also can add owners. For instance, we can add Brian as “data steward” and Lilly as a “business owner”. If we hover over the names, we see the role assigned to the name. DataHub allows to add persons multiple times — this can be useful if a person has multiple roles (e.g. business owner + data steward). Unfortunately, the tool also allows to add the same combination of person and roles multiple times, which is not useful.

Now let’s assume that Brian is new at the company and wants to change something on the product master data table, so he needs to find the right IT expert to clarify the implementation details. In this case, he simply clicks on the “products” dataset which is visible in the “Related Entities” section (see screenshot above). Then he is forwarded to the dataset (which we have seen further above) and can see: my IT counterpart is John! And of course, also other people in the organization who need support or have change requests for product data can use this linked information — they can simply search for “product” and need only 2 clicks to collect the right data stakeholders (Lilly, Brian and John).

We have seen that setting up a data ingestion process is done within a few minutes as DataHub provides a configuration wizard for a number of established database technologies. We have also seen that data discovery is a powerful feature to gain transparency over a company’s data architecture. The profiling of tables and columns is performed automatically and the statistical results can be viewed in the UI. Linking datasets and glossary terms or adding owners is also very simple, however, we have seen that consistency checks are not always performed by DataHub. We can add the same owners with the same role multiple times, but the tool prevents us at least from adding the same glossary term multiple times to a dataset. It is also possible to assign the same glossary term to multiple columns in the same table — maybe there are scenarios in practice where this flexibility is useful. In general, my practice advice is that — at least within one table — there should be a 1:1 relationship between columns and terms. If not, you should review the level of granularity in your glossary — or you have a problem in your physical data model.

Again — well done if you made it this far! In the next and final part, we will dive deeper into data quality management with data lineage and data validation. Maybe, I will write a further story to discuss the DG role concept of DataHub in comparison to other DG frameworks.

[1] https://datahubproject.io/docs/metadata-ingestion

[2] https://datahubproject.io/docs/architecture/metadata-ingestion