Building a scalable metadata-driven data ingestion framework

Ilse Epskamp
8 min readOct 18, 2022
Photo by Shubham Dhage on Unsplash

A typical task for a data engineer is to pick up data, validate the data, if required apply transformations such as an aggregation, or combine the data with other sources, and finally store the data into a final location such as a storage folder or a database. When you capture all these tasks in an automated flow of activities, you get an ingestion framework, allowing you to process many files without manually having to check, move and ingest each one of them. Moreover, having a reusable scalable framework enables you to efficiently and effectively handle all your data requirements. Metadata plays a key role in building any scalable ingestion framework, and in this blog I will deepdive into the concepts of such a framework, provide some practical examples and key takeaways for building a metadata-driven data ingestion framework, aiming to provide food for thought for how you can make use of metadata when processing high volumes of data.

In this blog I deliberately don’t talk about tooling. The concept of a scalable, metadata-driven data ingestion framework are tooling-independent, and your choice of tooling depends on many project or organization specific elements such as cost, contracts, strategy and experience.

How it started

Our development team was given the requirement to build an application that addresses the business and technical data lineage requirements of the organization. Business data lineage shows for example the data consumption by consuming applications in the organization, purpose of usage, risks associated to this usage and the connection of all these elements to different organizational aspects such as regulations, policies, and the organizational structure. Technical data lineage shows how data flows from any source application to all it’s consuming applications and all the data transformations that are applied along the way. You can imagine that to build such an application you need to integrate many data sources, such as data about the applications running in the organization, the datasets that they produce, the consuming applications of those datasets, the IT teams working on the related applications, and all the information related to usage and transformations of data in the (providing and consuming) applications. Considering the amount of sources we could expect, their refresh schedules (most of them daily) and the announcement that many more sources could be in scope in the near future, we knew instantly that we needed to build a data processing framework that was capable of handling all the data requirements that come along with all these different datasets. We designed a metadata-driven data ingestion framework, which is a flexible and highly scalable framework to automate your data engineering activities.

The role of metadata in your data ingestion framework

Metadata in a data ingestion framework holds the information that describes what steps to execute when processing a datafile. Introducing metadata into your data processing framework has many benefits:

- Scalability: easily increase or decrease the amount of data flowing through your system, and easily add, modify or remove features to/from your framework. It’s plug and play!

- Reusability: features are easily reused by other sources, even by other applications, because of the generic design.

- Maintainability: only one set of code to maintain, which you can apply on all (or a selection of) sources.

- Decreased development time: by writing generic workflows the amount of required custom code decreases drastically. This saves a lot of time!

Blueprint of ingestion framework and example flow

Figure 1 shows an example blueprint of an ingestion framework. In this example validated and, if required, transformed data is ingested in either SQLDB, NoSQL DB (Graph), both or none. The sample framework is able to apply three types of validations;

  • snapshot validation (ensure the date of the received file is the latest date);
  • rowcount anomaly (sanity check on the amount of records in the file);
  • datatype validation (ensure the datatype of a column is the expected datatype).

In reality you could have many more validation or quality checks that you want to apply. However, it’s not a given that all checks apply to all sources. Some sources might be eligible for more (or less) validation checks, depending on file-specific features like sensitivity, governance etc.

Figure 1: Example of a data ingestion framework. Image by Ilse Epskamp.

Let’s introduce some datafiles that we want to process with the framework. The requirements for these files are as follows:

Sample data requirements for files. Table by the author.

You see that with only four files you already have a lot of possible combinations of framework features that you need to apply. Some files require multiple validation checks, while others only require one. Some files have transformation requirements, while others don’t. And one is ingested only in the Graph, one only in SQL, one in both and one in none. You can imagine that writing a custom flow for every source not only takes a lot of time, but also results in a lot of duplicate code since even though not all sources require exactly the same handling, there is a lot of overlap in applied features. You want to have a scalable processing framework which allows you to on the fly only select the features that you want to apply to your sources, without having to write all workflows by hand.

Three pillars of a scalable ingestion framework

A scalable ingestion framework is build on 3 pillars:

  • data (the subject);
  • metadata (the instructions);
  • code (the execution engine).
Figure 2: Data, metadata and code drive any scalable ingestion framework. Image by Ilse Epskamp.

Practical example: datatype validation with and without metadata

Let’s get some code into the story. We will apply a datatype validation on two sample datasets, once without and once with use of metadata, to demonstrate the role of metadata in the code. Let’s assume we have two files; an Employee file, and an Accounts file.

Figure 3: Sample data Employees and Accounts. Image by Ilse Epskamp.

For the Employees file, we want to verify if the ID column is of datatype integer. For the Accounts file, we want to verify if the ACTIVE column is of type boolean. Only then we accept the files. We could write a custom validation for both files. That could look something like this:

if file=="employees":
isInteger=False
if(df.schema("ID").datatype.typeName=="integer"):
isInteger=True
result=isInteger
if file=="accounts":
isBoolean=False
if(df.schema("ACTIVE").datatype.typeName=="boolean"):
isBoolean=True
result=isBoolean

This would work fine. However, imagine you are loading 100 files per day and you want to apply this validation check to half of them. Or imagine you want to check multiple columns for the same file. Think of the amount of custom workflows you would need to write! Writing custom workflows for every scenario is not scalable and difficult to maintain. Let’s do this again, but now with use of metadata.

First we define metadata for both sources. In this example, each source has a dedicated JSON file* containing all the relevant metadata key-value pairs. Specifically, we add a section for validation checks, where we enable the datatype validation on a specific column, with a specific expected outcome.

*In this example the metadata for the ingestion is configured with JSON files. Alternatively you can use any format or tool that fits your application.

employees.json:

"filename": "employees",
"validation": {
"validateColumnDataType": {
"enabled": true,
"colname": "ID"
"datatype": "integer"
}
}

accounts.json:

"filename": "accounts",
"validation": {
"validateColumnDataType": {
"enabled": true,
"colname": "ACTIVE"
"datatype": "boolean"
}
}

Now we have the metadata ready. Next step is to write a generic function which we can call when processing a datafile.

def validate_col_datatype(data,col,type):
passed=False
if(data.schema(col).datatype.typeName==type):
passed=True
return passed

All that is left is to add some code which reads the metadata and executes the validation function if it’s enabled for the source.

md=read_metadata(filename)#run validation if enabled
if md["validation"]["validateColumnDataType"]["enabled"]:
colname = md["validation"]["validateColumnDataType"]["colname"]
datatype = md["validation"]["validateColumnDataType"]["datatype"]
result = validate_col_datatype(data,colname,datatype)

If you want to run this datatype validation on multiple columns, you simply add additional sections to your metadata configuration file. If you don’t need to apply the validation check to a source, you set enabled to false, and the check will be ignored for that source.

When using this pattern, you need to think about how you want to handle the result. Do you want to raise a warning and proceed, or break the flow if a validation fails? Do you want to automatically inform the development team or stakeholders group? Designing your workflow strategy is important to leverage the benefits such a framework can provide.

Additional features

You can think of many more features you can include in your framework. For example:

  • More validation and quality checks. For example, you can validate filesize, filetype, file origin, value format, …
  • SCD-type. In each metadata configuration specify if you want to store the data in your database with SCD1 or SCD2, so if you want to build up and keep historical data in your databases (SCD2) or not (SCD1). Design your ingestion logic in such a way that if SCD1 is enabled, the data is overwritten in your database, and if SCD2 is enabled, history is retained. All the records in your databases should have custom attributes reflecting their version, startdate, enddate and date last update. For a SQL database you can think of using stored procedures to update the records versions, or depending on your database type use any out-of-the-box feature that is available. Alternatively, in you ingestion logic you can handle the records versions on the fly.

Metadata configuration file blueprint

Below an example of a metadata configuration file for the sample Employees file, with generic file details and sections for validation, transformations and ingestion.

"description": "This is the metadatafile for Employees data.",
"sourcename": "HRSYSTEM",
"tablename": "all_employees",
"cardinality": "1-to-1",
"primary_key": "EMPLOYEE_ID",
"daily_delivery": true,
"file_delimiter": ";",
"validation"": {
"validate_snapshotdate" : true,
"validate_rowcount_anomaly" : false,
"validate_col_datatype": {
"enabled": true,
"colname": "EMPLOYEE_ID"
"datatype": "integer"
}
},
"transformations": {
"preprocessing_description": "Join with file XYZ.",
"preprocessing_flag": true,
"preprocessing_logic": "join_HRSYSTEM_all_employees_with_XYZ()"
},
"ingestion": {
"sqldb": {
"ingestion_enabled": true,
"ingestion_environment": "dtap",
"scd_type": "2",
"sql_tablename": "employees"
},
"csmsdb": {
"ingestion_enabled": true,
"ingestion_environment": "dtap",
"scd_type": "1",
"vertex_label": "EMPLOYEES",
"is_mapping_file": false
}
}

Takeaways

When designing and building a metadata-driven data ingestion framework, I suggest to take at least the following points into account:

1. Don’t cut corners. We all know the work pressure you can experience when fixing bugs or writing solutions under tight timelines. For many reasons it can be tempting to focus on the problem at hand without looking at the generic implementation. However spend that little extra effort to make your workflow generic and reusable. You will definitely benefit from it later!

2. Think about your naming conventions. Your framework should be able to dynamically find all kinds of resources, from a storage accounts and its directories and files, to databases and their tablenames. Think about this and define a standard.

3. Framework monitoring. Automation is great, but stay in control. Find a way to stay on top of everything that is happening in your framework. For example, we build a monitoring layer on top of the framework which logs every decision and every outcome to a centralized event log, which we use to build operational reports.

4. Keep it requirements driven. Once you start you will think of so many features that might of use. Keep it requirements driven, to make sure that any feature that you add to the framework directly contributes to your customers, or to your development team.

--

--

Ilse Epskamp

Azure Certified IT Engineer with 9+ years of experience in the banking industry. Focus areas: Azure, Data Engineering, DevOps, CI/CD, Automation, Python