The Hidden Complexity of Datasets

The current digital landscape of where and how work is done is becoming more complex. We are creating exponentially more data across uncountable digital surfaces. Getting all that data together into one place is no small ask.

The Hidden Complexity of Datasets
Photo by Marjan Blan / Unsplash

My role at Indeed has focused on making datasets easier to create. At its core, it sounds like a simple task:

  1. Find some data
  2. Load the data
  3. Use the data

However, this obfuscates some key components of how datasets are created.

Where does this data actually come from?

Likely the data is sourced from any number of systems and applications across the internet and locally. This could also include unstructured data such as event logs, messages, audio, pictures, etc.

How can I access this data from my computer?

The data has likely traveled a long way to reach you. The most common way for users to access central datasets would be through cloud storage. Think Google Drive, but MUCH bigger. AWS, Google Cloud Platform, and Microsoft Azure are some of the biggest names in the game.

How do I know this data is useful?

Most datasets have validation checks to ensure specific quality standards are met. In the data mesh model, the team that creates the dataset would also be responsible for validating it.

What if I need data from different sources?

If the data is not stored in one database, this can be quite a challenging task. Your best bet would be to likely combine this data into a single dataset through a process known as Extract Transform and Load (ETL).

What if this dataset is out of date?

Most datasets aren't just created once and are forever useful. They need to be periodically updated so that the information they contain is relevant. There needs to be a process in place to recreate this dataset at a particular cadence.

Data Orchestration

The answer to these questions boils down to a common solution. An application must exist that knows where all the data is and where to put it, how to combine various and complex types of data together, and when to do it. This application or service is typically referred to as a data orchestration platform.

The need for a centralized orchestration platform becomes vital as the quantity of data and the number of datasets grow. A person may be able to personally manage a solitary dataset with a custom script, but as the number of datasets begins to reach thousands if not millions, this task becomes impossible. The only way to ensure that data is consistently available when needed is through data orchestration.

Where

The current digital landscape of where and how work is done is becoming more complex. We are creating exponentially more data across uncountable digital surfaces. Getting all that data together into one place is no small ask.

Any good orchestration system needs to be able to connect to all of these servers creating data. It should also have a central location to output this data. This central location will be how users access the data.

How

It is not enough to simply know where the data is. Data engineers must give the orchestration platform instructions on which data to extract and how. As previously mentioned, this process is known as ETL and can be performed with any number of Structured Query Languages (SQL). The language you may use depends on the type of data storage you are working with, the format of the data, and the query engine exposing the data to the orchestrator. This process is known as "building" the data.

Further, checks should be in place to ensure the data is correct and complete. These checks, known as "validations" must also be written and provided to the orchestration system. If the validation fails, the orchestrator must have some sort of mechanism for reporting and remedying this shortcoming.

When

Lastly, but most importantly, the data must only be built when all of its dependencies are ready. In other words, most datasets are not just created in one step. Each dataset generally depends on dozens of other datasets, which in turn depend upon dozens more themselves and so on. It is not enough to just build the data on a schedule because this doesn't guarantee the upstream data has been built and validated already.

💡
This problem is solved with event-driven architecture. Instead of a dataset being created on a schedule, it is created when certain "events" occur. These events could be the creation and validation of another dataset, a log posted by an application, or any other measurable external event.

The when is the actual orchestration component of the platform. Much like a conductor leads each musician in an orchestra to play perfectly on time, the orchestration platform keeps all the moving parts organized and working only when they are needed.

Play with UV light.
Photo by Pietro Jeng / Unsplash

So What?

There are many, many services that provide these capabilities, however, many of them have short-comings that companies must keep in mind to keep their data relevant:

  • Inability to access all types of data sources
  • Insufficient security practices
  • Inefficient scaling
  • Insufficient availability of the service (minimal downtime is key)
  • Lack of event-based orchestration

For most of you, you will never need to worry about how the data is actually created. Nevertheless, this process is vital for everyone -- we all use data whether we realize it or not. Without this data processing you wouldn't be able to find results on Google, maps yourself home, or check the weather.