Modern or Not: What Is a Data Stack? | by Marie Lefevre | May, 2022


The “Modern Data Stack” Explained Simply…as a Stack

Not a data stack — Photo by Andrew Draper on Unsplash

Last year, as I found myself looking for a job with a broader scope of action than my position at the time (Operations Data Analyst), I started to get interested in topics around data architecture at companies. In particular, I wanted to better understand how data is structured and managed beyond my own scope of data analysis.

This is how I entered — still ongoing — discussions about “the modern data stack”:

What is it?

Should we try to “bundle it” or to “unbundle it”?

Who are all these data companies raising millions of dollars?

And many more questions around this topic that I don’t always fully comprehend.

For us data analysts, the major part of our role is about turning the needs of business users into actionable data to help them make better decisions. To do so, our daily tools are focused on data analysis (e.g. any SQL console like Google BigQuery’s console) and data visualization (e.g. Google DataStudio).

On the other side of the “data spectrum” at companies, the raw extraction of data and its transformation to make data available and usable to data analysts are more the responsibility of data engineers. At least this is how I used to see it.

Broadening my “data architecture” horizon —Photo by Nurlan Tortbayev from Pexels

To get a broader perspective on what a data stack is, I conducted my own research. In this article I want to share with you the output of this reflection. The goal here is to keep it simple, so the following explanations should be seen as a “where to start” guide when you want to start exploring the concept of the modern data stack.

From a personal point of view, I regularly come back to the following two diagrams to elaborate on a specific point or to onboard colleagues on the broad topic of data architecture:

  1. Core and foundational blocks of the modern data stack
  2. Examples of companies in the modern data stack

First of all, what is called a “data stack” in a business context is the combination of multiple technologies that allow companies to make use of data for their decision making.

The addition of the adjective “modern” refers to developments in recent years, in particular:

  • the rise of cloud platforms that offer cheaper and more flexible pricing solutions to store large volumes of data
  • the emergence of new data companies that offer a higher level of expertise in a specific part of a company’s data stack

These trends imply two main changes that are commonly understood under the term “modern data stack”. First, cheaper and more efficient storage solutions for high volumes of data tends to favor a shift from ETL (Extract > Transform > Load) towards ELT (Extract > Load > Transform).

Second, “all-in-one” solutions tend to have a worse price-quality ratio than combining several data tools as they have a more generalist approach compared to these emerging data companies. This means that a business will need to combine several data technologies if it chooses not to go for an “all-in-one” solution.

To perform all the necessary tasks for data management at a company, any data stack must include the following blocks:

Figure 1 — Core and foundational blocks of the modern data stack

Let’s start with the 4 components of the core data stack:

1. Extract

To get started, data from a variety of sources must be extracted. This can be done via scripts written in Python for example, or via native connectors offered by service providers. The best suited solution will depend on the technological choices made but also on the way raw data can be extracted: via API (probably the easiest way), via secure file transfer protocol (SFTP), via web scrapping, etc.

2. Load

The data extracted should then be stored in the appropriate infrastructure. In this block, companies can use data lakes and/or data warehouses to load data before the next step. Interestingly, some providers like Databricks or Snowflake tend to offer a third-way with innovative data platforms that combine the advantages of both the data lake (unstructured and large data) and the data warehouse (structured and curated data).

3. Transform

Once data has been made available at one central place (there can be several places for a more complex data architecture) it needs to be transformed before it can be exploited for further analyses. This is where data is processed through several layers with two main goals: make data clean (e.g. avoiding wrong values, standardizing formats) and make data usable for step 4 (e.g. by consolidating data from multiple sources, aggregating it).

4. Leverage

What would the data be if it were not used? This is where data outputs are produced, hence this is the most visible part of a data stack to stakeholders outside the tech team. Data outputs can be reports and interactive dashboards but also ad-hoc analyses, data discovery tool, etc.

These four blocks constitute the core elements of a modern data stack. However they would be vain if the two foundational elements were absent from the data stack:

A. Store

In the “load” and “transform” blocks we assumed that infrastructure to store data and to apply transformation to data went smoothly. This will only be the case if storage is configured properly and is adapted to the needs of the whole data stack, in terms of volume, refresh frequency, types of usage, etc.

B. Govern

When building a data stack, we tend to focus on the construction part and to forget about the maintenance part —it is likely the case for many fields of application. Nevertheless data governance tools and best practices are key to maintain data quality across all the blocks mentioned and to ensure that data is correctly treated from its source to its final destination.

Without launching a discussion about which tool is best suited for each and every use case, I want to present some examples of technologies and data companies that correspond to each block.

When doing your own research and bench-marking solution providers, this view should allow you to better categorize them. Feel free to supplement the following template and adapt it to your own case:

Figure 2 —Examples of companies in the modern data stack

While all blocks will be necessary to form your company’s data stack, some trade-offs must be made to choose the best suited combination of technologies. To help you do so, here are some key questions that you should ask yourself — and obviously the relevant stakeholders — when drawing your target data stack:

  • Where do you want to place the cursor between “all in one place” versus “one tool per specific task”?
  • Where do you want to place the cursor between “doing on your own” (possibly using open-source technologies) versus “delegating implementation to providers” (possibly involving vendor lock-in)?
  • What are your internal human capabilities, in terms of availability as well as competencies?
  • What budget do you have?
  • What are your constraints in terms of time?


The “Modern Data Stack” Explained Simply…as a Stack

Not a data stack — Photo by Andrew Draper on Unsplash

Last year, as I found myself looking for a job with a broader scope of action than my position at the time (Operations Data Analyst), I started to get interested in topics around data architecture at companies. In particular, I wanted to better understand how data is structured and managed beyond my own scope of data analysis.

This is how I entered — still ongoing — discussions about “the modern data stack”:

What is it?

Should we try to “bundle it” or to “unbundle it”?

Who are all these data companies raising millions of dollars?

And many more questions around this topic that I don’t always fully comprehend.

For us data analysts, the major part of our role is about turning the needs of business users into actionable data to help them make better decisions. To do so, our daily tools are focused on data analysis (e.g. any SQL console like Google BigQuery’s console) and data visualization (e.g. Google DataStudio).

On the other side of the “data spectrum” at companies, the raw extraction of data and its transformation to make data available and usable to data analysts are more the responsibility of data engineers. At least this is how I used to see it.

Broadening my “data architecture” horizon —Photo by Nurlan Tortbayev from Pexels

To get a broader perspective on what a data stack is, I conducted my own research. In this article I want to share with you the output of this reflection. The goal here is to keep it simple, so the following explanations should be seen as a “where to start” guide when you want to start exploring the concept of the modern data stack.

From a personal point of view, I regularly come back to the following two diagrams to elaborate on a specific point or to onboard colleagues on the broad topic of data architecture:

  1. Core and foundational blocks of the modern data stack
  2. Examples of companies in the modern data stack

First of all, what is called a “data stack” in a business context is the combination of multiple technologies that allow companies to make use of data for their decision making.

The addition of the adjective “modern” refers to developments in recent years, in particular:

  • the rise of cloud platforms that offer cheaper and more flexible pricing solutions to store large volumes of data
  • the emergence of new data companies that offer a higher level of expertise in a specific part of a company’s data stack

These trends imply two main changes that are commonly understood under the term “modern data stack”. First, cheaper and more efficient storage solutions for high volumes of data tends to favor a shift from ETL (Extract > Transform > Load) towards ELT (Extract > Load > Transform).

Second, “all-in-one” solutions tend to have a worse price-quality ratio than combining several data tools as they have a more generalist approach compared to these emerging data companies. This means that a business will need to combine several data technologies if it chooses not to go for an “all-in-one” solution.

To perform all the necessary tasks for data management at a company, any data stack must include the following blocks:

Figure 1 — Core and foundational blocks of the modern data stack

Let’s start with the 4 components of the core data stack:

1. Extract

To get started, data from a variety of sources must be extracted. This can be done via scripts written in Python for example, or via native connectors offered by service providers. The best suited solution will depend on the technological choices made but also on the way raw data can be extracted: via API (probably the easiest way), via secure file transfer protocol (SFTP), via web scrapping, etc.

2. Load

The data extracted should then be stored in the appropriate infrastructure. In this block, companies can use data lakes and/or data warehouses to load data before the next step. Interestingly, some providers like Databricks or Snowflake tend to offer a third-way with innovative data platforms that combine the advantages of both the data lake (unstructured and large data) and the data warehouse (structured and curated data).

3. Transform

Once data has been made available at one central place (there can be several places for a more complex data architecture) it needs to be transformed before it can be exploited for further analyses. This is where data is processed through several layers with two main goals: make data clean (e.g. avoiding wrong values, standardizing formats) and make data usable for step 4 (e.g. by consolidating data from multiple sources, aggregating it).

4. Leverage

What would the data be if it were not used? This is where data outputs are produced, hence this is the most visible part of a data stack to stakeholders outside the tech team. Data outputs can be reports and interactive dashboards but also ad-hoc analyses, data discovery tool, etc.

These four blocks constitute the core elements of a modern data stack. However they would be vain if the two foundational elements were absent from the data stack:

A. Store

In the “load” and “transform” blocks we assumed that infrastructure to store data and to apply transformation to data went smoothly. This will only be the case if storage is configured properly and is adapted to the needs of the whole data stack, in terms of volume, refresh frequency, types of usage, etc.

B. Govern

When building a data stack, we tend to focus on the construction part and to forget about the maintenance part —it is likely the case for many fields of application. Nevertheless data governance tools and best practices are key to maintain data quality across all the blocks mentioned and to ensure that data is correctly treated from its source to its final destination.

Without launching a discussion about which tool is best suited for each and every use case, I want to present some examples of technologies and data companies that correspond to each block.

When doing your own research and bench-marking solution providers, this view should allow you to better categorize them. Feel free to supplement the following template and adapt it to your own case:

Figure 2 —Examples of companies in the modern data stack

While all blocks will be necessary to form your company’s data stack, some trade-offs must be made to choose the best suited combination of technologies. To help you do so, here are some key questions that you should ask yourself — and obviously the relevant stakeholders — when drawing your target data stack:

  • Where do you want to place the cursor between “all in one place” versus “one tool per specific task”?
  • Where do you want to place the cursor between “doing on your own” (possibly using open-source technologies) versus “delegating implementation to providers” (possibly involving vendor lock-in)?
  • What are your internal human capabilities, in terms of availability as well as competencies?
  • What budget do you have?
  • What are your constraints in terms of time?

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
artificial intelligenceDataLefevreMarieModernStackTechnoblenderTechnology
Comments (0)
Add Comment