From Data Warehouses and Lakes to Data Mesh: A Guide to Enterprise Data Architecture | by Col Jung | May, 2023

By Jessie Hobb On May 12, 2023

Understand how data works at large companies

There’s a disconnect between data science courses and the reality of working with data in the real world.

When I landed my first analytics job at one of Australia’s ‘Big Four’ banks half a decade ago, I was confronted by a complex data landscape characterised by…

Challenges in finding, accessing & using data;
Competing business priorities pulling people in different directions;
Legacy systems that are difficult to maintain & upgrade;
A legacy culture resistant to data-driven insights;
Siloed teams who don’t talk to each other, resulting in inefficiencies.

For a while, I plodded on and resigned myself to the fact that this was just the way things were in the world of enterprise data. I had faith that while our tech stack evolved at a fast pace, eventually UX would catch up…

I had trained myself in data science but actually getting to do data science wasn’t straight-forward. Online courses do not prepare you for this.

But here’s the kicker.

After some digging, I realised that my organisation wasn’t alone in facing these data challenges — they were pervasive across the industry.

We’re in a melting pot of technological innovation where things are moving at breakneck speed. Data is exploding, computing power is on the rise, AI is breaking through, and consumer expectations are ever-changing.

Everyone involved in the analytics industry is just trying to find their footing. We’re all stumbling forward together. Fail fast and fail forward.

That’s why I penned this article.

I want to share my insights and help professionals like graduates, new business analysts and self-taught data scientists quickly understand the data landscape at the enterprise level and shape expectations.

First up, let’s align on the crucial role data plays in today’s competitive fast-paced business environment.

Companies in every industry are moving towards data-driven decision-making.

At the same time, consumers are increasingly expecting hyper-personalised digital products & services that increasingly leverage powerful analytics like AI and machine learning that’s trained on all the quality data the company can muster.

How the worlds of AI & machine learning intersect with enterprise analytics. Image by author

It’s what enables you to watch personalised TV shows on demand (entertainment), order food and have it delivered within an hour (groceries & shopping), and get a pre-approved mortgage in minutes (housing).

This means a forward-thinking data stack is essential to survive and thrive, because data is the lifeblood of digital.

Or as British mathematician Clive Humby put it in 2006:

“Data is the new oil.”

IT departments and data platforms are no longer basement-dwellers — they’re now a core part of the enterprise strategy.

Because data powers everything. It is a first-class citizen.

So without further ado, let’s now dive into how data is organised, processed and stored at large companies.

Today’s landscape is divided into operational data and analytical data.

30,000 feet view of the enterprise data landscape. Source: Z. Dehghani at MartinFowler.com with amendments by author

Operational data often comes in the form of individual records that represent specific events, such as a sale, purchase, or a customer interaction, and is information a business relies on to run its day-to-day operations.

Operational data is stored in databases and is accessed by microservices, which are small software programs that help manage the data. The data is constantly being updated and represents the current state of the business.

Transactional data is an important type of operational data. I work at a large bank, where examples of transactions mean:

money moving between bank accounts;
payments for goods and services;
a customer interaction with one of our channels, e.g. branch or online.

Transactional data that’s hot off the application is called source data, or System-of-Record (SOR). Source data is free of transformations and is the…

preferred data format by data scientists;
format of data ingested into data lakes;
beginning of any data lineage.

4.1 Data Warehouses

Data Warehouses are an established way to store structured data in a relational schema that’s optimised for read operations — primarily SQL queries to support business intelligence (BI), reporting and data visualisations.

Some features of warehouses:

Historical analysis: Data warehouses have been the mainstay for descriptive analytics for decades, offering the ability to query and join large volumes of historical data quickly.
Schema-on-Write: Data warehouses traditionally employ a Schema-on-Write approach, where the structure, or schema, of your tables are defined upfront.

Data Modelling: While data analysts and data scientists can work with the data directly in the analytical data store, it’s common to create data models that pre-aggregate the data to make it easier to produce reports, dashboards, and interactive visualisations. A common data model — called the star schema — is based on fact tables that contain numeric values you want to analyse (for example, some amount relating to Sales), which are related to — hence called a relational database — dimension tables representing the entities (e.g. Customer or Product) you want to measure.
Fast queries: Data in warehouses may be aggregated and loaded into an Online Analytical Processing (OLAP) model, also known as the cube. Numeric values (measures) from fact tables are pre-aggregated across one or more dimensions — for example, total revenue (from the fact Sales table) by the dimensions Customer, Product and Time. Visually, this looks like the intersection of the 3 dimensions in a 3D cube. Benefit-wise, the OLAP/cube model captures relationships that support “drill-up/drill-down” analysis, and queries are fast because the data is pre-aggregated.

The “cube”. Measures (e.g. sales) are aggregated by dimensions time, customer & product. Image by author

TFile types: Structured data files include readable formats like CSV and XLSX (Excel), and optimised formats like Avro, ORC & Parquet. Relational databases can also store semi-structured data like JSON files.

4.2 Data Lakes

Data Lakes are the de facto industry approach to store a large volume of file-based data to support data science and large-scale analytical data processing scenarios.

Distributed compute & storage: Data lakes use distributed compute and distributed storage to process and store huge volumes of potentially unstructured data. This means the data is held and processed across potentially hundreds of machines, known as a big data cluster. This technology took off in the 2010’s, enabled by Apache Hadoop, a collection of open-source big data software that empowered organisations to distribute huge amounts of data across many machines (HDFS distributed storage) and run SQL-like queries on tables stored across them (Hive & Spark distributed compute). Companies like Cloudera and Hortonworks later commercialised Apache software into packages that enabled easier onboarding and maintenance by organisations around the world.
Schema-on-Read: Data lakes use a Schema-on-Read paradigm where a schema is only created when the data is read. This means data can be dumped in the lake en-masse without the costly need to define schemas immediately, while allowing for schemas to be created for specific use cases down the line — precisely the kind of flexibility that data scientists require for modelling.
File types: Data lakes are the home of unstructured data — this include text files like txt & doc, audio files like MP3 & WAV, images like JPEG & PNG, videos like MP4, and even entire PDFs, social media posts, emails, webpages and sensor data. Data lakes (and NoSQL databases) also allow you to store your semi-structured data, like JSON and XML files, as-is.
Cloud computing: Data lakes are increasingly being hosted on public cloud providers like Amazon Web Services, Microsoft Azure and Google Cloud. This elastic and scalable infrastructure enables the organisation to automatically and quickly adjust to changing demands in resources in both compute and storage while maintaining performance and paying only for exactly what you use. There are three common models of cloud computing, with different divisions of shared responsibility between the cloud provider and client. The most flexible Infrastructure-as-a-Service (IaaS) allows you to essentially rent empty space in the data centre. The cloud provider maintains the physical infrastructure and access to the internet. In contrast, the Software-as-a-Service (SaaS) model has the client renting a fully-developed software solution ran through the internet (think Microsoft Office). For enterprise data, the most cloud popular model is the middle ground Platform-as-a-Service (PaaS) where the provider chooses the OS with the client able to build its data architecture and enterprise applications on top.

Cloud computing types & shared responsibility model. Image by author

Streaming: Technologies like Apache Kafka has enabled data to be processed near real-time as a perpetual stream of data, enabling the creation of systems that reveal instant insights and trends, or take immediate responsive action to events as they occur. For instance, the ability to send an instant mobile notification to customers who might be transferring money to scammers leverages this technology.

Architect Zhamek Dehghani condensed the evolution — challenges, progress and failings— of the enterprise data landscape across three generations:

First generation: proprietary enterprise data warehouse and business intelligence platforms; solutions with large price tags that have left companies with equally large amounts of technical debt [in the form of] thousands of unmaintainable ETL jobs, and tables and reports that only a small group of specialised people understand, resulting in an under-realized positive impact on the business.

Second generation: big data ecosystem with a data lake as a silver bullet; complex big data ecosystem and long running batch jobs operated by a central team of hyper-specialised data engineers have created data lake monsters that at best has enabled pockets of R&D analytics; over-promised and under-realised.

Third (and current generation) data platforms: more or less similar to the previous generation, with a modern twist towards streaming for real-time data availability with architectures, unifying the batch and stream processing for data transformation, as well as fully embracing cloud-based managed services for storage, data pipeline execution engines and machine learning platforms.

The current data lake architecture can be summarised as:

Centralised. All analytical data is stored in one place, managed by a central data engineering team that don’t have domain knowledge on the data, making it difficult to unlock its full potential or fix data quality issues stemming from source. Opposite of a decentralised architecture that federates data ingestion to teams across the business.
Domain-agnostic. Architecture that strive to serve everyone without specifically catering for anyone. A jack-of-all-trades platform. Opposite of a domain-driven architecture whereby data is owned by the different business domains.
Monolithic. The data platform is built as one big piece that’s hard to change and upgrade. Opposite of a modular architecture allowing individual parts or micro-services to be tweaked and modified.

The problems are clear and so appears to be some of the solutions.

Enter data mesh.

Data mesh is the next-generation data architecture that moves away from a single centralised data team towards a decentralised design where data is owned and managed by teams across the organisation that understands it the most, known as domain-driven ownership.

Importantly, each business unit or domain aims to infuse product thinking to create quality and reusable data products — a self-contained and accessible data set treated as a product by the data’s producers — which can then published and shared across the mesh to consumers in other domains and business units — called nodes on the mesh.

Individual business units share finely crafted data built to a ‘product standard’. Source: Data Mesh Architecture (with permission)

Data mesh enables teams to work independently with greater autonomy and agility, while still ensuring that data is consistent, reliable and well-governed.

Here’s an example from my work.

Right now, the data for our customers along with their transactions, products, income and liabilities are sitting in our centralised data lake. (And across our data warehouses too.)

In the future, as we federate our capabilities and ownership across the bank, the credit risk domain’s own data engineers can independently create and manage their data pipelines, without relying on a centralised ingestion team far removed from the business and lacking in credit expertise.

They will take pride in building and refining high-quality, strategic, and reusable data products that can be shared to different nodes across the mesh, providing the mortgages team with reliable credit information to make better decisions about approving home loans.

These same data products can also be utilised by the consumer credit domain to develop machine learning models to better understand the behaviours of our credit card customers so that we may offer them better services and identify those at risk.

These are examples of leveraging the strategic value of data products within the mesh.

Data mesh fosters a culture of data ownership and collaboration where data is treated as a first-class citizen that’s furthermore productised and seamlessly shared across teams and departments, rather than languishing in a entangled web of often-duplicated ETL pipelines crafted by siloed teams for specific ad hoc tasks.

Data mesh pushes organisations away from a costly and inefficient project-based mindset towards a scalable and forward-thinking product-based mindset.

Data governance is like a big game of Who’s the Boss, but for data. Just like the show, there are a lot of complicated relationships to navigate.

It’s about figuring out who’s in charge of what data, who gets to access it, who needs to protect it and what controls and monitoring is in place to ensure things don’t go wrong.

With my workplace boasting 40,000 employees, tons of processes, and competing priorities, it can feel like a real challenge to maintain order and ensure everyone is on the same page.

To data analysts, data scientists and developers, data governance can feel like that annoying friend who always wants to know what you’re up to. But they’re absolutely necessary for organisations, especially well-regulated ones. Otherwise, it would be like a circus without a ringmaster — chaotic, impossible to manage, and very risky.

Data governance components. Image by author

Some core considerations of data governance are:

Data privacy. It’s like trying to keep your embarrassing childhood photos hidden from the world. But for businesses, it’s a lot more serious than just dodgy haircuts. Let’s say a bank accidentally reveals all of its customer’s financial info. That’s gonna cost them a ton of cash, and even more importantly, trust.

Data security. You want to make sure that your customer’s data is safe from both external threats (like hackers) and internal threats (like rogue employees). That means robust authentication systems, fault-tolerant firewalls, ironclad encryption technologies and vigilant 24/7 cybersecurity. Nobody wants their data ending up on the dark web auctioned to criminals.

Data quality. Think of making a sandwich — put in rotten ingredients and you’ll end up with a gross meal. If your data quality stinks, you’ll end up with unreliable insights that nobody wants to bite into. And if you’re in a regulated industry, you better make sure your sandwich is made with fresh ingredients, or your data might not pass your compliance obligations.

Maintaining reliable information on how data flows through the enterprise — a.k.a data lineage — is crucial for ensuring data quality and troubleshooting when things go wrong.

Having weak data privacy, security and/or quality means more data risk.

This is where data ownership comes in. Who gets to call the shots and make decisions about the data? Who owns the risk of things going wrong?

In practice, it’s a bit like a game of hot potato where nobody really wants to hold onto the potato for too long. But somebody has to take responsibility for it, so we can avoid data mishaps and keep our potato hot, fresh and secure.

The move towards data mesh aims to:

enhance data quality across the board (via reusable data products)
simplify data lineage (byebye distant ETLs into a centralised data lake)
optimise data ownership (have appropriate domains own their data).

The realm of enterprise-level data can often be a perplexing one, marked by the accumulation of technical debt resulting from a cycle of experimentation followed by overcorrection, not unlike the fluctuations of the stock market.

While the tales of large companies are each unique, they share a few common threads. One such thread is the organic expansion towards an unwieldy and daunting enterprise data warehouse, subsequently succeeded by the eager embrace of a centralised data lake aimed at saving costs, concentrating expertise, and magnifying the value of data.

But this approach brought forth an entirely new set of issues. As a result, we’re now witnessing a dramatic swing towards decentralising and disseminating data management to the teams that best understand their own data.

To give some personal background, the bank I work for has journeyed through the usual epochs of data architectures. We spent decades on data warehouses. Around 2017, we embarked on a now-7-year journey to prop up a strategic data lake intended to become the cornerstone of our data infrastructure. In 2018, I joined the company as a fresh wide-eyed in-training data scientist from academia.

My online courses — kindly sponsored by my company — taught me how to wrangle data and train logistic regression models and gradient boosted trees, but ill-prepared me for the realities of working with data at a large organisation.

Long story short, our data warehouses and data lake are still around today, admittedly living together in perhaps a bit of an awkward marriage. (It’s a work-in-progress.)

We’ve started the journey to decentralise our data lake towards mesh. At the same time, we’re busting the spaghetti-like complexity of our data landscape by leveraging the power of reusable data products. And among the Big 4 Banks of Australia, we’re apparently leading the way. Bravo.

The challenges are big, as all this technical debt is the culmination of countless pipelines and ad hoc solutions built across numerous projects by thousands of employees who have come and, for some, long gone.

But boy, this is an exciting field to be in. I joined the company to do data science, and years later, I’ve been swept up on a journey building something bigger.

I hope you found this article insightful. Let me know if you see a resemblance of my story in your own journey!

Feel free to connect with me on Linkedin, Twitter & YouTube.