Building a Secure and Scalable Data and AI Platform

By Jessie Hobb On Feb 22, 2024

Empowering business through data-driven decision-making

Over the last four years, I had the golden opportunity to lead the strategy, design, and implementation of global-scale big data and AI platforms across not one but two public cloud platforms — AWS and GCP. Furthermore, my team operationalized 70+ data science/machine learning (DSML) use cases and 10 digital applications, contributing to ~$100M+ in revenue growth.

The journey was full of exciting challenges and a few steep learning curves, but the end results were highly impactful. Through this post, I want to share my learnings and experiences, which will help fellow technology innovators think through their planning process and leapfrog their implementation.

This post will focus mainly on the foundational construct to provide a holistic picture of the overall production ecosystem. In later posts, I will discuss the technology choices and share more detailed prescriptive.

Let me begin by giving you a view of the building blocks of the data and AI platform.

End to end block level architecture of data and AI platform

Thinking through the end-to-end architecture is an excellent idea as you can avoid the common trap of getting things done quickly and dirty. After all, the output of your ML model is as good as the data you are feeding it. And you dont want to compromise on data security and integrity.

1. Data Acquisition and Ingestion

Creating a well-architected DataOps framework is essential to the overall data onboarding process. Much depends on the source generating the data (structured vs. unstructured) and how you receive it (batch, replication, near real-time, real-time).

As you ingest the data, there are different ways to onboard it –

Extract → Load (no transformation needed)
Extract → Load → Transform (primarily used in batch uploads)
Extract → Transform → Load (works best for streaming data)

Feature engineers must further combine the data to create features (feature engineering) for machine learning use cases.

2. Data Storage

Choosing the optimal data storage is essential, and object storage buckets like S3, GCS, or Blob Storage are the best options for bringing in raw data, primarily for unstructured data.

For pure analytics use cases, plus if you are bringing SQL structured data, you can land the data directly into a cloud data warehouse (Big Query, etc.) as well. Many engineering teams also prefer using a data warehouse store (different from object storage). Your choice will depend on the use cases and costs involved. Tread wisely!

Typically, you can directly bring the data from internal and external (1st and 3rd party) sources without any intermediate step.

However, there are a few cases where the data provider will need access to your environment for data transactions. Plan a 3rd party landing zone in a DMZ setup to prevent exposing your entire data system to vendors.

Also, for compliance-related data like PCI, PII, and regulated data like GDPR, MLPS, AAPI, CCPA, etc., create structured storage zones to treat the data sensibly right from the get-go.

Remember to plan for retention and backup policies depending on your ML Model and Analytics reports’ time-travel or historical context requirements. While storage is cheap, accumulating data over time adds to the cost exponentially.

3. Data Governance

While most organizations are good at bringing and storing data, most engineering teams need help to make data consumable for end users.

The main factors leading to poor adoption are —

Inadequate data literacy in the org
Absence of a well-defined data catalog and data dictionary (metadata)
Inaccessibility to the query interface

Data teams must partner with legal, privacy, and security teams to understand the national and regional data regulations and compliance requirements for proper data governance.

Several methods that you could use for implementing data governance are:

Data masking and anonymization
Attribute-based access control
Data localization

Failure to properly secure storage and access to data could expose the organization to legal issues and associated penalties.

4. Data Consumption Patterns

As the data gets transformed and enriched to business KPIs, the presentation and consumption of data have different facets.

For pure visualization and dashboarding, simple access to stored data and query interface is all you will need.

As requirements become more complex, such as presenting data to machine learning models, you have to implement and enhance the feature store. This domain needs maturity, and most cloud-native solutions are still in the early stages of production-grade readiness.

Also, look for a horizontal data layer where you can present data through APIs for consumption by other applications. GraphQL is one good solution to help create the microservices layer, which significantly helps with ease of access (data as a service).

As you mature this area, look at structuring the data into data product domains and finding data stewards within business units who can be the custodians of that domain.

5. Machine Learning

Post-data processing, there is a two-step approach to Machine Learning — Model Development and Model Deployment & Governance.

In the Model Development phase, ML Engineers partner closely with the Data Scientists until the model is packaged and ready to be deployed. Choosing ML Frameworks and Features and partnering with DS on Hyperparameter Tuning and Model Training are all part of the development lifecycle.

Creating deployment pipelines and choosing the tech stack for operationalizing and serving the model fall under MLOps. MLOps Engineers also provide ML Model Management, which includes monitoring, scoring, drift detection, and initiating the retraining.

Automating all these steps in the ML Model Lifecycle helps with scaling.

Don’t forget to store all your trained models in a ML model registry and promote reuse for efficient operations.

6. Production Operations

Serving the model output requires constant collaboration with other functional areas. Advanced planning and open communication channels are critical to ensure that release calendars are well-aligned. Please do so to avoid missed deadlines, technology choice conflicts, and troubles at the integration layer.

Depending on the consumption layer and deployment targets, you would publish model output (model endpoint) through APIs or have the applications directly fetch the inference from the store. Using GraphQL in conjunction with the API Gateway is an efficient way to accomplish it.

7. Security Layer

Detach the management plane and create a shared services layer, which will be your main entry-exit point for the cloud accounts. It will also be your meet-me-room for external and internal public/private clouds within your organization.

Your service control policies (AWS) or organizational policy constraints (GCP) should be centralized and protect resources from being created or hosted without proper access controls.

8. User-Management Interface / Consumption Layer

It is wise to choose the structure of your cloud accounts in advance. You can structure them on lines of business (LOB) OR, product domains OR a mix of both. Also, design and segregate your development, staging and production environments.

It would be best if you also centralized your DevOps toolchain. I prefer a cloud-agnostic toolset to support the seamless integration and transition between a hybrid multi-cloud ecosystem.

For developer IDEs, there could be a mix of individual and shared IDEs. Make sure developers frequently check code into a code repository; otherwise, they risk losing work.

GCP setup with cloud-agnostic DevSecOps toolchain

End-to-End Data Science Process

Navigating through organizational dynamics and bringing stakeholders together on a common aligned goal is vital to successful production deployment and ongoing operations.

I am sharing the cross-functional workflows and processes that make this complex engine run smoothly.

End to end data science model deployment process

Conclusion

Hopefully, this post triggered your thoughts, sparked new ideas, and helped you visualize the complete picture of your undertaking. It is a complex task, but with a well-thought-out design, properly planned execution, and a lot of cross-functional partnerships, you will navigate it easily.

One final piece of advice: Don’t create technology solutions just because it seems cool. Start by understanding the business problem and assessing the potential return on investment. Ultimately, the goal is to create business value and contribute to the company’s revenue growth.

Good luck with building or maturing your data and AI platform.

Bon Voyage!

~ Adil {LinkedIn}

<< Unless otherwise noted, all images are by the author>>

Building a Secure and Scalable Data and AI Platform was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.