Techno Blender
Digitally Yours.

The Hierarchy of ML tooling on the Public Cloud | by Nathan Cheng | Mar, 2023

0 45


Hidden technical debt in ML systems.
Hidden technical debt in ML systems. Image by Google Developers.

Not all ML services are built the same. As a consultant working in the public cloud, I can tell you that you are spoilt for options for Artificial Intelligence (AI) / Machine Learning (ML) tooling on the 3 big public clouds — Azure, AWS, and GCP.

It can be overwhelming to process and synthesize the wave of information; especially when these services are constantly coming out with new features.

Just imagine how much of a nightmare it would be to explain to a layman which platform to choose, and why you chose to use this particular tool to solve your machine learning problem.

I’m writing this post to alleviate that problem statement for others, as well as for myself, so you walk away with a succinct and distilled understanding of what the public cloud has to offer. For the sake of simplicity, I will use the terms AI and ML interchangeably throughout this post.

Before we jump into tooling comparison, let’s understand why we should even use managed services on the public cloud. It’s a valid assumption to question — Why not build your own custom infrastructure and ML model from scratch? To answer this question, let’s take a quick look at the ML lifecycle.

The below diagram depicts a typical ML lifecycle (the cycle is iterative):

Machine Learning lifecycle.
Machine Learning lifecycle. Image by author.

As you can see, there are many parts to the entire lifecycle that must be considered.

A famous paper published by Google showed that a small fraction of the effort that goes into building maintainable ML models in production is writing the model training code.

This phenomenon is known as the hidden technical debt of ML systems in production, and also what has been termed by industry as Machine Learning Operations (MLOps), which has become an umbrella term to refer to the mentioned technical debt.

Below is a visual explanation to support the above statistics, adapted from Google’s paper:

Hidden technical debt in ML systems.
Hidden technical debt in ML systems. Image by Google Developers.

I won’t go into a detailed explanation of each stage in the lifecycle, but here’s a summarized list of definitions. If you’re interested in learning more, I would recommend reading Machine Learning Design Patterns Chapter 9 on ML Lifecycle and AI Readiness for a detailed answer.

ML lifecycle summarized definitions:

  1. Data pre-processing — prepare data for ML training; data pipeline engineering
  2. Feature engineering — transform input data into new features that are closely aligned with the ML model learning objective
  3. Model training — training and initial validation of ML model; iterate through algorithms, train / test splits, perform hyperparameter tuning
  4. Model evaluation — model performance assessed against predetermined evaluation metrics
  5. Model versioning — version control of model artifacts; model training parameters, model pipeline
  6. Model serving — serving model predictions via batch or real-time inference
  7. Model deployment — automated build, test, deployment to production, and model retraining
  8. Model monitoring — monitor infrastructure, input data quality, and model predictions

Don’t forget about platform infrastructure and security!

The ML lifecycle does not consider the supporting platform infrastructure, which has to be secure from a encryption, networking, and identity and access management (IAM) perspective.

Cloud services provide managed compute infrastructure, development environments, centralized IAM, encryption features, and network protection services that can achieve security compliance with internal IT policies — hence you really should not be building these ML services yourself, and leverage the power of the cloud to add ML capabilities into your product roadmap.

This section illustrates that writing the model training code is a relatively tiny part of the entire ML lifecycle, and actual data prep, evaluation, deployment, and monitoring of ML models in production is difficult.

Naturally, the conclusion is that building your own custom infrastructure and ML model takes considerable time and effort, and the decision to do so should be a last resort.

Here is where leveraging public cloud services come in to fill the gap. There are broadly two offerings these hyperscalers package and provide to customers; ML Tooling Hierarchy:

  • 🧰 AI services. EITHER:
  1. 🔨 Pre-Trained Standard – Use base model only, No Option to customize by bringing your own training data.
  2. ⚒️ Pre-Trained Customizable – Can use base model, and Optional customization by bringing your own training data.
  3. ⚙️ Bring Your Own Data – Mandatory to bring your own training data.

Honorable AI service mentions

For the sharper ones reading this post, I have purposefully omitted a few honorable AI service mentions in the hierarchy:

  • Data Warehouse built-in ML models which enable ML development using SQL syntax. Further reading can be done on BigQuery ML, Redshift ML, and Synapse dedicated SQL pool PREDICT function. These services are meant to be used by data analysts, given that your data is already inside the cloud data warehouse.
  • AI Builder for Microsoft Power Platform, and Amazon SageMaker Canvas. These services are meant to be used by non-technical business users a.k.a. citizen data scientists.
  • Azure OpenAI which is nascent service and regulated by Microsoft; you are required to request approval for a trial.

We will first discuss the ML Platform before discussing AI services. The platform provides auxiliary tooling required for MLOps.

Each public cloud has their own version of the ML Platform:

Who is it for?

Persona-wise, this is for the team who has internal data scientist resources, want to build custom state-of-the-art (SOTA) models with their own training data, and develop frameworks to do custom management of MLOps across the ML lifecycle.

How do I use it?

Requirement-wise, the business use case would need them to engineer a custom ML model implementation that AI services in Section 3.2 do not have the capabilities to meet.

As much as possible, this should not be your first option when looking to leverage a service on the public cloud.

Even with the ML platform, considerable time and effort has to be invested into learning the features on the ML platform, and writing the code to build out a custom MLOps framework using the hyperscaler software development kits (SDKs).

Instead, first look for an AI service in the next Section 3.2 that could meet your need.

What technology capabilities does the service provide?

When you utilize a cloud platform, you gain access to a completely hyperscaler managed environment that you would otherwise be pulling your hair out trying to get right:

  1. Managed Compute Infrastructure — these are clusters of machines with default environments, containing ubiquitous built-in ML libraries, and cloud-native SDKs. Compute can be used for distributed training, or to power model endpoints for serving batch and real-time predictions.
  2. Managed Development Environments — in the form of Notebooks, or through your choice of IDE given that there is integration with the ML platform.

These host of utilities enable data scientists and ML engineers to fully focus on the ML lifecycle instead of infrastructure configuration and dependency management.

Built-in libraries and cloud-native SDKs facilitates data scientists writing custom code to do more seamless engineering throughout the ML lifecycle.

The following table shows the technology features of each cloud ML platform:

ML Platform Comparison Table. Gist by author.

Next, we will discuss AI services.

They enable ML development using a low-code / no-code approach, and mitigate the overhead of managing MLOps.

The over-arching argument for these services is neatly put below by Jeff Atwood:

The best code is no code at all.

Every new line of code you willingly bring into the world is code that has to be debugged, code that has to be read and understood, code that has to be supported. Every time you write new code, you should do so reluctantly, under duress, because you completely exhausted all your other options.

Who is it for?

Persona-wise, these are for the teams who DO NOT HAVE EITHER:

  1. Internal data scientist resources.
  2. Own training data to train a custom ML model.
  3. Investment of resources, effort, and time to engineer a custom ML model end-to-end.

How do I use it?

Requirement-wise, the ML business use case would be met by cloud provider AI service capabilities.

The goal is to add ML features into the product by leveraging hyperscaler base models and training data; so the team can prioritize core application development, integrate with the AI service via retrieving predictions from API endpoints, and ultimately spend minimal effort on model training and MLOps.

What technology capabilities does the service provide?

We’re going to organize the following comparison table by the technology capabilities the AI service provides. This is closely interlinked with but should be differentiated from the ML business use case.

For example, Amazon Comprehend service gives you the capability to do text classification. That capability is used to build models for business use cases such as:

  1. Sentiment analysis of customer reviews.
  2. Content quality moderation.
  3. Multi-class item classification into custom-defined categories.

For certain AI services, the technology capability and business use case is exactly the same; in that scenario the AI service was built to solve that exact ML business use case.

Industry specific version of AI services

Note that I have excluded or avoided mention of industry specific version of AI services. Just know that hyperscalers train models specifically to achieve higher model performance in these domains and you should use them over the generic version of the service for the particular industry or domain.

Notable mentions of these services include Amazon Comprehend Medical, Amazon HealthLake, Amazon Lookout for{domain}, Amazon Transcribe Call Analytics, Google Cloud Retail Search etc.

The following legend and table shows the technology capabilities of each cloud AI service:

  • 🔨 Pre-Trained Standard – Use base model only, No Option to customize by bringing your own training data.
  • ⚒️ Pre-Trained Customizable – Can use base model, and Optional customization by bringing your own training data.
  • ⚙️ Bring Your Own Data – Mandatory to bring your own training data.

--- Speech ---

Speech AI Comparison Table. Gist by author.

--- Natural Language ---

Natural Language AI Comparison Table. Gist by author.

--- Vision ---

Vision AI Comparison Table. Gist by author.

--- Decision ---

Decision AI Comparison Table. Gist by author.

--- Search ---

Search AI Comparison Table. Gist by author.

We have covered considerable ground in this post regarding the spectrum of ML services the public cloud offers, however there are still other concepts that we have to consider when building an ML system.

I would encourage you to explore and find your own answers to these concepts that were not discussed as AI / ML become more deeply embedded within the products we use.

What ML tooling do the 3 public clouds offer to implement the following functionality?

  • Model data lineage and provenance
  • Model catalog
  • Human review for post-prediction ground truth labeling
  • Models that work on video data
  • Models that do generic regression and classification

A special mention and thanks to the authors and creators of the following resources, that helped me to write this post:

ML Tooling

AI Services

ML Platform


Hidden technical debt in ML systems.
Hidden technical debt in ML systems. Image by Google Developers.

Not all ML services are built the same. As a consultant working in the public cloud, I can tell you that you are spoilt for options for Artificial Intelligence (AI) / Machine Learning (ML) tooling on the 3 big public clouds — Azure, AWS, and GCP.

It can be overwhelming to process and synthesize the wave of information; especially when these services are constantly coming out with new features.

Just imagine how much of a nightmare it would be to explain to a layman which platform to choose, and why you chose to use this particular tool to solve your machine learning problem.

I’m writing this post to alleviate that problem statement for others, as well as for myself, so you walk away with a succinct and distilled understanding of what the public cloud has to offer. For the sake of simplicity, I will use the terms AI and ML interchangeably throughout this post.

Before we jump into tooling comparison, let’s understand why we should even use managed services on the public cloud. It’s a valid assumption to question — Why not build your own custom infrastructure and ML model from scratch? To answer this question, let’s take a quick look at the ML lifecycle.

The below diagram depicts a typical ML lifecycle (the cycle is iterative):

Machine Learning lifecycle.
Machine Learning lifecycle. Image by author.

As you can see, there are many parts to the entire lifecycle that must be considered.

A famous paper published by Google showed that a small fraction of the effort that goes into building maintainable ML models in production is writing the model training code.

This phenomenon is known as the hidden technical debt of ML systems in production, and also what has been termed by industry as Machine Learning Operations (MLOps), which has become an umbrella term to refer to the mentioned technical debt.

Below is a visual explanation to support the above statistics, adapted from Google’s paper:

Hidden technical debt in ML systems.
Hidden technical debt in ML systems. Image by Google Developers.

I won’t go into a detailed explanation of each stage in the lifecycle, but here’s a summarized list of definitions. If you’re interested in learning more, I would recommend reading Machine Learning Design Patterns Chapter 9 on ML Lifecycle and AI Readiness for a detailed answer.

ML lifecycle summarized definitions:

  1. Data pre-processing — prepare data for ML training; data pipeline engineering
  2. Feature engineering — transform input data into new features that are closely aligned with the ML model learning objective
  3. Model training — training and initial validation of ML model; iterate through algorithms, train / test splits, perform hyperparameter tuning
  4. Model evaluation — model performance assessed against predetermined evaluation metrics
  5. Model versioning — version control of model artifacts; model training parameters, model pipeline
  6. Model serving — serving model predictions via batch or real-time inference
  7. Model deployment — automated build, test, deployment to production, and model retraining
  8. Model monitoring — monitor infrastructure, input data quality, and model predictions

Don’t forget about platform infrastructure and security!

The ML lifecycle does not consider the supporting platform infrastructure, which has to be secure from a encryption, networking, and identity and access management (IAM) perspective.

Cloud services provide managed compute infrastructure, development environments, centralized IAM, encryption features, and network protection services that can achieve security compliance with internal IT policies — hence you really should not be building these ML services yourself, and leverage the power of the cloud to add ML capabilities into your product roadmap.

This section illustrates that writing the model training code is a relatively tiny part of the entire ML lifecycle, and actual data prep, evaluation, deployment, and monitoring of ML models in production is difficult.

Naturally, the conclusion is that building your own custom infrastructure and ML model takes considerable time and effort, and the decision to do so should be a last resort.

Here is where leveraging public cloud services come in to fill the gap. There are broadly two offerings these hyperscalers package and provide to customers; ML Tooling Hierarchy:

  • 🧰 AI services. EITHER:
  1. 🔨 Pre-Trained Standard – Use base model only, No Option to customize by bringing your own training data.
  2. ⚒️ Pre-Trained Customizable – Can use base model, and Optional customization by bringing your own training data.
  3. ⚙️ Bring Your Own Data – Mandatory to bring your own training data.

Honorable AI service mentions

For the sharper ones reading this post, I have purposefully omitted a few honorable AI service mentions in the hierarchy:

  • Data Warehouse built-in ML models which enable ML development using SQL syntax. Further reading can be done on BigQuery ML, Redshift ML, and Synapse dedicated SQL pool PREDICT function. These services are meant to be used by data analysts, given that your data is already inside the cloud data warehouse.
  • AI Builder for Microsoft Power Platform, and Amazon SageMaker Canvas. These services are meant to be used by non-technical business users a.k.a. citizen data scientists.
  • Azure OpenAI which is nascent service and regulated by Microsoft; you are required to request approval for a trial.

We will first discuss the ML Platform before discussing AI services. The platform provides auxiliary tooling required for MLOps.

Each public cloud has their own version of the ML Platform:

Who is it for?

Persona-wise, this is for the team who has internal data scientist resources, want to build custom state-of-the-art (SOTA) models with their own training data, and develop frameworks to do custom management of MLOps across the ML lifecycle.

How do I use it?

Requirement-wise, the business use case would need them to engineer a custom ML model implementation that AI services in Section 3.2 do not have the capabilities to meet.

As much as possible, this should not be your first option when looking to leverage a service on the public cloud.

Even with the ML platform, considerable time and effort has to be invested into learning the features on the ML platform, and writing the code to build out a custom MLOps framework using the hyperscaler software development kits (SDKs).

Instead, first look for an AI service in the next Section 3.2 that could meet your need.

What technology capabilities does the service provide?

When you utilize a cloud platform, you gain access to a completely hyperscaler managed environment that you would otherwise be pulling your hair out trying to get right:

  1. Managed Compute Infrastructure — these are clusters of machines with default environments, containing ubiquitous built-in ML libraries, and cloud-native SDKs. Compute can be used for distributed training, or to power model endpoints for serving batch and real-time predictions.
  2. Managed Development Environments — in the form of Notebooks, or through your choice of IDE given that there is integration with the ML platform.

These host of utilities enable data scientists and ML engineers to fully focus on the ML lifecycle instead of infrastructure configuration and dependency management.

Built-in libraries and cloud-native SDKs facilitates data scientists writing custom code to do more seamless engineering throughout the ML lifecycle.

The following table shows the technology features of each cloud ML platform:

ML Platform Comparison Table. Gist by author.

Next, we will discuss AI services.

They enable ML development using a low-code / no-code approach, and mitigate the overhead of managing MLOps.

The over-arching argument for these services is neatly put below by Jeff Atwood:

The best code is no code at all.

Every new line of code you willingly bring into the world is code that has to be debugged, code that has to be read and understood, code that has to be supported. Every time you write new code, you should do so reluctantly, under duress, because you completely exhausted all your other options.

Who is it for?

Persona-wise, these are for the teams who DO NOT HAVE EITHER:

  1. Internal data scientist resources.
  2. Own training data to train a custom ML model.
  3. Investment of resources, effort, and time to engineer a custom ML model end-to-end.

How do I use it?

Requirement-wise, the ML business use case would be met by cloud provider AI service capabilities.

The goal is to add ML features into the product by leveraging hyperscaler base models and training data; so the team can prioritize core application development, integrate with the AI service via retrieving predictions from API endpoints, and ultimately spend minimal effort on model training and MLOps.

What technology capabilities does the service provide?

We’re going to organize the following comparison table by the technology capabilities the AI service provides. This is closely interlinked with but should be differentiated from the ML business use case.

For example, Amazon Comprehend service gives you the capability to do text classification. That capability is used to build models for business use cases such as:

  1. Sentiment analysis of customer reviews.
  2. Content quality moderation.
  3. Multi-class item classification into custom-defined categories.

For certain AI services, the technology capability and business use case is exactly the same; in that scenario the AI service was built to solve that exact ML business use case.

Industry specific version of AI services

Note that I have excluded or avoided mention of industry specific version of AI services. Just know that hyperscalers train models specifically to achieve higher model performance in these domains and you should use them over the generic version of the service for the particular industry or domain.

Notable mentions of these services include Amazon Comprehend Medical, Amazon HealthLake, Amazon Lookout for{domain}, Amazon Transcribe Call Analytics, Google Cloud Retail Search etc.

The following legend and table shows the technology capabilities of each cloud AI service:

  • 🔨 Pre-Trained Standard – Use base model only, No Option to customize by bringing your own training data.
  • ⚒️ Pre-Trained Customizable – Can use base model, and Optional customization by bringing your own training data.
  • ⚙️ Bring Your Own Data – Mandatory to bring your own training data.

--- Speech ---

Speech AI Comparison Table. Gist by author.

--- Natural Language ---

Natural Language AI Comparison Table. Gist by author.

--- Vision ---

Vision AI Comparison Table. Gist by author.

--- Decision ---

Decision AI Comparison Table. Gist by author.

--- Search ---

Search AI Comparison Table. Gist by author.

We have covered considerable ground in this post regarding the spectrum of ML services the public cloud offers, however there are still other concepts that we have to consider when building an ML system.

I would encourage you to explore and find your own answers to these concepts that were not discussed as AI / ML become more deeply embedded within the products we use.

What ML tooling do the 3 public clouds offer to implement the following functionality?

  • Model data lineage and provenance
  • Model catalog
  • Human review for post-prediction ground truth labeling
  • Models that work on video data
  • Models that do generic regression and classification

A special mention and thanks to the authors and creators of the following resources, that helped me to write this post:

ML Tooling

AI Services

ML Platform

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment