Four Approaches to build on top of Generative AI Foundational Models | by Lak Lakshmanan

What works, the pros and cons, and example code for each approach

If some of the terminology I use here is unfamiliar, I encourage you to read my earlier article on LLMs first.

There are teams that are employing ChatGPT or its competitors (Anthropic, Google’s Flan T5 or PaLM, Meta’s LLaMA, Cohere, AI21Labs, etc.) for real rather for cutesy demos. Unfortunately, informative content about how they are doing so is lost amidst marketing hype and technical jargon. Therefore, I see folks who are getting started with generative AI take approaches that experts in the field will tell you are not going to pan out. This article is my attempt at organizing this space and showing you what’s working.

Photo by Sen on Unsplash

The bar to clear

The problem with many of the cutesy demos and hype-filled posts about generative AI is that they hit the training dataset — they don’t really tell you how well it will work when applied to the chaos of real human users and actually novel input. Typical software is expected to work at 99%+ reliability —for example, it was only when speech recognition crossed this accuracy bar on phrases that the market for Voice AI took off. Same for automated captioning, translation, etc.

I see two ways in which teams are addressing this issue in their production systems:

Human users are more forgiving if the UX is in a situation where they already expect to correct errors (this seems to be what helps GitHub Copilot) or where it is positioned as being interactive and helpful but not ready to use (ChatGPT, Bing Chat, etc.)
Fully automated applications of generative AI are mostly in the trusted-tester stage today, and the jury is out on whether these applications are actually able to clear this bar. That said, the results are promising and trending upwards, and it’s likely only a matter of time before the bar’s met.

Personally, I have been experimenting with GPT 3.5 Turbo and Google Flan-T5 with specific production use cases in mind, and learning quite a bit about what works and what doesn’t. None of my models have crossed the 99% bar. I also haven’t yet gotten access to GPT-4 or to Google’s PaLM API at the time of writing (March 2023). I’m basing this article on my experiments, on published research, and on publicly announced projects.

Approach 1: Use the API Directly

The first approach is the simplest because many users encountered GPT through the interactive interface offered by ChatGPT. It seems very intuitive to try out various prompts until you get one that generates the output you want. This is why you have a lot of LinkedIn influencers publishing ChatGPT prompts that work for sales emails or whatever.

When it comes to automating this workflow, the natural method is to use the REST API endpoint of the service and directly invoke it with the final, working prompt:

import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.Edit.create(
model="text-davinci-edit-001",
input="It was so great to meet you .... ",
instruction="Summarize the text below in the form of an email that is 5 sentences or less."
)

However, this approach does not lend itself to operationalization. There are several reasons:

Brittleness. The underlying models keep improving. Sudden changes in the deployed models broke many production workloads, and people learned from that experience. ML workloads are brittle enough already; adding additional points of failure in the form of prompts that are fine-tuned to specific models is not wise.
Injection. It is rare that the instruction and input are plain strings as in the example above. Most often, they include variables that are input from users. These variables have to incorporated into the prompts and inputs. And as any programmer knows, injection by string concatenation is rife with security problems. You put yourself at the mercy of the guardrails placed around the Generative AI API when you do this. As when guarding against SQL injection, it’s better to use an API that handles variable injection for you.
Multiple prompts. It is rare that you will be able to get a prompt to work in one-shot. More common is to send multiple prompts to the model, and get the model to modify its output based on these prompts. These prompts themselves may have some human input (such as follow-up inputs) embedded in the workflow. Also common is for the prompts to provide a few examples of the desired output (called few-shot learning).

A way to resolve all three of these problems is to use langchain.

Approach 2: Use langchain

Langchain is rapidly becoming the library of choice that allows you to invoke LLMs from different vendors, handle variable injection, and do few-shot training. Here’s an example of using langchain:

from langchain.prompts.few_shot import FewShotPromptTemplateexamples = [
{
"question": "Who lived longer, Muhammad Ali or Alan Turing?",
"answer": 
"""
Are follow up questions needed here: Yes.
Follow up: How old was Muhammad Ali when he died?
Intermediate answer: Muhammad Ali was 74 years old when he died.
Follow up: How old was Alan Turing when he died?
Intermediate answer: Alan Turing was 41 years old when he died.
So the final answer is: Muhammad Ali
"""
},
{
"question": "When was the founder of craigslist born?",
"answer": 
"""
Are follow up questions needed here: Yes.
Follow up: Who was the founder of craigslist?
Intermediate answer: Craigslist was founded by Craig Newmark.
Follow up: When was Craig Newmark born?
Intermediate answer: Craig Newmark was born on December 6, 1952.
So the final answer is: December 6, 1952
"""
...
]
example_prompt = PromptTemplate(input_variables=["question", "answer"], 
template="Question: {question}\n{answer}")
prompt = FewShotPromptTemplate(
examples=examples, 
example_prompt=example_prompt, 
suffix="Question: {input}", 
input_variables=["input"]
)
print(prompt.format(input="Who was the father of Mary Ball Washington?"))

I strongly recommend using langchain vs. using a vendor’s API directly. Then, make sure that everything you do works with at least two APIs or use a LLM checkpoint that will not change under you. Either of these approaches will avoid your prompts/code being brittle to changes in the underlying LLM.

Langchain today supports APIs from Open AI, Cohere, HuggingFace Hub (and hence Google Flan-T5), etc. and LLMs from AI21, Anthropic, Open AI, HuggingFace Hub, etc.

Approach 3: Finetune the Generative AI Chain

This is the leading-edge approach in that it’s the one I see used by most of the sophisticated production applications of generative AI. As just an example (no endorsement), finetuning is how a startup consisting of Stanford PhDs is approaching standard enterprise use cases like SQL generation and record matching.

To understand the rationale behind this approach, it helps to know that there are four machine learning models that underpin ChatGPT (or its competitors):

A Large Language Model (LLM) is trained to predict the next word of text given the previous words. It does this by learning word associations and patterns on a vast corpus of documents. The model is large enough that it learns these patterns in different contexts.
A Reinforcement Learning based on Human Feedback Model (RL-HF) is trained by showing humans examples of generated text, and asking them to approve text that is pleasing to read. The reason this is needed is that an LLM’s output is probabilistic — it doesn’t predict a single next word; instead, it predicts a set of words each of which has a certain probability of coming next. The RL-HF uses human feedback to learn how to choose the continuation that will generate the text that appeals to humans.
Instruction Model is a supervised model that is trained by showing prompts (“generate a sales email that proposes a demo to the engineering leadership”) and training the model on examples of sales emails.
Context Model is trained to carry on a conversation with the user, allowing them to craft the output through successive prompts.

In addition, there are guardrails (filters on both the input and output). The model declines to answer certain types of queries, and retracts certain answers. In practice, these are both machine learning models that are constantly updated.

Step 2: How RL-HF works. Image from Stiennon et al, 2020

There are open-source generative AI models (Meta’s LLaMA, Google’s Flan-T5) which allow you to pick up at any of the above steps (e.g. use steps 1–2 from the released checkpoint, train 3 on your own data, don’t do 4). Note that LLaMA does not permit commercial use, and Flan-T5 is a year old (so you are compromising on quality). To learn where to break off, it is helpful to understand the cost/benefit of each stage:

If your application uses very different jargon and words, it may be helpful to build a LLM from scratch on your own data (i.e., start at step 1). The problem is that you may not have enough data and even if you have enough data, the training is going to be expensive (on the order of 3–5 million dollars per training run). This seems to be what Salesforce has done with the generative AI they use for developers.
The RL-HF model is trained to appeal to a group of testers who may not be subject-matter experts, or representative of your own users. If your application requires subject matter expertise, you may be better off starting with a LLM and branching off from step 2. The dataset you need for this is much smaller — Stiennon et al 2020 used 125k documents and presented a pair of outputs for each input document in each iteration (see diagram). So, you need human labelers on standby to rate about 1 million outputs. Assuming that a labeler takes 10 min to rate each pair of documents, the cost is that of 250 human-months of labor per training run. I’d estimate $250k to $2m depending on location and skillset.
ChatGPT is trained to respond to thousands of different prompts. Your application, on the other hand, probably requires only one or two specific ones. It can be convenient to train a model such as Google Flan-T5 on your specific instruction and input. Such a model can be much smaller (and therefore cheaper to deploy). This advantage in serving costs explains why step 3 is the most common point of branching off. It’s possible to fine-tune Google Flan-T5 for your specific task with about 10k documents using HuggingFace and/or Keras. You’d do this on your usual ML framework such as Databricks, Sagemaker, or Vertex AI and use the same services to deploy the trained model. Because Flan-T5 is a Google model, GCP makes training and deployment really easy by providing pre-built containers in Vertex AI. The cost would be perhaps $50 or so.
Theoretically, it’s possible to train a different way to maintain conversational context. However, I haven’t seen this in practice. What most people do instead is to use a conversational agent framework like Dialogflow that already has a LLM built into it, and design a custom chatbot for their application. The infra costs are negligible and you don’t need any AI expertise, just domain knowledge.

It is possible to break off at any of these stages. Limiting my examples to publicly published work in medicine:

This Nature article builds a custom 8.9-billion parameter LLM from 90 billion words extracted from medical records (i.e., they start from step 1). For comparison, Flan T5 is 540 billion parameters and the “small/efficient” PaLM is 62 billion parameters. Obviously, cost is a constraint in going much bigger on your custom language model.
This MIT CSAIL study forces the model to closely hew to existing text and also doing instruction fine-tuning (i.e., they are starting from step 2).
Deep Mind’s MedPaLM starts from an instruction-tuned variation of PaLM called Flan-PaLM (i.e. it starts after step 3). They report that 93% of healthcare professionals rated the AI as being on par with human answers.

My advice is to choose where to break off based on how different your application space is from the generic internet text on which the foundational models are trained. Which model should you fine-tune? Currently, Google Flan T5 is the most sophisticated fine-tuneable model available and open for commercial use. For non-commercial uses, Meta’s LLaMA is the most sophisticated model available.

A word of caution though: when you tap into the chain using open-source models, the guardrail filters won’t exist, so you will have to put in toxicity safeguards. One option is to use the detoxify library. Make sure to incorporate toxicity filtering around any API endpoint in production — otherwise, you’ll find yourself having to take it back down. API gateways can be a convenient way to ensure that you are doing this for all your ML model endpoints.

Approach 4: Simplify the problem

There are smart approaches to reframe the problem you are solving in such as way that you can use a Generative AI model (as in Approach 3) but avoid problems with hallucination, etc.

For example, suppose you want to do question-answering. You could start with a powerful LLM and then struggle to “tame” the wild beast to have it not hallucinate. A much simpler approach is to reframe the problem. Change the model from one that predicts the output text to a model that has three outputs: the URL of a document, the starting position within that document, and the length of text. That is what Google Search is doing here:

Google’s Q&A model predicts a URL, starting position, and length of text. This avoids problems with hallucination.

At worst, the model will show you irrelevant text. What it will not do is to hallucinate because you don’t allow it to actually predict text.

A Keras sample that follows this approach tokenizes the inputs and context (the document that you are finding the answer within):

from transformers import AutoTokenizermodel_checkpoint = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
...
examples["question"] = [q.lstrip() for q in examples["question"]]
examples["context"] = [c.lstrip() for c in examples["context"]]
tokenized_examples = tokenizer(
examples["question"],
examples["context"],
...
)
...

and then passes the tokens into a Keras regression model whose first layer is the Transformer model that takes in these tokens and that outputs the position of the answer within the “context” text:

from transformers import TFAutoModelForQuestionAnswering
import tensorflow as tf
from tensorflow import kerasmodel = TFAutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
optimizer = keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer)
model.fit(train_set, validation_data=validation_set, epochs=1)

During inference, you get the predicted locations:

inputs = tokenizer([context], [question], return_tensors="np")
outputs = model(inputs)
start_position = tf.argmax(outputs.start_logits, axis=1)
end_position = tf.argmax(outputs.end_logits, axis=1)

You will note that the sample does not predict the URL — the context is assumed to be the result of a typical search query (such as returned by a matching engine or vector database), and the sample model only does extraction. However, you can build the search also into the model by having a separate layer in Keras.

Summary

There are four approaches that I see being used to build production applications on top of generative AI foundational models:

Use the REST API of an all-in model such as GPT-4 for one-shot prompts.
Use langchain to abstract away the LLM, input injection, multi-turn conversations, and few-shot learning.
Finetune on your custom data by tapping into the set of models that comprise an end-to-end generative AI model.
Reframe the problem into a form that avoids the dangers of generative AI (bias, toxicity, hallucination).

Approach #3 is what I see most commonly used by sophisticated teams.

What works, the pros and cons, and example code for each approach

If some of the terminology I use here is unfamiliar, I encourage you to read my earlier article on LLMs first.

Photo by Sen on Unsplash

The bar to clear

I see two ways in which teams are addressing this issue in their production systems:

Human users are more forgiving if the UX is in a situation where they already expect to correct errors (this seems to be what helps GitHub Copilot) or where it is positioned as being interactive and helpful but not ready to use (ChatGPT, Bing Chat, etc.)
Fully automated applications of generative AI are mostly in the trusted-tester stage today, and the jury is out on whether these applications are actually able to clear this bar. That said, the results are promising and trending upwards, and it’s likely only a matter of time before the bar’s met.

Approach 1: Use the API Directly

When it comes to automating this workflow, the natural method is to use the REST API endpoint of the service and directly invoke it with the final, working prompt:

import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.Edit.create(
model="text-davinci-edit-001",
input="It was so great to meet you .... ",
instruction="Summarize the text below in the form of an email that is 5 sentences or less."
)

However, this approach does not lend itself to operationalization. There are several reasons:

Brittleness. The underlying models keep improving. Sudden changes in the deployed models broke many production workloads, and people learned from that experience. ML workloads are brittle enough already; adding additional points of failure in the form of prompts that are fine-tuned to specific models is not wise.
Injection. It is rare that the instruction and input are plain strings as in the example above. Most often, they include variables that are input from users. These variables have to incorporated into the prompts and inputs. And as any programmer knows, injection by string concatenation is rife with security problems. You put yourself at the mercy of the guardrails placed around the Generative AI API when you do this. As when guarding against SQL injection, it’s better to use an API that handles variable injection for you.
Multiple prompts. It is rare that you will be able to get a prompt to work in one-shot. More common is to send multiple prompts to the model, and get the model to modify its output based on these prompts. These prompts themselves may have some human input (such as follow-up inputs) embedded in the workflow. Also common is for the prompts to provide a few examples of the desired output (called few-shot learning).

A way to resolve all three of these problems is to use langchain.

Approach 2: Use langchain

Langchain is rapidly becoming the library of choice that allows you to invoke LLMs from different vendors, handle variable injection, and do few-shot training. Here’s an example of using langchain:

from langchain.prompts.few_shot import FewShotPromptTemplateexamples = [
{
"question": "Who lived longer, Muhammad Ali or Alan Turing?",
"answer": 
"""
Are follow up questions needed here: Yes.
Follow up: How old was Muhammad Ali when he died?
Intermediate answer: Muhammad Ali was 74 years old when he died.
Follow up: How old was Alan Turing when he died?
Intermediate answer: Alan Turing was 41 years old when he died.
So the final answer is: Muhammad Ali
"""
},
{
"question": "When was the founder of craigslist born?",
"answer": 
"""
Are follow up questions needed here: Yes.
Follow up: Who was the founder of craigslist?
Intermediate answer: Craigslist was founded by Craig Newmark.
Follow up: When was Craig Newmark born?
Intermediate answer: Craig Newmark was born on December 6, 1952.
So the final answer is: December 6, 1952
"""
...
]
example_prompt = PromptTemplate(input_variables=["question", "answer"], 
template="Question: {question}\n{answer}")
prompt = FewShotPromptTemplate(
examples=examples, 
example_prompt=example_prompt, 
suffix="Question: {input}", 
input_variables=["input"]
)
print(prompt.format(input="Who was the father of Mary Ball Washington?"))

Langchain today supports APIs from Open AI, Cohere, HuggingFace Hub (and hence Google Flan-T5), etc. and LLMs from AI21, Anthropic, Open AI, HuggingFace Hub, etc.

Approach 3: Finetune the Generative AI Chain

To understand the rationale behind this approach, it helps to know that there are four machine learning models that underpin ChatGPT (or its competitors):

A Large Language Model (LLM) is trained to predict the next word of text given the previous words. It does this by learning word associations and patterns on a vast corpus of documents. The model is large enough that it learns these patterns in different contexts.
A Reinforcement Learning based on Human Feedback Model (RL-HF) is trained by showing humans examples of generated text, and asking them to approve text that is pleasing to read. The reason this is needed is that an LLM’s output is probabilistic — it doesn’t predict a single next word; instead, it predicts a set of words each of which has a certain probability of coming next. The RL-HF uses human feedback to learn how to choose the continuation that will generate the text that appeals to humans.
Instruction Model is a supervised model that is trained by showing prompts (“generate a sales email that proposes a demo to the engineering leadership”) and training the model on examples of sales emails.
Context Model is trained to carry on a conversation with the user, allowing them to craft the output through successive prompts.

Step 2: How RL-HF works. Image from Stiennon et al, 2020

If your application uses very different jargon and words, it may be helpful to build a LLM from scratch on your own data (i.e., start at step 1). The problem is that you may not have enough data and even if you have enough data, the training is going to be expensive (on the order of 3–5 million dollars per training run). This seems to be what Salesforce has done with the generative AI they use for developers.
The RL-HF model is trained to appeal to a group of testers who may not be subject-matter experts, or representative of your own users. If your application requires subject matter expertise, you may be better off starting with a LLM and branching off from step 2. The dataset you need for this is much smaller — Stiennon et al 2020 used 125k documents and presented a pair of outputs for each input document in each iteration (see diagram). So, you need human labelers on standby to rate about 1 million outputs. Assuming that a labeler takes 10 min to rate each pair of documents, the cost is that of 250 human-months of labor per training run. I’d estimate $250k to $2m depending on location and skillset.
ChatGPT is trained to respond to thousands of different prompts. Your application, on the other hand, probably requires only one or two specific ones. It can be convenient to train a model such as Google Flan-T5 on your specific instruction and input. Such a model can be much smaller (and therefore cheaper to deploy). This advantage in serving costs explains why step 3 is the most common point of branching off. It’s possible to fine-tune Google Flan-T5 for your specific task with about 10k documents using HuggingFace and/or Keras. You’d do this on your usual ML framework such as Databricks, Sagemaker, or Vertex AI and use the same services to deploy the trained model. Because Flan-T5 is a Google model, GCP makes training and deployment really easy by providing pre-built containers in Vertex AI. The cost would be perhaps $50 or so.
Theoretically, it’s possible to train a different way to maintain conversational context. However, I haven’t seen this in practice. What most people do instead is to use a conversational agent framework like Dialogflow that already has a LLM built into it, and design a custom chatbot for their application. The infra costs are negligible and you don’t need any AI expertise, just domain knowledge.

It is possible to break off at any of these stages. Limiting my examples to publicly published work in medicine:

This Nature article builds a custom 8.9-billion parameter LLM from 90 billion words extracted from medical records (i.e., they start from step 1). For comparison, Flan T5 is 540 billion parameters and the “small/efficient” PaLM is 62 billion parameters. Obviously, cost is a constraint in going much bigger on your custom language model.
This MIT CSAIL study forces the model to closely hew to existing text and also doing instruction fine-tuning (i.e., they are starting from step 2).
Deep Mind’s MedPaLM starts from an instruction-tuned variation of PaLM called Flan-PaLM (i.e. it starts after step 3). They report that 93% of healthcare professionals rated the AI as being on par with human answers.

Approach 4: Simplify the problem

There are smart approaches to reframe the problem you are solving in such as way that you can use a Generative AI model (as in Approach 3) but avoid problems with hallucination, etc.

Google’s Q&A model predicts a URL, starting position, and length of text. This avoids problems with hallucination.

At worst, the model will show you irrelevant text. What it will not do is to hallucinate because you don’t allow it to actually predict text.

A Keras sample that follows this approach tokenizes the inputs and context (the document that you are finding the answer within):

from transformers import AutoTokenizermodel_checkpoint = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
...
examples["question"] = [q.lstrip() for q in examples["question"]]
examples["context"] = [c.lstrip() for c in examples["context"]]
tokenized_examples = tokenizer(
examples["question"],
examples["context"],
...
)
...

from transformers import TFAutoModelForQuestionAnswering
import tensorflow as tf
from tensorflow import kerasmodel = TFAutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
optimizer = keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer)
model.fit(train_set, validation_data=validation_set, epochs=1)

During inference, you get the predicted locations:

inputs = tokenizer([context], [question], return_tensors="np")
outputs = model(inputs)
start_position = tf.argmax(outputs.start_logits, axis=1)
end_position = tf.argmax(outputs.end_logits, axis=1)

Summary

There are four approaches that I see being used to build production applications on top of generative AI foundational models:

Use the REST API of an all-in model such as GPT-4 for one-shot prompts.
Use langchain to abstract away the LLM, input injection, multi-turn conversations, and few-shot learning.
Finetune on your custom data by tapping into the set of models that comprise an end-to-end generative AI model.
Reframe the problem into a form that avoids the dangers of generative AI (bias, toxicity, hallucination).

Approach #3 is what I see most commonly used by sophisticated teams.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.

Four Approaches to build on top of Generative AI Foundational Models | by Lak Lakshmanan | Mar, 2023

What works, the pros and cons, and example code for each approach

The bar to clear

Approach 1: Use the API Directly

Approach 2: Use langchain

Approach 3: Finetune the Generative AI Chain

Approach 4: Simplify the problem

Summary

What works, the pros and cons, and example code for each approach

The bar to clear

Approach 1: Use the API Directly

Approach 2: Use langchain

Approach 3: Finetune the Generative AI Chain

Approach 4: Simplify the problem

Summary

Related Posts