Techno Blender
Digitally Yours.

Building Owly an AI Comic Video Generator For My Son | by Agustinus Nalwan | Apr, 2023

0 40


Owly the AI Comic Story Teller [AI Generated Image]

Every evening, it has become a cherished routine to share bedtime stories with my 4-year-old son Dexie, who absolutely adores them. His collection of books is impressive, but he’s especially captivated when I create tales from scratch. Crafting stories this way also allows me to incorporate moral values I want him to learn, which can be difficult to find in store-bought books. Over time, I’ve honed my skills in crafting personalised narratives that ignite his imagination — from dragons with fractured walls to a lonely sky lantern seeking companionship. Lately, I’ve been spinning yarns about fictional superheroes like Slow-Mo Man and Fart-Man, which have become his favourites.

While it’s been a delightful journey for me, after half a year of nightly storytelling, my creative reservoir is being tested. To keep him engaged with fresh and exciting stories without exhausting myself, I need a more sustainable solution — an AI technology that can generate captivating tales automatically! I named her Owly, after his favourite bird, an owl.

Pookie and the secret door to a magic forest — Generated by AI Comic Generator.

As I started assembling my wish list, it quickly ballooned, driven by my eagerness to test the frontiers of modern technology. No ordinary text-based story would do — I envisioned an AI crafting a full-blown comic with up to 10 panels. To amp up the excitement for Dexie, I aimed to customise the comic using characters he knew and loved, like Zelda and Mario, and maybe even toss in his toys for good measure. Frankly, the personalisation angle emerged from a need for visual consistency across the comic strips, which I will dive into later. But hold your horses, that’s not all — I also wanted the AI to narrate the story aloud, backed by a fitting soundtrack to set the mood. Tackling this project would be equal parts amusing and challenging for me, while Dexie would be treated to a tailor-made, interactive storytelling extravaganza.

Dexie’s toys as comic story’s leading characters [Image by Author]

To conquer the aforementioned requirements, I realised I needed to assemble five marvellous modules:

  1. The Story Script Generator, conjuring up a multi-paragraph story where each paragraph will be transformed into a comic strip section. Plus, it recommends a musical style to pluck a fitting tune from my library. To pull this off, I enlisted the mighty OpenAI GPT3.5 Large Language Model (LLM).
  2. The Comic Strip Image Generator, whipping up images for each story segment. Stable Diffusion 2.1 teamed up with Amazon SageMaker JumpStart, SageMaker Studio and Batch Transform to bring this to life.
  3. The Text-to-Speech Module, turning the written tale into an audio narration. Amazon Polly’s neural engine leaped to the rescue.
  4. The Video Maker, weaving the comic strips, audio narration, and music into a self-playing masterpiece. MoviePy was the star of this show.
  5. And finally, The Controller, orchestrating the grand symphony of all four modules, built on the mighty foundation of AWS Batch.

The game plan? Get the Story Script Generator to weave a 7–10 paragraph narrative, with each paragraph morphing into a comic strip section. The Comic Strip Image Generator then generates images for each segment, while the Text-to-Speech Module crafts the audio narration. A melodious tune will be selected based on the story generator’s recommendation. And finally, the Video Maker combines images, audio narration, and music to create a whimsical video. Dexie is in for a treat with this one-of-a-kind, interactive story-time adventure!

Before delving into the Story Script Generator, let’s first explore the image generator module to provide context for any references to the image generation process. There are numerous text-to-image AI models available, but I chose the Stable Diffusion 2.1 model for its popularity and ease of building, fine-tuning, and deployment using Amazon SageMaker and the broader AWS ecosystem.

Amazon SageMaker Studio is an integrated development environment (IDE) that offers a unified web-based interface for all machine learning (ML) tasks, streamlining data preparation, model building, training, and deployment. This boosts data science team productivity by up to 10x. Within SageMaker Studio, users can seamlessly upload data, create notebooks, train and tune models, adjust experiments, collaborate with their team, and deploy models to production.

Amazon SageMaker JumpStart, a valuable feature within SageMaker Studio, provides an extensive collection of widely-used pre-trained AI models. Some models, including Stable Diffusion 2.1 base, can be fine-tuned with your own training set and come with a sample Jupyter Notebook. This enables you to quickly and efficiently experiment with the model.

Launching Stable Diffusion 2.1 Notebook on Amazon SageMaker JumpStart [Image by Author]

I navigated to the Stable Diffusion 2.1 base view model page and launched the Jupyter notebook by clicking on the Open Notebook button.

Stable Diffusion 2.1 Base model card [Image by Author]

In a matter of seconds, Amazon SageMaker Studio presented the example notebook, complete with all the necessary code to load the text-to-image model from JumpStart, deploy the model, and even fine-tune it for personalised image generation.

Amazon SageMaker Studio IDE [Image by Author]

Numerous text-to-image models are available, with many tailored to specific styles by their creators. Utilising the JumpStart API, I filtered and listed all text-to-image models using the filter_value “task == txt2img” and displayed them in a dropdown menu for convenient selection.

from ipywidgets import Dropdown
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models

# Retrieves all Text-to-Image generation models.
filter_value = "task == txt2img"
txt2img_models = list_jumpstart_models(filter=filter_value)

# display the model-ids in a dropdown to select a model for inference.
model_dropdown = Dropdown(
options=txt2img_models,
value="model-txt2img-stabilityai-stable-diffusion-v2-1-base",
description="Select a model",
style={"description_width": "initial"},
layout={"width": "max-content"},
)
display(model_dropdown)

# Or just hard code the model id and version=*.
# Eg. if we want the latest 2.1 base model
self._model_id, self._model_version = (
"model-txt2img-stabilityai-stable-diffusion-v2-1-base",
"*",
)

The model I required was model-txt2img-stabilityai-stable-diffusion-v2–1-base which allow fine-tuning.

Huge selection of text-to-image models [Image by Author]

In under 5 minutes, utilising the provided code, I deployed the model to a SageMaker endpoint running a g4dn.2xlarge GPU instance. I swiftly generated my first image from my text prompts, which you can see showcased below.

My image generator crafts an image of turtle swimming underwater [Image by Author]

The Amazon SageMaker Studio streamlines my experimentation and prototyping process, allowing me to swiftly experiment with various image generation prompts and view the resulting images directly within the IDE using the file explorer and the preview window. Furthermore, I can upload images within the IDE, utilise the built-in terminal to launch AWS CLI for uploading and downloading images to and from an S3 bucket, and execute SageMaker batch transform jobs against my models to generate numerous images at once for a large scale testing.

The task of this module is quite straightforward: produce a story script given a story topic and a character name. Generating a story on a specific topic with GPT3.5 API is incredibly simple.

openai.api_key = self._api_key
prompt = "Write me a 1000-word story about Bob the penguin who wants to travel to Europe to see famous landmarks"
response = openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
temperature=0.7,
max_tokens=2089,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)

For example, using the prompt “Write me a 1000-word story about Bob the penguin who wants to travel to Europe to see famous landmarks. He learns that his bravery and curiosity lead him to experience many exciting things.” GPT3.5 will effortlessly craft a captivating story on this topic as if it were penned by a professional storyteller, much like the example below.

Bob the penguin had always dreamed of traveling to Europe and visiting famous landmarks. He had heard stories from his friends about the Eiffel Tower in Paris, the Colosseum in Rome, and the Big Ben in London. He had grown tired of his routine life in Antarctica and yearned for adventure.

One day, Bob decided to take the plunge and begin planning his trip. He spent hours researching the best travel routes and the most affordable accommodations. After careful consideration, he decided to start his journey in Paris.

The boat ride was long and tiring, but he was excited to finally be in Europe. He checked into his hotel and immediately set off to see the Eiffel Tower. As he walked through the streets of Paris, he felt a sense of wonder and excitement that he had never felt before.

Over the next few days, he visited famous landmarks like the Louvre Museum, Notre Dame Cathedral, and the Palace of Versailles. He tried new foods and met new people, each experience adding to his adventure.

The story itself is fantastic, but to transform it into comic strips, I need to divide the story into sections and create an image for each one. The most logical approach would be to convert each paragraph into a section. However, as you can see, the images generated from those paragraphs present some significant challenges.

Our bob the penguin is portrayed as different characters [AI generated image]
  • Character chaos ensued! Each comic strip depicted Bob as an entirely different character. In the first strip, he’s a penguin doll; in the second, he’s a person’s leg; in the third, a suited-up gentleman; and in the fourth, a man in a blue shirt. This happens because only the first paragraph mentions “Bob the penguin,” while the second refers to him as just “Bob,” and the rest as “he.” Given this scant information, it’s no wonder the image generator portrayed Bob in so many different guises.
  • The scenes also lacked focus. The first comic strip showed a penguin doll sitting on a table instead of an Antarctic-dwelling penguin dreaming of European adventures. A similar issue arose in the second comic strip, which showed someone’s leg soaring high above a city. It appears the image generator interpreted “One day, Bob decided to take the plunge” as skydiving from an airplane, while the paragraph’s focus should have been on Bob planning his trip to Europe. Long paragraphs with multiple focal points often confuse the image generator, resulting in out-of-context images.

To tackle the scene’s out-of-focus issue and improve character consistency, I refined my prompt to include a request for a concise, one-sentence scene description for each paragraph wrapped in [] so they can be programatically extracted. This allowed me to provide detailed examples and guide the image generator in creating more focused and accurate images.

For each section please describe the scene in details and always include the location in one sentence within [] with the following format [a photo of character in the location], [a photo of character in front of an object], [a photo of character next to an object], [a photo of a location]

With the updated prompt, here’s the resulting story that was generated.

[a photo of Bob the penguin in Antarctica]
Bob the penguin was a happy and curious penguin who lived in the Antarctic. He was content with his life there, surrounded by his friends and family. But one day, he decided to take the plunge and explore the world beyond the icy continent. He had heard stories of the many beautiful and exotic places around the world, and he wanted to experience them for himself.

[a photo of Bob the penguin reading a book]
Bob the penguin started researching the world, looking at maps and reading up on different countries and cultures. He was particularly drawn to Europe, with its many famous landmarks and sights. He decided that Europe was the place he wanted to visit, so he began to plan his journey.

[a photo of Bob the penguin on a cruise ship]
He started to make the long journey by boat. He was excited and couldn’t wait to get there, and he was determined to make it to Europe. After a few weeks of travelling, he eventually arrived at his destination.

[a photo of Bob the penguin at Eiffel Tower]
Bob the penguin started exploring Europe and was amazed by all the different places he visited. He went to the Eiffel Tower in Paris, the Colosseum in Rome, and the Cliffs of Moher in Ireland. Everywhere he went he was filled with awe and delight.

As you can observe, the generated scene descriptions are considerably more focused. They mention a single scene, a location, and/or an activity being performed, often starting with the character’s name. These concise prompts prove to be much more effective for my image generator, as evidenced by the improved images generated below.

A more consistent look of our Bob the penguin [AI generated image]

Bob the penguin has made a triumphant return, but he’s still sporting a new look in each comic strip. Since the image generation process treats each image separately, and no information is provided about Bob’s colour, size, or type of penguin, consistency remains elusive.

I previously considered generating a detailed character description as part of the story generation to maintain character consistency across images. However, this approach proved to be impractical for two reasons:

  • Sometimes it’s nearly impossible to describe a character with enough detail without resorting to an overwhelming amount of text. While there may not be many types of penguins, consider birds in general — with countless shapes, colours, and species such as cockatoos, parrots, canaries, pelicans, and owls, the task becomes daunting.
  • The character generated doesn’t always adhere to the provided description within the prompt. For example, a prompt describing a green parrot with a red beak might result in an image of a green parrot with a yellow beak instead.

So, despite our best efforts, our penguin pal Bob continues to experience something of an identity crisis.

The solution to our penguin predicament lies in giving the Stable Diffusion model a visual cue of what our penguin character should look like to influence the image generation process and to maintain consistency across all generated images. In the world of Stable Diffusion, this process is known as fine-tuning, where you supply a handful (usually 5 to 15) of images containing the same object and a sentence describing it. These images shall henceforth be known as training images.

As it turns out, this added personalisation is not just a solution but also a mighty cool feature for my comic generator. Now, I can use many of Dexie’s toys as the main characters in the stories, such as his festive Christmas penguin, breathing new life into Bob the penguin, making them even more personalised and relatable for my young but tough audience. So, the quest for consistency turns into a triumph for tailor-made tales!

Dexie’s toy is now Bob the penguin [Image by Author]

During my exhilarating days of experimentation, I’ve discovered a few nuggets of wisdom to share for achieving the best results when fine-tuning the model to reduce the chance of overfitting:

  • Keep the backgrounds in your training images diverse. This way, the model won’t confuse the backdrop with the object, preventing unwanted background cameos in the generated images
  • Capture the target object from various angles. This helps provide more visual information, enabling the model to generate the object with a greater range of angles, thus better matching the scene.
  • Mix close-ups with full-body shots. This ensures the model doesn’t assume a specific pose is necessary, granting more flexibility for the generated object to harmonise with the scene.

To perform the Stable Diffusion model fine-tuning, I launched a SageMaker Estimator training job with Amazon SageMaker Python SDK on an ml.g5.2xlarge GPU instance and directed the training process to my collection of training images in an S3 bucket. A resulting fine-tuned model file will then be saved in s3_output_location. And, with just a few lines of code, the magic began to unfold!

# [Optional] Override default hyperparameters with custom values
hyperparams["max_steps"] = 400
hyperparams["with_prior_preservation"] = False
hyperparams["train_text_encoder"] = False

training_job_name = name_from_base(f"stable-diffusion-{self._model_id}-transfer-learning")

# Create SageMaker Estimator instance
sd_estimator = Estimator(
role=self._aws_role,
image_uri=image_uri,
source_dir=source_uri,
model_uri=model_uri,
entry_point="transfer_learning.py", # Entry-point file in source_dir and present in train_source_uri.
instance_count=self._training_instance_count,
instance_type=self._training_instance_type,
max_run=360000,
hyperparameters=hyperparams,
output_path=s3_output_location,
base_job_name=training_job_name,
sagemaker_session=session,
)

# Launch a SageMaker Training job by passing s3 path of the training data
sd_estimator.fit({"training": training_dataset_s3_path}, logs=True)

To prepare the training set, ensure it contains the following files:

  1. A series of images named instance_image_x.jpg, where x is a number from 1 to N. In this case, N represents the number of images, ideally more than 10.
  2. A dataset_info.json file that includes a mandatory field called instance_prompt. This field should provide a detailed description of the object, with a unique identifier preceding the object’s name. For example, “a photo of Bob the penguin,” where ‘Bob’ acts as the unique identifier. By using this identifier, you can direct your fine-tuned model to generate either a standard penguin (referred to as “penguin”) or the penguin from your training set (referred to as “Bob the penguin”). Some sources suggest using unique names such as sks or xyz, but I discovered that it’s not essential to do so.

The dataset_info.json file can also include an optional field called class_prompt, which offers a general description of the object without the unique identifier (e.g., “a photo of a penguin”). This field is utilised only when the prior_preservation parameter is set to True; otherwise, it will be disregarded. I will discuss more about it at the advanced fine-tuning section below.

{"instance_prompt": "a photo of bob penguin",
"class_prompt": "a photo of a penguin"
}

After a few test runs with Dexie’s toys, the image generator delivered some truly impressive results. It brought Dexie’s kangaroo magnetic block creation to life, hopping its way into the virtual world. The generator also masterfully depicted his beloved shower turtle toy swimming underwater, surrounded by a vibrant school of fish. The image generator certainly captured the magic of Dexie’s playtime favourites!

Dexie’s toys are brought to life [AI generated image]

Batch Transform against fine-tuned Stable Diffusion model

Since I needed to generate over a hundred images for each comic strip, deploying a SageMaker endpoint (think of it as a Rest API) and generating one image at a time wasn’t the most efficient approach. Instead, I opted to run a batch transform against my model, supplying it with text files in an S3 bucket containing the prompts to generate the images.

I’ll provide more details about this process since I initially struggled with it, and I hope my explanation will save you some time. You’ll need to prepare one text file per image prompt with the following JSON content: {“prompt”: “a photo of Bob the penguin in Antarctica”}. While it appears that there’s a way to combine multiple inputs into one file using the MultiRecord strategy, I was unable to figure out how it works.

Another challenge I encountered was executing a batch transform against my fine-tuned model. You can’t execute a batch transform using a transformer object returned by Estimator.transformer(), which usually works in my previous projects. Instead, you need to first create a SageMaker model object by specifying the S3 location of your fine-tuned model as the model_data. From there, you can create the transformer object using this model object.

def _get_model_uris(self, model_id, model_version, scope):
# Retrieve the inference docker container uri
image_uri = image_uris.retrieve(
region=None,
framework=None, # automatically inferred from model_id
image_scope=scope,
model_id=model_id,
model_version=model_version,
instance_type=self._inference_instance_type,
)
# Retrieve the inference script uri. This includes scripts for model loading, inference handling etc.
source_uri = script_uris.retrieve(
model_id=model_id, model_version=model_version, script_scope=scope
)
if scope == "training":
# Retrieve the pre-trained model tarball to further fine-tune
model_uri = model_uris.retrieve(
model_id=model_id, model_version=model_version, model_scope=scope
)
else:
model_uri = None

return image_uri, source_uri, model_uri

image_uri, source_uri, model_uri = self._get_model_uris(self._model_id, self._model_version, "inference")

# Get model artifact location by estimator.model_data, or give an S3 key directly
model_artifact_s3_location = f"s3://{self._bucket}/output-model/{job_id}/{training_job_name}/output/model.tar.gz"

env = {
"MMS_MAX_RESPONSE_SIZE": "20000000",
}

# Create model from saved model artifact
sm_model = model.Model(
model_data=model_artifact_s3_location,
role=self._aws_role,
entry_point="inference.py", # entry point file in source_dir and present in deploy_source_uri
image_uri=image_uri,
source_dir=source_uri,
env=env
)

transformer = sm_model.transformer(instance_count=self._inference_instance_count, instance_type=self._inference_instance_type,
output_path=f"s3://{self._bucket}/processing/{job_id}/output-images",
accept='application/json')
transformer.transform(data=f"s3://{self._bucket}/processing/{job_id}/batch_transform_input/",
content_type='application/json')

And with that, my customised image generator is all ready!

Advanced Stable Diffusion model fine-tuning

While it’s not essential for my comic generator project, I’d like to touch on some advanced fine-tuning techniques involving the manipulation of max_steps, prior_reservation, and train_text_encoder hyper parameters, in case they come in handy for your projects.

Stable Diffusion model fine-tuning is highly susceptible to overfitting due to the vast difference between the number of training images you provide and those used in the base model. For example, you might only supply 10 images of Bob the penguin, while the base model’s training set contains thousands of penguin images. A larger number of images reduces the likelihood of overfitting and erroneous associations between the target object and other elements.

When setting prior_reservation to True, Stable Diffusion generates a default of x (typically 100) images using the class_prompt provided, and combines them with your instance_images during fine-tuning. Alternatively, you can manually supply these images by placing them in the class_data_dir subfolder. In my experience, prior_preservation is often crucial when fine-tuning Stable Diffusion for human faces. When employing prior_reservation, ensure you provide a class_prompt that mentions the most suitable generic name or common object resembling your character. For Bob the penguin, this object is clearly a penguin, so your class prompt would be “a photo of a penguin”. This technique can also be used to generate a blend between two characters, which I will discuss later.

Another helpful parameter for advanced fine-tuning is train_text_encoder. Set it to True to enable text encoder training during the fine-tuning process. The resulting model will better understand more complex prompts and generate human faces with greater accuracy.

Depending on your specific use case, different hyper parameter values may yield better results. Additionally, you’ll need to adjust the max_steps parameter to control the number of fine-tuning steps required. Keep in mind that a larger training set might lead to overfitting.

By utilising Amazon Polly’s Neural Text To Speech (NTTS) feature, I was able to create audio narration for each paragraph of the story. The quality of the audio narration is exceptional, as it sounds incredibly natural and human-like, making it an ideal story-teller.

To accommodate a younger audience, such as Dexie, I employed the SSML format and utilised the <prosody rate> tag to reduce the speaking speed to 90% of its normal rate, ensuring the content would not be delivered too quickly for them to follow.

self._pollyClient = boto3.Session(
region_name=aws_region).client('polly')
ftext = f"<speak><prosody rate=\"90%\">{text}</prosody></speak>"
response = self._pollyClient.synthesize_speech(VoiceId=self._speaker,
OutputFormat='mp3',
Engine='neural',
Text=ftext,
TextType='ssml')

with open(mp3_path, 'wb') as file:
file.write(response['AudioStream'].read())
file.close()

After all the hard work, I used MoviePy — a fantastic Python framework — to magically turn all the photos, audio narration, and music into an awesome mp4 video. Speaking of music, I gave my tech the power to choose the perfect soundtrack to match the video’s vibe. How, you ask? Well, I just modified my story script generator to return a music style from a pre-determined list using some clever prompts. How cool is that?

At the start of the story please suggest song style from the following list only which matches the story and put it within <>. Song style list are action, calm, dramatic, epic, happy and touching.

Once the music style is selected, the next step is to randomly pick an MP3 track from the relevant folder, which contains a handful of MP3 files. This helps to add a touch of unpredictability and excitement to the final product.

To orchestrate the entire system, I needed a controller module in the form of a Python script that could run each module seamlessly. But, of course, I needed a compute environment to execute this script. I had two options to explore — the first being my preferred option — a server-less architecture with AWS Lambda. This involved using several AWS Lambdas, paired with SQS. The first lambda is used as public API using API Gateway as an entry point. This API would take in the training image URLs and story topic text and pre-process the data, dropping it into an SQS queue. Another Lambda would pick up the data from the topic and conduct data preparation — think image resizing, creating dataset_info.json, and triggering the next Lambda to call Amazon SageMaker Jumpstart to prepare the Stable Diffusion model and execute SageMaker training job to fine-tune the model. Phew, that’s a mouthful. Finally, Amazon EventBridge would be used as an event bus to detect the completion of the training job and trigger the next Lambda to execute SageMaker Batch Transform using the fine-tuned model to generate images.

But alas, this option was not possible because the AWS Lambda function had a max storage limit of 10GB. And when executing the batch transform against the SageMaker model, the SageMaker Python SDK would download and extract the model.tar.gzip file temporarily in the local /tmp before sending it to the managed system that ran the batch transform. Unfortunately, my model was a whopping 5GB compressed, so the SageMaker Python SDK threw an error saying “Out of disk space.” For most use cases where the model size is smaller, this will the best and cleanest solution.

So, I had to resort to my second option — AWS Batch. It worked well, but it did cost a bit more since the AWS Batch compute instance had to run throughout the entire process —even during fine-tuning the model, and executing the batch transform which were executed in a separate compute environment within SageMaker. I could have split the process into several AWS Batch instances and glued them together with Amazon EventBridge and SQS, just like I would have done previously using the server-less approach. But with AWS Batch’s longer startup time (around 5 mins), it would have added way too much latency to the overall process. So, I went with the all-in-one AWS Batch option instead.

Owly system architecture

Feast your eyes upon Owly’s majestic architecture diagram! Our adventure kicks off by launching AWS Batch through the AWS Console, equipping it with an S3 folder brimming with training images, a captivating story topic, and a delightful character, all supplied via AWS Batch environment variables.

# Basic settings
JOB_ID = "penguin-images" # key to S3 folder containing the training images
STORY_TOPIC = "bob the penguin who wants to travel to Europe"
STORY_CHARACTER = "bob the penguin"

# Advanced settings
TRAIN_TEXT_ENCODER = False
PRIOR_RESERVATION = False
MAX_STEPS = 400
NUM_IMAGE_VARIATIONS = 5

The AWS Batch springs into action, retrieving the training images from the S3 folder specified by JOB_ID, resizing them to a 768×768, and creating a dataset_info.json file before placing them in a staging S3 bucket.

Next up, we call up the OpenAI GPT3.5 model API to whip up an engaging story and a complementary song style in harmony with the chosen topic and character. We then summon Amazon SageMaker JumpStart to unleash the powerful Stable Diffusion 2.1 base model. With the model at our disposal, we initiate a SageMaker training job to fine-tune it to our carefully selected training images. After a brief 30-minute interlude, we forge image prompts for each story paragraph in the guise of text files, which are then dropped into an S3 bucket as input for the image generation extravaganza. Amazon SageMaker Batch Transform is unleashed on the fine-tuned model to produce these images in a batch, a process that lasts a mere 5 minutes.

Once complete, we enlist the help of Amazon Polly to craft audio narrations for each paragraph in the story, saving them as mp3 files in just 30 seconds. We then randomly pick an mp3 music file from libraries sorted by song style, based on the selection made by our masterful story generator.

The final act sees the resulting images, audio narration mp3s, and music.mp3 files expertly woven together into a video slideshow with the help of MoviePy. Smooth transitions and the Ken Burns effect are added for that extra touch of elegance. The pièce de résistance, the finished video, is then hoisted up to the output S3 bucket, awaiting your eager download!

I must say, I’m rather chuffed with the results! The story script generator has truly outdone itself, performing far better than anticipated. Almost every story script crafted is not only well-written but also brimming with positive morals, showcasing the awe-inspiring prowess of Large Language Models (LLM). As for image generation, well, it’s a bit of a mixed bag.

With all the enhancements I’ve described earlier, one in five stories can be used in the final video right off the bat. The remaining four, however, usually have one or two images plagued by common issues.

  • First, we’ve got inconsistent characters, still. Sometimes the model conjures up a character that’s slightly different from the original in the training set, often opting for a photorealistic version rather than the toy counterpart. But fear not! Adding a desired photo style within the text prompt, like “A cartoon-style Rex the turtle swimming under the sea,” helps curb this issue. However, it does require manual intervention since certain characters may warrant a photorealistic style.
  • Then there’s the curious case of missing body parts. Occasionally, our generated characters appear with absent limbs or heads. Yikes! To mitigate this, we’ve added negative prompts supported by the Stable Diffusion model, such as “missing limbs, missing head,” encouraging the generation of images that steer clear of these peculiar attributes.
Rex the turtle in different style (bottom right image is in a photo realistic style, top right image is in a mixed style, the rest are in a toy style) and missing a head (top left image) [AI generated image]
  • Bizarre images emerge when dealing with uncommon interactions between objects. Generating images of characters in specific locations typically produces satisfactory results. However, when it comes to illustrating characters interacting with other objects, especially in an uncommon way, the outcome is often less than ideal. For instance, attempting to depict Tom the hedgehog milking a cow results in a peculiar blend of hedgehog and cow. Meanwhile, crafting an image of Tom the hedgehog holding a flower bouquet leads to a person clutching both a hedgehog and a bouquet of flowers. Regrettably, I have yet to devise a strategy to remedy this issue, leading me to conclude that it’s simply a limitation of current image generation technology. If the object or activity in the image you’re trying to generate is highly unusual, the model lacks prior knowledge, as none of the training data has ever depicted such scenes or activities.
Mixed of a hedgehog and a cow (top images)is generated from “Tom the hedgehog is milking a cow” prompt. A person holding a hedgehog and a flower (bottom left image) is generated from “Tom the hedgehog is holding a flower” [AI generated image]

In the end, to boost the odds of success in story generation, I cleverly tweaked my story generator to produce three distinct scenes per paragraph. Moreover, for each scene, I instructed my image generator to create five image variations. With this approach, I increased the likelihood of obtaining at least one top-notch image from the fifteen available. Having three different prompt variations also aids in generating entirely unique scenes, especially when one scene proves too rare or complex to create. Below is my updated story generation prompt.

"Write me a {max_words} words story about a given character and a topic.\nPlease break the story down into " \
"seven to ten short sections with 30 maximum words per section. For each section please describe the scene in " \
"details and always include the location in one sentence within [] with the following format " \
"[a photo of character in the location], [a photo of character in front of an object], " \
"[a photo of character next to an object], [a photo of a location]. Please provide three different variations " \
"of the scene details separated by |\\nAt the start of the story please suggest song style from the following " \
"list only which matches the story and put it within <>. Song style list are action, calm, dramatic, epic, " \
"happy and touching."

The only additional cost is a bit of manual intervention after the image generation step is done, where I handpick the best image for each scene and then proceed with the comic generation process. This minor inconvenience aside, I now boast a remarkable success rate of 9 out of 10 in crafting splendid comics!

With the Owly system fully assembled, I decided to put this marvel of technology to the test one fine Saturday afternoon. I generated a handful of stories from his toys collection, ready to enhance bedtime storytelling for Dexie using a nifty portable projector I had purchased. That night, as I saw Dexie’s face light up and his eyes widen with excitement, the comic playing out on his bedroom wall, I knew all my efforts had been worth it.

Dexie is watching the comic on his bedroom wall [Image by Author]

The cherry on top is that it now takes me under two minutes to whip up a new story using photos of his toy characters I’ve already captured. Plus, I can seamlessly incorporate valuable morals I want him to learn from each story, such as not talking to strangers, being brave and adventurous, or being kind and helpful to others. Here are some of the delightful stories generated by this fantastic system.

Super Hedgehog Tom Saves His City From a Dragon — Generated by AI Comic Generator.
Bob the Brave Penguin: Adventures in Europe — Generated by AI Comic Generator.

As a curious tinkerer, I couldn’t help but fiddle with the image generation module to push Stable Diffusion’s boundaries and merge two characters into one magnificent hybrid. I fine-tuned the model with Kwazi Octonaut images, but I threw in a twist by assigning Zelda as both the unique and class character name. Setting prior_preservation to True, I ensured that Stable Diffusion would “octonaut-ify” Zelda while still keeping her distinct essence intact.

I cleverly utilised a modest max_step of 400, just enough to preserve Zelda’s original charm without her being entirely consumed by Kwazi the Octonaut’s irresistible allure. Behold the glorious fusion of Zelda and Kwazi, united as one!

Dexie brimmed with excitement as he witnessed a fusion of his two favourite characters spearheading the action in his bedtime story. He embarked on thrilling adventures, combating extraterrestrial beings and hunting for hidden treasure chests!

Unfortunately to protect the IP owner I cannot show the resulting images.

Generative AI, particularly Large Language Models (LLMs), is here to stay and set to become the powerful tools for not only software development but many other industries as well. I’ve experienced the true power of LLMs firsthand in a few projects. Just last year, I built a robotic teddy bear called Ellie, capable of moving its head and engaging in conversations like a real human. While this technology is undeniably potent, it’s important to exercise caution to ensure the safety and quality of the outputs it generates, as it can be a double-edged sword.

And there you have it, folks! I hope you found this blog interesting. If so, please shower me with your claps. Feel free to connect with me on LinkedIn or check out my other AI endeavours on my Medium profile. Stay tuned, as I’ll be sharing the complete source code in the coming weeks!

Finally, I would like to say thanks to Mike Chambers from AWS who helped me troubleshoot my fine-tuned Stable Diffusion model batch transform code.


Owly the AI Comic Story Teller [AI Generated Image]

Every evening, it has become a cherished routine to share bedtime stories with my 4-year-old son Dexie, who absolutely adores them. His collection of books is impressive, but he’s especially captivated when I create tales from scratch. Crafting stories this way also allows me to incorporate moral values I want him to learn, which can be difficult to find in store-bought books. Over time, I’ve honed my skills in crafting personalised narratives that ignite his imagination — from dragons with fractured walls to a lonely sky lantern seeking companionship. Lately, I’ve been spinning yarns about fictional superheroes like Slow-Mo Man and Fart-Man, which have become his favourites.

While it’s been a delightful journey for me, after half a year of nightly storytelling, my creative reservoir is being tested. To keep him engaged with fresh and exciting stories without exhausting myself, I need a more sustainable solution — an AI technology that can generate captivating tales automatically! I named her Owly, after his favourite bird, an owl.

Pookie and the secret door to a magic forest — Generated by AI Comic Generator.

As I started assembling my wish list, it quickly ballooned, driven by my eagerness to test the frontiers of modern technology. No ordinary text-based story would do — I envisioned an AI crafting a full-blown comic with up to 10 panels. To amp up the excitement for Dexie, I aimed to customise the comic using characters he knew and loved, like Zelda and Mario, and maybe even toss in his toys for good measure. Frankly, the personalisation angle emerged from a need for visual consistency across the comic strips, which I will dive into later. But hold your horses, that’s not all — I also wanted the AI to narrate the story aloud, backed by a fitting soundtrack to set the mood. Tackling this project would be equal parts amusing and challenging for me, while Dexie would be treated to a tailor-made, interactive storytelling extravaganza.

Dexie’s toys as comic story’s leading characters [Image by Author]

To conquer the aforementioned requirements, I realised I needed to assemble five marvellous modules:

  1. The Story Script Generator, conjuring up a multi-paragraph story where each paragraph will be transformed into a comic strip section. Plus, it recommends a musical style to pluck a fitting tune from my library. To pull this off, I enlisted the mighty OpenAI GPT3.5 Large Language Model (LLM).
  2. The Comic Strip Image Generator, whipping up images for each story segment. Stable Diffusion 2.1 teamed up with Amazon SageMaker JumpStart, SageMaker Studio and Batch Transform to bring this to life.
  3. The Text-to-Speech Module, turning the written tale into an audio narration. Amazon Polly’s neural engine leaped to the rescue.
  4. The Video Maker, weaving the comic strips, audio narration, and music into a self-playing masterpiece. MoviePy was the star of this show.
  5. And finally, The Controller, orchestrating the grand symphony of all four modules, built on the mighty foundation of AWS Batch.

The game plan? Get the Story Script Generator to weave a 7–10 paragraph narrative, with each paragraph morphing into a comic strip section. The Comic Strip Image Generator then generates images for each segment, while the Text-to-Speech Module crafts the audio narration. A melodious tune will be selected based on the story generator’s recommendation. And finally, the Video Maker combines images, audio narration, and music to create a whimsical video. Dexie is in for a treat with this one-of-a-kind, interactive story-time adventure!

Before delving into the Story Script Generator, let’s first explore the image generator module to provide context for any references to the image generation process. There are numerous text-to-image AI models available, but I chose the Stable Diffusion 2.1 model for its popularity and ease of building, fine-tuning, and deployment using Amazon SageMaker and the broader AWS ecosystem.

Amazon SageMaker Studio is an integrated development environment (IDE) that offers a unified web-based interface for all machine learning (ML) tasks, streamlining data preparation, model building, training, and deployment. This boosts data science team productivity by up to 10x. Within SageMaker Studio, users can seamlessly upload data, create notebooks, train and tune models, adjust experiments, collaborate with their team, and deploy models to production.

Amazon SageMaker JumpStart, a valuable feature within SageMaker Studio, provides an extensive collection of widely-used pre-trained AI models. Some models, including Stable Diffusion 2.1 base, can be fine-tuned with your own training set and come with a sample Jupyter Notebook. This enables you to quickly and efficiently experiment with the model.

Launching Stable Diffusion 2.1 Notebook on Amazon SageMaker JumpStart [Image by Author]

I navigated to the Stable Diffusion 2.1 base view model page and launched the Jupyter notebook by clicking on the Open Notebook button.

Stable Diffusion 2.1 Base model card [Image by Author]

In a matter of seconds, Amazon SageMaker Studio presented the example notebook, complete with all the necessary code to load the text-to-image model from JumpStart, deploy the model, and even fine-tune it for personalised image generation.

Amazon SageMaker Studio IDE [Image by Author]

Numerous text-to-image models are available, with many tailored to specific styles by their creators. Utilising the JumpStart API, I filtered and listed all text-to-image models using the filter_value “task == txt2img” and displayed them in a dropdown menu for convenient selection.

from ipywidgets import Dropdown
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models

# Retrieves all Text-to-Image generation models.
filter_value = "task == txt2img"
txt2img_models = list_jumpstart_models(filter=filter_value)

# display the model-ids in a dropdown to select a model for inference.
model_dropdown = Dropdown(
options=txt2img_models,
value="model-txt2img-stabilityai-stable-diffusion-v2-1-base",
description="Select a model",
style={"description_width": "initial"},
layout={"width": "max-content"},
)
display(model_dropdown)

# Or just hard code the model id and version=*.
# Eg. if we want the latest 2.1 base model
self._model_id, self._model_version = (
"model-txt2img-stabilityai-stable-diffusion-v2-1-base",
"*",
)

The model I required was model-txt2img-stabilityai-stable-diffusion-v2–1-base which allow fine-tuning.

Huge selection of text-to-image models [Image by Author]

In under 5 minutes, utilising the provided code, I deployed the model to a SageMaker endpoint running a g4dn.2xlarge GPU instance. I swiftly generated my first image from my text prompts, which you can see showcased below.

My image generator crafts an image of turtle swimming underwater [Image by Author]

The Amazon SageMaker Studio streamlines my experimentation and prototyping process, allowing me to swiftly experiment with various image generation prompts and view the resulting images directly within the IDE using the file explorer and the preview window. Furthermore, I can upload images within the IDE, utilise the built-in terminal to launch AWS CLI for uploading and downloading images to and from an S3 bucket, and execute SageMaker batch transform jobs against my models to generate numerous images at once for a large scale testing.

The task of this module is quite straightforward: produce a story script given a story topic and a character name. Generating a story on a specific topic with GPT3.5 API is incredibly simple.

openai.api_key = self._api_key
prompt = "Write me a 1000-word story about Bob the penguin who wants to travel to Europe to see famous landmarks"
response = openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
temperature=0.7,
max_tokens=2089,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)

For example, using the prompt “Write me a 1000-word story about Bob the penguin who wants to travel to Europe to see famous landmarks. He learns that his bravery and curiosity lead him to experience many exciting things.” GPT3.5 will effortlessly craft a captivating story on this topic as if it were penned by a professional storyteller, much like the example below.

Bob the penguin had always dreamed of traveling to Europe and visiting famous landmarks. He had heard stories from his friends about the Eiffel Tower in Paris, the Colosseum in Rome, and the Big Ben in London. He had grown tired of his routine life in Antarctica and yearned for adventure.

One day, Bob decided to take the plunge and begin planning his trip. He spent hours researching the best travel routes and the most affordable accommodations. After careful consideration, he decided to start his journey in Paris.

The boat ride was long and tiring, but he was excited to finally be in Europe. He checked into his hotel and immediately set off to see the Eiffel Tower. As he walked through the streets of Paris, he felt a sense of wonder and excitement that he had never felt before.

Over the next few days, he visited famous landmarks like the Louvre Museum, Notre Dame Cathedral, and the Palace of Versailles. He tried new foods and met new people, each experience adding to his adventure.

The story itself is fantastic, but to transform it into comic strips, I need to divide the story into sections and create an image for each one. The most logical approach would be to convert each paragraph into a section. However, as you can see, the images generated from those paragraphs present some significant challenges.

Our bob the penguin is portrayed as different characters [AI generated image]
  • Character chaos ensued! Each comic strip depicted Bob as an entirely different character. In the first strip, he’s a penguin doll; in the second, he’s a person’s leg; in the third, a suited-up gentleman; and in the fourth, a man in a blue shirt. This happens because only the first paragraph mentions “Bob the penguin,” while the second refers to him as just “Bob,” and the rest as “he.” Given this scant information, it’s no wonder the image generator portrayed Bob in so many different guises.
  • The scenes also lacked focus. The first comic strip showed a penguin doll sitting on a table instead of an Antarctic-dwelling penguin dreaming of European adventures. A similar issue arose in the second comic strip, which showed someone’s leg soaring high above a city. It appears the image generator interpreted “One day, Bob decided to take the plunge” as skydiving from an airplane, while the paragraph’s focus should have been on Bob planning his trip to Europe. Long paragraphs with multiple focal points often confuse the image generator, resulting in out-of-context images.

To tackle the scene’s out-of-focus issue and improve character consistency, I refined my prompt to include a request for a concise, one-sentence scene description for each paragraph wrapped in [] so they can be programatically extracted. This allowed me to provide detailed examples and guide the image generator in creating more focused and accurate images.

For each section please describe the scene in details and always include the location in one sentence within [] with the following format [a photo of character in the location], [a photo of character in front of an object], [a photo of character next to an object], [a photo of a location]

With the updated prompt, here’s the resulting story that was generated.

[a photo of Bob the penguin in Antarctica]
Bob the penguin was a happy and curious penguin who lived in the Antarctic. He was content with his life there, surrounded by his friends and family. But one day, he decided to take the plunge and explore the world beyond the icy continent. He had heard stories of the many beautiful and exotic places around the world, and he wanted to experience them for himself.

[a photo of Bob the penguin reading a book]
Bob the penguin started researching the world, looking at maps and reading up on different countries and cultures. He was particularly drawn to Europe, with its many famous landmarks and sights. He decided that Europe was the place he wanted to visit, so he began to plan his journey.

[a photo of Bob the penguin on a cruise ship]
He started to make the long journey by boat. He was excited and couldn’t wait to get there, and he was determined to make it to Europe. After a few weeks of travelling, he eventually arrived at his destination.

[a photo of Bob the penguin at Eiffel Tower]
Bob the penguin started exploring Europe and was amazed by all the different places he visited. He went to the Eiffel Tower in Paris, the Colosseum in Rome, and the Cliffs of Moher in Ireland. Everywhere he went he was filled with awe and delight.

As you can observe, the generated scene descriptions are considerably more focused. They mention a single scene, a location, and/or an activity being performed, often starting with the character’s name. These concise prompts prove to be much more effective for my image generator, as evidenced by the improved images generated below.

A more consistent look of our Bob the penguin [AI generated image]

Bob the penguin has made a triumphant return, but he’s still sporting a new look in each comic strip. Since the image generation process treats each image separately, and no information is provided about Bob’s colour, size, or type of penguin, consistency remains elusive.

I previously considered generating a detailed character description as part of the story generation to maintain character consistency across images. However, this approach proved to be impractical for two reasons:

  • Sometimes it’s nearly impossible to describe a character with enough detail without resorting to an overwhelming amount of text. While there may not be many types of penguins, consider birds in general — with countless shapes, colours, and species such as cockatoos, parrots, canaries, pelicans, and owls, the task becomes daunting.
  • The character generated doesn’t always adhere to the provided description within the prompt. For example, a prompt describing a green parrot with a red beak might result in an image of a green parrot with a yellow beak instead.

So, despite our best efforts, our penguin pal Bob continues to experience something of an identity crisis.

The solution to our penguin predicament lies in giving the Stable Diffusion model a visual cue of what our penguin character should look like to influence the image generation process and to maintain consistency across all generated images. In the world of Stable Diffusion, this process is known as fine-tuning, where you supply a handful (usually 5 to 15) of images containing the same object and a sentence describing it. These images shall henceforth be known as training images.

As it turns out, this added personalisation is not just a solution but also a mighty cool feature for my comic generator. Now, I can use many of Dexie’s toys as the main characters in the stories, such as his festive Christmas penguin, breathing new life into Bob the penguin, making them even more personalised and relatable for my young but tough audience. So, the quest for consistency turns into a triumph for tailor-made tales!

Dexie’s toy is now Bob the penguin [Image by Author]

During my exhilarating days of experimentation, I’ve discovered a few nuggets of wisdom to share for achieving the best results when fine-tuning the model to reduce the chance of overfitting:

  • Keep the backgrounds in your training images diverse. This way, the model won’t confuse the backdrop with the object, preventing unwanted background cameos in the generated images
  • Capture the target object from various angles. This helps provide more visual information, enabling the model to generate the object with a greater range of angles, thus better matching the scene.
  • Mix close-ups with full-body shots. This ensures the model doesn’t assume a specific pose is necessary, granting more flexibility for the generated object to harmonise with the scene.

To perform the Stable Diffusion model fine-tuning, I launched a SageMaker Estimator training job with Amazon SageMaker Python SDK on an ml.g5.2xlarge GPU instance and directed the training process to my collection of training images in an S3 bucket. A resulting fine-tuned model file will then be saved in s3_output_location. And, with just a few lines of code, the magic began to unfold!

# [Optional] Override default hyperparameters with custom values
hyperparams["max_steps"] = 400
hyperparams["with_prior_preservation"] = False
hyperparams["train_text_encoder"] = False

training_job_name = name_from_base(f"stable-diffusion-{self._model_id}-transfer-learning")

# Create SageMaker Estimator instance
sd_estimator = Estimator(
role=self._aws_role,
image_uri=image_uri,
source_dir=source_uri,
model_uri=model_uri,
entry_point="transfer_learning.py", # Entry-point file in source_dir and present in train_source_uri.
instance_count=self._training_instance_count,
instance_type=self._training_instance_type,
max_run=360000,
hyperparameters=hyperparams,
output_path=s3_output_location,
base_job_name=training_job_name,
sagemaker_session=session,
)

# Launch a SageMaker Training job by passing s3 path of the training data
sd_estimator.fit({"training": training_dataset_s3_path}, logs=True)

To prepare the training set, ensure it contains the following files:

  1. A series of images named instance_image_x.jpg, where x is a number from 1 to N. In this case, N represents the number of images, ideally more than 10.
  2. A dataset_info.json file that includes a mandatory field called instance_prompt. This field should provide a detailed description of the object, with a unique identifier preceding the object’s name. For example, “a photo of Bob the penguin,” where ‘Bob’ acts as the unique identifier. By using this identifier, you can direct your fine-tuned model to generate either a standard penguin (referred to as “penguin”) or the penguin from your training set (referred to as “Bob the penguin”). Some sources suggest using unique names such as sks or xyz, but I discovered that it’s not essential to do so.

The dataset_info.json file can also include an optional field called class_prompt, which offers a general description of the object without the unique identifier (e.g., “a photo of a penguin”). This field is utilised only when the prior_preservation parameter is set to True; otherwise, it will be disregarded. I will discuss more about it at the advanced fine-tuning section below.

{"instance_prompt": "a photo of bob penguin",
"class_prompt": "a photo of a penguin"
}

After a few test runs with Dexie’s toys, the image generator delivered some truly impressive results. It brought Dexie’s kangaroo magnetic block creation to life, hopping its way into the virtual world. The generator also masterfully depicted his beloved shower turtle toy swimming underwater, surrounded by a vibrant school of fish. The image generator certainly captured the magic of Dexie’s playtime favourites!

Dexie’s toys are brought to life [AI generated image]

Batch Transform against fine-tuned Stable Diffusion model

Since I needed to generate over a hundred images for each comic strip, deploying a SageMaker endpoint (think of it as a Rest API) and generating one image at a time wasn’t the most efficient approach. Instead, I opted to run a batch transform against my model, supplying it with text files in an S3 bucket containing the prompts to generate the images.

I’ll provide more details about this process since I initially struggled with it, and I hope my explanation will save you some time. You’ll need to prepare one text file per image prompt with the following JSON content: {“prompt”: “a photo of Bob the penguin in Antarctica”}. While it appears that there’s a way to combine multiple inputs into one file using the MultiRecord strategy, I was unable to figure out how it works.

Another challenge I encountered was executing a batch transform against my fine-tuned model. You can’t execute a batch transform using a transformer object returned by Estimator.transformer(), which usually works in my previous projects. Instead, you need to first create a SageMaker model object by specifying the S3 location of your fine-tuned model as the model_data. From there, you can create the transformer object using this model object.

def _get_model_uris(self, model_id, model_version, scope):
# Retrieve the inference docker container uri
image_uri = image_uris.retrieve(
region=None,
framework=None, # automatically inferred from model_id
image_scope=scope,
model_id=model_id,
model_version=model_version,
instance_type=self._inference_instance_type,
)
# Retrieve the inference script uri. This includes scripts for model loading, inference handling etc.
source_uri = script_uris.retrieve(
model_id=model_id, model_version=model_version, script_scope=scope
)
if scope == "training":
# Retrieve the pre-trained model tarball to further fine-tune
model_uri = model_uris.retrieve(
model_id=model_id, model_version=model_version, model_scope=scope
)
else:
model_uri = None

return image_uri, source_uri, model_uri

image_uri, source_uri, model_uri = self._get_model_uris(self._model_id, self._model_version, "inference")

# Get model artifact location by estimator.model_data, or give an S3 key directly
model_artifact_s3_location = f"s3://{self._bucket}/output-model/{job_id}/{training_job_name}/output/model.tar.gz"

env = {
"MMS_MAX_RESPONSE_SIZE": "20000000",
}

# Create model from saved model artifact
sm_model = model.Model(
model_data=model_artifact_s3_location,
role=self._aws_role,
entry_point="inference.py", # entry point file in source_dir and present in deploy_source_uri
image_uri=image_uri,
source_dir=source_uri,
env=env
)

transformer = sm_model.transformer(instance_count=self._inference_instance_count, instance_type=self._inference_instance_type,
output_path=f"s3://{self._bucket}/processing/{job_id}/output-images",
accept='application/json')
transformer.transform(data=f"s3://{self._bucket}/processing/{job_id}/batch_transform_input/",
content_type='application/json')

And with that, my customised image generator is all ready!

Advanced Stable Diffusion model fine-tuning

While it’s not essential for my comic generator project, I’d like to touch on some advanced fine-tuning techniques involving the manipulation of max_steps, prior_reservation, and train_text_encoder hyper parameters, in case they come in handy for your projects.

Stable Diffusion model fine-tuning is highly susceptible to overfitting due to the vast difference between the number of training images you provide and those used in the base model. For example, you might only supply 10 images of Bob the penguin, while the base model’s training set contains thousands of penguin images. A larger number of images reduces the likelihood of overfitting and erroneous associations between the target object and other elements.

When setting prior_reservation to True, Stable Diffusion generates a default of x (typically 100) images using the class_prompt provided, and combines them with your instance_images during fine-tuning. Alternatively, you can manually supply these images by placing them in the class_data_dir subfolder. In my experience, prior_preservation is often crucial when fine-tuning Stable Diffusion for human faces. When employing prior_reservation, ensure you provide a class_prompt that mentions the most suitable generic name or common object resembling your character. For Bob the penguin, this object is clearly a penguin, so your class prompt would be “a photo of a penguin”. This technique can also be used to generate a blend between two characters, which I will discuss later.

Another helpful parameter for advanced fine-tuning is train_text_encoder. Set it to True to enable text encoder training during the fine-tuning process. The resulting model will better understand more complex prompts and generate human faces with greater accuracy.

Depending on your specific use case, different hyper parameter values may yield better results. Additionally, you’ll need to adjust the max_steps parameter to control the number of fine-tuning steps required. Keep in mind that a larger training set might lead to overfitting.

By utilising Amazon Polly’s Neural Text To Speech (NTTS) feature, I was able to create audio narration for each paragraph of the story. The quality of the audio narration is exceptional, as it sounds incredibly natural and human-like, making it an ideal story-teller.

To accommodate a younger audience, such as Dexie, I employed the SSML format and utilised the <prosody rate> tag to reduce the speaking speed to 90% of its normal rate, ensuring the content would not be delivered too quickly for them to follow.

self._pollyClient = boto3.Session(
region_name=aws_region).client('polly')
ftext = f"<speak><prosody rate=\"90%\">{text}</prosody></speak>"
response = self._pollyClient.synthesize_speech(VoiceId=self._speaker,
OutputFormat='mp3',
Engine='neural',
Text=ftext,
TextType='ssml')

with open(mp3_path, 'wb') as file:
file.write(response['AudioStream'].read())
file.close()

After all the hard work, I used MoviePy — a fantastic Python framework — to magically turn all the photos, audio narration, and music into an awesome mp4 video. Speaking of music, I gave my tech the power to choose the perfect soundtrack to match the video’s vibe. How, you ask? Well, I just modified my story script generator to return a music style from a pre-determined list using some clever prompts. How cool is that?

At the start of the story please suggest song style from the following list only which matches the story and put it within <>. Song style list are action, calm, dramatic, epic, happy and touching.

Once the music style is selected, the next step is to randomly pick an MP3 track from the relevant folder, which contains a handful of MP3 files. This helps to add a touch of unpredictability and excitement to the final product.

To orchestrate the entire system, I needed a controller module in the form of a Python script that could run each module seamlessly. But, of course, I needed a compute environment to execute this script. I had two options to explore — the first being my preferred option — a server-less architecture with AWS Lambda. This involved using several AWS Lambdas, paired with SQS. The first lambda is used as public API using API Gateway as an entry point. This API would take in the training image URLs and story topic text and pre-process the data, dropping it into an SQS queue. Another Lambda would pick up the data from the topic and conduct data preparation — think image resizing, creating dataset_info.json, and triggering the next Lambda to call Amazon SageMaker Jumpstart to prepare the Stable Diffusion model and execute SageMaker training job to fine-tune the model. Phew, that’s a mouthful. Finally, Amazon EventBridge would be used as an event bus to detect the completion of the training job and trigger the next Lambda to execute SageMaker Batch Transform using the fine-tuned model to generate images.

But alas, this option was not possible because the AWS Lambda function had a max storage limit of 10GB. And when executing the batch transform against the SageMaker model, the SageMaker Python SDK would download and extract the model.tar.gzip file temporarily in the local /tmp before sending it to the managed system that ran the batch transform. Unfortunately, my model was a whopping 5GB compressed, so the SageMaker Python SDK threw an error saying “Out of disk space.” For most use cases where the model size is smaller, this will the best and cleanest solution.

So, I had to resort to my second option — AWS Batch. It worked well, but it did cost a bit more since the AWS Batch compute instance had to run throughout the entire process —even during fine-tuning the model, and executing the batch transform which were executed in a separate compute environment within SageMaker. I could have split the process into several AWS Batch instances and glued them together with Amazon EventBridge and SQS, just like I would have done previously using the server-less approach. But with AWS Batch’s longer startup time (around 5 mins), it would have added way too much latency to the overall process. So, I went with the all-in-one AWS Batch option instead.

Owly system architecture

Feast your eyes upon Owly’s majestic architecture diagram! Our adventure kicks off by launching AWS Batch through the AWS Console, equipping it with an S3 folder brimming with training images, a captivating story topic, and a delightful character, all supplied via AWS Batch environment variables.

# Basic settings
JOB_ID = "penguin-images" # key to S3 folder containing the training images
STORY_TOPIC = "bob the penguin who wants to travel to Europe"
STORY_CHARACTER = "bob the penguin"

# Advanced settings
TRAIN_TEXT_ENCODER = False
PRIOR_RESERVATION = False
MAX_STEPS = 400
NUM_IMAGE_VARIATIONS = 5

The AWS Batch springs into action, retrieving the training images from the S3 folder specified by JOB_ID, resizing them to a 768×768, and creating a dataset_info.json file before placing them in a staging S3 bucket.

Next up, we call up the OpenAI GPT3.5 model API to whip up an engaging story and a complementary song style in harmony with the chosen topic and character. We then summon Amazon SageMaker JumpStart to unleash the powerful Stable Diffusion 2.1 base model. With the model at our disposal, we initiate a SageMaker training job to fine-tune it to our carefully selected training images. After a brief 30-minute interlude, we forge image prompts for each story paragraph in the guise of text files, which are then dropped into an S3 bucket as input for the image generation extravaganza. Amazon SageMaker Batch Transform is unleashed on the fine-tuned model to produce these images in a batch, a process that lasts a mere 5 minutes.

Once complete, we enlist the help of Amazon Polly to craft audio narrations for each paragraph in the story, saving them as mp3 files in just 30 seconds. We then randomly pick an mp3 music file from libraries sorted by song style, based on the selection made by our masterful story generator.

The final act sees the resulting images, audio narration mp3s, and music.mp3 files expertly woven together into a video slideshow with the help of MoviePy. Smooth transitions and the Ken Burns effect are added for that extra touch of elegance. The pièce de résistance, the finished video, is then hoisted up to the output S3 bucket, awaiting your eager download!

I must say, I’m rather chuffed with the results! The story script generator has truly outdone itself, performing far better than anticipated. Almost every story script crafted is not only well-written but also brimming with positive morals, showcasing the awe-inspiring prowess of Large Language Models (LLM). As for image generation, well, it’s a bit of a mixed bag.

With all the enhancements I’ve described earlier, one in five stories can be used in the final video right off the bat. The remaining four, however, usually have one or two images plagued by common issues.

  • First, we’ve got inconsistent characters, still. Sometimes the model conjures up a character that’s slightly different from the original in the training set, often opting for a photorealistic version rather than the toy counterpart. But fear not! Adding a desired photo style within the text prompt, like “A cartoon-style Rex the turtle swimming under the sea,” helps curb this issue. However, it does require manual intervention since certain characters may warrant a photorealistic style.
  • Then there’s the curious case of missing body parts. Occasionally, our generated characters appear with absent limbs or heads. Yikes! To mitigate this, we’ve added negative prompts supported by the Stable Diffusion model, such as “missing limbs, missing head,” encouraging the generation of images that steer clear of these peculiar attributes.
Rex the turtle in different style (bottom right image is in a photo realistic style, top right image is in a mixed style, the rest are in a toy style) and missing a head (top left image) [AI generated image]
  • Bizarre images emerge when dealing with uncommon interactions between objects. Generating images of characters in specific locations typically produces satisfactory results. However, when it comes to illustrating characters interacting with other objects, especially in an uncommon way, the outcome is often less than ideal. For instance, attempting to depict Tom the hedgehog milking a cow results in a peculiar blend of hedgehog and cow. Meanwhile, crafting an image of Tom the hedgehog holding a flower bouquet leads to a person clutching both a hedgehog and a bouquet of flowers. Regrettably, I have yet to devise a strategy to remedy this issue, leading me to conclude that it’s simply a limitation of current image generation technology. If the object or activity in the image you’re trying to generate is highly unusual, the model lacks prior knowledge, as none of the training data has ever depicted such scenes or activities.
Mixed of a hedgehog and a cow (top images)is generated from “Tom the hedgehog is milking a cow” prompt. A person holding a hedgehog and a flower (bottom left image) is generated from “Tom the hedgehog is holding a flower” [AI generated image]

In the end, to boost the odds of success in story generation, I cleverly tweaked my story generator to produce three distinct scenes per paragraph. Moreover, for each scene, I instructed my image generator to create five image variations. With this approach, I increased the likelihood of obtaining at least one top-notch image from the fifteen available. Having three different prompt variations also aids in generating entirely unique scenes, especially when one scene proves too rare or complex to create. Below is my updated story generation prompt.

"Write me a {max_words} words story about a given character and a topic.\nPlease break the story down into " \
"seven to ten short sections with 30 maximum words per section. For each section please describe the scene in " \
"details and always include the location in one sentence within [] with the following format " \
"[a photo of character in the location], [a photo of character in front of an object], " \
"[a photo of character next to an object], [a photo of a location]. Please provide three different variations " \
"of the scene details separated by |\\nAt the start of the story please suggest song style from the following " \
"list only which matches the story and put it within <>. Song style list are action, calm, dramatic, epic, " \
"happy and touching."

The only additional cost is a bit of manual intervention after the image generation step is done, where I handpick the best image for each scene and then proceed with the comic generation process. This minor inconvenience aside, I now boast a remarkable success rate of 9 out of 10 in crafting splendid comics!

With the Owly system fully assembled, I decided to put this marvel of technology to the test one fine Saturday afternoon. I generated a handful of stories from his toys collection, ready to enhance bedtime storytelling for Dexie using a nifty portable projector I had purchased. That night, as I saw Dexie’s face light up and his eyes widen with excitement, the comic playing out on his bedroom wall, I knew all my efforts had been worth it.

Dexie is watching the comic on his bedroom wall [Image by Author]

The cherry on top is that it now takes me under two minutes to whip up a new story using photos of his toy characters I’ve already captured. Plus, I can seamlessly incorporate valuable morals I want him to learn from each story, such as not talking to strangers, being brave and adventurous, or being kind and helpful to others. Here are some of the delightful stories generated by this fantastic system.

Super Hedgehog Tom Saves His City From a Dragon — Generated by AI Comic Generator.
Bob the Brave Penguin: Adventures in Europe — Generated by AI Comic Generator.

As a curious tinkerer, I couldn’t help but fiddle with the image generation module to push Stable Diffusion’s boundaries and merge two characters into one magnificent hybrid. I fine-tuned the model with Kwazi Octonaut images, but I threw in a twist by assigning Zelda as both the unique and class character name. Setting prior_preservation to True, I ensured that Stable Diffusion would “octonaut-ify” Zelda while still keeping her distinct essence intact.

I cleverly utilised a modest max_step of 400, just enough to preserve Zelda’s original charm without her being entirely consumed by Kwazi the Octonaut’s irresistible allure. Behold the glorious fusion of Zelda and Kwazi, united as one!

Dexie brimmed with excitement as he witnessed a fusion of his two favourite characters spearheading the action in his bedtime story. He embarked on thrilling adventures, combating extraterrestrial beings and hunting for hidden treasure chests!

Unfortunately to protect the IP owner I cannot show the resulting images.

Generative AI, particularly Large Language Models (LLMs), is here to stay and set to become the powerful tools for not only software development but many other industries as well. I’ve experienced the true power of LLMs firsthand in a few projects. Just last year, I built a robotic teddy bear called Ellie, capable of moving its head and engaging in conversations like a real human. While this technology is undeniably potent, it’s important to exercise caution to ensure the safety and quality of the outputs it generates, as it can be a double-edged sword.

And there you have it, folks! I hope you found this blog interesting. If so, please shower me with your claps. Feel free to connect with me on LinkedIn or check out my other AI endeavours on my Medium profile. Stay tuned, as I’ll be sharing the complete source code in the coming weeks!

Finally, I would like to say thanks to Mike Chambers from AWS who helped me troubleshoot my fine-tuned Stable Diffusion model batch transform code.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment