Stable Diffusion 2 Is the First Artist-Friendly AI Art Model | by Alberto Romero | Nov, 2022

By Jessie Hobb On Nov 29, 2022

Opinion

But that’s not what users wanted — are they right to be angry?

Who sits now on the generative AI throne? Credit: Author via Midjourney (oh, the irony)

I don’t usually cover model announcements. I do exceptions for transcendental downstream implications (like with Galactica and BLOOM) or high interestingness/usefulness.

Today’s topic probably checks both: Stability.ai, the king of open-source generative AI, has announced Stable Diffusion 2.

The new version of Stable Diffusion brings key improvements and updates. In a different world, it’d be highly likely that every app/feature/program that uses Stable Diffusion would use the new version right away.

However, that’s not going to happen. Stable Diffusion 2, despite its superior technical quality, is considered by many users (if not all) a step back.

In this article, I’m going to describe — as simply as possible — the main features of Stable Diffusion 2, how it compares to 1.x versions, why people think it’s a regression, and my take on all this.

To be clear, this isn’t just about Stable Diffusion 2. What’s happening goes beyond Stability.ai — it’s a sign of what’s coming and how generative AI is about to clash with the real world.

This article is a selection from The Algorithmic Bridge, an educational newsletter whose purpose is to bridge the gap between algorithms and people. It will help you understand the impact AI has in your life and develop the tools to better navigate the future.

Stable Diffusion 2: models and features

Let’s begin with the objective part of the story.

This section is slightly technical (although not difficult), so feel free to skim through it (still worth reading if you plan to use the model).

Stable Diffusion 2 is the generic name of an entire family of models that stem from a common baseline: Stable Diffusion 2.0-base (SD 2.0-base) a raw text-to-image model.

The baseline model is trained on an aesthetic subset of the open dataset LAION-5B (keep this in mind, it will be important later) and generates 512×512 images.

On top of SD 2.0-base, Stability.ai trained a few more models with specific features (examples below).

SD 2.0-v is also a text-to-image model but defaults to a better resolution (768×768):

Depth2img is a depth-to-image model that builds on the classic img2img version to improve the models’ ability to preserve structure and coherence:

The upscaler model takes the outputs from the others and enhances the resolution 4x (e.g. from 512×512 to 2048×2048):

Finally, a text-guided inpainting model provides the tools to semantically replace parts of the original image (as you can do with DALL·E):

To facilitate portability for existing users, Stability.ai optimized the models to run on a single GPU. As they explain in the blog post: “we wanted to make it accessible to as many people as possible from the very start.”

Like Stable diffusion 1.x, the new version falls under permissive licenses. The code is MIT licensed (on GitHub) and the weights (on Hugging Face) follow a CreativeML Open RAIL++-M License.

Stability.ai is also releasing the models on the API platform (for developers) and DreamStudio (for users).

The single most relevant change to SD 2: The OpenCLIP encoder

Now to the more consequential news.

Stable Diffusion 2 is architecturally better — but similar — to its predecessor. No surprises there. However, Stability.ai has drastically changed the nature of one particular component: the text/image encoder (the inner model that transforms text-image pairs into vectors).

All publicly available text-to-image models — including DALL·E and Midjourney — use OpenAI’s CLIP as encoder.

It’s not an exaggeration to say that CLIP is the most influential model in the 2022 wave of generative AI. Without OpenAI or CLIP, it wouldn’t have taken place at all.

That puts into perspective Stability.ai’s decision — breaking a two-year standard practice — to replace OpenAI’s CLIP on Stable Diffusion 2 with a new encoder.

LAION, with support from Stability.ai, has trained OpenCLIP-ViT/H (OpenCLIP), which reportedly sets a new state-of-the-art performance: “[it] greatly improves the quality of the generated images compared to earlier V1 releases.”

Stable Diffusion 2 is the first — and only — model to integrate OpenCLIP instead of CLIP.

Why is this noteworthy? Because OpenCLIP isn’t just open-source, like the original CLIP — it was trained on a public dataset (LAION-5B).

As Emad Mostaque (Stability.ai CEO) explains, “[CLIP] was great, but nobody knew what was in it.”

Emad Mostaque’s Tweet

The fact that OpenCLIP is trained on a publicly available dataset is significant (although not necessarily good) because now devs and users can know what it encodes (i.e. what it has learned and how).

This has two immediate implications.

First, because OpenCLIP’s and CLIP’s training data are different, the things Stable Diffusion 2 “knows” are not the same as what Stable Diffusion 1, DALL·E, and Midjourney “know”.

Mostaque explains that the prompt techniques and heuristics that worked for earlier versions of Stable Diffusion, may not work equally well for the new models: “[Stable Diffusion] V2 prompts different and will take a while for folk to get used to.”

However, even if Stable Diffusion 2 has learned things differently — and it’ll force users to rethink their prompt skills — it has learned them better, he explains (I’d say users have the final word here).

Second, because now we can find out exactly whose work is present in the dataset, Stability.ai could implement opt-in/opt-out features for artists in future versions (I don’t know if the company will do this, but Mostaque himself acknowledged this as an issue).

This means Stable Diffusion 2 is more respectful of artists’ work present in the training data. A notable improvement from Midjourney and DALL·E.

Why the AI community is angry

But, if we dig deeper, we find a very different view of this.

As it turns out, Stability.ai trained OpenCLIP (and the models) on a different subset of LAION than users would’ve wanted.

They removed most NSFW content, celebrity images, and, what angered people the most, they eliminated famous (modern) artists’ names from the labels completely (not their work, though).

This has serious (although not necessarily bad) second-order implications for Stable Diffusion 2 and the field of generative AI at large.

On the one hand, Stability.ai is clearly trying to comply with copyright laws by reducing its legally dubious practices, i.e. scraping from the internet the work of living artists to train their models without attribution, consent, or retribution.

On the other hand, Stable Diffusion users are reasonably pissed off because much of what they could generate before with the only high-quality open-source model that exists (Stable Diffusion) is impossible now.

Mostaque said prompts work differently, but the new implicit restrictions won’t be solved with better prompt engineering.

For instance, you can no longer prompt “in the style of Greg Rutkowski” and get an epic medieval scene with magic and dragons, because Stable Diffusion 2 no longer recognizes “Greg Rutkowski” to be anything in particular.

That’s gone. And with him, every other living — or late — artist you were using. Their artworks are still present in the data but the encoder no longer can associate the images with the names.

I acknowledge Stable Diffusion 2 is objectively much more limited than its previous iteration in its ability to make art (Midjourney v4 is much better quality-wise for instance).

Can the AI community bypass these limitations by tweaking OpenCLIP? Although Mostaque suggested this possibility on the Discord server, it’s not clear how they could do that (in the end, it’s Stability.ai that has 5408 A100s), and fine-tuning the encoder is costly.

A regression for generative AI?

However, despite the ubiquitous disappointment among users, Stability.ai had a good reason to do this — if you live in society, you have to adapt to the boundaries society sets.

You shouldn’t simply stomp on others (artists whose work is on the data feel that way) just because technology allows you to. And if you say that’s what freedom means, let me tell you that, from that perspective, today’s “freedom” is tomorrow’s peril.

Regulation evolves slower than technology, true, but it eventually catches up. Arguing that “the genie is out of the bottle” or “progress is unstoppable” isn’t going to suffice when the law is set.

Right now, there’s a lawsuit ongoing against Microsoft, GitHub, and OpenAI for scraping the web to train Copilot (Codex). If it ends up favoring open-source devs, it could radically redefine the generative AI landscape.

What Stability.ai did to artists is no different than what those companies did to coders. They took, without permission, the work of thousands of people to create AI technology that anyone can now use to generate copycats of the artists’ creations.

That’s most likely why the company has done this. They’re taking measures to avoid potential lawsuits (it’s hard to argue they’re protecting artists because if that were the case they’d have done this since the beginning).

But, regardless of their motives, the end result is what matters: AI people have their tech, and artists are more protected.

If the AI community now claims that Stable Diffusion is worthless because “in the style of…” kinds of prompts don’t work (even if the artist’s work is still present in the data) maybe the only reasonable conclusion is that artists were right all along: their explicit presence in the data was bearing most of the weight to create great AI art.

Final thoughts

As I argued a few months ago, we should have open-minded and respectful conversations about this.

Sadly — and expectedly — it hasn’t happened. AI people have largely dismissed artists’ complaints and petitions. And artists, in most cases, weren’t open to adapting to new developments and sometimes even turned aggressive toward the AI community.

None of that is helpful.

I went into the r/StableDiffusion subreddit to get a sense of the general sentiment and it matches what I’m telling you here. The AI community is seriously at odds with Stability.ai’s decisions.

Calling Stable Diffusion 2 “a step back” and “a regression” are the softest comments.

Only one comment captured what I thought reading all that anger and frustration:

“Clearly no one here thinks that copying an artist work without permission is wrong. i find all messages to suggest that copying the style of people is somehow a step back. I am no artist, but just imagine that someone copies your work, using a tool developed by someone, and leaves you unemployed, your work being undoubtedly unique. Would this be something anyone considers fairly?”

I think it’s paramount to consider “the other side” (whether you’re an artist, an AI user, or both) when thinking about Stable Diffusion 2 in particular and generative AI in general.

Users are mad at Stability.ai now — reasonably in a sense and unreasonably in others — but they shouldn’t forget that when the regulation takes place — and it will — Midjourney and OpenAI (and Microsoft and Google) will also have to adapt and comply.

This goes way beyond any particular company. It’s a matter of the world readapting to new technologies without losing sight of the rights of people (as a side note, I may not agree with the specifics of AI regulation, but I strongly believe regulation shouldn’t be nonexistent).

This non-accountability gap that generative AI companies and users have been enjoying (some may call it freedom) is coming to an end.

And, in my opinion, it’s better this way.