OpenAI’s ChatGPT Is the World’s Best Chatbot | by Alberto Romero | Dec, 2022

By Jessie Hobb On Dec 3, 2022

Opinion

On the impressive abilities of a new model — and what might happen if AI models became very good at hiding their own imperfections

OpenAI has released ChatGPT, a new dialogue language model (LM) based on the GPT-3.5 family series (trained on text and code) and similar to InstructGPT (aligned with reinforcement learning through human feedback). The company set up an online demo and people are losing their minds over it.

In a nutshell, ChatGPT is a chatbot that can “answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.”

This nicely encapsulates the reason why ChatGPT is so special: “admit”, “challenge”, and “reject” are unusual verbs to describe the behavior of an LM. However, it isn’t an exaggeration in ChatGPT’s case (countless examples that I’ll share soon assert it).

ChatGPT is, by far, the best chatbot in the world. It can write essays and poetry. It can find great prompts for AI art models. It can roleplay. It can write code, find a bug, explain it, solve it, and explain the solution. And it can combine ideas in the weirdest ways possible:

The model’s superior abilities and better alignment than, say, baseline GPT-3, make it feel more human. In turn, this makes it more believable — although it doesn’t necessarily imply it’s more reliable, and thus trustable.

Like all other LMs (e.g. GPTs, Galactica, LaMDA) it makes things up, can generate hurtful completions, and produce misinformation. None of those deficiencies have changed significantly (ChatGPT is better but it’s built from the same principles).

But that’s not my focus today. I won’t be annoying you with another cautionary tale on why we shouldn’t trust these models or a how-to piece on critical thinking.

In this article, I’ll share with you a compilation of the most interesting findings and implications people have dug out of ChatGPT (with my added commentary, of course).

And, to put the cherry on top, I’ll take you on a journey. I want to explore a hypothetical: what would happen if AI models became so good at hiding the imperfections that we could no longer find any shortcomings or deficiencies in them?

This article is a selection from The Algorithmic Bridge, an educational newsletter whose purpose is to bridge the gap between algorithms and people. It will help you understand the impact AI has in your life and develop the tools to better navigate the future.

In case you haven’t checked Twitter lately, people have spent the last two days talking to ChatGPT non-stop. I’m going to review their findings and conclusions. Once you see what I’m going to show you, you’ll understand why the above hypothetical isn’t so crazy after all.

Essays are dead

I agree 100% that essays, as a form of evaluation, will be dead soon. I’ve written about this before — and about how neither teachers nor the education system are ready for this. With ChatGPT this is now a generally accepted statement:

It became obvious it would happen after students started to cheat on their assignments with GPT-3 and teachers realized they had to prepare. Now it’s a tangible reality. I wrote a 1000-word essay (unpublished) about the top five AI predictions for 2023, and all seemed highly plausible.

I have to say, however, that insightful, engaging, innovative, or thought-provoking aren’t the best adjectives to describe ChatGPT’s creations. Much of its output is dull (which is inevitable unless you really try to get a memorable piece, or two), repetitive, or not accurate — when not utter nonsense.

What worries me — beyond the reforms the education system will need — is whether we’ll ever be capable again of recognizing human-made written work. LMs may get so good as to completely blur the gap between them and us. So much so that not even an AI discriminator (GAN-style) would be able to find which is which because there may not be a difference.

However, there’s another possibility: human writing has characteristics that can, using the right tools, reveal authorship. As LMs become masters of prose, they may develop some kind of writing idiosyncrasy (as a feature and not a bug).

Maybe we could find the AI’s styleme (like a fingerprint hidden in language) not simply to distinguish ChatGPT from a human, but to distinguish its style from all others.

Is Google dead?

The other grand implication of ChatGPT is that it will “kill” Google — the hegemon of internet search “is done”. No one implies it has already happened or is about to, but it’s clearly not a stretch given that people are already using the model to replace Google satisfactorily:

But there are a few caveats here.

Google is above OpenAI in terms of research capability, talent, and budget — if anyone can build this tech before OpenAI, it’s them. However, the juggernaut of internet ads is simply too big to adequately react and maneuver. Google’s AI research branch is arguably the best in the world, but they barely ship any products/services anymore.

Google is facing a case of “the innovator’s dilemma:” the company can’t put its main business model in check with risky innovations just because others could eventually dethrone it.

LMs could actually be the first real threat Google has faced in 20 years.

Yet, if we analyze the differences between search engines and LMs, we realize they don’t overlap perfectly.

On the one hand, search engines are rigid. They just go into the internet to find websites and show you a list of links that approximately gives you what you’re looking for — that’s basically the simplest form of internet search. But, on the other hand, they’re reliable. You know they won’t make things up. (Google search, like all others, is biased and may show you fake news, but you can check the sources, which is critical here.)

ChatGPT is much more flexible, but, because its objective isn’t to be factual or truthful, it can make up information as easily as it can give you an amazing, highly convoluted, and precise answer. You never know which one will be a priori and may have a hard time checking afterward (ChatGPT doesn’t give you sources and, if you ask, it could make those up anyway).

In short, search engines are much more limited but better equipped for the task.

That said, I don’t think the search engine will survive LMs. Time runs against them — while search engine tech isn’t advancing at all, LMs develop at the speed of light.

As soon as a more robust variation of the transformer architecture appears or companies implement “reliability modules” (whatever that means), LMs will automatically become super generative search engines.

No one would ever use Google again.

Now, I’ll attempt at explaining why the hypothetical I raised in the intro is so important — and will be even more so in the near future.

You’ve already seen some of the many impressive abilities ChatGPT has so now you understand why I take this seriously: ChatGPT is making it harder for people who are fighting the hype to find deficiencies — which doesn’t mean they aren’t there.

It’s still quite apparent that ChatGPT lacks reasoning abilities and doesn’t have a great memory window (Gary Marcus wrote a great essay on why it “can seem so brilliant one minute and so breathtakingly dumb the next”).

Like Galactica, it makes nonsense sound plausible. People can “easily” pass its filters and it’s susceptible to prompt injections. Obviously, it’s not perfect.

Yet, ChatGPT is a jump forward — a jump toward us being unable to make it trip up by testing it and sampling:

And this is a big deal.

I wrote an essay on AGI a while ago that I entitled “AGI will take everyone by surprise.” ChatGPT isn’t at that level or anywhere near it (it’s actually just GPT-3 on steroids), but it’s worth it to bring up my arguments on that piece:

“Everything has limits. The Universe has limits — nothing outside the laws of physics can happen, no matter how much we try — and even the infinite — the set of natural numbers is infinite, but it doesn’t contain the set of real numbers.

GPT-3 has limits, and we, the ones trying to find them, also have limits. What Gwern proved [here] was that while looking for GPT-3’s limits, we found ours. It wasn’t GPT-3 that was failing to do some tasks, but us who were unable to find an adequate prompt. Our limits were preventing GPT-3 from performing a task. We were preventing GPT-3 from reaching its true potential.

This raises an immediate question: If the limitations of GPT-3 are often mistaken for ours, how could we precisely define the boundaries of what the system can or can’t do?

…

In the end, we are a limited system trying to evaluate another limited system. Who guarantees that our limits are beyond theirs in every sense? We have a very good example that this may not be the case: We are very bad at assessing our limitations. We keep surprising ourselves with the things we can do, individually and collectively. We keep breaking physical and cognitive limits. Thus, our measurement tools may very well fall short of the action capabilities of a powerful enough AI.”

In his essay, Gwern (a popular tech blogger) pointed out that “sampling can prove the presence of knowledge but not the absence.” He used this idea to defend his thesis that the cause of GPT-3’s failures could be bad prompting and not an inherent lack of “knowledge” on the model’s side.

What I want to underscore here is that the limitations of sampling as a testing methodology don’t apply just in case we find our limits (Gwern argument) or AI’s deficiencies (anti-hype argument), but also if we don’t find any.

When people find deficiencies in ChatGPT’s responses a common counterpoint is “you don’t know how to get the most out of the AI.” That’s fair — but insufficient — because once a systematic deficiency is found, we can conclude the system isn’t reliable.

But, what will happen if, despite numerous attempts to make an AI model break character, beat its filters, and make it drop its facade of reasoning, people fail to do so?

This may seem like a philosophical thought experiment not grounded in reality, but I think it’s quite possible that — applying this reasoning to a future super AI model — we could find the upper limits of the methodologies at hand before finding the AI’s deficiencies.

(I’m not implying the model would actually be able to reason perfectly, but that a set of well-designed guardrails, filters, and intrinsic conservativeness, combined with our limitations as humans, would make it appear so.)

We would have no way to prove that the model can’t reason. No one would believe people who are now using sampling as a way to prove these limitations and everyone would eventually start to trust the system. If we don’t acknowledge this issue soon — and find a solution — it’ll be too late.

If there’s something we should take away from ChatGPT’s superb capabilities is that we’re inevitably approaching this reality.