2nd Order Thinkers
2nd Order Thinkers.
The Valuable LLM Apps Excel At Being Wrong
0:00
-11:08

Paid episode

The full episode is only available to paid subscribers of 2nd Order Thinkers

The Valuable LLM Apps Excel At Being Wrong

First in my two-part guide: Build your LLM bullshit detector. As humans thrive by embracing our nature, successful LLM apps leverage hallucinations instead of fighting them.

This is the first in a two-part series where I separate valuable LLM applications from expensive distractions, culminating in why Meta is positioned to dominate the AI landscape.

Ever nodded along to LLM discussions while secretly having no clue what's happening under the hood? Or tried reading technical explanations that left you more confused than before? What's worse, not understanding these models directly impacts your ability to recognize which AI applications are worth a damn.

I'll demystify LLMs without making your brain hurt - imagine that!

Better yet, you'll walk away with a bullshit detector for LLM applications - saving you from expensive AI mistakes others will make.

This matters because the difference between revolutionary AI and expensive gimmicks isn't always obvious until you know what to look for.

Part two will expose why Meta's unique approach to AI could fundamentally reshape their business while others flounder. Follow both for an insider's view of the LLM landscape that you won't find in pitch decks or press releases.

You will never find an analysis this clear to help you get the big picture of the LLM's commercial world.

Recap on LLM

Imagine you had a massive hard drive with all the websites that exist. You also had transcriptions of all YouTube videos, all movie scripts, newspapers, and even digital copies of books. All of that organized and labeled and available to you as a giant set of text files.

A naive computer program could count how many times a word was followed by another one across all that text. That would give you a distribution of how likely words B, C, and D followed word A. Think of it as a hat (call it "words after A") with all possible words in there. But you've rigged the hat so that the odds of pulling out B, C, or D fall in line with that count you did earlier. And then you had similar hats for "words after B" and "words after C" and so on.

With those hats, you can now play a fun game. Whenever you see word A, you put your hand in "words after A" and pull out a word. Sometimes that'll be B, sometimes C or D. But whatever it was, you just continue pulling from the right hat (words after B, words after C, ... ) until you reach 100 words.

You'd get some funny results, but a lot of it would be garbage. Just think about all the words that could follow "the". It wouldn't take much for your 100 words to unravel into utter nonsense.

Attention Is All You Need

Then, in 2017, Transformers came and changed the game. Rather than just completing a sequence with the next word, they made two huge differences.

Attention

The first key difference is called attention.

Or, more simply, the ability to look at all the input words simultaneously and understand how compatible they are with each other. It's how it differs the bat in "I swing my bat" from "The bat sleeps in a cave".

There are multiple of these attention mechanisms, focused on different relationships (syntax, semantics, etc.). As context windows (i.e., how much input a model can take) grow, it's no longer just the words immediately surrounding something but references you make with words far apart.

Process input in parallel

The second big breakthrough is that it allows the processing of input in parallel.

So, rather than going forward one word at a time, it actually looks at all words at the same time to work out the relationships between them. This has enormous benefits for the processing speed and the amount it can process.

A side-effect of all this, particularly if you feed it enough input text, is that these statistical relationships between words start to represent real concepts. For instance the statistical relationship between Paris and France; is highly similar to that between Madrid and Spain; and both will be closely linked to the term capital.

All this is basic training for a language model.

It's really capturing the statistical relationship between relevant words based on loads and loads of examples.

This type of training is slow and expensive.

But you end up with a comprehensive dataset that recognizes relationships between concepts, from the obvious to obscure ones that no human would easily make. And that is a powerful engine.


Reinforced Learning (Fine Tuning)

The next step is turning that into a helpful assistant. This is the phase where humans get involved to reinforce the learning and separate good answers from bad. The good or bad here is to "teach" the model the sort of conversations a helpful assistant would (or rather "should", in the opinion of the trainers) have.

If you boil it down to the mechanical side, all this achieves is altering the probabilities from the first training stage to make the output look more like a helpful assistant's response. This has an impact both on the kind of text it generates, but it can also actively change how the model relates certain concepts. Which ultimately is where the AI alignment and training bias problems come in.

For instance. Have you noticed that chatGPT often ends an answer asking you if it can do more? Or prompt you for a next step even when there really isn't anything else? Or when it starts an answer, it harks back certain key terms from your initial question? That's the trained behavior from this stage.

Compared to basic training, this type of training is relatively cheap and fast. T

hat makes sense since it's not really learning anything new but more "rewiring" itself. While the first step creates what's called a base model, the result of this stage is referred to as an "instructed" model.

In the end, what you're left with is something that will grab the next N number of words based on some amount of input, according to a fixed statistical representation, and with a complex mechanism to try and stay on topic.

In practice, that leads to more relevant words showing up more regularly.

However, statistics can only capture so much of this and are limited by what a model has been trained on. This means that inevitably, there are pathways in that representation that will make it go off the rails.

You could call that hallucination or genius.

That decision has more to do with whether it was a desired effect or not. But whichever side you're on, it's simply a feature of a statistical system.


Low-to-Zero Value LLM Applications

A "context-aware next word guesser" is, of course, technically very cool and impressive, but is it also valuable?

Listen to this episode with a 7-day free trial

Subscribe to 2nd Order Thinkers to listen to this post and get 7 days of free access to the full post archives.