All Episodes
Jan. 29, 2025 - Freedomain Radio - Stefan Molyneux
03:59
Why DeepSeek Is So Good at Learning the Language
| Copy link to current segment

Time Text
So, if you've ever had those fridge magnets that have various words on them, you can just throw them all at the fridge and just end up with this word salad.
Well, you wouldn't train the AI in that.
You would train the AI on the most likely sequences of words.
In other words, you are not training the AI on sentence structures that will never or almost never exist.
That would be my guess.
How did they replicate?
Oh, one, reinforcement learning, take complicated questions that can be easily verified.
Update the model.
Correct.
So there's a math or code questions and so on.
Okay, key value cache compression.
Let's get into this as a whole.
Why the DeepSeq model is so good?
So here's the answer.
They made three cool innovations.
So a key value is like the model's working memory.
So DeepSeq was able to find a way to compress this without losing quality of the model's output.
DeepSeq requires 93.3% less memory to store this information while working, which makes it much faster and more efficient at generating text.
A mixture of experts, FFN architecture.
So the model's processing is split into different expert components.
So the way DeepSeek does it is for each piece of text or token, the model always uses the shared experts and then picks the top few most relevant specialist experts from a larger pool.
But the clever newish part is that they make sure all specialists get used, prevent some from being ignored, and distribute work evenly across computers and have some new ways to keep network communications efficient between computers.
This approach lets them build a much larger model, 236 billion parameters, while only using a small portion, 21 billion, for each task, making it more powerful and efficient.
Multi-token prediction head.
That's the worst name for a porno ever.
So, when large language models like GBD4 or DeepSeq, or whatever, generate text, they typically work by predicting one word or token at a time.
Think of it like playing a word game where you have to guess the next word in a sentence.
The model makes it its best guess for what?
Should come next based on what came before.
Multi-token prediction takes this a step further.
Instead of just predicting the next word, the model tries to predict several words ahead at once.
For example, now, I don't know if you've ever played this game.
If you've had kids, you probably have.
If you were a kid, and you had fun people around, you probably did.
So, the game is...
I played this a huge amount with my daughter and her friends.
You get, like, five kids around a table, and maybe an adult.
And somebody starts off a story, right?
And everyone gets one word to add, like once upon a time there was, and you know, it always ends up with poop jokes before the age of eight or over the age of 50. So, that's kind of what AI is trying to do.
It's trying to guess, make up an X word that makes sense in the context of the story.
So, if you think of the text, the cat sat on the, right?
So, of course, the AI would say, Matt.
And then uses MAT to predict the next word and so on.
But the MTP approach, the multi-token prediction, doesn't just predict MAT, but also predicts in the sun.
So future words.
So it's much more efficient that way.
And faster, of course, right?
So that's a plus.
Yeah, so the US banning the chips made China have to innovate and become more efficient and so on.
This is what somebody wrote.
AI models are powered by advanced chips, and since 2021, the US government has restricted the sale of these to China in order to stunt progress.
To get around the supply problem, Chinese developers have been collaborating and experimenting with new approaches.
Export Selection