Lex Fridman Podcast - Vladimir Vapnik: Statistical Learning

All Episodes

Previous Next

Nov. 16, 2018 - Lex Fridman Podcast

54:00

Vladimir Vapnik: Statistical Learning | Lex Fridman Podcast #5

Large

Audio Only |

Scroll

Time	Text
	The following is a conversation with Vladimir Vapnik.
	He's the co-inventor of support vector machines, support vector clustering, VC theory, and many foundational ideas in statistical learning.
	He was born in the Soviet Union and worked at the Institute of Control Sciences in Moscow.
	Then, in the United States, he worked at AT&T, NEC Labs, Facebook Research, and now is a professor at Columbia University.
	His work has been cited over 170,000 times.
	He has some very interesting ideas about artificial intelligence and the nature of learning, especially on the limits of our current approaches and the open problems in the field.
	This conversation is part of MIT's course on Artificial General Intelligence and the Artificial Intelligence Podcast.
	If you enjoy it, please subscribe on YouTube or rate it on iTunes or your podcast provider of choice.
	Or simply connect with me on Twitter or other social networks at Lex Friedman, spelled F-R-I-D. And now, here's my conversation with Vladimir Vapnik.
	Einstein famously said that God doesn't play dice.
	Yeah. You have studied the world through the eyes of statistics, so let me ask you in terms of the nature of reality, fundamental nature of reality, does God play dice?
	We don't know some factors.
	And because we don't know some factors, which could be important, it looks like good play dice, but we should describe.
	In philosophy, they distinguish between two positions.
	Positions of instrumentalism, where you're creating theory for prediction, and position of realism, where you're trying to understand what God did.
	Can you describe instrumentalism and realism a little bit?
	For example, if you have some mechanical laws, what is that?
	Is it law which is true, always and everywhere?
	Or is it law which allows you to predict the position of the moving element?
	What you believe.
	You believe that it is God's law, that God created the world which obeyed to this physical law, or it is just law for predictions.
	And which one is instrumentalism?
	For predictions. If you believe that this is law of God and it's always true everywhere, that means that you're a realist.
	You're trying to really understand God's thought.
	So the way you see the world as an instrumentalist?
	You know, I'm working for some models, model of machine learning.
	So in this model we can see setting and we try to solve, resolve the setting to solve the problem.
	And you can do it in two different ways.
	From the point of view of instrumentalists, And that's what everybody does now, because they say the goal of machine learning is to find the rule for classification.
	That is true, but it is an instrument for prediction.
	But I can say the God of machine learning is to learn about conditional probability.
	So how does God play youth?
	Is he playing?
	What is probability for one?
	What is probability for another given situation?
	But for prediction I don't need this.
	I need the rule.
	But for understanding I need conditional probability.
	So let me just step back a little bit first to talk about, you mentioned, which I read last night, the parts of the 1960 paper by Eugene Wigner, Unreasonable Effectiveness of Mathematics and Natural Sciences.
	Such a beautiful paper, by the way.
	Yeah, absolutely. It made me feel, to be honest, to confess my own work in the past few years on deep learning, heavily applied, made me feel that I was missing out on some of the beauty of nature in the way that math can uncover.
	So let me just step away from the poetry of that for a second.
	How do you see the role of math in your life?
	Is it a tool? Is it poetry?
	Where does it sit?
	And does math for you have limits of what it can describe?
	Some people say that math is language which uses God.
	So I believe...
	Speak to God or use God?
	Use God. Use God.
	Yeah. So, I believe that this article about unreasonable effectiveness of mass is that if you're looking at mathematical structures, they know something about reality.
	Scientists from natural science are looking at equations and trying to understand reality.
	So the same in machine learning.
	If you try and very carefully look at all equations which define conditional probability, you can understand something about reality more than from your fantasy.
	So math can reveal the simple underlying principles of reality perhaps.
	You know what means simple?
	It is very hard to discover them.
	But then when you discover them and look at them, you see how beautiful they are.
	And it is surprising why people did not See that before you're looking at equations and derive it from equations.
	For example, I talked yesterday about least-square method.
	And people had a lot of fantasy how to improve least-square method.
	But if you go step by step by solving some equations, you suddenly will get some term which, after thinking, you understand that it describes position of observation point.
	In least square method, we throw out a lot of information.
	We don't look in composition of points, of observations.
	We're looking only on residuals.
	But when you understood that, that's a very simple idea, which is not too simple to understand.
	You can derive this just from equations.
	So some simple algebra, a few steps will take you to something surprising that when you think about...
	And that is proof that human intuition is not too rich and very primitive and it does not see very simple situations.
	So let me take a step back.
	In general, yes, right?
	But what about human as opposed to intuition, ingenuity, the moments of brilliance?
	Do you have to be so hard on human intuition?
	Are there moments of brilliance in human intuition that can leap ahead of math and then the math will catch up?
	I don't think so.
	I think that the best human intuition, it is putting in axioms.
	And then it is technical.
	See where the axioms take you.
	Yeah. But if they correctly take axioms.
	But axiom polished during generations of scientists.
	And this is integral wisdom.
	So, that's beautifully put.
	But if you maybe look at, when you think of Einstein and special relativity, what is the role of imagination coming first there in the moment of discovery of an idea?
	So there's obviously a mix of math and out-of-the-box imagination there.
	That I don't know.
	Whatever I did, I exclude any imagination.
	Because whatever I saw in machine learning that comes from imagination, like features, like deep learning, they are not relevant to the problem.
	When you're looking very carefully from mathematical equations, you're deriving very simple theory, which goes far beyond, theoretically, than whatever people can imagine.
	Because it is not good fantasy.
	It is just interpretation, it is just fantasy, but it is not what you need.
	You don't need any imagination to derive the main principle of machine learning.
	When you think about learning and intelligence, maybe thinking about the human brain and trying to describe mathematically the process of learning, that is something like what happens in the human brain.
	Do you think we have the tools currently?
	Do you think we will ever have the tools to try to describe that process of learning?
	It is not a description of what's going on.
	It is an interpretation.
	It is your interpretation.
	Your vision can be wrong.
	You know, when a guy invented a microscope for the first time, only he got this instrument and he kept a secret about the microscope.
	But he wrote a report in London Academy of Science.
	In his report, when he looked at the blood, he looked everywhere, on the water, on the blood, on the spirit.
	But he described blood like a fight between queen and king.
	So he saw blood, cells, red cells, and he imagines that it is an army fighting each other.
	And it was his interpretation of the situation.
	And he sent this report in the Academy of Science.
	They very carefully looked because they believed that he was right, he saw something, but he gave wrong interpretation.
	And I believe the same can happen to his brain.
	With brain, yeah. Because the most important part...
	You know, I believe in human language.
	In some proverb is so much wisdom.
	For example, people say that it is better than a thousand days of diligent studies one day with a great teacher.
	But I will ask you what a teacher does.
	Nobody knows. And that is intelligence.
	But we know from history and now from math and machine learning that Teacher can do a lot.
	So what, from a mathematical point of view, is the great teacher?
	I don't know. That's an open question.
	No, but we can say what teacher can do.
	He can introduce some invariants, some predicate for creating invariants.
	How he's doing it, I don't know, because teacher knows reality and can describe from this reality Predicate invariants, but he knows that when you're using invariants, you can decrease the number of observations 100 times.
	Maybe try to pull that apart a little bit.
	I think you mentioned a piano teacher saying to the student, play like a butterfly.
	I played piano, I played guitar for a long time.
	Yeah, maybe it's romantic, poetic, but it feels like there's a lot of truth in that statement.
	There is a lot of instruction in that statement.
	And so, can you pull that apart?
	What is that?
	The language itself may not contain this information.
	It's not blah, blah, blah. It is not blah, blah, yeah.
	It affects you. It's what?
	It affects you. It affects your playing.
	Yes, it does, but it's not the language.
	It feels like a...
	What is the information being exchanged there?
	What is the nature of information?
	What is the representation of that information?
	I believe that it is a sort of predicate, but I don't know.
	That is exactly what intelligence and machine learning should be, because the rest is just mathematical technique.
	I think that what was discovered recently is that there are two mechanisms of learning.
	One called strong convergence mechanism, And the convergence mechanism.
	Before people use only one convergence.
	In the convergence mechanism, you can use predicate, that's what play like butterfly, and it will immediately affect your playing.
	You know, there is an English proverb, great.
	If it looks like a duck, swims like a duck, And quack like a duck, then it is probably a duck.
	But this is exactly about the predicates.
	It looks like a duck, what it means.
	So you have so many ducks that you're training data.
	So you have description of how Looks integral, looks ducks.
	Yeah, the visual characteristics of a duck, yeah.
	Yeah, but you want, and you have model for recognition now.
	So you would like that theoretical description from model coincide with empirical description, which you saw on the text there.
	So about looks like a duck, it is general.
	But what about swims like a duck?
	You should know that a duck swims.
	You can't say it plays chess like a duck.
	A duck doesn't play chess, and it is a completely legal predicate, but it is useless.
	So half the teacher can recognize not useless predicate.
	So, up to now, we don't use this predicate in existing machine learning.
	But in this English proverb, they use only three predicate.
	Looks like a duck, swims like a duck, and quack like a duck.
	So you can't deny the fact that swims like a duck and quacks like a duck has humor in it, has ambiguity.
	Let's talk about swim like a duck.
	It does not say jump like a duck.
	Why? It's not relevant.
	But that means that you know ducks, you know different birds, you know animals, and you derive from this that it is relevant to say swim like a duck.
	So underneath, in order for us to understand swim is like a duck, it feels like we need to know millions of other little pieces of information which we pick up along the way.
	You don't think so? There doesn't need to be this knowledge base.
	In those statements carries some rich information that helps us understand the essence of duck.
	How far are we from integrating predicates?
	You know that when you consider complete theory of machine learning, so what it does, you have a lot of functions, and then you're talking, it looks like a duck.
	You see your training data.
	From training data, you recognize, like, expected data.
	Duck should look.
	Then you remove all functions which does not look like you think it should look from training date.
	So you decrease amount of function from which you pick up one.
	Then you give a second predicate and again decrease the set of functions.
	And after that you pick up the best function you can find.
	It is standard machine learning.
	So why you need not too many examples?
	Because your predicates aren't very good?
	That means the predicates are very good.
	Because every predicate is invented to decrease admissible set of functions.
	So you talk about admissible set of functions and you talk about good functions.
	So what makes a good function?
	So admissible set of function is set of function which has small capacity or small diversity, small VC dimension, example, which contains good function inside.
	So by the way, for people who don't know, VC, you're the V in the VC. So how would you describe to a layperson what VC theory is?
	How would you describe VC? So when you have a machine So a machine is capable to pick up one function from the admissible set of functions.
	But a set of admissible functions can be big.
	They contain all continuous functions and it's useless.
	You don't have so many examples to pick up a function.
	But it can be small.
	We call it capacity, but maybe it's better called diversity.
	So not very different function in the set.
	It's an infinite set of functions, but not very diverse.
	So it is small, the C dimension.
	When the C dimension is small, you need a small amount of training data.
	So the goal is to create admissible set of functions which have small VC dimensions and contain good functions.
	Then you will be able to pick up the function using small amount of observations.
	So that is the task of learning, is creating a set of admissible functions that has a small VC dimension.
	And then you've figured out a clever way of picking up No, that is the goal of learning which I formulated yesterday.
	Statistical learning theory does not involve in creating admissible set of functions.
	In classical learning theory, everywhere, 100% in the textbook, the admissible set of functions is given.
	But this is science about nothing, because the most difficult problem is to create admissible set of functions.
	Given, say, a lot of functions, continuous set of functions, create admissible set of functions, that means that it has finite VC dimensions, small VC dimensions, and contain good functions.
	So this was out of consideration.
	So what's the process of doing that?
	I mean, it's fascinating. What is the process of creating this admissible set of functions?
	That is invariance. That's invariance.
	Can you describe invariance?
	Yeah, you're looking at properties of training data, and properties means that you have some function, And you just count what is the average value of function on training data.
	You have model and what is the expectation of this function on the model.
	And they should coincide.
	So the problem is about how to pick up functions.
	It can be any function.
	In fact, it is true for all functions.
	But because when I'm talking, say, duck does not jump, so you don't ask questions, jump like a duck.
	Because it is trivial, it does not jump, it doesn't help you to recognize jump.
	But you know something.
	Which question to ask?
	When you're asking, it swims like a duck.
	But it looks like a duck in this general situation.
	Looks like, say, a guy who has this illness, this disease, It is legal.
	So there is a general type of predicate looks like, a special type of predicate which related to this specific problem.
	And that is intelligence part of all this business.
	And that where teachers are involved.
	Incorporating the specialized predicates.
	Okay. What do you think about deep learning as neural networks, these arbitrary architectures, as helping accomplish some of the tasks you're thinking about?
	Their effectiveness or lack thereof?
	What are the weaknesses and what are the possible strengths?
	You know, I think that this is fantasy.
	Everything which, like deep learning, like features.
	Let me give you this example.
	One of the greatest books is Churchill's book about history of Second World War.
	And he's starting this book describing that in all time, when war is over, so...
	The great kings, they gathered together, almost all of them were relatives, and they discussed what should be done, how to create peace.
	And they came to agreement.
	And when happened in the First World War, the general public came in power.
	And they were so greedy that they robbed Germany.
	And it was clear for everybody that it was not peace.
	That piece will last only 20 years because they were not professionals.
	The same I see in machine learning.
	There are mathematicians who are looking for the problem from a very deep mathematical point of view.
	And there are computer scientists who mostly do not know mathematics.
	They just have an interpretation of that.
	And they invented a lot of blah blah blah interpretations like deep learning.
	Why do you need deep learning?
	Mathematics does not know deep learning.
	Mathematics does not know neurons.
	It is just function.
	If you like to say piecewise linear function, say that and do it in class of piecewise linear function.
	But they invent something.
	And then they try to prove advantage of that through interpretations, which mostly wrong.
	And when it's not enough, they appeal to brain, which they know nothing about that.
	Nobody knows what's going on in the brain.
	So I think that more reliable, look en masse.
	This is a mathematical problem.
	Do your best to solve this problem.
	Try to understand that there is not only one way of convergence, which is a strong way of convergence.
	There is a weak way of convergence, which requires predicate.
	And if you will go through all this stuff, You will see that you don't need deep learning.
	Even more, I would say one of the theories, which is called the presenter theory, it says that optimal solution of mathematical problem which describes learning is on shadow network, not on deep learning. And a shallow network, yeah.
	The ultimate problem is there.
	Absolutely. So, in the end, what you're saying is exactly right.
	The question is, you have no value for throwing something on the table, playing with it, not math.
	It's like a neural network where you said throwing something in the bucket or the biological example and looking at kings and queens or the cells or the microscope.
	You don't see value in imagining the cells are kings and queens and using that as inspiration and imagination for where the math will eventually lead you?
	You think that interpretation basically deceives you in a way that's not productive?
	I think that if you're trying to analyze this business of learning, and especially discussion about deep learning, it is discussion about interpretation, not about things, about what you can say about things.
	That's right, but aren't you surprised by the beauty of it?
	Not mathematical beauty, but the fact that it works at all.
	Or are you criticizing that very beauty, our human desire to interpret, to find our silly interpretations in these constructs?
	Like, let me ask you this.
	Are you... Surprised?
	Does it inspire you?
	How do you feel about the success of a system like AlphaGo beating the game of Go?
	Using neural networks to estimate the quality of a board?
	That is your interpretation, quality of the board.
	Yeah, yes.
	So it's not our interpretation.
	The fact is, a neural network system, it doesn't matter, a learning system, that we don't, I think, mathematically understand that well, beats the best human player, does something that was thought impossible.
	That means that it's not a very difficult problem.
	So we've empirically have discovered that this is not a very difficult problem.
	It's true. So, maybe...
	I can't argue.
	So... Even more, I would say, that if they use deep learning, it is not the most effective way of learning theory.
	And usually, when people use deep learning, They're using zillions of training data.
	Yeah, but you don't need this.
	So I describe challenge.
	Can we do some problems which do it well?
	Deep learning method with deep net are using 100 times less training data.
	Even more, some problems deep learning cannot solve.
	Because it's not necessary they create admissible set of functions.
	To create deep architecture means to create admissible set of functions.
	You cannot say that you're creating good admissible set of functions.
	It's just your fantasy.
	It does not come from mass.
	But it is possible to create admissible set of functions because you have your training data.
	Actually, for mathematicians, when you consider invariant, you need to use law of large numbers.
	When you're making training in existing algorithm, you need uniform law of large numbers.
	Which is much more difficult.
	But nevertheless, if you use both weak and stroke way of convergence, you can decrease a lot of training data.
	Yeah, you could do the three, the swims like a duck and quacks like a duck.
	So let's step back and...
	Think about human intelligence in general.
	Clearly that has evolved in a non-mathematical way.
	It wasn't, as far as we know, God or whoever didn't come up with a model and place it in our brain of admissible functions.
	It kind of evolved. I don't know, maybe you have a view on this, but So Alan Turing in the 50s, in his paper, asked and rejected the question, can machines think?
	It's not a very useful question, but can you briefly entertain this useless question?
	Can machines think?
	So talk about intelligence and your view of it.
	I don't know that. I know that Turing described imitation.
	If a computer can imitate a human being, let's call it intelligent.
	And he understands that it is not a thinking computer.
	He completely understands what he is doing.
	But he set up a problem of imitation.
	So now we understand that the problem is not in imitation.
	I'm not sure that intelligence is just inside of us.
	It may be also outside of us.
	I have several observations.
	So, when I prove some theory, it's a very difficult theory.
	In a couple of years, in several places, People prove the same theorem.
	Soyer Lemma after us was done.
	Then another guy proves the same theorem.
	In the history of science, it's happened all the time.
	For example, geometry.
	It's happened simultaneously.
	First it did Lobachevsky and then Gauss and Boyai and another guy.
	And it approximately in 10 times period Ten years period of time.
	And I saw a lot of examples like that.
	And many mathematicians think that when they develop something, they develop something in general which affects everybody.
	So, maybe our model that intelligence is only inside of us is incorrect.
	It's our interpretation.
	Maybe there exists some connection with world intelligence.
	I don't know. You're almost like plugging in into...
	Yeah, exactly. And contributing to this...
	Into a big network. Maybe in your own network.
	No, no, no. On the flip side of that, maybe you can comment on big O complexity and how you see classifying algorithms by worst case running time in relation to their input.
	So that way of thinking about functions.
	Do you think P equals NP? Do you think that's an interesting question?
	Yeah, it is an interesting question.
	But let me talk about In the worst-case scenario, there is a mathematical setting.
	When I came to the United States in 1995, people did not know.
	They did not know statistical learning.
	In Russia it was published to our monographs, but in America they did not know.
	Then they learned.
	And somebody told me that it is worst case theory and they will create real case theory, but till now I did not.
	Because it is a mathematical tool.
	You can do only what you can do using mathematics, which has a clear understanding and clear description.
	And for this reason we introduced complexity.
	And we need this, because using, actually it is diversity, like this one more, you can prove some theorems.
	But we also create theory for case when you know probability measure.
	And that is the best case which can happen, it is entropy theory.
	So from a mathematical point of view, You know the best possible case, the worst possible case.
	You can derive different model in the medium.
	But it's not so interesting.
	You think the edges are interesting?
	The edges are interesting.
	Because It is not so easy to get good bound, exact bound.
	It's not many cases where you have the bound, but interesting principles which discover the mass.
	Do you think it's interesting because it's challenging and reveals interesting principles that allow you to get those bounds?
	Or do you think it's interesting because it's actually very useful for understanding the essence of a function, of an algorithm?
	So it's like me judging your life as a human being by the worst thing you did and the best thing you did, versus all the stuff in the middle.
	It seems not productive.
	I don't think so, because you cannot describe the situation in the middle, or it will be not general.
	So you can describe each case, and it is clear, it has some model, but you cannot describe model for every new case.
	So you will never be accurate.
	But from a statistical point of view, the way you've studied functions and the nature of learning and the world, don't you think that the real world has a very long tail?
	That the edge cases are very far away from the mean?
	The stuff in the middle?
	Or no? I don't know that.
	I think that, from my point of view, if you will use formal statistics, you need uniform law of large numbers.
	If you will use this In variance business, you will need just law of launch numbers.
	And there's this huge difference between uniform law of launch numbers and launch numbers.
	Is it useful to describe that a little more?
	Or should we just take it to...
	For example, when I'm talking about DAC, I gave three predicates and that was enough.
	But if you will try to do formal distinguishing, you will need a lot of observations.
	So, that means that information about what looks like a duck contains a lot of bits of information, formal bits of information.
	So, we don't know how much bits of information contain things from artificial intelligence, and that is the subject of analysis.
	Till now, All business.
	I don't like how people consider artificial intelligence.
	They consider us some codes which imitate the activity of human beings.
	It is not science.
	It is applications. You would like to imitate, go ahead.
	It is very useful and a good problem.
	You need to learn something more.
	How people can develop, say, predicate, swim like a dog, or play like a butterfly, or something like that.
	Not the teacher tells you.
	How it came in his mind.
	How he chose this image.
	So that process...
	That is the problem of intelligence.
	That is the problem of intelligence.
	And you see that connected to the problem of learning?
	Absolutely. Because you immediately give this predicate, like a specific predicate, swims like a dog or quack like a dog.
	It was chosen, somehow.
	So what is the line of work, would you say, if you were to formulate it as a set of open problems?
	That will take us there, to play like a butterfly.
	We'll get a system to be able to...
	Let's separate two stories.
	One mathematical story, that if you have predicate, you can do something.
	And another story, you have to get predicate.
	It is intelligence problem, and people even did not start to understand intelligence.
	Because to understand intelligence, first of all, try to understand what doing teaches.
	How teachers teach.
	Why want one teacher better than another one?
	Yeah, so you think we really even haven't started on the journey of generating the predicates?
	No, we don't understand. We even don't understand that this problem exists.
	You do.
	No, I just have no name.
	I want to understand why one teacher is better than another.
	And how affect teacher, student.
	It is not because he's repeating the problem which is in the textbook.
	He makes some remarks.
	He makes some philosophy of reasoning.
	It is a formulation of a question that is the open problem.
	Why is one teacher better than another?
	Right. What he does better.
	Why at every level?
	How do they get better?
	What does it mean to be better?
	From whatever model I have, one teacher can give a very good predicate.
	One teacher can say swims like a dog and another can say jump like a dog.
	And jump like a dog carries zero information.
	So what is the most exciting problem in statistical learning you've ever worked on or are working on now?
	I just finished this invariant story.
	And I'm happy that I believe that it is ultimate learning story.
	At least I can show that there are no another mechanism, only two mechanisms.
	But they separate statistical part from intelligent part.
	And I know nothing about intelligent part.
	And if we will know this intelligent part, it will help us a lot in teaching.
	In learning.
	Do you not well know it when we see it?
	So, for example, in my talk, the last slide was a challenge.
	So you have, say, NIST digit recognition problem.
	And deep learning claims that they did it very well, say, 99.5% of correct answers.
	But they used 60,000 observations.
	Can you do the same using 100 times less?
	But incorporating invariants, what it means, you know, digit 1, 2, 3.
	Just looking at that, explain me which invariant I should keep to use 100 examples or say 100 times less examples to do the same job.
	Yeah, that last slide, unfortunately, your talk ended quickly, but that last slide was a powerful, open challenge and a formulation of the essence here.
	That is the exact problem of intelligence.
	Because... Everybody, when machine learning started and it was developed by a mathematician, they immediately recognized that we use much more training data than humans needed.
	But now again we came to the same story.
	Have to decrease. That is the problem of learning.
	It is not like in deep learning they use zillions of training data.
	Because maybe the learns are not enough if you have a good invariance.
	Maybe you will never collect some number of observations.
	But now it is a question to intelligence.
	We have to do that because the statistical part is ready.
	As soon as you supply us with a predicate, we can do a good job with a small amount of observations.
	And the very first challenge is well-known digit recognition.
	And you know digits.
	And please, tell me invariants.
	I think about that.
	I can say, for digit 3, I would introduce concept of horizontal symmetry.
	So the digit 3 has horizontal symmetry, say, more than, say, digit 2 or something like that.
	But as soon as I get the idea of horizontal symmetry, I can mathematically invent a lot of measure of horizontal symmetry, or vertical symmetry, or diagonal symmetry, whatever, if I have a idea of symmetry.
	But what else?
	Looking on digit, I see that it is meta-predicate, which is not shape.
	It is something like symmetry, like half-dark is whole picture, something like that, which can self-rise a predicate.
	You think such a predicate could rise Out of something that's not general, meaning it feels like for me to be able to understand the difference between a 2 and a 3, I would need to have had a childhood of 10 to 15 years playing with kids, going to school, being yelled by parents.
	All of that. Walking, jumping, looking at ducks.
	And now, then I would be able to generate the right predicate for telling the difference between two and a three.
	Or do you think there's a more efficient way?
	I don't know. I know for sure that you must know something more than digits.
	Yes. And that's a powerful statement.
	Yeah. But maybe there are several languages of description of these elements of digits.
	So I'm talking about symmetry, about some properties of geometry.
	I'm talking about something abstract.
	I don't know that. But this is the problem of intelligence.
	So, in one of our articles, it is trivial to show that every example can carry not more than one bit of information in real.
	When you show an example and you say this is one, you can remove a function which does not tell you one.
	It's the best strategy if you can't do it perfectly, it's remove half of the functions.
	But when you use one predicate which looks like a duck, you can remove much more functions than half.
	And that means that it contains a lot of information from a formal point of view.
	But when you have a general picture of what you want to recognize, a general picture of the world, can you invent this predicate?
	And that predicate carries a lot of information.
	Beautifully put. Maybe just me, but in all the math you show, in your work, which is some of the most profound mathematical work in the field of learning AI and just math in general, I hear a lot of poetry and philosophy.
	you really kind of talk about philosophy of science.
	There's a poetry and music to a lot of the work you're doing
	and the way you're thinking about it.
	So do you, where's that come from?
	Do you escape to poetry?
	Do you escape to music or not?
	I think that there exists ground truth.
	There exists ground truth?
	Yeah, and that can be seen everywhere.
	The smart guy, philosopher, sometimes I'm surprised how they deep see.
	Sometimes I see that some of them are completely out of subject.
	But the ground truth I see in music.
	Music is the ground truth?
	Yeah. And in poetry, many poets, they believe that they take dictation.
	So what piece of music, as a piece of empirical evidence, gave you a sense that they are touching something in the ground truth?
	It is structure.
	Yeah, because when you're listening to Bach, you see this structure.
	Very clear, very classic, very simple.
	And the same in Moise, when you have axioms in geometry, you have the same feeling.
	And in poetry sometimes you see the same.
	Yeah. And if you look back at your childhood, you grew up in Russia.
	You maybe were born as a researcher in Russia.
	You've developed as a researcher in Russia.
	You've came to the United States in a few places.
	If you look back...
	What were some of your happiest moments as a researcher?
	Some of the most profound moments?
	Not in terms of their impact on society, but in terms of their impact on how damn good you feel that day and you remember that moment.
	You know, every time when you found something It is great.
	Every simple thing.
	But my general feeling is that most of my time was wrong.
	You should go again and again and again and try to be honest in front of yourself, not to make interpretation.
	But try to understand that it's related to ground truth.
	It is not my blah, blah, blah interpretation and something like that.
	But you're allowed to get excited at the possibility of discovery.
	Oh, yeah. You have to double-check it, but...
	No, but how is it related to the other ground rules?
	Is it just temporary or it is forever?
	You know, you always have a feeling when you found something.
	How big is that? So 20 years ago when we discovered statistical learning theory, nobody believed, except for one guy, Dudley from MIT. And then in 20 years it became passion.
	And the same with support vector machines.
	That is kernel machines.
	So with support vector machines and learning theory, when you were working on it, you had a sense?
	You had a sense of the profundity of it?
	That this seems to be right?
	This seems to be powerful?
	Right. Absolutely.
	Immediately. I recognize that it will last forever.
	And now, when I found this invariance story, I have a feeling that it is complete learning, because I have proved that there are no different mechanisms.
	You can have some cosmetic improvement, but in terms of invariance, We need both invariants and statistical learning, and they should work together.
	But also, I'm happy that we can formulate what is intelligence from that, and to separate from technical part.
	And that is completely different.
	Absolutely. Well, Vladimir, thank you so much for talking today.

▲