Writings Yixin Lin

Machine learning and humans

I’ve been thinking about what we can learn from deep learning about general intelligence and specifically connections with human intelligence, which has especially been on my mind after watching some of the lectures by Yoshua Bengio (1) and Yann Lecun (1, 2) on the foundations of deep learning. Because fundamentally “deep learning”, for all it’s been hyped by the media, is a pretty simple concept: get more layers. Why does it work well? Is it just another trendy topic in this field, like SVMs were in the early 2000s? Or is it something that actually leads us to a better understanding of general intelligence?

An popular answer to this question, which is often touted in popular media, is the universal approximation theorem1. It certainly sounds pretty fundamental: with some loose caveats, you can approximate any function, which certainly seems impressive. This actually doesn’t even say anything about why deep learning works: it’s a theorem about the shallowest possible network: loosely, the theorem says that a single hidden layer neural network with a finite number of neurons can approximate arbitrary continuous functions. Sure, but that doesn’t actually say anything about the algorithmic learnability: if, in order to approximate f(x), you need a neuron for every real number x, you’re no better than a lookup table. And it doesn’t say anything about why depth helps at all.

So does deep learning actually get us closer to general intelligence?


The famous no free lunch theorem says that “for certain types of mathematical problems, the computational cost of finding a solution, averaged over all problems in the class, is the same for any solution method” (emphasis mine). So in order for any machine learning algorithm to perform better than another one, you need to exploit some sort of regular structure that occurs in natural problems.

For example, we can tell a typical image from the world somehow doesn’t look like an image of random noise (see below); in the same way, the set of natural problems that we face are large but have some structure that makes it much narrower than the set of all possible problems. If deep learning is doing so much better empirically than other algorithms, it needs to exploit this natural structure; this means, in a sense, it’s biased towards these problems, and would do worse on problems without this structure.

There’s two things that actually make deep architectures surprisingly better than more “classical” machine learning models, and (in my opinion) is a step on the path to general intelligence: they do representation learning, which means that you don’t handcraft the features but instead let the algorithm discover its own representation from the raw data, and they’re hierarchical models, which means that they form higher-level abstractions from the details.


It’s pretty crucial for humans that internal representations of things are learned and not handcrafted, because this gives us the generality part of “general intelligence”. There was no concept of a keyboard in the ancestral environment, but that doesn’t stop people from learning to type when the typewriter was invented; there was no touchscreen keyboard twenty years ago, but that didn’t stop millenials from learning to text when the iPhone was invented.

There is definitely something fundamentally important about learning the features from raw data for general intelligence. After all, the same raw data means entirely different things depending on the context; there’s no way a fixed mapping from raw data to an algorithm can be part of a learning algorithm. Dependency on a human-constructed mapping from representation to reality is one of the reasons for the brittleness of purely symbolic AI. Representation learning is a necessary (if not sufficient) piece of generality.


Hierarchical modeling also seems to capture an important aspect of general intelligence. It has an obvious analogue in the study of human expertise, which is the concept of chunking in psychology. Human brains can only manipulate “seven plus-or-minus two” objects in their working memory, so the thing that separates experts (chessmasters, memory competitors, tennis players, pianists, etc.) is not some superhuman ability to notice everything and act upon that information, but a concise chunking of the relevant information and an ability to manipulate these chunks in meaningful ways. You can see this pretty obviously when you learn to drive a car: at first, you’re consciously thinking about how much to turn, which windows to notice, and which pedals to press, but you gradually become familiar with these lower-level operations until eventually you operate at the level of changing lanes and choosing routes. The fact that human brains work intelligently at all, despite this tiny “amount” of working memory, is a reflection that the highest level of abstraction encodes the most important information from the lower levels really, really well.

Yoshua Bengio cites the no free lunch theorem as evidence that we need to have a strong prior belief about what the world looks like– to encode some of the structure in the world– and points to hierarchical models (what he calls “compositionality” of the world) as a reason why deep architectures work well. But even though they work well, the best machine learning algorithms, compared to human brains, generalize extremely slowly. For example, humans require just a few examples of objects to generalize well, as opposed to the gigabytes required by state-of-the-art systems that still only approach human performance. In other words, humans are extremely efficient with data2.


Connection to humans

If we take this (very vague and hand-wavy) hierarchical, feature-learning metaphor for human intelligence seriously, then there’s some things we can say about what it means to think differently from other people, and what it means to communicate different thoughts.

Somewhat uncontroversially, an expert understands her subject in a different way than a novice, but it’s not just knowing more: it’s knowing differently. Given the same raw data, the features that an expert extracts from raw data (sensory inputs like vision, touch, sound), features that hierarchically build into one of very few chunks manipulated at a conscious level. This is why chess grandmasters often consider no more moves than amateurs (they just subconsciously choose which moves to consider better), and where the whole concept of intuition (or unconscious, nearly indescribable knowledge) can surface. In The Art of Learning by Josh Waitzkin, he describes the experience of learning numbers to learn numbers, or form to leave form: low-level information first becomes conscious knowledge, which is repeated enough to become unconscious knowledge, which builds a strong foundation for higher-level knowledge to be learned in the same way. This is oddly reminiscent of the hierarchical feature-learning in deep learning.

This means that there’s no one-to-one mapping between concepts in two different people’s brains. As an aside, matrix-style knowledge transfer (“downloading kungfu”) is likely not so straightforward even if you had emulations, because it’s unlikely that knowledge is additive and compactly described in the way that data on a hard drive is. On the other hand, I think there’s likely a significant amount of interesting research to be done in transfer learning.

But what does this mean for arguments and communication? What does it mean for two people to communicate?


For a long time, it seemed very strange to me that two people cannot eventually agree. After all, there’s some sense that two people who persistently discuss will eventually agree, right? There’s even a formal statement of this: Aumann’s agreement theorem.

Aumann’s agreement theorem: Two people acting rationally (in a certain precise sense) and with common knowledge of each other’s beliefs cannot agree to disagree. More specifically, if two people are genuine Bayesian rationalists with common priors, and if they each have common knowledge of their individual posterior probabilities, then their posteriors must be equal.

And yet it is often the case that not only two humans find it impossible to agree, but that they find each others’ positions completely incomprehensible. The easy solution, of course, is that the priors aren’t common: that they disagree on some important facts which lead them to reach different conclusions, even as they agree on the rules of logic. But I think it goes deeper than that: not only are the “facts” different, but their internal representations of the world are different. Not only do they disagree on which facts are right or wrong, but they don’t even share the same feature-extraction mechanism: the way they see the world, the high-level representation that they extract from the raw data, tuned over their entire lives, creates “chunks” that operate with completely different dynamics. So when I say the word “liberal”, or “God”, or “capitalism”, entirely different constructs are conjured up in different people’s minds.


On the other hand, when we discuss very specific things, we can communicate incredibly quickly– almost instantaneously. For example, how is it that we can communicate enormously complex emotions to each other if our internal representations are so unsynchronized? Read a sappy poem, watch a horror film, and somehow incredibly large amounts of information are transmitted instantaneously. The nuances of emotion must be enormously complex, if we measure them by something like Kolmogorov complexity (i.e. described by the smallest possible computer program the reproduces the concept). How is it possible that it takes months to communicate a relatively low-complexity concept like calculus, but communicate emotion almost instantaneously?

It’s because they don’t really need to communicate the concept at all: basically, compression. That these specific representations are highly synchronized. Humans share an enormous amount of prior between them, especially in terms of their machinery of cognition, if not the resulting representations that the machinery produces. Evolution placed a highly-complex black-box emotion machine in our brains, and if someone else wants to communicate one complex internal state of that machine (like happiness or love), they just tell you which lever to pull and that state is automatically generated within yourself. Empathy works because we share so much prior information on these matters– and it’s why we find these emotions “instinctive”, almost subconscious: we can literally feel other people’s emotions by leveraging our shared black-box machine. In some sense, poetry is just an extremely lossy compression format for complex, specific emotions.


The field of AI is useful for many reasons, both practical and philosophical, but it’s also fascinating from a humanities point of view. It shears away our anthropomorphic tendencies and exposes what, in some sense, is really “hard” or “interesting” about humans. I believe that as we get closer to a theoretic understanding of the phenomenon of intelligence, we’ll gather more insights on human thought and human nature.


Footnotes

  1. Here is a great intuitive introduction to this theorem. 

  2. Human data efficiency is probably due to better algorithms used by the brain, but it’s probably also caused by a strong prior evolved into us. I think it’s therefore really interesting to explore what the human brains’ priors are. An obvious example is the capability of language and vision. I think it’s especially interesting that our “basest” or most instinctive desires– things like hunger, thirst, and sexual desire– actually operate on a very high level of abstraction.]