Towards Geometric Deep Learning II: The Perceptron Affair | by Michael Bronstein | Jul, 2022

By Jessie Hobb On Jul 11, 2022

Origins of Geometric Deep Learning

Geometric Deep Learning approaches a broad class of ML problems from the perspectives of symmetry and invariance, providing a common blueprint for neural network architectures as diverse as CNNs, GNNs, and Transformers. In a new series of posts, we study how geometric ideas dating back to ancient Greece have shaped modern deep learning.

Image: based on Shutterstock.

In the second post from the “Towards Geometric Deep Learning series,” we discuss the early neural network models, and how their criticism gave rise to a new field of computational geometry. This post is based on the introduction chapter of the book M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković, Geometric Deep Learning (to appear with MIT Press upon completion) and accompanies our course in the African Masters in Machine Intelligence (AMMI). See Part I discussing symmetry and our previous post summarising the concept of Geometric Deep Learning.

While it is hard to agree on a specific point in time when ‘artificial intelligence’ was born as a scientific field (at the end, humans have been obsessed with comprehending intelligence and learning from the dawn of civilisation), we will try a less risky task of looking at the precursors of deep learning — the main topic of our discussion. This history can be packed into less than a century.

By the 1930s, it has become clear that the mind resides in the brain, and research efforts turned to explaining brain functions such as memory, perception, and reasoning, in terms of brain network structures. McCulloch and Pitts [1] are credited with the first mathematical abstraction of the neuron, showing its capability to compute logical functions. Just a year after the legendary Dartmouth College workshop that coined the very term ‘artificial intelligence’ [2], an American psychologist Frank Rosenblatt from Cornell Aeronautical laboratory proposed a class of neural networks he called ‘perceptrons’ [3].

Frank Rosenblatt and his Mark I perceptron neural network developed at the Cornell Aeronautical Laboratory for simple visual pattern recognition tasks. Portrait: Ihor Gorskiy.

Perceptrons, first implemented on a digital machine and then in dedicated hardware, managed to solve simple pattern recognition problems such as classification of geometric shapes. However, the quick rise of ‘connectionism’ (how AI researchers working on artificial neural networks labeled themselves) received a bucket of cold water in the form of the now-infamous book Perceptrons by Marvin Minsky and Seymour Papert [4].

In the deep learning community, it is common to retrospectively blame Minsky and Papert for the onset of the first ‘AI Winter,’ which made neural networks fall out of fashion for over a decade. A typical narrative mentions the ‘XOR Affair,’ a proof that perceptrons were unable to learn even very simple logical functions as evidence of their poor expressive power. Some sources even add a pinch of drama recalling that Rosenblatt and Minsky went to the same school and even alleging that Rosenblatt’s premature death in a boating accident in 1971 was a suicide in the aftermath of the criticism of his work by colleagues.

Marvin Minsky and Seymour Papert, authors of the influential book “Perceptrons.” The two shapes on the cover of the book (one of which is connected) allude to the “parity problem.” The book considered simple single-layer perceptions (top left) and was perhaps the earliest geometric approach to learning, including the introduction of group invariance (bottom left).

The reality is probably more mundane and more nuanced at the same time. First, a far more plausible reason for the ‘AI Winter’ in the USA is the 1969 Mansfield Amendment, which required the military to fund “mission-oriented direct research, rather than basic undirected research.” Since many efforts in artificial intelligence at the time, including Rosenblatt’s research, were funded by military agencies and did not show immediate utility, the cut in funding has had dramatic effects. Second, neural networks and artificial intelligence in general were over-hyped: it is enough to recall a 1958 New Yorker article calling perceptrons a

“first serious rival to the human brain ever devised” — The New Yorker (1958)

and “remarkable machines” that were “capable of what amounts to thought” [5], or the overconfident MIT Summer Vision Project expecting a “construction of a significant part of a visual system” and achieving the ability perform “pattern recognition” [6] during one summer term of 1966. A realisation by the research community that initial hopes to ‘solve intelligence’ had been overly optimistic was just a matter of time.

Early hype: A New Yorker 1958 article praised perceptions as “capable of thought” (left), and the over-optimistic “MIT Summer Vision Project” aimed to construct a “significant part of a visual system” in a few months in 1966.

If one, however, looks into the substance of the dispute, it is apparent that what Rosenblatt called a ‘perceptron’ is rather different from what Minsky and Papert understood under this term. Minsky and Papert focused their analysis and criticism on a narrow class of single-layer neural networks they called ‘simple perceptrons’ (and what is typically associated with this term in modern times) that compute a weighted linear combination of the inputs followed by a nonlinear function [7]. On the other hand, Rosenblatt considered a broader class of architectures that antedated many ideas of what would now be considered ‘modern’ deep learning, including multi-layered networks with random and local connectivity [8].

Rosenblatt could have probably rebutted some of the criticism concerning the expressive power of perceptrons had he known about the proof of the Thirteenth Hilbert Problem [9] by Vladimir Arnold and Andrey Kolmogorov [10-11] establishing that a continuous multivariate function can be written as a superposition of continuous functions of a single variable. The Arnold–Kolmogorov theorem was a precursor of a subsequent class of results known as ‘universal approximation theorems’ for multi-layer (or ‘deep’) neural networks that put these concerns to rest.

While most remember the book of Minsky and Papert for the role it played in cutting the wings of the early day connectionists and lament the lost opportunities, an important overlooked aspect is that it for the first time presented a geometric analysis of learning problems. This fact is reflected in the very name of the book, subtitled ‘An introduction to Computational Geometry.’ At the time it was a radically new idea, and one critical review of the book [12] (which essentially stood in the defense of Rosenblatt), wondered:

“Will the new subject of “Computational Geometry” grow into an active field of mathematics; or will it peter out in a miscellany of dead ends?” — Block (1970)

The former happened: computational geometry is now a well-established field [13].

Furthermore, Minsky and Papert probably deserve the credit for the first introduction of group theory into the realm of machine learning: their Group Invariance theorem stated that if a neural network is invariant to some group, then its output could be expressed as functions of the orbits of the group. While they used this result to prove the limitations of what a perceptron could learn, similar approaches were subsequently used by Shun’ichi Amari [14] for the construction of invariant features in pattern recognition problems. An evolution of these ideas in the works of Terrence Sejnowski [15] and John Shawe-Taylor [16–17], unfortunately rarely cited today, provided the foundations of the Geometric Deep Learning Blueprint.

The aforementioned notion of universal approximation deserves further discussion. The term refers to the ability to approximate any continuous multivariate function to any desired accuracy; in the machine learning literature, results of this type are usually credited to Cybenko [18] and Hornik [19]. Unlike ‘simple’ (single-layer) perceptrons criticised by Minsky and Papert, multilayer neural networks are universal approximators and thus are an appealing choice of architecture for learning problems. We can think of supervised machine learning as a function approximation problem: given the outputs of some unknown function (e.g., an image classifier) on a training set (e.g., images of cats and dogs), we try to find a function from some hypothesis class that fits well the training data and allows to predict the outputs on previously unseen inputs (‘generalisation’).

Towards Universal Approximation: David Hilbert’s Thirteenth Problem proven by Andrey Kolmogorov and Vladimir Arnold was one of the first results showing that a multivariate continuous function could be expressed as a composition and sum of simple one-dimensional functions. George Cybenko and Kurt Hornik proved results specific to neural networks, showing that a perceptron with one hidden layer can approximate any continuous function to any desired accuracy.

Universal approximation guarantees that we can express functions from a very broad regularity class (continuous functions) by means of a multi-layer neural network. In other words, there exists a neural network with a certain number of neurons and certain weights that approximates a given function mapping from the input to the output space (e.g., from the space of images to the space of labels). However, universal approximation theorems do not tell us how to find such weights. In fact, learning (i.e., finding weights) in neural networks has been a big challenge in the early days.

Rosenblatt showed a learning algorithm only for a single-layer perceptron; to train multi-layer neural networks, Aleksey Ivakhnenko and Valentin Lapa [20] used a layer-wise learning algorithm called ‘group method of data handling.’ This allowed Ivakhnenko [21] to go as deep as eight layers — a remarkable feat for the early 1970s!

How to train your neural network? Now-ubiquitous backpropagation became standard only in the 1980s after the paper of David Rumelhart (though introduced earlier by Paul Werbos and Seppo Linnainmaa). Earlier approaches such as the “group method of data handling” of Aleksey Ivakhnenko allowed training deep neural networks already in the early 1970s.

A breakthrough came with the invention of backpropagation, an algorithm using the chain rule to compute the gradient of the weights with respect to the loss function, and allowed to use gradient descent-based optimisation techniques to train neural networks. As of today, this is the standard approach in deep learning. While the origins of backpropagation date back to at least 1960 [22] the first convincing demonstration of this method in neural networks was in the widely cited Nature paper of Rumelhart, Hinton, and Williams [23]. The introduction of this simple and efficient learning method has been a key contributing factor to the return of neural networks to the AI scene in the 1980s and 1990s.

Looking at neural networks through the lens of approximation theory has led some cynics to call deep learning a “glorified curve fitting.” We will let the reader judge how true this maxim is by trying to answer an important question: how many samples (training examples) are needed to accurately approximate a function? Approximation theorists will immediately retort that the class of continuous functions that multilayer perceptrons can represent is obviously way too large: one can pass infinitely many different continuous functions through a finite collection of points [24]. It is necessary to impose additional regularity assumptions such as Lipschitz continuity [25], in which case one can provide a bound on the required number of samples.

The curse of dimensionality is a type of geometric phenomenon occurring in high-dimensional spaces. One way of visualising it is by looking at the proportion of the volume of a unit metric ball inscribed in a unit hypercube (the latter represents the feature space, while the former can be interpreted as a “nearest neighbour” classifier). The volume ratio decays exponentially with dimension: for d=2 the ratio is ~0.78, for d=3 it drops to ~0.52, and for d=10 it is already ~0.01. Figure: adapted from Vision Dummy.

Unfortunately, these bounds scale exponentially with dimension — a phenomenon colloquially known as the ‘curse of dimensionality’ [26] — which is unacceptable in machine learning problems: even small-scale pattern recognition problems, such as image classification, deal with input spaces of thousands of dimensions. If one had to rely only on classical results from approximation theory, machine learning would be impossible. In our illustration, the number of examples of cat and dog images that would be required in theory in order to learn to tell them apart would be way larger than the number of atoms in the universe [27] — there are simply not enough cats and dogs around to do it.

The curse of dimensionality: standard results from approximation theory do not scale well with dimension. As a result, even in simple machine learning tasks, one would predict the number of training samples to be significantly larger than practically possible.

The struggle of machine learning methods to scale to high dimensions was brought up by the British mathematician Sir James Lighthill in a paper that AI historians call the ‘Lighthill Report’ [28], in which he used the term ‘combinatorial explosion’ and claimed that existing AI methods could only work on toy problems and would become intractable in real-world applications.

“Most workers in AI research and in related fields confess to a pronounced feeling of disappointment in what has been achieved in the past twenty-five years. […] In no part of the field have the discoveries made so far produced the major impact that was then promised.” — Sir James Lighthill (1972)

Since the Lighthill Report was commissioned by the British Science Research Council to evaluate academic research in the field of artificial intelligence, its pessimistic conclusions resulted in funding cuts across the pond. Together with similar decisions by the American funding agencies, this amounted to a wrecking ball for AI research in the 1970s.

For us, the realisation that classical functional analysis cannot provide an adequate framework to deal with learning problems will be the motivation to seek stronger, geometric forms of regularity, that can be implemented in a particular wiring of the neural network — such as the local connectivity of convolutional neural networks. It is fair to say that the triumphant reemergence of deep learning we have witnessed a decade ago owes, at least in part, to these insights.

[1] W. S. McCulloch and W. Pitts, A logical calculus of the ideas immanent in nervous activity (1943), The Bulletin of Mathematical Biophysics 5(4):115–133.

[2] The Dartmouth Summer Research Project on Artificial Intelligence was a 1956 summer workshop at Dartmouth College that is considered to be the founding event of the field of artificial intelligence.

[3] F. Rosenblatt, The perceptron, a perceiving and recognizing automaton (1957), Cornell Aeronautical Laboratory. The name is a portmanteau of ‘perception’ and the Greek suffix –τρον denoting an instrument.

[4] M. Minsky and S. A. Papert, Perceptrons: An introduction to computational geometry (1969), MIT Press.

[5] That is why we can only smile at similar recent claims about the ‘consciousness’ of deep neural networks: אין כל חדש תחת השמש.

[6] Sic in quotes, as “pattern recognition” has not yet become an official term.

[7] Specifically, Minsky and Papert considered binary classification problems on a 2D grid (‘retina’ in their terminology) and a set of linear threshold functions. While the inability to compute the XOR function is always brought up as the main point of criticism in the book, much of the attention was dedicated to geometric predicates such as parity and connectedness. This problem is alluded to on the cover of the book, which is adorned by two patterns: one is connected, and another one is not. Even for a human, it is very difficult to determine which is which.

[8] E. Kussul, T. Baidyk, L. Kasatkina, and V. Lukovich, Rosenblatt perceptrons for handwritten digit recognition (2001), IJCNN showed that Rosenblatt’s 3-layer perceptrons implemented on the 21st-century hardware achieve an accuracy of 99.2% on the MNIST digit recognition task, which is on par with modern models.

[9] Hilbert’s Thirteenth Problem is one of the 23 problems compiled by David Hilbert in 1900 and entailing the proof of whether a solution exists for all 7th-degree equations using continuous functions of two arguments. Kolmogorov and his student Arnold showed a solution of a generalised version of this problem, which is now known as the Arnold–Kolmogorov Superposition Theorem.

[10] В. И. Арнольд, О представлении непрерывных функций нескольких переменных суперпозициями непрерывных функций меньшего числа переменных (1956), Доклады Академии Наук СССР 108:179–182.

[11] В. И. Арнольд, О функции трех переменных (1957), Доклады Академии Наук СССР 114:679–681.

[12] H. D. Block, A review of “Perceptrons: An introduction to computational geometry” (1970), Information and Control 17(5): 501–522.

[13] Computational Geometry has subject code I.3.5 in the ACM Computing Classification System.

[14] S.-I. Amari, Feature spaces which admit and detect invariant signal transformations (1978), Joint Conference on Pattern Recognition.

[15] T. Sejnowski et al., Learning symmetry groups with hidden units: Beyond the perceptron (1986), Physica D: Nonlinear Phenomena 22(1–3):260–275.

[16] J. Shawe-Taylor, Building symmetries into feedforward networks (1989), ICANN.

[17] J. Shawe-Taylor, Symmetries and discriminability in feedforward network architectures (1993), IEEE Trans. Neural Networks 4(5): 816–826.

[18] G. Cybenko, Approximation by superpositions of a sigmoidal function (1989), Mathematics of Control, Signals and Systems 2(4):303–314.

[19] K. Hornik, Approximation capabilities of multilayer feedforward networks (1991) Neural Networks 4(2):251–257.

[20] А. Г. Ивахненко, В. Г. Лапа, Кибернетические предсказывающие устройства (1965), Наукова думка.

[21] A. Ivakhnenko, Polynomial theory of complex systems (1971), IEEE Trans. Systems, Man, and Cybernetics 4:364–378.

[22] Backpropagation is based on the chain rule of differentiation that itself goes back to the co-inventor of differential calculus Gottfried Wilhelm von Leibniz in 1676. A precursor of backpropagation was used by H. J. Kelley, Gradient theory of optimal flight paths (1960), Ars Journal 30(10): 947–954 to perform optimisation of complex nonlinear multi-stage systems. Efficient backpropagation that is still in use today was described in the Finnish master’s thesis of S. Linnainmaa, Algoritmin kumulatiivinen pyoristysvirhe yksittaisten pyoristysvirheiden taylor-kehitelmana (1970), University of Helsinki. The earliest use in neural networks is due to P. J. Werbos, Applications of advances in nonlinear sensitivity analysis (1982), System Modeling and Optimization 762–770, Springer, which is usually cited as the origin of the method. See J. Schmidhuber, Deep learning in neural networks: An overview (2015), Neural Networks 61:85–117.

[23] D. E. Rumelhart et al., Learning representations by back-propagating errors (1986), Nature 323(6088):533–536.

[24] There are even examples of continuous nowhere differentiable functions such as the construction of Weierstrass (1872).

[25] Roughly, Lipschitz-continuous functions do not arbitrarily shrink or expand the distance between points on the domain. For differentiable functions, Lipschitz continuity can be expressed as an upper bound on the norm of the gradient, implying that the function does not ‘jump’ too abruptly.

[26] The first to use the term was Richard Bellman in the preface of his 1957 book Dynamic Programming, where he refers to dimensionality as “a curse which has hung over the head of the physicist and astronomer for many a year.”

[27] The number of protons in the observable universe, known as the Eddington number, is estimated at 10⁸⁰.

[28] J. Lighthill, Artificial intelligence: A general survey (1973) Artificial Intelligence 1–21, Science Research Council London.