Geometric Priors II. The GDL Blueprint | by Ahmed A. A. Elhag | Jun, 2022

By Jessie Hobb On Jun 7, 2022

The GDL Blueprint

A series of blog posts that summarize the Geometric Deep Learning (GDL) Course, at AMMI program; African Master’s of Machine Intelligence, taught by Michael Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković.

The curse of dimensionality represents one of the most challenging problems in high-dimensional learning. In a previous post (Geometric Priors I), we discussed various concepts in machine learning and geometric deep learning, including symmetry, invariant and equivariant networks. In this post, we connect these mathematical concepts to the curse of dimensionality. Then, we introduce the second geometric prior; scale separation, which represents an effective tool to defeat the curse of dimensionality. Finally, we will see how we can combine these two geometric priors (symmetry and scale separation) to construct the Geometric Deep Learning (GDL) Blueprint, which can serve as a general architecture for different geometric domains.

Scale Separation Prior (Image Classification). Image from GDL proto-book, section 3.4.

This post was co-authored with M.Elfatih Salah. See also last posts on the Erlangen Programme of ML, High-Dimensional Learning, and Geometric Priors I. These blog posts are based on the GDL proto-book by the four instructors and the GDL Course at AMMI.

Recalling the learning setup from previous posts, in statistical learning, we have a target function f*, a mapping that goes from the input space 𝒳 to the output space ℝ; f* : 𝒳→ ℝ. We also write the input space 𝒳 as a set of signals 𝒳(Ω, 𝒞) = {𝓍 ∶ Ω → 𝒞}, where Ω is a geometric domain and 𝒞 is a vector space defined on Ω. In Geometric Priors I, we have seen various examples of geometric domains, and how the domain Ω can be used to capture various transformations deeply related to the nature of signals in this domain. (This post is closely related to Geometric Priors I).

Thus, from what was introduced, the target function f* is defined as f* : 𝒳(Ω, 𝒞) → ℝ. Our goal is to learn this function within a hypothesis class 𝓕 that we can parameterize however we want, for example using neural networks. Our promise now is that the target f* is 𝔊-invariant (f* remains the same if any element of the transformation group 𝔊 and any signal ‘input’ are considered). Mathematically,

𝔊-Invariant Function. GDL course, lecture 4.

Thus, natural questions are: how can we leverage this promise in our hypothesis class? And is it enough to break the curse of dimensionality?

To answer the first question, let’s define a group smoothing operator, which is an operator that takes any hypothesis f and replaces it with the average of all transformations in the group 𝔊. In a mathematical setup, if 𝔊 is a discrete and finite group and f is our hypothesis, we define the 𝔊-smoothing operator of 𝑓 as,

𝔊-Smoothing Operator of the Function 𝑓. Equation from GDL course, lecture 4.

If we look closely, we can find that this smoothing operator of the function f has an interesting property; it is also 𝔊-invariant (if 𝑓 is already 𝔊-invariant, then applying the smoothing operator will not change it). Consequently, given a hypothesis class 𝓕, we can make it 𝔊-invariant by applying the smoothing operator and more formally as follows,

𝔊-Smoothing Operator of the Hypothesis Class 𝓕. Equation from GDL course, lecture 4.

(Also, it seems intuitive that if we extend the invariance group 𝔊, the invariance class will become smaller).

Learning under Invariance

By applying the smoothing operator to our hypothesis class, we can prove that the approximation error is not affected by this smoothing operation, and we can formally put it this way:

The Approximation Error is not Affected by the Smoothing Operation. GDL course, lecture 4.

For the prove, we can notice that the smoothing operator is an orthogonal projection operator in L²(𝒳). So the L² error made in the prediction can be decomposed into the error in the image of this operator plus the error in the orthogonal complement, as shown below,

The Approximation Error is not Affected by the Smoothing Operation (Prove). GDL course, lecture 4.

Then, replacing 𝔤 with f* and knowing that f* is not affected by the smoothing operation, we get,

The Approximation Error is not Affected by the Smoothing Operation (Prove). GDL course, lecture 4.

On the other hand, the statistical error is reduced since the hypothesis class is smaller after smoothing, but the question is: how much?

Taking a look at one of the hypothesis classes, which we assume is the Lipschitz class, we know that it has the following property:

Lipschitz Function Class. GDL course, lecture 4.

Then, its smoothed version has the following property:

Lipschitz Function Class when applying the 𝔊-Smoothing Operator. GDL course, lecture 4.

The smoothed version makes the Lipschitz assumption more stronger, since the distance is much weaker, and the main result:

Theorem [Bietti, Venturi, B.’21]: Using a 𝔊-invariant kernel ridge regression, the generalization error of learning a Lipschitz, 𝔊-invariant function f* satisfies 𝔼𝓡(f̂) ≲ Θ (|𝔊| n)⁻¹/ᵈ, where n is the number of samples and d is the number of dimensions.

This theorem leads to a quantitative gain in sample complexity that is proportional to the size of the group, as the generalization error is bound by the number of samples times the size of the group, all to the power of -1/d.

Additionally, the group size |𝔊| can be exponential in dimensions, for example, if we consider local translations (the number of possible local translations is exponential on the size of the domain). However, as we can see, the rate is still cursed by dimensionality, which indicates that incorporating group invariances is unlikely to break the curse of dimensionality.

Conclusion so far:

Up to now, we have seen that using the global symmetry or invariance prior reduces the statistical error and keeps the approximation error the same, but still not sufficient to break the curse of dimensionality. So the question is: what are the assumptions beyond symmetry that are necessary to break the curse of dimensionality, and what are we missing?

Also, we have been talking about theoretical questions without discussing how to build invariant classes efficiently. The smoothing operator we have described is not something we want to implement directly in a computer (because calculating the average over all the transformations group is quite impractical). How can we do this efficiently from the first principles?

Deep Learning “Inductive Bias”: Compositionality

It might be noted in different talks by influential people in deep learning research, such as Yann LeCun and Yoshua Bengio, that the reason deep learning works is the principle of compositionality. Furthermore, there is evidence from the brain that the brain is also organized hierarchically, and this idea is not something that was developed by deep learning researchers, but it has been around for a long time.

Why Does Deep Learning Work? Yoshua Bengio (top) and Yann LeCun (bottom). GDL course, lecture 4.

Since we seem to all agree that compositionality is critical to the success of deep learning, we should ask how we can formalize this intuition?

Multiscale Structures

A way to formalize this intuition is to look at physics and different areas of computational science. Across diverse areas, we can see how multiscale structure is a fundamental principle and the foundation of many disciplines.

In computational biology, to know how biology works, we have to understand the process at different scales. There are some things we can understand using molecular dynamics or functional maps. Turbulence, for instance, exhibits communication between different scales in mathematical physics. Percolation, another model, describes the movement of fluids through porous materials. Consequently, we have to think about different scales. You may also have your own examples of systems with different behaviors at different scales.

Percolation Model. Animation from wikipedia.

Basics of Multiresolution Analysis

In our context, we want to relate the notion of multiscale structure to the geometric domains (grids, groups, graphs, and manifolds). Another concept that is also very related is multiresolution analysis (MRA). (For simplicity, we will assume a grid domain, i.e., Ω is a 2D-grid, but most of the things can be described more generally, and then applied to all geometric domains).

The multiresolution analysis is a scheme that tries to decompose a signal that lives in the domain, in terms of signals that live in a smaller domain (coarser domain). For example, in the case of the grid domain, we want to have a smaller number of pixels. In multiresolution analysis, we are trying to understand how to go from one resolution (specific number of pixels) to the next (less number of pixels) in a way that allows us to keep information. Generally, we can decompose an image into the same one with a different resolution (less resolution) plus the detail we need to return to the first image (see the figure below).

Image Decomposition. GDL course, lecture 4.

And here, we have to state that by reducing the resolution of the grid, we are attacking the curse of dimensionality. This is due to the exponential dependence on the number of pixels in this case. And we can guarantee that at a certain resolution, the curse of dimensionality will be avoided since the number of samples is proportional to the new resolution. Thus, our questions will be about how we can combine this idea of multiscale structure with priors in learning?

Let’s discuss some examples of how we can combine the multiscale structure and priors in learning. Imagine that the target function f* can be defined on the coarser version of the image. In the classification problem shown in the figure below, we want to know only whether the image is a “beach” or a “mountain” rather than its resolution.

Original Versions (left) and Coarser Versions (right) of two Images (Classification Problem). GDL course, lecture 4.

In this problem, we can reduce the resolution of the image to a certain level that can give correct classification. As previously mentioned, this will decrease the size of the grid (number of pixels). Thus, by coarsening we can avoid the curse of dimensionality. Nevertheless, this is not true in general, as we need to be very careful about the coarsening level. As shown in the image below, too coarsening can lose so much information, and we are unable to make out what exactly is in the image.

An example of when too much coarsening can lose information in an image. GDL course, lecture 4.

Let’s see another alternative case in which the multiscale structure can also help. Assume the target function f* can be approximated by a sum of local terms,

Approximating the Function f* by the Sum of Local Terms. GDL course, lecture 4.

The local term x(u) refers to extracting a patch image centered at pixel u. For example, in recognizing a texture (the classification problem shown in the image below), if we are in a brick wall or a bunch of grass. One of our local terms would fall into this category. Therefore, if we want a good classification, we can just take the average of the local descriptors because there’s not a lot of variability in the coarse scales (since the texture is a structure that has spatial homogeneity).

Examples of Local Terms: patch images shown by the yellow blocks. GDL course, lecture 4.

We can see here the curse of dimensionality is also avoided because the relevant dimension in this problem is reduced to the patch dimension.

We have to state also this is a strong assumption and not sufficient in general, as shown in the figure below; all the local terms might give a wrong classification.

Examples of when the local terms might give a wrong classification. GDL course, lecture 4.

It is important to illustrate this fact that we might have information in the original scales as well as the coarse scales, but somehow they interact in a way that is not too extreme. Thus, in some sense, we have to combine these two scales.

Let’s consider a general compositional model that takes the form,

A general Compositional Model with two Operators (Nonlinear Local Coarsening and Nonlinear Hypothesis). GDL course, lecture 4.

The function f* is composed of two operators; nonlinear local coarsening combined with a nonlinear hypothesis that extracts some information at coarse scales (much more powerful than averaging the patches alone), and both of which can be learned. You might ask why this composition and when can argue that this compositional model is more efficient. When we consider some specific examples, such as dynamic programming and divide and conquer algorithms, we see this problem in terms of smaller problems. In these cases, we can argue that this compositional model is more effective. However, this is still an open problem and this compositional model is not fully understood from the theoretical perspectives.

Next, we combine the two priors of the scale separation and group invariant, to give a powerful model from first principles. Then, we introduce the full Geometric Deep Learning (GDL) Blueprint.

Combining Invariance with Scale Separation

Since we have a function class that is invariant to the group action (𝔊-invariant function), we want to incorporate this with the multiscale structure described above (scale separation prior) into an architecture. Because these geometric priors we have discussed give us the necessary conditions that we should have in our model and not all the ingredients of a specific architecture.

If we start with a linear operator that is invariant to the group action (linear 𝔊-invariant function) and use the 𝔊-smoothing operator introduced before, we can write,

Linear 𝔊-Invariant Function with 𝔊-Smoothing Operator. GDL course, lecture 4.

where x̅ is the average over group orbit (we refer to the group average of x as A x). (The first inequality is because f is 𝔊-invariant and we are applying 𝔊-smoothing operator, and the second inequality is due to the linearity of f).

However, if we rely on only this assumption, we can lose a lot of information. Because if we have a translation group, for example, the orbit of an image will consist of all the possible translations of the image, and then we just average all these images into one.

To complement this, we can use the linear 𝔊-equivariant function, which we discussed in Geometric Priors I. (If 𝔊 is acting on both 𝒳 and 𝒴, a mapping 𝐵 : 𝒳 → 𝒴 is 𝔊-equivariant if for all 𝔤 ∈ 𝔊, 𝑥 ∈ 𝒳, 𝐵 (𝔤. 𝑥) = 𝔤. 𝐵(𝑥)). And to achieve a more powerful model, we can compose this equivariant function 𝐵 with an element-wise nonlinear function 𝜎; 𝜎 : 𝒳 → 𝒳, 𝑤𝑖𝑡ℎ 𝜎𝑥(u) = 𝜎(𝑥(u)), to give a nonlinear 𝔊-equivariant function U, i.e., U := 𝜎 ⚬ 𝐵 (as the nonlinearity is applied element-wise, U will remain 𝔊-equivariant).

Then we can provide a powerful 𝔊-invariant function by combining the group average A with the nonlinear 𝔊-equivariant function U.

An imperative question is whether this composition yields a rich class? In other words, are we able to approximate any target function that is 𝔊-invariant pretty well with adequate choices of B and 𝜎?

Referring to the Universal Approximation Theorems, this composition of the group average with a nonlinear equivariant map gives rise to a universal approximator. For a more stable representation, however, we need to use a local equivariant map. Also, if we have a function with long-range interactions, we cannot approximate it with a single layer of local 𝔊-equivariant map; we must instead compose several local equivariant functions that coarsen the domain. (For the proof, GDL proto-book, section 3.5).

In the end, to build a rich 𝔊-invariant function class, we need a local equivariant map, a global invariant map, and a coarsening operator.

We can outline all the ingredients discussed above for the model required to learn the target function f* in the so-called GDL Blueprint. The GDL Blueprint is a constructive approach we can use to define a general architecture, which can be applied to various geometric domains (grids, groups, graphs, and manifolds).

Suppose that Ω is a domain, 𝔊 is a symmetry group over Ω , and Ω′ is a coarser domain from Ω, i.e., Ω′ ⊆ Ω. The building blocks of the Geometric Deep Learning (GDL) Blueprint are:

Linear 𝔊-equivariant layer 𝐵 ∶ 𝒳(Ω, 𝒞) → 𝒳(Ω′, 𝒞′), satisfying 𝐵(𝔤.x) = 𝔤. 𝐵(𝑥) for all 𝔤 ∈ 𝔊 and 𝑥 ∈ 𝒳(Ω, 𝒞).
Nonlinearity 𝜎 ∶ 𝒞 → 𝒞′ applied element-wise as (𝝈(x))(u) = 𝜎(𝑥(u)).
Local pooling (coarsening) 𝑃 ∶ 𝒳(Ω, 𝒞) → 𝒳(Ω′, 𝒞), such that Ω′ ⊆ Ω.
𝔊-invariant layer (global pooling) 𝐴 ∶ 𝒳(Ω, 𝒞) → 𝒴, satisfying 𝐴(𝔤. 𝑥) = 𝐴(𝑥) for all 𝔤 ∈ 𝔊 and 𝑥 ∈ 𝒳(Ω, 𝒞).

The Geometric Deep Learning (GDL) Blueprint (graph input). Image from GDL proto-book, section 3.5.

Putting these blocks together, we can create a rich 𝔊-invariant function f : 𝒳(Ω, 𝒞) → 𝒴 that takes the form,

Rich 𝔊-Invariant Function Class. GDL proto-book, section 3.5.

where each block is chosen so that its output dimensions matches the input dimensions of the next block. Additionally, each block might have a different symmetry group 𝔊.

The GDL Blueprint

Scale Separation Prior (Image Classification). Image from GDL proto-book, section 3.4.

𝔊-Invariant Function. GDL course, lecture 4.

Thus, natural questions are: how can we leverage this promise in our hypothesis class? And is it enough to break the curse of dimensionality?

𝔊-Smoothing Operator of the Function 𝑓. Equation from GDL course, lecture 4.

𝔊-Smoothing Operator of the Hypothesis Class 𝓕. Equation from GDL course, lecture 4.

(Also, it seems intuitive that if we extend the invariance group 𝔊, the invariance class will become smaller).

Learning under Invariance

By applying the smoothing operator to our hypothesis class, we can prove that the approximation error is not affected by this smoothing operation, and we can formally put it this way:

The Approximation Error is not Affected by the Smoothing Operation. GDL course, lecture 4.

The Approximation Error is not Affected by the Smoothing Operation (Prove). GDL course, lecture 4.

Then, replacing 𝔤 with f* and knowing that f* is not affected by the smoothing operation, we get,

The Approximation Error is not Affected by the Smoothing Operation (Prove). GDL course, lecture 4.

On the other hand, the statistical error is reduced since the hypothesis class is smaller after smoothing, but the question is: how much?

Taking a look at one of the hypothesis classes, which we assume is the Lipschitz class, we know that it has the following property:

Lipschitz Function Class. GDL course, lecture 4.

Then, its smoothed version has the following property:

Lipschitz Function Class when applying the 𝔊-Smoothing Operator. GDL course, lecture 4.

The smoothed version makes the Lipschitz assumption more stronger, since the distance is much weaker, and the main result:

Theorem [Bietti, Venturi, B.’21]: Using a 𝔊-invariant kernel ridge regression, the generalization error of learning a Lipschitz, 𝔊-invariant function f* satisfies 𝔼𝓡(f̂) ≲ Θ (|𝔊| n)⁻¹/ᵈ, where n is the number of samples and d is the number of dimensions.

Conclusion so far:

Deep Learning “Inductive Bias”: Compositionality

Since we seem to all agree that compositionality is critical to the success of deep learning, we should ask how we can formalize this intuition?

Multiscale Structures

Basics of Multiresolution Analysis

And here, we have to state that by reducing the resolution of the grid, we are attacking the curse of dimensionality. This is due to the exponential dependence on the number of pixels in this case. And we can guarantee that at a certain resolution, the curse of dimensionality will be avoided since the number of samples is proportional to the new resolution. Thus, our questions will be about how we can combine this idea of multiscale structure with priors in learning?

Let’s see another alternative case in which the multiscale structure can also help. Assume the target function f* can be approximated by a sum of local terms,

Approximating the Function f* by the Sum of Local Terms. GDL course, lecture 4.

We can see here the curse of dimensionality is also avoided because the relevant dimension in this problem is reduced to the patch dimension.

We have to state also this is a strong assumption and not sufficient in general, as shown in the figure below; all the local terms might give a wrong classification.

Let’s consider a general compositional model that takes the form,

Next, we combine the two priors of the scale separation and group invariant, to give a powerful model from first principles. Then, we introduce the full Geometric Deep Learning (GDL) Blueprint.

Combining Invariance with Scale Separation

If we start with a linear operator that is invariant to the group action (linear 𝔊-invariant function) and use the 𝔊-smoothing operator introduced before, we can write,

Then we can provide a powerful 𝔊-invariant function by combining the group average A with the nonlinear 𝔊-equivariant function U.

An imperative question is whether this composition yields a rich class? In other words, are we able to approximate any target function that is 𝔊-invariant pretty well with adequate choices of B and 𝜎?

In the end, to build a rich 𝔊-invariant function class, we need a local equivariant map, a global invariant map, and a coarsening operator.

Suppose that Ω is a domain, 𝔊 is a symmetry group over Ω , and Ω′ is a coarser domain from Ω, i.e., Ω′ ⊆ Ω. The building blocks of the Geometric Deep Learning (GDL) Blueprint are:

Linear 𝔊-equivariant layer 𝐵 ∶ 𝒳(Ω, 𝒞) → 𝒳(Ω′, 𝒞′), satisfying 𝐵(𝔤.x) = 𝔤. 𝐵(𝑥) for all 𝔤 ∈ 𝔊 and 𝑥 ∈ 𝒳(Ω, 𝒞).
Nonlinearity 𝜎 ∶ 𝒞 → 𝒞′ applied element-wise as (𝝈(x))(u) = 𝜎(𝑥(u)).
Local pooling (coarsening) 𝑃 ∶ 𝒳(Ω, 𝒞) → 𝒳(Ω′, 𝒞), such that Ω′ ⊆ Ω.
𝔊-invariant layer (global pooling) 𝐴 ∶ 𝒳(Ω, 𝒞) → 𝒴, satisfying 𝐴(𝔤. 𝑥) = 𝐴(𝑥) for all 𝔤 ∈ 𝔊 and 𝑥 ∈ 𝒳(Ω, 𝒞).

Putting these blocks together, we can create a rich 𝔊-invariant function f : 𝒳(Ω, 𝒞) → 𝒴 that takes the form,

Rich 𝔊-Invariant Function Class. GDL proto-book, section 3.5.

where each block is chosen so that its output dimensions matches the input dimensions of the next block. Additionally, each block might have a different symmetry group 𝔊.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.