Hybrid Discrete-Continuous Geometric Deep Learning | by Jason McEwen

Scalable and equivariant spherical CNNs by DISCO convolutions

No existing spherical convolutional neural network (CNN) framework is both computationally scalable and rotationally equivariant. Continuous approaches capture rotational equivariance but are often prohibitively computationally demanding. Discrete approaches offer more favorable computational performance but at the cost of equivariance. We develop a hybrid discrete-continuous (DISCO) group convolution that is simultaneously equivariant and computationally scalable to high-resolution. This approach achieves state-of-the-art (SOTA) performance on many benchmark dense prediction tasks. (Further details can be found in our ICLR paper on Scalable and Equivariant Spherical CNNs by DISCO Convolutions.)

Photo by Dustin Tramel on Unsplash

Geometric deep learning on groups has many applications, such as analysing observations over the Earth and panoramic 360° photos and videos, to name just a few. However, current approaches suffer a dichotomy: they either exhibit good equivariance properties or good computationally scalability; but not both simultaneously.

The key goals of geometric deep learning techniques on groups is to encode equivariance to various group transformations (which typically translates to very good performance), while also being highly computationally scalable.

As discussed in our previous TDS article, focusing on the group setting of homogenous spaces with global symmetries, geometric deep learning on groups can be broadly classified into discrete and continuous approaches. Continuous approaches offer equivariance but with a large computational cost. Discrete approaches, on the other hand, are typically relatively computationally efficient but sacrifice equivariance.

At Copernic AI we have recently developed techniques that break this dichotomy (recently published in ICLR [1]). That is, we have developed geometric deep learning techniques on groups that provide excellent equivariance properties, while also being highly computationally efficient so that they can be effectively scaled to huge, high-resolution datasets.

The key to breaking the discrete versus continuous dichotomy is to take a hybrid approach, where some parts of the representation are discretized, to facilitate efficient computation, while other parts are left continuous to facilitate equivariance. Due to its hybrid nature (as illustrated in the diagram below) we name this approach DISCO, for DIScrete-COntinous.

While the DISCO approach is general, we focus on the sphere as the archetypical example of the group setting of homogenous spaces with global symmetries.

Breaking the continuous vs discrete dichotomy through a hybrid discrete-continuous (DISCO) approach that is both rotationally equivariant and computationally scalable. [Original figure created by authors.]

The DISCO approach is based on convolutional layers, where the DISCO group convolution follows by a careful hybrid representation of the standard group convolution. Some components of the representation are left continuous, to facilitate accurate rotational equivariance, while other components are discretized, to yield scalable computation.

The DISCO group convolution of a signal (i.e. data, feature map) f defined over the group, with a filter 𝝭, is given by

where g is an element of the group G, dµ(u) is the (Haar) measure of integration, and q(uᵢ) are quadrature weights. Square brackets and index subscripts denote discretized quantities, with i denoting sample index, and round brackets denote continuous quantities.

On the sphere we consider transformations given by 3D rotations and so the DISCO convolution of a signal on the sphere reads

where R denotes a rotation and ω spherical coordinates.

Focusing on the spherical case, clearly the signal of interest must be discretized at sample positions ωᵢ. Critically, however, in the DISCO approach the filter 𝝭 and the group action R remain continuous. This allows the filter to be transformed continuously by any R, keeping a coherent representation that avoids any discretization errors and, consequently, affords rotational equivariance, unlike a fully discrete method.

The integral with respect to ω must also be discretized. For bandlimited signals on compact homogeneous manifolds, such as the sphere, the existence of a sampling theorem ensures that the integral can be approximated very accurately using quadrature weights q(ωᵢ).

The DISCO approximation of the group convolution is highly accurate for bandlimited signals, which real-world signals can be well approximately by for a sufficient bandlimit. By appealing to a sampling theorem, all information content of the signal can be captured in the finite set of samples {f[ωᵢ]}. The filter is represented continuously and so does not introduce any error. The only source of approximation error is thus the quadrature used to evaluate the integral. For a sufficiently dense sampling one can appeal to the sampling theorem and corresponding quadrature to evaluate this exactly. Therefore, it is possible in principle to compute the DISCO group convolution exactly, without any approximation error. Since the approximation is highly accurate, which can be made exact for a sufficiently dense sampling, and group actions are treated continuously, the DISCO group convolution exhibits excellent equivariance properties, as validated numerically [1].

The DISCO convolution affords a computationally scalable implementation through sparse tensor representations [1]. Specifically, we leverage sparse-dense tensor multiplication operators to compute the DISCO spherical convolution efficiently on hardware accelerators (e.g. GPUs, TPUs).

By restricting the space of rotations further (to the quotient space SO(3)/SO(2)) and exploiting symmetries of the sampling scheme, we achieve linear scaling in both computational cost and memory requirements.

The plots below show the number of floating point operations (FLOPs) and memory requirements for the DISCO spherical convolution as a function of resolution/bandlimit, compared to the most efficient alternative spherical convolution that exhibits rotational equivariance.

Computational cost and memory requirements of the DISCO spherical convolution as a function of resolution/bandlimit, compared to the most efficient alternative spherical convolution that exhibits rotational equivariance. [Original figure created by authors.]

For 4k spherical images we achieve a saving of 10⁹ in computational cost and 10⁴ in memory usage.

A transpose DISCO convolution can also be constructed in an analogous way to the forward convolution discussed above, which can then be used to increase the resolution of internal feature representations for dense-prediction tasks.

Efficient spherical implementations of common CNN architectures can then be constructed by combining the DISCO forward and transpose spherical convolutions with pointwise non-linear activations and other common architectural features, such as skip connections, batch-normalization, multiple channels, etc.

We consider a number of dense-prediction tasks below, such as semantic segmentation and depth estimation, for which we adopt a common backbone of a residual UNet architecture with DISCO convolutions. Our resulting DISCO models achieve state-of-the-art (SOTA) performance on all of the benchmark problems considered to date.

We consider the dense-prediction problem of semantic segmentation of 360° photos.

For the 2D3DS dataset of indoor 360° photos, we show below examples of spherical RGB images, ground truth segmentations, and segmentations predicted by the DISCO model simply from the RGB image.

Example segmentation of 2D3DS data of indoor 360° photos. [Original figure created by authors.]

While the predicted segmentations aren’t perfect, they are generally highly accurate. In fact, our DISCO approach achieves SOTA performance compared to all other alternatives (see [1] for further details).

For the Omni-SYNTHIA dataset of ourdoor 360° photos, we also show below examples of spherical RGB images, ground truth segmentations, and predicted segmentations.

Example segmentation of Omni-SYNTHIA data of outdoor 360° photos. [Original figure created by authors.]

Again, predicted segmentations are generally highly accurate and we achieve SOTA performance compared to all other alternatives (see [1] for further details).

Another common dense-prediction task is depth estimation. We consider the task of monocular depth estimation from 360° photos, tackling the Pano3D benchmark for the Matterport3D dataset.

We show below examples of spherical RGB images, ground truth depth, and depths predicted by the DISCO model simply from the RGB image.

Example depth estimation of Matterport3D data of indoor 360° photos. [Original figure created by authors.]

Predicted depths are generally highly accurate. Indeed, we again achieve SOTA performance compared to all other alternatives (see [1] for further details).

The problem of both equivariant and computationally scalable geometric deep learning on groups has now be cracked through the hybrid discrete-continuous (DISCO) representation. As we have seen on the benchmark tasks considered above where we achieve SOTA performance, excellent equivariance properties translate to excellent performance.

We now have the underlying building blocks needed to extend modern deep learning architectures to the group setting of homogenous spaces with global symmetries, such as the sphere. There are vast number of such applications where we can now unlock the potential of modern deep learning.

[1] Ocampo, Price, McEwen, Scalable and equivariant spherical CNNs by discrete-continuous (DISCO) convolutions, ICLR (2023), arXiv:2209.13603