Google’s PaLI: language-image learning in 100 languages | by Salvatore Raieli | Sep, 2022

By Jessie Hobb On Sep 22, 2022

A new impressive model able to reach state-of-the-art in complex tasks

image from Grant Ritchie at unsplash.com

The trend in recent years has been to increase the capacity of neural networks. On the one hand, different models with more and more parameters have been published, and on the other hand, the amount of data that has been used for training has increased. This view has been common whether the input and tasks were textual or visual.

Regarding textual models, increased capacity showed improved performance and interesting emergent behaviors: GPT, GLaM, PaLM, T5, and so on. While for images, before the field was dominated by convolutional neural networks in recent years the trend has been vision transformers.

Similar advances in all two fields (textual and visual) have allowed the opening of a new field at the intersection of the two: vision-language models.

What is a visual language model for? a model that considers both images and text as input can be used in a variety of tasks. This year we were surprised by how models such as DALL-E or stable diffusion were able to create detailed images from textual descriptions. On the other hand, this is just one of the tasks that can be addressed by a visual language model:

Visual Captioning (VC): generating descriptions for a visual input (an image, a video). The model analyzes an image and provides a textual description representing it.
Visual Question Answering (VQA): providing an answer to visual input.
Visual Commonsense Reasoning (VCR): infer common sense information and cognitive understanding from visual input.
Visual Generation (VG): generate visual outputs from a textual input (prompt).
In-context optical-character-recognition (OCR): OCR is the conversion of images containing text (typed, handwritten, or printed) into textual output that is understandable by the computer.
object recognition

Also, many of the models are trained using only English, but there are thousands of languages (7000 languages estimated) and it is important that other languages are represented and included.

This week presented PaLI which is a language visual model that can perform tasks in 100 languages.

The total model parameters are 17 billion (language model counts 13 B and 4 for the visual component). The model consists of a transformer encoder that processes text and n auto-regressive Transformer decoder that generates test output. The image input is processed by a vision transformer (Vit) that is connected to the encoder. Interestingly, the components were taken from earlier models:

A key component of the PaLI model is reuse, in which we seed the model with weights from previously-trained uni-modal vision and language models, such as mT5-XXL and large ViTs. This reuse not only enables the transfer of capabilities from uni-modal training, but also saves computational cost. — Google AI

Google’s PaLI — The model architechture. image source: original article

In fact, the authors created three different models with an increasing number of parameters but similar architecture (the final model was the one with the largest capacity)

The authors built a dataset specifically for this model: WebLI. The dataset was constructed by collecting images and text from the Web (they were not limited to English but collected data in 109 languages). They thus collected 12 billion images and 12 billion alt-texts (text describing the image). They also extracted using the GCP Vision API the text in the images (OCR annotations), resulting in 29 billion image-OCR pairs. Ultimately, only one billion images were used for training.

Vision-language tasks require different capabilities and sometimes have diverging goals. Some tasks inherently require localization of objects to solve the task accurately, whereas some other tasks might need a more global view. Similarly, different tasks might require either long or compact answers. — Google AI

For this very reason, they developed a special training strategy, in which the purpose was to provide the model each time with an input (image + text) and receive an output (text). In fact, the model was trailed with a mixture of pre-training tasks (the corruption of text only, captioning on native and translated alt-text data, split captioning, and so on) but still maintained the same system (input: image + text; output: text). Thus, to allow to train of the model in the same way while still keeping the model capable of generalizing and being able to perform other tasks.

The model is trained in JAX with Flax using the open-sourced T5X and Flaxformer framework. For the visual component, we introduce and train a large ViT architecture, named ViT-e, with 4B parameters using the open-sourced BigVision framework.

Once the model was trained, the authors compared it with other state-of-the-art approaches (including SimVLM, CoCa, GIT2, Flamingo, BEiT3) to multiple vision-and-language tasks:

Image captioning. The authors tested the model’s capability on three datasets: COCO (standard benchmark), NoCaps (similar to COCO, but the target has many more visual concepts), and TextCaps (images contain text). In addition, they used another dataset XM-3600 (unlike the other three, it is multilingual)

Visual Question Answering. As the authors note, it is a difficult task because the answer must be an exact match to be accepted as accurate and PaLI’s vocabulary is in 100 languages. Despite this, they achieved very good results in several datasets: VQAv2, OKVQA (where external knowledge is needed to answer its questions since all knowledge is not present in the input image and has to be inferred), TextVQA & VizWiz-QA (the model has to use the text present in the images to answer). They also used cross-lingual and multilingual datasets (xGQA and MaXM).

Language-understanding capabilities. The authors wondered whether the model once trained for the multimodal tasks (image + text) could forget its language modeling capability. They tested the capabilities on datasets in both English and several languages. The model, despite the training setting favoring multimodality, maintains a high level of language-understanding capabilities for English.

Zero-shot Image Classification. In this case, the authors did not modify the model by adding a classifier on top of the original model. Instead, they adapted the dataset. The model showed good capabilities but still lower than the state-of-the-art

Although the results are surprising the model is not perfect, there are some limitations:

the model is not capable of describing situations where there are many objects because the training dataset does not have complex annotations anyway.
Some of the multilingual capability is lost when the model is fine-tuned for English-only data.
There are inherent limitations to the evaluation as the authors state: “model might generate a correct response which is a synonym or a paraphrase of the target response and does not match the target exactly.”

In addition, they have attempted to mitigate potential biases and present in the appendix of the article detailed sheets on the model and datasets used (model and data cards). Although as the authors note in the article:

Large models may have broader societal impact. While such models have demonstrated strong performance on public benchmarks, they might contain unknown biases or stereotypes, or propagate inaccurate or otherwise distorted information. While we have made efforts to measure some of these issues, such models need to be re-assessed carefully before being used for specific purposes.

Conclusions

PaLI shows that it achieves state-of-the-art in several tasks that until now have been considered challenging. Moreover, not only in English but also in several languages.

PaLI although it has reached state-of-the-art, it is probable that more visual language models will soon be released (hopefully open source). It also demonstrates some trends:

Higher performing but not necessarily more capable models (it is better than Flamingo, but has only 18 B vs. 80 B parameters)
Models that are trained in several languages besides English (such as Meta’s NLLB or Bloom)
Multimodality, image, and text in this case, but there are also video, music, etc…
a focus on avoiding bias

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn. Thanks for your support!

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

Or feel free to check out some of my other articles on Medium: