Techno Blender
Digitally Yours.

Multimodal Chain of Thoughts: Solving Problems in a Multimodal World | by Salvatore Raieli | Mar, 2023

0 42


NLP | MULTIMODALITY | CHAIN OF THOUGHTS |

The world is not only text: How to extend the chain of thoughts to image and text?

photo by Giulio Magnifico on Unsplash

Sometimes getting to the answer is not easy, especially when the question requires reasoning. A model does not always have the answer hidden in its parameters but can get there with the right context and approach. What is the chain of thoughts? Why does this approach make it possible to solve multi-step reasoning tasks? Can it be extended to multimodal problems (i.e., problems with images and text)? Are only large models capable of this?

This article discusses how to answer these questions.

chain of thought
photo by Todd Cravens on Unsplash

In recent years we have seen the number of model parameters grow (to well over 100 B of parameters). This has been motivated by the scaling law: as the number of parameters increased, the error decreased.

While this is true for tasks such as sentiment analysis and machine translation (even in the case of zero-shot or few-shot learning), even models with billions of parameters struggle with tasks that require multi-step reasoning (e.g., math problems or commonsense reasoning).

How to allow a model to succeed in these tasks?

Large models can be fine-tuned for a specific task, and this was the first system that was attempted. As the authors of this idea explain if you ask a model if a whale has a belly button the model will incorrectly answer no. This is because the model does not have this information stored in its parameters. The authors suggest that one can provide help to the model, providing it with a hint of implicit knowledge: “A whale is a mammal”.

chain of thought
source: here

The idea of providing implicit knowledge has paved the way for the possibility that systems can improve themselves by interacting with users. The user can identify an error and provide the information to the model allowing it to correct itself. Or more precisely as defined by the authors:

This can be viewed as a form of “one-shot learning” that improves the model on-the-fly without further training, unlike most current work that relies on data collection and re-training for fixing model errors.

So conceptually the idea is that a model can solve a problem whose exact answer it does not directly know by exploiting intermediate steps.

As noted by Google, prompting is allowing in-context few-shot learning. In other words, instead of fine-tuning an LM on a particular task, one can prompt the LM with a few input-output exemplars demonstrating the task. This method has proven to be extremely functional, especially for question answering. Also, as demonstrated in context learning it is particularly effective for large models.

chain of thought
“Aggregate performance for all 42 accuracy-denominated benchmarks While zero-shot performance improves steadily with model size, few-shot performance increases more rapidly, demonstrating that larger models are more proficient at in-context learning.” source: here

Google then proposed that one can allow the model to solve multi-step reasoning problems by including a few examples of the chain of thought via prompting only. For better understanding, here is an example of what changes between a classic prompt and a chain of thought prompt:

chain of thought
“Chain-of-thought prompting enables large language models to tackle complex arithmetic, commonsense, and symbolic reasoning tasks. Chain-of-thought reasoning processes are highlighted.” source: here

The advantage of this method is that it requires neither changing LM’s weights nor a large training dataset.

In short, we can say that the idea is that a complex problem can be decomposed into a series of intermediate steps that can be solved separately.

It may seem like a small thing, but it actually means that this method can be applied to any problem you can solve using language.

The Google authors say that this is an emergent property of the model and that it emerges with a certain size of model capacity (they estimate about 100 B parameters). The authors have evaluated increasing models for solving math problems:

chain of thought
“: Chain-of-thought prompting enables large language models to solve challenging math problems”. source: here

Furthermore, the authors note that model improvement does not come from increasing parameters, but by using a “chain of thought prompting, increasing model scale leads to improved performance that substantially outperforms standard prompting for large model sizes.”

This is also true for commonsense reasoning (“reasoning about physical and human interactions under the presumption of general background knowledge”).

Multimodal chain of thought
“: Examples of input, chain of thought, output triples for arithmetic, commonsense, and symbolic reasoning benchmark”. source: here

In this, too, the model showed the same behavior: “performance improved with model scale, and employing a chain of thought prompting led to additional small improvements.” The greatest improvement was seen in the area of sports understanding (surprisingly).

So in general we have seen that there are two techniques for CoT, fine-tuning or using prompting (in context learning). Regarding this second paradigm, we can further subdivide into:

  • Zero-Shot-CoT. Kojima showed that LMs are decent zero-shot COT (simply adding “Let’s think step by step”) and this is enough to meaningfully improve zero-shot LLM for complex reasoning.
  • Few-Shot-CoT. A few step-by-step reasoning demonstrations are used for model conditioning in inference. Each demonstration presents both a question and a reasoning chain that explains to the model how to arrive at the final answer (these demonstrations can be either hand-built or using automatic generation).
Multimodal chain of thought
source: here

Few-shot-CoT has since been shown to be more efficient and with better results (provided the demonstrations are well written). Therefore, most subsequent studies focused on this method.

chain of thought
“Typical CoT techniques (FT: fine-tuning; KD: knowledge distillation). Segment 1: in-context learning techniques; Segment 2: fine-tuning techniques. To the best of our knowledge, our work is the first to study CoT reasoning in different modalities. Besides, we focus on 1B-models, without relying on the outputs of LLMs”. source: here
Multimodal chain of thought
image by airfocus on Unsplash

As we saw above, the chain of thought (CoT) has proven very useful for problems requiring complex reasoning. Many of the problems are not only textual but multimodal. For example, to solve a problem we may need to look at a picture. As we said CoT works only for problems that can be expressed in textual form. How can we do it for multimodal problems?

Imagine reading a textbook with no figures or tables. Our ability to knowledge acquisition is greatly strengthened by jointly modeling diverse data modalities, such as vision, language, and audio. (source)

A recent article posed exactly this problem and tried to extend CoT to multimodal problems as well:

As noted earlier, models under 100 billion parameters tend to produce illogical CoTs, thus leading to incorrect answers. A multi-modal model must not only handle textual input but also other modalities. This makes it difficult to create a model smaller than 100 B of parameters.

On the other hand, META’s LLaMA showed that models trained with fewer 100 B parameters can achieve comparable results to much larger models.

In addition, as other studies have shown, a textual model did not see pictures during training and thus has no information about visual elements or how to exploit visual features.

CoT reasoning in a multimodal context requires the model to take into account the different modalities: given the inputs in different modalities, a model decomposes a multi-step problem into a series of intermediate stems and can then infer the answer.

Multimodal chain of thought
Example of the multimodal CoT task. source: here

The most immediate way to perform Multimodal-CoT is to transform the input of different modalities into one modality and prompt LLMs to perform CoT. (source)

For example, one could take an image and use it as input for a captioning model. Once the caption is obtained one could then use the obtained caption and join it to the textual prompt and then provide it to a large LM.

However, this approach has a serious drawback, the caption as opposed to the visual features loses a lot of information, so the mutual synergy between the information contained in the different modalities is lost.

In addition, it has been shown in previous studies that cross-modal alignment of pre-trained uni-modal models is not easy. For example, in BLIP-2 to allow a vision transformer and a language model to talk to each other they needed an additional transformer in between.

Considering these challenges, the authors decided to investigate whether it is possible to train a 1 B model of parameters for multimodal CoT.

This work focuses on 1B-models as they can be fine-tuned and deployed with consumer-grade GPUs (e.g., 32G memory). In this section, we will investigate why 1B-models fail at CoT reasoning and study how to design an effective approach to overcome the challenge. (source)

image by Jason Leung on Unsplash all

Actually, an approach to train small models to reason had already been tried. However, previous attempts had used a large model as the teacher and a small model as a student.

For example, the authors provided, the teacher model with a prompt and used the “Let’s think step by step” method to get answers that explained the reasoning. The prompt plus demonstration was then provided to the smaller model.

Multimodal chain of thought
“We consider a method consisting of multiple stages. First, a large teacher model is prompted to answer questions using multi-step reasoning, without relying on correct examples. That is, the teacher employs zero-shot chain-of-thought reasoning to generate output. We then use the resulting reasoning samples (consisting of the question and teacher output) for fine-tuning a much smaller student model.” image source (here)

This approach, however, still requires the use of large LMs with all its drawbacks.

The authors instead decided to explore the possibility that a small model could be fine-tuned for multimode-CoT. In short, fusing multimodal features allows the model architecture to be able to be adjusted more flexibly (with respect to prompting). However, the main problem remains: “The key challenge is that language models under 100 billion parameters tend to generate hallucinated rationales that mislead the answer inference”.

First, why small models hallucinate with CoT?

And it is the same question the authors asked themselves: to investigate why a 1-B model fails at CoT reasoning. Once this is understood study an effective approach.

The authors started with fine-tuning a text-only baseline model for CoT reasoning. In this case, the problem is modeled as a text generation problem. The baseline is having the question (Q), context (C), and multiple options (O) the model must predict the answer (A). The authors compared the baseline with predicting the rationale (R) before the answer (QCM→RA) and the rationale is used for explaining the answer (QCM→AR).

Multimodal chain of thought
(source)

The result is surprising, there is more than a 10 % accuracy decrease if the model predicts the rational first: “The results imply that the rationales might not necessarily contribute to predicting the right answer.” In other words, it almost seems that reasoning harms the answer.

But why?

To try to understand this, the authors decided to separate the problem into two stages. First, generate the rationale and then use that to answer the question as well. The model succeeds in generating a quality rationale (RougeL is a metric used for automatic summarization and machine translation) but at the same time, it seems to harm the accuracy inference (the answer to the question).

Multimodal chain of thought
(source)

The rationale does not help to improve answer accuracy. So the authors selected 50 random error cases and inspected them manually. They saw that the model when generating rational often hallucinated because it lacked reference to the visual content.

Multimodal chain of thought
(source)

This was the most common error, more than 60 percent of the errors were attributable to this factor.

Multimodal chain of thought
(source)

So why not provide them with information about what is inside the image? The authors used a pipeline to generate captions and provide them to the model (append the captions to the input). However, this resulted in an increase in marginal accuracy (0.59 percent, in Table 3).

The authors then tested another approach, took the image, and used it as input to the DETR model with the aim of extracting vision features. They then combined these vision features with the encoded language representation. In other words, the text is encoded by the LM encoder, the image is encoded by the vision model. These two outputs are combined and become the input to the LM’s decoder.

The result shows (Table 3, above) that it improves not only the generation of the rationale but also the accuracy of the response. In other words, with a better rationale “the phenomenon of hallucination is mitigated.” Vision features are beneficial for better response, but probably this useful information is lost in the process of captioning.

Having understood why the model hallucinated, what framework can we use for efficient multimodal-CoT?

The authors propose that of incorporating language (the text) and vision (the images) modalities into a two-stage framework: in which the rationale is generated first and the response is generated later.

The model architecture is the same for both steps; however, the inputs and outputs change. In the first step, the model is given language and vision inputs to generate rationales. In the second step of the second model, you provide d the original language input which is appended to the rationale generated from the first stage. This is passing by the encoder of the second model, then you add the vision features and use the decoder to get the final answer

Multimodal chain of thought
“Overview of our Multimodal-CoT framework. Multimodal-CoT consists of two stages: (i) rationale generation and (ii) answer inference. Both stages share the same model architecture but differ in the input and output. In the first stage, we feed the model with language and vision inputs to generate rationales. In the second stage, we append the original language input with the rationale generated from the first stage. Then, we feed the updated language input with the original vision input to the model to infer the answer.” (source)
Multimodal chain of thought
photo by Steven Lelham on Unsplash

We have seen why small models hallucinate during CoT, how to solve the problem, it remains to be understood whether this approach is competitive compared to larger models and other approaches.

The authors decided to use the ScienceQA benchmark:

ScienceQA is the first large-scale multimodal science question dataset that annotates the answers with detailed lectures and explanations. It contains 21k multimodal multiple choice questions with rich domain diversity across 3 subjects, 26 topics, 127 categories, and 379 skills. (source)

In order to use vision features, they needed a model that uses an encoder-decoder so they chose T5. In addition, to better study whether the approach generalizes with other models they also chose FLAN-T5. They also decided to compare it with a number of models and with humans.

The result shows that their approach outperforms GPT-3.5 and also outperforms humans (both on average and in the various classes of questions). UnifiedQA and GPT-3.5 use captions, the result shows that vision features are more effective.

Multimodal chain of thought
(source)

Ablation studies show that using a two-stages approach leverages the best from the vision features.

Multimodal chain of thought
(source)

In addition, the authors note that multimodality boosts convergence. Practically, the two-stage model achieves higher accuracy from the beginning of training.

Multimodal chain of thought
“Accuracy curve of the No-CoT baseline and MultimodalCoT variants across epochs.” (source)

The authors say that the approach is generalizable with different models to extract vision features, they then chose DETR because it gave the best accuracy.

Multimodal chain of thought
(source)

And the textual model that is chosen is also generalizable. That is, the approach works even with a different LM model.

(source)

The authors then inspected 50 examples for which the answer was correct and 50 for which the answer was incorrect instead, to better understand the mechanism. The result shows that the CoT is not always beneficial for the answer, but the model is robust and in some cases is able to answer correctly even if the rationale is wrong. Moreover, when the answer is incorrect most of the errors are due to commonsense mistakes.

Multimodal chain of thought
(source)

The model in most makes commonsense errors when the question requires commonsense knowledge: for example, understanding a map or counting numbers in the image, or using the alphabet. An example of an error:

Multimodal chain of thought
(source)

The authors state that the results of this one are a cue to modify the model prospectively:

It is possible to improve MultimodalCoT by (i) incorporating more informative vision features and improving language-vision interaction to be capable of understanding maps and counting numbers; (ii) injecting commonsense knowledge; (iii) applying a filtering mechanism, e.g., using only the effective CoT to infer the answer and get rid of irrelevant CoT. (source)

The authors have made the model, both code and dataset available on GitHub for those who want to test it or learn more about it:

The authors in this study formally studied multimodal CoT. They analyzed why a small model hallucinates during CoT and showed that a small model is capable of outperforming large models in multimodal CoT (even outperforming human performance). The key is to be able to best combine textual and visual modalities.

This is achieved by using a two-stage approach, in the first the visual features are used to create the rationale and then exploit this best rationale to be able to get the answer. The analysis then conducted by the authors gives suggestions on how to get even better models.

In short, the results of this paper show that even a small model can solve complex problems. Moreover, providing the right multimodal features is essential for the model. One does not need a large LM with billions of parameters, because captioning works worse than a small model that is aware of vision features.

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn.

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

or you may be interested in one of my recent articles:


NLP | MULTIMODALITY | CHAIN OF THOUGHTS |

The world is not only text: How to extend the chain of thoughts to image and text?

photo by Giulio Magnifico on Unsplash

Sometimes getting to the answer is not easy, especially when the question requires reasoning. A model does not always have the answer hidden in its parameters but can get there with the right context and approach. What is the chain of thoughts? Why does this approach make it possible to solve multi-step reasoning tasks? Can it be extended to multimodal problems (i.e., problems with images and text)? Are only large models capable of this?

This article discusses how to answer these questions.

chain of thought
photo by Todd Cravens on Unsplash

In recent years we have seen the number of model parameters grow (to well over 100 B of parameters). This has been motivated by the scaling law: as the number of parameters increased, the error decreased.

While this is true for tasks such as sentiment analysis and machine translation (even in the case of zero-shot or few-shot learning), even models with billions of parameters struggle with tasks that require multi-step reasoning (e.g., math problems or commonsense reasoning).

How to allow a model to succeed in these tasks?

Large models can be fine-tuned for a specific task, and this was the first system that was attempted. As the authors of this idea explain if you ask a model if a whale has a belly button the model will incorrectly answer no. This is because the model does not have this information stored in its parameters. The authors suggest that one can provide help to the model, providing it with a hint of implicit knowledge: “A whale is a mammal”.

chain of thought
source: here

The idea of providing implicit knowledge has paved the way for the possibility that systems can improve themselves by interacting with users. The user can identify an error and provide the information to the model allowing it to correct itself. Or more precisely as defined by the authors:

This can be viewed as a form of “one-shot learning” that improves the model on-the-fly without further training, unlike most current work that relies on data collection and re-training for fixing model errors.

So conceptually the idea is that a model can solve a problem whose exact answer it does not directly know by exploiting intermediate steps.

As noted by Google, prompting is allowing in-context few-shot learning. In other words, instead of fine-tuning an LM on a particular task, one can prompt the LM with a few input-output exemplars demonstrating the task. This method has proven to be extremely functional, especially for question answering. Also, as demonstrated in context learning it is particularly effective for large models.

chain of thought
“Aggregate performance for all 42 accuracy-denominated benchmarks While zero-shot performance improves steadily with model size, few-shot performance increases more rapidly, demonstrating that larger models are more proficient at in-context learning.” source: here

Google then proposed that one can allow the model to solve multi-step reasoning problems by including a few examples of the chain of thought via prompting only. For better understanding, here is an example of what changes between a classic prompt and a chain of thought prompt:

chain of thought
“Chain-of-thought prompting enables large language models to tackle complex arithmetic, commonsense, and symbolic reasoning tasks. Chain-of-thought reasoning processes are highlighted.” source: here

The advantage of this method is that it requires neither changing LM’s weights nor a large training dataset.

In short, we can say that the idea is that a complex problem can be decomposed into a series of intermediate steps that can be solved separately.

It may seem like a small thing, but it actually means that this method can be applied to any problem you can solve using language.

The Google authors say that this is an emergent property of the model and that it emerges with a certain size of model capacity (they estimate about 100 B parameters). The authors have evaluated increasing models for solving math problems:

chain of thought
“: Chain-of-thought prompting enables large language models to solve challenging math problems”. source: here

Furthermore, the authors note that model improvement does not come from increasing parameters, but by using a “chain of thought prompting, increasing model scale leads to improved performance that substantially outperforms standard prompting for large model sizes.”

This is also true for commonsense reasoning (“reasoning about physical and human interactions under the presumption of general background knowledge”).

Multimodal chain of thought
“: Examples of input, chain of thought, output triples for arithmetic, commonsense, and symbolic reasoning benchmark”. source: here

In this, too, the model showed the same behavior: “performance improved with model scale, and employing a chain of thought prompting led to additional small improvements.” The greatest improvement was seen in the area of sports understanding (surprisingly).

So in general we have seen that there are two techniques for CoT, fine-tuning or using prompting (in context learning). Regarding this second paradigm, we can further subdivide into:

  • Zero-Shot-CoT. Kojima showed that LMs are decent zero-shot COT (simply adding “Let’s think step by step”) and this is enough to meaningfully improve zero-shot LLM for complex reasoning.
  • Few-Shot-CoT. A few step-by-step reasoning demonstrations are used for model conditioning in inference. Each demonstration presents both a question and a reasoning chain that explains to the model how to arrive at the final answer (these demonstrations can be either hand-built or using automatic generation).
Multimodal chain of thought
source: here

Few-shot-CoT has since been shown to be more efficient and with better results (provided the demonstrations are well written). Therefore, most subsequent studies focused on this method.

chain of thought
“Typical CoT techniques (FT: fine-tuning; KD: knowledge distillation). Segment 1: in-context learning techniques; Segment 2: fine-tuning techniques. To the best of our knowledge, our work is the first to study CoT reasoning in different modalities. Besides, we focus on 1B-models, without relying on the outputs of LLMs”. source: here
Multimodal chain of thought
image by airfocus on Unsplash

As we saw above, the chain of thought (CoT) has proven very useful for problems requiring complex reasoning. Many of the problems are not only textual but multimodal. For example, to solve a problem we may need to look at a picture. As we said CoT works only for problems that can be expressed in textual form. How can we do it for multimodal problems?

Imagine reading a textbook with no figures or tables. Our ability to knowledge acquisition is greatly strengthened by jointly modeling diverse data modalities, such as vision, language, and audio. (source)

A recent article posed exactly this problem and tried to extend CoT to multimodal problems as well:

As noted earlier, models under 100 billion parameters tend to produce illogical CoTs, thus leading to incorrect answers. A multi-modal model must not only handle textual input but also other modalities. This makes it difficult to create a model smaller than 100 B of parameters.

On the other hand, META’s LLaMA showed that models trained with fewer 100 B parameters can achieve comparable results to much larger models.

In addition, as other studies have shown, a textual model did not see pictures during training and thus has no information about visual elements or how to exploit visual features.

CoT reasoning in a multimodal context requires the model to take into account the different modalities: given the inputs in different modalities, a model decomposes a multi-step problem into a series of intermediate stems and can then infer the answer.

Multimodal chain of thought
Example of the multimodal CoT task. source: here

The most immediate way to perform Multimodal-CoT is to transform the input of different modalities into one modality and prompt LLMs to perform CoT. (source)

For example, one could take an image and use it as input for a captioning model. Once the caption is obtained one could then use the obtained caption and join it to the textual prompt and then provide it to a large LM.

However, this approach has a serious drawback, the caption as opposed to the visual features loses a lot of information, so the mutual synergy between the information contained in the different modalities is lost.

In addition, it has been shown in previous studies that cross-modal alignment of pre-trained uni-modal models is not easy. For example, in BLIP-2 to allow a vision transformer and a language model to talk to each other they needed an additional transformer in between.

Considering these challenges, the authors decided to investigate whether it is possible to train a 1 B model of parameters for multimodal CoT.

This work focuses on 1B-models as they can be fine-tuned and deployed with consumer-grade GPUs (e.g., 32G memory). In this section, we will investigate why 1B-models fail at CoT reasoning and study how to design an effective approach to overcome the challenge. (source)

image by Jason Leung on Unsplash all

Actually, an approach to train small models to reason had already been tried. However, previous attempts had used a large model as the teacher and a small model as a student.

For example, the authors provided, the teacher model with a prompt and used the “Let’s think step by step” method to get answers that explained the reasoning. The prompt plus demonstration was then provided to the smaller model.

Multimodal chain of thought
“We consider a method consisting of multiple stages. First, a large teacher model is prompted to answer questions using multi-step reasoning, without relying on correct examples. That is, the teacher employs zero-shot chain-of-thought reasoning to generate output. We then use the resulting reasoning samples (consisting of the question and teacher output) for fine-tuning a much smaller student model.” image source (here)

This approach, however, still requires the use of large LMs with all its drawbacks.

The authors instead decided to explore the possibility that a small model could be fine-tuned for multimode-CoT. In short, fusing multimodal features allows the model architecture to be able to be adjusted more flexibly (with respect to prompting). However, the main problem remains: “The key challenge is that language models under 100 billion parameters tend to generate hallucinated rationales that mislead the answer inference”.

First, why small models hallucinate with CoT?

And it is the same question the authors asked themselves: to investigate why a 1-B model fails at CoT reasoning. Once this is understood study an effective approach.

The authors started with fine-tuning a text-only baseline model for CoT reasoning. In this case, the problem is modeled as a text generation problem. The baseline is having the question (Q), context (C), and multiple options (O) the model must predict the answer (A). The authors compared the baseline with predicting the rationale (R) before the answer (QCM→RA) and the rationale is used for explaining the answer (QCM→AR).

Multimodal chain of thought
(source)

The result is surprising, there is more than a 10 % accuracy decrease if the model predicts the rational first: “The results imply that the rationales might not necessarily contribute to predicting the right answer.” In other words, it almost seems that reasoning harms the answer.

But why?

To try to understand this, the authors decided to separate the problem into two stages. First, generate the rationale and then use that to answer the question as well. The model succeeds in generating a quality rationale (RougeL is a metric used for automatic summarization and machine translation) but at the same time, it seems to harm the accuracy inference (the answer to the question).

Multimodal chain of thought
(source)

The rationale does not help to improve answer accuracy. So the authors selected 50 random error cases and inspected them manually. They saw that the model when generating rational often hallucinated because it lacked reference to the visual content.

Multimodal chain of thought
(source)

This was the most common error, more than 60 percent of the errors were attributable to this factor.

Multimodal chain of thought
(source)

So why not provide them with information about what is inside the image? The authors used a pipeline to generate captions and provide them to the model (append the captions to the input). However, this resulted in an increase in marginal accuracy (0.59 percent, in Table 3).

The authors then tested another approach, took the image, and used it as input to the DETR model with the aim of extracting vision features. They then combined these vision features with the encoded language representation. In other words, the text is encoded by the LM encoder, the image is encoded by the vision model. These two outputs are combined and become the input to the LM’s decoder.

The result shows (Table 3, above) that it improves not only the generation of the rationale but also the accuracy of the response. In other words, with a better rationale “the phenomenon of hallucination is mitigated.” Vision features are beneficial for better response, but probably this useful information is lost in the process of captioning.

Having understood why the model hallucinated, what framework can we use for efficient multimodal-CoT?

The authors propose that of incorporating language (the text) and vision (the images) modalities into a two-stage framework: in which the rationale is generated first and the response is generated later.

The model architecture is the same for both steps; however, the inputs and outputs change. In the first step, the model is given language and vision inputs to generate rationales. In the second step of the second model, you provide d the original language input which is appended to the rationale generated from the first stage. This is passing by the encoder of the second model, then you add the vision features and use the decoder to get the final answer

Multimodal chain of thought
“Overview of our Multimodal-CoT framework. Multimodal-CoT consists of two stages: (i) rationale generation and (ii) answer inference. Both stages share the same model architecture but differ in the input and output. In the first stage, we feed the model with language and vision inputs to generate rationales. In the second stage, we append the original language input with the rationale generated from the first stage. Then, we feed the updated language input with the original vision input to the model to infer the answer.” (source)
Multimodal chain of thought
photo by Steven Lelham on Unsplash

We have seen why small models hallucinate during CoT, how to solve the problem, it remains to be understood whether this approach is competitive compared to larger models and other approaches.

The authors decided to use the ScienceQA benchmark:

ScienceQA is the first large-scale multimodal science question dataset that annotates the answers with detailed lectures and explanations. It contains 21k multimodal multiple choice questions with rich domain diversity across 3 subjects, 26 topics, 127 categories, and 379 skills. (source)

In order to use vision features, they needed a model that uses an encoder-decoder so they chose T5. In addition, to better study whether the approach generalizes with other models they also chose FLAN-T5. They also decided to compare it with a number of models and with humans.

The result shows that their approach outperforms GPT-3.5 and also outperforms humans (both on average and in the various classes of questions). UnifiedQA and GPT-3.5 use captions, the result shows that vision features are more effective.

Multimodal chain of thought
(source)

Ablation studies show that using a two-stages approach leverages the best from the vision features.

Multimodal chain of thought
(source)

In addition, the authors note that multimodality boosts convergence. Practically, the two-stage model achieves higher accuracy from the beginning of training.

Multimodal chain of thought
“Accuracy curve of the No-CoT baseline and MultimodalCoT variants across epochs.” (source)

The authors say that the approach is generalizable with different models to extract vision features, they then chose DETR because it gave the best accuracy.

Multimodal chain of thought
(source)

And the textual model that is chosen is also generalizable. That is, the approach works even with a different LM model.

(source)

The authors then inspected 50 examples for which the answer was correct and 50 for which the answer was incorrect instead, to better understand the mechanism. The result shows that the CoT is not always beneficial for the answer, but the model is robust and in some cases is able to answer correctly even if the rationale is wrong. Moreover, when the answer is incorrect most of the errors are due to commonsense mistakes.

Multimodal chain of thought
(source)

The model in most makes commonsense errors when the question requires commonsense knowledge: for example, understanding a map or counting numbers in the image, or using the alphabet. An example of an error:

Multimodal chain of thought
(source)

The authors state that the results of this one are a cue to modify the model prospectively:

It is possible to improve MultimodalCoT by (i) incorporating more informative vision features and improving language-vision interaction to be capable of understanding maps and counting numbers; (ii) injecting commonsense knowledge; (iii) applying a filtering mechanism, e.g., using only the effective CoT to infer the answer and get rid of irrelevant CoT. (source)

The authors have made the model, both code and dataset available on GitHub for those who want to test it or learn more about it:

The authors in this study formally studied multimodal CoT. They analyzed why a small model hallucinates during CoT and showed that a small model is capable of outperforming large models in multimodal CoT (even outperforming human performance). The key is to be able to best combine textual and visual modalities.

This is achieved by using a two-stage approach, in the first the visual features are used to create the rationale and then exploit this best rationale to be able to get the answer. The analysis then conducted by the authors gives suggestions on how to get even better models.

In short, the results of this paper show that even a small model can solve complex problems. Moreover, providing the right multimodal features is essential for the model. One does not need a large LM with billions of parameters, because captioning works worse than a small model that is aware of vision features.

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn.

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

or you may be interested in one of my recent articles:

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment