Watch a 2-hour movie in 4 seconds! Alibaba releases universal multi-modal large
In just 4 seconds, watch a 2-hour movie—Alibaba's team unveils a new achievement with the official debut of the universal multimodal large model mPLUG-Owl3, specifically designed to understand multiple images and long videos.
Specifically, using LLaVA-Next-Interleave as a benchmark, mPLUG-Owl3 has reduced the model's First Token Latency by 6 times, and the number of images that can be modeled on a single A100 has been increased by 8 times, reaching 400 images. It has been tested to finish watching a 2-hour movie in just 4 seconds.
In other words, the model's inference efficiency has been greatly improved without sacrificing the accuracy of the model.
mPLUG-Owl3 has also achieved state-of-the-art (SOTA) results in various scenarios involved in multimodal large models, including single image, multiple images, and video domains across numerous Benchmarks.
The paper's authors are from Alibaba's mPLUG team, who have been deeply engaged in the development of multimodal large model foundations. Prior to this, they have also proposed:
Advertisement
- Efficient multimodal foundation mPLUG series
- Modular multimodal large language model foundation mPLUG-Owl series
- Document understanding foundation mPLUG-DocOwl seriesLet's delve into the groundbreaking technology that the newly released mPLUG-Owl3 has brought to the table.
Play Unlocking
The team demonstrated several common uses of the mPLUG-Owl3.
Multimodal Retrieval Enhancement
Generally, for knowledge that the model has not learned, it is a common practice to supplement it through a retrieval system. This requires the model to accurately understand the incoming multimodal knowledge and use it to answer questions.
In the following example, the mPLUG-Owl3 not only responds correctly based on the retrieved knowledge but also provides which piece of knowledge it used to make its judgment, along with detailed reasoning. This is very user-friendly for downstream developers in terms of development and debugging.
Retrieval section: Here are several multimodal pieces of knowledge retrieved for you to make a decision.
Question: Based on your observation of the four photos of dogs above, can you classify the dog in image 5? Provide your reasoning.
Answer: Based on the observation of the four photos of dogs, the dog in image 5 can be classified as an Irish Terrier. The dog in image 1 is described as an Irish Terrier, and the dog in image 5 shares similar physical characteristics with the dog in image 1...
Multi-image ReasoningIn life, there are often situations where —
Many users do not just converse around a single image, but rather, they may have several pieces of material and need the model to correctly understand the relationships between the contents of different materials.
In the following example, mPLUG-Owl3 can activate the common-sense knowledge of the language model through what is seen in two images and reason around the two pictures.
Question: Can the animal on the left survive in the environment on the right? The answer requires a response and an explanation.
Answer (correct answer highlighted in green): The general idea is that it cannot survive because the environment is too cold.
In another example, mPLUG-Owl3 can not only perceive the stylistic differences between the two images but also realize that although the characters in the two pictures have different styles, they are actually the same character.
First question: What artistic styles are depicted in the images, and how do they differ?
Answer: The first one is pixel art, and the second one is cartoon style... (provide reasons)
Second question: Who appears in both paintings? What are their characteristics?
Answer: In both paintings, a man wearing a cowboy hat appears, depicted in two different styles.Long Video Understanding
Watching ultra-long videos is a task that most models currently cannot complete end-to-end. If based on an Agent framework, it is also difficult to ensure response speed.
mPLUG-Owl3 can watch a 2-hour movie and start answering user questions in just 4 seconds.
No matter whether the user asks very detailed questions about the beginning, middle, and end of the movie, mPLUG-Owl3 can answer fluently.
How is this achieved?
Unlike traditional models, mPLUG-Owl3 does not need to splice visual sequences into the text sequences of the language model in advance.
In other words, no matter what is inputted (dozens of images or hours of video), it does not occupy the sequence capacity of the language model, thus avoiding the huge computational cost and memory occupation brought by long visual sequences.
Some may ask, how is visual information integrated into the language model?
To achieve this, the team proposed a lightweight Hyper Attention module, which can extend an existing Transformer Block that can only model text into a new module capable of both image-text feature interaction and text modeling.
By sparsely expanding 4 Transformer Blocks throughout the language model, mPLUG-Owl3 can upgrade the LLM to a multimodal LLM at a very low cost.After being extracted from a visual encoder, visual features are aligned to the dimensions of a language model through a simple linear mapping. Subsequently, these visual features interact with text only within these 4 layers of Transformer Blocks, ensuring that fine-grained information is preserved as the visual tokens have not undergone any compression.
Let's now examine the design within Hyper Attention.
Hyper Attention introduces a Cross-Attention operation to enable the language model to perceive visual features, using the visual features as Keys and Values while the hidden state of the language model acts as the Query to extract visual features.
In recent years, other studies have also considered using Cross-Attention for multimodal fusion, such as Flamingo and IDEFICS, but these efforts have not achieved satisfactory performance.
In the technical report of mPLUG-Owl3, the team compared the design of Flamingo to further illustrate the key technical points of Hyper Attention:
Firstly, Hyper Attention does not adopt the cascade design of Cross-Attention and Self-Attention but is embedded within the Self-Attention block.
The advantage of this is a significant reduction in the additional parameters introduced, making the model easier to train and further enhancing the efficiency of both training and inference.
Secondly, Hyper Attention chooses to share the LayerNorm of the language model because the distribution output by LayerNorm is precisely the distribution that the Attention layer has already stabilized through training. Sharing this layer is crucial for stabilizing the learning of the newly introduced Cross-Attention.
In fact, Hyper Attention adopts a strategy of parallel Cross-Attention and Self-Attention, using a shared Query to interact with visual features and merging the features of both through an Adaptive Gate.This allows the Query to selectively choose visual features that are relevant to its own semantics.
The team found that the relative position of images to text in the original context is very important for the model to better understand multimodal inputs.
To model this property, they introduced a multimodal interlaced rotation position encoding MI-Rope to model positional information for the visual Key.
Specifically, they pre-recorded the position information of each image in the original text, and this position is used to calculate the corresponding Rope embedding, which is shared by all patches of the same image.
In addition, they also introduced an Attention mask in the Cross-Attention, preventing the text in the original context before the image from seeing the features corresponding to the image behind it.
In summary, these design points of Hyper Attention have brought further efficiency improvements to mPLUG-Owl3, and ensured that it still possesses top-tier multimodal capabilities.
Experimental results
Through experiments on a wide range of datasets, mPLUG-Owl3 has achieved state-of-the-art (SOTA) results in most single-image multimodal Benchmarks, and even surpassed models with larger sizes in many evaluations.
At the same time, in multi-image evaluations, mPLUG-Owl3 also outperformed LLAVA-Next-Interleave and Mantis, which are specifically optimized for multi-image scenarios.
Furthermore, on the LongVideoBench (52.1 points), a list specifically designed to evaluate models' understanding of long videos, it has surpassed existing models.The development team has also proposed an interesting long visual sequence evaluation method.
As is well known, in real human-computer interaction scenarios, not all images are meant to serve the user's question. Historical context is filled with multimodal content that is irrelevant to the question, and the longer the sequence, the more severe this phenomenon becomes.
To assess the model's ability to resist interference in long visual sequence inputs, they built a new evaluation dataset based on MMBench-dev.
By introducing irrelevant images for each MMBench cycle evaluation sample and scrambling the image order, then asking questions about the original images to see if the model can consistently respond correctly. (For the same question, four samples with different option orders and interference images are constructed, and only if all are answered correctly is it counted as one correct response.)
In the experiment, multiple levels were divided based on the number of input images.
It can be seen that models without multi-image training, such as Qwen-VL and mPLUG-Owl2, quickly fell behind.
Models that have undergone multi-image training, such as LLAVA-Next-Interleave and Mantis, were able to maintain a similar decay curve with mPLUG-Owl3 at the beginning, but as the number of images reached the order of 50, these models could no longer answer correctly.
However, mPLUG-Owl3 persisted until 400 images and still maintained a 40% accuracy rate.
However, it must be said that although mPLUG-Owl3 has surpassed existing models, its accuracy is far from excellent. It can only be said that this evaluation method reveals the need for all models to further improve their ability to resist interference in long sequences in the future.