NVIDIA AI Research Proposes Language Instructed Temporal-Localization Assistant (LITA), which Enables Accurate Temporal Localization Using Video LLMs

Large Language Models (LLMs) have proven their impressive instruction-following capabilities, and they can be a universal interface for various tasks such as text generation, language translation, etc. These models can be extended to multimodal LLMs to process language and other modalities, such as Image, video, and audio. Several recent works introduce models that specialize in processing videos. These Video LLMs preserve the instruction following capabilities of LLMs and allow users to ask various questions about a given video. However, one important missing piece in these Video LLMs is temporal localization. When prompted with the “When?” questions, these models cannot accurately localize periods and often hallucinate irrelevant information.

Three key aspects limit the temporal localization capabilities of existing Video LLMs: time representation, architecture, and data. First, existing models often represent timestamps as plain text (e.g., 01:22 or 142sec). However, given a set of frames, the correct timestamp still depends on the frame rate, which the model cannot access. This makes learning temporal localization harder. Second, the architecture of existing Video LLMs might need more temporal resolution to interpolate time information accurately. For example, Video-LLaMA only uniformly samples eight frames from the entire video, which needs to be revised for accurate temporal localization. Finally, temporal localization is largely ignored in the data used by existing Video LLMs. Data with timestamps are only a small subset of video instruction tuning data, and the accuracy of these timestamps is also not verified.

NVIDIA researchers propose Language Instructed Temporal-Localization Assistant (LITA). The three key components they have proposed are: (1) Time Representation: time tokens to represent relative timestamps and allow Video LLMs to better communicate about time than using plain text. (2) Architecture: They introduced SlowFast tokens to capture temporal information at fine temporal resolution to enable accurate temporal localization. (3) Data: They have emphasized temporal localization data for LITA. They have proposed a new task, Reasoning Temporal Localization (RTL), along with the dataset ActivityNet-RTL, to learn this task.

LITA is built on Image LLaVA due to its simplicity and effectiveness. LITA does not depend on the underlying Image LLM architecture and can be easily adapted to other base architectures. Given a video, they first uniformly select T frames and encode each frame into M tokens. T × M is a large number that usually cannot be directly processed by the LLM module. Thus, they use SlowFast pooling to reduce the T × M tokens to T + M tokens. The text tokens (prompt) are processed to convert referenced timestamps to specialized time tokens. All the input tokens are then jointly processed by the LLM module sequentially.  The model is fine-tuned with RTL data and other video tasks, such as dense video captioning and event localization. LITA learns to use time tokens instead of absolute timestamps. 

Comparing LITA with LLaMA-Adapter, Video-LLaMA, VideoChat, and Video-ChatGPT. Video-ChatGPT slightly outperforms other baselines, including VideoLLaMA-v2. LITA significantly outperforms these two existing Video LLMs in all aspects. In particular, LITA achieves a 22% improvement in the Correctness of Information (2.94 vs. 2.40) and a 36% relative improvement in Temporal Understanding (2.68 vs. 1.98). This shows that the emphasis on temporal understanding in training enables accurate temporal localization and improves LITA’s video understanding.

In conclusion, NVIDIA researchers present LITA, a game-changer in temporal localization using Video LLMs. With its unique model design, LITA introduces time tokens and SlowFast tokens, significantly improving the representation of time and the processing of video inputs. LITA demonstrates promising capabilities to answer complex temporal localization questions and substantially enhances video-based text generation compared to existing Video LLMs, even for non-temporal questions. 

Check out the Paper and GithubAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 39k+ ML SubReddit

Asjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...