Twelve Lab’s AI Models: Understand Video Besides Text

Written By: Aniqa Batool
Last Updated On: October 25, 2023

Twelve Labs is a San Francisco company, that focuses on training AI models on both text and visuals. These models match natural language to video by detecting actions, things, and noise from the surroundings. Developers can use this model to build apps that can identify scenes, extract topics, and summarize videos.

Twelve Labs’ models can enable services such as ad insertion, content control, media analytics, and video highlight reel generation. Before revealing the model, the company’s CEO, Jae Lee, stated that they attempt to resolve prejudices and fairness criteria. They will also release model ethics benchmarks and datasets in the future.

Lee said on the basis of how their product is different from large language models like ChatGPT, their model has been specially trained and built to process video, completely integrating visual, audio, and speech components within videos. They have truly tested the limits of video understanding.

Google is working on an identical model for video comprehension dubbed MUM, which will be used to power video across Google Search and YouTube. Google, Microsoft, and Amazon provide API-level, AI-powered tools that recognize objects, places, and actions in movies.

They questioned Lee about the possibility of these models, given that models enhance any biases in the data on which they’re trained. For example, training a video intelligence model on large clips of local news — which frequently spends a lot of time sensationalizing and racializing crime — could induce the model to learn racist and sexist tendencies.

Lee claims that Twelve Labs is distinguished by the strength of its models and the platform’s tools, which allow clients to automate the platform’s models with their own data for “domain-specific” video analysis.

Twelve Labs is releasing Pegasus-1, a new multimodal model that recognizes a variety of triggers relevant to the whole-video analysis, today. Pegasus-1 might be instructed to write a lengthy, informative report about a video or just a few highlights with timestamps.