Meta announced an open-source AI system that combines multiple data streams, such as text, audio and visual data, temperature readings, and movement data.
Meta is continuing to share AI research, even as rivals such as OpenAI and Google become more secretive.
The research’s core idea is to combine multiple types of data in a multidimensional index. (Or “embedding spaces,” as AI professionals call them). The idea might seem abstract, but this is the same concept behind the recent boom of generative AI.
Multimodal AI models are at the core of the AI boom
AI image generators such as DALL-E and Stable Diffusion rely on systems which link text and images together during the training phase. They search for patterns within visual data and connect that information with descriptions of images. This is how these systems can generate images that match the text inputs of users. Many AI tools generate audio or video in the same manner.
Meta claims that ImageBind is the first model to integrate six different types of data in a single embedding area. The model includes six different types of data: visual (both image and video), thermal (infrared), text, audio, depth information and — perhaps most interestingly — movement readings from an IMU. IMUs can be found in smartwatches and phones, where they are used to perform a variety of tasks. For example, switching the phone’s landscape mode to portrait or identifying different types of physical activities. Meta’s ImageBind Model combines six different types of data, including audio, visual, depth, temperature and movement.
Future AI systems should be able cross-reference these data the same way they do with text inputs. Imagine a virtual reality device of the future that can not only provide audio and visual input, but also generate your physical environment and movements on a stage. It could be asked to simulate a long voyage at sea, which would include the sound of waves and the movement of the deck beneath your feet.
Meta writes in a blog that future models could include “touch, speech and smell signals, as well as brain fMRI signals.” The company also states that the research brings machines “one step closer to human’s ability to learn simultaneously and holistically from different types of information”. (Which is, sure, whatever. It depends on how small the steps are.
It’s important to note that this is all speculative research, and the applications will likely be limited. Meta, for example, demonstrated last year an AI model which generated blurred and short videos based on text descriptions. ImageBind is a good example of how future versions of this system can incorporate other streams of information, such as audio that matches the video output.
The research is interesting for industry watchers as Meta has open-sourced the model underlying it, a practice that is increasingly scrutinized in the world AI.
OpenAI and those who oppose it say that the practice can be harmful to creators, as rivals are able to copy their work. They also claim that open-sourcing could potentially be dangerous, as malicious actors may take advantage of AI models at the cutting edge. Open-sourcing, say those who support it, allows third parties the opportunity to examine systems and fix some of their flaws. It may have a commercial advantage, they say, because it allows companies to hire third-party developers to improve their work.
Meta is firmly in open-source camp so far, but not without its challenges. Its latest language, LLaMA was leaked on the internet earlier this year. This approach is enabled in part by the company’s lack of commercial success with AI (it has no chatbot that can compete with Bing, Bard or ChatGPT). ImageBind is a continuation of this strategy.