Logo
Datadrifters Blog Header Image

Latest Vision, Image and Language Models: Pangea, Ferret, OmniParser, Granite, Pixtral, Aya, SD 3.5

2024-10-29


Recently, an influx of new model releases from all angles has felt like the floodgates of innovation bursting open.


Staying up-to-date might seem overwhelming — like drinking from a firehose — but rest assured, you’re in the right place.


I’ll guide you through the most groundbreaking developments you won’t want to miss:



Let’s GO!



Vision-Language Models


Pangea


Pangea-7B is an open-source multilingual multimodal large language model (MLLM) that delivers better performance than SoTA open-source models such as Llama 3.2 11B.



It’s designed to bridge not just language gaps but also cultural nuances in visual understanding tasks.


Pangea is powered by massive instruction tuning dataset called PangeaIns, containing 6 million samples across 39 languages.


PangeaIns includes general instructions, document and chart question answering, captioning, domain-specific, culturally relevant, and text-only instructions.



This isn’t also just machine-translated English data; they actually focused on including culturally relevant content. They even developed a pipeline to generate multicultural instructions and captions, ensuring the data isn’t just Anglo-centric.



These capabilities are important for global or multi-regional product development.


By leveraging Pangea-7B or its training techniques, you can significantly improve your product’s ability to handle multiple languages, both in text and image understanding.


This could be a differentiator in markets where cultural context is crucial for multimodal chat, captioning, multilingual VQA and multi-subject reasoning.



Here’s multilingual and multimodal benchmarks:



Pangea-7B, PangeaIns, and PangeaBench are open-source, you can directly experiment with them or integrate parts into your own models without starting from scratch.


Here’s the official page where you can find more information.


PUMA


Puma stands for “Powering Unified MLLM with Multi-grAnular visual generation” which aims to unify various visual generation and understanding tasks within a single Multimodal Large Language Model (MLLM).


The challenge it addresses is pretty significant: balancing diversity and controllability in image generation tasks.




PUMA introduces a multi-granular approach to visual feature representation to achieve this:




This is important because in previous models, there was often a trade-off:



PUMA overcomes this by handling multiple levels of detail simultaneously, allowing for both high diversity and precise controllability.


It provides diverse text-to-image generation, image editing, and conditional image generation.



Here are some of the technical highlights:



Refer to official page and paper for more information.


Ferret-UI from Apple


Multimodal large language models (MLLMs) like GPT-4V (the vision-enabled version of GPT-4) have made huge strides recently.


But despite their prowess with natural images, they often stumble when it comes to understanding and interacting with mobile user interface (UI) screens.


The problem is that UI screens are quite different from natural images — they have elongated aspect ratios and contain lots of tiny elements like icons and text, which standard MLLMs aren’t optimized for.


Ferret-UI is tailored specifically for mobile UI understanding. It’s built on top of the original Ferret model, which is known for its strong referring and grounding capabilities in natural images.



Ferret-UI is able to perform referring tasks (e.g., widget classification, icon recognition, OCR) with flexible input formats (point, box, scribble) and grounding tasks (e.g., find widget, find icon, find text, widget listing) on mobile UI screens… Ferret-UI is able to not only discuss visual elements in detailed description and perception conversation, but also propose goal-oriented actions in interaction conversation and deduce the overall function of the screen via function inference.


There are a few key innovations behind Ferret-UI:




Here’s why it is important from product development perspective:




There are also other potential applications:



I believe exploring this further could be really beneficial for your upcoming mobile projects.


You can find more information in Ferret-UI’s paper.


OmniParser from Microsoft


Coming back to the challenges that GPT-4V faces across user interfaces (UIs) across different platforms — like Windows, macOS, iOS, Android — and also various applications, there are 2 main things to understand:



Without these capabilities, GPT-4V can’t effectively act as a general agent that can perform tasks across different apps and platforms.


Most of the existing solutions rely on additional data like HTML code or view hierarchies, which aren’t always available, especially in non-web environments.


OmniParser from Microsoft is a screen parsing tool for pure vision based GUI agents to bridge these gaps.


Here’s how it works:








By combining these three components, OmniParser effectively turns a raw screenshot into a structured, DOM-like representation of the UI, complete with bounding boxes and semantic labels.


OmniParser is tested on several benchmarks like ScreenSpot, Mind2Web, and AITW, which cover various platforms and tasks. The results were pretty impressive:




For us, this means we can develop AI agents that are much more adept at interacting with UIs across different platforms without relying on extra data that’s not always accessible.


However, there are still potential challenges:



You can find more info on github, paper and blog post.


Next cohort will start soon! Reserve your spot for building full-stack GenAI SaaS applications !


Pixtral 12B Base from Mistral


Mistral’s latest release, Pixtral 12B, is something you should definitely know about.


It’s their first-ever multimodal model that’s open-sourced under the Apache 2.0 license, which is awesome because we can adapt and integrate it freely.


It’s trained to handle both images and text simultaneously. They’ve interleaved image and text data during training, so it excels in tasks that require understanding and reasoning over visual content alongside text. Think of tasks like chart and figure interpretation, document question answering, and even converting images to code



Performance-wise, it’s pretty impressive. On the MMMU reasoning benchmark, it scores 52.5%, outperforming some larger models. Plus, it maintains state-of-the-art performance on text-only benchmarks, so we’re not sacrificing text capabilities for the sake of multimodal prowess.



For starters, this could significantly enhance the customer experience in various products similar to other MLLMs.


For example:



Pixtral substantially outperforms all open models around its scale and, in many cases, outperforms closed models such as Claude 3 Haiku. Pixtral even outperforms or matches the performance of much larger models like LLaVa OneVision 72B on multimodal benchmarks.All prompts will be open-sourced.



Instruction following is another area where Pixtral shines.


It outperforms other open-source models like Qwen2-VL 7B and LLaVa-OneVision 7B by a significant margin in both text and multimodal instruction following benchmarks.


This means it’s better at understanding and executing complex instructions, which is crucial for creating a smooth user experience.



Definitely read the official blog post and hugging face model card for more information.


Image and Video Generation Models


Stable Diffusion 3.5 from StabilityAI


No introduction needed for Stable Diffusion models!


Stability AI are soon releasing several models to cater to different needs:




Here’s why these models matter:






Stable Diffusion 3.5 Medium release is set for October 29th, and shortly after the Medium model release, they’re launching ControlNets, which will offer advanced control features for a variety of professional use cases.


For more information, refer to official blog post, hugging face model cards and github.


Mochi from GenmoAI


Genmo has released a research preview of Mochi 1, and it looks great!


Mochi 1 is being touted as a new state-of-the-art (SOTA) in open-source video generation. It dramatically improves on two fronts that have been challenging for us: motion quality and prompt adherence.


It generates smooth videos at 30 frames per second for up to 5.4 seconds. The motion dynamics are so realistic that it’s starting to cross the uncanny valley. Think about simulating fluid dynamics, fur, hair, and human actions with high temporal coherence.



The model also shows exceptional alignment with textual prompts. This means when you feed it a description, the output video closely matches what you asked for. This level of control over characters, settings, and actions is something we’ve been wanting for a while.



From product point of view, this opens up a lot of possibilities:




Here are some limitations to keep in mind:



You can find more information in official announcement.


Large Language Models


Granite 3.0 from IBM


Granite 3.0 is IBM’s latest generation of large language models, and they’re making a splash by releasing them under an Apache 2.0 license. That means they’re open-source and we can use them freely in our projects. They’ve trained these models on over 12 trillion tokens, covering 12 human languages and 116 programming languages. So, they’re pretty robust in terms of linguistic and coding capabilities.



Granite models support various languages such as English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese.


Users may finetune Granite 3.0 models for languages beyond these 12 languages.


Granite models are designed to match top performance on general, enterprise, and safety benchmarks.



They can handle a variety of applications that are right up our alley:



Here’s how the Granite lineup breaks down:



Let’s also have a quick look at model hyperparameters:



By the end of the year, they’re planning some major updates:



You can find more information on paper, blog, hugging face model cards and github.


Llama 3.2 1B & 3B from Meta


Meta has just released quantized versions of their Llama models — 3.2 1B and 3B — that are optimized for mobile and edge deployments.


Quantized models are significantly smaller and faster. We’re talking about a 2–4x speedup in inference time, an average 56% reduction in model size, and a 41% reduction in memory usage compared to the original models.


You can now deploy powerful LLMs directly on devices without hogging resources.


There are 2 key techniques used**:**



Here are some of the benchmarks that Meta shared for QLoRA and SpinQuant:



These are definitely great developments because,



More information on official blog post.


Aya Expanse


Big news from Cohere For AI!


They just released Aya Expanse, a family of top-performing multilingual models that’s raising the bar for language coverage.


Aya Expanse, available in both 8B and 32B parameter sizes, is open-weight and ready for you on Kaggle and Hugging Face.


The 8B model is designed to make advanced multilingual research more accessible, while the 32B model brings next-level capabilities across 23 languages.


The 8B version stands strong, outperforming its peers like Gemma 2 9B and Llama 3.1 8B with win rates ranging from 60.4% to 70.6%.



Aya Expanse isn’t just another model release.


Since the Aya initiative kicked off two years ago, Cohere has collaborated with 3,000+ researchers from 119 countries.


Along the way, they’ve created the Aya dataset collection (over 513 million examples!) and launched Aya-101, a multilingual powerhouse that supports 101 languages.


Cohere’s commitment to multilingual AI is serious, and Aya Expanse is the latest milestone.



Aya Expanse 32B isn’t just another large model — it beats out others like Gemma 2 27B, Mistral 8x22B, and even the massive Llama 3.1 70B (that’s more than twice its size!) in multilingual tasks.



Here’s a sneak peek at the innovative approaches that power Aya Expanse:




Aya Expanse models are available now on the Cohere API, Kaggle, and Hugging Face, and you can already run it with Ollama!



You can find more information from official announcement.


That’s it folks — hope you enjoyed the read and start experimenting with some of these models in your applications.


And don’t forget to have a look at some practitioner resources that we published recently: