
Over the past few years, we’ve seen the meteoric growth of large language models (LLMs) that have now mushroomed into billions of parameters, making them powerful tools when it comes to tasks like analyzing, summarizing and generating text and images, or creating human-sounding chatbots.
Of course, all that power comes with some significant limitations, especially if users don’t have deep pockets or the hardware to accommodate the considerable computational resources these LLMs require. So it’s no wonder that we’re witnessing the emergence of small language models (SLMs), which cater specifically to users who are more resource-constrained.
Now, with the growing interest in multimodal AI systems that can simultaneously process different types of data (images, text, audio and video), there’s also been a coinciding increase in smaller versions of these polyvalent tools as well. In the rest of this article, we’ll cover five small multimodal AI tools that have been getting a lot of attention lately.
1. TinyGPT-V
This powerful yet resource-efficient 2.8-billion parameter multimodal model processes both text and image inputs, and maintains an impressive level of performance while using significantly fewer resources compared to its larger cousins.
TinyGPT-V‘s scaled-down architecture features optimized transformer layers that strike a balance between size, performance and efficiency, in addition to using a specialized mechanism that processes image inputs and integrates them with text inputs. It is built using the relatively small LLM Phi-2, combining it with pre-trained vision modules from BLIP-2 or CLIP.
It can be fine-tuned with smaller datasets, making it a good option for small- and medium-sized companies, or for those looking to locally deploy it in educational or research contexts (where funding and resources might be more limited).
2. TinyLlaVA
This novel framework integrates vision encoders like CLIP-Large and SigLIP, as well as a small-scale LLM decoder, an intermediary connector, and customized training pipelines — all in order to attain high-level performance that still keeps computational use to a minimum.
TinyLlaVA is trained with two different datasets: LLaVA-1.5 and ShareGPT4V. The supervised fine-tuning process permits the adjustment of partially learnable parameters of the LLM and the vision encoder.
According to tests, TinyLlaVA’s best-performing variant, the TinyLLaVA-share-Sig-Phi 3.1B variant, outperforms 7B models like LLaVA-1.5 and Qwen-VL. Additionally, the framework offers a holistic analysis of model selections, training recipes, and data contributions to the performance of small-scale LMMs. It’s a great example of how leveraging small-scale LLMs can provide significant advantages in accessibility and efficiency, without sacrificing performance.
3. GPT-4o mini
Released as a smaller and cheaper version of OpenAI’s GPT-4o multimodal model, GPT-4o mini costs approximately 60 percent less to run than GPT-3.5 Turbo, previously the most affordable model in OpenAI’s line of models.
GPT-4o mini is derived from the larger GPT-4o via a distillation process, resulting in an excellent balance between performance and cost-efficiency. It features a large 128K context window, multimodal capabilities to process both text and images, with planned future support for video and audio. It also features enhanced safety features against jailbreaks, system prompt extractions, and prompt injections.
Use cases for GPT-4o mini might include rapid prototyping for new chatbots, on-device apps for language learning or personal assistants, interactive games, as well applications in educational settings.
4. Phi-3 Vision
This powerful vision-language variant of Microsoft’s Phi-3 is a transformer-based model that contains an image encoder, connector, projector, and the Phi-3 Mini language model. At 4.2-billion parameters, Phi-3 Vision is capable of supporting up to 128K context length in tokens, and “extensive multimodal reasoning” that permits it to understand and generate content based off charts, graphs and tables.
With performance that rivals that of larger models like OpenAI’s GPT-4V, Phi-3 Vision could be well-suited to resource-constrained environments and latency-bound scenarios, offering advantages for offline operation, cost, and user privacy.
Potential use cases include document and image analysis to improve customer support, social media content moderation, and video analysis for companies or educational institutions.
5. Mississippi 2B and Mississippi 0.8B
Recently released by H2O.ai, these are two multimodal foundation models designed specifically for OCR and Document AI use cases. Intended to be compact yet efficient, these vision-language models offer businesses a scalable and cost-effective way to perform document analysis and image recognition in real-time.
The models feature multi-stage training with fine-tuning of layers and minimal latency — making them a good fit for healthcare, banking, insurance and finance, where a large volume of documents need to be processed.
Both H2OVL Mississippi 2B and H2OVL Mississippi 0.8B are freely available on Hugging Face at the moment, making it an accessible option for developers, researchers, and enterprises to fine-tune and modify.
Conclusion
Accessibility and cost-efficiency remain major issues with multimodal models, and with large language models in general. But with an increasing number of relatively lightweight yet powerful multimodal AI options becoming available, this means that many more institutions and smaller businesses will be able to adopt AI into their workflow.
The post 5 Small-Scale Multimodal AI Models and What They Can Do appeared first on The New Stack.
With growing interest in multimodal AI systems, there's been a coinciding increase in smaller versions of these polyvalent tools as well.