
The recent emergence of multimodal AI has meant that AI systems are now becoming increasingly multipurpose in nature, as they simultaneously process and generate a variety of data modalities — including text, images, audio and video — in an integrated fashion.
One of the more versatile subsets of multimodal AI is the vision language model (VLM), which combines natural language processing (NLP) and computer vision (CV) capabilities to tackle advanced vision-language tasks — such as image captioning, visual question answering, and text-to-image search and generation.
Architecture of Vision Language Models
Vision language models are capable of processing both text- and image-based inputs, with the computer vision portion of the model analyzing and interpreting visual data, and the natural language processing portion of the model analyzing and understanding text. In a way, it’s possible to imagine VLMs as polyvalent large language models (LLMs) that are capable of understanding both words and images.
Generally speaking, VLMs consist of these main components:
- Vision encoder: This part extracts visual cues like shapes, patterns and colors from visual inputs and converts them into vector embeddings — or numerical representations of data points within a high-dimensional space — which can be understood by the AI model. In the past, VLMs used convolutional neural networks to extract features from images. Nowadays, many VLMs will typically use a vision transformer (ViT), which divides an image into “patches” of a fixed size and then processes them as tokens, much like how a transformer-based language model might parse words in a sentence.
- Language encoder: This component evaluates the semantic meaning and contextual associations between words and transforms that information into text embeddings.
- Projector/fusion mechanism: This vital element aligns the feature embeddings from the vision and language encoders into a shared multimodal space.
- Multimodal transformer: Operating over the combined vision and language embeddings, this integrated component often uses a self-attention mechanism within modalities that weighs the contextual importance of word tokens in a sequence, thus allowing the model to predict the most likely order of words in a sentence. Additionally, it uses a cross-attention mechanism between modalities to learn relationships between images and words, as well as positional encoding to retain the contextuality between image patches and text tokens.
- Task-specific heads: These adapt the final outputs for whatever specific tasks the model was designed to perform. Some examples of task-specific heads include classification heads, generation heads and question answering heads.

Diagram of a common VLM architecture (via NVIDIA).
Learning Techniques for Training VLMs
The strategies for training VLMs often involve a mix of techniques that help to align and fuse data from both the vision and language components.
- Contrastive learning: This approach trains the model to differentiate between similar and dissimilar pairs of data points by mapping image and text embeddings into a shared embedding space. As the model trains on datasets comprised of paired images and text, it generates a similarity score. It then learns to minimize the distance between matching embedding pairs, while maximizing the distance between those that don’t match. One example of a contrastive model is CLIP, which uses a three-step process to perform zero-shot predictions.
- PrefixLM: This is an NLP learning technique for pretraining language models, where a part of text (i.e. a prefix) is used as input, and the model learns to predict the next part in the sequence. With VLMs, PrefixLM is often used in conjunction with the simplified SimVLM architecture to provide zero-shot learning capabilities, thus allowing the model to efficiently predict the next sequence of text based on an image and its associated text prefix, and using a vision transformer.
- Frozen PrefixLM: This training technique builds upon PrefixLM, but the parameters of the language model are frozen during training, resulting in a more computationally efficient training process.
- Masked modeling: With this approach, parts of a text- or image-based input are randomly obscured. A VLM will then learn to predict and “fill in” the missing parts of the masked input, either by using masked language modeling for generating missing textual information when given an unmasked image, or by using masked image modeling to reconstruct the missing pixels of an image when given an unmasked text caption. FLAVA (Foundational Language And Vision Alignment) is one example of a model that employs this masking technique, along with contrastive learning.
- Generative model training: This method trains the VLM to produce new outputs, depending on the text and image inputs given. This could mean generating images based on the text inputs (text-to-image), or text captions or summaries related to an image (image-to-text). Examples of generative, text-to-image diffusion-based VLMs include Midjourney and Stable Diffusion.
- Pretrained models: To reduce the cost and time of training a VLM from scratch, it’s also possible to build one using pretrained LLMs and vision encoders, with the addition of extra mapping network layers to align the image and text representations. Knowledge distillation is one technique that can be used to transfer knowledge from a pretrained “teacher” model to a simpler, more lightweight “student” model. Alternatively, it’s also possible to adapt and fine-tune an existing VLM for a specific application by employing tools like Transformers and SFTTrainer.
How Vision Language Models Can Be Used
Vision language models can be used in a wide range of applications that require synthesizing visual and textual information, including:
- Image generation.
- Image captioning and summarization.
- Image segmentation.
- Image retrieval.
- Object detection.
- Video understanding.
- Visual question answering (VQA).
- Text extraction for intelligent document understanding.
- Online content moderation and safety.
- Powering interactive systems, such as for education and healthcare.
- Telemedicine, automated diagnostic tools, and virtual health assistants.
Conclusion
Vision language models are but one subtype of the growing number of versatile and powerful multimodal AI models that are now emerging. But as with developing and deploying any AI model, there are always challenges when it comes to potential bias, cost, complexity, and hallucinations. In an upcoming post, we’ll cover some of the datasets used to train VLMs, benchmarks for evaluating them, as well as some well-known VLMs and what they can do.
The post A Developer’s Guide to Vision Language Models appeared first on The New Stack.
Vision language models (VLMs) can be used in a wide range of applications that require synthesizing visual and textual information.