Guide to Vision-Language Models (VLMs)

This guide covers the major open-weight and OpenAI vision-language models as of early 2026, including their sizes, computational requirements, and what each version is specialized for.


Open-Weight Models

LLaVA (Large Language-and-Vision Assistant)

Developer: University of Wisconsin–Madison / Microsoft Research
License: Apache 2.0 (LLaVA 1.6); earlier versions restricted by LLaMA license
Architecture: CLIP ViT-L/14 vision encoder + MLP projector + LLM backbone

VersionLLM BackboneParametersOllama SizeMin VRAMContext
LLaVA 1.5 7BVicuna 7B (LLaMA)~7.3B total4.7 GB~6 GB32K
LLaVA 1.5 13BVicuna 13B~13.3B total8.0 GB~10 GB4K
LLaVA-NeXT 7BMistral 7B~7.3B total~5 GB~6 GB32K
LLaVA-NeXT 13BVicuna 13B~13.3B total~8 GB~10 GB4K
LLaVA-NeXT 34BYi-34B~34B total20 GB~24 GB4K

Version specializations:

Best for: Education and experimentation. Simple architecture makes internals easy to understand. Runs well on student laptops via Ollama. Primarily single-image conversations.


Llama 3.2 Vision (Meta)

Developer: Meta
License: Llama 3.2 Community License (commercial use permitted; EU restrictions on multimodal models)
Architecture: Llama 3.1 text model (frozen) + vision adapter with cross-attention layers

VersionText BackboneParametersMin VRAMContext
Llama 3.2 11B VisionLlama 3.1 8B11B8 GB128K
Llama 3.2 90B VisionLlama 3.1 70B90B64 GB128K

Both come in base and instruct variants.

Key characteristics:

Best for: Drop-in replacement for Llama 3.1 text models when you need vision capabilities. Strong OCR, document understanding, chart interpretation. Available through Ollama.


Qwen 2.5 VL (Alibaba)

Developer: Alibaba Cloud / Qwen Team
License: Qwen License (permissive, allows commercial use)
Architecture: ViT with window attention + Qwen 2.5 LLM

VersionParametersMin VRAM (FP16)Min VRAM (4-bit)Context
Qwen 2.5 VL 3B3B~8 GB~4 GB32K
Qwen 2.5 VL 7B7B~17 GB~6 GB32K
Qwen 2.5 VL 72B72B~144 GB~36 GB32K (YaRN to 128K)

Key characteristics:

Best for: Applications requiring video understanding, multilingual OCR, or object localization. The most feature-rich open VLM. The 7B variant offers an excellent quality-to-cost ratio.


Gemma 3 (Google)

Developer: Google DeepMind
License: Gemma License (permissive, commercial use permitted)
Architecture: Decoder-only transformer + frozen SigLIP vision encoder (~400M params)

VersionParametersVision SupportMin VRAM (FP16)Min VRAM (QAT/4-bit)Context
Gemma 3 270M270MNo<1 GB<1 GB32K
Gemma 3 1B1BNo~2 GB~1 GB32K
Gemma 3 4B4BYes~8 GB~3 GB128K
Gemma 3 12B12BYes~24 GB~8 GB128K
Gemma 3 27B27BYes~54 GB~18 GB128K

Key characteristics:

Best for: Long-context multimodal tasks (128K tokens ≈ 500 images or 8+ minutes of video at 1fps). Best open model for running on consumer hardware when using QAT variants. Strong multilingual capabilities.


Phi-4 Multimodal (Microsoft)

Developer: Microsoft
License: MIT
Architecture: Unified transformer with Mixture-of-LoRAs for text, vision, and speech

VersionParametersMin VRAMContext
Phi-4-mini3.8B (text only)~4 GB128K
Phi-4-multimodal5.6B (text + vision + audio)~8 GB128K
Phi-414B (text only)~10 GB (FP16)16K

Key characteristics:

Best for: Edge/mobile deployment where you need vision + audio in a tiny package. Excellent for educational settings due to low hardware requirements. Best bang-for-parameter on STEM visual reasoning.


DeepSeek-VL2 (DeepSeek)

Developer: DeepSeek AI
License: DeepSeek License (permissive)
Architecture: Mixture-of-Experts (MoE) with vision encoder

VersionTotal ParamsActive ParamsMin VRAM
DeepSeek-VL2-Tiny3.4B1.0B~4 GB
DeepSeek-VL2-Small16.1B2.8B~8 GB
DeepSeek-VL227.5B4.5B~12 GB

Key characteristics:

Best for: Cost-efficient inference where you need more capability than parameter count suggests. Good for scientific/technical image analysis.


Other Notable Open Models

InternVL 2.5 (Shanghai AI Lab): Available from 1B to 78B. Very strong benchmark performance, especially at larger sizes. Uses InternViT vision encoder.

MiniCPM-o 2.6 (OpenBMB): 8B parameters. Uniquely supports images, video, and audio input. Real-time speech conversation capability. Good all-rounder for multimodal tasks.

Janus-Pro (DeepSeek): Available in 1.5B and 7B. Can both understand and generate images (bidirectional), unlike most VLMs which only understand images.


OpenAI Models

OpenAI's models are proprietary and accessed only via API (or ChatGPT). Parameter counts are not publicly disclosed. All support vision (image input).

GPT-4o

Released: May 2024
Context Window: 128K tokens

VariantBest ForRelative CostRelative Speed
GPT-4oGeneral-purpose multimodal (text + vision + audio)MediumFast
GPT-4o-miniBudget-friendly general useLowVery fast

Key characteristics:

Best for: General-purpose multimodal applications. Real-time voice + vision interactions. Accessible default choice for most users.


GPT-4.1

Released: April 2025
Context Window: 1 million tokens

VariantBest ForRelative CostRelative Speed
GPT-4.1Coding, long-context, agentic tasks, visionHighMedium
GPT-4.1 miniSame strengths, lower costLowFast
GPT-4.1 nanoClassification, autocomplete, fast tasksVery lowVery fast

Key characteristics:

Best for: Developer/agentic applications requiring precise instruction following, long documents with images, code generation from visual specs. The mini variant is the practical sweet spot for most API use cases.


GPT-5

Released: August 2025
Context Window: 400K tokens

Key characteristics:

Best for: Highest-capability tasks where you want OpenAI to automatically select the right model for the job. Premium pricing.


Quick Comparison for Classroom Use

For an educational setting with limited budgets and student laptops, here are practical recommendations:

GoalRecommended ModelWhy
Simplest local setupLLaVA 7B via Ollama4.7 GB download, runs on most laptops, simple API
Best local qualityGemma 3 12B QAT via OllamaGood quality at ~8 GB VRAM with quantization
Smallest capable modelPhi-4-multimodal (5.6B)Vision + audio in one tiny model
Best free cloud APIGoogle Gemini Flash (free tier)No local hardware needed, fast
Understanding internalsLLaVA via HuggingFace TransformersComponents (CLIP, projector, LLM) are accessible individually
Multi-image / videoQwen 2.5 VL 7BNative video support, dynamic resolution
Comparing open vs. proprietaryGPT-4.1 mini via APILow cost, strong benchmarks, good baseline for comparison

Computational Requirements Summary

Rule of thumb for VRAM estimation:

Minimum hardware tiers:

HardwareCan Run
8 GB VRAM (e.g., laptop GPU, M1/M2 Mac)LLaVA 7B, Llama 3.2 11B (quantized), Phi-4-multimodal
16 GB VRAM (e.g., RTX 4060 Ti, M2 Pro Mac)Most 7B models at FP16, 12–13B quantized
24 GB VRAM (e.g., RTX 4090)7B at FP16 comfortably, Gemma 3 27B with QAT
48+ GB VRAM (e.g., A6000, dual GPU)34B models, larger Qwen variants
80+ GB VRAM (e.g., A100)70B+ models
Multi-GPU (e.g., 4–8× A100)Qwen 72B, Llama 90B

Note: Apple Silicon Macs (M1/M2/M3/M4) with unified memory can run larger models than their VRAM equivalent would suggest on discrete GPUs, since they share system RAM. An M2 Max with 32 GB can comfortably run 13B models.