LLMs and VLMs on Hugging Face + AI2 OLMO Models

Reference Guide - December 2024/January 2025

This document provides a comprehensive overview of Language Models (LLMs) and Vision-Language Models (VLMs) available on Hugging Face, along with the fully open OLMO models from the Allen Institute for AI (AI2).

Language Models (LLMs)

Small Models (≤3B parameters)

Model	Parameters	Size (Float/Quant)	Precision	Target Use
Llama 3.2 1B	1B	~2 GB (BF16) ~0.5-1 GB (4-bit)	BF16 / 4-bit (QAT+LoRA, SpinQuant)	Based model. Edge/mobile devices, on-device summarization/
Llama 3.2 1B Instruct	1B	~2 GB (BF16) ~0.5-1 GB (4-bit)	BF16 / 4-bit (QAT+LoRA, SpinQuant)	The instruction-tuned version, optimized for multilingual dialogue, summarization, and following complex user prompts
Llama 3.2 3B	3B	~6 GB (BF16) ~1.5 GB (4-bit)	BF16 / 4-bit (QAT+LoRA, SpinQuant)	Mobile/edge AI, multilingual dialog, agentic retrieval, local summarization
OLMo 2 1B	1B	~2 GB	BF16	Small-scale research, educational use, efficient inference on constrained hardware
Qwen 2.5 0.5B	0.5B	~1 GB	FP16 / BF16 / INT8 / INT4	Ultra-lightweight applications, IoT devices, minimal compute requirements
Qwen 2.5 1.5B	1.5B	~3 GB	FP16 / BF16 / INT8 / INT4	Efficient text generation, lightweight chat applications
Qwen 2.5 1.5B Instruct	1.5B	~3 GB	FP16 / BF16 / INT8 / INT4	Optimized for dialogue and instruction-following, suitable for chat-based applications.
Qwen 2.5 Math 1.5B	1.5B	~3 GB	FP16 / BF16 / INT8 / INT4	Specialized in mathematical problem-solving, supporting Chain-of-Thought (CoT) and Tool-integrated Reasoning (TIR) for both English and Chinese math problems
Qwen 2.5 3B	3B	~6 GB	FP16 / BF16 / INT8 / INT4	Base model. Balanced performance for resource-constrained environments
Qwen2.5 3B Instruct	3B	~6 GB	FP16 / BF16 / INT8 / INT4	Highly capable in coding, mathematics, and following complex instructions while supporting over 29 languages
Qwen2.5 VL 3B	3B	~6 GB	FP16 / BF16 / INT8 / INT4	A vision-language model capable of analyzing images, charts, and long videos (over 1 hour). It is small enough to run on high-end mobile devices.
Qwen2.5 Omni 3B	3B	~6 GB	FP16 / BF16 / INT8 / INT4	A multimodal variant that supports simultaneous text, audio, and video inputs/outputs for real-time interaction
SmolLM2 135M/360M/1.7B	135M-1.7B	0.3-3.5 GB	FP16	Edge deployment, educational purposes, lightweight NLP tasks
Gemma 2B	2B	~4 GB	FP16/BF16	Lightweight chat, instruction following, research

Medium Models (4B-20B parameters)

Model	Parameters	Size (Float/Quant)	Precision	Target Use
Nemotron-H-8B-Base	8B	~16 GB (BF16) ~4 GB (4-bit)	BF16 / 4-bit	Hybrid Mamba-Transformer, research baseline, 3× faster inference than pure Transformers
Nemotron-Nano-9B-v2	9B (from 12B)	~18 GB (BF16) ~4.5 GB (4-bit)	BF16 / 4-bit	Hybrid reasoning model, 6× throughput vs Transformers, 128K context, configurable reasoning mode
Qwen 2.5 7B	7B	~14 GB	FP16/BF16	Advanced reasoning, multilingual support, coding tasks
OLMo 2 7B	7B	~14 GB	BF16	Fully open research, reproducible AI development, academic use
OLMo 3 Base 7B	7B	~14 GB	BF16	State-of-the-art open base model, reasoning, tool use, 65K context
OLMo 3 Instruct 7B	7B	~14 GB	BF16	Instruction following, multi-turn dialog, tool use, agentic workflows
OLMo 3 Think 7B	7B	~14 GB	BF16	Reasoning model with explicit chain-of-thought, math, code reasoning
Llama 3.1 8B	8B	~16 GB (FP16) ~4 GB (4-bit)	FP16 / 4-bit PTQ	General-purpose chat, code generation, multilingual tasks
Mistral 7B	7B	~14 GB	FP16/BF16	Efficient inference, strong reasoning, open commercial use
Qwen 2.5 14B	14B	~28 GB	FP16/BF16	Enhanced reasoning capabilities, complex task handling
OLMo 2 13B	13B	~26 GB	BF16	Enhanced fully open model, outperforms larger models with fewer FLOPs
Nemotron-Nano-12B-v2-Base	12B	~24 GB (BF16)	BF16 (FP8 trained)	Hybrid base model for fine-tuning, efficient reasoning foundation

Large Models (>20B parameters)

Model	Parameters	Size (Float/Quant)	Precision	Target Use
Nemotron-3-Nano-30B-A3B	30B (3B active)	~60 GB total ~6 GB active (MoE) ~15 GB (FP8)	BF16 / FP8	Hybrid Mamba-Transformer-MoE, 1M context, 3× throughput vs Transformers, agentic reasoning
OLMo 2 32B	32B	~64 GB	BF16	Most capable fully open model, outperforms GPT-3.5-Turbo and GPT-4o-mini
OLMo 3 Base 32B	32B	~64 GB	BF16	Flagship fully open base model, 65K context, trained on 6T tokens
OLMo 3 Instruct 32B	32B	~64 GB	BF16	Competitive with Qwen 3, Gemma 3, Llama 3.1 at similar sizes
OLMo 3 Think 32B	32B	~64 GB	BF16	Strongest fully open reasoning model, explicit thinking steps
Qwen 2.5 32B	32B	~64 GB	FP16/BF16	High-performance reasoning, coding, multilingual capabilities
Nemotron-H-47B-Base	47B (from 56B)	~94 GB (BF16) ~24 GB (4-bit)	BF16 / FP4 (1M ctx)	Hybrid compressed model, 20% faster than 56B, 1M context in FP4 on RTX 5090
Nemotron-H-56B-Base	56B	~112 GB (BF16) ~28 GB (4-bit)	BF16 (FP8 trained)	Hybrid flagship, matches Llama-3.1-70B/Qwen-2.5-72B with 3× faster inference, FP8 pre-training
Llama 3.3 70B	70B	~140 GB (FP16) ~35 GB (4-bit)	FP16 / 4-bit	High-quality general assistants, complex reasoning, 128K context
Qwen 2.5 72B	72B	~144 GB	FP16/BF16	Flagship text model, SOTA performance on many benchmarks
DeepSeek V3	671B (37B active)	~37 GB active	MoE	Mixture-of-experts, efficient large-scale inference
Llama 3.1 405B	405B	~810 GB (FP16)	FP16 / Quantized	Largest open model, frontier capabilities, requires multi-GPU

Vision-Language Models (VLMs)

Small VLMS (≤3B parameters)

Model	Parameters	Size (Float/Quant)	Precision	Target Use
SmolVLM 2B	2B	~4 GB	FP16	Edge devices, browser deployment, efficient multimodal on mobile
Qwen2.5 VL 3B	3B	~6 GB	FP16 / BF16 / INT8 / INT4	A vision-language model capable of analyzing images, charts, and long videos (over 1 hour). It is small enough to run on high-end mobile devices.
Qwen2.5 Omni 3B	3B	~6 GB	FP16 / BF16 / INT8 / INT4	A multimodal variant that supports simultaneous text, audio, and video inputs/outputs for real-time interaction
PaliGemma 3B	3B	~6 GB	FP16	Efficient vision-language tasks, image captioning

Medium VLMs (4B-20B parameters)

Model	Parameters	Size (Float/Quant)	Precision	Target Use
Qwen2-VL 7B	7B	~14 GB	FP16/BF16	Video understanding (20+ min), image analysis, multimodal reasoning
Qwen2.5-VL 7B	7B	~14 GB (FP16) ~3.5 GB (AWQ)	FP16/BF16 / 4-bit AWQ	Enhanced visual understanding, agentic capabilities, 1hr+ video, temporal localization
Llama 3.2 11B Vision	11B	~22 GB	FP16	Image understanding, vision Q&A, multimodal dialog
Molmo 7B	7B	~14 GB	FP16/BF16	Pointing/tagging objects, visual understanding, open by AI2
Molmo 2 8B	8B	~16 GB	FP16/BF16	Video grounding, Q&A, improved over Molmo 72B on image tasks
LLaVA 1.5 7B	7B	~14 GB	FP16	Visual question answering, image captioning, first successful open VLM
Pixtral 12B	12B	~24 GB	FP16/BF16	Multimodal understanding by Mistral AI

Large VLMs (>20B parameters)

Model	Parameters	Size (Float/Quant)	Precision	Target Use
Qwen2.5-VL 32B	32B	~64 GB	FP16/BF16	Advanced vision-language, human preference alignment
Qwen2-VL 72B	72B	~144 GB (FP16) ~36 GB (4-bit)	FP16/BF16 / 4-bit (AWQ, GPTQ)	SOTA vision-language, complex visual reasoning, long video analysis
Qwen2.5-VL 72B	72B	~144 GB (FP16) ~36 GB (AWQ)	FP16/BF16 / 4-bit AWQ	Flagship VLM, competitive with GPT-4V/Claude 3.5 Sonnet, hour-long videos, computer/phone use
Llama 3.2 90B Vision	90B	~180 GB	FP16	Advanced vision understanding, exceeds Claude 3 Haiku on image tasks
Molmo 72B	72B	~144 GB	FP16/BF16	High-performance visual understanding, pointing, object recognition
Qwen3-VL 235B	235B (22B active)	~44 GB active	MoE	Massive-scale multimodal MoE, visual coding, agent capabilities

Quantization Types Explained

Standard Precision Formats

FP16 (Float16): 16-bit floating point - standard precision (2 bytes per parameter)
BF16 (BFloat16): Brain Float 16 - optimized for neural networks, same range as FP32 (2 bytes per parameter)
FP32: 32-bit floating point - full precision (4 bytes per parameter)

Quantization Methods

4-bit (QAT+LoRA): Quantization-Aware Training with Low-Rank Adaptation
- Simulates quantization effects during training
- Uses LoRA adapters to maintain quality
- Achieves ~75% size reduction
4-bit (SpinQuant): Advanced post-training quantization
- Optimized rotation-based quantization
- Can be applied to fine-tuned models
- Balances accuracy and performance
4-bit (AWQ): Activation-aware Weight Quantization
- Protects salient weights based on activation patterns
- Better accuracy retention than naive quantization
4-bit (GPTQ): Generalized Post-Training Quantization
- Layer-wise quantization with error compensation
- Popular for large model compression
8-bit: 8-bit integer quantization (1 byte per parameter)
- ~50% size reduction
- Minimal accuracy loss

Mixture of Experts (MoE)

Only a subset of model parameters are active for each token
Much smaller memory footprint during inference
Example: DeepSeek V3 has 671B total parameters but only 37B active

Hybrid Mamba-Transformer Architecture

Combines Mamba-2 (State Space Model) with Transformer attention
- Mamba-2 layers: Linear time complexity O(n), constant memory per token
- Transformer attention layers: Strategic placement for in-context learning
- Typical ratio: 1 attention layer per 7-8 Mamba layers
Key Advantages:
- 3-6× faster inference than pure Transformers (especially long outputs)
- Constant memory during generation (no growing KV cache)
- Linear scaling with sequence length vs quadratic
- Matches or exceeds Transformer accuracy
Architecture Components (Nemotron example):
- Mamba-2 layers: Efficient sequential processing
- Attention layers: Handle copying and in-context learning
- MLP layers: Standard feed-forward computation
- Optional MoE: Sparse expert activation (Nemotron-3-Nano)
Example: Nemotron-H-8B
- 24 Mamba-2 layers + 4 attention layers + 24 MLP layers
- 92% Mamba, 8% attention by layer count
- Achieves +2.65 points over pure Transformer baseline
- 3× faster inference, same training data

NEMOTRON Models - Hybrid Architecture Pioneer

The NVIDIA Nemotron family represents the first production-grade hybrid Mamba-Transformer models at scale:

What Makes NEMOTRON Unique

Hybrid Mamba-Transformer Architecture
- Replaces 90%+ of attention layers with Mamba-2 (State Space Model)
- Linear time complexity instead of quadratic
- Constant memory per token during generation
- 3-6× faster inference than pure Transformers
Multiple Model Families:
- Nemotron-H: Pure hybrid baseline models (8B, 47B, 56B)
- Nemotron Nano 2: Reasoning models with ON/OFF modes (9B, 12B)
- Nemotron 3 Nano: Hybrid + MoE architecture (30B total, 3B active)
Reasoning Capabilities:
- Configurable reasoning mode (ON/OFF via chat template)
- Explicit chain-of-thought generation when enabled
- Competitive with or exceeding pure Transformer baselines
Extreme Context Support:
- Nemotron Nano 2: 128K tokens
- Nemotron 3 Nano: 1M tokens (1 million native support)
- Nemotron-H-47B: 1M tokens in FP4 on RTX 5090
Training Innovations:
- FP8 pre-training at scale (56B model on 20T tokens)
- Efficient compression (56B → 47B with 63B tokens)
- Warmup-Stable-Decay (WSD) learning rate schedules

Nemotron Model Variants

Nemotron-H (Hybrid Base)

Pure hybrid Mamba-Transformer without MoE
Apples-to-apples comparison with Transformers (same training data)
Research-focused baseline models
8B, 47B (compressed), 56B (flagship)

Nemotron Nano 2 (Reasoning)

Configurable reasoning with ON/OFF modes
6× throughput improvement over Qwen3-8B
128K context window
9B (compressed from 12B), 12B base

Nemotron 3 Nano (MoE + Hybrid)

Combines hybrid architecture with Mixture-of-Experts
30B total parameters, only 3B active per token
1M token native context window
128 experts with top-6 routing
23 Mamba-2 + 23 MoE + 6 attention layers

License & Availability

License: Apache 2.0 / NVIDIA Open Model License
Commercial Use: Permitted
Open Resources:
- Model weights on Hugging Face
- Training recipes in NeMo Framework
- Majority of pre-training data released (6.6T tokens)
- Evaluation code and benchmarks

Performance Highlights

Nemotron-H-8B: +2.65 points average over Transformer baseline
Nemotron Nano 2: 6× throughput vs Qwen3-8B on reasoning tasks
Nemotron 3 Nano: 3.3× throughput vs Qwen3-30B, matches GPT-OSS-20B
Nemotron-H-56B: Matches Llama-3.1-70B/Qwen-2.5-72B with 3× speed

OLMO Models - Unique Features

The OLMO (Open Language Model) family from AI2 represents the gold standard for truly open AI:

What Makes OLMO "Fully Open"

Complete Model Flow: Not just weights, but the entire development pipeline
- Pre-training data mixtures (fully documented and downloadable)
- Training code and recipes
- All intermediate checkpoints
- Evaluation frameworks (OLMES)
- Post-training recipes (Tülu 3)
OlmoTrace Integration:
- Trace model outputs back to specific training data
- Understand why models generate specific responses
- Available in AI2 Playground
License: Apache 2.0 for all components
- True open source
- Commercial use permitted
- No restrictions on model outputs
Reproducibility:
- Everything needed to reproduce from scratch
- Published training efficiency metrics
- 2.5x more efficient than comparable models (e.g., vs Llama 3.1)

OLMO Model Variants

Base Models

Pre-trained foundation models
Ready for fine-tuning on specific tasks
65K token context window (16x larger than OLMo 2)

Instruct Models

Instruction-tuned for dialog and tool use
Multi-turn conversation capabilities
Direct drop-in replacements for assistant applications

Think Models

Reasoning-enhanced with explicit chain-of-thought
Strong performance on math, coding, and logic tasks
Shows intermediate reasoning steps

RL Zero (Experimental)

Reinforcement learning from scratch
Research into training dynamics
Experimental release for community exploration

Size Estimation Formula

To estimate model storage requirements:


Float16/BF16 size (GB) ≈ Parameters × 2 bytes
4-bit quantized size (GB) ≈ Parameters × 0.5 bytes
8-bit quantized size (GB) ≈ Parameters × 1 byte

Examples:

7B model in FP16: 7 × 2 = ~14 GB
7B model in 4-bit: 7 × 0.5 = ~3.5 GB
72B model in FP16: 72 × 2 = ~144 GB
72B model in 4-bit: 72 × 0.5 = ~36 GB

Note: Actual sizes may vary due to:

Embedding layers
Additional model components
Framework overhead
Attention cache requirements during inference

Hardware Requirements

Consumer GPUs (Inference)

4GB-8GB VRAM: 1B-3B models (quantized)
12GB-16GB VRAM: 7B models (quantized), 3B models (full precision)
24GB VRAM: 7B-13B models (full precision), 30B models (quantized)
40GB-48GB VRAM: 30B-70B models (quantized)
80GB+ VRAM: 70B+ models (full precision)

Typical Inference Hardware

Mobile/Edge: 1B-3B quantized models
Consumer Desktop (RTX 3090/4090): Up to 7B full precision, 13B quantized
Workstation (A5000, A6000): Up to 32B quantized, 13B full precision
Data Center (A100, H100): 70B+ models, multiple GPUs for 400B+ models

Access and Deployment

Hugging Face Hub

All models listed are available at: https://huggingface.co/

Popular Model Repositories:

Meta: meta-llama/
Qwen: Qwen/
AI2: allenai/
NVIDIA: nvidia/
Mistral: mistralai/
Google: google/

AI2 Resources

Playground: https://playground.allenai.org/
Model Downloads: https://allenai.org/olmo
Documentation: Available with each model release
Hugging Face Collection: https://huggingface.co/allenai

NVIDIA Nemotron Resources

Developer Portal: https://developer.nvidia.com/nemotron
Hugging Face Models: https://huggingface.co/nvidia
NeMo Framework: https://github.com/NVIDIA/NeMo
Technical Reports: https://research.nvidia.com/labs/adlr/nemotronh/
Pre-training Datasets: https://huggingface.co/collections/nvidia/nemotron-pretraining-datasets

Deployment Tools

Transformers: Standard HuggingFace library
vLLM: High-throughput LLM serving
TGI (Text Generation Inference): HuggingFace inference server
Ollama: Local model running (Mac/Linux/Windows)
LlamaCPP: CPU-optimized inference
ExecuTorch: Mobile deployment (iOS/Android)

Evaluation Leaderboards

LLM Leaderboards

Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
LMSys Chatbot Arena: https://chat.lmsys.org/
OLMES (OLMo Evaluation): 20 benchmark suite for core capabilities

VLM Leaderboards

Open VLM Leaderboard: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
Vision Arena: https://huggingface.co/spaces/WildVision/vision-arena

Key Benchmarks

MMLU: Massive Multitask Language Understanding (knowledge)
HumanEval: Code generation
MATH: Mathematical reasoning
GSM8K: Grade school math
BBH: Big Bench Hard (reasoning)
GPQA: Graduate-level questions
IFEval: Instruction following

Recent Developments (2024-2025)

Notable Trends

Efficiency Focus: Smaller models (1B-7B) achieving near-parity with larger predecessors
Hybrid Architectures: Mamba-Transformer models achieving 3-6× speedups with equal/better accuracy
Long Context: Models extending to 128K-1M token contexts
Multimodal Integration: VLMs becoming standard, not specialized
True Openness: More fully open models (OLMO, Qwen, Nemotron) vs just open weights
Reasoning Models: Explicit chain-of-thought becoming standard feature
Edge Deployment: Aggressive quantization enabling mobile/IoT deployment

Major Releases

Llama 3.2 (Sept 2024): First Llama with 1B/3B models and vision capabilities
OLMo 2 (Nov 2024): Outperforming larger models with full openness
OLMo 3 (Nov 2024): First fully open 32B reasoning model with OlmoTrace
Nemotron-H (March 2024): First large-scale hybrid Mamba-Transformer (up to 56B)
Nemotron Nano 2 (Aug 2024): Configurable reasoning modes, 6× throughput
Nemotron 3 Nano (Jan 2025): Hybrid MoE with 1M context window
Qwen2.5-VL (Jan 2025): Hour-long video understanding, computer use
DeepSeek V3 (Dec 2024): Efficient 671B MoE architecture
SmolVLM (2025): 2B VLM for edge deployment

License Information

Common Licenses

Apache 2.0: Fully open, commercial use (OLMO, Qwen)
Llama Community License: Restrictive at scale, commercial use allowed under threshold
MIT: Permissive open source
CC BY-NC: Non-commercial use only

Always check the specific model card for license details before commercial deployment.

Additional Resources

Documentation

Hugging Face Docs: https://huggingface.co/docs
Transformers: https://huggingface.co/docs/transformers
PEFT (LoRA, QLoRA): https://huggingface.co/docs/peft
TRL (Training): https://huggingface.co/docs/trl

Communities

Hugging Face Forums: https://discuss.huggingface.co/
AI2 Discord: Via allenai.org
r/LocalLLaMA: Reddit community for local deployment

Papers and Technical Reports

OLMo Technical Reports: Available at allenai.org/olmo
Qwen Papers: Available with model releases
Llama Papers: ai.meta.com

Document Information

Last Updated: January 3, 2025
Data Sources: Hugging Face Hub, AI2 official releases, NVIDIA Nemotron documentation, model cards, technical reports
Maintained By: Community reference - verify current specs before deployment

Note: Model capabilities and sizes are rapidly evolving. This guide now includes NVIDIA's hybrid Mamba-Transformer models which represent a significant architectural advancement. Always check the official model card on Hugging Face or the provider's website for the most current information.

This document is provided as a reference guide. Model availability, performance characteristics, and specifications may change. Always verify against official sources before making deployment decisions.