Reference Guide - December 2024/January 2025
This document provides a comprehensive overview of Language Models (LLMs) and Vision-Language Models (VLMs) available on Hugging Face, along with the fully open OLMO models from the Allen Institute for AI (AI2).
| Model | Parameters | Size (Float/Quant) | Precision | Target Use |
|---|---|---|---|---|
| Llama 3.2 1B | 1B | ~2 GB (BF16) ~0.5-1 GB (4-bit) | BF16 / 4-bit (QAT+LoRA, SpinQuant) | Edge/mobile devices, on-device summarization, instruction following, tool calling |
| Llama 3.2 3B | 3B | ~6 GB (BF16) ~1.5 GB (4-bit) | BF16 / 4-bit (QAT+LoRA, SpinQuant) | Mobile/edge AI, multilingual dialog, agentic retrieval, local summarization |
| OLMo 2 1B | 1B | ~2 GB | BF16 | Small-scale research, educational use, efficient inference on constrained hardware |
| Qwen 2.5 0.5B | 0.5B | ~1 GB | FP16/BF16 | Ultra-lightweight applications, IoT devices, minimal compute requirements |
| Qwen 2.5 1.5B | 1.5B | ~3 GB | FP16/BF16 | Efficient text generation, lightweight chat applications |
| Qwen 2.5 3B | 3B | ~6 GB | FP16/BF16 | Balanced performance for resource-constrained environments |
| SmolLM2 135M/360M/1.7B | 135M-1.7B | 0.3-3.5 GB | FP16 | Edge deployment, educational purposes, lightweight NLP tasks |
| Gemma 2B | 2B | ~4 GB | FP16/BF16 | Lightweight chat, instruction following, research |
| Model | Parameters | Size (Float/Quant) | Precision | Target Use |
|---|---|---|---|---|
| Nemotron-H-8B-Base | 8B | ~16 GB (BF16) ~4 GB (4-bit) | BF16 / 4-bit | Hybrid Mamba-Transformer, research baseline, 3× faster inference than pure Transformers |
| Nemotron-Nano-9B-v2 | 9B (from 12B) | ~18 GB (BF16) ~4.5 GB (4-bit) | BF16 / 4-bit | Hybrid reasoning model, 6× throughput vs Transformers, 128K context, configurable reasoning mode |
| Qwen 2.5 7B | 7B | ~14 GB | FP16/BF16 | Advanced reasoning, multilingual support, coding tasks |
| OLMo 2 7B | 7B | ~14 GB | BF16 | Fully open research, reproducible AI development, academic use |
| OLMo 3 Base 7B | 7B | ~14 GB | BF16 | State-of-the-art open base model, reasoning, tool use, 65K context |
| OLMo 3 Instruct 7B | 7B | ~14 GB | BF16 | Instruction following, multi-turn dialog, tool use, agentic workflows |
| OLMo 3 Think 7B | 7B | ~14 GB | BF16 | Reasoning model with explicit chain-of-thought, math, code reasoning |
| Llama 3.1 8B | 8B | ~16 GB (FP16) ~4 GB (4-bit) | FP16 / 4-bit PTQ | General-purpose chat, code generation, multilingual tasks |
| Mistral 7B | 7B | ~14 GB | FP16/BF16 | Efficient inference, strong reasoning, open commercial use |
| Model | Parameters | Size (Float/Quant) | Precision | Target Use |
|---|---|---|---|---|
| Qwen 2.5 14B | 14B | ~28 GB | FP16/BF16 | Enhanced reasoning capabilities, complex task handling |
| OLMo 2 13B | 13B | ~26 GB | BF16 | Enhanced fully open model, outperforms larger models with fewer FLOPs |
| Nemotron-Nano-12B-v2-Base | 12B | ~24 GB (BF16) | BF16 (FP8 trained) | Hybrid base model for fine-tuning, efficient reasoning foundation |
| Model | Parameters | Size (Float/Quant) | Precision | Target Use |
|---|---|---|---|---|
| Nemotron-3-Nano-30B-A3B | 30B (3B active) | ~60 GB total ~6 GB active (MoE) ~15 GB (FP8) | BF16 / FP8 | Hybrid Mamba-Transformer-MoE, 1M context, 3× throughput vs Transformers, agentic reasoning |
| OLMo 2 32B | 32B | ~64 GB | BF16 | Most capable fully open model, outperforms GPT-3.5-Turbo and GPT-4o-mini |
| OLMo 3 Base 32B | 32B | ~64 GB | BF16 | Flagship fully open base model, 65K context, trained on 6T tokens |
| OLMo 3 Instruct 32B | 32B | ~64 GB | BF16 | Competitive with Qwen 3, Gemma 3, Llama 3.1 at similar sizes |
| OLMo 3 Think 32B | 32B | ~64 GB | BF16 | Strongest fully open reasoning model, explicit thinking steps |
| Qwen 2.5 32B | 32B | ~64 GB | FP16/BF16 | High-performance reasoning, coding, multilingual capabilities |
| Nemotron-H-47B-Base | 47B (from 56B) | ~94 GB (BF16) ~24 GB (4-bit) | BF16 / FP4 (1M ctx) | Hybrid compressed model, 20% faster than 56B, 1M context in FP4 on RTX 5090 |
| Model | Parameters | Size (Float/Quant) | Precision | Target Use |
|---|---|---|---|---|
| Nemotron-H-56B-Base | 56B | ~112 GB (BF16) ~28 GB (4-bit) | BF16 (FP8 trained) | Hybrid flagship, matches Llama-3.1-70B/Qwen-2.5-72B with 3× faster inference, FP8 pre-training |
| Llama 3.3 70B | 70B | ~140 GB (FP16) ~35 GB (4-bit) | FP16 / 4-bit | High-quality general assistants, complex reasoning, 128K context |
| Qwen 2.5 72B | 72B | ~144 GB | FP16/BF16 | Flagship text model, SOTA performance on many benchmarks |
| DeepSeek V3 | 671B (37B active) | ~37 GB active | MoE | Mixture-of-experts, efficient large-scale inference |
| Llama 3.1 405B | 405B | ~810 GB (FP16) | FP16 / Quantized | Largest open model, frontier capabilities, requires multi-GPU |
| Model | Parameters | Size (Float/Quant) | Precision | Target Use |
|---|---|---|---|---|
| SmolVLM 2B | 2B | ~4 GB | FP16 | Edge devices, browser deployment, efficient multimodal on mobile |
| Qwen2-VL 2B | 2B | ~4 GB | FP16/BF16 | Lightweight vision-language, mobile deployment, image understanding |
| Qwen2.5-VL 3B | 3B | ~6 GB | FP16/BF16 | Edge AI, image/video understanding, outperforms previous 7B models |
| PaliGemma 3B | 3B | ~6 GB | FP16 | Efficient vision-language tasks, image captioning |
| Model | Parameters | Size (Float/Quant) | Precision | Target Use |
|---|---|---|---|---|
| Qwen2-VL 7B | 7B | ~14 GB | FP16/BF16 | Video understanding (20+ min), image analysis, multimodal reasoning |
| Qwen2.5-VL 7B | 7B | ~14 GB (FP16) ~3.5 GB (AWQ) | FP16/BF16 / 4-bit AWQ | Enhanced visual understanding, agentic capabilities, 1hr+ video, temporal localization |
| Llama 3.2 11B Vision | 11B | ~22 GB | FP16 | Image understanding, vision Q&A, multimodal dialog |
| Molmo 7B | 7B | ~14 GB | FP16/BF16 | Pointing/tagging objects, visual understanding, open by AI2 |
| Molmo 2 8B | 8B | ~16 GB | FP16/BF16 | Video grounding, Q&A, improved over Molmo 72B on image tasks |
| LLaVA 1.5 7B | 7B | ~14 GB | FP16 | Visual question answering, image captioning, first successful open VLM |
| Pixtral 12B | 12B | ~24 GB | FP16/BF16 | Multimodal understanding by Mistral AI |
| Model | Parameters | Size (Float/Quant) | Precision | Target Use |
|---|---|---|---|---|
| Qwen2.5-VL 32B | 32B | ~64 GB | FP16/BF16 | Advanced vision-language, human preference alignment |
| Qwen2-VL 72B | 72B | ~144 GB (FP16) ~36 GB (4-bit) | FP16/BF16 / 4-bit (AWQ, GPTQ) | SOTA vision-language, complex visual reasoning, long video analysis |
| Qwen2.5-VL 72B | 72B | ~144 GB (FP16) ~36 GB (AWQ) | FP16/BF16 / 4-bit AWQ | Flagship VLM, competitive with GPT-4V/Claude 3.5 Sonnet, hour-long videos, computer/phone use |
| Llama 3.2 90B Vision | 90B | ~180 GB | FP16 | Advanced vision understanding, exceeds Claude 3 Haiku on image tasks |
| Molmo 72B | 72B | ~144 GB | FP16/BF16 | High-performance visual understanding, pointing, object recognition |
| Qwen3-VL 235B | 235B (22B active) | ~44 GB active | MoE | Massive-scale multimodal MoE, visual coding, agent capabilities |
FP16 (Float16): 16-bit floating point - standard precision (2 bytes per parameter)
BF16 (BFloat16): Brain Float 16 - optimized for neural networks, same range as FP32 (2 bytes per parameter)
FP32: 32-bit floating point - full precision (4 bytes per parameter)
4-bit (QAT+LoRA): Quantization-Aware Training with Low-Rank Adaptation
Simulates quantization effects during training
Uses LoRA adapters to maintain quality
Achieves ~75% size reduction
4-bit (SpinQuant): Advanced post-training quantization
Optimized rotation-based quantization
Can be applied to fine-tuned models
Balances accuracy and performance
4-bit (AWQ): Activation-aware Weight Quantization
Protects salient weights based on activation patterns
Better accuracy retention than naive quantization
4-bit (GPTQ): Generalized Post-Training Quantization
Layer-wise quantization with error compensation
Popular for large model compression
8-bit: 8-bit integer quantization (1 byte per parameter)
~50% size reduction
Minimal accuracy loss
Only a subset of model parameters are active for each token
Much smaller memory footprint during inference
Example: DeepSeek V3 has 671B total parameters but only 37B active
Combines Mamba-2 (State Space Model) with Transformer attention
Mamba-2 layers: Linear time complexity O(n), constant memory per token
Transformer attention layers: Strategic placement for in-context learning
Typical ratio: 1 attention layer per 7-8 Mamba layers
Key Advantages:
3-6× faster inference than pure Transformers (especially long outputs)
Constant memory during generation (no growing KV cache)
Linear scaling with sequence length vs quadratic
Matches or exceeds Transformer accuracy
Architecture Components (Nemotron example):
Mamba-2 layers: Efficient sequential processing
Attention layers: Handle copying and in-context learning
MLP layers: Standard feed-forward computation
Optional MoE: Sparse expert activation (Nemotron-3-Nano)
Example: Nemotron-H-8B
24 Mamba-2 layers + 4 attention layers + 24 MLP layers
92% Mamba, 8% attention by layer count
Achieves +2.65 points over pure Transformer baseline
3× faster inference, same training data
The NVIDIA Nemotron family represents the first production-grade hybrid Mamba-Transformer models at scale:
Hybrid Mamba-Transformer Architecture
Replaces 90%+ of attention layers with Mamba-2 (State Space Model)
Linear time complexity instead of quadratic
Constant memory per token during generation
3-6× faster inference than pure Transformers
Multiple Model Families:
Nemotron-H: Pure hybrid baseline models (8B, 47B, 56B)
Nemotron Nano 2: Reasoning models with ON/OFF modes (9B, 12B)
Nemotron 3 Nano: Hybrid + MoE architecture (30B total, 3B active)
Reasoning Capabilities:
Configurable reasoning mode (ON/OFF via chat template)
Explicit chain-of-thought generation when enabled
Competitive with or exceeding pure Transformer baselines
Extreme Context Support:
Nemotron Nano 2: 128K tokens
Nemotron 3 Nano: 1M tokens (1 million native support)
Nemotron-H-47B: 1M tokens in FP4 on RTX 5090
Training Innovations:
FP8 pre-training at scale (56B model on 20T tokens)
Efficient compression (56B → 47B with 63B tokens)
Warmup-Stable-Decay (WSD) learning rate schedules
Pure hybrid Mamba-Transformer without MoE
Apples-to-apples comparison with Transformers (same training data)
Research-focused baseline models
8B, 47B (compressed), 56B (flagship)
Configurable reasoning with ON/OFF modes
6× throughput improvement over Qwen3-8B
128K context window
9B (compressed from 12B), 12B base
Combines hybrid architecture with Mixture-of-Experts
30B total parameters, only 3B active per token
1M token native context window
128 experts with top-6 routing
23 Mamba-2 + 23 MoE + 6 attention layers
License: Apache 2.0 / NVIDIA Open Model License
Commercial Use: Permitted
Open Resources:
Model weights on Hugging Face
Training recipes in NeMo Framework
Majority of pre-training data released (6.6T tokens)
Evaluation code and benchmarks
Nemotron-H-8B: +2.65 points average over Transformer baseline
Nemotron Nano 2: 6× throughput vs Qwen3-8B on reasoning tasks
Nemotron 3 Nano: 3.3× throughput vs Qwen3-30B, matches GPT-OSS-20B
Nemotron-H-56B: Matches Llama-3.1-70B/Qwen-2.5-72B with 3× speed
The OLMO (Open Language Model) family from AI2 represents the gold standard for truly open AI:
Complete Model Flow: Not just weights, but the entire development pipeline
Pre-training data mixtures (fully documented and downloadable)
Training code and recipes
All intermediate checkpoints
Evaluation frameworks (OLMES)
Post-training recipes (Tülu 3)
OlmoTrace Integration:
Trace model outputs back to specific training data
Understand why models generate specific responses
Available in AI2 Playground
License: Apache 2.0 for all components
True open source
Commercial use permitted
No restrictions on model outputs
Reproducibility:
Everything needed to reproduce from scratch
Published training efficiency metrics
2.5x more efficient than comparable models (e.g., vs Llama 3.1)
Pre-trained foundation models
Ready for fine-tuning on specific tasks
65K token context window (16x larger than OLMo 2)
Instruction-tuned for dialog and tool use
Multi-turn conversation capabilities
Direct drop-in replacements for assistant applications
Reasoning-enhanced with explicit chain-of-thought
Strong performance on math, coding, and logic tasks
Shows intermediate reasoning steps
Reinforcement learning from scratch
Research into training dynamics
Experimental release for community exploration
To estimate model storage requirements:
Float16/BF16 size (GB) ≈ Parameters × 2 bytes4-bit quantized size (GB) ≈ Parameters × 0.5 bytes8-bit quantized size (GB) ≈ Parameters × 1 byte
Examples:
7B model in FP16: 7 × 2 = ~14 GB
7B model in 4-bit: 7 × 0.5 = ~3.5 GB
72B model in FP16: 72 × 2 = ~144 GB
72B model in 4-bit: 72 × 0.5 = ~36 GB
Note: Actual sizes may vary due to:
Embedding layers
Additional model components
Framework overhead
Attention cache requirements during inference
4GB-8GB VRAM: 1B-3B models (quantized)
12GB-16GB VRAM: 7B models (quantized), 3B models (full precision)
24GB VRAM: 7B-13B models (full precision), 30B models (quantized)
40GB-48GB VRAM: 30B-70B models (quantized)
80GB+ VRAM: 70B+ models (full precision)
Mobile/Edge: 1B-3B quantized models
Consumer Desktop (RTX 3090/4090): Up to 7B full precision, 13B quantized
Workstation (A5000, A6000): Up to 32B quantized, 13B full precision
Data Center (A100, H100): 70B+ models, multiple GPUs for 400B+ models
All models listed are available at: https://huggingface.co/
Popular Model Repositories:
Meta: meta-llama/
Qwen: Qwen/
AI2: allenai/
NVIDIA: nvidia/
Mistral: mistralai/
Google: google/
Playground: https://playground.allenai.org/
Model Downloads: https://allenai.org/olmo
Documentation: Available with each model release
Hugging Face Collection: https://huggingface.co/allenai
Developer Portal: https://developer.nvidia.com/nemotron
Hugging Face Models: https://huggingface.co/nvidia
NeMo Framework: https://github.com/NVIDIA/NeMo
Technical Reports: https://research.nvidia.com/labs/adlr/nemotronh/
Pre-training Datasets: https://huggingface.co/collections/nvidia/nemotron-pretraining-datasets
Transformers: Standard HuggingFace library
vLLM: High-throughput LLM serving
TGI (Text Generation Inference): HuggingFace inference server
Ollama: Local model running (Mac/Linux/Windows)
LlamaCPP: CPU-optimized inference
ExecuTorch: Mobile deployment (iOS/Android)
Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
LMSys Chatbot Arena: https://chat.lmsys.org/
OLMES (OLMo Evaluation): 20 benchmark suite for core capabilities
Open VLM Leaderboard: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
Vision Arena: https://huggingface.co/spaces/WildVision/vision-arena
MMLU: Massive Multitask Language Understanding (knowledge)
HumanEval: Code generation
MATH: Mathematical reasoning
GSM8K: Grade school math
BBH: Big Bench Hard (reasoning)
GPQA: Graduate-level questions
IFEval: Instruction following
Efficiency Focus: Smaller models (1B-7B) achieving near-parity with larger predecessors
Hybrid Architectures: Mamba-Transformer models achieving 3-6× speedups with equal/better accuracy
Long Context: Models extending to 128K-1M token contexts
Multimodal Integration: VLMs becoming standard, not specialized
True Openness: More fully open models (OLMO, Qwen, Nemotron) vs just open weights
Reasoning Models: Explicit chain-of-thought becoming standard feature
Edge Deployment: Aggressive quantization enabling mobile/IoT deployment
Llama 3.2 (Sept 2024): First Llama with 1B/3B models and vision capabilities
OLMo 2 (Nov 2024): Outperforming larger models with full openness
OLMo 3 (Nov 2024): First fully open 32B reasoning model with OlmoTrace
Nemotron-H (March 2024): First large-scale hybrid Mamba-Transformer (up to 56B)
Nemotron Nano 2 (Aug 2024): Configurable reasoning modes, 6× throughput
Nemotron 3 Nano (Jan 2025): Hybrid MoE with 1M context window
Qwen2.5-VL (Jan 2025): Hour-long video understanding, computer use
DeepSeek V3 (Dec 2024): Efficient 671B MoE architecture
SmolVLM (2025): 2B VLM for edge deployment
Apache 2.0: Fully open, commercial use (OLMO, Qwen)
Llama Community License: Restrictive at scale, commercial use allowed under threshold
MIT: Permissive open source
CC BY-NC: Non-commercial use only
Always check the specific model card for license details before commercial deployment.
Hugging Face Docs: https://huggingface.co/docs
Transformers: https://huggingface.co/docs/transformers
PEFT (LoRA, QLoRA): https://huggingface.co/docs/peft
TRL (Training): https://huggingface.co/docs/trl
Hugging Face Forums: https://discuss.huggingface.co/
AI2 Discord: Via allenai.org
r/LocalLLaMA: Reddit community for local deployment
OLMo Technical Reports: Available at allenai.org/olmo
Qwen Papers: Available with model releases
Llama Papers: ai.meta.com
Last Updated: January 3, 2025
Data Sources: Hugging Face Hub, AI2 official releases, NVIDIA Nemotron documentation, model cards, technical reports
Maintained By: Community reference - verify current specs before deployment
Note: Model capabilities and sizes are rapidly evolving. This guide now includes NVIDIA's hybrid Mamba-Transformer models which represent a significant architectural advancement. Always check the official model card on Hugging Face or the provider's website for the most current information.
This document is provided as a reference guide. Model availability, performance characteristics, and specifications may change. Always verify against official sources before making deployment decisions.