LLMs and VLMs on Hugging Face + AI2 OLMO Models

Reference Guide - December 2024/January 2025

This document provides a comprehensive overview of Language Models (LLMs) and Vision-Language Models (VLMs) available on Hugging Face, along with the fully open OLMO models from the Allen Institute for AI (AI2).


Language Models (LLMs)

Tiny Models (≤3B parameters)

ModelParametersSize (Float/Quant)PrecisionTarget Use
Llama 3.2 1B1B~2 GB (BF16)
~0.5-1 GB (4-bit)
BF16 / 4-bit (QAT+LoRA, SpinQuant)Edge/mobile devices, on-device summarization, instruction following, tool calling
Llama 3.2 3B3B~6 GB (BF16)
~1.5 GB (4-bit)
BF16 / 4-bit (QAT+LoRA, SpinQuant)Mobile/edge AI, multilingual dialog, agentic retrieval, local summarization
OLMo 2 1B1B~2 GBBF16Small-scale research, educational use, efficient inference on constrained hardware
Qwen 2.5 0.5B0.5B~1 GBFP16/BF16Ultra-lightweight applications, IoT devices, minimal compute requirements
Qwen 2.5 1.5B1.5B~3 GBFP16/BF16Efficient text generation, lightweight chat applications
Qwen 2.5 3B3B~6 GBFP16/BF16Balanced performance for resource-constrained environments
SmolLM2 135M/360M/1.7B135M-1.7B0.3-3.5 GBFP16Edge deployment, educational purposes, lightweight NLP tasks
Gemma 2B2B~4 GBFP16/BF16Lightweight chat, instruction following, research

Small Models (4B-10B parameters)

ModelParametersSize (Float/Quant)PrecisionTarget Use
Nemotron-H-8B-Base8B~16 GB (BF16)
~4 GB (4-bit)
BF16 / 4-bitHybrid Mamba-Transformer, research baseline, 3× faster inference than pure Transformers
Nemotron-Nano-9B-v29B (from 12B)~18 GB (BF16)
~4.5 GB (4-bit)
BF16 / 4-bitHybrid reasoning model, 6× throughput vs Transformers, 128K context, configurable reasoning mode
Qwen 2.5 7B7B~14 GBFP16/BF16Advanced reasoning, multilingual support, coding tasks
OLMo 2 7B7B~14 GBBF16Fully open research, reproducible AI development, academic use
OLMo 3 Base 7B7B~14 GBBF16State-of-the-art open base model, reasoning, tool use, 65K context
OLMo 3 Instruct 7B7B~14 GBBF16Instruction following, multi-turn dialog, tool use, agentic workflows
OLMo 3 Think 7B7B~14 GBBF16Reasoning model with explicit chain-of-thought, math, code reasoning
Llama 3.1 8B8B~16 GB (FP16)
~4 GB (4-bit)
FP16 / 4-bit PTQGeneral-purpose chat, code generation, multilingual tasks
Mistral 7B7B~14 GBFP16/BF16Efficient inference, strong reasoning, open commercial use

Medium Models (11B-20B parameters)

ModelParametersSize (Float/Quant)PrecisionTarget Use
Qwen 2.5 14B14B~28 GBFP16/BF16Enhanced reasoning capabilities, complex task handling
OLMo 2 13B13B~26 GBBF16Enhanced fully open model, outperforms larger models with fewer FLOPs
Nemotron-Nano-12B-v2-Base12B~24 GB (BF16)BF16 (FP8 trained)Hybrid base model for fine-tuning, efficient reasoning foundation

Large Models (21B-50B parameters)

ModelParametersSize (Float/Quant)PrecisionTarget Use
Nemotron-3-Nano-30B-A3B30B (3B active)~60 GB total
~6 GB active (MoE)
~15 GB (FP8)
BF16 / FP8Hybrid Mamba-Transformer-MoE, 1M context, 3× throughput vs Transformers, agentic reasoning
OLMo 2 32B32B~64 GBBF16Most capable fully open model, outperforms GPT-3.5-Turbo and GPT-4o-mini
OLMo 3 Base 32B32B~64 GBBF16Flagship fully open base model, 65K context, trained on 6T tokens
OLMo 3 Instruct 32B32B~64 GBBF16Competitive with Qwen 3, Gemma 3, Llama 3.1 at similar sizes
OLMo 3 Think 32B32B~64 GBBF16Strongest fully open reasoning model, explicit thinking steps
Qwen 2.5 32B32B~64 GBFP16/BF16High-performance reasoning, coding, multilingual capabilities
Nemotron-H-47B-Base47B (from 56B)~94 GB (BF16)
~24 GB (4-bit)
BF16 / FP4 (1M ctx)Hybrid compressed model, 20% faster than 56B, 1M context in FP4 on RTX 5090

Very Large Models (>50B parameters)

ModelParametersSize (Float/Quant)PrecisionTarget Use
Nemotron-H-56B-Base56B~112 GB (BF16)
~28 GB (4-bit)
BF16 (FP8 trained)Hybrid flagship, matches Llama-3.1-70B/Qwen-2.5-72B with 3× faster inference, FP8 pre-training
Llama 3.3 70B70B~140 GB (FP16)
~35 GB (4-bit)
FP16 / 4-bitHigh-quality general assistants, complex reasoning, 128K context
Qwen 2.5 72B72B~144 GBFP16/BF16Flagship text model, SOTA performance on many benchmarks
DeepSeek V3671B (37B active)~37 GB activeMoEMixture-of-experts, efficient large-scale inference
Llama 3.1 405B405B~810 GB (FP16)FP16 / QuantizedLargest open model, frontier capabilities, requires multi-GPU

Vision-Language Models (VLMs)

Tiny VLMs (≤3B parameters)

ModelParametersSize (Float/Quant)PrecisionTarget Use
SmolVLM 2B2B~4 GBFP16Edge devices, browser deployment, efficient multimodal on mobile
Qwen2-VL 2B2B~4 GBFP16/BF16Lightweight vision-language, mobile deployment, image understanding
Qwen2.5-VL 3B3B~6 GBFP16/BF16Edge AI, image/video understanding, outperforms previous 7B models
PaliGemma 3B3B~6 GBFP16Efficient vision-language tasks, image captioning

Medium VLMs (4B-20B parameters)

ModelParametersSize (Float/Quant)PrecisionTarget Use
Qwen2-VL 7B7B~14 GBFP16/BF16Video understanding (20+ min), image analysis, multimodal reasoning
Qwen2.5-VL 7B7B~14 GB (FP16)
~3.5 GB (AWQ)
FP16/BF16 / 4-bit AWQEnhanced visual understanding, agentic capabilities, 1hr+ video, temporal localization
Llama 3.2 11B Vision11B~22 GBFP16Image understanding, vision Q&A, multimodal dialog
Molmo 7B7B~14 GBFP16/BF16Pointing/tagging objects, visual understanding, open by AI2
Molmo 2 8B8B~16 GBFP16/BF16Video grounding, Q&A, improved over Molmo 72B on image tasks
LLaVA 1.5 7B7B~14 GBFP16Visual question answering, image captioning, first successful open VLM
Pixtral 12B12B~24 GBFP16/BF16Multimodal understanding by Mistral AI

Large VLMs (>20B parameters)

ModelParametersSize (Float/Quant)PrecisionTarget Use
Qwen2.5-VL 32B32B~64 GBFP16/BF16Advanced vision-language, human preference alignment
Qwen2-VL 72B72B~144 GB (FP16)
~36 GB (4-bit)
FP16/BF16 / 4-bit (AWQ, GPTQ)SOTA vision-language, complex visual reasoning, long video analysis
Qwen2.5-VL 72B72B~144 GB (FP16)
~36 GB (AWQ)
FP16/BF16 / 4-bit AWQFlagship VLM, competitive with GPT-4V/Claude 3.5 Sonnet, hour-long videos, computer/phone use
Llama 3.2 90B Vision90B~180 GBFP16Advanced vision understanding, exceeds Claude 3 Haiku on image tasks
Molmo 72B72B~144 GBFP16/BF16High-performance visual understanding, pointing, object recognition
Qwen3-VL 235B235B (22B active)~44 GB activeMoEMassive-scale multimodal MoE, visual coding, agent capabilities

Quantization Types Explained

Standard Precision Formats

Quantization Methods

Mixture of Experts (MoE)

Hybrid Mamba-Transformer Architecture


NEMOTRON Models - Hybrid Architecture Pioneer

The NVIDIA Nemotron family represents the first production-grade hybrid Mamba-Transformer models at scale:

What Makes NEMOTRON Unique

  1. Hybrid Mamba-Transformer Architecture

    • Replaces 90%+ of attention layers with Mamba-2 (State Space Model)

    • Linear time complexity instead of quadratic

    • Constant memory per token during generation

    • 3-6× faster inference than pure Transformers

  2. Multiple Model Families:

    • Nemotron-H: Pure hybrid baseline models (8B, 47B, 56B)

    • Nemotron Nano 2: Reasoning models with ON/OFF modes (9B, 12B)

    • Nemotron 3 Nano: Hybrid + MoE architecture (30B total, 3B active)

  3. Reasoning Capabilities:

    • Configurable reasoning mode (ON/OFF via chat template)

    • Explicit chain-of-thought generation when enabled

    • Competitive with or exceeding pure Transformer baselines

  4. Extreme Context Support:

    • Nemotron Nano 2: 128K tokens

    • Nemotron 3 Nano: 1M tokens (1 million native support)

    • Nemotron-H-47B: 1M tokens in FP4 on RTX 5090

  5. Training Innovations:

    • FP8 pre-training at scale (56B model on 20T tokens)

    • Efficient compression (56B → 47B with 63B tokens)

    • Warmup-Stable-Decay (WSD) learning rate schedules

Nemotron Model Variants

Nemotron-H (Hybrid Base)

Nemotron Nano 2 (Reasoning)

Nemotron 3 Nano (MoE + Hybrid)

License & Availability

Performance Highlights


OLMO Models - Unique Features

The OLMO (Open Language Model) family from AI2 represents the gold standard for truly open AI:

What Makes OLMO "Fully Open"

  1. Complete Model Flow: Not just weights, but the entire development pipeline

    • Pre-training data mixtures (fully documented and downloadable)

    • Training code and recipes

    • All intermediate checkpoints

    • Evaluation frameworks (OLMES)

    • Post-training recipes (Tülu 3)

  2. OlmoTrace Integration:

    • Trace model outputs back to specific training data

    • Understand why models generate specific responses

    • Available in AI2 Playground

  3. License: Apache 2.0 for all components

    • True open source

    • Commercial use permitted

    • No restrictions on model outputs

  4. Reproducibility:

    • Everything needed to reproduce from scratch

    • Published training efficiency metrics

    • 2.5x more efficient than comparable models (e.g., vs Llama 3.1)

OLMO Model Variants

Base Models

Instruct Models

Think Models

RL Zero (Experimental)


Size Estimation Formula

To estimate model storage requirements:

Examples:

Note: Actual sizes may vary due to:


Hardware Requirements

Consumer GPUs (Inference)

Typical Inference Hardware


Access and Deployment

Hugging Face Hub

All models listed are available at: https://huggingface.co/

Popular Model Repositories:

AI2 Resources

NVIDIA Nemotron Resources

Deployment Tools


Evaluation Leaderboards

LLM Leaderboards

VLM Leaderboards

Key Benchmarks


Recent Developments (2024-2025)

  1. Efficiency Focus: Smaller models (1B-7B) achieving near-parity with larger predecessors

  2. Hybrid Architectures: Mamba-Transformer models achieving 3-6× speedups with equal/better accuracy

  3. Long Context: Models extending to 128K-1M token contexts

  4. Multimodal Integration: VLMs becoming standard, not specialized

  5. True Openness: More fully open models (OLMO, Qwen, Nemotron) vs just open weights

  6. Reasoning Models: Explicit chain-of-thought becoming standard feature

  7. Edge Deployment: Aggressive quantization enabling mobile/IoT deployment

Major Releases


License Information

Common Licenses

Always check the specific model card for license details before commercial deployment.


Additional Resources

Documentation

Communities

Papers and Technical Reports


Document Information

Last Updated: January 3, 2025
Data Sources: Hugging Face Hub, AI2 official releases, NVIDIA Nemotron documentation, model cards, technical reports
Maintained By: Community reference - verify current specs before deployment

Note: Model capabilities and sizes are rapidly evolving. This guide now includes NVIDIA's hybrid Mamba-Transformer models which represent a significant architectural advancement. Always check the official model card on Hugging Face or the provider's website for the most current information.


This document is provided as a reference guide. Model availability, performance characteristics, and specifications may change. Always verify against official sources before making deployment decisions.