Guide to Image Generation Models

This guide covers the major open-weight and OpenAI text-to-image generation models as of early 2026, including their sizes, computational requirements, and what each version is specialized for.


How Image Generation Models Work

Unlike VLMs (which take images in and produce text out), image generation models take text in and produce images out. The dominant approaches are:

All approaches use a text encoder (typically CLIP, T5, or more recently Mistral/Qwen) to convert the prompt into an embedding that guides generation.


Open-Weight Models

FLUX (Black Forest Labs)

Developer: Black Forest Labs (founded by the original creators of Stable Diffusion)
Architecture: Diffusion Transformer (DiT) + flow matching

FLUX.1 Family

VariantParametersLicenseMin VRAM (FP16)Min VRAM (FP8/quantized)Speed
FLUX.1 [schnell]12BApache 2.0~33 GB~16 GB (FP8)Very fast (1–4 steps)
FLUX.1 [dev]12BNon-commercial~22 GB~12 GB (FP8)Medium (20–50 steps)
FLUX.1 [pro]12BAPI onlyN/A (cloud)N/AMedium

FLUX.1 specializations:

FLUX.2 Family (November 2025)

VariantParametersLicenseMin VRAM (FP16)Min VRAM (FP8/quantized)Notes
FLUX.2 [dev]32BNon-commercial~64 GB~18–20 GB (4-bit)Open weights
FLUX.2 [pro]32BAPI onlyN/AN/AHighest quality
FLUX.2 [flex]32BAPI onlyN/AN/AFine-grained control
FLUX.2 [klein] 9B9BNon-commercial~20 GB~10 GBFast, unified gen+edit
FLUX.2 [klein] 4B4BApache 2.0~13 GB~6 GBConsumer GPU friendly

FLUX.2 improvements over FLUX.1:

Best for: FLUX is currently the leading open-weight image generation family. FLUX.1 [schnell] for fast/free generation; FLUX.2 [klein] 4B for consumer-grade real-time use; FLUX.2 [dev] for highest open-weight quality.


Stable Diffusion (Stability AI)

Developer: Stability AI
Architecture: Latent diffusion (U-Net or DiT based, depending on version) + VAE

VersionParametersArchitectureMin VRAMResolutionLicense
SD 1.5~860MU-Net~4 GB512×512CreativeML Open RAIL-M
SDXL~3.5B (base + refiner)U-Net~8 GB1024×1024Open RAIL++
SDXL Turbo~3.5BDistilled U-Net~8 GB512×512Research only
SDXL Lightning~3.5BDistilled U-Net~8 GB1024×1024Open RAIL++
SD 3 Medium2BDiT (MMDiT)~6 GB1024×1024Community license
SD 3.5 Large8BDiT (MMDiT)~12 GB1024×1024Community license
SD 3.5 Large Turbo8BDistilled DiT~12 GB1024×1024Community license

Version specializations:

Best for: Stable Diffusion has the most mature ecosystem (AUTOMATIC1111, ComfyUI, thousands of fine-tunes and LoRAs). SD 1.5 and SDXL remain practical for anyone with existing workflows. SD 3.5 is the best choice for new projects needing text-in-image capability. The small model sizes make Stable Diffusion the most accessible family for limited hardware.


DeepFloyd IF (Stability AI / DeepFloyd)

Developer: DeepFloyd (Stability AI research lab)
Architecture: Cascaded pixel-space diffusion (3 stages) + frozen T5-XXL text encoder
License: Research only

StageResolutionParameters
Stage 164×64~4.3B
Stage 2 (upscaler)256×256~1.2B
Stage 3 (upscaler)1024×1024~1.2B
T5-XXL text encoder~4.7B

Total VRAM: ~40 GB for the full pipeline (can be run stage-by-stage with ~16 GB)

Key characteristics:

Best for: Academic research and understanding cascaded diffusion architectures. Largely superseded by FLUX and SD 3.5 for practical use.


Community & Emerging Open Models

Z-Image-Turbo — 6B parameters, leading HuggingFace downloads. Strong quality at a fraction of the size of FLUX.2. Good balance of speed and quality for consumer hardware.

LongCat-Image (Meituan) — 6B parameters. Bilingual Chinese-English text rendering. Strong photorealism. Has separate dev (for LoRA training) and edit (for image editing) variants.

Ovis-Image — 7B parameters. Specialized for high-quality text rendering in generated images (posters, logos, UI mockups). Less suited for photorealism.

Janus-Pro (DeepSeek) — Unique in that it can both understand and generate images using a unified architecture. Available in 1.5B and 7B sizes. Uses a VQ tokenizer for autoregressive image generation rather than diffusion.

Playground 2.5 — Based on SDXL, trained to mimic Midjourney's aesthetic. Produces polished, detailed images without complex prompting. Good for users who want a "just works" artistic style.


OpenAI Models

OpenAI's image generation models are proprietary and API-only. No local deployment is possible. Parameter counts are not disclosed.

GPT Image Family (Current)

ModelReleasedPrice per Image (1024×1024)ResolutionsKey Strength
gpt-image-1.52025~0.020.19 (low–high quality)1024², 1024×1536, 1536×1024State-of-the-art quality, best text rendering
gpt-image-1April 2025~0.020.19 (low–high quality)1024², 1024×1536, 1536×1024Professional grade, streaming support
gpt-image-1-mini2025Lower than gpt-image-11024²Cost-effective, lower quality

Key characteristics:

DALL-E 3 (Deprecated)

ModelPrice per ImageResolutions
DALL-E 3 Standard0.040(1024²),0.080 (1792×1024)1024², 1024×1792, 1792×1024
DALL-E 3 HD0.080(1024²),0.120 (1792×1024)1024², 1024×1792, 1792×1024

Note: DALL-E 3 is deprecated and will be removed from the API on May 12, 2026. It was replaced by GPT Image models in ChatGPT in March 2025. DALL-E 3 used a diffusion architecture and was notable for its strong prompt adherence (it automatically expanded brief prompts into detailed descriptions via an internal ChatGPT rewrite).

Best for: OpenAI's GPT Image models are the easiest way to get high-quality image generation via API. No hardware requirements, no setup. Best text rendering and prompt understanding of any available system. Cost adds up at scale.


Key Technical Concepts

Guidance Scale (CFG)

Controls how closely the output follows the text prompt. Higher values = more literal prompt following but potentially less natural-looking. Typical range: 5–15 for most models.

Sampling Steps

The number of denoising iterations. More steps = higher quality but slower. Typical range: 20–50 for standard models, 1–4 for distilled/turbo variants.

LoRA (Low-Rank Adaptation)

A lightweight fine-tuning technique that trains a small adapter (typically 10–100 MB) to customize a model's style or teach it new concepts. Widely used with Stable Diffusion and FLUX. Much cheaper than full fine-tuning.

ControlNet

An auxiliary network that adds spatial conditioning to diffusion models — you can guide generation using edge maps, depth maps, pose skeletons, etc. Available for SD 1.5, SDXL, and FLUX.

VAE (Variational Autoencoder)

Compresses images to/from latent space. All latent diffusion models (SD, FLUX) use a VAE. The diffusion process runs in the compressed latent space, then the VAE decoder converts back to pixels.

Text Encoders

The component that converts your text prompt into embeddings that guide generation:


Computational Requirements Summary

Quick VRAM Reference

HardwareCan Run
4 GB VRAMSD 1.5 (basic)
6–8 GB VRAMSD 1.5, SDXL (with optimizations), SD 3 Medium, FLUX.2 [klein] 4B (quantized)
12–16 GB VRAMSDXL comfortably, SD 3.5 Large, FLUX.1 [schnell] (FP8), FLUX.2 [klein] 4B
24 GB VRAM (RTX 4090)FLUX.1 [dev] (FP16), FLUX.2 [dev] (4-bit quantized), all SD variants
48+ GB VRAMFLUX.2 [dev] (FP8), full pipeline with ControlNets
64+ GB VRAMFLUX.2 [dev] (FP16)

Generation Speed (approximate, RTX 4090, 1024×1024)

ModelStepsTime
SD 1.520~2 seconds
SDXL20~4 seconds
SDXL Lightning4<1 second
SD 3.5 Large28~6 seconds
FLUX.1 [schnell]4~3 seconds
FLUX.1 [dev]20~10 seconds
FLUX.2 [klein] 4B4<1 second
FLUX.2 [dev] (FP8)20~15 seconds

Quick Comparison for Classroom Use

GoalRecommended ModelWhy
Smallest/fastest local setupSD 1.5 or SD 3 MediumRuns on 4–6 GB VRAM, huge ecosystem
Best free & open qualityFLUX.1 [schnell]Apache 2.0, very fast, 12B params
Consumer GPU, high qualityFLUX.2 [klein] 4BApache 2.0, <1 second, ~13 GB VRAM
Understanding diffusion conceptsSD 1.5 via diffusers librarySimplest architecture, most tutorials available
No local hardware neededOpenAI GPT Image APIAPI-only, ~0.020.19/image
Fine-tuning / LoRA trainingSDXL or FLUX.1 [dev]Best ecosystem for LoRA and ControlNet
Text rendering in imagesSD 3.5 Large or FLUX.2Both handle text well due to T5/Mistral encoders
Image editingFLUX.2 [klein] or GPT Image APIBoth support unified generation + editing

Architecture Comparison

FeatureSD 1.5 / SDXLSD 3 / 3.5FLUX.1FLUX.2
Denoising backboneU-NetMMDiT (Transformer)DiT (Transformer)DiT (Transformer)
Text encoder(s)CLIPCLIP ×2 + T5CLIP + T5Mistral Small
Latent spaceYes (VAE)Yes (VAE)Yes (VAE)Yes (VAE)
ApproachDiffusionDiffusionFlow matchingFlow matching
Text renderingPoor (SD 1.5), Fair (SDXL)GoodGoodExcellent
Native resolution512 (1.5) / 1024 (XL)10241024Up to 2048 (4MP)
Typical parameters0.8B–3.5B2B–8B12B4B–32B

The clear trend is toward transformer-based backbones (replacing U-Nets), flow matching (replacing classical diffusion), and LLM-grade text encoders (replacing CLIP alone). Each generation roughly doubles in parameter count while improving quality, text understanding, and generation speed through distillation.