FP8 on RTX 3090 Ti - What Actually Works on Consumer GPUs

Download printable cheat-sheet (CC-BY 4.0)

10 May 2026, 00:00 Z

60-second takeaway
FP8 on an RTX 3090 Ti is real, but it is mostly a VRAM-saving storage trick, not an FP8 acceleration path.
The 3090 Ti is an Ampere GPU with compute capability 8.6. It can hold weights in torch.float8_e4m3fn, but native FP8 tensor-core compute is not the path you should count on. The reliable consumer-GPU recipe is to store large diffusion transformer weights in FP8, then compute in BF16.
If a model almost fits in 24 GB, use diffusers layerwise casting first. If it still does not fit, combine FP8 storage on the diffusion transformer with NF4 on the text encoder.

Who this is for

This guide is for builders running image or video generation models on a single 24 GB NVIDIA GPU:

RTX 3090
RTX 3090 Ti
A10
RTX 4090
L40 / L40S

The most common reader has a model that almost fits. A README, issue comment, or benchmark says "use FP8", but the same recipe was probably written on a 4090, L40, H100, or newer card. On a 3090 Ti, that detail matters.

The question is not "does PyTorch expose FP8 dtypes?" It does. The question is:

Which FP8 paths actually help on Ampere, and which ones quietly assume newer hardware?

The short answer

On RTX 3090 Ti:

pipe.transformer.enable_layerwise_casting(
    storage_dtype=torch.float8_e4m3fn,
    compute_dtype=torch.bfloat16,
)

This is the practical path. It stores transformer weights in FP8, then casts them back to BF16 when each layer runs.

Do not expect this to make generation faster. Expect it to cut weight memory enough that a larger model or higher resolution run fits.

What SM 8.6 vs SM 8.9 means

SM means Streaming Multiprocessor architecture version. In CUDA docs, this is usually described as compute capability.

The useful boundary for this post:

GPU family	Example GPUs	Compute capability	Practical FP8 meaning
Ampere consumer / prosumer	RTX 3090, RTX 3090 Ti, A10	8.6

Phrase	What it means	3090 Ti result
FP8 dtype exists	PyTorch can represent tensors with a float8 dtype	True
FP8 storage	Weights are stored in FP8 and cast up for compute	Useful
FP8 compute	Matrix multiply uses FP8 tensor-core kernels	Not the reliable Ampere path

Component	BF16 memory	FP8 storage memory
9B transformer weights	~18 GB	~9 GB

Component	Technique	Approx memory
Diffusion transformer	FP8 storage, BF16 compute	~9 GB
Text encoder	NF4 double quant	~4 GB
VAE	BF16	~0.2 GB
Activations and overhead	depends on resolution	~2-3 GB
Total		~15-16 GB

Technique	Memory effect	Speed effect on 3090 Ti	Quality risk	Use when
BF16 baseline	Highest memory	Reference	Lowest	Model already fits
FP8 layerwise casting	Cuts weight memory ~50%	Usually not faster	Low	Model almost fits
torchao float8 weight-only	Similar memory goal	Usually not faster	Low to medium	You need torchao-specific flow
NF4 bitsandbytes	Cuts memory more aggressively	Can be slower	Medium	Text encoder is too large
INT8 weight-only	Moderate memory reduction	Usually not faster	Low to medium	FP8 path is awkward or unsupported
CPU offload	Reduces peak GPU residency	Slower	Low	Components fit one at a time

Component	Correct config class
Transformers text encoder	`transformers.BitsAndBytesConfig`
Diffusers model component	`diffusers.BitsAndBytesConfig`

FP8 on RTX 3090 Ti - What Actually Works on Consumer GPUs

Who this is for

The short answer

What SM 8.6 vs SM 8.9 means

Turn AI video into a repeatable engine

Why this confuses people

The recipe that worked on RTX 3090 Ti

The 24 GB diffusion pipeline pattern

What not to do first

1. Do not start with FP8 activation quantization

2. Do not assume a single-image smoke test proves the batch run

3. Do not assume a pre-quantized FP8 repo is a diffusers pipeline

Technique comparison on RTX 3090 Ti

A practical decision tree

If the model fits in BF16

If the transformer alone is too large

If the text encoder is too large

If both components individually fit, but the pipeline OOMs

If a README says "use FP8 for speed"

The two BitsAndBytesConfig classes trap

What we validated

Sources

Related Instavar posts

The honest read

Related Posts

Who this is for

The short answer

What SM 8.6 vs SM 8.9 means

Turn AI video into a repeatable engine

Why this confuses people

The recipe that worked on RTX 3090 Ti

The 24 GB diffusion pipeline pattern

What not to do first

1. Do not start with FP8 activation quantization

2. Do not assume a single-image smoke test proves the batch run

3. Do not assume a pre-quantized FP8 repo is a diffusers pipeline

Technique comparison on RTX 3090 Ti

A practical decision tree

If the model fits in BF16

If the transformer alone is too large

If the text encoder is too large

If both components individually fit, but the pipeline OOMs

If a README says "use FP8 for speed"

The two BitsAndBytesConfig classes trap

What we validated

Sources

Related Instavar posts

The honest read

Related Posts

LoRA vs Full SFT for Voice Models - What Actually Changes on a 24 GB GPU

VoxCPM 2 Full SFT on a 24 GB GPU - The Run That Actually Fit

Running OpenAI Privacy Filter on an M2 MacBook Pro - 52-Case Benchmark