End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 training).
-
Updated
Jan 8, 2026 - Python
End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 training).
A survey of modern quantization formats (e.g., MXFP8, NVFP4) and inference optimization tools (e.g., TorchAO, GemLite), illustrated through the example of Llama-3.1 inference.
FLUX.1-dev on AMD Radeon consumer GPUs — fast, low-VRAM, and shippable. Backport patches + benchmarks for torchao + diffusers group_offload on ROCm.
Flux.2-Klein-Small-Decoder-Only is an experimental, high-performance image generation and editing application built to exclusively utilize the FLUX.2-klein-4B model paired with the specialized FLUX.2-small-decoder Variational Autoencoder (VAE).
Deploy AI models with an API through quantization and containerization.
Run FLUX.1-dev on AMD Radeon GPUs using ROCm with backport patches, optimized scripts, and support for low-VRAM configurations.
Block-scaled FP8 / FP4 / INT4 tensor primitive with Triton scaled-matmul at FP32 parity on H100. NumPy / PyTorch / MLX / JAX backends.
Identity-preserving image-to-video generation: vision-grounded prompt simplification via Qwen3-VL, Lightning LoRA 4-step inference, and SAM3-masked DINOv3 candidate reranking for fluid 720p video from a single reference image.
Measuring what makes a VLA fast enough to run on the robot: a 5.9x CUDA-graph win, four experiments on why low-bit doesn't, a budget-driven deploy-compiler, and a runtime safety supervisor. Live demo: hf.co/spaces/LaelaZ/embodied-efficiency
This repository contains code for benchmarking ModernBERT, RoBERTa, and OPT-350m on multi-class emotion classification using 8-bit quantization, backbone freezing, and LoRA-based PEFT.
Add a description, image, and links to the torchao topic page so that developers can more easily learn about it.
To associate your repository with the torchao topic, visit your repo's landing page and select "manage topics."