Skip to main content

Calculating LLM GPU Memory Requirements

·342 words·2 mins
AI LLM GPU Memory Hugging-Face
Vinay V
Author
Vinay V
A photon with hope, bounded in space, striving to make a difference through time..
Table of Contents

One of the most common questions when pulling a model from Hugging Face is : “Will this model fit on my GPU?”

Most people look at the model size on Hugging Face and assume that’s a 14 GB model - “Sure, it’ll fit on my 24 GB RTX 3090. And then CUDA throws an Out Of Memory (OOM) error the moment you increase context length or batch size.

In this post, lets explore hf-mem that automates the memory calculation for LLMs without needing to download the model first.

Automating Memory Calculation
#

Calculating GPU memory requirements manually is tedious and error-prone. hf-mem is a CLI tool by Alvaro Bartolome that estimates inference memory usage by inspecting the model’s metadata on Hugging Face—without downloading the weights.

Installation
#

The recommended way to install is using uv (for a fast, isolated environment), but pip works too.

# Using uv
uvx hf-mem

# Using pip
pip install hf-mem

Basic Usage
#

To check the base memory requirement for a model just run:

hf-mem --model-id google/embeddinggemma-300m

alt text

hf-mem --model-id Qwen/Qwen3.5-397B-A17B

alt text

Breakdown of output:
#

TOTAL MEMORY REQUIREMENTS
1.15 GB (307.58M PARAMS)

The total parameters. is ~307 million and the estimated inference memory - 1.15GB

This is model weight memory, not full runtime memory including KV cache

0_TRANSFORMER (302.86M PARAMS, 1.13 GB)

This means all parameters are in the transformer stack. Doing the math - 302.86M / 307.58M params ≈ 98% of all model parameters live inside the transformer layers

F32 - Float32 

This refers to Float32 precision, meaning each parameter in the model is stored as a 32-bit floating-point number. Precision directly determines memory usage.

Doing the math

A single Float32 parameter occupies 4 bytes of memory.

In Transfomer block

302.86M parameters
F32
1.13 GB

which means hf-mem is calculating as - 302.86 million * 4 bytes ≈ 1.13 GB

BF16 (Brain Float16) – 2 bytes per parameter
403.40B PARAMS
BF16
751.39 GB

which means hf-mem is calculating as

403.40 * 2 = 806.8 billion bytes
1 GiB = 1,073,741,824 bytes
806,800,000,000 / 1,073,741,824 ≈ 751.39 GiB