Dataclass in Python #
Dataclasses are one of those Python features that most developers use long before they fully understand them. You add @dataclass, write a few type annotations, and suddenly you get an init, repr, and eq for free.
It feels like magic until you hit an edge case involving inheritance, mutable defaults, ordering, or initialization logic. Then understanding what @dataclass actually generates becomes important.
This post breaks down how dataclasses work, what code they generate, and the common pitfalls worth knowing.
Dataclasses #
The Problem They Solve #
Before dataclasses (pre Python 3.7), creating a class that just holds data required a lot of boilerplate:
class Point:
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
def __repr__(self):
return f"Point(x={self.x}, y={self.y}, z={self.z})"
def __eq__(self, other):
if not isinstance(other, Point):
return NotImplemented
return self.x == other.x and self.y == other.y and self.z == other.z
Dataclasses fix exactly this.
The Basics #
from dataclasses import dataclass
@dataclass
class Point:
x: float
y: float
z: float
That generates __init__, __repr__, and __eq__ automatically, derived from the class annotations.
p1 = Point(1.0, 2.0, 3.0)
p2 = Point(1.0, 2.0, 3.0)
print(p1) # Point(x=1.0, y=2.0, z=3.0)
print(p1 == p2) # True (structural equality, not identity)
print(p1 is p2) # False (different objects)
The @dataclass decorator is inspecting the class body at definition time, reading all the annotated fields, and generating method implementations. You can see what it generates by looking at the class’s __dict__ — the methods are there, just auto-written.
Default Values #
Fields can have defaults:
from dataclasses import dataclass
@dataclass
class Config:
host: str = "localhost"
port: int = 8080
debug: bool = False
c = Config()
print(c) # Config(host='localhost', port=8080, debug=False)
c2 = Config(host="prod.thequietkernel.com", port=443)
print(c2) # Config(host='prod.thequietkernel.com', port=443, debug=False)
Important constraint: fields with defaults must come after fields without defaults. This is the same rule as Python function arguments. Violating it raises a TypeError at class definition time.
field() — When Defaults Get Complicated
#
What if you want a mutable default, like a list? This is a classic Python trap:
@dataclass
class BadConfig:
tags: list = [] # TypeError: mutable default not allowed
Python catches this for you and raises an error. The fix is field() with default_factory:
from dataclasses import dataclass, field
@dataclass
class Request:
url: str
headers: dict = field(default_factory=dict)
tags: list = field(default_factory=list)
timeout: float = field(default=30.0)
_internal_id: str = field(default="", repr=False, compare=False)
field() gives you granular control:
default_factory: a zero-argument callable called to produce a fresh default per instancerepr=False: exclude this field from__repr__compare=False: exclude this field from__eq__(and ordering comparisons)init=False: exclude from__init__— you set it yourself, typically in__post_init__
__post_init__ : Computed Fields and Validation
#
To run initialization logic after the __init__ that @dataclass generates. That’s what __post_init__ is for:
from dataclasses import dataclass, field
import hashlib
@dataclass
class User:
username: str
email: str
password_raw: str
password_hash: str = field(init=False, repr=False)
def __post_init__(self):
if "@" not in self.email:
raise ValueError(f"Invalid email: {self.email}")
self.password_hash = hashlib.sha256(self.password_raw.encode()).hexdigest()
del self.password_raw # don't keep the raw password around
user = User("vinay", "vinay@thequietkernel.com", "supersecret")
print(user)
# User(username='vinay', email='vinay@thequietkernel.com')
print(user.password_hash[:16]) # sha256 hash of the password
__post_init__ runs right after __init__. It’s the standard place for validation logic, derived field computation, or any setup that depends on the initialized values.
Frozen Dataclasses — Immutability #
Add frozen=True and instances become effectively immutable. Attempts to set attributes after creation raise a FrozenInstanceError.
from dataclasses import dataclass
@dataclass(frozen=True)
class Coordinate:
lat: float
lon: float
c = Coordinate(37.7749, -122.4194)
c.lat = 0.0 # FrozenInstanceError: cannot assign to field 'lat'
Frozen dataclasses are also hashable by default (because immutable objects can safely serve as dict keys), which plain dataclasses are not:
cache = {}
cache[c] = "Bengaluru" # works because Coordinate is frozen and hashable
cache[Coordinate(37.7749, -122.4194)] # same coords → same hash → cache hit
Ordering #
By default, @dataclass only generates __eq__. To get <, <=, >, >=, add order=True:
from dataclasses import dataclass
@dataclass(order=True)
class Version:
major: int
minor: int
patch: int
v1 = Version(1, 2, 0)
v2 = Version(1, 3, 0)
v3 = Version(2, 0, 0)
print(sorted([v3, v1, v2]))
# [Version(major=1, minor=2, patch=0), Version(major=1, minor=3, patch=0), Version(major=2, minor=0, patch=0)]
The comparison is field-by-field in the order they’re defined. Version(1, 2, 0) < Version(1, 3, 0) because major is equal (1 == 1), so it falls through to minor (2 < 3).
If you want to exclude a field from ordering but keep it in __eq__, mark it with field(compare=False).
Inheritance #
Dataclasses support inheritance. Child classes get the parent’s fields first, then their own:
from dataclasses import dataclass
@dataclass
class Animal:
name: str
species: str
@dataclass
class Pet(Animal):
owner: str
vaccinated: bool = False
p = Pet(name="Rex", species="Dog", owner="Vinay")
print(p)
# Pet(name='Rex', species='Dog', owner='Vinay', vaccinated=False)
One gotcha: if a parent field has a default, child fields cannot be without defaults. This is the same “defaults must come after non-defaults” rule.
@dataclass
class Parent:
x: int = 0 # has a default
@dataclass
class Child(Parent):
y: int # no default — TypeError!
The workaround is to give y a default too, or restructure the inheritance chain.
Serialization for dataclass #
Two utilities for serialization:
- asdict()
- astuple()
from dataclasses import dataclass, asdict, astuple
@dataclass
class Point:
x: float
y: float
p = Point(3.0, 4.0)
print(asdict(p)) # {'x': 3.0, 'y': 4.0}
print(astuple(p)) # (3.0, 4.0)
Dataclass vs NamedTuple vs TypedDict #
These three are often used interchangeably, but they’re different tools:
| Feature | dataclass |
NamedTuple |
TypedDict |
|---|---|---|---|
| Mutable | Yes (unless frozen) | No | Yes (it’s a dict) |
| Hashable | Only if frozen | Yes | No |
| Inheritance | Full support | Limited | Yes |
| Method definitions | Yes | Yes | No |
| Default values | Yes | Yes | No |
isinstance check |
Yes | Yes | No (it’s a dict) |
| Memory | Object overhead | Tuple-level | Dict overhead |
| Best for | Business logic objects | Immutable records, dict-unpacking | JSON shapes, external API contracts |
NamedTuple is essentially a tuple with named access it’s immutable and memory efficient. Best used when the data is truly a fixed-size record.
TypedDict is for annotating plain dictionaries it’s invisible at runtime (no enforcement), just for type checkers. Best used when dealing with JSON payloads or external data which is out of our control.
dataclass is the default choice when we want a proper object with methods, mutation.
Key Takeaway #
The @dataclass decorator is a code-generation decorator. It inspects the class definition and generates methods such as init, repr, eq, and optionally ordering and hashing methods. The real value is eliminating repetitive boilerplate while keeping the data model readable.