Skip to main content

Dataclass in Python

·1083 words·6 mins
Python Software-Engineering Object-Oriented-Programming Dataclasses
Vinay V
Author
Vinay V
A photon with hope, bounded in space, striving to make a difference through time..
Table of Contents

Dataclass in Python
#

Dataclasses are one of those Python features that most developers use long before they fully understand them. You add @dataclass, write a few type annotations, and suddenly you get an init, repr, and eq for free.

It feels like magic until you hit an edge case involving inheritance, mutable defaults, ordering, or initialization logic. Then understanding what @dataclass actually generates becomes important.

This post breaks down how dataclasses work, what code they generate, and the common pitfalls worth knowing.


Dataclasses
#

The Problem They Solve
#

Before dataclasses (pre Python 3.7), creating a class that just holds data required a lot of boilerplate:

class Point:
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z

    def __repr__(self):
        return f"Point(x={self.x}, y={self.y}, z={self.z})"

    def __eq__(self, other):
        if not isinstance(other, Point):
            return NotImplemented
        return self.x == other.x and self.y == other.y and self.z == other.z

Dataclasses fix exactly this.


The Basics
#

from dataclasses import dataclass

@dataclass
class Point:
    x: float
    y: float
    z: float

That generates __init__, __repr__, and __eq__ automatically, derived from the class annotations.

p1 = Point(1.0, 2.0, 3.0)
p2 = Point(1.0, 2.0, 3.0)

print(p1)         # Point(x=1.0, y=2.0, z=3.0)
print(p1 == p2)   # True  (structural equality, not identity)
print(p1 is p2)   # False (different objects)

The @dataclass decorator is inspecting the class body at definition time, reading all the annotated fields, and generating method implementations. You can see what it generates by looking at the class’s __dict__ — the methods are there, just auto-written.


Default Values
#

Fields can have defaults:

from dataclasses import dataclass

@dataclass
class Config:
    host: str = "localhost"
    port: int = 8080
    debug: bool = False
c = Config()
print(c)  # Config(host='localhost', port=8080, debug=False)

c2 = Config(host="prod.thequietkernel.com", port=443)
print(c2)  # Config(host='prod.thequietkernel.com', port=443, debug=False)

Important constraint: fields with defaults must come after fields without defaults. This is the same rule as Python function arguments. Violating it raises a TypeError at class definition time.


field() — When Defaults Get Complicated
#

What if you want a mutable default, like a list? This is a classic Python trap:

@dataclass
class BadConfig:
    tags: list = [] # TypeError: mutable default not allowed

Python catches this for you and raises an error. The fix is field() with default_factory:

from dataclasses import dataclass, field

@dataclass
class Request:
    url: str
    headers: dict = field(default_factory=dict)
    tags: list = field(default_factory=list)
    timeout: float = field(default=30.0)
    _internal_id: str = field(default="", repr=False, compare=False)

field() gives you granular control:

  • default_factory: a zero-argument callable called to produce a fresh default per instance
  • repr=False: exclude this field from __repr__
  • compare=False: exclude this field from __eq__ (and ordering comparisons)
  • init=False: exclude from __init__ — you set it yourself, typically in __post_init__

__post_init__ : Computed Fields and Validation
#

To run initialization logic after the __init__ that @dataclass generates. That’s what __post_init__ is for:

from dataclasses import dataclass, field
import hashlib

@dataclass
class User:
    username: str
    email: str
    password_raw: str
    password_hash: str = field(init=False, repr=False)

    def __post_init__(self):
        if "@" not in self.email:
            raise ValueError(f"Invalid email: {self.email}")
        self.password_hash = hashlib.sha256(self.password_raw.encode()).hexdigest()
        del self.password_raw   # don't keep the raw password around
user = User("vinay", "vinay@thequietkernel.com", "supersecret")
print(user)
# User(username='vinay', email='vinay@thequietkernel.com')
print(user.password_hash[:16])   # sha256 hash of the password

__post_init__ runs right after __init__. It’s the standard place for validation logic, derived field computation, or any setup that depends on the initialized values.


Frozen Dataclasses — Immutability
#

Add frozen=True and instances become effectively immutable. Attempts to set attributes after creation raise a FrozenInstanceError.

from dataclasses import dataclass

@dataclass(frozen=True)
class Coordinate:
    lat: float
    lon: float

c = Coordinate(37.7749, -122.4194)
c.lat = 0.0   # FrozenInstanceError: cannot assign to field 'lat'

Frozen dataclasses are also hashable by default (because immutable objects can safely serve as dict keys), which plain dataclasses are not:

cache = {}
cache[c] = "Bengaluru"   # works because Coordinate is frozen and hashable

cache[Coordinate(37.7749, -122.4194)]  # same coords → same hash → cache hit

Ordering
#

By default, @dataclass only generates __eq__. To get <, <=, >, >=, add order=True:

from dataclasses import dataclass

@dataclass(order=True)
class Version:
    major: int
    minor: int
    patch: int

v1 = Version(1, 2, 0)
v2 = Version(1, 3, 0)
v3 = Version(2, 0, 0)

print(sorted([v3, v1, v2]))
# [Version(major=1, minor=2, patch=0), Version(major=1, minor=3, patch=0), Version(major=2, minor=0, patch=0)]

The comparison is field-by-field in the order they’re defined. Version(1, 2, 0) < Version(1, 3, 0) because major is equal (1 == 1), so it falls through to minor (2 < 3).

If you want to exclude a field from ordering but keep it in __eq__, mark it with field(compare=False).


Inheritance
#

Dataclasses support inheritance. Child classes get the parent’s fields first, then their own:

from dataclasses import dataclass

@dataclass
class Animal:
    name: str
    species: str

@dataclass
class Pet(Animal):
    owner: str
    vaccinated: bool = False

p = Pet(name="Rex", species="Dog", owner="Vinay")
print(p)
# Pet(name='Rex', species='Dog', owner='Vinay', vaccinated=False)

One gotcha: if a parent field has a default, child fields cannot be without defaults. This is the same “defaults must come after non-defaults” rule.

@dataclass
class Parent:
    x: int = 0      # has a default

@dataclass
class Child(Parent):
    y: int          # no default — TypeError!

The workaround is to give y a default too, or restructure the inheritance chain.


Serialization for dataclass
#

Two utilities for serialization:

  • asdict()
  • astuple()
from dataclasses import dataclass, asdict, astuple

@dataclass
class Point:
    x: float
    y: float

p = Point(3.0, 4.0)
print(asdict(p))    # {'x': 3.0, 'y': 4.0}
print(astuple(p))   # (3.0, 4.0)

Dataclass vs NamedTuple vs TypedDict
#

These three are often used interchangeably, but they’re different tools:

Feature dataclass NamedTuple TypedDict
Mutable Yes (unless frozen) No Yes (it’s a dict)
Hashable Only if frozen Yes No
Inheritance Full support Limited Yes
Method definitions Yes Yes No
Default values Yes Yes No
isinstance check Yes Yes No (it’s a dict)
Memory Object overhead Tuple-level Dict overhead
Best for Business logic objects Immutable records, dict-unpacking JSON shapes, external API contracts

NamedTuple is essentially a tuple with named access it’s immutable and memory efficient. Best used when the data is truly a fixed-size record.

TypedDict is for annotating plain dictionaries it’s invisible at runtime (no enforcement), just for type checkers. Best used when dealing with JSON payloads or external data which is out of our control.

dataclass is the default choice when we want a proper object with methods, mutation.


Key Takeaway
#

The @dataclass decorator is a code-generation decorator. It inspects the class definition and generates methods such as init, repr, eq, and optionally ordering and hashing methods. The real value is eliminating repetitive boilerplate while keeping the data model readable.