Data Validation: The Key to Building Sturdy Systems
* Download the data and code from my GitHub repo *

Garbage in, garbage out" isn't just a saying; it's a reality that can wreck your day. One bad record (say a string where an integer should be) can ripple through your system, break dashboards, and turn a routine run into hours of debugging chaos.
The question is: how do you stop these issues before they spread? To me, the answer is simple: validate early and validate often.
By enforcing data integrity through clear rules and constraints as data flows into your system, you create a protective layer around downstream processes. But those rules need structure, and that's where the data schema comes in.
The Schema as Your Source of Stability
For validation to be effective, it needs a blueprint to follow. A schema acts as that blueprint, defining what your data should look like and how it should behave. By defining a schema and then validating your data against it, you create a reliable contract, one that keeps your systems sturdy and your sanity intact.
Defining a schema involves establishing clear expectations for your data upfront. This includes specifying column names, data types (such as string or integer), and business rules (for example, age must be greater than 18). Being explicit removes ambiguity and ensures your pipeline knows exactly what to expect.
Validating against the schema involves checking incoming data against what is expected. This allows you to raise descriptive errors the moment a record fails to meet your criteria. If you want to keep things moving, you can even quarantine bad records, routing problematic data to a side area for manual review, while letting the clean data continue to flow uninterrupted.
Putting it into Practice
To see this in action, let's look at a practical example.
Suppose we are ingesting a JSON-based catalog of book metadata. Specifically, we are processing a catalog featuring 15 metadata records for works by Black and African authors. To maintain a sturdy pipeline, we must ensure each metadata object contains four essential fields:
title: The title of the book.author: The book's author.year_published: The year the book was first published.genre: A collection of tags indicating the book's subject matter or category.
Let's load a few libraries that we'll use during this practical example.
import json
import polars as pl
import pandera.polars as pa
from pandera.polars import PolarsData
from pydantic import (
BaseModel,
TypeAdapter,
ValidationError,
Field
)
Next, import two catalogs of book metadata:
books_pass: A catalog where the schema contract is fully honored; every record meets the defined rules.books_fail: A modified version of the catalog where the contract is broken; some records violate the schema checks.
files = {"pass": "./data/books.json", "fail": "./data/books_fail.json"}
data = {
key: json.load(open(path, "r", encoding="utf-8")) for key, path in files.items()
}
books_pass = data["pass"]
books_fail = data["fail"]
There are many ways to define and validate (data) schemas in Python. I'll walk you through two approaches: one using Pydantic and another using Polars + Pandera.
Validating Data with Pydantic
Pydantic
is one of the most popular libraries for data validation in Python. It's fast, flexible, and easy to use. It leverages Python type hints to define data models and enforce schemas at runtime.
Using the Book catalog definitions from above, we can create a simple BookMetadata class with Pydantic's
BaseModel
, which specifies the structure of the metadata we expect in the catalog.
class BookMetadata(BaseModel):
title: str
author: str
year_published: int
genre: list[str]
Let's create a dictionary called test_record and use Pydantic's
.model_validate()
method to validate the record.
test_record = {
"title":"Book Title",
"author": "Book author",
"year_published": 1990,
"genre": ["Fiction", "Magical Realism"]
}
BookMetadata.model_validate(test_record)
BookMetadata(title='Book Title', author='Book author', year_published=1990, genre=['Fiction', 'Magical Realism'])
Validating test_record with the BookMetadata model returns the data successfully, confirming that our dictionary meets every requirement defined in the schema.
Now, let's see what happens when validation fails: change year_published from 1990 to "ABC" and try validating the record again.
test_record2 = {
"title":"Book Title",
"author": "Book author",
"year_published": "ABC",
"genre": ["Fiction", "Magical Realism"]
}
BookMetadata.model_validate(test_record2)
Traceback (most recent call last):
File "", line 1, in
File "/Users/User/.venv/lib/python3.11/site-packages/pydantic/main.py", line 716, in model_validate
return cls.__pydantic_validator__.validate_python(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for BookMetadata
year_published
Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='ABC', input_type=str]
For further information visit https://errors.pydantic.dev/2.12/v/int_parsing
When we attempt to validate the data this time, Pydantic raises a ValidationError for year_published because it expects an integer, not the string "ABC" we provided.
Validating a single record is useful for learning, but real-world scenarios often involve processing thousands of records at once, such as a large .json file containing an array of objects.
[
{
"title": "Things Fall Apart",
"author": "Chinua Achebe",
"year_published": 1958,
"genre": ["Fiction", "Classic", "African Literature"]
},
{
"title": "I Know Why the Caged Bird Sings",
"author": "Maya Angelou",
"year_published": 1969,
"genre": ["Autobiography", "Memoir", "Non-fiction"]
},
...
]
To validate this collection as a single unit, you can define a wrapper model. In our example, this means treating the entire list as one cohesive structure where every entry must satisfy the BookMetadata schema:
class BookMetadataCatalog(BaseModel):
records: list[BookMetadata]
Let's use the BookMetadataCatalog wrapper to validate every record in books_pass. Since our JSON data is represented in Python as a list of dictionaries, we can simply pass the books_pass list to our wrapper under the records key:
BookMetadataCatalog.model_validate({"records": books_pass})
# Output truncated for readability
BookMetadataCatalog(records=[BookMetadata(title='Things Fall Apart', author='Chinua Achebe', year_published=1958, genre=['Fiction', 'Classic', 'African Literature']), BookMetadata(title='I Know Why the Caged Bird Sings', author='Maya Angelou', year_published=1969, genre=['Autobiography', 'Memoir', 'Non-fiction'])...
The data is returned, confirming that every record has been successfully validated against the predefined schema.
As an alternative to the wrapper model, we can use a
TypeAdapter
along with the
.validate_python()
method to validate the list directly. This is a flexible way to handle data, like top-level JSON arrays, without needing to modify the data's shape or wrap it in an extra dictionary key just for validation:
BookMetadata_list_adapter = TypeAdapter(list[BookMetadata])
BookMetadata_list_adapter.validate_python(books_pass)
# Output truncated for readability
BookMetadataCatalog(records=[BookMetadata(title='Things Fall Apart', author='Chinua Achebe', year_published=1958, genre=['Fiction', 'Classic', 'African Literature']), BookMetadata(title='I Know Why the Caged Bird Sings', author='Maya Angelou', year_published=1969, genre=['Autobiography', 'Memoir', 'Non-fiction'])...
You can add extra layers of validation using Pydantic's
Field()
function. This is perfect for capturing specific business logic, like ensuring the year_published field only accepts dates from 1900 or later. For example, using ge=1900 (greater than or equal to), you can constrain the input beyond just being an integer. As a bonus, the
Field()
function also lets you add descriptions, making your code self-documenting.
class BookMetadata(BaseModel):
title: str = Field(description="The title of the book.")
author: str = Field(description="The book's author.")
year_published: int = Field(ge=1900, description = "The year the book was first published. Must be from 1900 onward.")
genre: list[str] = Field(description = "A collection of tags indicating the book's subject matter or category.")
To test our new constraints, let's create two records: one valid, where year_published is set to 1954 and one invalid, where year_published is set to 1880. We'll then use BookMetadataCatalog to validate them against the updated model
year_pub_records = [
{
"title":"First title",
"author": "First author",
"year_published": 1954,
"genre": ["genre 1", "genre 2"]
},
{
"title":"Second title",
"author": "Second author",
"year_published": 1880,
"genre": ["genre 1", "genre 2"]
}
]
class BookMetadataCatalog(BaseModel):
records: list[BookMetadata]
BookMetadataCatalog.model_validate({"records": year_pub_records})
Traceback (most recent call last):
File "", line 1, in
File "/Users/User/.venv/lib/python3.11/site-packages/pydantic/main.py", line 716, in model_validate
return cls.__pydantic_validator__.validate_python(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for BookMetadataCatalog
records.1.year_published
Input should be greater than or equal to 1900 [type=greater_than_equal, input_value=1880, input_type=int]
For further information visit https://errors.pydantic.dev/2.12/v/greater_than_equal
As expected, Pydantic raises a ValidationError because the second record violates our ge=1900 rule. While Pydantic successfully catches the error, the default traceback can be dense and difficult to read. To make debugging more efficient, we can transform these raw errors into a user-friendly format that explicitly calls out the record index, the faulty field, the invalid input, and the reason for failure.
The following snippet demonstrates how structured error reporting makes it easier to pinpoint invalid data within the books_fail object
try:
validated_data = BookMetadataCatalog.model_validate({"records": books_fail})
print("✅ All book records successfully validated.")
except ValidationError as e:
print("❌ Validation failed.")
for error in e.errors():
print(
f"Record #{error['loc'][1]} | "
f"Field: '{error['loc'][-1]}' | "
f"Input: {error['input']} | "
f"Error: {error['msg']}"
)
❌ Validation failed.
Record #0 | Field: 'year_published' | Input: ABC | Error: Input should be a valid integer, unable to parse string as an integer
Record #1 | Field: 'year_published' | Input: 123 | Error: Input should be greater than or equal to 1900
Record #2 | Field: 'title' | Input: 123123 | Error: Input should be a valid string
Record #3 | Field: 'genre' | Input: None | Error: Input should be a valid list
Importantly, once you can pinpoint validation errors (e.g., where they occur and why) the next step is deciding how to handle them. Here are some common strategies:
- Log and alert: Capture detailed error information or send notifications for monitoring.
- Quarantine problematic records: Isolate bad data for later review without interrupting the entire process.
- Terminate the pipeline: Stop execution immediately if validation fails.
Next, let's look at how we can validate the same data using
Polars + Pandera.
Validating Data with Polars + Pandera
Polars
is a high-performance DataFrame library, and
Pandera
is an open-source framework for validating DataFrame-like objects. When you put them together, they provide a powerful, declarative approach to validating DataFrames at scale, making schema enforcement both flexible and efficient.
Since the data we've been working with thus far is a list of dictionaries (JSON records), we first need to convert it into a DataFrame before validating it with Pandera.
We can convert our list of dictionaries (i.e., books_pass and books_fail) to Polars DataFrames, by passing them to the pl.DataFrame() function.
# convert list of dictionaries to Polars DataFrame
books_pass_df = pl.DataFrame(books_pass)
books_fail_df = pl.DataFrame(books_fail)
Alternatively, you can load the data directly from a JSON file into a DataFrame.
# read data into a DataFrame from a JSON file
books_pass_df = pl.read_json("./data/books.json")
books_fail_df = pl.read_json("./data/books_fail.json")
Now that the data are in a DataFrame format, we can create our
Pandera
schema:
class BookMetadataSchema(pa.DataFrameModel):
title: str = pa.Field(nullable=False, unique=True, coerce=True)
author: str = pa.Field(nullable=False, coerce=True)
year_published: int = pa.Field(nullable=False, ge=1900, coerce=True)
genre: list[str] = pa.Field(nullable=False, coerce=True)
Just like we did with our
Pydantic
model, the
Pandera
schema BookMetadataSchema enforces specific assumptions about the structure of the data we want to validate:
- The DataFrame must include at least four columns: title, author, year_published, and genre.
titleis a string column that cannot be empty, and each book title in the dataset must be unique.authoris a non-empty string column.year_publishedis a non-empty integer column. And usingpa.Field(), we also enforce a constraint: all values foryear_publishedmust be greater than or equal to 1900.genreis a column containing a list of strings and cannot be empty.
Once we have the schema, we can use it to validate the data within a DataFrame. Here's an example of a DataFrame that passes validation:
BookMetadataSchema.validate(books_pass_df, lazy=True)
shape: (15, 4)
┌────────────────────┬────────────────────┬────────────────┬───────────────────┐
│ title ┆ author ┆ year_published ┆ genre │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ list[str] │
╞════════════════════╪════════════════════╪════════════════╪═══════════════════╡
│ Things Fall Apart ┆ Chinua Achebe ┆ 1958 ┆ ["Fiction", │
│ ┆ ┆ ┆ "Classic", │
│ ┆ ┆ ┆ "Africa… │
│ I Know Why the ┆ Maya Angelou ┆ 1969 ┆ ["Autobiography", │
│ Caged Bird Sing… ┆ ┆ ┆ "Memoir", "N… │
│ Beloved ┆ Toni Morrison ┆ 1987 ┆ ["Fiction", │
│ ┆ ┆ ┆ "Historical │
│ ┆ ┆ ┆ Fictio… │
│ Their Eyes Were ┆ Zora Neale Hurston ┆ 1937 ┆ ["Fiction", │
│ Watching God ┆ ┆ ┆ "Classic", │
│ ┆ ┆ ┆ "Africa… │
│ Invisible Man ┆ Ralph Ellison ┆ 1952 ┆ ["Fiction", │
│ ┆ ┆ ┆ "Modern Classic"] │
│ … ┆ … ┆ … ┆ … │
│ A Brief History of ┆ Marlon James ┆ 2014 ┆ ["Fiction", │
│ Seven Killi… ┆ ┆ ┆ "Historical │
│ ┆ ┆ ┆ Fictio… │
│ Nervous Conditions ┆ Tsitsi Dangarembga ┆ 1988 ┆ ["Fiction", │
│ ┆ ┆ ┆ "African │
│ ┆ ┆ ┆ Literatur… │
│ Parable of the ┆ Octavia E. Butler ┆ 1993 ┆ ["Science │
│ Sower ┆ ┆ ┆ Fiction", │
│ ┆ ┆ ┆ "Dystopian… │
│ The Palm-Wine ┆ Amos Tutuola ┆ 1952 ┆ ["Fiction", │
│ Drinkard ┆ ┆ ┆ "Fantasy", │
│ ┆ ┆ ┆ "Africa… │
│ So Long a Letter ┆ Mariama Bâ ┆ 1979 ┆ ["Fiction", │
│ ┆ ┆ ┆ "Epistolary │
│ ┆ ┆ ┆ Novel"… │
└────────────────────┴────────────────────┴────────────────┴───────────────────┘
Here's an example of a DataFrame that does not pass validation:
BookMetadataSchema.validate(books_fail_df, lazy=True)
Traceback (most recent call last):
File "", line 1, in
File "/Users/User/.venv/lib/python3.11/site-packages/pandera/api/polars/model.py", line 153, in validate
result = cls.to_schema().validate(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/User/.venv/lib/python3.11/site-packages/pandera/api/polars/container.py", line 57, in validate
output = self.get_backend(check_obj).validate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/User/.venv/lib/python3.11/site-packages/pandera/backends/polars/container.py", line 152, in validate
raise SchemaErrors(
pandera.errors.SchemaErrors: {
"DATA": {
"DATATYPE_COERCION": [
{
"schema": "BookMetadataSchema",
"column": "BookMetadataSchema",
"check": "coerce_dtype('{'title': DataType(String), 'author': DataType(String), 'year_published': DataType(Int64), 'genre': DataType(List(String))}')",
"error": "Could not coerce 'year_published' in LazyFrame with schema Schema([('title', String), ('author', String), ('year_published', String), ('genre', List(String))]) into type Int64"
}
],
"CHECK_ERROR": [
{
"schema": "BookMetadataSchema",
"column": "year_published",
"check": "greater_than_or_equal_to(1900)",
"error": "ComputeError(\"cannot compare string with numeric type (i32)\")"
}
]
},
"SCHEMA": {
"WRONG_DATATYPE": [
{
"schema": "BookMetadataSchema",
"column": "year_published",
"check": "dtype('Int64')",
"error": "expected column 'year_published' to have type Int64, got String"
}
],
"SERIES_CONTAINS_NULLS": [
{
"schema": "BookMetadataSchema",
"column": "genre",
"check": "not_nullable",
"error": "non-nullable column 'genre' contains null values"
}
]
}
}
As expected books_pass_df passes all validation checks, while books_fail_df fails and a SchemaErrors exception is raised by Pandera.
Rather than relying on raw error outputs, we can use Pandera's built-in error reporting to print a structured summary of every schema error and failure case encountered during validation:
try:
validated_df = BookMetadataSchema.validate(books_fail_df, lazy=True)
print("✅ All schema checks passed!")
except pa.errors.SchemaErrors as exc:
print("❌ Validation failed.\nDetails:")
print(exc.failure_cases)
❌ Validation failed.
Details:
shape: (4, 6)
┌────────────────┬───────────────┬───────────────┬───────────────┬──────────────┬───────┐
│ failure_case ┆ schema_contex ┆ column ┆ check ┆ check_number ┆ index │
│ --- ┆ t ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ --- ┆ str ┆ str ┆ i32 ┆ i32 │
│ ┆ str ┆ ┆ ┆ ┆ │
╞════════════════╪═══════════════╪═══════════════╪═══════════════╪══════════════╪═══════╡
│ ABC ┆ DataFrameSche ┆ BookMetadataS ┆ coerce_dtype( ┆ null ┆ 0 │
│ ┆ ma ┆ chema ┆ '{'title': ┆ ┆ │
│ ┆ ┆ ┆ DataTy… ┆ ┆ │
│ String ┆ Column ┆ year_publishe ┆ dtype('Int64' ┆ null ┆ null │
│ ┆ ┆ d ┆ ) ┆ ┆ │
│ ComputeError(" ┆ Column ┆ year_publishe ┆ greater_than_ ┆ 0 ┆ null │
│ cannot compare ┆ ┆ d ┆ or_equal_to(1 ┆ ┆ │
│ s… ┆ ┆ ┆ 900) ┆ ┆ │
│ null ┆ Column ┆ genre ┆ not_nullable ┆ null ┆ 3 │
└────────────────┴───────────────┴───────────────┴───────────────┴──────────────┴───────┘
The structured summary is a DataFrame with six columns, each detailing the specific values and row indices that failed validation, along with the violated check and the corresponding column name.
Often, simple type hints and built-in constraints aren't enough to validate the nuanced "business rules" of your data. For these scenarios, Pandera allows you to implement complex, custom logic using the @pa.check method decorator. This provides the flexibility to define specialized column-level validation rules. (Note: In the schema below, we set coerce=False for year_published.)
class BookMetadataSchema(pa.DataFrameModel):
title: str = pa.Field(nullable=False, unique=True, coerce=True)
author: str = pa.Field(nullable=False, coerce=True)
year_published: int = pa.Field(nullable=False, coerce=False)
genre: list[str] = pa.Field(nullable=False, coerce=True)
@pa.check("title")
def valid_alphanumeric_title(cls, data: PolarsData) -> pl.LazyFrame:
"""Check if a title is completely made up of digits"""
return data.lazyframe.select(
~pl.col(data.key).cast(pl.Utf8).str.contains(r"^\d+$")
)
@pa.check("year_published")
def valid_year_pub_range(cls, data: PolarsData) -> pl.LazyFrame:
"""Check if a year published is interger type and greater than 1900"""
return (
data.lazyframe.select(data.key)
.cast(pl.Int64, strict=False)
.select((pl.col(data.key).is_not_null()) & (pl.col(data.key).gt(1900)))
)
try:
validated_df = BookMetadataSchema.validate(books_fail_df, lazy=True)
print("✅ All schema checks passed!")
except pa.errors.SchemaErrors as exc:
print("❌ Validation failed.\nDetails:")
print(exc.failure_cases)
❌ Validation failed.
Details:
shape: (5, 6)
┌──────────────┬──────────────┬──────────────┬──────────────┬─────────────┬───────┐
│ failure_case ┆ schema_conte ┆ column ┆ check ┆ check_numbe ┆ index │
│ --- ┆ xt ┆ --- ┆ --- ┆ r ┆ --- │
│ str ┆ --- ┆ str ┆ str ┆ --- ┆ i32 │
│ ┆ str ┆ ┆ ┆ i32 ┆ │
╞══════════════╪══════════════╪══════════════╪══════════════╪═════════════╪═══════╡
│ 123123 ┆ Column ┆ title ┆ valid_alphan ┆ 0 ┆ 2 │
│ ┆ ┆ ┆ umeric_title ┆ ┆ │
│ String ┆ Column ┆ year_publish ┆ dtype('Int64 ┆ null ┆ null │
│ ┆ ┆ ed ┆ ') ┆ ┆ │
│ ABC ┆ Column ┆ year_publish ┆ valid_year_p ┆ 0 ┆ 0 │
│ ┆ ┆ ed ┆ ub_range ┆ ┆ │
│ 123 ┆ Column ┆ year_publish ┆ valid_year_p ┆ 0 ┆ 1 │
│ ┆ ┆ ed ┆ ub_range ┆ ┆ │
│ null ┆ Column ┆ genre ┆ not_nullable ┆ null ┆ 3 │
└──────────────┴──────────────┴──────────────┴──────────────┴─────────────┴───────┘
In the example above, we have implemented two custom checks:
valid_alphanumeric_title: This check ensures that book titles aren't composed entirely of digits. It uses a regular expression to flag any title that consists only of digits, helping to catch potential data entry errors.valid_year_pub_range: Rather than rely on a basic range constraint, this check adds a layer of robustness. It attempts to cast theyear_publishedcolumn to a 64-bit integer and verifies that the resulting value is bothnon-nulland strictly greater than1900, ensuring the column contains clean, logical dates before processing.
You might be thinking: why aren't the built-in constraints like ge=1900 enough? The answer lies in how Polars handles data types. Polars is notoriously strict when it comes to data types, which is generally a strength, but it can lead to "silent" side effects during data ingestion that might look like a non-event to the untrained eye.
In our previous example without custom checks, you might have spotted a cryptic ComputeError in the failure cases: ComputeError("cannot compare string with numeric type (i32)"). This happened because our year_published column contained mixed types (integers and strings). When Polars converted that data into a DataFrame, it chose a container type (in this case a String) to accommodate the messiness.
{name: dtype for name, dtype in books_fail_df.schema.items()}
{'title': String, 'author': String, 'year_published': String, 'genre': List(String)}
When Pandera tried to check if all values in the column were greater than 1900, the process failed. Why? Even though we told Pandera to convert the column to integer, one of the entries was "ABC". Spoiler: you can't mathematically compare the string "ABC" to the number 1900. In our updated schema, we turned off the data type coercion for year_publihsed and instead used our custom check logic to identify specific records that are either not integers or less than 1900.
Also, did you notice how, in our original schema, Polars + Pandera didn't flag the title record that was just a string of numbers (123123)? That's because when converting the list of dictionaries to a DataFrame, Polars silently coerced all non-string values in that column into a common, compatible data type; in this case, a String (see schema for books_fail_df above). In our updated schema, we added custom checks to evaluate records against specific conditions, such as titles made up entirely of digits, so these cases don't slip through.
Validate Early, Validate Often
Growing up, my mom loved doing those silly voices when reading aloud to us. Maybe that's why I've always had a soft spot for alliteration, rhymes, and little jingles; they help make things stick and turn mundane or boring processes into something playful and memorable.
A few years back, while working with a team that was starting to strengthen their data management practices, someone asked, "When should you validate data?"
Out popped a jingle:
If your data aren't at rest, put it to the test.
It's silly and catchy, but it's also the truth, which is why I always recommend to teams that they validate their data early and validate it often.
Early data validation is not just a technical safeguard; it's a promise of integrity. Catching issues as close to the source as possible stops bad data before it can erode trust downstream. The schema is one vehicle for this promise, and it's most powerful when it enforces both structure and logic.
When we intentionally build schemas that define the structure and logic of data, they stop being a technical artifact and start becoming a framework for collaboration. Schema design sparks vital conversations between those who create the data, those who make it accessible, and those who use it. Use these conversations to turn unspoken assumptions into shared understanding and ensure that what we build truly matches what we expect from one another.