Recoding (Column) Values in Python

Jun 2

* Download the data and code from my GitHub repo *

Image of a process model recoding A,W,R to Accepted, Waitlisted, and Rejected, respectively.

Not too long ago, I wrote a post on renaming Pandas and Polars DataFrame Columns. In this post, I will show you a quick and easy way to recode variable (i.e., column) values in a DataFrame.

Spoiler Yes, it uses a "codebook"

Is R more your thing? Check out my post on Recoding Variable Values in R.

Value Recoding

Data recoding is a dreaded task, but the results are well worth the effort. We often transform or modify data to make it usable for analysis. This can involve changing data types, grouping/binning values, or mapping old values to new ones. This post will focus on mapping old values to new ones.

When recoding variable values, it is important to have at least three pieces of information:

Names of the variables to recode
Original values/labels
New values/labels

The Value of a Codebook

Most people use a programming language like Python to automate data workflows. And having a trusty codebook can help streamline those processes. A codebook acts as a map or rulebook for your data. It removes (some) ambiguity, supports automation, and enables collaboration, especially on large-scale projects with many moving parts.

The value of a codebook lies in its ability to provide an accounting of the variables in a dataset. Successful codebooks typically contain the following information about each variable in the dataset:

Variable Name: The name assigned to each variable
New Variable Name: A new name you wish to assign to each variable
Variable Label: A brief description outlining what the variable is measuring.
Level of Measurement: How the variable was measured (e.g., nominal, ordinal, interval, ratio scale)
Variable Values: List of the values a variable can take
Value Labels: Descriptions for each unique variable value

While many people choose to automate codebook creation, I usually build mine manually, at least in part. Doing it this way gives me more control over what information is used during the data processing phase.

Setup

I generated a fake dataset for this post, loosely based on some data about graduate applicants to a large university. The dataset contains demographic and educational information about the applicants, as well as information about the programs to which they applied.

The dataset has 5,000 rows and 14 columns:

Image of the fake dataset containing demographic and educational information about the applicants and information about the programs they applied to.

I also created a codebook similar to one I would use in a typical data cleaning workflow. The codebook has four columns:

column_name: The name assigned to each variable in the dataset
column_description: A description of the variable
old_values: Values each variable can have
new_labels : Descriptions for each unique variable value

One significant difference between this codebook and the codebooks I typically use is that this codebook presents data in the long format. In other words, there are multiple records for each column. For example, the school_decision column in our dataset has three possible values and therefore three entries in our codebook:

Image of the code with school_decision's three rows highligted.

We will convert these codebook entries into a nested (i.e., two-level) dictionary.

Recoding column values in a Pandas DataFrame The easiest way to recode column values in a Pandas DataFrame is to use the .replace method. To replace values in select columns, we are going to take advantage of the fact that you can pass a dictionary to .replace's first argument, to_replace. Specifically, we want to pass a nested dictionary to the to_replace argument, where column names from our dataset serve as outer keys, and inner dictionaries give the mapping of old values to new values. Remember, our codebook data is in the long format, so before we can use the .replace method, we need to group values/labels by unique variables (i.e., column_name) and then return a nested dictionary.

First, import the codebook and dataset using the .read_csv method:

import pandas as pd

# codebook
csv_codebook_pd = pd.read_csv("./data/grad_app_codebook.csv")

# dataset
data_pd = pd.read_csv("./data/grad_app_data.csv")

Then create the nested dictionary. One way to create the nested dictionary we want is to use a regular for loop:

nested_dict_pd = {}
for var, col in csv_codebook_pd.groupby("column_name", sort=False):
     nested_dict_pd[var] = dict(zip(col["old_values"], col["new_labels"]))

Here's a breakdown of what the code does:

Starts with an empty dictionary.
Creates a nested dictionary from a DataFrame by:
- Looping through each group of rows that share the same value in the column column_name (which in this case is the variable name in our dataset).
  - var is the group key — a unique variable in our dataset.
  - group represents the set of rows (a subset of the original DataFrame) that match the var key.
  - For each group, it goes row by row, pairing values from the old_values column with values from the new_labels column to form key-value pairs in an inner dictionary.

The end result?

A two-level nested dictionary:

nested_dict_pd

{'school_decision': {'A': 'Accepted', 'W': 'Waitlisted', 'R': 'Rejected'},
 'student_decision': {'A': 'Accepted Offer', 'D': 'Declined Offer'},
 'ft_pt': {'pt': 'Part-time', 'ft': 'Full-time', 'sub': 'Submatriculate'},
 'applied_dual_prg': {'NODUAL': 'No', 'YES': 'Yes'},
 'gender': {'W': 'Woman', 'M': 'Man', 'NB': 'Non-Binary'},
 'ethnicity': {'AN': 'Indigenous',
  'HL': 'Hispanic/Latino',
  'ME': 'Middle Eastern/Arab',
  'W': 'White',
  'B': 'Black /African American',
  'AP': 'Asian or Pacific Islander'},
 'low_income': {'N': 'No', 'Y': 'Yes'}}

To use the dictionary, set the to_replace argument in the replace method to our dictionary nested_dic. For those columns whose names appear in both the dataset and as outer keys in the nested dictionary, the column values in the dataset have been recoded.

data_pd.replace(to_replace=nested_dict_pd)
      applicant_id                                    email  ...            ethnicity low_income
            401                  karin.walborn@gmail.com  ...           Indigenous         No
            386                yareli.granados@yahoo.com  ...      Hispanic/Latino         No
           2905  sulaimaan.al-demian@this.university.edu  ...  Middle Eastern/Arab         No
           3043            ryan.burt@this.university.edu  ...                White        Yes
           1546              ryan.villalobos@Outlook.com  ...      Hispanic/Latino         No
...            ...                                      ...  ...                  ...        ...
        2130   schyeler.martzloff@this.university.edu  ...           Indigenous         No
        4894                  kasey.schwarz@gmail.com  ...                White         No
        4592                   jerrod.apple@gmail.com  ...                White         No
        2684               natasha.fossum@Outlook.com  ...                White        Yes
        1345          jaimin.lott@this.university.edu  ...           Indigenous         No

If this is a process you envision using across various workflows, consider wrapping it in a function:

def create_dict_pd(df:pd.DataFrame,
                   col_name:str = "column_name",
                   old_val_col:str = "values", 
                   new_val_col:str = "labels"
                   )->dict[str, dict[int|str, int|str]]:
    
    nested_dict = {}
    for var, col in df.groupby(col_name, sort=False):
        nested_dict[var] = dict(zip(col[old_val_col], col[new_val_col]))

    return nested_dict

Then build the nested dictionary:

nested_dict_pd = create_dict_pd(df = csv_codebook_pd,
                                col_name="column_name",
                                old_val_col="old_values",
                                new_val_col="new_labels")

And finally, use the dictionary:

data_pd.replace(to_replace= nested_dict_pd)      applicant_id                                    email  ...            ethnicity low_income
            401                  karin.walborn@gmail.com  ...           Indigenous         No
            386                yareli.granados@yahoo.com  ...      Hispanic/Latino         No
           2905  sulaimaan.al-demian@this.university.edu  ...  Middle Eastern/Arab         No
           3043            ryan.burt@this.university.edu  ...                White        Yes
           1546              ryan.villalobos@Outlook.com  ...      Hispanic/Latino         No
...            ...                                      ...  ...                  ...        ...
        2130   schyeler.martzloff@this.university.edu  ...           Indigenous         No
        4894                  kasey.schwarz@gmail.com  ...                White         No
        4592                   jerrod.apple@gmail.com  ...                White         No
        2684               natasha.fossum@Outlook.com  ...                White        Yes
        1345          jaimin.lott@this.university.edu  ...           Indigenous         No

Recoding column values in a Polars DataFrame

Much like when using Pandas, the easiest way to recode column values in a Polars DataFrame is to use the .replace method.

First, import the codebook and dataset using the .read_csv method:

import polars as pl

# codebook
csv_codebook_pl = pl.read_csv("./data/grad_app_codebook.csv")

# dataset
data_pl = pl.read_csv("./data/grad_app_data.csv")

Next, we need to create the nested dictionary. Here's sample code you can use to convert the codebook (as a Polars DataFrame) into a two-level nested dictionary:

nested_dict_pl = {
    row["column_name"]: dict(zip(row["old_values"], row["new_labels"]))
    for row in (csv_codebook_pl.group_by("column_name", maintain_order=True)
                    .agg(pl.col("old_values"),pl.col("new_labels"))
                    .iter_rows(named=True)
                )
        }

Let's rewrite the code in two parts so you can see what's happening underneath the hood:

In this first part, we are creating a grouped Polars DataFrame. Then, with that grouped DataFrame, we are aggregating the data in the old_values and new_labels columns into lists.

shape: (7, 3)
┌──────────────────┬──────────────────────┬─────────────────────────────────┐
│ column_name      ┆ old_values           ┆ new_labels                      │
│ ---              ┆ ---                  ┆ ---                             │
│ str              ┆ list[str]            ┆ list[str]                       │
╞══════════════════╪══════════════════════╪═════════════════════════════════╡
│ school_decision  ┆ ["A", "W", "R"]      ┆ ["Accepted", "Waitlisted", "Re… │
│ student_decision ┆ ["A", "D"]           ┆ ["Accepted Offer", "Declined O… │
│ ft_pt            ┆ ["pt", "ft", "sub"]  ┆ ["Part-time", "Full-time", "Su… │
│ applied_dual_prg ┆ ["NODUAL", "YES"]    ┆ ["No", "Yes"]                   │
│ gender           ┆ ["W", "M", "NB"]     ┆ ["Woman", "Man", "Non-Binary"]  │
│ ethnicity        ┆ ["AN", "HL", … "AP"] ┆ ["Indigenous", "Hispanic/Latin… │
│ low_income       ┆ ["N", "Y"]           ┆ ["No", "Yes"]                   │
└──────────────────┴──────────────────────┴─────────────────────────────────┘

However, we're not done yet; we need to convert this DataFrame into a nested dictionary, where the values under column_name will become our outer keys, and the latter two DataFrame columns will become our inner keys and values, respectively.

One way you can approach it is to use a dict(ionary) comprehension:

nested_dict_pl = {
                row["column_name"]: dict(zip(row["old_values"], row["new_labels"]))
                for row in csv_cb_pl.iter_rows(named=True)
                }

A dict comprehension needs two things to work:

Two expressions separated by a colon
"for" and "if" clauses

In our code:

The two expressions separated by a colon are our key-value pairs.
The for loop is right beneath it.

If you take a closer look, however, you'll notice that:

The outer keys in our nested dictionary are represented by the expression to the left of the colon: row["column_name"].
The inner dictionaries are populated by the expression to the right of the colon dict(zip(row["old_values"], row["new_labels"])).
The for-loop loops through each row in our transformed Polars DataFrame, returning an iterator of dictionaries of row values. Importantly, we can access values in these dictionaries by column name.

Now, if you're not comfortable with comprehensions or if the logic you are using to populate the dictionary is more complex, feel free to convert the comprehension into a more expressive for loop:

nested_dict_pl={}
for row in csv_cb_pl.iter_rows(named=True):
     nested_dict_pl[row["column_name"]] = dict(zip(row["old_values"], row["new_labels"]))

You might even consider using a dict comprehension, but instead of returning an iterator of dictionaries, return tuples, whose elements you access via index. (Granted, this option requires knowing each element's exact position.)

nested_dict_pl = {
                row[0]: dict(zip(row[1], row[2]))
                for row in csv_cb_pl.iter_rows()
                }

Whatever option you use, the result should be the same: A two-level nested dictionary.

nested_dict_pl{'school_decision': {'A': 'Accepted', 'W': 'Waitlisted', 'R': 'Rejected'},
 'student_decision': {'A': 'Accepted Offer', 'D': 'Declined Offer'},
 'ft_pt': {'pt': 'Part-time', 'ft': 'Full-time', 'sub': 'Submatriculate'},
 'applied_dual_prg': {'NODUAL': 'No', 'YES': 'Yes'},
 'gender': {'W': 'Woman', 'M': 'Man', 'NB': 'Non-Binary'},
 'ethnicity': {'AN': 'Indigenous',
  'HL': 'Hispanic/Latino',
  'ME': 'Middle Eastern/Arab',
  'W': 'White',
  'B': 'Black /African American',
  'AP': 'Asian or Pacific Islander'},
 'low_income': {'N': 'No', 'Y': 'Yes'}}

Here's one way you can use the dictionary:

Say you wanted to recode the values we have for the school_decision column. Here's one way you could approach it (partial output shown for brevity):

(
    data_pl
        .with_columns(
            pl.col("school_decision")
                .replace(nested_dict_pl["school_decision"])
                .alias("school_decision")
                )
)
┌──────────────┬─────────────────┬──────────────────┬───────┬────────┐
│ applicant_id ┆ school_decision ┆ student_decision ┆ ft_pt ┆ gender │
│ ---          ┆ ---             ┆ ---              ┆ ---   ┆ ---    │
│ i64          ┆ str             ┆ str              ┆ str   ┆ str    │
╞══════════════╪═════════════════╪══════════════════╪═══════╪════════╡
│ 401          ┆ Accepted        ┆ A                ┆ pt    ┆ W      │
│ 386          ┆ Accepted        ┆ A                ┆ ft    ┆ M      │
│ 2905         ┆ Waitlisted      ┆ null             ┆ ft    ┆ W      │
│ 3043         ┆ Accepted        ┆ D                ┆ pt    ┆ NB     │
│ 1546         ┆ Waitlisted      ┆ null             ┆ ft    ┆ M      │
└──────────────┴─────────────────┴──────────────────┴───────┴────────┘

Great, only five more columns to go…

Kidding. No one has the time to sit and write all of that out. When you apply the same transformation to multiple columns, I recommend looping over the column names within the .with_columns method like so (partial output shown for brevity):

(
     data_pl
        .with_columns(
            pl.col(col)
                .replace(nested_dict_pl[col])
                .alias(col)
        for col in list(nested_dict_pl)
                )
    )┌──────────────┬─────────────────┬───────────┬──────────────────┬────────────┬────────────┐
│ applicant_id ┆ school_decision ┆ ft_pt     ┆ applied_dual_prg ┆ gender     ┆ low_income │
│ ---          ┆ ---             ┆ ---       ┆ ---              ┆ ---        ┆ ---        │
│ i64          ┆ str             ┆ str       ┆ str              ┆ str        ┆ str        │
╞══════════════╪═════════════════╪═══════════╪══════════════════╪════════════╪════════════╡
│ 401          ┆ Accepted        ┆ Part-time ┆ No               ┆ Woman      ┆ No         │
│ 386          ┆ Accepted        ┆ Full-time ┆ Yes              ┆ Man        ┆ No         │
│ 2905         ┆ Waitlisted      ┆ Full-time ┆ No               ┆ Woman      ┆ No         │
│ 3043         ┆ Accepted        ┆ Part-time ┆ Yes              ┆ Non-Binary ┆ Yes        │
│ 1546         ┆ Waitlisted      ┆ Full-time ┆ No               ┆ Man        ┆ No         │
└──────────────┴─────────────────┴───────────┴──────────────────┴────────────┴────────────┘

Note, we are using list(nested_dict_pl) to return a list of the (outer) keys of our nested dictionary. These are the only columns to which we want to apply this transformation.

Now, if you want to keep things a little neater, you may consider building lists of expressions:

recode_expr_list = [ pl.col(col)
                        .replace(nested_dict_pl[col])
                        .alias(col)
                    for col in list(nested_dict_pl)
                    ]

Then pass the expressions to the .with_columns method (partial output shown for brevity):

data_pl.with_columns(recode_expr_list)┌──────────────┬─────────────────┬───────────┬──────────────────┬────────────┬────────────┐
│ applicant_id ┆ school_decision ┆ ft_pt     ┆ applied_dual_prg ┆ gender     ┆ low_income │
│ ---          ┆ ---             ┆ ---       ┆ ---              ┆ ---        ┆ ---        │
│ i64          ┆ str             ┆ str       ┆ str              ┆ str        ┆ str        │
╞══════════════╪═════════════════╪═══════════╪══════════════════╪════════════╪════════════╡
│ 401          ┆ Accepted        ┆ Part-time ┆ No               ┆ Woman      ┆ No         │
│ 386          ┆ Accepted        ┆ Full-time ┆ Yes              ┆ Man        ┆ No         │
│ 2905         ┆ Waitlisted      ┆ Full-time ┆ No               ┆ Woman      ┆ No         │
│ 3043         ┆ Accepted        ┆ Part-time ┆ Yes              ┆ Non-Binary ┆ Yes        │
│ 1546         ┆ Waitlisted      ┆ Full-time ┆ No               ┆ Man        ┆ No         │
└──────────────┴─────────────────┴───────────┴──────────────────┴────────────┴────────────┘

Again, if this is a process you envision using across various workflows, consider wrapping it in a function:

def create_dict_pl(df:pl.DataFrame,
                       col_name:str = "column_name",
                       old_val_col:str = "values", 
                       new_val_col:str = "labels")->dict[str, dict[int|str, int|str]]:
    return {
                row[col_name]: dict(zip(row[old_val_col], row[new_val_col]))
                for row in (df.group_by(col_name, maintain_order=True)
                                .agg(pl.col(old_val_col),
                                     pl.col(new_val_col))
                                     .iter_rows(named=True)
                            )
            }

Then build the nested dictionary:

nested_dict_pl = create_dict_pl(df = csv_codebook_pl,
                                col_name="column_name",
                                old_val_col="old_values",
                                new_val_col="new_labels")

And finally, use the dictionary (partial output shown for brevity):

recode_expr_list = [ pl.col(col)
                        .replace(nested_dict_pl[col])
                        .alias(col)
                    for col in list(nested_dict_pl)
                    ]

data_pl.with_columns(recode_expr_list)┌──────────────┬─────────────────┬───────────┬──────────────────┬────────────┬────────────┐
│ applicant_id ┆ school_decision ┆ ft_pt     ┆ applied_dual_prg ┆ gender     ┆ low_income │
│ ---          ┆ ---             ┆ ---       ┆ ---              ┆ ---        ┆ ---        │
│ i64          ┆ str             ┆ str       ┆ str              ┆ str        ┆ str        │
╞══════════════╪═════════════════╪═══════════╪══════════════════╪════════════╪════════════╡
│ 401          ┆ Accepted        ┆ Part-time ┆ No               ┆ Woman      ┆ No         │
│ 386          ┆ Accepted        ┆ Full-time ┆ Yes              ┆ Man        ┆ No         │
│ 2905         ┆ Waitlisted      ┆ Full-time ┆ No               ┆ Woman      ┆ No         │
│ 3043         ┆ Accepted        ┆ Part-time ┆ Yes              ┆ Non-Binary ┆ Yes        │
│ 1546         ┆ Waitlisted      ┆ Full-time ┆ No               ┆ Man        ┆ No         │
└──────────────┴─────────────────┴───────────┴──────────────────┴────────────┴────────────┘

See, a codebook can make a world of difference when developing and implementing a data workflow.

What is your favorite method for recoding variable values in Python? Share your code in the comments below.

Need help thinking through the design and development of a data pipeline? At Analytics Made Accessible, we can help you turn messy data into streamlined systems and stories that stick—get in touch today!

Ama Nyame-Mensah https://www.anyamemensah.com

Recoding (Column) Values in Python

Value Recoding

The Value of a Codebook

Setup

Recoding column values in a Polars DataFrame

Five Tips For Communicating Data Responsibly