May 3, 2025
Follow this post, documenting some steps of data preprocessing for machine learning using Python.
import pandas as pd
import numpy as np
For this example I will create my own table with a python dictionary.
data_dict = {
'ProductID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115],
'Category': ['Electronics', 'Clothing', 'Electronics', 'Home Goods', 'Clothing',
'Electronics', 'Beauty', 'Home Goods', 'Clothing', np.nan,
'Electronics', 'Home Goods', 'Beauty', 'Clothing', 'Electronics'],
'SubCategory': ['Laptop', 'T-Shirt', 'Camera', 'Bedding', 'Jeans',
'Smartphone', 'Skincare', 'Kitchenware', np.nan,
'T-Shirt', 'Headphones', 'Decor', 'Makeup', 'Jacket', 'Laptop'],
'SaleDate': ['2023-01-15', '2023-01-16', '2023-01-17', '2023-02-10', '2023-02-11',
'2023-02-20', '2023-03-05', '2023-03-06', '2023-04-10', '2023-04-12',
'2023-04-15', '2023-05-01', '2023-05-04', '2023-05-10', '2023-05-12'],
'Quantity': [1, 3, 1, 2, 2, 1, 5, 1, 3, 1, 4, 1, 2, 1, 1],
'UnitPrice': [1200, 25, 800, 150, 70, 950, 30, np.nan, 65, 25,
150, 50, 45, 120, 1300],
'Discount': [0.1, 0.0, 0.15, 0.05, 0.0, 0.2, 0.0, 0.1, 0.0, 0.0, 0.25, 0.0, 0.05, 0.1, 0.12],
'Region': ['North', 'South', 'West', 'East', 'South', 'West', 'North', 'East', 'South',
'West', 'North', 'East', 'South', 'West', 'North'],
'CustomerID': [1001, 1002, 1003, 1004, 1002, 1005, 1001, 1004, 1002, 1003, 1005, 1004, 1001, 1002, 1005],
'TotalSale': [1080, 75, 680, 285, 140, 760, 150, np.nan, 195, 75, 450, 50, 85.5, 108, 1144]
}
# Create DataFrame from python dictionary
data = pd.DataFrame(data_dict)
# Copy of df for future examples
original_df = df.copy()
# Show first 5 results
data.head()
df.info()
# Output tells you: total entries, column names, non-null counts per column,
# and data types (Dtype).
# Look closely at Dtypes: object often means string/mixed types. SaleDate is
# object, needs conversion.
# | Column | Non-Null Count | Dtype |
---|---|---|---|
3 | SaleDate | 15 non-null | object |
df.describe()
# Gives count, mean, std dev, min, max, quartiles for numerical columns.
# Look at min/max: Are they reasonable?
# Look at mean vs median (50%): If they are very different, it suggests skewness.
df.isnull().sum()
# This is KEY! Shows exactly how many NaNs (Not a Number) are in each column.
# We see missing values in Category, SubCategory, UnitPrice, and TotalSale.
# 1 = True, 0 = False
Column Name | isnull |
---|---|
ProductID | 0 |
Category | 1 |
Subcategory | 1 |
SaleDate | 0 |
Quantity | 0 |
from sklearn.impute import SimpleImputer
# We'll use median because price distributions can be skewed.
num_imputer = SimpleImputer(strategy='median')
# Note: Imputers expect 2D input, so use [['UnitPrice']]
# Select the column(s) you want to replace NaN values
num_imputer.fit(df[['UnitPrice']])
# Transform the column (actually replaces NaNs)
df['UnitPrice'] = num_imputer.transform(df[['UnitPrice']])
df.isnull().sum() # UnitPrice should now have 0 missing
df[df['ProductID'] == 108] # Show ProductID where Unitprice was NaN
Unit Price (before) | Unit Price (after) |
---|---|
NaN | 95 |
# Replace all missing values with the most frequent value
cat_imputer = SimpleImputer(strategy='most_frequent')
# fit and transform in one step
df['Category'] = cat_imputer.fit_transform(df[['Category']])
# READ BELOW! ↓↓↓
When running the code, I initially received an unhelpful error: ValueError: 2
To investigate, I tried separating the fit and transform methods. This produced a different, more informative error message: ValueError: Expected a 2-dimensional container but got <class 'pandas.core.series.Series'> instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead.
This second error shows the transformation step was outputting a 2D array, but the code was trying to assign it to a 1D Pandas Series (the 'Category' column). I found that using the flatten() method resolved this error.
This led me to wonder why the same error didn't occur earlier when we were imputing the 'UnitPrice' column. I found out this is because numeric data might be handled differently during transformation compared to object data.
# fit and transform in one step and convert to 1D array
df['Category'] = cat_imputer.fit_transform(df[['Category']]).flatten()
# CHALLANGE!
# For SubCategory also replace all np.nan values with cat_imputer
# See if there are still any missing values in our data
# .------------------------------------------------------.
# | _____ _ ____ _ |
# || ___| | ___ | _ \ _____ _____| | ___ _ __ ___ |
# || |_ | |/ _ \| | | |/ _ \ \ / / _ \ |/ _ \| '_ \/ __||
# || _| | | (_) | |_| | __/\ V / __/ | (_) | |_) \__ \|
# ||_| |_|\___/|____/ \___| \_/ \___|_|\___/| .__/|___/|
# | |_| |
# '------------------------------------------------------'
df['SubCategory'] = cat_imputer.fit_transform(df[['SubCategory']]).flatten()
df.isnull().sum()
The TotalSales column still contains a missing value. Since this is our target (dependent) variable for training a supervised model, we cannot impute it using the median or average. Instead, we drop the entire row containing the missing TotalSales value.
# subset: List of columns when checking for NaN values to drop
# inplace: Modify the DataFrame directly
df.dropna(subset=['TotalSale'], inplace=True)
df.isnull().sum()
# Convert SaleDate (object) to datetime64[ns]
df['SaleDate'] = pd.to_datetime(df['SaleDate'])
# You can now extract components from the new Dtype
df['SaleYear'] = df['SaleDate'].dt.year
df['SaleMonth'] = df['SaleDate'].dt.month
df['SaleDay'] = df['SaleDate'].dt.day
df['SaleDayOfWeek'] = df['SaleDate'].dt.dayofweek
df['SaleDayOfYear'] = df['SaleDate'].dt.dayofyear
# Now that date components are extracted, we can perform time-based analysis
# Calculate total sales per month
df['SaleMonth'].value_counts().sort_index()
# CHALLENGE!
# Calculate sales per day of the week
# Drop the original SaleDate column as it's no longer needed
# .------------------------------------------------------.
# | _____ _ ____ _ |
# || ___| | ___ | _ \ _____ _____| | ___ _ __ ___ |
# || |_ | |/ _ \| | | |/ _ \ \ / / _ \ |/ _ \| '_ \/ __||
# || _| | | (_) | |_| | __/\ V / __/ | (_) | |_) \__ \|
# ||_| |_|\___/|____/ \___| \_/ \___|_|\___/| .__/|___/|
# | |_| |
# '------------------------------------------------------'
df['SaleDayOfWeek'].value_counts().sort_index()
# This time we use drop() and specify what column(s) to drop
# Don't forget to use inplace=True, else it won't be modified directly
df.drop(columns=['SaleDate'], inplace=True)
To train models they need numerical input. So we need way to convert categories to a numerical value.
We will be discussing two ways to do this.
LabelEncoder works by assigning a unique integer to each category (North=0, South=1, West=2, East=3). However, this introduces a problem: it creates a relationship. The encoding implies that East (3) has a greater importance than West (2), which is likely not true. This unintended ordering can be a problem machine learning models. As its name suggests, LabelEncoder is best suited for encoding the target variables (labels), not features.
For demonstration purposes I will use LabelEncoder on the feature 'Region'.
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Region_LabelEncoded'] = label_encoder.fit_transform(df['Region'])
print(df[['Region', 'Region_LabelEncoded']].head(4))
# I will drop the Region_LabelEncoded, because we will use OneHotEncoding
# df.drop(columns=['Region_LabelEncoded'], inplace=True)
# A different way to drop the column
# axis=1 means "drop column" - axis=0 means "drop rows"
df = df.drop(columns=['Region_LabelEncoded'], axis=1)
Region | Region_LabelEncoded |
---|---|
North | 1 |
South | 2 |
West | 3 |
East | 0 |
This method is preferred for nominal features. Examples of nominal features are:
Examples of ordinal features:
from sklearn.preprocessing import OneHotEncoder
# Select columns for One-Hot Encoding
cat_columns = ['Category', 'SubCategory', 'Region']
# handle_unknown='ignore' prevents errors if unseen categories are in the test data
# sparse_output lets you view the data
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
# Fit Transform selected columns
onehot_encoded_df = onehot_encoder.fit_transform(df[cat_columns])
# Currently, all columns have numeric names, which makes them hard to distinguish
# Retrieve all feature names
onehot_feature_names = onehot_encoder.get_feature_names_out(cat_columns)
'''
array(['Category_Beauty', 'Category_Clothing', 'Category_Electronics',
'Category_Home Goods', 'SubCategory_Bedding', 'SubCategory_Camera',
'SubCategory_Decor', 'SubCategory_Headphones', ...
'''
# Now we create a new DataFrame with the new column data and names
onehot_df = pd.DataFrame(onehot_encoded_df, columns=onehot_feature_names)
# Now we need to concat the onehot_df with our original df
df = pd.concat([df, onehot_df], axis=1)
# Drop te categorical columns that are now onehot encoded
df.drop(columns=cat_columns, inplace=True)
Region_East | Region_North | Region_South | Region_West |
---|---|---|---|
1.0 | 0.0 | 0.0 | 0.0 |
0.0 | 0.0 | 1.0 | 0.0 |
0.0 | 0.0 | 0.0 | 1.0 |
That was quite a lot to take in. Let's quickly recap the steps for One-Hot Encoding:
An alternative is to use pandas pd.getdummies()
# A simple quick analysis, but less flexible
# dummy_na=True also displays NaN values
pd.get_dummies(original_df['Category'], dummy_na=True)
Many algorithms work better when numerical features are on a similar scale. Feature scaling is used to bring features in your data to a comparable range. For example, if one feature ranges from 0 to 10 and another from 10,000 to 50,000, the feature with larger values can dominate the model. We will be using the StandardScaler from scikit-learn to fix this.
from sklearn.preprocessing import StandardScaler
# Select numerical columns to scale
# Columns that already have values between 0 and 1 do not have to be scaled
cols_to_scale = ['Quantity', 'UnitPrice', 'Discount', 'SaleYear',
'SaleMonth', 'SaleDay', 'SaleDayOfWeek', 'SaleDayOfYear']
scaler = StandardScaler()
# Fit Transform selected columns
df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])
All values are now scaled!
It's important to remember to fit the scaler only on the training data after the split! Otherwise, you risk leaking knowledge of the test set into the training process. Always use .fit() on the training data only. Then use .transform() on both the training and test sets using the scaler fitted on the training data.
Until now we've done all the steps manually. It's easy to make errors this way. A solution for this is ColumnTransformer. We can define a pipeline where we can apply transformers to specific columns to handle the preprocessing.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
# Create a new DataFrame
test_df = pd.DataFrame(data_dict)
# Drop NaN values
test_df.dropna(subset=['TotalSale'], inplace=True)
# Convert object to datetime
test_df['SaleDate'] = pd.to_datetime(test_df['SaleDate'])
# Define X and y
X = test_df.drop(['ProductID', 'CustomerID', 'TotalSale', 'SaleDate'], axis=1)
y = test_df['TotalSale']
# Train Test Split
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2, random_state=42)
# A function to extract date features
def extract_date_features(dataframe):
df_copy = dataframe.copy()
# Recreate SaleDate
if 'SaleDate' not in df_copy.columns:
df_copy['SaleDate'] = test_df['SaleDate']
if 'SaleDate' in df_copy.columns:
df_copy['SaleYear'] = df_copy['SaleDate'].dt.year
df_copy['SaleMonth'] = df_copy['SaleDate'].dt.month
df_copy['SaleDay'] = df_copy['SaleDate'].dt.day
df_copy['SaleDayOfWeek'] = df_copy['SaleDate'].dt.dayofweek
df_copy.drop(columns=['SaleDate'], inplace=True)
return df_copy
X_train = extract_date_features(X_train)
X_test = extract_date_features(X_test)
date_features = ['SaleYear', 'SaleMonth', 'SaleDay', 'SaleDayOfWeek']
# Define columns for pipelines
numerical_features = ['Quantity', 'UnitPrice', 'Discount'] + date_features
categorical_features = ['Category', 'SubCategory', 'Region']
# Creating pipelines
numerical_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# //////////////////////////////////////////////////////////////////////
# CHALLENGE
# After implementing the numerical_pipeline, can you think
# of how to implement the 'categorical_pipeline'?
# Hint: the process is the same
# //////////////////////////////////////////////////////////////////////
# .------------------------------------------------------.
# | _____ _ ____ _ |
# || ___| | ___ | _ \ _____ _____| | ___ _ __ ___ |
# || |_ | |/ _ \| | | |/ _ \ \ / / _ \ |/ _ \| '_ \/ __||
# || _| | | (_) | |_| | __/\ V / __/ | (_) | |_) \__ \|
# ||_| |_|\___/|____/ \___| \_/ \___|_|\___/| .__/|___/|
# | |_| |
# '------------------------------------------------------'
categorical_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_pipeline, numerical_features),
('cat', categorical_pipeline, categorical_features),
], remainder='passthrough' # Handles columns that weren't specified
)
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
# For displaying purposes - get feature names for the transformed data
numerical_names = [col for col in numerical_features]
categorical_features = list(preprocessor.named_transformers_['cat']
.named_steps['onehot'].get_feature_names_out(categorical_features))
So, as mentioned earlier, we split the data first before we apply the preprocessing steps.
We create a pipeline where we chain all preprocessing steps. This helps with preventing mistakes. It also ensures the same steps are always applied in sequence.
After we've created the pipelines, we apply these to specific columns in our dataset with the help of ColumnTransformer. Numerical and Categorical features need distinct treatments. You pass in a name for the transformation ('num', 'cat'), the pipeline to apply (num_pipeline, cat_pipeline) and the list of columns to apply it to (num_features, cat_features). The parameter 'remainder' controls what happens to columns not listed in the transformers. 'passthrough' keeps then and 'drop' removes them.
Finally we fit_transform the training data (X_train) to learn the preprocessing parameters and apply them. Then, we use the same fitted preprocessor to only transform the test data (X_test), applying the parameters learned from the training data.
We're coming close to an end and just want to share a few more pandas operations that might be useful in your journey.
# Shows all unique values
df['Category'].unique() # ['Electronics', 'Clothing', 'Home Goods', 'Beauty', nan]
# Shows count of total unique values
print(df['Category'].nunique()) # 4
# Counts occurrences of each region
df['Region'].value_counts()
# Region
# North 4
# South 4
# West 4
# East 3
# Shows percentages
df['Region'].value_counts(normalize=True)
# Region
# North 0.266667
# South 0.266667
# West 0.266667
# East 0.200000
# Drop columns that are not needed for the model
# With axis=1 you specifically remove COLUMNS, not rows
df_2 = df.drop(['ProductID', 'CustomerID', 'TotalSale'], axis=1, errors='ignore')
# Quickly display n (n=5 by default) entries
df.head(n=10)
# Apply custom function to create a feature
df['IsDiscounted'] = df['Discount'].apply(lambda x: 1 if x > 0 else 0)
discount_df = df[['ProductID', 'IsDiscounted']]
print(discount_df.to_string(index=False))
# ProductID IsDiscounted
# 101 1
# 102 0
# 103 1
# 104 1
# 105 0
Preprocessing is a process of understanding your data, handling missing values, converting data types, scaling numerical features and possibly creating new features. We've used tools like SimpleImputer, OneHotEncoder, StandardScaler and ColumnTransformer. The end.
1. Which Pandas method is commonly used to get a quick overview of column data types and non-null counts in a DataFrame df?
df.info()
2. What function in Pandas is typically used to identify missing values (like np.nan) in a DataFrame?
df.isnull().sum()
3. You have a numerical column 'Age' with some missing values. If you suspect the column has outliers, which strategy in SimpleImputer is generally more robust: 'mean' or 'median'?
median. The median is less sensitive to extreme outliers
4. You need to impute missing values in a categorical column 'Color' (e.g., 'Red', 'Blue', np.nan). Which SimpleImputer strategy is most appropriate?
category_imputer = SimpleImputer(strategy='most_frequent')
5. Which Pandas function converts a column containing date strings (like '2024-01-15') into proper datetime objects?
df['Date'] = pd.to_datetime(df['Date'])
6. After converting a 'Date' column to datetime objects, how can you easily extract the month as a new feature?
df['Month'] = df['Date'].dt.month
7. You have a categorical feature 'Size' with values ['Small', 'Medium', 'Large']. Which encoder could potentially be used if you want to represent this inherent order with integers (0, 1, 2)?
encoder = LabelEncoder()
8. Why is OneHotEncoder generally preferred over LabelEncoder for nominal categorical features (like 'Region': ['North', 'South', 'West']) when preparing data for most machine learning models?
It prevents the model from mistakenly assuming that there's an order or hierarchy between categories when there isn't one
9. What does OneHotEncoder create for a categorical feature?
It creates a new binary (0/1) column for each category
10. Which Scikit-learn transformer scales features so they have a mean of 0 and a standard deviation of 1?
StandardScalar()
11. Why do you think feature scaling is often important for algorithms like Linear Regression?
Because these algorithms can be sensitive to the scale of input features. Features with larger values (values of 10.000-50.000 versus 1-10) might dominate the learning process unfairly.
12. What is the primary purpose of splitting your data into a training set and a testing set?
To evaluate the model's performance on unseen data.
13. What problem can occur if you fit your imputer or scaler on the entire dataset before splitting into train and test sets?
Data Leakage. Information from the test set leaks into the training process.
14. Which Scikit-learn tool is useful for applying different preprocessing steps (imputation, scaling, encoding) to different columns of your data within a single pipeline?
ColumnTransformer()
15. If 'TotalSale' is the variable (Y) you want your model to predict (the target variable), should you typically include it in the list of features being scaled by StandardScaler? Why or why not?
No! The model's goal is to predict the target in its original units. Scaling is applied only to input the features (X).