Streamlining Your Machine Learning Workflow with Scikit-Learn Pipelines — Pipeline Explained

When you’re diving into machine learning, there’s a crucial step you must take before training your model —  splitting your data into a…

Streamlining Your Machine Learning Workflow with Scikit-Learn Pipelines — Pipeline Explained
By Chanaka Prasanna

When you’re diving into machine learning, there’s a crucial step you must take before training your model — splitting your data into a training set and a test set. Both datasets undergo data cleaning and preprocessing before you feed them into your machine-learning model. But let’s face it, writing repetitive code for these steps is far from efficient. This is where the Scikit-Learn pipeline shines.

What is a Scikit-Learn Pipeline?

Think of the Scikit-Learn pipeline as an elegant and efficient way to streamline your machine-learning workflow. Instead of manually coding each preprocessing and transformation step for your training and test datasets, the pipeline allows you to link these steps together seamlessly.

Imagine you have a single pipeline where you can input any data. This data will be automatically transformed into the appropriate format for model training or prediction. The Scikit-Learn pipeline does just that, making your code cleaner, more readable, and easier to adjust.

Breaking Down the Pipeline Components

In a typical machine learning task, you often perform a sequence of transformations on your raw dataset before applying a final estimator. Here’s how each component fits into the pipeline.

Transformers

These are steps that transform your data. Each transformer must implement fit and transform methods, or a combined fit_transform method. Examples include scaling, normalization, and feature extraction.

Didn’t get it?

Don’t worry. I will explain it.

In sci-kit-learn, transformers are a special type of class designed to preprocess and prepare your data for machine learning models. They play a crucial role in the machine learning pipeline, which is a sequence of steps that transforms raw data into a format suitable for training and using a model.

Key Points about Transformers

  1. Function

Unlike estimators and predictors (which often focus on model training and prediction), transformers focus solely on data transformation. They manipulate the features (columns) in your data to make them more suitable for modeling.

2. Tasks Performed

Transformers can perform various data preprocessing tasks, including,

  • Scaling - Standardizing or normalizing features to have a similar range. (Read more)
  • Encoding — Converting categorical data into numerical representations. (Read more)
  • Feature Selection — Choosing a subset of relevant features. (Read more)
  • Imputation — Filling missing values in the data. (Read more)
  • Custom Transformations — You can create custom transformers for specific data manipulation needs. (Read more)

3. Methods

Transformers typically implement two core methods:

  • fit(X) — This method analyzes the training data (X) to learn the parameters or statistics needed for the transformation. For example, a scaler might calculate the mean and standard deviation during fitting.
  • transform(X) — This method applies the learned transformation to new data (X). Using the learned parameters from fit, the transformer modifies the features in the data.
  • fit_transform(X) — This is a convenience method that combines fit and transform into one step. It is commonly used during data preprocessing to fit the transformer to the training data and then transform it

Example

Imagine you have a dataset with features like age (numerical) and city (categorical). A pipeline might include the following transformers:

  1. StandardScaler - This transformer would learn the mean and standard deviation of the age column during fit and then subtract the mean and divide by the standard deviation during transform. (Read More)
  2. OneHotEncoder — This transformer would convert the city column into one-hot encoded features during fit (e.g., creating separate features for each unique city). During transform, it would apply the one-hot encoding to new data. (Read More)

Transformers are essential building blocks in sci-kit-learn pipelines. By effectively preprocessing your data, transformers enhance the performance and generalizability of your machine-learning models.

Predictors

These are the final step in your pipeline. They must implement the fit and predict methods, or a combined fit_predict method.

I don’t think you want an explanation for predictors.

Benefits of Using Pipelines

  1. Code Efficiency — No more repetitive code. The pipeline handles the sequence of transformations and predictions seamlessly.
  2. Readability — Your code is much cleaner and easier to understand.
  3. Parameter Tuning — Pipelines make it simple to perform hyperparameter tuning using GridSearchCV without data leakage from the test set.
  4. Modularity - Each step in your pipeline can be easily adjusted or replaced.

Pipelines in Action

Let’s understand the data set first.

Note: The below dataset is not real. I just generated using ChatGPT to use here.

Download the data set

Download the notebook

  • Make — Brand of the laptop (e.g., Dell, HP, Lenovo, Apple, Asus, Acer)
  • Colour — Color of the laptop (e.g., Black, Silver, Blue, White, Red)
  • Usage — Usage in hours (unit: hours)
  • USB Ports — Number of USB ports available on the laptop (unit: count)
  • Price — Price of the laptop in dollars (unit: dollars)

It’s time to code…

# Standard imports 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline 
 
#Scikit-Learn imports 
from sklearn.pipeline import Pipeline 
from sklearn.impute import SimpleImputer # To fill missing values 
from sklearn.preprocessing import OneHotEncoder # To turn our categorical variables into numbers 
from sklearn.compose import ColumnTransformer 
from sklearn.ensemble import RandomForestRegressor # Our estimator/model 
 
#Import our data set 
data =  pd.read_csv('./data/laptop_sales.csv') # Replace with your actual path

This is how our data looks like.

data.head()
First 5 records
# Check missing values 
data.isna().sum()
Null value counts for each column
# Drop the rows with no labels in  our target variable 
data.dropna(subset=["Price"], inplace=True) 
data.isna().sum()
Now, the available null values for the Price column is 0
# Split data into X & y 
X = data.drop("Price", axis=1) 
y = data["Price"]

We’ve dropped the rows with no labels and split our data into X and y, let’s create a Pipeline() to fill the rest of the missing values, encode them if necessary (turn them into numbers), and fit a model to them.

Let’s define categorical, ports, and numeric features. Then build transformer

We’ll do the following with the Pipeline() class:

  • Categorical transformer — fill our categorical values with the value ‘missing’ and then one encode them.
  • Ports transformer — fill the USB Ports column missing values with the value 4.
  • A numeric transformer — fill the numeric column missing values with the mean of the rest of the column.
# Define categorical columns 
categorical_features = ["Make", "Colour"] 
 
# Create categorical transformer (imputes missing values, then encodes them) 
categorical_transformer = Pipeline(steps=[ 
  ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), 
  ('onehot', OneHotEncoder(handle_unknown='ignore'))                                          
]) 
 
# Define port feature 
port_feature = ["USB Ports"] 
# Create port transformer (fills all door missing values with 3) 
port_transformer = Pipeline(steps=[ 
  ('imputer', SimpleImputer(strategy='constant', fill_value=3)), 
]) 
 
# Define numeric features 
numeric_features = ["Usage (Hours)"] 
# Create a transformer for filling all missing numeric values with the mean 
numeric_transformer = Pipeline(steps=[ 
  ('imputer', SimpleImputer(strategy='mean'))   
])

categorical_features — It specifies the columns in the dataset that contain categorical data, which are "Make" and "Colour".

categorical_transformer — This pipeline consists of two steps

  • SimpleImputer: It fills missing values in categorical features with a constant value, in this case, 'missing'.
  • OneHotEncoder: It encodes categorical integer features using a one-hot encoding scheme. handle_unknown='ignore' parameter is used to handle any unknown categories encountered during transformation.

port_feature — It specifies the column in the dataset that contains port information, which is "USB Ports"

port_transformer — This pipeline consists of one step

  • SimpleImputer: It fills missing values in the "USB Ports" column with a constant value, in this case, 3.

numeric_features — It specifies the columns in the dataset that contain numeric data, which is "Usage (Hours)"

numeric_transformer — This pipeline consists of one step

  • SimpleImputer: It fills missing values in numeric features with the mean of the respective column.

Let’s combine Pipelines with ColumnTransformer()

# Create a column transformer which combines all of the other transformers  
preprocessor = ColumnTransformer( 
    transformers=[ 
      # (name, transformer_to_use, features_to_use transform) 
      ('categorical', categorical_transformer, categorical_features), 
      ('port', port_transformer, port_feature), 
      ('numerical', numeric_transformer, numeric_features) 
])

This code creates a ColumnTransformer named preprocessor that combines different preprocessing steps for categorical, port, and numeric features. Each step is specified with a name, a corresponding transformer pipeline, and the features it should operate on. This allows for tailored preprocessing treatments on specific columns of the input data. The preprocessor can then be seamlessly integrated into a machine-learning pipeline, ensuring that all necessary transformations are applied consistently before training the model.

Let’s create a Pipeline() to preprocess and model our data with the ColumnTransformer() and RandomForestRegressor().

# Create the preprocessing and modelling pipeline 
 
model = Pipeline(steps=[('preprocessor', preprocessor), # fill our missing data and will make sure it's all numbers 
                        ('regressor', RandomForestRegressor())]) # this will model our data

It utilizes scikit-learn’s Pipeline class to sequentially apply a preprocessing step (preprocessor) followed by a regression model (RandomForestRegressor). The preprocessor object encompasses tasks such as handling missing data and ensuring all features are in numerical format. Once the data is preprocessed, it is passed to the random forest regressor, which builds a model based on the preprocessed data. This pipeline streamlines the process of data preparation and model training, providing a convenient and efficient way to handle both tasks in a coherent manner.

Let's make our training data and test data. Then fit the preprocessing and modelling Pipeline() on the training data.

# Split data into train and teset sets 
np.random.seed(42) 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) 
 
# Fit the model on the training data  
# (When fit() is called with a Pipeline(), fit_transform() is used for transformers) 
model.fit(X_train, y_train)

The below image shows information about our Pipeline

Let’s evalueate the model on test data

model.score(X_test, y_test)

I have got 0.9482058333233976 accuracy here. You may get different values based on your random state.

In conclusion, Scikit-Learn pipelines are a powerful tool to streamline your machine learning workflows. They help in keeping your code DRY (Don’t Repeat Yourself), improve readability, and simplify hyperparameter tuning. Whether you’re a beginner or an experienced data scientist, pipelines can significantly enhance your efficiency and model performance. So, next time you’re working on a machine learning project, give Scikit-Learn pipelines a try!

Follow me for more blogs

Follow me on LinkedIn