Unsure how to score your Machine Learning models?

Before we delve deeper into model scoring methods, it’s crucial to understand the significance of model evaluation in the context of…

Unsure how to score your Machine Learning models?
By Chanaka

Before we delve deeper into model scoring methods, it’s crucial to understand the significance of model evaluation in the context of machine learning development. The ultimate goal of any machine learning model is to make accurate predictions on unseen data. However, a model’s performance on the data it was trained on (often referred to as the training data) does not necessarily reflect its ability to generalize to new, unseen data.

This is where model evaluation comes into play. By assessing a model’s performance on a separate dataset that it hasn’t been exposed to during training (commonly known as the test data or validation data), we can gauge its ability to generalize. A model that performs well on the test data is more likely to make accurate predictions on real-world, unseen data.

Therefore, model evaluation serves as a critical step in the machine learning pipeline, allowing developers to identify and address issues such as overfitting (where the model learns to memorize the training data rather than capturing underlying patterns) or underfitting (where the model fails to capture the underlying patterns in the data).

In this article, we’ll delve into the world of model scoring, exploring three main methods: Estimator score, Scoring parameter, and Metric functions. We’ll break down what they are, how they differ, and when to use each one. Buckle up, and get ready to understand how to effectively evaluate your machine learning models!

Metrics and Scoring

Before diving in, let’s clarify some key terms. A metric is a specific measure that quantifies how well your model performs on a particular task. It could be accuracy for classification, mean squared error for regression, or something else entirely. Think of it as a yardstick — it tells you the distance between your predictions and the actual values.

Scoring, on the other hand, is the process of applying a specific metric to your model’s predictions. It’s like using the yardstick to measure that distance. So, scoring methods are essentially different ways to calculate these metrics and assess your model’s performance.

There are three ways of evaluating the quality of a model’s predictions — Estimator Score, Scoring Parameter, and Metric Functions

Now, let’s meet the three main scoring methods in scikit-learn, a popular machine-learning library.

Estimator Score Method

This built-in method comes with most scikit-learn estimators (models) themselves. It provides a default way to evaluate the model’s performance on a specific task. Think of it as a quick and easy score the model gives itself!

For example, if you use a DecisionTreeClassifier estimator, it has a built-in score method that might calculate accuracy by default. This is a convenient starting point, but it might not always be the most appropriate metric for your problem.

Two basic ways of calculating the Score here

Method 1

np.mean(y_preds == y_test)
  • Prediction: The code y_preds = clf.predict(X_test) makes predictions on the unseen test data (X_test) using the trained classifier (clf). This results in an array (y_preds) containing the predicted class labels for each sample in the test set.
  • Comparison: The core of this method lies in the expression y_preds == y_test. This performs element-wise comparison between the predicted labels (y_preds) and the actual labels (y_test). The result is a boolean array where True indicates a correct prediction and False indicates an incorrect prediction.
  • Accuracy Calculation: Finally, np.mean(y_preds == y_test) calculates the average (mean) of this boolean array. Since the array contains True for correct predictions and False for incorrect ones, the average directly gives you the accuracy of the model on the test set. Accuracy is the proportion of correct predictions made by the model.

Method 2

clf.score(X_test, y_test)
  • Direct Model Score: This method leverages the built-in score method of the classifier (clf). Scikit-learn estimators (models) often come with a default scoring metric, typically accuracy. This method assumes you’re interested in the default scoring metric associated with the classifier.
  • Automatic Calculation: When you call clf.score(X_test, y_test), the classifier calculates the chosen scoring metric based on the predictions (y_preds) and the actual labels (y_test). It directly returns the score, saving you the step of manually computing the mean.

Key Differences and When to Use Which:

  • Control vs. Convenience: Method 1 (np.mean(y_preds == y_test)) gives you more control over the scoring metric. You can easily replace the comparison (==) with other operators (e.g., != for error rate, >= for a custom threshold) to calculate different metrics. However, it requires manual calculation of the mean.
  • Default vs. Specific: Method 2 (clf.score(X_test, y_test)) is more convenient as it uses the default scoring metric associated with the classifier. But it may not be ideal if you need a different metric.

Scoring Parameter — Your Evaluation Ruler

This method is used with tools like cross_val_score and GridSearchCV for cross-validation and hyperparameter tuning. Here, you explicitly define the scoring metric you want to use using the scoring parameter. This gives you more control over the evaluation process.

Imagine you’re training different machine learning models. You want a way to compare them and pick the best one. This is where the “scoring parameter” comes in. It acts like a ruler to measure how well your models perform.

Predefined Rulers (scorer objects)

scikit-learn provides a set of pre-built rulers for common tasks like classification and regression. These are called “scorer objects”. You can choose one of these objects as the “scoring” parameter when using tools like GridSearchCV or cross_val_score.

The table in the documentation (3.3.1.1) lists all these predefined scorers. They’re designed so that higher scores mean better performance. For metrics that naturally measure error (like mean squared error), scikit-learn provides a negated version (e.g., neg_mean_squared_error) to follow this convention.

Imagine you’re comparing different decision tree models with GridSearchCV. You can set the scoring parameter to ‘f1_score’ to evaluate them based on the F1-score metric (a balance between precision and recall). This way, you can find the model that performs best according to your chosen metric.

Metric Functions

Scikit-learn’s sklearn.metrics module provides a rich collection of functions to assess prediction errors for various machine learning tasks. These functions are more granular than the predefined scorers offered by the scoring parameter and cater to specific evaluation needs. Here’s a breakdown of the metric function categories:

Classification Metrics

  • Accuracy: Measures the overall proportion of correct predictions (True Positives + True Negatives) / Total Samples.
  • Precision: Measures the ratio of true positive predictions to all positive predictions (True Positives / (True Positives + False Positives)). Useful for imbalanced datasets where you care more about identifying actual positives and avoiding false positives.
  • Recall: Measures the ratio of true positive predictions to all actual positive cases (True Positives / (True Positives + False Negatives)). Useful for imbalanced datasets where missing true positives is costly.
  • F1-Score: A harmonic mean of precision and recall, combining both metrics into a single score.
  • ROC AUC (Area Under the ROC Curve): Evaluates a classifier’s ability to distinguish between positive and negative classes.
  • Confusion Matrix: Provides a table summarizing the number of correct and incorrect predictions for each class.

Multilabel Ranking Metrics

  • Average Precision: Used for problems involving multiple labels per sample, measures the average precision across all labels.

Regression Metrics

  • Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values. Sensitive to outliers.
  • Mean Absolute Error (MAE): Measures the average absolute difference between predicted and actual values. Less sensitive to outliers than MSE.
  • R-Squared: Measures the proportion of variance in the target variable explained by the model.

Clustering Metrics

  • Silhouette Score: Measures how well data points are clustered within their assigned clusters.

Choosing the Right Metric

The best metric function depends on your problem type and what aspect of performance is most important. Here are some general guidelines:

  • Classification: Accuracy is a good starting point, but for imbalanced data, consider precision, recall, or F1-score.
  • Regression: MSE or MAE are common choices, but MAE might be better for interpreting the magnitude of errors.
  • Clustering: Silhouette score is a popular option for evaluating cluster quality.

Let’s say you’re working on a regression problem and want to go beyond the basic mean squared error (MSE). You can use the mean_absolute_error function from sklearn.metrics to assess the average absolute difference between predictions and actual values. This might be more informative if you’re concerned about the magnitude of errors.

Why Use All Three? Understanding the Differences

So, why do we need all three methods? Here’s a breakdown of their key differences:

  • Convenience vs. Control: Estimator score methods are convenient but offer limited control over the metric used. Scoring parameters and metric functions give you more flexibility to choose the most appropriate metric for your task.
  • Specificity vs. Generality: Metric functions provide the most specific options for calculating various metrics. Estimator scores tend to be more general-purpose, while scoring parameters offer a balance between control and common use cases.
  • Integration with Tools: Scoring parameters are seamlessly integrated with cross-validation and hyperparameter tuning tools like GridSearchCV. Metric functions are standalone functions you can use for various evaluation purposes.

When to Use Which Method?

Here’s a quick guide to choosing the right scoring method:

  • Use Estimator Score Method: As a starting point for basic evaluation, especially if you’re new to model scoring.
  • Use Scoring Parameter: For cross-validation and hyperparameter tuning when you want to control the evaluation metric.
  • Use Metric Functions: When you need a specific metric not offered by estimator score methods or scoring parameters, or for more detailed evaluation after training.

Building Your Own Scoring Methods: Going Beyond the Basics

While the methods above cover most common scenarios, scikit-learn allows you to create custom scoring functions. This is helpful if you have a unique evaluation criteria not addressed by existing metrics.

However, building custom scorers requires writing Python code

If you found this useful, follow me for future articles. It motivates me to write more for you.

Follow me on Medium

Follow me on LinkedIn