Auto-Sklearn: A Comprehensive Guide to Automated Machine Learning

Machine learning (ML) has permeated numerous sectors, including business, engineering, and research, to the point that it is difficult to find an area where it is not used. Using ML now means that for many useful tasks it’s no longer necessary to manually write a program, or even to know exactly how to solve the problem. Modern machine learning is sometimes referred to as “software 2.0” and is a trend that has been super-charged by the effectiveness of, and ensuing research and development interest in, deep learning. Still, these projects almost always require the application of substantial world-class engineering and research talent. That’s not entirely surprising, as coaxing a computer program, even one equipped with sophisticated state-of-the-art machine learning algorithms, to accomplish completely new objectives still requires human innovation. For now, while groundbreaking AI science is still difficult to automate, there’s an ever-growing volume of ML applications where a human engineer isn’t necessarily needed to optimize a model for a given task. In fact, in some tasks, leaving the task of choosing a specific model and stirring the learning hyperparameters up to human judgment might actually slow things down or lead to sub-par results. Instead, a good ML practitioner should take advantage of all the tools at their disposal, which now includes open source off-the-shelf tools and best practices for applying ML to ML.

AutoML is a broad category of techniques and tools for applying automated search to your automated search and learning to your learning. The field is quite active and diverse, with a healthy ecosystem of contests, many of which are cataloged at automl.ai. Auto-Sklearn was developed by one of the most notable research groups pursuing Automated machine learning in the pre-eminent AutoML supergroup from Germany. This collaboration is made up of labs at the University of Freiburg and the University of Hannover. Other noteworthy contributors to the field include the scientists behind Auto-WEKA, one of the first popular AutoML toolkits, and its successor Auto-WEKA 2.0. These researchers are mostly scattered around North America but with a nucleus at the University of British Columbia in Canada. In addition to Auto-Sklearn, the Freiburg-Hannover AutoML group has also developed an Auto-PyTorch library.

Auto-Sklearn employs the well-known Scikit-Learn machine learning package for data processing and machine learning algorithms. It also includes a Bayesian Optimization search technique to find the best model pipeline for the given dataset quickly.

This article will explore Auto-Sklearn, an open-source AutoML framework, providing a comprehensive guide to its features, usage, and benefits.

What is AutoML?

AutoML is a relatively new and upcoming subset of machine learning. AutoML promises to improve the utility, performance, and efficiency of typical data science and machine learning workflows. The main approach in AutoML is to limit the involvement of data scientists and let the tool handle all time-consuming processes in machine learning like data preprocessing, best algorithm selection, hyperparameter tuning, etc., thus saving time for setting up these ML models and speeding up their deployment. This automated workflow should automatically do preprocessing, model selection, hyperparameter tuning, and all other stages of the ML process. Experts could use AutoML to increase their job performance by focusing on the best-performing pipelines, and non-experts could use AutoML systems without a broad ML education.

Read also: Lessons in Auto Financing

There are different types of AutoML frameworks, each has unique features. Each of them has automated a few steps of a full machine learning workflow, from pre-processing to model development.

Introducing Auto-Sklearn

Auto-Sklearn is a Python-based open-source toolkit for doing AutoML. Auto-sklearn is an AutoML framework on top of scikit-Learn. It was developed by Matthias Feurer, et al. CASH = Combined Algorithm Selection and Hyperparameter optimization. Put simply, we want to find the best ML model and its hyperparameter for a dataset among a vast search space, including plenty of classifiers and a lot of hyperparameters.

Key Features

Auto-Sklearn distinguishes itself through several key features:

Automated Algorithm Selection: Auto-Sklearn automatically searches for the best machine learning algorithm for a given dataset.
Hyperparameter Optimization: It automatically optimizes the hyperparameters of the selected algorithm.
Bayesian Optimization: Auto-Sklearn utilizes Bayesian Optimization to efficiently navigate the space of possible models and model configurations and quickly discover what works well for a given predictive modeling task.
Ensemble Building: By default, the search will create an ensemble of top-performing models discovered as part of the search. The above search command creates an ensemble of top-performing models. In order to avoid overfitting, we can disable it by changing the setting “ensemblesize” = 1 and “initialconfigurationsviametalearning” = 0.
Scikit-Learn Compatibility: Auto-Sklearn employs the well-known Scikit-Learn machine learning package for data processing and machine learning algorithms. SciKit-Learn uses a friendly fit/predict API, making training models a snap, and Auto-Sklearn and Auto-PyTorch retain the same API.

What can Auto-Sklearn do for users?

By writing just five lines of Python code, beginners can see the prediction, and experts can boost their productivity. Auto-sklearn can solve classification and regression problems, but how? There’s a lot that goes into a machine learning pipeline. In the next step, when the search space shrinks enough through meta-learning, Bayesian optimization will try to find and select the out-performing ML pipelines.

Auto-Sklearn V2

Recently the second version of auto-sklearn went public. Let’s review what’s changed in the new generation.

Read also: Comprehensive Guide: Auto Loans

Improving model selection strategy: One vital step in auto-sklearn is how to select models. In auto sklearn V2, they used a multi-fidelity optimization method such as BOHB. However, they showed that a single model selection is not fit for all types of the problem, and they integrated several strategies.
Building a portfolio: Instead of using meta-feature to find a similar dataset in the knowledge base.

Installation and Setup

First we’ll set up our needed packages and dependencies. The following are commands to set up your environment from the command line on a unix-based system like Ubuntu, or from something like Anaconda prompt if you happen to be using Windows. You’re likely to run into conflicts if you use the same environment for both AutoML libraries, so make a second environment for Auto-PyTorch. Note the extra two install statements following pip install -e . ValueError: numpy.ndarray size changed, may indicate binary incompatibility.

First, you need to install auto-sklearn on your machine. If you get an error, you may need to install dependencies for that, so please check the official installation page.

Using Auto-Sklearn

Depending on whether your prediction task is classification or regression, you create and configure an instance of the AutoSklearnClassifier or AutoSklearnRegressor class, fit it on your dataset, and that’s it.

Classification

Auto-sklearn can solve classification and regression problems. For the classification problem, I chose a cherished Kaggle competition - Santander Customer Transaction Prediction. Please download the dataset and select 10000 records randomly.

Let's walk through a classification example using a standard machine learning dataset.

Import necessary libraries:

import autosklearn.classificationfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_irisimport sklearn.metrics

Load the dataset:
```
X, y = load_iris(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
```
Other classification datasets from sklearn.datasets to try include the diabetes (loaddiabetes) dataset and the digits dataset (loaddigits).
Initialize and fit the AutoSklearnClassifier:
```
automl = autosklearn.classification.AutoSklearnClassifier( time_left_for_this_task=120, # Time in seconds per_run_time_limit=30, # Time per model n_jobs=-1 # Use all available cores)automl.fit(X_train, y_train, dataset_name='iris')
```
Importantly, you should set the “njobs” argument to the number of cores in your system, e.g. The optimization process will run for as long as you allow, measure in minutes. I recommend setting the “timeleftforthistask” argument for the number of seconds you want the process to run. E.g. We will use 5 minutes (300 seconds) for the examples in this tutorial. We will also limit the time allocated to each model evaluation to 30 seconds via the “perruntimelimit” argument.
Make predictions:
```
y_pred = automl.predict(X_test)
```
Evaluate the model:
```
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, y_pred))
```
Likewise, evaluating your models is simple. The code for using this basic AutoML class looks exactly like training a single model in the example above, but in fact it performs a hyperparameter search over multiple types of machine learning models and retains the best as an ensemble.

Regression

The second type of problem which auto-sklearn can solve is regression.

Import necessary libraries:

import autosklearn.regressionfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_diabetesimport sklearn.metrics

Load the dataset:

X, y = load_diabetes(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

Initialize and fit the AutoSklearnRegressor:

automl = autosklearn.regression.AutoSklearnRegressor( time_left_for_this_task=120, per_run_time_limit=30, n_jobs=-1)automl.fit(X_train, y_train, dataset_name='diabetes')

Make predictions:
```
y_pred = automl.predict(X_test)
```

Evaluate the model:

print("Mean Absolute Error:", sklearn.metrics.mean_absolute_error(y_test, y_pred))

Advanced Usage

Although Auto-sklearn might be able to find an outperforming pipeline without setting any parameters, there are some parameters that you can use to boost your productivity.

time_left_for_this_task: It shows how many seconds are left for the task.
initial_configurations_via_metalearning: How many configurations via meta-learning considers hyperparameter optimization. If set 0, this option will be inactive.
ensemble_size: The number of the models in the ensemble.
n_jobs: The number of parallel jobs.
ensemble_nbest: Number of best models for building an ensemble model. It will use all estimators when there is None.
exclude_estimators: You can exclude some estimators from the search space.
metric: If you don’t define a metric, it will be selected based on the task.

Saving and Loading Models

The above-trained models for classification and regression can be saved using python packages Pickle and JobLib. These saved models can then be used to make predictions directly on new data.

import pickle# Save the modelfilename = 'finalized_model.sav'pickle.dump(automl, open(filename, 'wb'))# Load the modelloaded_model = pickle.load(open(filename, 'rb'))result = loaded_model.score(X_test, Y_test)print(result)

Here ‘wb’ argument means that we are writing the file to the disk in binary mode.

Practical Examples

Heart Disease Prediction

We will use the heart disease prediction dataset available on the UCI repository. For convenience, let us use the .csv version of this data from Kaggle. There are only two classes (0= healthy, 1= heart disease), so this is a binary classification problem. Also, This indicates that this is an imbalanced dataset. Due to this, the accuracy score of this model will be less reliable. However, we will first test the imbalanced dataset by directly feeding it to the autosklearn classifier.

import autosklearn.classificationimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score, confusion_matrix# Load the datasetdata = pd.read_csv('heart.csv')# Split into features and target variableX = data.drop('target', axis=1)y = data['target']# Split into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Initialize and fit the AutoSklearnClassifierautoml = autosklearn.classification.AutoSklearnClassifier( time_left_for_this_task=300, # 5 minutes per_run_time_limit=30, n_jobs=-1)automl.fit(X_train, y_train)# Make predictionsy_pred = automl.predict(X_test)# Evaluate the modelaccuracy = accuracy_score(y_test, y_pred)conf_matrix = confusion_matrix(y_test, y_pred)print(f"Accuracy: {accuracy}")print(f"Confusion Matrix:\n{conf_matrix}")

Here, we are setting the max time for this task using the ‘time_left_for_this_task’ argument and assigning 5*60 sec or 5 mins to it. If nothing is specified for this argument, the process will run for an hour, i.e., 60mins. In this command, there are other arguments like n_jobs (number of parallel jobs), ensemble_size, initial_configurations_via_metalearning, which can be used to fine-tune the classifier.

As the number of unhealthy samples is more, we will use the resampling technique (oversampling) and increase the samples of healthy individuals in the dataset. We can also use techniques like SMOTE, Ensemble learning (bagging, boosting), NearMiss Algorithm to address the imbalance in the dataset. Now that we have adjusted the skew, we will create X and y sets for classification again. We need to repeat all the steps from setting up the classifier to printing a confusion matrix for this new X1 and y1. From the above chart, the model accuracy has slightly reduced after over-sampling, we can see that the model is now better optimized.

Regression with Flights Dataset

For this task, let us use the simple ‘flights’ dataset from the seaborn datasets library.

import autosklearn.regressionimport seaborn as snsfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_absolute_error# Load the datasetdata = sns.load_dataset('flights')data = data.pivot(index='year', columns='month', values='passengers')data = data.fillna(data.mean())data = data.stack().reset_index(name='passengers')# Split into features and target variableX = data.drop('passengers', axis=1)y = data['passengers']# Convert year and month to numerical valuesX['year'] = pd.to_numeric(X['year'])X['month'] = pd.to_datetime(X['month'], format='%B').dt.month# Split into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Initialize and fit the AutoSklearnRegressorautoml = autosklearn.regression.AutoSklearnRegressor( time_left_for_this_task=300, # 5 minutes per_run_time_limit=30, n_jobs=-1)automl.fit(X_train, y_train)# Make predictionsy_pred = automl.predict(X_test)# Evaluate the modelmae = mean_absolute_error(y_test, y_pred)print(f"Mean Absolute Error: {mae}")

Auto-Sklearn vs. Other AutoML Tools

Auto-Sklearn is positioned among other AutoML tools like Auto-WEKA and Auto-PyTorch. Auto-Sklearn was developed by one of the most notable research groups pursuing Automated machine learning in the pre-eminent AutoML supergroup from Germany. This collaboration is made up of labs at the University of Freiburg and the University of Hannover. Other noteworthy contributors to the field include the scientists behind Auto-WEKA, one of the first popular AutoML toolkits, and its successor Auto-WEKA 2.0. These researchers are mostly scattered around North America but with a nucleus at the University of British Columbia in Canada. In addition to Auto-Sklearn, the Freiburg-Hannover AutoML group has also developed an Auto-PyTorch library.

Auto-PyTorch, like Auto-Sklearn, is built to be extremely simple to use. As you’ll notice in the results table, Auto-PyTorch is efficient and effective at fitting the iris dataset, yielding training and test accuracy in the mid to high 90s. This is moderately better than the automatic SciKit-Learn classifier we trained earlier and much better than the standard sklearn classifiers with default parameters.

Challenges and Considerations

For the packages we experimented with in today’s tutorial, we would describe their level of readiness as working research prototypes. There were a number of little fixes we had to go through to get everything to work properly, like upgrading NumPy to 1.20.0 to fix a cryptic error message, not being able to set the run time limits as n input argument (as suggested in the documentation for Auto-Sklearn), and not being able to use a single virtual environment for both package due to some cryptic conflicts.

Computational Cost: AutoML processes can be computationally expensive, especially with large datasets and complex models.
Interpretability: The automated nature of AutoML can sometimes lead to models that are difficult to interpret.
Data Preprocessing: While AutoML automates many steps, data preprocessing is still crucial for optimal performance.

tags: #auto-sklearn #tutorial