AutoML

The purpose of the AutoML Challenge series is to promote research on reducing the need for human intervention in applying machine learning (ML) to practical problems. This refers to all aspects of automating the ML process beyond model selection, hyper-parameter optimization, and model search. Automation is desired for data loading and formatting, detection and handling of skewed data and missing values, selection of learning representation and feature extraction, matching algorithms to problems, acquisition of new data (active learning), creation of appropriately sized and stratified training, validation, and test sets, selection of algorithms that satisfy resource nconstraints at training and run time, the ability to generate and reuse workflows, meta-learning and learning transfer, and explicative reports. Such automation is crucial for both robots and lifelong autonomous ML.

The first AutoML challenge (2015-2016) focused on mainstream ML problems, which make the bulk of today’s in- dustrial applications: classification and regression problems. The applications are taken from biology and medicine, ecology, energy and sustainability management, image, text, audio, speech, video and other sensor data process- ing, internet social media management and advertising, market analysis and financial prediction. The data present themselves as input-output pairs that are identically and independently distributed and are limited to fixed- length vectorial representations (no time series prediction). Text, speech, and video processing tasks included in the challenge are not presented in their native data representations; datasets have been preprocessed in suitable fixed-length vectorial representations.

The difficulty of the challenge lies on the data complexity (class imbalance, sparsity, missing values, categorical variables) and the fact that the data come from a wide variety of domains, hence are distributed very differently. Although there exist ML toolkits that can tackle all these problems, it still requires considerable human effort to find, for a given dataset, task, evaluation metric, and available computational time, the methods and hyper-parameter settings that maximize performance. The participant’s challenge is to create the perfect black box that removes the need for human interaction.

After four phases of the AutoML challenge and three workshops, co-organized by CDS members (ICML 2014, CiML 2015, ICML 2015), the state-of-the-art has greatly improved. The current leaders are a group of University of Freiburg, Germany, who developed in the course of the challenge Auto-sklearn, a wrapper around the scikit-learn Python library. Novel methodologies in meta-learning and transfer learning have been developed. However, many participants have stumbled on handling efficiently sparse data in recent phases. This is stimulating new research in making some of the favorite algorithms like Random Forests efficient for sparse matrices.

A new edition of the AutoML challenge is in preparation, using new extensions of the CodaLab platform.

AutoML

News