The BioNLP challenge is a benchmarking challenge, organized by an interdisciplinary community around biology and text mining.
International competitions play an important role in the advances of Information Extraction (IE) as a text-mining domain, especially in the biomedical domain. They aim at providing sound frameworks for the comparison and the evaluation of the technologies on high quality benchmarks. The BioNLP Shared Task (BioNLP-ST) is a community- wide effort in the biomedical field on fine grained information extraction.
The research laboratories MaIAGE (INRA), LIMSI (CNRS), and IJPB (INRA) organized a task on information extraction on plant biology called SeeDev. The main issue in IE is the design of a reference corpus with curated annotations. MaIAGE and LIMSI have both a strong experience and dedicated tools for organizing challenges. The goal of the SeeDev project is the extraction, from scientific papers, of complex events that are involved in the regulation networks of seed development for the model plant Arabidopsis thaliana. Seed development is an important issue in research, agriculture, and industry. It involves complex mechanisms at a molecular, tissue, physiological, phenotype, and environment levels.
The HIGGSML challenge was a machine learning (ML) challenge to optimize the discovery potential for the Higgs boson. In high-energy physics (HEP), especially at the Large Hadron Collider (LHC), there is a complex software pipeline that reduces the petabytes of data to final measurements. Machine learning (neural nets, boosting; or multivariate analysis as it is called within HEP), has been used since the nineties within this pipeline. However, it was realized that the tools used within the HEP communities were obsolete, and machine learning was not exploited to its full potential.
The HiggsML challenge was organized by a collaboration of three physicists of the ATLAS experiment on the LHC at CERN and three machine learning specialists (five of them from Saclay). The challenge was funded largely by CDS, and to a lesser extent by Google and INRIA. It was run on Kaggle, the best known data challenge platform. For the first time, simulated Higgs events (both signal and background) were released by the ATLAS collaboration. The challenge participants were asked to submit a classifier to maximize the significance of the Higgs boson search in the difficult t+t channel. The challenge was running from May to September 2014. It was a remarkable success with more than 2000 participants in 1785 teams, the largest challenge on Kaggle at that time. The winner beat the significance of the HEP in-house tool (called TMVA) by 20%. We also awarded a special “HEP meets ML” prize to the author of the XGboost library which has since become a de facto industry standard in ML.
The most important outcome of the challenge was the dynamics it generated, both in ML and in HEP, both locally and internationally.
The purpose of the AutoML Challenge series is to promote research on reducing the need for human intervention in applying machine learning (ML) to practical problems. This refers to all aspects of automating the ML process beyond model selection, hyper-parameter optimization, and model search. Automation is desired for data loading and formatting, detection and handling of skewed data and missing values, selection of learning representation and feature extraction, matching algorithms to problems, acquisition of new data (active learning), creation of appropriately sized and stratified training, validation, and test sets, selection of algorithms that satisfy resource nconstraints at training and run time, the ability to generate and reuse workflows, meta-learning and learning transfer, and explicative reports. Such automation is crucial for both robots and lifelong autonomous ML.
The first AutoML challenge (2015-2016) focused on mainstream ML problems, which make the bulk of today’s in- dustrial applications: classification and regression problems. The applications are taken from biology and medicine, ecology, energy and sustainability management, image, text, audio, speech, video and other sensor data process- ing, internet social media management and advertising, market analysis and financial prediction. The data present themselves as input-output pairs that are identically and independently distributed and are limited to fixed- length vectorial representations (no time series prediction). Text, speech, and video processing tasks included in the challenge are not presented in their native data representations; datasets have been preprocessed in suitable fixed-length vectorial representations.
The difficulty of the challenge lies on the data complexity (class imbalance, sparsity, missing values, categorical variables) and the fact that the data come from a wide variety of domains, hence are distributed very differently. Although there exist ML toolkits that can tackle all these problems, it still requires considerable human effort to find, for a given dataset, task, evaluation metric, and available computational time, the methods and hyper-parameter settings that maximize performance. The participant’s challenge is to create the perfect black box that removes the need for human interaction.
After four phases of the AutoML challenge and three workshops, co-organized by CDS members (ICML 2014, CiML 2015, ICML 2015), the state-of-the-art has greatly improved. The current leaders are a group of University of Freiburg, Germany, who developed in the course of the challenge Auto-sklearn, a wrapper around the scikit-learn Python library. Novel methodologies in meta-learning and transfer learning have been developed. However, many participants have stumbled on handling efficiently sparse data in recent phases. This is stimulating new research in making some of the favorite algorithms like Random Forests efficient for sparse matrices.
A new edition of the AutoML challenge is in preparation, using new extensions of the CodaLab platform.
NEXT DATA CHALLENGES
The instantaneous luminosity of the Large Hadron Collider is expected to increase in a few years time so that the amount of charged particle per proton bunch collision is expected to increase by a factor 10. In addition, the experiments plan a 10-fold increase of the readout rate. This will be a challenge for the ATLAS and CMS experiments, in particular for the tracking, which will be performed with a new all Silicon tracker in both experiments. Preliminary studies have shown that the CPU time to reconstruct an event increase many-fold, due to the combinatorial explosion at the pattern recognition stage, while the resource budget will be flat at best.
The TrackML challenge is being set up to engage Computer Scientists to tackle the problem with non HEP standard algorithms such as Convolutional Neural Network, Deep Neural Net or Monte Carlo Tree Search. A large data set of order one million events, ten billion tracks and one Terabyte will be created, so that there will be no lack of training data. The participants would compete to invent the fastest algorithm associating 3D points originating from the same charged particle, maintaining highest efficiency. They will do so by logging to a powerful platform, train on the data set, and submit their solution with an evaluation of its speed. The emphasis is to expose innovative approaches, rather than super-optimizing classical ones. If successful, it would improve the LHC physics reach.
The See.4C challenge is a “warm up challenge” whose objective is to test the protocol and platform of a large-scale upcoming EU challenge to predict electricity flows in the French power network.
We propose a different task, but of real practical interest: predicting upcoming frames in video data. The applications may include replacing missing frames in a tele-conference when data transmission is defective. For this benchmark, we are making available a dataset of thousands of videos of speakers facing a camera, sampled at 25 frames per second. We limit the resolution to small 32×32 pixels frames in black and white to permit obtaining results in a short time, in the context of a hackathon.
One day hackathon:
The launching event will be held at La Paillasse, Paris. This event is co-sponsored by the Paris Machine Learning Meetups. The challenge will remain open until April 2, 2017.
Florin Popescu (Fraunhofer Institute, Berlin, Germany)
Sergio Escalera, Xavier Baro, and Julio Jacques Jr. (University of Barcelona, Spain)
Cecile Capponi, Stephane Ayache, and Isabelle Guyon (Aix Marseille University)
Commitee and local arrangements:
Isabelle Guyon, Lisheng Sun, and Diviyan Kalainathan (UPsud Paris-Saclay and ChaLearn)
Igor Carron and Frank Bardol (Paris Machine-Learning Meetups)
Sebastien Treguer (La Paillasse and ChaLearn)
Balazs Kegl (CNRS, Paris-Saclay Center for Data Science)