The CDS contributes to open-source datascience software by funding contributors of these projects on a given “doctoral mission”. Doctoral missions concerned a total of thirteen PhD students, each student engaged in the development of new tools related to data science (machine learning, distributed computing, data visualisation, method documentation). Most of the projects achieved noticeable impact on the users of these packages and the new sphinx-gallery package has become a core package for making scientific Python projects more accessible to non-expert users. The projects are the following:
Table of Contents:
Anomaly detection in Scikit-Learn
Contributions to this project included the implementation of two state-of-the-art anomaly detection algorithms, namely the Isolation Forest algorithm and the Local Outlier Factor algorithm. The first one has been merged on scikit-learn development version, while the second one still needs some work.
This contribution also includes participation to the scikit-learn maintenance and pull requests review. This project was a real success as it offers to scikit-learn the tool to better address academic and industrial data challenges related to anomaly and novelty detection (predictive maintenance, device monitoring, etc.).
Contribution to Operalib
A new Python library Operalib devoted to various machine learning algorithms devoted to operator-valued kernels regression was implemented. Operator-valued kernels regression provide a general framework allowing for learning vectorial, functional and structured outputs including multi-task regression or vector field learning. Operalib includes efficient implementation of the following algorithms: OKR-ridge regression, Naive Online OVKR, Quan- tile Regression.
Incorporating multivariate adaptive regression splines into Scikit-Learn
The purpose of the project was to incorporate MARS (Multivariate Adaptive Regression Splines) into scikit-learn by adapting and improving an existing implementation named py-earth . MARS is a non-parametric regression algorithm and py-earth is an implementation of it in Python. During this mission doctorale, a number of contributions have been made to the existing code of py-earth. Part of the contributions was to clean the code, adapt it to the coding guidelines of scikit-learn (http://scikit-learn.org/stable/developers/), enhance the documentation and add more unit tests. Further, new features were added. py-earth lacked a way to deal with multiple outputs, the first contribution was to add a way to deal with multiple outputs with the possibility of weighting each output variable.The second contribution was to add three different ways of estimating input variables importances, the purpose of this contribution was to bring a way to assess the predictive power of each input variable. The final contribution was to implement FastMARS, a way to speed up the original MARS algorithm.
Dipole fitting in MNE-Python
The aim of this project was to ease the use of a standard MEG/EEG technology called dipole fitting via a reimplementation in MNE-Python that benefits from a large user base. This project was completed and released in september 2015. The project now offers an open implementation of a technology that is routinely used in clinical practice.
This project contributed to the visibility of the MNE project and helped the funding of the project by Concours Mondial de l’Innovation (CMI) 2015 in collaboration the Bioserenity and Dataiku companies.
File IO for electrophysiology data
The community of MEG and EEG is scattered with a number of hardware vendors that all implement their file format for the acquired data. The ambition of this project is to extend the list of supported file format in the open source project MNE. The list of now supported format for EEG is provided on the MNE IO manual page. The addition of the EEGLAB file format contributed to the adoption of the MNE project in the EEG community.
Tom Dupré la Tour is a core developer of the scikit-learn project. He maintained during this mission linear models for large scale classification and improved the performance of a number of estimators (Logistic Regression, NMF).
Parallel computing in joblib
The goal of the project is to improve the Python multiprocessing backend of joblib, used extensively by scikit- learn. The technical challenge is that, to avoid locks, the parallel-computing strategy of the multiprocessing module is to spawn multiple processes. Error management and nested parallelism are difficult in such setting. The project is still ramping up, but we could already identify and fix many failure modes of the Python multiprocessing module when computation crashed in workers. Fixes will be first integrated in joblib, and later contributed upstream in the Python standard library.
The goal of this mission doctorale was to create a software tool that integrates example files in a on-line documentation: Sphinx-gallery.
Starting from simple Python files, the resulting tool runs them, captures the output, figures and text, generates HTML rendering that weave these together, creates IPython notebooks out of them, and exposes all the examples in a gallery. Finally, links are adding the HTML code listing to relate each symbol to the corresponding documentation, leveraging the intersphinx mechanism.