The staff of the CDS is actively contributing to open-source data science software. In addition, the CDS funded no less than 13 “doctoral missions” to develop specific features impacting each community related to those projects.
Several features have been implemented in scikit-learn and scikit-learn-contrib:
- Isolation forest for anomaly detection
- Multivariate adaptive regression splines used in regression tasks
- Categorical encoder used in pre-processing
- Memory caching in scikit-learn pipelines
- Quantile transformer used in pre-processing
- Transformed target regressor to ease target manipulation in regression task
- Imbalanced-learn to deal with classification of imbalanced data sets
- Column transformer to combine heterogeneous pre-processing steps
- (In progress) Optimization of the tree architecture
- (In progress) Maintenance and improvements: @TomDLT, @glemaitre, @jorisvandenbossche
dask-ml is a library for distributed and parallel machine learning using dask. The following algorithms have been implemented:
Joblib is a set of tools to provide lightweight pipelining in Python. The following contributions have been made:
- Development of loky backend used in joblib.Parallel
- Gallery of examples
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. @jorisvandenbossche is actively maintaining and improving this project.
MNE-Python is the Python open source toolbox for processing and visualizing MEG and EEG data. The following contributions have been made:
scikit-image is a collection of algorithms for image processing. The following contributions have been made:
- Implementation of the Haar-like features
Sphinx-gallery which is a Sphinx extension that builds an HTML gallery of examples from any set of Python scripts.
Specio is a Python library that provides an easy interface to read hyperspectral data. It is cross-platform, runs on Python 2.x and 3.x, and is easy to install.