Scikit-learn, the popular Python-based machine learning (ML) library, has released version 1.0. Although the library has been stable for some time, and the release contains no breaking changes, the project maintainers opted for a major version revision to signal to users that the software is mature and production-ready.
The project team announced the release on Twitter. Containing 2,100 merged pull requests since the previous 0.24 release, version 1.0 contains several new features, including spline transformers, quantile regression, online one-class support vector machines (SVM), and an improved plotting API. There are also many documentation improvements, representing nearly 800 of the merged pull requests. Although there are no breaking changes, apart from those in the project’s normal two-release deprecation cycle, the team decided to increment the library’s major version number from 0 to 1 in recognition of the code’s long-term stability and maturity. According to Adrin Jalali, a core developer on the project:
The library has been stable for a while, and we’d like to signal that by the versioning of the release….[It] includes some features which we’ve wanted to have for years, so it felt right to finally do it!
Scikit-learn, billed as an “easy-to-use and general-purpose machine learning in Python,” is used by over 80% of data scientists, according to Kaggle’s 2020 survey. The library contains implementations of many common ML algorithms and models, including the widely-used linear regression, decision tree, and gradient-boosting algorithms. Begun in 2007 as a Google Summer of Code project, it was originally conceived as an ML “toolkit” for the Python-based scientific computing library SciPy. Scikit-learn’s first public beta release was in early 2010, and in 2020 the library was accepted as a Sponsored Project by NumFOCUS, the non-profit foundation that funds SciPy and many other open-source scientific computing packages.
Several new features were included in the release. One important change is that constructor and function parameters are required to be keyword arguments instead of positional. Existing histogram-based gradient boosting models have moved from experimental to stable status, and there are also new models. First, the SGDOneClassSVM model is a linear version of the One-Class SVM that is fit using stochastic gradient descent (SGD). This can approximate the solution of a kernelized One-Class SVM with “several orders of magnitude faster” time to fit. Quantile regression models can estimate the median or other quantiles of a function; the model is fit by minimizing the pinball loss.
In a discussion about the release on Hacker News, some users noted that scikit-learn still is not a good choice for deep learning models:
– No saving checkpoints (can be crucial for large models who need a lot of compute and time)
– No way to assign different activation functions to different layers
– No complex nodes like LSTM, GRU
– No way to implement complex architectures like transformers, encoders etc
Other users also pointed out that scikit-learn does not support GPU hardware. However, most users praised the library for having good documentation and being easy to use:
scikit-learn (next to NumPy) is the one library I use in every single project at work. Every time I consider switching away from Python I am faced with the fact that I’d lose access to this workhorse of a library. Of course it’s not all sunshine and rainbows – I had my fair share of rummaging through its internals – but its API design is a de-facto standard for a reason.
The scikit-learn code is available on GitHub.