There’s a lot more to machine learning. Implementation of an ML algorithm is just the tip of the iceberg. Machine learning systems are a part of a much larger ecosystem. Creating a well-performing machine learning model is just a small aspect of real-world machine learning solutions.
Image Source: ML Crash Course by Google
Let’s say you are on the verge of signing the first customer for your startup. Your start-up has an amazing team of ML Engineers, Data Analysts, Data Scientists. They have been successful in creating state-of-the-art models with unprecedented results and metrics.
The real problem here that arises is its deployment at the production level. In the 2020 State of Enterprise Report, based on a survey of nearly 750 domain experts and practitioners in Machine Learning, had the following conclusions:
- There was an increase in spending on AI by more than 2/3rds of the subgroups that were interviewed about their budget.
- 43% of respondents cited difficulty in scaling their ML projects as to 30% in the previous year.
- Half of the respondents deploy their systems between a week and 3 months, while 18 percent require more than 3 months.
Machine Learning is evolving swiftly, growing into new sectors and industries yet building projects at scale is difficult. This marks a huge gap between models generated through scripts, notebooks, and their deployment in a production system at scale.
As MLOps corresponds to DevOps for ML, there are challenges needed to be addressed.
As highlighted by Arnab Bose, and Aditya Aggarwal in their blog, an example of such challenge is the role of data. There are two different Software paradigms involved in traditional software engineering and machine learning – software developers have well-defined logic and code for their software programs whereas data scientists follow a parameterized problem-solving coding approach. These parameters depend on data which vary with changes in data thus altering the entire code behavior. Therefore, another aspect of data and its irregular variation causes difficulties in tracking a well-defined software.
List of challenges that make it difficult to deploy ML models to production:
- Data Management
- Huge Datasets
- Dataset Tracking
- Data Privacy
- Trial and Error and Iterative Development
- Tracking changes: Hyper-parameter tuning, code changes, architecture changes
- Code Quality: Production-ready code, code optimizations
- Model Evaluation
- Training, Inference, and Retraining
- Production Deployment
- Cloud / On-premise – batch and real-time predictions
- Infrastructure Requirements
Shout out to Andrej Karpathy for the wonderful blog emphasizing Software 2.0 and the ongoing transition into the 2.0 stack.
- Data Engineering and Management
- Training / Modeling (Machine Learning Pipeline)
- Continuous Deployment
At first, one needs to define a business problem and translate it into objectives that can be addressed through machine learning solutions.
Second, there should be a focus on collaboration between data engineers and data scientists to explore, create and manage dataset(s) for modeling.
Third, designing a pipeline comprising of operations like Model Training, Model Evaluation, Model Testing, and Model Packaging to be integrated with CI/CD for experimentation, tracking, validation, and testing.
Fourth, seamless deployment into production server – cloud, on-premise, or hybrid.
Finally, monitoring both model and computing resources (infrastructure) and their management. Key Performance Indicators (KPIs) help monitor the changes.
Thus, the aim is to provide an end-to-end machine learning pipeline for designing, building and managing reproduciable ML Software alongside test-driven development.