Machine Learning (ML) has been big in the spotlight since it made a resurgence with the advent of practical deep learning in 2014. The field is both immature (compared to software engineering, or industrial engineering for instance) yet still being applied all over the place. Most advocates and practitioners view ML as a dark art, or alchemy, a point of view that insinuates it cannot be tamed, and should not be approached with normal software engineering (SWE) process and rigor.
For most of us in the field, this distinction barely matters. As long as we can predict churn based on usage characteristics, or come up with a model to identify the most popular fashion styles, the org views it as a success. It doesn’t really matter if it takes one week or two to get there, or if we took a top-down or bottom-up approach, nailed it on the first try, went through 3 iterations, or just kept massaging and tweaking data and parameters until it worked.
The value delivered in those examples is insight, often in presentation, PDF, or visualization, and there aren’t too many interdependencies with the core product. However, at TWOSENSE.AI, our product lives and dies with the performance of the machine learning behind it, and our ability to execute at top speed in a competitive market. Product and ML engineering must both happen in parallel and are heavily interdependent, so if ML doesn’t execute on time and within tolerances, we can’t ship product. Not being able to wrangle ML into agile and test-driven development presents a significant hurdle for the business that would cost us customers, cash, eat away at our competitive edge, and potentially put us out of business: a matter of life and death.
Ask anyone on the SWE side of our business (or any tech business for that matter) why they think agile SWE is important and they’ll answer with the following points or something similar:
Improve scalability: more people on the same problem means you can solve it faster.
Reduce ramp-up time for new hires by providing a solid framework.
Make projects (more) plannable and predictable, as well as making speed trackable.
Improve code quality and reduce bugs.
Increase productivity and get more done in the same amount of time.
Ship quickly and iteratively for fast feedback, so you arrive at something that actually solves a problem.
Be able to react quickly to changing requirements.
It’s just more fun!
Etc.
It might take a senior SW engineer a while to answer because this methodology is so established that many have not needed to think about the “why” for quite some time. Many engineers came up in a world of agile and can’t remember a time before Kanban boards, scrums, and sprints.
Now ask most senior machine learning engineers and data scientists if the aforementioned benefits are advantageous, and they’ll emphatically answer ‘YES!’ But, ask them if an ambitious ML project can be broken down into bite-sized tickets with quick, focused iterations of step-wise improvements can be applied to their domain, and two-thirds will say no. And they’re not entirely wrong. Most of the arguments against Agile ML circle around the fact that you don’t know what you don’t know with every project. Also, they’ll say that the interplay between components is so high and complex (e.g. model selection is dependent on feature engineering) that trying to break it apart into smaller tasks is doomed to failure. Only a holistic approach to projects will succeed, many will say, and indeed most of the great successes we know from academia and previous ML products from large enterprises were built using holistic approaches without agile methodologies.
<CONJECTURE> ML engineering is a nascent domain, and the pushback against software engineering principles stems from that fact. I liken it to PHP development in the late ’90s. Developers were building massive monolithic sites leading up to the dot com bubble, rampant with spaghetti code and waterfall methods. If you had proposed modern agile methods, they’d have said similar things: “how can you expect me to know in advance what the front end will look like when it’s so dependent on the backend?” These days, we have standards, design patterns, and generally more experience in what works and what doesn’t, collective experience if you will, that make the process much easier as the domain expertise begins to take shape across the industry. Machine learning is the same way, a nascent engineering domain where none of the standards and design patterns have been established yet and where what we know changes with every passing day. Therefore, most practitioners think you can’t break machine learning down into component parts and tasks. </CONJECTURE>
At TWOSENSE.AI, we’ve embarked on the journey to apply and adapt agile software engineering processes to data science to create an internal standard for ML engineering. We define ML Engineering as a combination of ML/DS and software engineering. The first step is being honest about what we’re up against (that’s this post right here). We do believe there are some differences between ML and general SWE, so finding what doesn’t work or can’t be adapted is as important as finding what does.
The good news is that we’re not alone, and there are some good resources out there. We’re putting this out there in the hope that there will be value for others, and maybe, others who have gone through this before and can help us out. If you have thoughts or comments, please reach out and share, we’d love to chat.