Reflections of ML Engineer: From Hype to Bitter Lesson(s)

Muhtasham Oblokulov
8 min readSep 29, 2022

Retrospective is an important process for any Machine Learning (ML) engineer. By looking back at past projects, we can learn from our mistakes and make better decisions in the future. In this blog post, I will share some lessons I learned so far from recent retrospective.

Danke schön

My ML journey has been with gradual changes in companies of different sizes which had certainly melded me with skillsets I have today, and all of the coworkers were really helpful along the way and may have influenced me in number of ways for which I am grateful to them. Hence I wanted to start this blog by appreciation.

The journey and company Left: Original. Right: Reimagined with DALLE2

Look at the data

Always almost take your time before hastily feeding your data to GPUs to crunch and multiply your matrices. Exploratory Data Analysis (EDA) techniques are not enough, open the files and inspect them.

Start with simple baselines

State of the art (SOTA)….. I know it is hard to resist the urge to find that shiny idea from the latest SOTA paper only to find out that you need 80 A100 GPUs and humongous amount of VRAM to be able to only run it. One of the greatest advices I have learnt along the way is to

“Start without Machine Learning” -Eugene Yan

Sometimes some simple heuristics embedded in the code solve the task at hand. But of course the hyper-hype of paper claims have done its job and most of us have been fooled into thinking that SOTA is all they need.

Due to cherry-picked samples and shiny blogposts, people fall in the trap of SOTAs and get fooled. And as the saying goes:

It is easier to fool people than to convince them they have been fooled.

Occam’s Razor states that the simplest explanation is usually the correct one. And ML’s Occam’s Razor, states that start with simple methods that have stood the test of time and are OGs (Original Gangster)of this space. I can recall vividly my latest NLP model trainings with custom dataset, where a Näive Bayes classifier was performing on par with transformers, but only needed 1 minute of CPU compute, whereas our transformer friends needed at least 2 hours of GPU compute to reach the same performance. And given the likelihood that SOTA lifetime expectancy is shrinking everyday you better start with OGs.

Also I can’t count the number of times when I tried a new paper that’s “better” only to not find even the datasets to validate the claims of the paper.

It has led me be critical of anything in the ML space unless I’ve tried it out myself, which brings us to another point.

Reproducibility crisis

There is a growing reproducibility crisis in the field of ML. For instance the prediction paradigm has been adopted by dozens of scientific fields, but the evaluation of ML performance is notoriously tricky. At least 20 reviews in 17 fields have found widespread errors. There are many reasons for caution:

  • Performance evaluation is notoriously tricky in machine learning.
  • ML code tends to be complex and as yet lacks “standardisation”.
  • Subtle pitfalls arise from the differences between explanatory and predictive modelling.
  • The hype and overoptimism about commercial AI may spill over into ML-based scientific research.
  • Pressures and publication biases that have led to past reproducibility crises are also present in ML-based science.

Bitter lesson

I have been paving the discussion towards this point, the bitter lesson as naive ML Engineer is that Applied ML is very different from Research ML, data is not same, splits of data, interests are not same, incentives are not same …

different hierarchy of needs, source

Applied ML engineers have opposite needs to those of researchers. When you do applied ML, you need a framework that’s feature-complete, reasonably prescriptive, high-level, that guides you towards industry best practices. And if course you want it to be production-ready.

François Chollet

Beware of statistical biases in your model

Have solid understanding of model assumptions, I can’t emphasise this point enough, you need to beware of the statistical biases of your model, which is always hard to debug.

Statistical learners such as standard neural network architectures are prone to adopting shallow heuristics that succeed for the majority of training examples, instead of learning the underlying generalisations that they are intended to capture.

This problem of learners solving a task by learning the “wrong” thing has been known for a long time and is known as the Clever Hans effect, which is named after eponymous horse which appeared to be able to perform simple intellectual tasks, but in reality relied on involuntary cues given by its owner.

This is one of the examples of the above mentioned effect,

Translation of tweet says:
Data point for the SOTA technology in cars: driving assistant recognises the road markings in the gap of the shadow and constantly pushes me to the right.

Another prominent example of this is neural network trained by the military to recognise tanks in images, but actually learning to recognise different levels of brightness due to one type of tank appearing only in bright photos and another type only in darker ones (source).

For a broader view on this topic, also see:

On the Paradox of Learning to Reason from Data

NLP’s Generalization Problem

NLP’s Clever Hans Moment has Arrived

Accuracy is not enough

Also in ML world, improving accuracy is not your priority most of the time you’re worried about latency and throughput.

Supervised Learning >>> Unsupervised learning

Unsupervised learning is overrated especially in the context of Anomaly Detection. Damn I learned this the hard way: I spent a significant amount of time building unsupervised models for Anomaly and Out of Distribution Detection which was really hard to get working with real-world data, even when they were working they were not statistically significant than other methods. Eventually spent time to label data and it was well worth the effort.

Here is evidence to back up my point from new NIPS22 work, ADBench which is the most comprehensive anomaly detection benchmark.

Through 98,436 experiments, we find (1) none of the unsupervised detection methods are actually statistically better than the rest, and we thus need AutoML for detection tasks other than blindly picking models. Also, we find semi-supervised methods appear to be a promising direction due to their efficiency in using labels and resistance to data noise.

Scaling is not easy as it seems

Let’s assume you got some transformers and friends which are performing best for your task at hand and custom dataset but still you are not satisfied with some metric you care about and perhaps you want to play around and scale a bit.

Practice has shown 110% it is not as easy as just stacking more layers, you need extensive engineering, rigor to choose the best optimizer, and an intuitive set of hyperparameters to start testing waters and not to drown.

Here are some lessons from “Scaling laws vs Model Architectures” from Google AI:

  • Not all architectures scale the same way.
  • Vanilla Transformer does pretty well
  • Touching the attention too much is “dangerous”.
  • Perf at base may not translate to large+ scale.

Meta recently published their logbook of how they trained an Open Transformer 175b model, titled “Chronicles of OPT development”.

It is truly a captivating and reads like a gripping adventure by explorers reaching the North Pole for the first time.

Below are some more misc retrospectives

Document as much as possible

Always keep your experiments reproducible (lineage, data, code, baseline), one of my favourite tools is weight and biases, which also helps to keep your experiments reproducible

Thank yourself later, but try not to over optimise with different trackers.

Over optimising

I realised I was spending more time trying to increase productivity and efficiency, instead of actually doing the things I was trying to make more efficient.

Same thing goes for software packages, try to keep them to bare essentials and not to mess up your software environment.

Debugging Tactics

While debugging (especially Neural Network is notoriously hard) often taking time off screen inspired me with some possible solutions to the problem I was facing came to mind :)

Data Engineering is underrated

Even if you mostly do modelling this book will literally open your eyes, it also has high quality audio book

Most of people think that Machine learning Engineers (MLE) and Data Scientists (DS) are handed .csv and .json files just to do EDAs and train the models right away, well the jobs Data Engineers do set up data pipelines and to constantly deal with schema changes is really tantamount and essential bare-bone of most companies whose butterbrot is ML related stuff.

Outro

Thanks for reading so far! I hope you enjoyed reading this, as a take-away, just enjoy this poem.

The road to wisdom?
— Well, it’s plain
and simple to express:
Err
and err
and err again
but less
and less
and less.

- Piet Hein

I will conclude with this advice, which recently one of my Seniors gave me, I believe is applicable to almost any ML Engineer:

ML is drastically overhyped, learn good coding practices and diversify in technologies and skills in the beginning. It will open a lot of doors later. You can still do all this ML/CV/NLP stuff as a hobby or home research kind of thing.

--

--

Muhtasham Oblokulov

BERT Engineer bei Munich🥨NLP | Matrix Multiplier at TU Munich | GitHub Archeologist| https://www.linkedin.com/in/muhtasham/