Data Science Anti-Patterns You Should Know | by Samuel Flender | Dec, 2022

By Jessie Hobb On Dec 7, 2022

Eliminate your recurring pain points by understanding the underlying patterns

Anti-patterns are common yet counter-productive responses to recurring problems. Because they’re ineffective, they perpetuate recurring pain points without ever resolving the underlying, systematic issues. Anti-patterns exist exist pretty much anywhere people come together to solve problems, in software development, project management, and yes, also in data science.

Knowledge about anti-patterns is the best way to avoid them, so in this post I’ll break down 5 common data science anti-patterns that I’ve observed over the years in my own career, namely:

ad hoc overload,
meeting paralysis,
pin factories,
excessive resume-driven development, and
HARKing.

(Of course, this list is not meant to be complete, nor are all items on the list exclusive to the domain of data science.)

Ready? Let’s dive in.

Every data scientist knows ad hoc requests. They’re queries from collaborators and stakeholders, such as…

“Hey, can you quickly pull these numbers again?”,
“Hey, this metric suddenly spiked, can you find out why?”,
“Hey, that metric suddenly dropped, can you find out why?”,
“Hey, someone tweeted this weird thing about our product, can you pull the data and see if this is true?”,

and so on and on. In the worst case, ad hoc requests may completely fill a data scientist’s day, resulting in ad hoc overload.

👉 What to do instead:

Never do it immediately. Unless it’s an emergency, don’t take care of the request immediately, but instead wait until you have collected a few such requests and then do them all in one batch, perhaps during a dedicated time slot each week (I used to set such a time slot on Fridays). This ‘interrupt coalescing’ has two advantages:

First, it reduces context switches for you. Context switches are more costly than they appear: ‘A 20-minute interruption while working on a project entails two context switches; realistically, this interruption results in a loss of a couple of hours of truly productive work’, writes Dave O’Connor in Site Reliability Engineering: How Google runs Production Systems.
Second, it sends the right message. If you take care of ad hoc requests right away, people may see that as an invitation to come to you with more ad hoc requests in the future. That’s not a position you want to put yourself in.

Automate recurring ad hoc requests. A good rule of thumb: if you do a small task twice, for the third time try to write an automated script. For example, at Amazon our team was frequently asked to prepare a list of all models that we owned along with certain metadata such as the time since the last model update. After doing the task manually a few times, I scheduled an ETL job that would automatically pull the data each week and Email it to the relevant stakeholders. Voila, I never had to even think about that task again.

Beware of data insights that aren’t actionable. Before writing any SQL query, always ask what action can be taken based on the result. If the datapoint you’re pulling doesn’t inform any particular business action either way, do we really need it?

You were hired to solve problems, not to sit in meetings. Yet, almost inevitably, meeting load tends to increase with your tenure inside an organization. If you feel that you’re not getting any actual work done because of meetings, you’re suffering from meeting paralysis.

👉 What to do instead:

Leave meetings if you aren’t contributing. If you are neither contributing your expertise to a decision-making process, nor presenting anything to others, nor learning anything that’s useful to you, then leave. It’s not rude to leave a meeting that’s wasting your time, but it is rude to waste company resources (and that includes your time).

Skip an instance of a recurrent meeting and see what happens. Sometimes you find yourself on the invite list of a recurrent meeting series even though you’re not sure if your presence is actually required. Amazon principal engineer Steve Huynh has a useful tip for this situation: one week, simply don’t show up to the next meeting instance and see what happens. If no one asks where you were, that means you can permanently drop this meeting series from your calendar. If people ask about your absence, apologize and say you’ll be back next time.

Feel empowered to reschedule meetings. There is an enormous difference between having four 30-minute meetings spread throughout the day and having them all in one 2-hour chunk in the morning. That’s because in the former case you’ll end up with 4 times more context switches (8 instead of just 2) between ‘meeting mode’ and ‘deep work mode’, which can practically cost an entire day of productive time. Batching your meetings such as to minimize context switches is therefore not only in your own personal best interest, but also in the company’s best interest because they pay you for your time. Never hesitate to do so.

Here’s a recipe for disaster: have a team of model developers build model artifacts and hand these over to an engineering team for deployment. This pin factory approach doesn’t work well in practice because:

It introduces communication overhead. The back and forth between modeling and engineering teams adds friction and slows down iteration cycles. ML is an empirical discipline, and fast iteration cycles are key to success.
It falsely assumes that model development and deployment are independent of each other. In reality, infrastructure constraints inform modeling decisions, and vice versa. For example, a deep neural network may need to be tweaked to meet latency requirements.
It introduces the risk of finger-pointing without clear ownership when things break in production. For example, if a categorical feature suddenly has a new category in production that breaks the model, is this a modeling problem or an engineering problem? It’s both, but in a pin-factory approach there’s no clear ownership for this problem.

👉 What to do instead:

The opposite of the pin factory approach is end-to-end ownership, where a model owner owns the entire model lifecycle, including data collection, model development, offline/online testing, and deployment.

For this approach to work, an organization needs two types of ML roles, namely ML infra engineers and model developers (aka data/applied/research scientists). ML infra engineers build, maintain, and own a set of services that abstract away the infrastructure around the model development process, one service for model training, one service for feature engineering, one service for inference calls, and so on. Model owners use these services in order to develop, test, and deploy their models end to end.

The former own the services, the latter own the models: this approach works well because it has clear ownership boundaries and little communication overhead.

ML model developers are incentivized to build overly complex solutions where simple ones would do, just because complex solutions look better on their resume or promo doc, a phenomenon known as resume-driven development. This can create higher maintenance cost for the entire organization in the long run because overly complex models are harder to interpret, debug, and retrain, and create higher infrastructure costs for training and serving.

👉 What to do instead:

A good strategy therefore is to keep production systems as simple as possible, and reserve wild ideas for research projects that can result in publications. LinkedIn Staff ML engineer Brandon Rohrer puts it this way:

“ML strategy tip: When you have a problem, build two solutions — a deep Bayesian transformer running on multicloud Kubernetes and a SQL query built on a stack of egregiously oversimplifying assumptions. Put one on your resume, the other in production. Everyone goes home happy.”

As an empirical discipline, a lot of progress in ML is driven by experimentation. However, things get problematic when model developers throw everything at a problem and “see what sticks”. This is HARKing, the practice of formulating a hypothesis only after the results from a large suite of experiments are known.

Science vs HARKing. Image by the author.

HARKing is dangerous because of the statistical look-elsewhere effect: the more experiments you run, the higher the chance that you’ll find a model that looks better just by chance alone. Needless to say, if the proposed model was HARKed, then the expected model improvement will not actually materialize in production: it was just a statistical fluke.

👉 What to do instead:

Always follow the scientific method. Prior to running any experiments, always explicitly formulate a hypothesis. This can be as simple as “I hypothesize that a BERT model is better than bag-of-words because in this problem the context of words matters, and not just their frequency.” Experimenting without a hypothesis is not science, it’s pseudo-science.
Estimate the variance. Instead of just measuring model accuracy on the test set, try to estimate its variance as well, so that you know how statistically meaningful the changes are. You can estimate the variance for example by bootstrapping, i.e. creating multiple measurements from different random subsets (with replacement) of the test set.
Don’t rely on offline metrics, which can be easily gamed. Always make the final launch decision based on online model performance.
Document everything. Even if an experiment is unsuccessful (it doesn’t improve the model performance), document what you tried, why you tried it, and what you found. Leaving such a paper trail will help you understand and communicate later on how you arrived at the final modeling choices, and prevents you (and others) from trying the same things again.

Let’s summarize these 5 anti-patterns and their respective resolutions as follows:

avoid ad hoc overload: practice interrupt coalescing,
avoid meeting paralysis: leave meetings that don’t actually require your presence,
avoid pin factories: work in organizations that support end-to-end model ownership,
avoid excessive resume-driven development: reserve wild ideas for your research papers, not for production, and
avoid HARKing: always follow the scientific method and formulate a hypothesis prior to running any experiments.

Of course, this is list is by no means complete. Generally, a good way to spot anti-patterns in your day-to-day work is to watch out for recurring pain points. For example, the recurring pain point caused by ad hoc overload is frequent context switching that drags your productivity. Or, the recurring pain point caused by pin factories is the act of frequently having to send data back and forth in between teams. Pay attention to the recurring pain points, and you may discover the underlying anti-patterns.

Lastly, distinguish between anti-patterns you can break yourself, and those that you can’t. For example, you can probably break ad hoc overload and meeting paralysis yourself, at least to some degree, by following steps such as the ones I outlined here. However, for organizational anti-patterns such as ‘pin factories’, there’s really not much you can do yourself as an individual contributor. If anti-patterns are organizational, the only thing you can really do is leave that organization.

Eliminate your recurring pain points by understanding the underlying patterns

Knowledge about anti-patterns is the best way to avoid them, so in this post I’ll break down 5 common data science anti-patterns that I’ve observed over the years in my own career, namely:

ad hoc overload,
meeting paralysis,
pin factories,
excessive resume-driven development, and
HARKing.

(Of course, this list is not meant to be complete, nor are all items on the list exclusive to the domain of data science.)

Ready? Let’s dive in.

Every data scientist knows ad hoc requests. They’re queries from collaborators and stakeholders, such as…

“Hey, can you quickly pull these numbers again?”,
“Hey, this metric suddenly spiked, can you find out why?”,
“Hey, that metric suddenly dropped, can you find out why?”,
“Hey, someone tweeted this weird thing about our product, can you pull the data and see if this is true?”,

and so on and on. In the worst case, ad hoc requests may completely fill a data scientist’s day, resulting in ad hoc overload.

👉 What to do instead:

First, it reduces context switches for you. Context switches are more costly than they appear: ‘A 20-minute interruption while working on a project entails two context switches; realistically, this interruption results in a loss of a couple of hours of truly productive work’, writes Dave O’Connor in Site Reliability Engineering: How Google runs Production Systems.
Second, it sends the right message. If you take care of ad hoc requests right away, people may see that as an invitation to come to you with more ad hoc requests in the future. That’s not a position you want to put yourself in.

👉 What to do instead:

It introduces communication overhead. The back and forth between modeling and engineering teams adds friction and slows down iteration cycles. ML is an empirical discipline, and fast iteration cycles are key to success.
It falsely assumes that model development and deployment are independent of each other. In reality, infrastructure constraints inform modeling decisions, and vice versa. For example, a deep neural network may need to be tweaked to meet latency requirements.
It introduces the risk of finger-pointing without clear ownership when things break in production. For example, if a categorical feature suddenly has a new category in production that breaks the model, is this a modeling problem or an engineering problem? It’s both, but in a pin-factory approach there’s no clear ownership for this problem.

👉 What to do instead:

The former own the services, the latter own the models: this approach works well because it has clear ownership boundaries and little communication overhead.

👉 What to do instead:

“ML strategy tip: When you have a problem, build two solutions — a deep Bayesian transformer running on multicloud Kubernetes and a SQL query built on a stack of egregiously oversimplifying assumptions. Put one on your resume, the other in production. Everyone goes home happy.”

👉 What to do instead:

Always follow the scientific method. Prior to running any experiments, always explicitly formulate a hypothesis. This can be as simple as “I hypothesize that a BERT model is better than bag-of-words because in this problem the context of words matters, and not just their frequency.” Experimenting without a hypothesis is not science, it’s pseudo-science.
Estimate the variance. Instead of just measuring model accuracy on the test set, try to estimate its variance as well, so that you know how statistically meaningful the changes are. You can estimate the variance for example by bootstrapping, i.e. creating multiple measurements from different random subsets (with replacement) of the test set.
Don’t rely on offline metrics, which can be easily gamed. Always make the final launch decision based on online model performance.
Document everything. Even if an experiment is unsuccessful (it doesn’t improve the model performance), document what you tried, why you tried it, and what you found. Leaving such a paper trail will help you understand and communicate later on how you arrived at the final modeling choices, and prevents you (and others) from trying the same things again.

Let’s summarize these 5 anti-patterns and their respective resolutions as follows:

avoid ad hoc overload: practice interrupt coalescing,
avoid meeting paralysis: leave meetings that don’t actually require your presence,
avoid pin factories: work in organizations that support end-to-end model ownership,
avoid excessive resume-driven development: reserve wild ideas for your research papers, not for production, and
avoid HARKing: always follow the scientific method and formulate a hypothesis prior to running any experiments.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.