Here’s My Data Science Workflow Template

Introduction
Data Exploration, Data Aggregation, and Feature Engineering
Algorithm Comparison
Summary
References

While there are articles I have written encompassing the entire data science process, including facets like business understanding and stakeholder collaboration, I wanted to focus purely on the data science part itself instead. In this article, I will provide a template for any data scientist to use and build off of. I assume that more mid and senior-level data scientists follow this general pattern already, so if you are more junior in seniority, just starting out, or curious about data science, this article is for you. That said, let’s dive deeper into the main parts of the data science process, including the steps of data exploration, data aggregation, feature engineering, and algorithm comparison.

Photo by Jason Briscoe on Unsplash [2].

We want to keep in mind that companies and roles differ, so this template could be updated depending on that for you. The following steps below described are the ones after you have already defined the business problem and met with the stakeholders. It can also be true that you are responsible for other parts of this process too where I define data engineering and data analytics, depending on your situation.

Data Exploration

Define your initial dataset → you most likely already have assumptions on what data you want before even training your model, so you will start with those features, and iterate further
Define the sources of this dataset → is it from a current company database table, a one-off Google Sheet/Microsoft Excel Sheet, a third-party data platform, etc.?
Does the data currently exist? → If yes, move forward. If no, then most likely you will work with data engineering/data analytics, etc. to provide an accurate source of the required data ingested by the model.

— You might also need to provide reason to the data engineer, for example, for requesting specific data points. Ex: I want the population grouped by zip code in a table — group it yourself first, and provide reasoning for how it is expected to impact accuracy and ultimately, the business problem, to validate data engineering time.

Data Aggregation and Feature Engineering

Is your data already aggregated correctly? → If yes, move forward. If not, then you will need to aggregate the data and store it in that format. For example, you might want to aggregate your rows for training data that is based on the county level of a state (geography grouping), if you want to predict county-level population. You might be able to aggregate the data without data engineering’s help with storage and perform some grouping techniques with something like pandas in Python programming language instead.
Do you want to create new model features? → If yes, move forward. If not, then, identify and test more features that can explain the variance in what you are predicting by tools like correlation to the target, automatic feature selection, SHAP feature importance, etc.

Reflection and What Might Surprise You:

The key takeaway about this section is that you might not have as much relevant data/features as you would expect starting off with your data science project. With that being said, it is important to be mindful of this situation as you start your initial data exploration.

Photo by Liz Sanchez-Vegas on Unsplash [3].

One thing to keep in mind is the last step of the last section, feature engineering, can switch back and forth with the order of algorithms comparison, because you might find different features are more useful or easier to use for certain algorithms.

Now you have your features (or at least a starting point), you will want to pick a specific algorithm to use for your final model.

Algorithm Comparison

Is your target a continuous number or a category? → If the answer is a continuous number, then you will compare regression algorithms, if a category, then you will compare classification algorithms.
What algorithm do you choose? → It is easier to start with as many algorithms as you can that you understand, and then narrow it down from there. There are plenty of ways to compare algorithms like running each one separately, or creating a loop that compares all algorithms you’ve selected, but what I have found to be the easiest is to use a library that already compares algorithms side-by-side, like Moez Ali’s PyCaret library — the following link shows the comapre_models() function that is incredibly valuable:

What do I look at when comparing different algorithms? → You can compare a few things, depending on your use case, but usually you will want to look at the loss function, like MAE or RMSE, for example, for a regression target. You can also look at how long the training took and if you needed to add hyperparameter tuning, and more.

Reflection and What Might Surprise You:

Algorithm comparison used to be a lot more tedious, and as the years pass, there are more and more tools you can utilize to easily compare algorithms, so you can spend more time on the one you end up choosing. You can even train and compare multiple algorithms in the same time window, optimizing for a specific loss function more easily on your training cadence so that you are not set to just one algorithm, as your underlying data and assumptions can change.

While companies and roles can and do differ, there are some key steps to the data science process that are widely practiced. We talked about a few steps that encompass the data science workflow for a usual project.

To summarize, here are those steps we discussed in our data science workflow template:

Define your initial dataset
Define the sources of this dataset
Does the data currently exist?
Is your data already aggregated correctly?
Do you want to create new model features?
Is your target a continuous number or a category?
What algorithm do you choose?
What do I look at when comparing different algorithms?

I hope you found my article both interesting and useful. Please feel free to comment down below if your experiences are the same or different as a data scientist, or what you would expect coming into data science for the first time. Why or why not? What other things do you think should be discussed more, including more steps? These can certainly be clarified even further, but I hope I was able to shed some light on what to expect in a typical data science workflow.

I am not affiliated with any of these companies.

Please feel free to check out my profile, Matt Przybyla, and other articles, as well as subscribe to receive email notifications for my blogs by following the link below, or by clicking on the subscribe icon on the top of the screen by the follow icon, and reach out to me on LinkedIn if you have any questions or comments.

Subscribe link: https://datascience2.medium.com/subscribe

Referral link: https://datascience2.medium.com/membership

(I will receive a commission if you sign up for a membership on Medium)

[1] Photo by Campaign Creators on Unsplash, (2018)

[2] Photo by Jason Briscoe on Unsplash, (2019)

[3] Photo by Liz Sanchez-Vegas on Unsplash, (2019)

[4] Moez Ali, PyCaret, PyCaret Regression Tutorial (REG101) — Level Beginner, (2023)

Introduction
Data Exploration, Data Aggregation, and Feature Engineering
Algorithm Comparison
Summary
References

Photo by Jason Briscoe on Unsplash [2].

Data Exploration

Define your initial dataset → you most likely already have assumptions on what data you want before even training your model, so you will start with those features, and iterate further
Define the sources of this dataset → is it from a current company database table, a one-off Google Sheet/Microsoft Excel Sheet, a third-party data platform, etc.?
Does the data currently exist? → If yes, move forward. If no, then most likely you will work with data engineering/data analytics, etc. to provide an accurate source of the required data ingested by the model.

Data Aggregation and Feature Engineering

Is your data already aggregated correctly? → If yes, move forward. If not, then you will need to aggregate the data and store it in that format. For example, you might want to aggregate your rows for training data that is based on the county level of a state (geography grouping), if you want to predict county-level population. You might be able to aggregate the data without data engineering’s help with storage and perform some grouping techniques with something like pandas in Python programming language instead.
Do you want to create new model features? → If yes, move forward. If not, then, identify and test more features that can explain the variance in what you are predicting by tools like correlation to the target, automatic feature selection, SHAP feature importance, etc.

Reflection and What Might Surprise You:

Photo by Liz Sanchez-Vegas on Unsplash [3].

Now you have your features (or at least a starting point), you will want to pick a specific algorithm to use for your final model.

Algorithm Comparison

Is your target a continuous number or a category? → If the answer is a continuous number, then you will compare regression algorithms, if a category, then you will compare classification algorithms.
What algorithm do you choose? → It is easier to start with as many algorithms as you can that you understand, and then narrow it down from there. There are plenty of ways to compare algorithms like running each one separately, or creating a loop that compares all algorithms you’ve selected, but what I have found to be the easiest is to use a library that already compares algorithms side-by-side, like Moez Ali’s PyCaret library — the following link shows the comapre_models() function that is incredibly valuable:

What do I look at when comparing different algorithms? → You can compare a few things, depending on your use case, but usually you will want to look at the loss function, like MAE or RMSE, for example, for a regression target. You can also look at how long the training took and if you needed to add hyperparameter tuning, and more.

Reflection and What Might Surprise You:

To summarize, here are those steps we discussed in our data science workflow template:

Define your initial dataset
Define the sources of this dataset
Does the data currently exist?
Is your data already aggregated correctly?
Do you want to create new model features?
Is your target a continuous number or a category?
What algorithm do you choose?
What do I look at when comparing different algorithms?

I am not affiliated with any of these companies.

Subscribe link: https://datascience2.medium.com/subscribe

Referral link: https://datascience2.medium.com/membership

(I will receive a commission if you sign up for a membership on Medium)

[1] Photo by Campaign Creators on Unsplash, (2018)

[2] Photo by Jason Briscoe on Unsplash, (2019)

[3] Photo by Liz Sanchez-Vegas on Unsplash, (2019)

[4] Moez Ali, PyCaret, PyCaret Regression Tutorial (REG101) — Level Beginner, (2023)

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.

Here’s My Data Science Workflow Template

Related Posts