How To Transform a Take-Home Assignment Into a Data Science Job | by Alex Vamvakaris | Jan, 2023


Photo by Patrick Federi on Unsplash

A take-home assignment is a common step in many data science interviews, typically given in the later stages of the screening process. The first rounds assess your knowledge of statistics (hypothesis testing, etc.) and often include practice coding questions (SQL, R, etc.). If you need to brush up on your skills in these areas, please check the articles below.

In many ways, the take-home assignment is a simulation of what the company will expect from you if you are hired and given a similar problem. In this article, I have created a step-by-step guide on how to approach this challenge based on my experience as a candidate and as a hiring manager. The guide will be structured as follows:

  • The basics of take-home assignments
  • Quick QA of the data
  • The exploratory phase: Find patterns in the data
  • The explanatory phase: Build your story
Photo by Dennis Scherdt on Unsplash

1.1 How are take-home assignments structured?

Take-home assignments are composed of two elements. A dataset and an overview of the assignment.

  • The dataset will usually be a fictional version of the company’s data. It is provided in a .csv format, with a description of all columns. Rarely, you might be given more than one dataset and be expected to combine them
  • The assignment overview provides some background on the task and a short description of what they expect you to do with the dataset. For example, in my take-home assignment for a data science position in Deliveroo, the task was to analyze the performance of their RGR (Rider Gets Rider) referral program compared to other marketing channels. Then to assess if it was successful and, if not (ha!), what are some important factors we should consider changing. The last step of the task involved presenting a summary of the information to senior non-technical stakeholders and suggesting next steps, such as gathering additional data

1.2 What are the expected insights to pass the take-home assignment?

The take-home assignment puts you in a unique position of advantage and disadvantage at the same time. On the one hand, you already have the problem definition and dataset given, which usually requires a hefty amount of time in the project pipeline. You are also given freedom on how to approach the analysis. On the other hand, you cannot further communicate with the business stakeholders, which is scary. This is exactly what the test evaluates. The ability to take such a vague assignment and, without overcomplicating it (in fact, this is one of the things you will be evaluated on), deliver the following:

A clear understanding of the dataset that led you to identify the problem (or opportunity) hidden in it

Furthermore, you were able to pinpoint possible sources of the problem and provide clear recommendations and next steps for your intended audience

In other words, they want you to showcase that you can provide value from data!

Photo by Ümit Yıldırım on Unsplash

Before we proceed, we need to quickly check the health of our dataset. In computer science, garbage in, garbage out (GIGO) is the concept that flawed or nonsense (garbage) input data produces nonsense output. While it is tempting to skip that step and dive into the analysis, you do not want to trip over and fail the test from such a simple mistake.

  • Identify the column that serves as the primary key (i.e., unique for each row of your dataset). This information will usually be provided
  • Check columns for missing values and label them accordingly (NA for R, NULL for SQL, etc.). Hidden missing values like a suspicious peak at 999 can also be flagged as NA
  • Make sure that all columns are formatted appropriately. For example, a binary attribute (values of 0 and 1) is flagged as categorical and not as an integer, or a date is not mislabeled as a character
  • Check if there are values that don’t make sense. A value of 1 billion when all other observations are between 0 and 1000 can be corrected when possible or otherwise excluded (make sure you explain any exclusion decisions in your presentation)
Photo by Daniel Lerman on Unsplash

3.1 Creating KPIs

For most companies, the interest is around numerical attributes. These can be anything, depending on the company. For our Deliveroo example, it would be the number of referrals, or for a company like Netflix, it could be the number of videos seen. For most companies, they will be monetary (i.e., revenue based).

First, we want to better understand these numerical attributes. We can use descriptive statistics and plots to that end.

  • Descriptive Statistics: Sum, Mean, Median, Mode, Interquartile Range (IQR), Standard Deviation
  • Visualization: Bar Plot for discrete and Histogram or Box Plot for continuous attributes

We now want to identify our KPIs (Key Performance Indicators). KPIs are the subset of descriptive statistics that can be meaningful to business stakeholders. For example, there are better choices than IQR for a KPI. The most frequently used statistic for a KPI is the average (average number of referrals by rider or average number of videos seen per user). You will be expected to select the appropriate statistic for your KPI(s) based on the distribution of the data. So if there are outliers, you can choose the 95% trimmed mean, or if the distribution is highly skewed, you can select the median.

It is important to walk the business stakeholders through these decisions. Explaining that because of extreme outliers, the mean has a value of $10000, and so a much better choice is the 95% trimmed mean with a more representative value of $120 showcases that a) you have identified the outliers and b) that you know how to deal with them!

3.2 Adding dimensions

Our analysis is still too high level. We need to add the categorical attributes to the mix. You can think of them as the dimensions of our KPIs. For our Deliveroo example, we can split the median referrals per rider by city or month. Dates are one of the most frequently used columns for dimensions.

Some categorical attributes may have many rare classes (a class is a unique value for categorical attributes, such as “Boy” and “Girl”). These rare classes can be combined or reassigned. We can also group in a similar way numerical attributes to create new categorical dimensions. For example, using quantiles, we can group revenue into four buckets (or bins). But try not to overcomplicate the assignment; unless there is good reason, refrain from binning numerical attributes.

We can now start making comparisons by splitting our KPIs by a categorical attribute or by dates:

  • Did our KPIs perform better in one city than another?
  • Was total revenue down or up compared to previous months?
  • We can also combine dimensions. If revenue was down 15% year on year, was the difference caused by worse performance in one or two cities, or was it an almost uniform decline of 15% across all cities?

3.3 Root Cause Analysis

Root cause analysis aims to identify the underlying factors that contribute to a problem. Let’s see how that works using an example. Imagine an assignment where Total Revenue was down by 25% compared to last month. You added the categorical attribute City as a dimension and observed that the difference came almost exclusively from New York and London. Total Revenue has stayed the same month on month for other cities. First, we will decompose Total Revenue into its two components.

Total Revenue = Total Number of Orders * Average Order Value

Looking at the (fictional) data, for both cities Total Number of Orders was up by 9%, but Average Order Value was down by 38%! Okay, now we are getting somewhere. We can also examine the funnel that users go through. For example, if orders are down, we can first examine how many visitors we had, how many of them added a product to the basket, how many proceeded to the checkout page, and finally, how many have confirmed their order.

Of course, the above depends on how much data we will have available. But we can only work with what we have, so do not worry if any of the above steps are not possible due to data limitations. But make sure you add them as potential data sources when you go through your analysis’s recommendation and next steps.

🚀🚀 And with that last step, we have reached the end of the guide. Below you can also find a quick summary of the steps:

✅ Read the description carefully and understand what the task and objectives are

✅ QA your data before you start your analysis

✅ Create KPIs, and split them by dimensions to pinpoint the problem

✅ Conduct Root Cause Analysis to identify the factors that might explain the problem

✅ Communicate only the pearls of your analysis. Resist the need to show all the oysters you opened

One last point I want to emphasize is the importance of doing your own assessment of the company based on the quality of the take-home assignment. Was the description clear? Was the dataset reasonably challenging? Was sufficient effort put into the problem and dataset, or did it look like a hasty effort?

If you have any questions or need further help, please feel free to comment below, and I will answer promptly.


Photo by Patrick Federi on Unsplash

A take-home assignment is a common step in many data science interviews, typically given in the later stages of the screening process. The first rounds assess your knowledge of statistics (hypothesis testing, etc.) and often include practice coding questions (SQL, R, etc.). If you need to brush up on your skills in these areas, please check the articles below.

In many ways, the take-home assignment is a simulation of what the company will expect from you if you are hired and given a similar problem. In this article, I have created a step-by-step guide on how to approach this challenge based on my experience as a candidate and as a hiring manager. The guide will be structured as follows:

  • The basics of take-home assignments
  • Quick QA of the data
  • The exploratory phase: Find patterns in the data
  • The explanatory phase: Build your story
Photo by Dennis Scherdt on Unsplash

1.1 How are take-home assignments structured?

Take-home assignments are composed of two elements. A dataset and an overview of the assignment.

  • The dataset will usually be a fictional version of the company’s data. It is provided in a .csv format, with a description of all columns. Rarely, you might be given more than one dataset and be expected to combine them
  • The assignment overview provides some background on the task and a short description of what they expect you to do with the dataset. For example, in my take-home assignment for a data science position in Deliveroo, the task was to analyze the performance of their RGR (Rider Gets Rider) referral program compared to other marketing channels. Then to assess if it was successful and, if not (ha!), what are some important factors we should consider changing. The last step of the task involved presenting a summary of the information to senior non-technical stakeholders and suggesting next steps, such as gathering additional data

1.2 What are the expected insights to pass the take-home assignment?

The take-home assignment puts you in a unique position of advantage and disadvantage at the same time. On the one hand, you already have the problem definition and dataset given, which usually requires a hefty amount of time in the project pipeline. You are also given freedom on how to approach the analysis. On the other hand, you cannot further communicate with the business stakeholders, which is scary. This is exactly what the test evaluates. The ability to take such a vague assignment and, without overcomplicating it (in fact, this is one of the things you will be evaluated on), deliver the following:

A clear understanding of the dataset that led you to identify the problem (or opportunity) hidden in it

Furthermore, you were able to pinpoint possible sources of the problem and provide clear recommendations and next steps for your intended audience

In other words, they want you to showcase that you can provide value from data!

Photo by Ümit Yıldırım on Unsplash

Before we proceed, we need to quickly check the health of our dataset. In computer science, garbage in, garbage out (GIGO) is the concept that flawed or nonsense (garbage) input data produces nonsense output. While it is tempting to skip that step and dive into the analysis, you do not want to trip over and fail the test from such a simple mistake.

  • Identify the column that serves as the primary key (i.e., unique for each row of your dataset). This information will usually be provided
  • Check columns for missing values and label them accordingly (NA for R, NULL for SQL, etc.). Hidden missing values like a suspicious peak at 999 can also be flagged as NA
  • Make sure that all columns are formatted appropriately. For example, a binary attribute (values of 0 and 1) is flagged as categorical and not as an integer, or a date is not mislabeled as a character
  • Check if there are values that don’t make sense. A value of 1 billion when all other observations are between 0 and 1000 can be corrected when possible or otherwise excluded (make sure you explain any exclusion decisions in your presentation)
Photo by Daniel Lerman on Unsplash

3.1 Creating KPIs

For most companies, the interest is around numerical attributes. These can be anything, depending on the company. For our Deliveroo example, it would be the number of referrals, or for a company like Netflix, it could be the number of videos seen. For most companies, they will be monetary (i.e., revenue based).

First, we want to better understand these numerical attributes. We can use descriptive statistics and plots to that end.

  • Descriptive Statistics: Sum, Mean, Median, Mode, Interquartile Range (IQR), Standard Deviation
  • Visualization: Bar Plot for discrete and Histogram or Box Plot for continuous attributes

We now want to identify our KPIs (Key Performance Indicators). KPIs are the subset of descriptive statistics that can be meaningful to business stakeholders. For example, there are better choices than IQR for a KPI. The most frequently used statistic for a KPI is the average (average number of referrals by rider or average number of videos seen per user). You will be expected to select the appropriate statistic for your KPI(s) based on the distribution of the data. So if there are outliers, you can choose the 95% trimmed mean, or if the distribution is highly skewed, you can select the median.

It is important to walk the business stakeholders through these decisions. Explaining that because of extreme outliers, the mean has a value of $10000, and so a much better choice is the 95% trimmed mean with a more representative value of $120 showcases that a) you have identified the outliers and b) that you know how to deal with them!

3.2 Adding dimensions

Our analysis is still too high level. We need to add the categorical attributes to the mix. You can think of them as the dimensions of our KPIs. For our Deliveroo example, we can split the median referrals per rider by city or month. Dates are one of the most frequently used columns for dimensions.

Some categorical attributes may have many rare classes (a class is a unique value for categorical attributes, such as “Boy” and “Girl”). These rare classes can be combined or reassigned. We can also group in a similar way numerical attributes to create new categorical dimensions. For example, using quantiles, we can group revenue into four buckets (or bins). But try not to overcomplicate the assignment; unless there is good reason, refrain from binning numerical attributes.

We can now start making comparisons by splitting our KPIs by a categorical attribute or by dates:

  • Did our KPIs perform better in one city than another?
  • Was total revenue down or up compared to previous months?
  • We can also combine dimensions. If revenue was down 15% year on year, was the difference caused by worse performance in one or two cities, or was it an almost uniform decline of 15% across all cities?

3.3 Root Cause Analysis

Root cause analysis aims to identify the underlying factors that contribute to a problem. Let’s see how that works using an example. Imagine an assignment where Total Revenue was down by 25% compared to last month. You added the categorical attribute City as a dimension and observed that the difference came almost exclusively from New York and London. Total Revenue has stayed the same month on month for other cities. First, we will decompose Total Revenue into its two components.

Total Revenue = Total Number of Orders * Average Order Value

Looking at the (fictional) data, for both cities Total Number of Orders was up by 9%, but Average Order Value was down by 38%! Okay, now we are getting somewhere. We can also examine the funnel that users go through. For example, if orders are down, we can first examine how many visitors we had, how many of them added a product to the basket, how many proceeded to the checkout page, and finally, how many have confirmed their order.

Of course, the above depends on how much data we will have available. But we can only work with what we have, so do not worry if any of the above steps are not possible due to data limitations. But make sure you add them as potential data sources when you go through your analysis’s recommendation and next steps.

🚀🚀 And with that last step, we have reached the end of the guide. Below you can also find a quick summary of the steps:

✅ Read the description carefully and understand what the task and objectives are

✅ QA your data before you start your analysis

✅ Create KPIs, and split them by dimensions to pinpoint the problem

✅ Conduct Root Cause Analysis to identify the factors that might explain the problem

✅ Communicate only the pearls of your analysis. Resist the need to show all the oysters you opened

One last point I want to emphasize is the importance of doing your own assessment of the company based on the quality of the take-home assignment. Was the description clear? Was the dataset reasonably challenging? Was sufficient effort put into the problem and dataset, or did it look like a hasty effort?

If you have any questions or need further help, please feel free to comment below, and I will answer promptly.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsAlexartificial intelligenceAssignmentDataJanJobScienceTakeHomeTechnologyTransformvamvakaris
Comments (0)
Add Comment