Techno Blender
Digitally Yours.

How to Use Pandas to Get Your Data in the Format You Need | by Murtaza Ali | Oct, 2022

0 57


Learn the difference between long-form and wide-form data, and how to transition between them in Pandas

Photo by David Becker on Unsplash

It’s a well-known fact among data scientists: your data will never be exactly the way you want it. You might get a somewhat organized spreadsheet or reasonable sensible tables, but there will always be some cleaning up to do before you’re ready for analysis.

As a result, it’s crucial to be able to transition between different forms of data. Sometimes, it’s simply a matter of readability and ease of interpretation. Other times, you’ll quite literally find that a software package or model you’re trying to use simply won’t work unless your data is in a specific format. Whatever the case may be, this is a good skill to have.

In this article, I’m going to discuss two common forms of data: long-form data and wide-form data. These are widely used paradigms in data science, and it is good to be familiar with them. We’ll look at some examples to see what exactly both data formats look like, and then we’ll see how to convert between them using Python (and, more specifically, Pandas).

Let’s get into it.

Long-Form vs. Wide-Form Data

The simplest way to begin is with straight forward definitions [1]:

  • Wide-form data has one row for each possible value of your independent variable, with all dependent variables recorded in the column labels. As a result, the label in each row (for the independent variable) will be unique.
  • Long-form data has one row for each observation, and each dependent variable is recorded as a new value over multiple rows. Hence, values for the independent variable repeat within the rows.

Okay, cool — but what does that mean? It’ll be easier to understand if we look at an example. Say we have a data set of students, and we are storing their scores on a midterm exam, final exam, and class project. In wide form, our data would look like this:

Image by Author

Here, each student is an independent variable, and each of the scores are respective dependent variables (because the score for a particular exam or project depends on the student). We can see that the value of Student is unique for each row, as we would expect for wide-form data.

Now let’s look at the exact same data, but in long form:

Image by Author

This time around, we have a row for each observation. In this case, an observation corresponds to a score on a particular assignment. In the wide-form version of this data above, we recorded multiple observations (scores) in a row, whereas here every row has its own score.

Additionally, we can see that the values for our independent variable Student repeat in this data format, which is again what we expected.

In a moment, we’ll talk about why you should even care about these different formats. But first, let’s take a quick look at how we can use Pandas to convert between these different data formats.

Wide-Form Data to Long-Form Data: The Melt Function

Once again, let’s take a look at the wide-form data from above. This time, we’ll give the DataFrame a name: student_data :

Image by Author

To convert student data into long form, we use the following line of code:

student_data.melt('Student', var_name='Assignment', value_name='Score')
Image by Author

Here’s a step-by-step explanation:

  • The melt function is designed to convert wide-form data into long-form data [2].
  • The var_name parameter specifies what we want to name the second column — the one which will contain our respective dependent variables.
  • The value_name parameter specifies what we want to name the third column — the one containing the individual values we are observing (in this case, the scores).

Okay, so we have our long-form data now. But what if — for whatever reason — we need to go back to a wide format?

Long-Form Data to Wide Form Data: The Pivot Function

Now, we’re starting with the long-form version of our data above, called student_data_long. The following line of code will convert it back to our original format:

student_data_long.pivot(index='Student', columns='Assignment', values='Score')
Image By Author

Excepting the slightly updated labels ( pivot shows the overall column label 'Assignment', this is precisely the data we started with above.

Here’s a step-by-step explanation:

  • The pivot function is designed to convert wide-form data into long-form data [3], but can actually accomplish much more than what’s shown here [4].
  • The index parameter specifies which column’s values we want to make our unique rows (i.e. the independent variable).
  • The columns parameter specifies which column’s unique values (in long form) will become the unique column labels.
  • The values parameter specifies what column’s labels will make up the actual data entries in our wide format.

And that’s all there is to it!

Why Does it Matter?

Finally, I want to briefly emphasize that while the above might seem superficial at first glance, it’s actually a very useful skill to have. Many times, you’ll find that having your data in a certain format will make your life much, much easier.

I’ll illustrate with an example from my own work. I often need to make data visualizations in Python, and my module of choice is Altair. This led to an unanticipated issue: most spreadsheets tend to be in wide format, but Altair’s specifications are significantly easier to use in long format.

I struggled for quite a while to develop one particular visualization earlier this year. Upon resigning myself to Stack Overflow, I discovered that all I needed to do was convert my data into a long format. If you’re skeptical, feel free to check out the post yourself.

Now, you may not work in visualization, but if you’re reading this, it’s safe to assume that you do work with data. Therefore, you should know how to manipulate it, and this is just one more useful skill to keep in your toolbox.

Best of luck on your data science endeavors.


Learn the difference between long-form and wide-form data, and how to transition between them in Pandas

Photo by David Becker on Unsplash

It’s a well-known fact among data scientists: your data will never be exactly the way you want it. You might get a somewhat organized spreadsheet or reasonable sensible tables, but there will always be some cleaning up to do before you’re ready for analysis.

As a result, it’s crucial to be able to transition between different forms of data. Sometimes, it’s simply a matter of readability and ease of interpretation. Other times, you’ll quite literally find that a software package or model you’re trying to use simply won’t work unless your data is in a specific format. Whatever the case may be, this is a good skill to have.

In this article, I’m going to discuss two common forms of data: long-form data and wide-form data. These are widely used paradigms in data science, and it is good to be familiar with them. We’ll look at some examples to see what exactly both data formats look like, and then we’ll see how to convert between them using Python (and, more specifically, Pandas).

Let’s get into it.

Long-Form vs. Wide-Form Data

The simplest way to begin is with straight forward definitions [1]:

  • Wide-form data has one row for each possible value of your independent variable, with all dependent variables recorded in the column labels. As a result, the label in each row (for the independent variable) will be unique.
  • Long-form data has one row for each observation, and each dependent variable is recorded as a new value over multiple rows. Hence, values for the independent variable repeat within the rows.

Okay, cool — but what does that mean? It’ll be easier to understand if we look at an example. Say we have a data set of students, and we are storing their scores on a midterm exam, final exam, and class project. In wide form, our data would look like this:

Image by Author

Here, each student is an independent variable, and each of the scores are respective dependent variables (because the score for a particular exam or project depends on the student). We can see that the value of Student is unique for each row, as we would expect for wide-form data.

Now let’s look at the exact same data, but in long form:

Image by Author

This time around, we have a row for each observation. In this case, an observation corresponds to a score on a particular assignment. In the wide-form version of this data above, we recorded multiple observations (scores) in a row, whereas here every row has its own score.

Additionally, we can see that the values for our independent variable Student repeat in this data format, which is again what we expected.

In a moment, we’ll talk about why you should even care about these different formats. But first, let’s take a quick look at how we can use Pandas to convert between these different data formats.

Wide-Form Data to Long-Form Data: The Melt Function

Once again, let’s take a look at the wide-form data from above. This time, we’ll give the DataFrame a name: student_data :

Image by Author

To convert student data into long form, we use the following line of code:

student_data.melt('Student', var_name='Assignment', value_name='Score')
Image by Author

Here’s a step-by-step explanation:

  • The melt function is designed to convert wide-form data into long-form data [2].
  • The var_name parameter specifies what we want to name the second column — the one which will contain our respective dependent variables.
  • The value_name parameter specifies what we want to name the third column — the one containing the individual values we are observing (in this case, the scores).

Okay, so we have our long-form data now. But what if — for whatever reason — we need to go back to a wide format?

Long-Form Data to Wide Form Data: The Pivot Function

Now, we’re starting with the long-form version of our data above, called student_data_long. The following line of code will convert it back to our original format:

student_data_long.pivot(index='Student', columns='Assignment', values='Score')
Image By Author

Excepting the slightly updated labels ( pivot shows the overall column label 'Assignment', this is precisely the data we started with above.

Here’s a step-by-step explanation:

  • The pivot function is designed to convert wide-form data into long-form data [3], but can actually accomplish much more than what’s shown here [4].
  • The index parameter specifies which column’s values we want to make our unique rows (i.e. the independent variable).
  • The columns parameter specifies which column’s unique values (in long form) will become the unique column labels.
  • The values parameter specifies what column’s labels will make up the actual data entries in our wide format.

And that’s all there is to it!

Why Does it Matter?

Finally, I want to briefly emphasize that while the above might seem superficial at first glance, it’s actually a very useful skill to have. Many times, you’ll find that having your data in a certain format will make your life much, much easier.

I’ll illustrate with an example from my own work. I often need to make data visualizations in Python, and my module of choice is Altair. This led to an unanticipated issue: most spreadsheets tend to be in wide format, but Altair’s specifications are significantly easier to use in long format.

I struggled for quite a while to develop one particular visualization earlier this year. Upon resigning myself to Stack Overflow, I discovered that all I needed to do was convert my data into a long format. If you’re skeptical, feel free to check out the post yourself.

Now, you may not work in visualization, but if you’re reading this, it’s safe to assume that you do work with data. Therefore, you should know how to manipulate it, and this is just one more useful skill to keep in your toolbox.

Best of luck on your data science endeavors.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment