5 Functions is All You Need to Manage Your Data with dplyr | by Soner Yıldırım | Mar, 2023


How to efficiently make your data ready-to-use

Photo by Kelly Sikkema on Unsplash

Those who have worked on real-life tasks with real-life data know that the bigger slice of pie belongs to data wrangling.

What I mean by data wrangling covers operations done to make the data ready for other stakeholders or downstream processes.

Whether you are a data analyst, data scientist, or data engineer, you will need to perform one or more of the following operations as part of your daily routine:

  • Filtering (e.g. get me the sales data in Texas)
  • Sorting (e.g. I want to see the top 10 best seller in last week)
  • Updating (e.g. change the category of these products)
  • Summarizing (e.g. I want to see the average revenue per category)

Data analysis and manipulation tools exist in the data science ecosystem to provide efficient ways of performing these operations to be able to keep their existence in the ecosystem.

In this article, we will learn how to handle these tasks using one of the predominant tools in data science: dplyr.

It is a package for R programming language and described as “a grammar of data manipulation”.

We can either install dplyr by itself or use tidyverse, which is a collection of R packages for data science. I prefer the latter because it allows me to use some functions from other packages in tidyverse (e.g. read_csv from readr).

Let’s start with importing the libraries and reading a dataset. We will be using a sample dataset that I prepared with mock data. You can download it from my datasets repo.

library(tidyverse)

sales <- read_csv("sales_data_with_stores.csv")

# display the first 6 rows
head(sales)

# A tibble: 6 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Violet PG2 4187 498 421. 570. 13 58
2 Rose PG2 4195 473 546. 712. 16 58
3 Violet PG2 4204 968 640. 855. 22 88
4 Daisy PG2 4219 241 870. 1035. 14 45
5 Daisy PG2 4718 1401 12.5 26.6 50 285
6 Violet PG4 5630 287 5.85 7.59 24 116

The resulting object is a tibble, which is just like a DataFrame or table.

All data manipulation challenges come down to using one or more of the following 5 functions:

  • select
  • mutate
  • filter
  • arrange
  • summarise

The select function can be used for selecting columns. It allows us to do so by keeping or dropping columns using their names and types.

Let’s go over several examples to see how select works. The following example shows how to select a particular column or columns.

# select product_code and price columns
select(sales, product_code, price)

# A tibble: 1,000 x 2
product_code price
<dbl> <dbl>
1 4187 570.
2 4195 712.
3 4204 855.
4 4219 1035.
5 4718 26.6

# the following does the same
select(sales, c(product_code, price))

We can also do the selection by excluding (dropping) one or more columns using the “!” as follows:

# select all but product_group column
select(sales, !product_group)

# A tibble: 1,000 x 7
store product_code stock_qty cost price last_week_sales last_month_sales
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Violet 4187 498 421. 570. 13 58
2 Rose 4195 473 546. 712. 16 58
3 Violet 4204 968 640. 855. 22 88
4 Daisy 4219 241 870. 1035. 14 45
5 Daisy 4718 1401 12.5 26.6 50 285

If we are selecting multiple consecutive columns, it’s easier to use “:” to define the range.

# select all columns from store to cost
select(sales, store:cost)

# A tibble: 1,000 x 5
store product_group product_code stock_qty cost
<chr> <chr> <dbl> <dbl> <dbl>
1 Violet PG2 4187 498 421.
2 Rose PG2 4195 473 546.
3 Violet PG2 4204 968 640.
4 Daisy PG2 4219 241 870.
5 Daisy PG2 4718 1401 12.5

We can also use column indices so the following does the same operation as the one above:

# select all columns from store to cost
select(sales, 1:5)

We can also select columns based on their data types. For instance, the following selects the numerical columns.

# select all numeric columns
select(sales, where(is.numeric))

# A tibble: 1,000 x 6
product_code stock_qty cost price last_week_sales last_month_sales
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4187 498 421. 570. 13 58
2 4195 473 546. 712. 16 58
3 4204 968 640. 855. 22 88
4 4219 241 870. 1035. 14 45
5 4718 1401 12.5 26.6 50 285

By using the same predicate function (i.e. where), we can select non-numeric columns as follows:

# select all non-numeric columns
select(sales, !where(is.numeric))

# A tibble: 1,000 x 2
store product_group
<chr> <chr>
1 Violet PG2
2 Rose PG2
3 Violet PG2
4 Daisy PG2
5 Daisy PG2

The mutate function, as its name suggests, modifies the tibble by updating the existing columns or creating new ones.

For instance, we can increase the price values by 10 percent as follows:

# increase price by 10 percent
mutate(sales, price = price * 1.1)

# A tibble: 1,000 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Violet PG2 4187 498 421. 627. 13 58
2 Rose PG2 4195 473 546. 784. 16 58
3 Violet PG2 4204 968 640. 940. 22 88
4 Daisy PG2 4219 241 870. 1138. 14 45
5 Daisy PG2 4718 1401 12.5 29.2 50 285

We can also create new columns:

# create price_updated column by increasing price by 10 percent
mutate(sales, price_updated = price * 1.1)

# A tibble: 1,000 x 9
store product_group product_code stock_qty cost price last_week_sales last_month_sales price_updated
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Violet PG2 4187 498 421. 570. 13 58 627.
2 Rose PG2 4195 473 546. 712. 16 58 784.
3 Violet PG2 4204 968 640. 855. 22 88 940.
4 Daisy PG2 4219 241 870. 1035. 14 45 1138.
5 Daisy PG2 4718 1401 12.5 26.6 50 285 29.2

Multiple updates can be done in a single operation. Before doing an example to demonstrate this case, let’s mention two very useful features:

  • We can combine different types of operations using the “%>%” operator to create pipes and solve complex tasks involving multiple steps.
  • The new columns are immediately available so that we can use them in the same mutate function.

Here is an example to show the two features mentioned above:

> sales %>%
+ select(cost, price) %>%
+ mutate(
+ price_updated = price * 1.1,
+ profit_updated = price_updated - cost)

# A tibble: 1,000 x 4
cost price price_updated profit_updated
<dbl> <dbl> <dbl> <dbl>
1 421. 570. 627. 206.
2 546. 712. 784. 238.
3 640. 855. 940. 300.
4 870. 1035. 1138. 268.
5 12.5 26.6 29.2 16.7

In the example above, we first select the price and cost columns from sales and create the price_updated column from the price column and then use this new column to create the profit_updated column.

The filter function lets us filter observations (i.e. rows) based on a condition or a set of conditions.

The following example filters rows with a price of more than 1000.

filter(sales, price > 1000)

# A tibble: 5 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Daisy PG2 4219 241 870. 1035. 14 45
2 Violet PG1 9692 68 1243 1500. 26 94
3 Violet PG1 7773 602 976. 1325. 19 60
4 Daisy PG1 1941 213 847 1177. 18 72
5 Daisy PG1 4140 92 803 1202. 12 24

Let’s use a more complex filter:

# rows with a price of more than 1000 and store is Daisy
filter(sales, price > 1000 & store == "Daisy")

# A tibble: 3 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Daisy PG2 4219 241 870. 1035. 14 45
2 Daisy PG1 1941 213 847 1177. 18 72
3 Daisy PG1 4140 92 803 1202. 12 24

If we have a condition with multiple values, we can use the “%in%” operator.

# rows with product group of PG3, PG4, or PG5 and store is Daisy
filter(sales, product_group %in% c("PG3", "PG4", "PG5") & store == "Daisy")

# A tibble: 302 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Daisy PG4 5634 205 14.2 18.0 14 53
2 Daisy PG4 2650 239 59.4 111. 15 38
3 Daisy PG4 5647 352 5.85 13.3 37 108
4 Daisy PG4 5693 260 7.62 13.3 19 74
5 Daisy PG4 5696 260 7.62 13.3 29 98

The arrange function sorts the rows based on the values in a column or columns. It is similar to SQL order by and Pandas sort_values function.

# order rows by price
arrange(sales, price)

# A tibble: 1,000 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Violet PG4 5454 -22 0.570 0.66 31 28
2 Violet PG5 5621 -123 1.32 0.76 49 100
3 Daisy PG4 2279 525 1.34 1.23 14 18
4 Rose PG2 9372 350 1.41 1.42 16 40
5 Rose PG4 2138 482 1.63 1.61 13 23

By default, the sorting is done in ascending order. We can change it to descending by adding a “-” before the column name.

# order rows by price descending
arrange(sales, -price)

# A tibble: 1,000 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Violet PG1 9692 68 1243 1500. 26 94
2 Violet PG1 7773 602 976. 1325. 19 60
3 Daisy PG1 4140 92 803 1202. 12 24
4 Daisy PG1 1941 213 847 1177. 18 72
5 Daisy PG2 4219 241 870. 1035. 14 45

To sort by multiple columns, we can write the column names separated by comma. The following example sorts the rows by store name and then by last_week_sales (descending).

arrange(sales, store, -last_week_sales)

# A tibble: 1,000 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Daisy PG6 856 52748 31.4 38.0 1883 6880
2 Daisy PG3 2481 10543 8.25 14.2 947 4100
3 Daisy PG6 3957 10090 26.9 31.3 867 2355
4 Daisy PG6 889 21569 13.0 16.1 808 2990
5 Daisy PG6 9635 26576 8.53 11.4 673 2484

It summarizes groups down to one row. Together with the group_by function, we can use it for calculating aggregated values for groups. This operation is similar to SQL group by and Pandas groupby function.

# group by rows in sales by store
by_store <- group_by(sales, store)

# calculate the avg price for each store
summarise(by_store, avg_price = mean(price))

# A tibble: 3 x 2
store avg_price
* <chr> <dbl>
1 Daisy 69.3
2 Rose 60.5
3 Violet 67.8

There are several aggregate functions to extract insights from data and perform in-depth analysis.

Here is a list of them from the official documentation:

The functions we have covered in this article help you get almost all your data manipulation and analysis work done. Being able to use them together in a pipe provides further flexibility.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.

Thank you for reading. Please let me know if you have any feedback.


How to efficiently make your data ready-to-use

Photo by Kelly Sikkema on Unsplash

Those who have worked on real-life tasks with real-life data know that the bigger slice of pie belongs to data wrangling.

What I mean by data wrangling covers operations done to make the data ready for other stakeholders or downstream processes.

Whether you are a data analyst, data scientist, or data engineer, you will need to perform one or more of the following operations as part of your daily routine:

  • Filtering (e.g. get me the sales data in Texas)
  • Sorting (e.g. I want to see the top 10 best seller in last week)
  • Updating (e.g. change the category of these products)
  • Summarizing (e.g. I want to see the average revenue per category)

Data analysis and manipulation tools exist in the data science ecosystem to provide efficient ways of performing these operations to be able to keep their existence in the ecosystem.

In this article, we will learn how to handle these tasks using one of the predominant tools in data science: dplyr.

It is a package for R programming language and described as “a grammar of data manipulation”.

We can either install dplyr by itself or use tidyverse, which is a collection of R packages for data science. I prefer the latter because it allows me to use some functions from other packages in tidyverse (e.g. read_csv from readr).

Let’s start with importing the libraries and reading a dataset. We will be using a sample dataset that I prepared with mock data. You can download it from my datasets repo.

library(tidyverse)

sales <- read_csv("sales_data_with_stores.csv")

# display the first 6 rows
head(sales)

# A tibble: 6 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Violet PG2 4187 498 421. 570. 13 58
2 Rose PG2 4195 473 546. 712. 16 58
3 Violet PG2 4204 968 640. 855. 22 88
4 Daisy PG2 4219 241 870. 1035. 14 45
5 Daisy PG2 4718 1401 12.5 26.6 50 285
6 Violet PG4 5630 287 5.85 7.59 24 116

The resulting object is a tibble, which is just like a DataFrame or table.

All data manipulation challenges come down to using one or more of the following 5 functions:

  • select
  • mutate
  • filter
  • arrange
  • summarise

The select function can be used for selecting columns. It allows us to do so by keeping or dropping columns using their names and types.

Let’s go over several examples to see how select works. The following example shows how to select a particular column or columns.

# select product_code and price columns
select(sales, product_code, price)

# A tibble: 1,000 x 2
product_code price
<dbl> <dbl>
1 4187 570.
2 4195 712.
3 4204 855.
4 4219 1035.
5 4718 26.6

# the following does the same
select(sales, c(product_code, price))

We can also do the selection by excluding (dropping) one or more columns using the “!” as follows:

# select all but product_group column
select(sales, !product_group)

# A tibble: 1,000 x 7
store product_code stock_qty cost price last_week_sales last_month_sales
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Violet 4187 498 421. 570. 13 58
2 Rose 4195 473 546. 712. 16 58
3 Violet 4204 968 640. 855. 22 88
4 Daisy 4219 241 870. 1035. 14 45
5 Daisy 4718 1401 12.5 26.6 50 285

If we are selecting multiple consecutive columns, it’s easier to use “:” to define the range.

# select all columns from store to cost
select(sales, store:cost)

# A tibble: 1,000 x 5
store product_group product_code stock_qty cost
<chr> <chr> <dbl> <dbl> <dbl>
1 Violet PG2 4187 498 421.
2 Rose PG2 4195 473 546.
3 Violet PG2 4204 968 640.
4 Daisy PG2 4219 241 870.
5 Daisy PG2 4718 1401 12.5

We can also use column indices so the following does the same operation as the one above:

# select all columns from store to cost
select(sales, 1:5)

We can also select columns based on their data types. For instance, the following selects the numerical columns.

# select all numeric columns
select(sales, where(is.numeric))

# A tibble: 1,000 x 6
product_code stock_qty cost price last_week_sales last_month_sales
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4187 498 421. 570. 13 58
2 4195 473 546. 712. 16 58
3 4204 968 640. 855. 22 88
4 4219 241 870. 1035. 14 45
5 4718 1401 12.5 26.6 50 285

By using the same predicate function (i.e. where), we can select non-numeric columns as follows:

# select all non-numeric columns
select(sales, !where(is.numeric))

# A tibble: 1,000 x 2
store product_group
<chr> <chr>
1 Violet PG2
2 Rose PG2
3 Violet PG2
4 Daisy PG2
5 Daisy PG2

The mutate function, as its name suggests, modifies the tibble by updating the existing columns or creating new ones.

For instance, we can increase the price values by 10 percent as follows:

# increase price by 10 percent
mutate(sales, price = price * 1.1)

# A tibble: 1,000 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Violet PG2 4187 498 421. 627. 13 58
2 Rose PG2 4195 473 546. 784. 16 58
3 Violet PG2 4204 968 640. 940. 22 88
4 Daisy PG2 4219 241 870. 1138. 14 45
5 Daisy PG2 4718 1401 12.5 29.2 50 285

We can also create new columns:

# create price_updated column by increasing price by 10 percent
mutate(sales, price_updated = price * 1.1)

# A tibble: 1,000 x 9
store product_group product_code stock_qty cost price last_week_sales last_month_sales price_updated
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Violet PG2 4187 498 421. 570. 13 58 627.
2 Rose PG2 4195 473 546. 712. 16 58 784.
3 Violet PG2 4204 968 640. 855. 22 88 940.
4 Daisy PG2 4219 241 870. 1035. 14 45 1138.
5 Daisy PG2 4718 1401 12.5 26.6 50 285 29.2

Multiple updates can be done in a single operation. Before doing an example to demonstrate this case, let’s mention two very useful features:

  • We can combine different types of operations using the “%>%” operator to create pipes and solve complex tasks involving multiple steps.
  • The new columns are immediately available so that we can use them in the same mutate function.

Here is an example to show the two features mentioned above:

> sales %>%
+ select(cost, price) %>%
+ mutate(
+ price_updated = price * 1.1,
+ profit_updated = price_updated - cost)

# A tibble: 1,000 x 4
cost price price_updated profit_updated
<dbl> <dbl> <dbl> <dbl>
1 421. 570. 627. 206.
2 546. 712. 784. 238.
3 640. 855. 940. 300.
4 870. 1035. 1138. 268.
5 12.5 26.6 29.2 16.7

In the example above, we first select the price and cost columns from sales and create the price_updated column from the price column and then use this new column to create the profit_updated column.

The filter function lets us filter observations (i.e. rows) based on a condition or a set of conditions.

The following example filters rows with a price of more than 1000.

filter(sales, price > 1000)

# A tibble: 5 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Daisy PG2 4219 241 870. 1035. 14 45
2 Violet PG1 9692 68 1243 1500. 26 94
3 Violet PG1 7773 602 976. 1325. 19 60
4 Daisy PG1 1941 213 847 1177. 18 72
5 Daisy PG1 4140 92 803 1202. 12 24

Let’s use a more complex filter:

# rows with a price of more than 1000 and store is Daisy
filter(sales, price > 1000 & store == "Daisy")

# A tibble: 3 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Daisy PG2 4219 241 870. 1035. 14 45
2 Daisy PG1 1941 213 847 1177. 18 72
3 Daisy PG1 4140 92 803 1202. 12 24

If we have a condition with multiple values, we can use the “%in%” operator.

# rows with product group of PG3, PG4, or PG5 and store is Daisy
filter(sales, product_group %in% c("PG3", "PG4", "PG5") & store == "Daisy")

# A tibble: 302 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Daisy PG4 5634 205 14.2 18.0 14 53
2 Daisy PG4 2650 239 59.4 111. 15 38
3 Daisy PG4 5647 352 5.85 13.3 37 108
4 Daisy PG4 5693 260 7.62 13.3 19 74
5 Daisy PG4 5696 260 7.62 13.3 29 98

The arrange function sorts the rows based on the values in a column or columns. It is similar to SQL order by and Pandas sort_values function.

# order rows by price
arrange(sales, price)

# A tibble: 1,000 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Violet PG4 5454 -22 0.570 0.66 31 28
2 Violet PG5 5621 -123 1.32 0.76 49 100
3 Daisy PG4 2279 525 1.34 1.23 14 18
4 Rose PG2 9372 350 1.41 1.42 16 40
5 Rose PG4 2138 482 1.63 1.61 13 23

By default, the sorting is done in ascending order. We can change it to descending by adding a “-” before the column name.

# order rows by price descending
arrange(sales, -price)

# A tibble: 1,000 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Violet PG1 9692 68 1243 1500. 26 94
2 Violet PG1 7773 602 976. 1325. 19 60
3 Daisy PG1 4140 92 803 1202. 12 24
4 Daisy PG1 1941 213 847 1177. 18 72
5 Daisy PG2 4219 241 870. 1035. 14 45

To sort by multiple columns, we can write the column names separated by comma. The following example sorts the rows by store name and then by last_week_sales (descending).

arrange(sales, store, -last_week_sales)

# A tibble: 1,000 x 8
store product_group product_code stock_qty cost price last_week_sales last_month_sales
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Daisy PG6 856 52748 31.4 38.0 1883 6880
2 Daisy PG3 2481 10543 8.25 14.2 947 4100
3 Daisy PG6 3957 10090 26.9 31.3 867 2355
4 Daisy PG6 889 21569 13.0 16.1 808 2990
5 Daisy PG6 9635 26576 8.53 11.4 673 2484

It summarizes groups down to one row. Together with the group_by function, we can use it for calculating aggregated values for groups. This operation is similar to SQL group by and Pandas groupby function.

# group by rows in sales by store
by_store <- group_by(sales, store)

# calculate the avg price for each store
summarise(by_store, avg_price = mean(price))

# A tibble: 3 x 2
store avg_price
* <chr> <dbl>
1 Daisy 69.3
2 Rose 60.5
3 Violet 67.8

There are several aggregate functions to extract insights from data and perform in-depth analysis.

Here is a list of them from the official documentation:

The functions we have covered in this article help you get almost all your data manipulation and analysis work done. Being able to use them together in a pipe provides further flexibility.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.

Thank you for reading. Please let me know if you have any feedback.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai Newsartificial intelligenceDataDplyrFunctionsmachine learningManageMARSonerYıldırım
Comments (0)
Add Comment