Techno Blender
Digitally Yours.

How to Calculate Medians with Grouping in MySQL | by Lynn Kwong | Dec, 2022

0 41


Learn advanced MySQL queries to calculate medians on different occasions

Image by geralt in Pixabay

Calculating the median of an array of data is pretty straightforward in any programming language, even in Excel, where a built-in or third-party median function can be used directly. However, in MySQL, the median function is not natively supported. To get the median, we need to write some smart queries with subqueries.

In this post, we will demystify the queries to calculate medians in MySQL. Especially, we will demonstrate how to calculate the medians of a field with grouping. If a column has multiple categories whose medians should be calculated separately, it becomes cumbersome to calculate the media for each category one by one. With this post, you will be able to calculate the medians of all the categories with a single query. Besides, you will also learn how to calculate the medians of multiple columns at the same time.

Preparations

To get started, we need to have a MySQL server, create a database and table, and insert some dummy data to play with. We will use Docker to start a MySQL 8 server locally:

# Create a volume to persist the data.
$ docker volume create mysql8-data

# Create the container for MySQL.
$ docker run --name mysql8 -d -e MYSQL_ROOT_PASSWORD=root -p 13306:3306 -v mysql8-data:/var/lib/mysql mysql:8

# Connect to the local MySQL server in Docker.
$ docker exec -it mysql8 mysql -u root -proot

mysql> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 8.0.31 |
+-----------+
1 row in set (0.00 sec)

To write complex SQL queries, it’s recommended to use an IDE rather than the console. The benefits of an IDE are code completion, code formatting, syntax highlighting, etc. We will use DBeaver in this post, which is a universal database administration tool that can be used to manage all kinds of relational and NoSQL databases. However, any IDE will work. You can even copy the queries demonstrated in this post and run them in the MySQL console directly.

As to DBeaver, if you encounter the “Public Key Retrieval is not allowed” error, you should edit the connection and add the following two user properties for the driver:

# User properties
useSSL: false
allowPublicKeyRetrieval: true

For more settings of DBeaver, please check this post.

Then run the following queries to create the database, create the table and insert some dummy data:

CREATE DATABASE products;

CREATE TABLE `products`.`prices` (
`pid` int(11) NOT NULL AUTO_INCREMENT,
`category` varchar(100) NOT NULL,
`price` float NOT NULL,
PRIMARY KEY (`pid`)
);

INSERT INTO products.prices
(pid, category, price)
VALUES
(1, 'A', 2),
(2, 'A', 1),
(3, 'A', 5),
(4, 'A', 4),
(5, 'A', 3),
(6, 'B', 6),
(7, 'B', 4),
(8, 'B', 3),
(9, 'B', 5),
(10, 'B', 2),
(11, 'B', 1)
;

Calculate the median in the “classical” way

Now that the database and data are set up, we can start to calculate medians. The classical solution is to use SQL variables:

SELECT AVG(sub.price) AS median
FROM (
SELECT @row_index := @row_index + 1 AS row_index, p.price
FROM products.prices p, (SELECT @row_index := -1) r
WHERE p.category = 'A'
ORDER BY p.price
) AS sub
WHERE sub.row_index IN (FLOOR(@row_index / 2), CEIL(@row_index / 2))
;

median|
------+
3.0|

Key points for this query:

  • @row_index is a SQL variable that is initiated in the FROM statement and updated for each row in the SELECT statement.
  • The column whose median will be calculated (the price column in this example) should be sorted. It doesn’t matter if it’s sorted in ascending or descending order.
  • According to the definition of median, the median is the value of the middle element (total count is odd) or the average value of the two middle elements (total count is even). In this example, category A has 5 rows and thus the median is the value of the third row after sorting. The values of both FLOOR(@row_index / 2) and CEIL(@row_index / 2) are 2 which is the third row. On the other hand, for category B which has 6 rows, the median is the average value of the third and fourth rows.

This solution is simple and easy to understand. However, if the table has many categories, we would need to run the query for each of them, which is cumbersome and the results are not easy to store and compare.

To solve this problem, we need to have a non-classical solution using GROUP BY, GROUP_CONCAT and SUBSTRING_INDEX.

Calculate the median in a flexible way

Let’s do it step by step. The final query may look complex at first sight. However, once you know how it works, it’s actually easier to understand and you can change it freely for your own use cases.

Let’s first get all the sorted prices for each category:

SELECT
category,
GROUP_CONCAT(price ORDER BY p.price) AS prices,
COUNT(*) AS total
FROM products.prices p
GROUP BY p.category
;

category|prices |total|
--------+-----------+-----+
A |1,2,3,4,5 | 5|
B |1,2,3,4,5,6| 6|

Note that if your table has a lot of data, GROUP_CONCAT would not contain all the data. In this case, you increase the limit for GROUP_CONCAT by:

SET GROUP_CONCAT_MAX_LEN = 100000;

You can set the limit to a number that suits your use case. However, if your table contains too much data, you may have memory issues. In this case, you would need to write some scripts to perform the data processing and calculation in a smarter way. Nonetheless, the solution provided in this post is applicable to most small- to medium-sized tables.

Then, let’s get the middle elements for each category, we need to check if the total count is an odd or even number and deal with it accordingly:

SELECT 
sub1.category,
sub1.total,
CASE WHEN MOD(sub1.total, 2) = 1 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', CEIL(sub1.total/2)), ',', '-1')
WHEN MOD(sub1.total, 2) = 0 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', sub1.total/2 + 1), ',', '-2')
END AS mid_prices
FROM
(
SELECT
p.category,
GROUP_CONCAT(p.price ORDER BY p.price) AS prices,
COUNT(*) AS total
FROM products.prices p
GROUP BY p.category
) sub1
;

category|total|mid_prices|
--------+-----+----------+
A | 5|3 |
B | 6|3,4 |

We use the MOD function to check if the total count is an odd or even number. The SUBSTRING_INDEX function is used twice to extract the middle elements. Let’s demonstrate in a bit more detail how it works:

-- Categoy A, 5 rows:
SUBSTRING_INDEX('1,2,3,4,5', ',', CEIL(5/2)) => '1,2,3'
SUBSTRING_INDEX('1,2,3', ',', -1) => 3

-- Categoy B, 6 rows:
SUBSTRING_INDEX('1,2,3,4,5,6', ',', 6/2 + 1) => '1,2,3,4'
SUBSTRING_INDEX('1,2,3,4', ',', -2) => '3,4'

Finally, let’s calculate the average of the middle elements to get the median for each category:

SELECT
sub2.category,
CASE WHEN MOD(sub2.total, 2) = 1 THEN sub2.mid_prices
WHEN MOD(sub2.total, 2) = 0 THEN (SUBSTRING_INDEX(sub2.mid_prices, ',', 1) + SUBSTRING_INDEX(sub2.mid_prices, ',', -1)) / 2
END AS median
FROM
(
SELECT
sub1.category,
sub1.total,
CASE WHEN MOD(sub1.total, 2) = 1 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', CEIL(sub1.total/2)), ',', '-1')
WHEN MOD(sub1.total, 2) = 0 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', sub1.total/2 + 1), ',', '-2')
END AS mid_prices
FROM
(
SELECT
p.category,
GROUP_CONCAT(p.price ORDER BY p.price) AS prices,
COUNT(*) AS total
FROM products.prices p
GROUP BY p.category
) sub1
) sub2
;

category|median|
--------+------+
A |3 |
B |3.5 |

Cheers! The median values are calculated correctly for both categories. It’s a bit more code than the classical solution. However, it’s more transparent and thus easier to understand. Besides, it’s way more flexible and you can easily adjust the queries for different use cases, not just to get the median values of different categories using grouping.

Bonus — Calculate the medians for multiple columns

Above the medians are only calculated for one column. With the new solution, we can easily calculate the medians of multiple columns. Let’s first create a new table with some dummy data to play with:

CREATE TABLE `products`.`orders` (
`order_id` int(11) NOT NULL AUTO_INCREMENT,
`price` float NOT NULL,
`quantity` float NOT NULL,
PRIMARY KEY (`order_id`)
);

INSERT INTO products.orders
(order_id, price, quantity)
VALUES
(1, 2, 50),
(2, 1, 40),
(3, 5, 10),
(4, 3, 30),
(5, 4, 20)
;

Note that the data is fake and just for demonstration purposes, which is thus the simpler the better.

The query for calculating the medians of multiple columns can be easily adapted from the one above. The difference is that we don’t need to group by categories anymore but instead need to repeat the queries for each column whose median needs to be calculated:

SELECT
CASE WHEN MOD(sub2.total, 2) = 1 THEN sub2.mid_prices
WHEN MOD(sub2.total, 2) = 0 THEN (SUBSTRING_INDEX(sub2.mid_prices, ',', 1) + SUBSTRING_INDEX(sub2.mid_prices, ',', -1)) / 2
END AS median_of_price,
CASE WHEN MOD(sub2.total, 2) = 1 THEN sub2.mid_quantities
WHEN MOD(sub2.total, 2) = 0 THEN (SUBSTRING_INDEX(sub2.mid_quantities, ',', 1) + SUBSTRING_INDEX(sub2.mid_prices, ',', -1)) / 2
END AS median_of_quantity
FROM
(
SELECT
sub1.total,
CASE WHEN MOD(sub1.total, 2) = 1 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', CEIL(sub1.total/2)), ',', '-1')
WHEN MOD(sub1.total, 2) = 0 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', sub1.total/2 + 1), ',', '-2')
END AS mid_prices,
CASE WHEN MOD(sub1.total, 2) = 1 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.quantities, ',', CEIL(sub1.total/2)), ',', '-1')
WHEN MOD(sub1.total, 2) = 0 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.quantities, ',', sub1.total/2 + 1), ',', '-2')
END AS mid_quantities
FROM
(
SELECT
COUNT(*) AS total,
GROUP_CONCAT(o.price ORDER BY o.price) AS prices,
GROUP_CONCAT(o.quantity ORDER BY o.quantity) AS quantities
FROM products.orders o
) sub1
) sub2
;

median_of_price|median_of_quantity|
---------------+------------------+
3 |30 |

Cheers again, it works as expected!

Currently, the function to calculate medians is not implemented in MySQL yet and thus we need to write some queries ourselves to calculate them. Two solutions are introduced in this post. The first one is a classical solution that uses SQL variables. The second one is a new one which is accomplished with GROUP_CONCAT and SUBSTRING_INDEX. The second one is a bit more code but is much more extensible. You can use it to calculate the medians of the same field with different categories and also the medians of multiple fields.


Learn advanced MySQL queries to calculate medians on different occasions

Image by geralt in Pixabay

Calculating the median of an array of data is pretty straightforward in any programming language, even in Excel, where a built-in or third-party median function can be used directly. However, in MySQL, the median function is not natively supported. To get the median, we need to write some smart queries with subqueries.

In this post, we will demystify the queries to calculate medians in MySQL. Especially, we will demonstrate how to calculate the medians of a field with grouping. If a column has multiple categories whose medians should be calculated separately, it becomes cumbersome to calculate the media for each category one by one. With this post, you will be able to calculate the medians of all the categories with a single query. Besides, you will also learn how to calculate the medians of multiple columns at the same time.

Preparations

To get started, we need to have a MySQL server, create a database and table, and insert some dummy data to play with. We will use Docker to start a MySQL 8 server locally:

# Create a volume to persist the data.
$ docker volume create mysql8-data

# Create the container for MySQL.
$ docker run --name mysql8 -d -e MYSQL_ROOT_PASSWORD=root -p 13306:3306 -v mysql8-data:/var/lib/mysql mysql:8

# Connect to the local MySQL server in Docker.
$ docker exec -it mysql8 mysql -u root -proot

mysql> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 8.0.31 |
+-----------+
1 row in set (0.00 sec)

To write complex SQL queries, it’s recommended to use an IDE rather than the console. The benefits of an IDE are code completion, code formatting, syntax highlighting, etc. We will use DBeaver in this post, which is a universal database administration tool that can be used to manage all kinds of relational and NoSQL databases. However, any IDE will work. You can even copy the queries demonstrated in this post and run them in the MySQL console directly.

As to DBeaver, if you encounter the “Public Key Retrieval is not allowed” error, you should edit the connection and add the following two user properties for the driver:

# User properties
useSSL: false
allowPublicKeyRetrieval: true

For more settings of DBeaver, please check this post.

Then run the following queries to create the database, create the table and insert some dummy data:

CREATE DATABASE products;

CREATE TABLE `products`.`prices` (
`pid` int(11) NOT NULL AUTO_INCREMENT,
`category` varchar(100) NOT NULL,
`price` float NOT NULL,
PRIMARY KEY (`pid`)
);

INSERT INTO products.prices
(pid, category, price)
VALUES
(1, 'A', 2),
(2, 'A', 1),
(3, 'A', 5),
(4, 'A', 4),
(5, 'A', 3),
(6, 'B', 6),
(7, 'B', 4),
(8, 'B', 3),
(9, 'B', 5),
(10, 'B', 2),
(11, 'B', 1)
;

Calculate the median in the “classical” way

Now that the database and data are set up, we can start to calculate medians. The classical solution is to use SQL variables:

SELECT AVG(sub.price) AS median
FROM (
SELECT @row_index := @row_index + 1 AS row_index, p.price
FROM products.prices p, (SELECT @row_index := -1) r
WHERE p.category = 'A'
ORDER BY p.price
) AS sub
WHERE sub.row_index IN (FLOOR(@row_index / 2), CEIL(@row_index / 2))
;

median|
------+
3.0|

Key points for this query:

  • @row_index is a SQL variable that is initiated in the FROM statement and updated for each row in the SELECT statement.
  • The column whose median will be calculated (the price column in this example) should be sorted. It doesn’t matter if it’s sorted in ascending or descending order.
  • According to the definition of median, the median is the value of the middle element (total count is odd) or the average value of the two middle elements (total count is even). In this example, category A has 5 rows and thus the median is the value of the third row after sorting. The values of both FLOOR(@row_index / 2) and CEIL(@row_index / 2) are 2 which is the third row. On the other hand, for category B which has 6 rows, the median is the average value of the third and fourth rows.

This solution is simple and easy to understand. However, if the table has many categories, we would need to run the query for each of them, which is cumbersome and the results are not easy to store and compare.

To solve this problem, we need to have a non-classical solution using GROUP BY, GROUP_CONCAT and SUBSTRING_INDEX.

Calculate the median in a flexible way

Let’s do it step by step. The final query may look complex at first sight. However, once you know how it works, it’s actually easier to understand and you can change it freely for your own use cases.

Let’s first get all the sorted prices for each category:

SELECT
category,
GROUP_CONCAT(price ORDER BY p.price) AS prices,
COUNT(*) AS total
FROM products.prices p
GROUP BY p.category
;

category|prices |total|
--------+-----------+-----+
A |1,2,3,4,5 | 5|
B |1,2,3,4,5,6| 6|

Note that if your table has a lot of data, GROUP_CONCAT would not contain all the data. In this case, you increase the limit for GROUP_CONCAT by:

SET GROUP_CONCAT_MAX_LEN = 100000;

You can set the limit to a number that suits your use case. However, if your table contains too much data, you may have memory issues. In this case, you would need to write some scripts to perform the data processing and calculation in a smarter way. Nonetheless, the solution provided in this post is applicable to most small- to medium-sized tables.

Then, let’s get the middle elements for each category, we need to check if the total count is an odd or even number and deal with it accordingly:

SELECT 
sub1.category,
sub1.total,
CASE WHEN MOD(sub1.total, 2) = 1 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', CEIL(sub1.total/2)), ',', '-1')
WHEN MOD(sub1.total, 2) = 0 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', sub1.total/2 + 1), ',', '-2')
END AS mid_prices
FROM
(
SELECT
p.category,
GROUP_CONCAT(p.price ORDER BY p.price) AS prices,
COUNT(*) AS total
FROM products.prices p
GROUP BY p.category
) sub1
;

category|total|mid_prices|
--------+-----+----------+
A | 5|3 |
B | 6|3,4 |

We use the MOD function to check if the total count is an odd or even number. The SUBSTRING_INDEX function is used twice to extract the middle elements. Let’s demonstrate in a bit more detail how it works:

-- Categoy A, 5 rows:
SUBSTRING_INDEX('1,2,3,4,5', ',', CEIL(5/2)) => '1,2,3'
SUBSTRING_INDEX('1,2,3', ',', -1) => 3

-- Categoy B, 6 rows:
SUBSTRING_INDEX('1,2,3,4,5,6', ',', 6/2 + 1) => '1,2,3,4'
SUBSTRING_INDEX('1,2,3,4', ',', -2) => '3,4'

Finally, let’s calculate the average of the middle elements to get the median for each category:

SELECT
sub2.category,
CASE WHEN MOD(sub2.total, 2) = 1 THEN sub2.mid_prices
WHEN MOD(sub2.total, 2) = 0 THEN (SUBSTRING_INDEX(sub2.mid_prices, ',', 1) + SUBSTRING_INDEX(sub2.mid_prices, ',', -1)) / 2
END AS median
FROM
(
SELECT
sub1.category,
sub1.total,
CASE WHEN MOD(sub1.total, 2) = 1 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', CEIL(sub1.total/2)), ',', '-1')
WHEN MOD(sub1.total, 2) = 0 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', sub1.total/2 + 1), ',', '-2')
END AS mid_prices
FROM
(
SELECT
p.category,
GROUP_CONCAT(p.price ORDER BY p.price) AS prices,
COUNT(*) AS total
FROM products.prices p
GROUP BY p.category
) sub1
) sub2
;

category|median|
--------+------+
A |3 |
B |3.5 |

Cheers! The median values are calculated correctly for both categories. It’s a bit more code than the classical solution. However, it’s more transparent and thus easier to understand. Besides, it’s way more flexible and you can easily adjust the queries for different use cases, not just to get the median values of different categories using grouping.

Bonus — Calculate the medians for multiple columns

Above the medians are only calculated for one column. With the new solution, we can easily calculate the medians of multiple columns. Let’s first create a new table with some dummy data to play with:

CREATE TABLE `products`.`orders` (
`order_id` int(11) NOT NULL AUTO_INCREMENT,
`price` float NOT NULL,
`quantity` float NOT NULL,
PRIMARY KEY (`order_id`)
);

INSERT INTO products.orders
(order_id, price, quantity)
VALUES
(1, 2, 50),
(2, 1, 40),
(3, 5, 10),
(4, 3, 30),
(5, 4, 20)
;

Note that the data is fake and just for demonstration purposes, which is thus the simpler the better.

The query for calculating the medians of multiple columns can be easily adapted from the one above. The difference is that we don’t need to group by categories anymore but instead need to repeat the queries for each column whose median needs to be calculated:

SELECT
CASE WHEN MOD(sub2.total, 2) = 1 THEN sub2.mid_prices
WHEN MOD(sub2.total, 2) = 0 THEN (SUBSTRING_INDEX(sub2.mid_prices, ',', 1) + SUBSTRING_INDEX(sub2.mid_prices, ',', -1)) / 2
END AS median_of_price,
CASE WHEN MOD(sub2.total, 2) = 1 THEN sub2.mid_quantities
WHEN MOD(sub2.total, 2) = 0 THEN (SUBSTRING_INDEX(sub2.mid_quantities, ',', 1) + SUBSTRING_INDEX(sub2.mid_prices, ',', -1)) / 2
END AS median_of_quantity
FROM
(
SELECT
sub1.total,
CASE WHEN MOD(sub1.total, 2) = 1 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', CEIL(sub1.total/2)), ',', '-1')
WHEN MOD(sub1.total, 2) = 0 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', sub1.total/2 + 1), ',', '-2')
END AS mid_prices,
CASE WHEN MOD(sub1.total, 2) = 1 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.quantities, ',', CEIL(sub1.total/2)), ',', '-1')
WHEN MOD(sub1.total, 2) = 0 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.quantities, ',', sub1.total/2 + 1), ',', '-2')
END AS mid_quantities
FROM
(
SELECT
COUNT(*) AS total,
GROUP_CONCAT(o.price ORDER BY o.price) AS prices,
GROUP_CONCAT(o.quantity ORDER BY o.quantity) AS quantities
FROM products.orders o
) sub1
) sub2
;

median_of_price|median_of_quantity|
---------------+------------------+
3 |30 |

Cheers again, it works as expected!

Currently, the function to calculate medians is not implemented in MySQL yet and thus we need to write some queries ourselves to calculate them. Two solutions are introduced in this post. The first one is a classical solution that uses SQL variables. The second one is a new one which is accomplished with GROUP_CONCAT and SUBSTRING_INDEX. The second one is a bit more code but is much more extensible. You can use it to calculate the medians of the same field with different categories and also the medians of multiple fields.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment