SQL Riddles to Test Your Wits. Timestamps, dependent filters, and… | by Matt Sosna | Feb, 2023


Photo by Saffu on Unsplash

SQL is a deceptively simple language. Across its many dialects, users can query databases in a syntax similar to English. What you see is what you get… until you don’t.

Every now and then I come across a query that produces a result completely different from what I expected, teaching me little nuances about the language. I’ve compiled three recent head-scratchers in this post, and I’ve arranged them as riddles to make them more interesting. Try to figure out the answer before reading the end of each section!

I’ve also included quick common table expressions (CTEs) to generate the tables in each example, so you don’t need to try querying your company’s production tables! But to get really comfortable with SQL, I actually recommend creating your own database and tables to play with. Check out this post to learn how.

Note that all queries are in Postgres — you may get different results in a different dialect. Finally, an obligatory note that the actual data and topics in each query are just illustrative examples. 🙂

Photo by Akram Huseyn on Unsplash

Imagine we have a table called purchases with purchase IDs, amounts, and times the purchase were made. Let’s say it looks like this:

Image by author

As a CTE, this would look something like this. Note that we need to specify that the dt column is a timestamp so it isn’t interpreted as a string. We also only need to specify the data types for one of the rows; the rest are inferred.

WITH purchases(id, amount, dt) AS (
VALUES
(1::bigint, 0.99::float, '2023-02-15 00:00:00 GMT'::timestamp),
(2, 9.99, '2023-02-15 07:15:00 GMT'),
(3, 15.99, '2023-02-15 23:01:15 GMT'),
(4, 7.99, '2023-02-16 14:22:09 GMT')
)
...

Now let’s calculate the sum of purchases made on Feb 15. We can write a query like the one below:

...
SELECT
SUM(amount) AS sum
FROM purchases
WHERE
dt = '2023-02-15'

We mysteriously receive the following response.

Image by author

What happened? There were three purchases made on Feb 15: IDs 1, 2, and 3. The sum should therefore be $26.97. Instead, only the first purchase was counted.

If you change the filter to 2023-02-16, no rows are returned.

The dt column format is a timestamp that includes both date and time. Our WHERE filter only specifies the date. Rather than rejecting this query, Postgres automatically reformats the date string to 2023-02-15 00:00:00. This matches only the first transaction in the table, so we’re therefore taking only the sum of one row.

If we wanted to select all rows corresponding to Feb 15, we should first cast the timestamp to date.

SELECT
SUM(amount) AS sum
FROM purchases
WHERE
DATE(dt) = '2023-02-15'

We now get the expected result.

Image by author
Photo by Womanizer Toys on Unsplash

Alright, next riddle. We have a table called users, and our goal is to remove all rows that meet any one of three conditions. In the table below, for example, let’s say that we want to only return tenured and active users, i.e., ones who have logged in during the last 28 days, have posted before, and are not a new account.

Image by author

In other words, we want our query to only user 8, who has False values for no_login_l28, has_never_posted, and is_new_account.

Let’s start with the top of our query.

WITH users(id, no_login_l28, has_never_posted, is_new_account) AS (
VALUES
(1, True, True, True),
(2, True, True, False),
(3, True, False, True),
(4, True, False, False),
(5, False, True, True),
(6, False, True, False),
(7, False, False, True),
(8, False, False, False)
)
SELECT
id
FROM users
WHERE
...

How should we structure the WHERE clause of our query? Think for a minute – we need to be careful not to return rows where any of the columns is False.

When you’re ready, take a look at the options below. Two are correct and two are wrong.

Option 1: Multiple AND NOT

WHERE
NOT no_login_l28
AND NOT has_never_posted
AND NOT is_new_account

Option 2: Multiple OR NOT

WHERE
NOT no_login_l28
OR NOT has_never_posted
OR NOT is_new_account

Option 3: NOT + grouped OR

WHERE
NOT (
no_login_l28
OR has_never_posted
OR is_new_account
)

Option 4: NOT + grouped AND

WHERE
NOT (
no_login_l28
AND has_never_posted
AND is_new_account
)

When are conditions in a filter evaluated separately versus together? If they’re evaluated together, can we condense all conditions down to one True or False value?

Option 1. This one tripped me up a bit. A data scientist on my team submitted a PR with this filter, which I was convinced would pull in rows 2–7 because the query would only remove users with False values for all three columns. But to my surprise, Option 1 actually works because the three filters are evaluated independently.

Option 2. This was the filter I initially thought was correct, since I didn’t realize the filters would be evaluated independently. But this filter will actually return users 2–8, since anyone who has at least one True for no_login_l28, has_never_posted, and is_new_account will be allowed through. ❌

Option 3. This was how I initially thought the filter needed to be worded. If a user has True for any of no_login_l28, has_never_posted, or is_new_account, then lines 3-5 evaluate to True, the NOT flips this to False, and those rows are ultimately excluded. This indeed works, and I find this easier to understand than Option 1, but both are valid. ✅

Option 4. This returns the same incorrect result as Option 2. Lines 3–5 evaluate to True only for user 1, meaning that when we flip the boolean with NOT, all remaining users are pulled through. ❌

Photo by Nick Fewings on Unsplash

Take a look at the query below. We have two tables, customers and reviews. customers contains customer IDs and their lifetime dollars spent on the platform.

Image by author

reviews contains information about reviews left by customers: the review ID, customer ID, and whether the review was reported as spam.

Image by author

Here’s the subquery to generate the two CTEs:

WITH customers(id, total_spend) AS (
VALUES
(100, 1583.49),
(200, 8739.03),
(300, 431.00),
(400, 1.00),
(500, 22.27)
),
reviews(id, customer_id, reported_as_spam) AS (
VALUES
(1, 100, False),
(2, 100, False),
(3, 400, True),
(4, 400, True),
(5, 500, False)
)
...

Now let’s say we’re curious about the relationship between a customer’s total spend and the number of non-spam reviews they write. Since not each customer has left a review, we’ll want to left join reviews to customers. We can structure our query like this:

...
SELECT
c.id,
c.total_spend,
COALESCE(COUNT(r.id), 0) AS n_reviews
FROM customers c
LEFT JOIN reviews r
ON c.id = r.customer_id
WHERE
NOT r.reported_as_spam
GROUP BY
1, 2
ORDER BY
1

Ready? Here’s what comes out.

Image by author

Wait a minute. Where did users 200, 300, and 400 go? Why were they removed, and how can we bring them back in?

If you create a CTE for reviews with spam reviews filtered out, then join on this CTE, do we get the same result?

Looking closely, we can see that users 200 and 300 have never left any reviews. 400 only has spam reviews, but they were completely removed as well. Since we did a left join, these users should still be in the table and have a 0 for n_reviews. Instead, our left join behaved like an inner join.

The issue, it turns out, is that WHERE clauses are evaluated after joins. Our left join brings in null values for reported_as_spam for users 200 and 300. The WHERE filter then removes all rows where reported_as_spam is True, which removes user 400. However, this filter also removes null values, so users 200 and 300 are also removed.

To do this properly, we need to pre-filter reviews before joining with customers. As the hint states, we can create a CTE for reviews and perform the filtering there. But more efficiently, let’s perform the filtering within the join.

We can do this by adding AND NOT r.reported_as_spam to the LEFT JOIN block. See below:

...
SELECT
c.id,
c.total_spend,
COALESCE(COUNT(r.id), 0) AS n_reviews
FROM customers c
LEFT JOIN reviews r
ON c.id = r.customer_id
AND NOT r.reported_as_spam
GROUP BY
1, 2
ORDER BY
1

Now we get the expected result.

Image by author
Photo by Laura Chouette on Unsplash

This post shared three SQL wrinkles that can lead to unexpected results: timestamp specificity, dependent versus independent filters, and left joins acting like inner joins. I specifically provided simple examples to keep the focus on the syntax, but you’ll likely encounter SQL nuances like these nestled within large, complex queries.

These bugs can be incredibly challenging to identify, especially for queries with many components. Whenever I’m confused by a result, I try to break the query into its pieces and verify each component’s result. But when in doubt, write some simple CTEs with test data and confirm the results do what you expect.

Happy querying!
Matt


Photo by Saffu on Unsplash

SQL is a deceptively simple language. Across its many dialects, users can query databases in a syntax similar to English. What you see is what you get… until you don’t.

Every now and then I come across a query that produces a result completely different from what I expected, teaching me little nuances about the language. I’ve compiled three recent head-scratchers in this post, and I’ve arranged them as riddles to make them more interesting. Try to figure out the answer before reading the end of each section!

I’ve also included quick common table expressions (CTEs) to generate the tables in each example, so you don’t need to try querying your company’s production tables! But to get really comfortable with SQL, I actually recommend creating your own database and tables to play with. Check out this post to learn how.

Note that all queries are in Postgres — you may get different results in a different dialect. Finally, an obligatory note that the actual data and topics in each query are just illustrative examples. 🙂

Photo by Akram Huseyn on Unsplash

Imagine we have a table called purchases with purchase IDs, amounts, and times the purchase were made. Let’s say it looks like this:

Image by author

As a CTE, this would look something like this. Note that we need to specify that the dt column is a timestamp so it isn’t interpreted as a string. We also only need to specify the data types for one of the rows; the rest are inferred.

WITH purchases(id, amount, dt) AS (
VALUES
(1::bigint, 0.99::float, '2023-02-15 00:00:00 GMT'::timestamp),
(2, 9.99, '2023-02-15 07:15:00 GMT'),
(3, 15.99, '2023-02-15 23:01:15 GMT'),
(4, 7.99, '2023-02-16 14:22:09 GMT')
)
...

Now let’s calculate the sum of purchases made on Feb 15. We can write a query like the one below:

...
SELECT
SUM(amount) AS sum
FROM purchases
WHERE
dt = '2023-02-15'

We mysteriously receive the following response.

Image by author

What happened? There were three purchases made on Feb 15: IDs 1, 2, and 3. The sum should therefore be $26.97. Instead, only the first purchase was counted.

If you change the filter to 2023-02-16, no rows are returned.

The dt column format is a timestamp that includes both date and time. Our WHERE filter only specifies the date. Rather than rejecting this query, Postgres automatically reformats the date string to 2023-02-15 00:00:00. This matches only the first transaction in the table, so we’re therefore taking only the sum of one row.

If we wanted to select all rows corresponding to Feb 15, we should first cast the timestamp to date.

SELECT
SUM(amount) AS sum
FROM purchases
WHERE
DATE(dt) = '2023-02-15'

We now get the expected result.

Image by author
Photo by Womanizer Toys on Unsplash

Alright, next riddle. We have a table called users, and our goal is to remove all rows that meet any one of three conditions. In the table below, for example, let’s say that we want to only return tenured and active users, i.e., ones who have logged in during the last 28 days, have posted before, and are not a new account.

Image by author

In other words, we want our query to only user 8, who has False values for no_login_l28, has_never_posted, and is_new_account.

Let’s start with the top of our query.

WITH users(id, no_login_l28, has_never_posted, is_new_account) AS (
VALUES
(1, True, True, True),
(2, True, True, False),
(3, True, False, True),
(4, True, False, False),
(5, False, True, True),
(6, False, True, False),
(7, False, False, True),
(8, False, False, False)
)
SELECT
id
FROM users
WHERE
...

How should we structure the WHERE clause of our query? Think for a minute – we need to be careful not to return rows where any of the columns is False.

When you’re ready, take a look at the options below. Two are correct and two are wrong.

Option 1: Multiple AND NOT

WHERE
NOT no_login_l28
AND NOT has_never_posted
AND NOT is_new_account

Option 2: Multiple OR NOT

WHERE
NOT no_login_l28
OR NOT has_never_posted
OR NOT is_new_account

Option 3: NOT + grouped OR

WHERE
NOT (
no_login_l28
OR has_never_posted
OR is_new_account
)

Option 4: NOT + grouped AND

WHERE
NOT (
no_login_l28
AND has_never_posted
AND is_new_account
)

When are conditions in a filter evaluated separately versus together? If they’re evaluated together, can we condense all conditions down to one True or False value?

Option 1. This one tripped me up a bit. A data scientist on my team submitted a PR with this filter, which I was convinced would pull in rows 2–7 because the query would only remove users with False values for all three columns. But to my surprise, Option 1 actually works because the three filters are evaluated independently.

Option 2. This was the filter I initially thought was correct, since I didn’t realize the filters would be evaluated independently. But this filter will actually return users 2–8, since anyone who has at least one True for no_login_l28, has_never_posted, and is_new_account will be allowed through. ❌

Option 3. This was how I initially thought the filter needed to be worded. If a user has True for any of no_login_l28, has_never_posted, or is_new_account, then lines 3-5 evaluate to True, the NOT flips this to False, and those rows are ultimately excluded. This indeed works, and I find this easier to understand than Option 1, but both are valid. ✅

Option 4. This returns the same incorrect result as Option 2. Lines 3–5 evaluate to True only for user 1, meaning that when we flip the boolean with NOT, all remaining users are pulled through. ❌

Photo by Nick Fewings on Unsplash

Take a look at the query below. We have two tables, customers and reviews. customers contains customer IDs and their lifetime dollars spent on the platform.

Image by author

reviews contains information about reviews left by customers: the review ID, customer ID, and whether the review was reported as spam.

Image by author

Here’s the subquery to generate the two CTEs:

WITH customers(id, total_spend) AS (
VALUES
(100, 1583.49),
(200, 8739.03),
(300, 431.00),
(400, 1.00),
(500, 22.27)
),
reviews(id, customer_id, reported_as_spam) AS (
VALUES
(1, 100, False),
(2, 100, False),
(3, 400, True),
(4, 400, True),
(5, 500, False)
)
...

Now let’s say we’re curious about the relationship between a customer’s total spend and the number of non-spam reviews they write. Since not each customer has left a review, we’ll want to left join reviews to customers. We can structure our query like this:

...
SELECT
c.id,
c.total_spend,
COALESCE(COUNT(r.id), 0) AS n_reviews
FROM customers c
LEFT JOIN reviews r
ON c.id = r.customer_id
WHERE
NOT r.reported_as_spam
GROUP BY
1, 2
ORDER BY
1

Ready? Here’s what comes out.

Image by author

Wait a minute. Where did users 200, 300, and 400 go? Why were they removed, and how can we bring them back in?

If you create a CTE for reviews with spam reviews filtered out, then join on this CTE, do we get the same result?

Looking closely, we can see that users 200 and 300 have never left any reviews. 400 only has spam reviews, but they were completely removed as well. Since we did a left join, these users should still be in the table and have a 0 for n_reviews. Instead, our left join behaved like an inner join.

The issue, it turns out, is that WHERE clauses are evaluated after joins. Our left join brings in null values for reported_as_spam for users 200 and 300. The WHERE filter then removes all rows where reported_as_spam is True, which removes user 400. However, this filter also removes null values, so users 200 and 300 are also removed.

To do this properly, we need to pre-filter reviews before joining with customers. As the hint states, we can create a CTE for reviews and perform the filtering there. But more efficiently, let’s perform the filtering within the join.

We can do this by adding AND NOT r.reported_as_spam to the LEFT JOIN block. See below:

...
SELECT
c.id,
c.total_spend,
COALESCE(COUNT(r.id), 0) AS n_reviews
FROM customers c
LEFT JOIN reviews r
ON c.id = r.customer_id
AND NOT r.reported_as_spam
GROUP BY
1, 2
ORDER BY
1

Now we get the expected result.

Image by author
Photo by Laura Chouette on Unsplash

This post shared three SQL wrinkles that can lead to unexpected results: timestamp specificity, dependent versus independent filters, and left joins acting like inner joins. I specifically provided simple examples to keep the focus on the syntax, but you’ll likely encounter SQL nuances like these nestled within large, complex queries.

These bugs can be incredibly challenging to identify, especially for queries with many components. Whenever I’m confused by a result, I try to break the query into its pieces and verify each component’s result. But when in doubt, write some simple CTEs with test data and confirm the results do what you expect.

Happy querying!
Matt

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
artificial intelligenceDependentFebfilterslatest newsmachine learningMattRiddlesSosnaSQLtestTimestampswits
Comments (0)
Add Comment