10-Minute Effortless SQL Tutorial For Die-hard Pandas Lovers | by Bex T. | Jun, 2022

By Jessie Hobb On Jun 1, 2022

There was a time it was the other way around

Photo by Fiona Art from Pexels

Motivation

When Pandas package gained public exposure in 2009, SQL had been dominating the data world since 1974. Pandas came with an attractive set of features such as in-built visualization and flexible data handling and became an ultimate data exploration tool. As it started gaining popularity, many courses and resources emerged teaching Pandas and comparing it to SQL.

Flash-forward to 2021, people are now getting introduced to the Pandas package first rather than the universal data language — SQL. Even though SQL is as popular as ever, the flexibility and multi-functionality of Pandas are making it the first choice for beginner data scientists.

Then, why do you need SQL if you know Pandas?

Even though Pandas may seem a better choice, SQL still plays a crucial role in the day-to-day job of a data scientist. In fact, SQL is the second most in-demand and the third most growing programming language for data science (see here). So, it is a must to add SQL to your CV if you want to get a job in the field. And knowing Pandas, learning SQL should be a breeze, as you will see in this article.

Connecting to a Database

Setting up an SQL workspace and connecting to a sample database can be a real pain. First, you need to install your favorite SQL flavor (PostgreSQL, MySQL, etc.) and download an SQL IDE. Doing those here would deviate us from the article’s purpose, so we will use a shortcut.

Specifically, we will directly run SQL queries in a Jupyter Notebook without additional steps. All we need to do is install the ipython-sql package using pip:

pip install ipython-sql

After the installation, start a new Jupyter session and run this command in the notebook:

%load_ext sql

and you are all set!

To illustrate how basic SQL statements work, we will be using the Chinook SQLite database, which you can download here. The database has 11 tables.

To retrieve the data stored in this database’s tables, run this command:

%sql sqlite:///data/chinook.db

The statement starts with %sql in-line magic command that tells the notebook interpreter we will be running SQL commands. It is followed by the path that the downloaded Chinook database is in. The valid paths should always start with sqlite:/// prefix for SQLite databases. Above, we are connecting to the database stored in the ‘data’ folder of the current directory. If you want to pass an absolute path, the prefix should start with four forward slashes – sqlite:////

If you wish to connect to a different database flavor, you can refer to this excellent article.

Taking a First Look at the Tables

The first thing we always do in Pandas is to use the .head() function to take a first look at the data. Let’s learn how to do that in SQL:

The dataset is licensed for commercial use as well.

Counting the Number of Rows

Just like Pandas has .shape attribute on its DataFrames, SQL has a COUNT function to display the number of rows in a table:

%%sqlSELECT COUNT(*) FROM tracks

It is also possible to pass a column name to COUNT:

%sql SELECT COUNT(FirstName) FROM customers

But the output would be the same as COUNT(*).

More helpful info would be counting the number of unique values in a particular column. We can do this by adding the DISTINCT keyword into COUNT:

Filtering Results With WHERE Clauses

Just looking and counting rows is pretty lame. Let’s see how we can filter rows based on conditions.

First, let’s look at the songs which cost more than a dollar:

Easier Filtering With BETWEEN And IN

Similar conditionals are used very often, and writing them out with simple booleans becomes cumbersome. For example, Pandas has .isin() function which checks if a value belongs to a list of groups or values. If we wanted to select all invoices for five cities, we would have to write five chained conditions. Luckily, SQL supports a similar IN operator like .isin() so we don’t have to:

Checking For Nulls

Every data source has missing values, and databases are no exception. Just like there are several ways to explore the missing values in Pandas, there are specific keywords that check the existence of null values in SQL. The below query counts the number of rows with missing values in BillingState:

Better String Matching With LIKE

In the WHERE clause, we filtered columns based on exact text values. But often, we may want to filter textual columns based on a pattern. In Pandas and pure Python, we would use regular expressions for pattern matching, which are very powerful but requires time to master.

As an alternative, SQL offers a ‘%’ wildcard as a placeholder to match any character 0 or more times. For example, the ‘gr%’ string matches’ great,’ ‘groom,’ ‘greed,’ and ‘%ex%’ matches any text with ‘ex’ in the middle, etc. Let’s see how to use it with SQL:

Aggregate Functions in SQL

It is also possible to perform basic arithmetic operations on columns. These operations are called aggregate functions in SQL, and the most common ones are AVG, SUM, MIN, MAX. Their functionality should be clear from their names:

Ordering Results in SQL

Just like Pandas has sort_values method, SQL supports ordering columns via ORDER BY clause. Passing a column name after the clause sorts the results in ascending order:

Grouping in SQL

One of the most powerful functions of Pandas is the groupby. You can use it to transform a table into virtually any shape you want. Its very close cousin in SQL – GROUP BY clause can be used to achieve the same functionality. For example, the below query counts the number of songs in each genre:

Using conditionals with HAVING

By default, SQL does not allow conditional filtering using aggregate functions in the WHERE clause. For example, we want to select only the genres where the number of songs is greater than 100. Let’s try this with the WHERE clause:

Summary

By now, you should have realized how powerful SQL can be. Even though we learned a ton, we have barely scratched the surface. For more advanced topics, you can read the excellent guide on W3Schools and practice your querying skills by solving real-world SQL questions on Hackerrank or LeetCode. Thank you for reading!