10-Minute Effortless SQL Tutorial For Die-hard Pandas Lovers | by Bex T. | Jun, 2022
There was a time it was the other way around
Motivation
When Pandas package gained public exposure in 2009, SQL had been dominating the data world since 1974. Pandas came with an attractive set of features such as in-built visualization and flexible data handling and became an ultimate data exploration tool. As it started gaining popularity, many courses and resources emerged teaching Pandas and comparing it to SQL.
Flash-forward to 2021, people are now getting introduced to the Pandas package first rather than the universal data language — SQL. Even though SQL is as popular as ever, the flexibility and multi-functionality of Pandas are making it the first choice for beginner data scientists.
Then, why do you need SQL if you know Pandas?
Even though Pandas may seem a better choice, SQL still plays a crucial role in the day-to-day job of a data scientist. In fact, SQL is the second most in-demand and the third most growing programming language for data science (see here). So, it is a must to add SQL to your CV if you want to get a job in the field. And knowing Pandas, learning SQL should be a breeze, as you will see in this article.
Connecting to a Database
Setting up an SQL workspace and connecting to a sample database can be a real pain. First, you need to install your favorite SQL flavor (PostgreSQL, MySQL, etc.) and download an SQL IDE. Doing those here would deviate us from the article’s purpose, so we will use a shortcut.
Specifically, we will directly run SQL queries in a Jupyter Notebook without additional steps. All we need to do is install the ipython-sql
package using pip:
pip install ipython-sql
After the installation, start a new Jupyter session and run this command in the notebook:
%load_ext sql
and you are all set!
To illustrate how basic SQL statements work, we will be using the Chinook SQLite database, which you can download here. The database has 11 tables.
To retrieve the data stored in this database’s tables, run this command:
%sql sqlite:///data/chinook.db
The statement starts with %sql
in-line magic command that tells the notebook interpreter we will be running SQL commands. It is followed by the path that the downloaded Chinook database is in. The valid paths should always start with sqlite:///
prefix for SQLite databases. Above, we are connecting to the database stored in the ‘data’ folder of the current directory. If you want to pass an absolute path, the prefix should start with four forward slashes – sqlite:////
If you wish to connect to a different database flavor, you can refer to this excellent article.
Taking a First Look at the Tables
The first thing we always do in Pandas is to use the .head()
function to take a first look at the data. Let’s learn how to do that in SQL:
The first keyword in the above query is SELECT
. It is equivalent to the brackets operator in Pandas, where we select specific columns. But, the SELECT keyword is followed by a * (asterisk). * is an SQL operator that selects everything (all rows and columns) from a table specified after the FROM
keyword. LIMIT is used to minimize the output returned. So, the above query is equivalent to df.head()
function.
If you don’t want to select all columns, you can specify one or more column names after the SELECT keyword:
The equivalent Pandas operation is
tracks[['Name', 'Composer', 'UnitPrice']].head(10)
Another useful keyword in SQL is DISTINCT
. Adding this keyword before any column name returns its unique values:
Comments in SQL are written with double dashes.
Counting the Number of Rows
Just like Pandas has .shape
attribute on its DataFrames, SQL has a COUNT
function to display the number of rows in a table:
%%sqlSELECT COUNT(*) FROM tracks
It is also possible to pass a column name to COUNT:
%sql SELECT COUNT(FirstName) FROM customers
But the output would be the same as COUNT(*).
More helpful info would be counting the number of unique values in a particular column. We can do this by adding the DISTINCT keyword into COUNT:
Filtering Results With WHERE Clauses
Just looking and counting rows is pretty lame. Let’s see how we can filter rows based on conditions.
First, let’s look at the songs which cost more than a dollar:
Conditional statements are written in the WHERE clause, which always comes after FROM and before the LIMIT keywords. Using conditionals is pretty similar to how we do it in Pandas.
You can also use the COUNT function when using conditionals. For example, let’s see the number of songs that are priced between 1 and 10 dollars:
Above we chained two conditions with the boolean operator AND. Other boolean operators are the same in SQL.
Now, let’s see all the invoices that have Paris or Berlin as a billing city:
The equality operator in SQL requires only one ‘=’ (equal) sign. The inequality operator is represented with either ‘!=’ or ‘<>’ operators:
Easier Filtering With BETWEEN And IN
Similar conditionals are used very often, and writing them out with simple booleans becomes cumbersome. For example, Pandas has .isin()
function which checks if a value belongs to a list of groups or values. If we wanted to select all invoices for five cities, we would have to write five chained conditions. Luckily, SQL supports a similar IN operator like .isin()
so we don’t have to:
The list of values after IN should be given as a tuple, not as a list. You can also negate the condition with the NOT keyword:
Another common filtering operation on numeric columns is to select values within a range. For this, BETWEEN keyword can be used, which is equivalent to pd.Series.between()
:
Checking For Nulls
Every data source has missing values, and databases are no exception. Just like there are several ways to explore the missing values in Pandas, there are specific keywords that check the existence of null values in SQL. The below query counts the number of rows with missing values in BillingState:
You can add the NOT keyword between IS and NULL to filter out missing values of a particular column:
Better String Matching With LIKE
In the WHERE clause, we filtered columns based on exact text values. But often, we may want to filter textual columns based on a pattern. In Pandas and pure Python, we would use regular expressions for pattern matching, which are very powerful but requires time to master.
As an alternative, SQL offers a ‘%’ wildcard as a placeholder to match any character 0 or more times. For example, the ‘gr%’ string matches’ great,’ ‘groom,’ ‘greed,’ and ‘%ex%’ matches any text with ‘ex’ in the middle, etc. Let’s see how to use it with SQL:
The above query finds all songs that start with ‘B.’ The string that contains the wildcard should come after the LIKE keyword.
Now, let’s find all songs that contain the word ‘beautiful’ in their titles:
You can also use other boolean operators next to LIKE:
There are many other wildcards in SQL that have different functionalities. You can see the complete list and their usage here.
Aggregate Functions in SQL
It is also possible to perform basic arithmetic operations on columns. These operations are called aggregate functions in SQL, and the most common ones are AVG, SUM, MIN, MAX
. Their functionality should be clear from their names:
Aggregate functions give only a single result for the column you used them on. This means you cannot aggregate across one column and select other unaggregated columns:
You can combine aggregate functions with conditionals using WHERE clauses just as easily:
It is also possible to use arithmetic operators such as +, -, *, / on columns, and simple numbers. When used on columns, the operation is performed element-wise:
One thing to note about arithmetic operations: if you perform operations on integers only, SQL thinks that you are expecting an integer as the answer:
%%sql SELECT 10 / 3
Instead of returning 3.33…, the result is 3. To get a float result, you should use at least one float number in the query or use all floats to be safe:
%%sql SELECT 10.0 / 3.0
Using this knowledge, let’s calculate the average duration of a song in minutes:
If you pay attention to the above column, its name is written as “the query used to generate that column.” Because of this behavior, using long calculations, such as finding the standard deviation or variance of a column, can be an issue because the column name will be as large as the query itself.
To avoid this, SQL allows aliasing, similar to aliasing import statements in Python. For example:
Using as
keyword after a single item in a SELECT
statement tells SQL that we are aliasing. Here are more examples:
You can use aliasing just as easily for columns with long names.
Ordering Results in SQL
Just like Pandas has sort_values
method, SQL supports ordering columns via ORDER BY
clause. Passing a column name after the clause sorts the results in ascending order:
We order the tracks table in ascending order by the composer’s name. Note that the ORDER BY statement should always come after the WHERE clause. It is also possible to pass two or more columns to ORDER BY:
You can also reverse the ordering by passing the DESC
keyword after each column name:
The above query returns three columns after ordering the UnitPrice and Compose in descending order and the name in ascending order (ASC
is a default keyword).
Grouping in SQL
One of the most powerful functions of Pandas is the groupby
. You can use it to transform a table into virtually any shape you want. Its very close cousin in SQL – GROUP BY
clause can be used to achieve the same functionality. For example, the below query counts the number of songs in each genre:
The difference between the GROUP BY in SQL and groupby
in Pandas is that SQL does not allow selecting columns that weren’t given in the GROUP BY clause. For example, adding an extra free column in the above query generates an error:
However, you can choose as many columns in the SELECT statement as you like as long as you are using some type of aggregate function on them:
The above query includes almost all the topics we have learned up to this point. We are grouping by both album ID and genre ID, and for each group, we calculate the average duration and price of a song. We are also making efficient use of aliasing.
We can make the query even more powerful by ordering by the average duration and genre count:
Pay attention to how we use the alias names of the aggregate functions in the ORDER BY clause. Once you alias a column or the result of the aggregate function, you can refer to them only by their alias for the rest of the query.
Using conditionals with HAVING
By default, SQL does not allow conditional filtering using aggregate functions in the WHERE clause. For example, we want to select only the genres where the number of songs is greater than 100. Let’s try this with the WHERE clause:
The correct way of filtering rows based on the results of aggregate functions is using the HAVING clause:
HAVING clause is usually used with GROUP BY. Whenever you want to filter rows using aggregate functions, the HAVING clause is the way to go!
Summary
By now, you should have realized how powerful SQL can be. Even though we learned a ton, we have barely scratched the surface. For more advanced topics, you can read the excellent guide on W3Schools and practice your querying skills by solving real-world SQL questions on Hackerrank or LeetCode. Thank you for reading!
There was a time it was the other way around
Motivation
When Pandas package gained public exposure in 2009, SQL had been dominating the data world since 1974. Pandas came with an attractive set of features such as in-built visualization and flexible data handling and became an ultimate data exploration tool. As it started gaining popularity, many courses and resources emerged teaching Pandas and comparing it to SQL.
Flash-forward to 2021, people are now getting introduced to the Pandas package first rather than the universal data language — SQL. Even though SQL is as popular as ever, the flexibility and multi-functionality of Pandas are making it the first choice for beginner data scientists.
Then, why do you need SQL if you know Pandas?
Even though Pandas may seem a better choice, SQL still plays a crucial role in the day-to-day job of a data scientist. In fact, SQL is the second most in-demand and the third most growing programming language for data science (see here). So, it is a must to add SQL to your CV if you want to get a job in the field. And knowing Pandas, learning SQL should be a breeze, as you will see in this article.
Connecting to a Database
Setting up an SQL workspace and connecting to a sample database can be a real pain. First, you need to install your favorite SQL flavor (PostgreSQL, MySQL, etc.) and download an SQL IDE. Doing those here would deviate us from the article’s purpose, so we will use a shortcut.
Specifically, we will directly run SQL queries in a Jupyter Notebook without additional steps. All we need to do is install the ipython-sql
package using pip:
pip install ipython-sql
After the installation, start a new Jupyter session and run this command in the notebook:
%load_ext sql
and you are all set!
To illustrate how basic SQL statements work, we will be using the Chinook SQLite database, which you can download here. The database has 11 tables.
To retrieve the data stored in this database’s tables, run this command:
%sql sqlite:///data/chinook.db
The statement starts with %sql
in-line magic command that tells the notebook interpreter we will be running SQL commands. It is followed by the path that the downloaded Chinook database is in. The valid paths should always start with sqlite:///
prefix for SQLite databases. Above, we are connecting to the database stored in the ‘data’ folder of the current directory. If you want to pass an absolute path, the prefix should start with four forward slashes – sqlite:////
If you wish to connect to a different database flavor, you can refer to this excellent article.
Taking a First Look at the Tables
The first thing we always do in Pandas is to use the .head()
function to take a first look at the data. Let’s learn how to do that in SQL:
The first keyword in the above query is SELECT
. It is equivalent to the brackets operator in Pandas, where we select specific columns. But, the SELECT keyword is followed by a * (asterisk). * is an SQL operator that selects everything (all rows and columns) from a table specified after the FROM
keyword. LIMIT is used to minimize the output returned. So, the above query is equivalent to df.head()
function.
If you don’t want to select all columns, you can specify one or more column names after the SELECT keyword:
The equivalent Pandas operation is
tracks[['Name', 'Composer', 'UnitPrice']].head(10)
Another useful keyword in SQL is DISTINCT
. Adding this keyword before any column name returns its unique values:
Comments in SQL are written with double dashes.
Counting the Number of Rows
Just like Pandas has .shape
attribute on its DataFrames, SQL has a COUNT
function to display the number of rows in a table:
%%sqlSELECT COUNT(*) FROM tracks
It is also possible to pass a column name to COUNT:
%sql SELECT COUNT(FirstName) FROM customers
But the output would be the same as COUNT(*).
More helpful info would be counting the number of unique values in a particular column. We can do this by adding the DISTINCT keyword into COUNT:
Filtering Results With WHERE Clauses
Just looking and counting rows is pretty lame. Let’s see how we can filter rows based on conditions.
First, let’s look at the songs which cost more than a dollar:
Conditional statements are written in the WHERE clause, which always comes after FROM and before the LIMIT keywords. Using conditionals is pretty similar to how we do it in Pandas.
You can also use the COUNT function when using conditionals. For example, let’s see the number of songs that are priced between 1 and 10 dollars:
Above we chained two conditions with the boolean operator AND. Other boolean operators are the same in SQL.
Now, let’s see all the invoices that have Paris or Berlin as a billing city:
The equality operator in SQL requires only one ‘=’ (equal) sign. The inequality operator is represented with either ‘!=’ or ‘<>’ operators:
Easier Filtering With BETWEEN And IN
Similar conditionals are used very often, and writing them out with simple booleans becomes cumbersome. For example, Pandas has .isin()
function which checks if a value belongs to a list of groups or values. If we wanted to select all invoices for five cities, we would have to write five chained conditions. Luckily, SQL supports a similar IN operator like .isin()
so we don’t have to:
The list of values after IN should be given as a tuple, not as a list. You can also negate the condition with the NOT keyword:
Another common filtering operation on numeric columns is to select values within a range. For this, BETWEEN keyword can be used, which is equivalent to pd.Series.between()
:
Checking For Nulls
Every data source has missing values, and databases are no exception. Just like there are several ways to explore the missing values in Pandas, there are specific keywords that check the existence of null values in SQL. The below query counts the number of rows with missing values in BillingState:
You can add the NOT keyword between IS and NULL to filter out missing values of a particular column:
Better String Matching With LIKE
In the WHERE clause, we filtered columns based on exact text values. But often, we may want to filter textual columns based on a pattern. In Pandas and pure Python, we would use regular expressions for pattern matching, which are very powerful but requires time to master.
As an alternative, SQL offers a ‘%’ wildcard as a placeholder to match any character 0 or more times. For example, the ‘gr%’ string matches’ great,’ ‘groom,’ ‘greed,’ and ‘%ex%’ matches any text with ‘ex’ in the middle, etc. Let’s see how to use it with SQL:
The above query finds all songs that start with ‘B.’ The string that contains the wildcard should come after the LIKE keyword.
Now, let’s find all songs that contain the word ‘beautiful’ in their titles:
You can also use other boolean operators next to LIKE:
There are many other wildcards in SQL that have different functionalities. You can see the complete list and their usage here.
Aggregate Functions in SQL
It is also possible to perform basic arithmetic operations on columns. These operations are called aggregate functions in SQL, and the most common ones are AVG, SUM, MIN, MAX
. Their functionality should be clear from their names:
Aggregate functions give only a single result for the column you used them on. This means you cannot aggregate across one column and select other unaggregated columns:
You can combine aggregate functions with conditionals using WHERE clauses just as easily:
It is also possible to use arithmetic operators such as +, -, *, / on columns, and simple numbers. When used on columns, the operation is performed element-wise:
One thing to note about arithmetic operations: if you perform operations on integers only, SQL thinks that you are expecting an integer as the answer:
%%sql SELECT 10 / 3
Instead of returning 3.33…, the result is 3. To get a float result, you should use at least one float number in the query or use all floats to be safe:
%%sql SELECT 10.0 / 3.0
Using this knowledge, let’s calculate the average duration of a song in minutes:
If you pay attention to the above column, its name is written as “the query used to generate that column.” Because of this behavior, using long calculations, such as finding the standard deviation or variance of a column, can be an issue because the column name will be as large as the query itself.
To avoid this, SQL allows aliasing, similar to aliasing import statements in Python. For example:
Using as
keyword after a single item in a SELECT
statement tells SQL that we are aliasing. Here are more examples:
You can use aliasing just as easily for columns with long names.
Ordering Results in SQL
Just like Pandas has sort_values
method, SQL supports ordering columns via ORDER BY
clause. Passing a column name after the clause sorts the results in ascending order:
We order the tracks table in ascending order by the composer’s name. Note that the ORDER BY statement should always come after the WHERE clause. It is also possible to pass two or more columns to ORDER BY:
You can also reverse the ordering by passing the DESC
keyword after each column name:
The above query returns three columns after ordering the UnitPrice and Compose in descending order and the name in ascending order (ASC
is a default keyword).
Grouping in SQL
One of the most powerful functions of Pandas is the groupby
. You can use it to transform a table into virtually any shape you want. Its very close cousin in SQL – GROUP BY
clause can be used to achieve the same functionality. For example, the below query counts the number of songs in each genre:
The difference between the GROUP BY in SQL and groupby
in Pandas is that SQL does not allow selecting columns that weren’t given in the GROUP BY clause. For example, adding an extra free column in the above query generates an error:
However, you can choose as many columns in the SELECT statement as you like as long as you are using some type of aggregate function on them:
The above query includes almost all the topics we have learned up to this point. We are grouping by both album ID and genre ID, and for each group, we calculate the average duration and price of a song. We are also making efficient use of aliasing.
We can make the query even more powerful by ordering by the average duration and genre count:
Pay attention to how we use the alias names of the aggregate functions in the ORDER BY clause. Once you alias a column or the result of the aggregate function, you can refer to them only by their alias for the rest of the query.
Using conditionals with HAVING
By default, SQL does not allow conditional filtering using aggregate functions in the WHERE clause. For example, we want to select only the genres where the number of songs is greater than 100. Let’s try this with the WHERE clause:
The correct way of filtering rows based on the results of aggregate functions is using the HAVING clause:
HAVING clause is usually used with GROUP BY. Whenever you want to filter rows using aggregate functions, the HAVING clause is the way to go!
Summary
By now, you should have realized how powerful SQL can be. Even though we learned a ton, we have barely scratched the surface. For more advanced topics, you can read the excellent guide on W3Schools and practice your querying skills by solving real-world SQL questions on Hackerrank or LeetCode. Thank you for reading!