Techno Blender
Digitally Yours.

10-Minute Effortless SQL Tutorial For Die-hard Pandas Lovers | by Bex T. | Jun, 2022

0 110


There was a time it was the other way around

Photo by Fiona Art from Pexels

Motivation

When Pandas package gained public exposure in 2009, SQL had been dominating the data world since 1974. Pandas came with an attractive set of features such as in-built visualization and flexible data handling and became an ultimate data exploration tool. As it started gaining popularity, many courses and resources emerged teaching Pandas and comparing it to SQL.

Flash-forward to 2021, people are now getting introduced to the Pandas package first rather than the universal data language — SQL. Even though SQL is as popular as ever, the flexibility and multi-functionality of Pandas are making it the first choice for beginner data scientists.

Then, why do you need SQL if you know Pandas?

Even though Pandas may seem a better choice, SQL still plays a crucial role in the day-to-day job of a data scientist. In fact, SQL is the second most in-demand and the third most growing programming language for data science (see here). So, it is a must to add SQL to your CV if you want to get a job in the field. And knowing Pandas, learning SQL should be a breeze, as you will see in this article.

Connecting to a Database

Setting up an SQL workspace and connecting to a sample database can be a real pain. First, you need to install your favorite SQL flavor (PostgreSQL, MySQL, etc.) and download an SQL IDE. Doing those here would deviate us from the article’s purpose, so we will use a shortcut.

Specifically, we will directly run SQL queries in a Jupyter Notebook without additional steps. All we need to do is install the ipython-sql package using pip:

pip install ipython-sql

After the installation, start a new Jupyter session and run this command in the notebook:

%load_ext sql

and you are all set!

To illustrate how basic SQL statements work, we will be using the Chinook SQLite database, which you can download here. The database has 11 tables.

To retrieve the data stored in this database’s tables, run this command:

%sql sqlite:///data/chinook.db

The statement starts with %sql in-line magic command that tells the notebook interpreter we will be running SQL commands. It is followed by the path that the downloaded Chinook database is in. The valid paths should always start with sqlite:/// prefix for SQLite databases. Above, we are connecting to the database stored in the ‘data’ folder of the current directory. If you want to pass an absolute path, the prefix should start with four forward slashes – sqlite:////

If you wish to connect to a different database flavor, you can refer to this excellent article.

Taking a First Look at the Tables

The first thing we always do in Pandas is to use the .head() function to take a first look at the data. Let’s learn how to do that in SQL:

image.png
The dataset is licensed for commercial use as well.

The first keyword in the above query is SELECT. It is equivalent to the brackets operator in Pandas, where we select specific columns. But, the SELECT keyword is followed by a * (asterisk). * is an SQL operator that selects everything (all rows and columns) from a table specified after the FROM keyword. LIMIT is used to minimize the output returned. So, the above query is equivalent to df.head() function.

If you don’t want to select all columns, you can specify one or more column names after the SELECT keyword:

image.png

The equivalent Pandas operation is

tracks[['Name', 'Composer', 'UnitPrice']].head(10)

Another useful keyword in SQL is DISTINCT. Adding this keyword before any column name returns its unique values:

image.png

Comments in SQL are written with double dashes.

Counting the Number of Rows

Just like Pandas has .shape attribute on its DataFrames, SQL has a COUNT function to display the number of rows in a table:

%%sqlSELECT COUNT(*) FROM tracks
image.png

It is also possible to pass a column name to COUNT:

%sql SELECT COUNT(FirstName) FROM customers
image.png

But the output would be the same as COUNT(*).

More helpful info would be counting the number of unique values in a particular column. We can do this by adding the DISTINCT keyword into COUNT:

image.png

Filtering Results With WHERE Clauses

Just looking and counting rows is pretty lame. Let’s see how we can filter rows based on conditions.

First, let’s look at the songs which cost more than a dollar:

image.png

Conditional statements are written in the WHERE clause, which always comes after FROM and before the LIMIT keywords. Using conditionals is pretty similar to how we do it in Pandas.

You can also use the COUNT function when using conditionals. For example, let’s see the number of songs that are priced between 1 and 10 dollars:

image.png

Above we chained two conditions with the boolean operator AND. Other boolean operators are the same in SQL.

Now, let’s see all the invoices that have Paris or Berlin as a billing city:

image.png

The equality operator in SQL requires only one ‘=’ (equal) sign. The inequality operator is represented with either ‘!=’ or ‘<>’ operators:

image.png

Easier Filtering With BETWEEN And IN

Similar conditionals are used very often, and writing them out with simple booleans becomes cumbersome. For example, Pandas has .isin() function which checks if a value belongs to a list of groups or values. If we wanted to select all invoices for five cities, we would have to write five chained conditions. Luckily, SQL supports a similar IN operator like .isin() so we don’t have to:

image.png

The list of values after IN should be given as a tuple, not as a list. You can also negate the condition with the NOT keyword:

image.png

Another common filtering operation on numeric columns is to select values within a range. For this, BETWEEN keyword can be used, which is equivalent to pd.Series.between():

image.png

Checking For Nulls

Every data source has missing values, and databases are no exception. Just like there are several ways to explore the missing values in Pandas, there are specific keywords that check the existence of null values in SQL. The below query counts the number of rows with missing values in BillingState:

image.png

You can add the NOT keyword between IS and NULL to filter out missing values of a particular column:

image.png

Better String Matching With LIKE

In the WHERE clause, we filtered columns based on exact text values. But often, we may want to filter textual columns based on a pattern. In Pandas and pure Python, we would use regular expressions for pattern matching, which are very powerful but requires time to master.

As an alternative, SQL offers a ‘%’ wildcard as a placeholder to match any character 0 or more times. For example, the ‘gr%’ string matches’ great,’ ‘groom,’ ‘greed,’ and ‘%ex%’ matches any text with ‘ex’ in the middle, etc. Let’s see how to use it with SQL:

image.png

The above query finds all songs that start with ‘B.’ The string that contains the wildcard should come after the LIKE keyword.

Now, let’s find all songs that contain the word ‘beautiful’ in their titles:

image.png

You can also use other boolean operators next to LIKE:

image.png

There are many other wildcards in SQL that have different functionalities. You can see the complete list and their usage here.

Aggregate Functions in SQL

It is also possible to perform basic arithmetic operations on columns. These operations are called aggregate functions in SQL, and the most common ones are AVG, SUM, MIN, MAX. Their functionality should be clear from their names:

image.png

Aggregate functions give only a single result for the column you used them on. This means you cannot aggregate across one column and select other unaggregated columns:

image.png

You can combine aggregate functions with conditionals using WHERE clauses just as easily:

image.png

It is also possible to use arithmetic operators such as +, -, *, / on columns, and simple numbers. When used on columns, the operation is performed element-wise:

image.png

One thing to note about arithmetic operations: if you perform operations on integers only, SQL thinks that you are expecting an integer as the answer:

%%sql SELECT 10 / 3
image.png

Instead of returning 3.33…, the result is 3. To get a float result, you should use at least one float number in the query or use all floats to be safe:

%%sql SELECT 10.0 / 3.0
image.png

Using this knowledge, let’s calculate the average duration of a song in minutes:

image.png

If you pay attention to the above column, its name is written as “the query used to generate that column.” Because of this behavior, using long calculations, such as finding the standard deviation or variance of a column, can be an issue because the column name will be as large as the query itself.

To avoid this, SQL allows aliasing, similar to aliasing import statements in Python. For example:

image.png

Using as keyword after a single item in a SELECT statement tells SQL that we are aliasing. Here are more examples:

image.png

You can use aliasing just as easily for columns with long names.

Ordering Results in SQL

Just like Pandas has sort_values method, SQL supports ordering columns via ORDER BY clause. Passing a column name after the clause sorts the results in ascending order:

image.png

We order the tracks table in ascending order by the composer’s name. Note that the ORDER BY statement should always come after the WHERE clause. It is also possible to pass two or more columns to ORDER BY:

image.png

You can also reverse the ordering by passing the DESC keyword after each column name:

image.png

The above query returns three columns after ordering the UnitPrice and Compose in descending order and the name in ascending order (ASC is a default keyword).

Grouping in SQL

One of the most powerful functions of Pandas is the groupby. You can use it to transform a table into virtually any shape you want. Its very close cousin in SQL – GROUP BY clause can be used to achieve the same functionality. For example, the below query counts the number of songs in each genre:

image.png

The difference between the GROUP BY in SQL and groupby in Pandas is that SQL does not allow selecting columns that weren’t given in the GROUP BY clause. For example, adding an extra free column in the above query generates an error:

However, you can choose as many columns in the SELECT statement as you like as long as you are using some type of aggregate function on them:

image.png

The above query includes almost all the topics we have learned up to this point. We are grouping by both album ID and genre ID, and for each group, we calculate the average duration and price of a song. We are also making efficient use of aliasing.

We can make the query even more powerful by ordering by the average duration and genre count:

image.png

Pay attention to how we use the alias names of the aggregate functions in the ORDER BY clause. Once you alias a column or the result of the aggregate function, you can refer to them only by their alias for the rest of the query.

Using conditionals with HAVING

By default, SQL does not allow conditional filtering using aggregate functions in the WHERE clause. For example, we want to select only the genres where the number of songs is greater than 100. Let’s try this with the WHERE clause:

The correct way of filtering rows based on the results of aggregate functions is using the HAVING clause:

image.png

HAVING clause is usually used with GROUP BY. Whenever you want to filter rows using aggregate functions, the HAVING clause is the way to go!

Summary

By now, you should have realized how powerful SQL can be. Even though we learned a ton, we have barely scratched the surface. For more advanced topics, you can read the excellent guide on W3Schools and practice your querying skills by solving real-world SQL questions on Hackerrank or LeetCode. Thank you for reading!


There was a time it was the other way around

Photo by Fiona Art from Pexels

Motivation

When Pandas package gained public exposure in 2009, SQL had been dominating the data world since 1974. Pandas came with an attractive set of features such as in-built visualization and flexible data handling and became an ultimate data exploration tool. As it started gaining popularity, many courses and resources emerged teaching Pandas and comparing it to SQL.

Flash-forward to 2021, people are now getting introduced to the Pandas package first rather than the universal data language — SQL. Even though SQL is as popular as ever, the flexibility and multi-functionality of Pandas are making it the first choice for beginner data scientists.

Then, why do you need SQL if you know Pandas?

Even though Pandas may seem a better choice, SQL still plays a crucial role in the day-to-day job of a data scientist. In fact, SQL is the second most in-demand and the third most growing programming language for data science (see here). So, it is a must to add SQL to your CV if you want to get a job in the field. And knowing Pandas, learning SQL should be a breeze, as you will see in this article.

Connecting to a Database

Setting up an SQL workspace and connecting to a sample database can be a real pain. First, you need to install your favorite SQL flavor (PostgreSQL, MySQL, etc.) and download an SQL IDE. Doing those here would deviate us from the article’s purpose, so we will use a shortcut.

Specifically, we will directly run SQL queries in a Jupyter Notebook without additional steps. All we need to do is install the ipython-sql package using pip:

pip install ipython-sql

After the installation, start a new Jupyter session and run this command in the notebook:

%load_ext sql

and you are all set!

To illustrate how basic SQL statements work, we will be using the Chinook SQLite database, which you can download here. The database has 11 tables.

To retrieve the data stored in this database’s tables, run this command:

%sql sqlite:///data/chinook.db

The statement starts with %sql in-line magic command that tells the notebook interpreter we will be running SQL commands. It is followed by the path that the downloaded Chinook database is in. The valid paths should always start with sqlite:/// prefix for SQLite databases. Above, we are connecting to the database stored in the ‘data’ folder of the current directory. If you want to pass an absolute path, the prefix should start with four forward slashes – sqlite:////

If you wish to connect to a different database flavor, you can refer to this excellent article.

Taking a First Look at the Tables

The first thing we always do in Pandas is to use the .head() function to take a first look at the data. Let’s learn how to do that in SQL:

image.png
The dataset is licensed for commercial use as well.

The first keyword in the above query is SELECT. It is equivalent to the brackets operator in Pandas, where we select specific columns. But, the SELECT keyword is followed by a * (asterisk). * is an SQL operator that selects everything (all rows and columns) from a table specified after the FROM keyword. LIMIT is used to minimize the output returned. So, the above query is equivalent to df.head() function.

If you don’t want to select all columns, you can specify one or more column names after the SELECT keyword:

image.png

The equivalent Pandas operation is

tracks[['Name', 'Composer', 'UnitPrice']].head(10)

Another useful keyword in SQL is DISTINCT. Adding this keyword before any column name returns its unique values:

image.png

Comments in SQL are written with double dashes.

Counting the Number of Rows

Just like Pandas has .shape attribute on its DataFrames, SQL has a COUNT function to display the number of rows in a table:

%%sqlSELECT COUNT(*) FROM tracks
image.png

It is also possible to pass a column name to COUNT:

%sql SELECT COUNT(FirstName) FROM customers
image.png

But the output would be the same as COUNT(*).

More helpful info would be counting the number of unique values in a particular column. We can do this by adding the DISTINCT keyword into COUNT:

image.png

Filtering Results With WHERE Clauses

Just looking and counting rows is pretty lame. Let’s see how we can filter rows based on conditions.

First, let’s look at the songs which cost more than a dollar:

image.png

Conditional statements are written in the WHERE clause, which always comes after FROM and before the LIMIT keywords. Using conditionals is pretty similar to how we do it in Pandas.

You can also use the COUNT function when using conditionals. For example, let’s see the number of songs that are priced between 1 and 10 dollars:

image.png

Above we chained two conditions with the boolean operator AND. Other boolean operators are the same in SQL.

Now, let’s see all the invoices that have Paris or Berlin as a billing city:

image.png

The equality operator in SQL requires only one ‘=’ (equal) sign. The inequality operator is represented with either ‘!=’ or ‘<>’ operators:

image.png

Easier Filtering With BETWEEN And IN

Similar conditionals are used very often, and writing them out with simple booleans becomes cumbersome. For example, Pandas has .isin() function which checks if a value belongs to a list of groups or values. If we wanted to select all invoices for five cities, we would have to write five chained conditions. Luckily, SQL supports a similar IN operator like .isin() so we don’t have to:

image.png

The list of values after IN should be given as a tuple, not as a list. You can also negate the condition with the NOT keyword:

image.png

Another common filtering operation on numeric columns is to select values within a range. For this, BETWEEN keyword can be used, which is equivalent to pd.Series.between():

image.png

Checking For Nulls

Every data source has missing values, and databases are no exception. Just like there are several ways to explore the missing values in Pandas, there are specific keywords that check the existence of null values in SQL. The below query counts the number of rows with missing values in BillingState:

image.png

You can add the NOT keyword between IS and NULL to filter out missing values of a particular column:

image.png

Better String Matching With LIKE

In the WHERE clause, we filtered columns based on exact text values. But often, we may want to filter textual columns based on a pattern. In Pandas and pure Python, we would use regular expressions for pattern matching, which are very powerful but requires time to master.

As an alternative, SQL offers a ‘%’ wildcard as a placeholder to match any character 0 or more times. For example, the ‘gr%’ string matches’ great,’ ‘groom,’ ‘greed,’ and ‘%ex%’ matches any text with ‘ex’ in the middle, etc. Let’s see how to use it with SQL:

image.png

The above query finds all songs that start with ‘B.’ The string that contains the wildcard should come after the LIKE keyword.

Now, let’s find all songs that contain the word ‘beautiful’ in their titles:

image.png

You can also use other boolean operators next to LIKE:

image.png

There are many other wildcards in SQL that have different functionalities. You can see the complete list and their usage here.

Aggregate Functions in SQL

It is also possible to perform basic arithmetic operations on columns. These operations are called aggregate functions in SQL, and the most common ones are AVG, SUM, MIN, MAX. Their functionality should be clear from their names:

image.png

Aggregate functions give only a single result for the column you used them on. This means you cannot aggregate across one column and select other unaggregated columns:

image.png

You can combine aggregate functions with conditionals using WHERE clauses just as easily:

image.png

It is also possible to use arithmetic operators such as +, -, *, / on columns, and simple numbers. When used on columns, the operation is performed element-wise:

image.png

One thing to note about arithmetic operations: if you perform operations on integers only, SQL thinks that you are expecting an integer as the answer:

%%sql SELECT 10 / 3
image.png

Instead of returning 3.33…, the result is 3. To get a float result, you should use at least one float number in the query or use all floats to be safe:

%%sql SELECT 10.0 / 3.0
image.png

Using this knowledge, let’s calculate the average duration of a song in minutes:

image.png

If you pay attention to the above column, its name is written as “the query used to generate that column.” Because of this behavior, using long calculations, such as finding the standard deviation or variance of a column, can be an issue because the column name will be as large as the query itself.

To avoid this, SQL allows aliasing, similar to aliasing import statements in Python. For example:

image.png

Using as keyword after a single item in a SELECT statement tells SQL that we are aliasing. Here are more examples:

image.png

You can use aliasing just as easily for columns with long names.

Ordering Results in SQL

Just like Pandas has sort_values method, SQL supports ordering columns via ORDER BY clause. Passing a column name after the clause sorts the results in ascending order:

image.png

We order the tracks table in ascending order by the composer’s name. Note that the ORDER BY statement should always come after the WHERE clause. It is also possible to pass two or more columns to ORDER BY:

image.png

You can also reverse the ordering by passing the DESC keyword after each column name:

image.png

The above query returns three columns after ordering the UnitPrice and Compose in descending order and the name in ascending order (ASC is a default keyword).

Grouping in SQL

One of the most powerful functions of Pandas is the groupby. You can use it to transform a table into virtually any shape you want. Its very close cousin in SQL – GROUP BY clause can be used to achieve the same functionality. For example, the below query counts the number of songs in each genre:

image.png

The difference between the GROUP BY in SQL and groupby in Pandas is that SQL does not allow selecting columns that weren’t given in the GROUP BY clause. For example, adding an extra free column in the above query generates an error:

However, you can choose as many columns in the SELECT statement as you like as long as you are using some type of aggregate function on them:

image.png

The above query includes almost all the topics we have learned up to this point. We are grouping by both album ID and genre ID, and for each group, we calculate the average duration and price of a song. We are also making efficient use of aliasing.

We can make the query even more powerful by ordering by the average duration and genre count:

image.png

Pay attention to how we use the alias names of the aggregate functions in the ORDER BY clause. Once you alias a column or the result of the aggregate function, you can refer to them only by their alias for the rest of the query.

Using conditionals with HAVING

By default, SQL does not allow conditional filtering using aggregate functions in the WHERE clause. For example, we want to select only the genres where the number of songs is greater than 100. Let’s try this with the WHERE clause:

The correct way of filtering rows based on the results of aggregate functions is using the HAVING clause:

image.png

HAVING clause is usually used with GROUP BY. Whenever you want to filter rows using aggregate functions, the HAVING clause is the way to go!

Summary

By now, you should have realized how powerful SQL can be. Even though we learned a ton, we have barely scratched the surface. For more advanced topics, you can read the excellent guide on W3Schools and practice your querying skills by solving real-world SQL questions on Hackerrank or LeetCode. Thank you for reading!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment