Techno Blender
Digitally Yours.

How to Run SQL Queries On Your Pandas DataFrames With Python | by Zoumana Keita | Dec, 2022

0 35


Run SQL queries in your Python Pandas Dataframe

Image by Caspar Camille Rubin on Unsplash

Pandas is being increasingly used by Data Scientists and Data Analysts for data analysis purposes, and it has the advantage of being part of the wider Python universe, making it accessible to many people. SQL on the other hand is known for its performance, being human-readable, and can be easily understood even by non-technical people.

What if we could find a way to combine the benefits of both Pandas and SQL statements? Here is where pandasqlcomes in handy.

From this conceptual blog, you will understand what pandasqlis before diving deep into hands-on practices, which will be even much easier if you are already familiar with SQL.

Keep in mind that table and dataframe will be used interchangeably to mean the same thing.

So, what is pandasql?

It is basically used to query pandas DataFrames using SQL syntax. The same process can be performed using sqldf to interact with R DataFrames.

The installation is straightforward with the syntaxes below depending on your environment:

Before installing, it is important to note that installing libraries using conda, it is necessary to specify the --yesflag so that you don’t get stuck to the (Proceed ([y]/n)?) prompt.

pip install pandasql 
!conda install --yes pandasql

After successfully installing the library, you should be able to import the library using the following statement.

from pandasql import sqldf

Before diving deep, let’s start by creating the datasets StudentTable and TeachingAssistantTablethat will be used for hands-on practice.

# Import the pandas library for creating the dataframes
import pandas as pd
# Create the Student Data Frame

students= {
'Students':["Sira","Ibrahim","Moussa","Mamadou","Nabintou"],
'Gender':['Female','Male','Male', "Male", "Female"],
'Age':[18, 27, 19, 22, 21],
'Email': ["[email protected]", "[email protected]", "[email protected]",
"[email protected]", "[email protected]"]
}
students_df = pd.DataFrame(students)

students_df

The Student Data Frame (Image by Author)
# Create the Teaching Assistant Data Frame

teaching_assistant= {
'Teacher':["Ibrahim","Nabintou","Mamadou","Fatim","Aziz"],
'Email':['[email protected]','[email protected]','[email protected]',
"[email protected]", "[email protected]"],
'Degree':["M.S in Data Science", "B.S in Statistics",
"B. Comp Sc", "M.S. Architecture", "B.S in Accounting"],
'Department': ["Business", "Statistics", "Comp Sc",
"Engineering", "Business"]
}
teaching_assistant_df = pd.DataFrame(teaching_assistant)

teaching_assistant_df

The Teaching Assistants Data Frame (Image by Author)

These are the main concepts that will be covered in this section: (1) column section, (2) data filtering, (3) data aggregation, and finally, (4) data joining.

1- Column Selection

This corresponds to selecting part or all the columns of your database/data frame. It is performed with the keyword SELECT col_1, col_2,... col_X FROM tableName

1, 2, ...X correspond to the columns you are interested in having in the final result.

tableName is the name of the dataframe/table.

The result from the following code includes all the columns of the student dataframe.

all_students = sqldf("SELECT * FROM students_df")
all_students
SELECT * Statement Result (Image by Author)
  • all_students is also a DataFrame, which can be helpful if you want to perform quick Pandas tasks such as column types. Below is an illustration.
# Check the type of all_students
print(type(all_students))

# Run Pandas Statement to show the type of the columns
print("---"*10)
print(all_students.dtypes)

all_students type and column types (Image by Author)

Sometimes when you have complicated queries, you can proceed step by step as follow:

  • Define the query as a string. When doing so, make sure to use the triple quote sign ”””so that you can write the query on multiple lines.
  • Apply the sqldf function to the query to get the result.

Let’s say we want the student’s name, their email and limit the result to the first 3.

# Query definition
query = """ SELECT Students, Email
FROM students_df
LIMIT 3
"""
# Query execution
name_email = sqldf(query)
name_email
Students Name and Email limited to the first 3 rows (Image by Author)

2- Data Filtering

Data filtering is when the WHERE statement comes into the equation to perform custom filtering on the result of the SELECT statement.

Let’s imagine that we want all the female students.

# Define the query
query = """SELECT *
FROM students_df
WHERE Gender = 'Female'
"""

# Execute the query
female_student = sqldf(query)
female_student

Female Students (Image by Author)

Let’s have a look at this query that aims to retrieve the Name, Email, and Degree of all the Teaching Assistants with a Master’s Degree.

query = """ SELECT Teacher, Email, Degree 
FROM teaching_assistant_df
WHERE Degree LIKE 'M.S%'
"""
ms_students = sqldf(query)
ms_students
Teaching Assistants with a Master Degree (Image by Author)

I have decided to perform the same request, this time using Pandas statement, and this is what it looks like!

cols_to_get = ['Teacher', 'Email', 'Degree']
teaching_assistant_df[teaching_assistant_df.Degree.str.startswith('M.S')][cols_to_get]

This looks like. scary statement 🥺. Let’s understand what happened:

  • cols_to_get corresponds to the columns we want in the final dataframe. It’s basically equivalent to the columns we specify when using the SELECT statement.
  • teaching_assistant_df.Degree.str.startswith('M.S') returns True is the value in the Degreecolumn starts with 'M.S'. FalseOtherwise.
  • Finally, those boolean values are transmitted to the starting teaching_assistant_df to get rows where the result is True.

But… we performed the same task with SQL query, and was simple and easy to understand. Without going into too much comparative analysis, just wanted to emphasize how complicated can be some queries that are simple to write in SQL.

3- Data Aggregation

Aggregation in SQL is performed with the help of aggregation functions, and these are some of the most commonly used: COUNT , SUM , MAX & MIN , and AVG . For instance, you can get the age of students, based on their gender.

query = """ SELECT AVG(Age), Gender 
FROM students_df
GROUP BY Gender
"""
avg_age = sqldf(query)
avg_age
Average Age of Students Gender Wise (Image by Author)

4- Data Joining

The join concept becomes useful when we want to combine data from at least two tables. This section goes straight to the point with only one type of join. You can learn more about all the join concepts from my article titled 4 Types of SQL JOIN Every Data Scientist Should Know.

Imagine you want to know who is both a student and also a teaching assistant. The answer to this requires joining our two tables as follows, using an INNER JOIN and the final result contains the following information:

  • Student Name, Gender, Email, and Age from the students_df table.
  • Department from the teaching_assistant_df dataframe.
query = """ SELECT st.Students, st.Gender, st.Email, st.Age, tat.Department
FROM students_df st INNER JOIN teaching_assistant_df tat
ON st.Email = tat.Email;
"""

result = sqldf(query)
result

Students who are also teaching assistants (Image by Author)

Congratulations!🎉🍾 You have just learned how to leverage the power of pandasql, a great tool that allows you to apply both SQL and Pandas queries on your dataframes.

If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!




Run SQL queries in your Python Pandas Dataframe

Image by Caspar Camille Rubin on Unsplash

Pandas is being increasingly used by Data Scientists and Data Analysts for data analysis purposes, and it has the advantage of being part of the wider Python universe, making it accessible to many people. SQL on the other hand is known for its performance, being human-readable, and can be easily understood even by non-technical people.

What if we could find a way to combine the benefits of both Pandas and SQL statements? Here is where pandasqlcomes in handy.

From this conceptual blog, you will understand what pandasqlis before diving deep into hands-on practices, which will be even much easier if you are already familiar with SQL.

Keep in mind that table and dataframe will be used interchangeably to mean the same thing.

So, what is pandasql?

It is basically used to query pandas DataFrames using SQL syntax. The same process can be performed using sqldf to interact with R DataFrames.

The installation is straightforward with the syntaxes below depending on your environment:

Before installing, it is important to note that installing libraries using conda, it is necessary to specify the --yesflag so that you don’t get stuck to the (Proceed ([y]/n)?) prompt.

pip install pandasql 
!conda install --yes pandasql

After successfully installing the library, you should be able to import the library using the following statement.

from pandasql import sqldf

Before diving deep, let’s start by creating the datasets StudentTable and TeachingAssistantTablethat will be used for hands-on practice.

# Import the pandas library for creating the dataframes
import pandas as pd
# Create the Student Data Frame

students= {
'Students':["Sira","Ibrahim","Moussa","Mamadou","Nabintou"],
'Gender':['Female','Male','Male', "Male", "Female"],
'Age':[18, 27, 19, 22, 21],
'Email': ["[email protected]", "[email protected]", "[email protected]",
"[email protected]", "[email protected]"]
}
students_df = pd.DataFrame(students)

students_df

The Student Data Frame (Image by Author)
# Create the Teaching Assistant Data Frame

teaching_assistant= {
'Teacher':["Ibrahim","Nabintou","Mamadou","Fatim","Aziz"],
'Email':['[email protected]','[email protected]','[email protected]',
"[email protected]", "[email protected]"],
'Degree':["M.S in Data Science", "B.S in Statistics",
"B. Comp Sc", "M.S. Architecture", "B.S in Accounting"],
'Department': ["Business", "Statistics", "Comp Sc",
"Engineering", "Business"]
}
teaching_assistant_df = pd.DataFrame(teaching_assistant)

teaching_assistant_df

The Teaching Assistants Data Frame (Image by Author)

These are the main concepts that will be covered in this section: (1) column section, (2) data filtering, (3) data aggregation, and finally, (4) data joining.

1- Column Selection

This corresponds to selecting part or all the columns of your database/data frame. It is performed with the keyword SELECT col_1, col_2,... col_X FROM tableName

1, 2, ...X correspond to the columns you are interested in having in the final result.

tableName is the name of the dataframe/table.

The result from the following code includes all the columns of the student dataframe.

all_students = sqldf("SELECT * FROM students_df")
all_students
SELECT * Statement Result (Image by Author)
  • all_students is also a DataFrame, which can be helpful if you want to perform quick Pandas tasks such as column types. Below is an illustration.
# Check the type of all_students
print(type(all_students))

# Run Pandas Statement to show the type of the columns
print("---"*10)
print(all_students.dtypes)

all_students type and column types (Image by Author)

Sometimes when you have complicated queries, you can proceed step by step as follow:

  • Define the query as a string. When doing so, make sure to use the triple quote sign ”””so that you can write the query on multiple lines.
  • Apply the sqldf function to the query to get the result.

Let’s say we want the student’s name, their email and limit the result to the first 3.

# Query definition
query = """ SELECT Students, Email
FROM students_df
LIMIT 3
"""
# Query execution
name_email = sqldf(query)
name_email
Students Name and Email limited to the first 3 rows (Image by Author)

2- Data Filtering

Data filtering is when the WHERE statement comes into the equation to perform custom filtering on the result of the SELECT statement.

Let’s imagine that we want all the female students.

# Define the query
query = """SELECT *
FROM students_df
WHERE Gender = 'Female'
"""

# Execute the query
female_student = sqldf(query)
female_student

Female Students (Image by Author)

Let’s have a look at this query that aims to retrieve the Name, Email, and Degree of all the Teaching Assistants with a Master’s Degree.

query = """ SELECT Teacher, Email, Degree 
FROM teaching_assistant_df
WHERE Degree LIKE 'M.S%'
"""
ms_students = sqldf(query)
ms_students
Teaching Assistants with a Master Degree (Image by Author)

I have decided to perform the same request, this time using Pandas statement, and this is what it looks like!

cols_to_get = ['Teacher', 'Email', 'Degree']
teaching_assistant_df[teaching_assistant_df.Degree.str.startswith('M.S')][cols_to_get]

This looks like. scary statement 🥺. Let’s understand what happened:

  • cols_to_get corresponds to the columns we want in the final dataframe. It’s basically equivalent to the columns we specify when using the SELECT statement.
  • teaching_assistant_df.Degree.str.startswith('M.S') returns True is the value in the Degreecolumn starts with 'M.S'. FalseOtherwise.
  • Finally, those boolean values are transmitted to the starting teaching_assistant_df to get rows where the result is True.

But… we performed the same task with SQL query, and was simple and easy to understand. Without going into too much comparative analysis, just wanted to emphasize how complicated can be some queries that are simple to write in SQL.

3- Data Aggregation

Aggregation in SQL is performed with the help of aggregation functions, and these are some of the most commonly used: COUNT , SUM , MAX & MIN , and AVG . For instance, you can get the age of students, based on their gender.

query = """ SELECT AVG(Age), Gender 
FROM students_df
GROUP BY Gender
"""
avg_age = sqldf(query)
avg_age
Average Age of Students Gender Wise (Image by Author)

4- Data Joining

The join concept becomes useful when we want to combine data from at least two tables. This section goes straight to the point with only one type of join. You can learn more about all the join concepts from my article titled 4 Types of SQL JOIN Every Data Scientist Should Know.

Imagine you want to know who is both a student and also a teaching assistant. The answer to this requires joining our two tables as follows, using an INNER JOIN and the final result contains the following information:

  • Student Name, Gender, Email, and Age from the students_df table.
  • Department from the teaching_assistant_df dataframe.
query = """ SELECT st.Students, st.Gender, st.Email, st.Age, tat.Department
FROM students_df st INNER JOIN teaching_assistant_df tat
ON st.Email = tat.Email;
"""

result = sqldf(query)
result

Students who are also teaching assistants (Image by Author)

Congratulations!🎉🍾 You have just learned how to leverage the power of pandasql, a great tool that allows you to apply both SQL and Pandas queries on your dataframes.

If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment