Techno Blender
Digitally Yours.

Pandas vs. SQL — Part 2: Pandas Is More Concise | by Aditya Parameswaran

0 75


Co-authored with Mahesh Vashishtha

Photo by Prateek Katyal on Unsplash

tl;dr: In this post, we compare Pandas vs. SQL on the first of three axes: conciseness. We discuss 10 operations that are much more straightforward in Pandas than in SQL, spanning data cleaning, machine learning, feature engineering, linear algebra, and spreadsheet functions.

In our previous post we discussed how Pandas dataframes are much like a food court, while databases offering SQL are like Michelin-star restaurants, and how both dataframes and databases have been around in some form since the 1970s. In this post we’ll dive into the comparison between Pandas and SQL on conciseness: the ability to express sophisticated data transformation operations succinctly.

So, given that both Pandas and SQL support data transformation and analysis, one may ask: Is Pandas really more concise than SQL, the original database query language? In this post we show that there are many operations that are much easier to do in Pandas than in SQL. We’ll walk through ten examples, most of which use this Goodreads dataset from Kaggle:

  1. Normalizing columns uniformly
  2. One-hot encoding
  3. Cleaning columns with missing values
  4. Compute covariances
  5. Element-wise combination of dataframes
  6. Transpose
  7. Programmatically operate on column names
  8. Point updates
  9. Pivot & Melt
  10. Moving average of a time series

We use Pandas to load the Goodreads data from the Kaggle CSV file. The CSV doesn’t correctly escape commas in string values, e.g., the author "Sam Bass Warner, Jr./Sam B. Warner" on line 3350 is not surrounded by quotation marks. We can just have Pandas skip these malformed lines by passing read_csv the parameter on_bad_lines='skip'. Pandas will also try to infer data types from the CSV, and will by default make the bookID, isbn, and isbn13 columns numerical, whereas we prefer to treat them as strings, so we can override this step.

Some of the steps we took here are not easy to do in most relational databases supporting SQL. For example, Postgres would not allow us to skip bad lines when reading the CSV and would throw a parse error.

Normalization is a common feature engineering operation; it involves scaling numerical values in a column to ensure that they are in a certain range, or that the resulting distribution of values have low variance. Normalization is key to fixing input features prior to machine learning model training, which in turn ensures that the eventual machine learning models have high accuracy. With Pandas, we can easily pick out only the numerical columns and normalize them all at once:

Now to do the same in SQL, we need to explicitly list out the names of the columns that are numerical:

Since we need to operate on each column separately, the size of the SQL query grows with the number of columns, and we end up with a lot of duplicate query phrases. Moreover, every time our columns change, we have to change our query. With Pandas, we are able to select all of the numeric columns at once, because Pandas lets us examine and manipulate metadata (in this case, column types) within operations. In SQL, we have to manually craft a clause for each numerical column, because the query itself can’t access column types.

Note that there is a way to filter columns by data type in SQL via the auxiliary metadata tables (e.g.,sys.table,information_schema).

But this approach is equally clunky and moreover, this approach is not universally supported across all SQL engines.

Most machine learning packages expect numeric or boolean matrices as input. One-hot encoding is a feature engineering operation that is commonly used to transform a single categorical attribute to a set of binary attributes, which machine learning packages like Scikit-learn can interpret more easily.

For example, we might want to one-hot encode language_code:

To one-hot encode the column with Pandas, we’d write:

Doing the same thing in SQL is much harder. For each column we want to one-hot encode, we have to write as many clauses as there are distinct values of that column. Imagine if there were thousands of distinct values in a given column (e.g., the counties in the US, which is over 3000) — the resulting SQL query would involve thousands of query phrases. This is simply impossible for a human being to write correctly.

There are other complications. Even before we can write the SQL query above, we would need to first identify the distinct values in the column we want to one-hot-encode. The distinct values have to be identified in a separate operation, such as:

Finally, whenever the set of distinct values of the column change, the SQL query will have to change. If we forget to do this, we may miss out on correctly encoding our data, impacting downstream applications and models.

Imagine that we want to discard columns where too many rows are missing a value for that column, as denoted by the value NULL or NA. Doing this is a common data cleaning step in machine learning, because such columns are unlikely to be valuable features for the eventual model.

In our example, we discard columns where more than 1% of the records are missing values. In Pandas, we can generate a boolean telling whether each entry in the dataframe is null, calculate the mean across each column, and use a boolean indexer to select columns where that mean is not too low:

We can’t write a single statement to accomplish this in SQL, but first, to get the fraction of rows that are null for each column, we can write something like:

We can then write another statement selecting the columns where the fraction is larger than .01. However, this is a painful process, and the time it takes to write this query is proportional to the number of columns. The extensive duplication of query logic across phrases also makes it more likely that errors may creep in.

The covariance matrix gives some sense of how each pair of variables correlate with each other. It’s used in principal component analysis, a common technique in exploratory data analysis and machine learning. It’s also useful as a step in feature selection — removing extraneous features to help make machine learning algorithms run more efficiently and effectively. With Pandas we can get a matrix of correlations across all numerical columns with:

Note that we multiply by df.cov() by len(df) - 1 because Pandas normalizes the covariance. The reason why it was so simple is because Pandas allows you to treat the dataframe as a matrix and apply linear algebraic operations to it.

With SQL we would first use a WITH clause to calculate the mean for each column, then sum the product (X - X_m) * (Y - Y_m) for each pair of variables X and Y, where X_m and Y_m are the corresponding means.

Again, this is a very long query, with clauses proportional to the number of attributes squared. The query also does not enable you to reproduce the covariance matrix exactly.

Some databases, e.g., RedShift and BigQuery, have a correlation function and/or let us write a user-defined function to compute correlation instead of writing the correlation formula out for each pair of columns. In that case, we would have to manually write a query to find correlation between each pair of numerical columns, or join the entire table against itself in the second solution here.

Suppose we want to take a weighted average of numerical columns of the Otherreads and Goodreads book review datasets. In Pandas, we can select all the numerical columns from each dataset, and then add the two datasets element-wise. Pandas lets us easily operate on each of the columns in an equivalent manner with minimal code.

In SQL, we’d have to write out the average for each numerical column, and we’d have to write out a separate SELECT subclause for each non-numerical column. In Pandas, we effectively joined on the row number, but in SQL, there’s no concept of row number, and we have to assume that a column like bookID is common between the datasets so that we can join them. Once again, the size of this query is proportional to the number of attributes.

Transpose is a standard linear algebra operation that is often useful in cleaning data. Most machine learning packages expect individual datapoints as rows and features as columns. If input datasets have the opposite orientation, a transpose can fix them. Here’s how we transpose in Pandas — It’s a single character: T!

Transpose is impossible in SQL. SQL doesn’t natively support most linear algebraic operators, like Pandas can.

Often, when a dataset has many columns, it is a bit overwhelming to examine it all at once. You may want to focus your attention on a subset of those with similar names – these names are often similar because they express similar functions. It is easy in Pandas to just focus on columns based on the column names matching certain criteria. For example, to select all columns of our dataset that contain the string isbn, we’d do:

Selecting column names that match a pattern requires filtering based on the table’s schema, which isn’t easily possible with a SQL query (except using the tedious information_schema approach described in 1. above).

Instead, we have to search on our own for columns whose names match our target string and manually select them, but that becomes difficult and error-prone when the number of columns is large. We’d then have to update the list of columns in our query whenever the table schema changes.

Often, some rows of a dataset have errors or anomalies that break subsequent data processing steps. In such cases, it’s useful to fix the errors and anomalies point by point, directly. For example, suppose that we find out that the average_rating value of just the first row is wrong, and we need to fix it. In Pandas we would write:

This is easy in Pandas because dataframes maintain order; so like in a spreadsheet, we can go in and edit cells on demand. Since relational databases don’t support a notion of order, this operation is impossible to do in SQL. Instead, we would need to fix the data source (in this case, the CSV) ourselves. Likewise, if we wanted to do an update on columns 4 through 6, that would be impossible in SQL, since queries cannot use the order of columns at all.

Pivoting lets you reshape a dataframe so that unique values from a specified index and columns serve as axes for the new dataframe. This operation comes from spreadsheet applications such as Excel. Here’s an example with a different, smaller dataset:

To perform the same pivot on the above dataset in SQL, we’d have to write something like:

The above query has a separate SELECT clause per possible value of the columns column bar. Consequently, the length of the query is proportional to the number of distinct values in bar. Once again, if the distinct values change, the query would have to change.

Pandas also supports Melt, the inverse of pivot. Some SQL engines, such as Microsoft SQL server and Spark SQL, support MELT, but others do not.

Suppose we have a table where each column has the number of visits to a website on a given day, as well the number of people who bought a product:

We might want to get a rolling weekly average to smooth out noise in the data. To get a rolling weekly average in Pandas, we can do:

In SQL, an equivalent query would look like:

We’ve abbreviated the query for brevity, but the SELECT class would have 2513 lines! It would also grow as the number of dates in the dataset grew.

Most databases have a window operation for these kinds of queries, but those are for windows spanning rows, not columns. Your database might let you pivot the data so that the dates are rows instead of columns, but that would require an extra query. Pandas, on the other hand, lets you run such operations across either axis by choosing a single number, the axis argument to the window function.

While SQL is very concise, there are a number of types of operations for which Pandas is much more capable and succinct. In the examples above, we found:

  • Data cleaning and feature engineering: Pandas has convenient utilities for normalization, one-hot encoding, and dropping sparsely populated features. It’s possible to do these things in SQL, but with much more code.
  • Operations coming from linear algebra and spreadsheets: Pandas has utilities for spreadsheet-style operations like pivoting and unpivoting, while not all SQL engines do. Pandas also makes it convenient to perform matrix operations like addition, transposition, and point updates, while SQL does not, and some of these matrix operations are impossible in many SQL engines.
  • Working fluently with metadata: When working with Pandas dataframes, we can access and manipulate metadata as easily as data in our queries. Consequently, it was easy to select all numerical columns, or to select all columns whose names matched a pattern.

Are you aware of other examples where Pandas or SQL outshines the other in terms of conciseness? We’d love to hear it! Feel free to follow us on Twitter for more content like this, and respond to our tweet here!

In the next post in our Pandas vs. SQL series (post 3 of 4), we argue Pandas is the more flexible language. Read more here!




Co-authored with Mahesh Vashishtha

Photo by Prateek Katyal on Unsplash

tl;dr: In this post, we compare Pandas vs. SQL on the first of three axes: conciseness. We discuss 10 operations that are much more straightforward in Pandas than in SQL, spanning data cleaning, machine learning, feature engineering, linear algebra, and spreadsheet functions.

In our previous post we discussed how Pandas dataframes are much like a food court, while databases offering SQL are like Michelin-star restaurants, and how both dataframes and databases have been around in some form since the 1970s. In this post we’ll dive into the comparison between Pandas and SQL on conciseness: the ability to express sophisticated data transformation operations succinctly.

So, given that both Pandas and SQL support data transformation and analysis, one may ask: Is Pandas really more concise than SQL, the original database query language? In this post we show that there are many operations that are much easier to do in Pandas than in SQL. We’ll walk through ten examples, most of which use this Goodreads dataset from Kaggle:

  1. Normalizing columns uniformly
  2. One-hot encoding
  3. Cleaning columns with missing values
  4. Compute covariances
  5. Element-wise combination of dataframes
  6. Transpose
  7. Programmatically operate on column names
  8. Point updates
  9. Pivot & Melt
  10. Moving average of a time series

We use Pandas to load the Goodreads data from the Kaggle CSV file. The CSV doesn’t correctly escape commas in string values, e.g., the author "Sam Bass Warner, Jr./Sam B. Warner" on line 3350 is not surrounded by quotation marks. We can just have Pandas skip these malformed lines by passing read_csv the parameter on_bad_lines='skip'. Pandas will also try to infer data types from the CSV, and will by default make the bookID, isbn, and isbn13 columns numerical, whereas we prefer to treat them as strings, so we can override this step.

Some of the steps we took here are not easy to do in most relational databases supporting SQL. For example, Postgres would not allow us to skip bad lines when reading the CSV and would throw a parse error.

Normalization is a common feature engineering operation; it involves scaling numerical values in a column to ensure that they are in a certain range, or that the resulting distribution of values have low variance. Normalization is key to fixing input features prior to machine learning model training, which in turn ensures that the eventual machine learning models have high accuracy. With Pandas, we can easily pick out only the numerical columns and normalize them all at once:

Now to do the same in SQL, we need to explicitly list out the names of the columns that are numerical:

Since we need to operate on each column separately, the size of the SQL query grows with the number of columns, and we end up with a lot of duplicate query phrases. Moreover, every time our columns change, we have to change our query. With Pandas, we are able to select all of the numeric columns at once, because Pandas lets us examine and manipulate metadata (in this case, column types) within operations. In SQL, we have to manually craft a clause for each numerical column, because the query itself can’t access column types.

Note that there is a way to filter columns by data type in SQL via the auxiliary metadata tables (e.g.,sys.table,information_schema).

But this approach is equally clunky and moreover, this approach is not universally supported across all SQL engines.

Most machine learning packages expect numeric or boolean matrices as input. One-hot encoding is a feature engineering operation that is commonly used to transform a single categorical attribute to a set of binary attributes, which machine learning packages like Scikit-learn can interpret more easily.

For example, we might want to one-hot encode language_code:

To one-hot encode the column with Pandas, we’d write:

Doing the same thing in SQL is much harder. For each column we want to one-hot encode, we have to write as many clauses as there are distinct values of that column. Imagine if there were thousands of distinct values in a given column (e.g., the counties in the US, which is over 3000) — the resulting SQL query would involve thousands of query phrases. This is simply impossible for a human being to write correctly.

There are other complications. Even before we can write the SQL query above, we would need to first identify the distinct values in the column we want to one-hot-encode. The distinct values have to be identified in a separate operation, such as:

Finally, whenever the set of distinct values of the column change, the SQL query will have to change. If we forget to do this, we may miss out on correctly encoding our data, impacting downstream applications and models.

Imagine that we want to discard columns where too many rows are missing a value for that column, as denoted by the value NULL or NA. Doing this is a common data cleaning step in machine learning, because such columns are unlikely to be valuable features for the eventual model.

In our example, we discard columns where more than 1% of the records are missing values. In Pandas, we can generate a boolean telling whether each entry in the dataframe is null, calculate the mean across each column, and use a boolean indexer to select columns where that mean is not too low:

We can’t write a single statement to accomplish this in SQL, but first, to get the fraction of rows that are null for each column, we can write something like:

We can then write another statement selecting the columns where the fraction is larger than .01. However, this is a painful process, and the time it takes to write this query is proportional to the number of columns. The extensive duplication of query logic across phrases also makes it more likely that errors may creep in.

The covariance matrix gives some sense of how each pair of variables correlate with each other. It’s used in principal component analysis, a common technique in exploratory data analysis and machine learning. It’s also useful as a step in feature selection — removing extraneous features to help make machine learning algorithms run more efficiently and effectively. With Pandas we can get a matrix of correlations across all numerical columns with:

Note that we multiply by df.cov() by len(df) - 1 because Pandas normalizes the covariance. The reason why it was so simple is because Pandas allows you to treat the dataframe as a matrix and apply linear algebraic operations to it.

With SQL we would first use a WITH clause to calculate the mean for each column, then sum the product (X - X_m) * (Y - Y_m) for each pair of variables X and Y, where X_m and Y_m are the corresponding means.

Again, this is a very long query, with clauses proportional to the number of attributes squared. The query also does not enable you to reproduce the covariance matrix exactly.

Some databases, e.g., RedShift and BigQuery, have a correlation function and/or let us write a user-defined function to compute correlation instead of writing the correlation formula out for each pair of columns. In that case, we would have to manually write a query to find correlation between each pair of numerical columns, or join the entire table against itself in the second solution here.

Suppose we want to take a weighted average of numerical columns of the Otherreads and Goodreads book review datasets. In Pandas, we can select all the numerical columns from each dataset, and then add the two datasets element-wise. Pandas lets us easily operate on each of the columns in an equivalent manner with minimal code.

In SQL, we’d have to write out the average for each numerical column, and we’d have to write out a separate SELECT subclause for each non-numerical column. In Pandas, we effectively joined on the row number, but in SQL, there’s no concept of row number, and we have to assume that a column like bookID is common between the datasets so that we can join them. Once again, the size of this query is proportional to the number of attributes.

Transpose is a standard linear algebra operation that is often useful in cleaning data. Most machine learning packages expect individual datapoints as rows and features as columns. If input datasets have the opposite orientation, a transpose can fix them. Here’s how we transpose in Pandas — It’s a single character: T!

Transpose is impossible in SQL. SQL doesn’t natively support most linear algebraic operators, like Pandas can.

Often, when a dataset has many columns, it is a bit overwhelming to examine it all at once. You may want to focus your attention on a subset of those with similar names – these names are often similar because they express similar functions. It is easy in Pandas to just focus on columns based on the column names matching certain criteria. For example, to select all columns of our dataset that contain the string isbn, we’d do:

Selecting column names that match a pattern requires filtering based on the table’s schema, which isn’t easily possible with a SQL query (except using the tedious information_schema approach described in 1. above).

Instead, we have to search on our own for columns whose names match our target string and manually select them, but that becomes difficult and error-prone when the number of columns is large. We’d then have to update the list of columns in our query whenever the table schema changes.

Often, some rows of a dataset have errors or anomalies that break subsequent data processing steps. In such cases, it’s useful to fix the errors and anomalies point by point, directly. For example, suppose that we find out that the average_rating value of just the first row is wrong, and we need to fix it. In Pandas we would write:

This is easy in Pandas because dataframes maintain order; so like in a spreadsheet, we can go in and edit cells on demand. Since relational databases don’t support a notion of order, this operation is impossible to do in SQL. Instead, we would need to fix the data source (in this case, the CSV) ourselves. Likewise, if we wanted to do an update on columns 4 through 6, that would be impossible in SQL, since queries cannot use the order of columns at all.

Pivoting lets you reshape a dataframe so that unique values from a specified index and columns serve as axes for the new dataframe. This operation comes from spreadsheet applications such as Excel. Here’s an example with a different, smaller dataset:

To perform the same pivot on the above dataset in SQL, we’d have to write something like:

The above query has a separate SELECT clause per possible value of the columns column bar. Consequently, the length of the query is proportional to the number of distinct values in bar. Once again, if the distinct values change, the query would have to change.

Pandas also supports Melt, the inverse of pivot. Some SQL engines, such as Microsoft SQL server and Spark SQL, support MELT, but others do not.

Suppose we have a table where each column has the number of visits to a website on a given day, as well the number of people who bought a product:

We might want to get a rolling weekly average to smooth out noise in the data. To get a rolling weekly average in Pandas, we can do:

In SQL, an equivalent query would look like:

We’ve abbreviated the query for brevity, but the SELECT class would have 2513 lines! It would also grow as the number of dates in the dataset grew.

Most databases have a window operation for these kinds of queries, but those are for windows spanning rows, not columns. Your database might let you pivot the data so that the dates are rows instead of columns, but that would require an extra query. Pandas, on the other hand, lets you run such operations across either axis by choosing a single number, the axis argument to the window function.

While SQL is very concise, there are a number of types of operations for which Pandas is much more capable and succinct. In the examples above, we found:

  • Data cleaning and feature engineering: Pandas has convenient utilities for normalization, one-hot encoding, and dropping sparsely populated features. It’s possible to do these things in SQL, but with much more code.
  • Operations coming from linear algebra and spreadsheets: Pandas has utilities for spreadsheet-style operations like pivoting and unpivoting, while not all SQL engines do. Pandas also makes it convenient to perform matrix operations like addition, transposition, and point updates, while SQL does not, and some of these matrix operations are impossible in many SQL engines.
  • Working fluently with metadata: When working with Pandas dataframes, we can access and manipulate metadata as easily as data in our queries. Consequently, it was easy to select all numerical columns, or to select all columns whose names matched a pattern.

Are you aware of other examples where Pandas or SQL outshines the other in terms of conciseness? We’d love to hear it! Feel free to follow us on Twitter for more content like this, and respond to our tweet here!

In the next post in our Pandas vs. SQL series (post 3 of 4), we argue Pandas is the more flexible language. Read more here!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment