Techno Blender
Digitally Yours.

Multivariate Analysis — Going Beyond One Variable At A Time | by Farzad Mahmoodinobar | Jan, 2023

0 41


an owl reflecting on the data, by DALL.E 2

It has become common practice these days for companies and businesses to collect as much information as reasonably possible, even if the use cases of such data are unknown at the time of collection — the hope is to understand and use the data at some point in the future. Once such data sets are available, data-driven individuals will dive into the data looking for the hidden patterns and relationships within the data. One of the tools of finding such hidden patterns in the data is multivariate analysis.

Multivariate analysis involves analyzing the relationships between multiple variables (i.e. multivariate data) and understanding how they influence each other. It is an important tool that helps us better understand complex data sets to make data-driven and informed decisions. If you are interested in only analyzing the impact of one variable at a time, that can be accomplished through a univariate analysis, which I covered in this post.

Now that we are familiar with multivariate data, we can define univariate data as a special case of multivariate data where data consists of only one variable. Similarly, bivariate data consists of two variables and so forth.

We will discuss bivariate/multivariate analysis of both numerical and categorical variables in this post. Therefore, let’s go over a quick refresher about the distinction between these two types of variables and then we can move on to the analysis.

  • Numerical Variables: Represent a measurable quantity, which can be either continuous or discrete variables. Continuous ones can take on any value within a certain range (e.g. height, weight, etc.), while discrete numerical variables can only take on specific values within a range (e.g. number of children, count of cars in a parking lot, etc.)
  • Categorical Variables: Represent a group (or category) and can take on a limited number of values, such as car brands, dog breeds, etc.

Now that we understand the distinction between these two types of variables, we can move on to the analysis itself.

I have organized this post in the format of a series of questions and answers, which I personally find an effective method of learning. I have also included a link to the notebook that I used to create this exercise towards the end. Feel free to download and practice after reading this post!

Let’s get started!

(All images, unless otherwise noted, are by the author.)

In order to practice multivariate analysis, we will be using a data set from UCI Machine Learning Repository (CC BY 4.0), which includes car prices and a set of car properties associated with each car price. In order to simplify the process, I have cleaned up and filtered the data, which can be downloaded from this link.

Let’s start with importing the libraries we will be using today, then we will read the data set into a dataframe and look at the top 5 rows of the dataframe to familiarize ourselves with the data.

# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

# Show all columns/rows of the dataframe
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

# Read the data
df = pd.read_csv('auto-cleaned.csv')

# Return top 5 rows of the dataframe
df.head()

Results:

Column that we will be using in this post are self-explanatory so do not worry about understanding all of the columns at this point.

Let’s move on to the analysis!

Let’s start with a case of bivariate data, consisting of only two variables. The goal of bivariate analysis is to understand the relationship between two variables. There are various statistical techniques that can be used to analyze bivariate data and using scatterplots is one of the most common ones. Let’s see how scatterplots work.

Question 1:

What is the relationship between price and engine size? One might intuitively expect cars with larger engines to have higher prices (all else equal) but let’s see if data supports this. Create a scatterplot with price in the x-axis and engine-size in the y-axis.

Answer:

# Create the scatterplot
sns.regplot(data = df, x = 'price', y = 'engine-size', fit_reg = False)
plt.show()

Results:

Scatterplot of price vs. engine size

As we see, there seems to be a positive relationship between the price and engine size in our data. Note this does not imply a causation (whether that is correct or incorrect) and merely shows a positive correlation between the two. Let’s add in the correlation values to have a quantitative measure for our reference.

Question 2:

Return the correlation between price and other variables in a descending order.

Answer:

# Create the overall correlation
corr = np.round(df.corr(numeric_only = True), 2)

# Return correlation only with alcohol
price_corr = corr['price'].sort_values(ascending = False)
price_corr

Results:

Results confirm the positive correlation that we observed in the scatterplot between price and engine size. Let’s try to go one layer deeper and look at the variation within the data.

Heterogeneity in the data refers to the variation within a data set. For example, our data set consists of different body styles, such as sedan, hatchback, wagon, convertible, etc. Do we expect the correlation between price and engine size to be similar in all of these body styles? For example, incremental willingness to pay of customers for larger engines in convertibles might be higher compared to wagons which are primarily used by families. Let’s look into that hypothesis and see if such a variation across body styles exists, by stratifying our data across body styles.

Question 3:

The data set includes car prices with various body styles, as indicated in column “body-style”. How many rows per class are there in the data set?

Answer:

# Apply value_counts to the df['class'] column
df['body-style'].value_counts()

Results:

According to the results, there are five classes.

Question 4:

Create a scatterplot per body style of price versus engine size to demonstrate whether there is a visual variance among the body styles.

Answer:

sns.FacetGrid(data = df, col = 'body-style').map(plt.scatter, 'price', 'engine-size').add_legend()
plt.show()

Results:

Scatterplots of price vs. engine size broken down by body style

Now that is interesting! The distributions are quite different than the overall distribution that we observed in Question 1 and demonstrate visual differences among these five body styles. All five body styles demonstrate a positive correlation between price and engine size as expected but the slope seems to be the highest for convertibles (despite smaller number of data points), compared to wagons. Let’s look at correlation numbers to quantify these.

Question 5:

What are the correlations between price and engine size for each of the body styles?

Answer:

bodies = df['body-style'].unique()

for body in bodies:
print(body)
print(df.loc[df['body-style'] == body, ['price', 'engine-size']].corr())
print()

Results confirm our visual inspection — correlation of price and engine size is positive for all body styles, with the highest correlation belonging to convertibles and the lowest to the wagons, as we intuitively expected. Next, we are going to look at a categorical bivariate analysis.

In this section, we are going to create a similar bivariate analysis but for categorical variables. In statistics, this type of analysis is usually visualized through a “contingency table” (aka cross-tabulation or crosstab), which displays the frequency or count of observations for two (for bivariate) or more (for multivariate) categorical variables. Let’s look at an example to better understand contingency tables.

Question 6:

Create a contingency table of car’s body style and number of cylinders. Do you see a pattern in the results?

Answer:

crosstab = pd.crosstab(df['body-style'], df['num-of-cylinders'])
crosstab

Results:

If you are familiar with cars, the most common cylinder counts are 4, 6 and 8, which is where we see most of the frequency in the table as well. We can also see that majority of the cars in our data set are four cylinders with body types of sedans and hatchbacks, followed by wagons. Did you notice that we are doing a mental math to calculate the percentage of total for each combination of cylinder count and body style? Contingency tables can be normalized to address this exact topic. There are three approaches to normalize such a table:

  1. Entries in each row sum to 1
  2. Entries of each column sum to 1
  3. Entries of the entire table sum to 1

Let’s try one of these in the next question.

Question 7:

Create a cross-tabulation table similar to the previous question normalized in a manner that the sum of the entries of each row equal to 1, rounded to 2 decimal places.

Answer:

I am going to demonstrate two different approaches here for learning purposes. First approach uses Pandas’ crosstab and the second one uses groupby.

# Approach 1

# Create the crosstab (similar to previous question)
crosstab = pd.crosstab(df['body-style'], df['num-of-cylinders'])

# Normalize the crosstab by row
crosstab_normalized = crosstab.apply(lambda x: x/x.sum(), axis = 1)

# Round the results to two decimal places
round(crosstab_normalized, 2)

Results:

# Approach 2

# Group by and count occurences using size method
grouped_table = df.groupby(['body-style', 'num-of-cylinders']).size()

# Pivot the results using unstack and apply the row normalization
grouped_table_normalized = grouped_table.unstack().fillna(0).apply(lambda x: x/x.sum(), axis = 1)

# Round the results to two decimal places
round(grouped_table_normalized, 2)

Results:

There are many instances where we need to analyze data that is a mix of numerical and categorical variables so let’s take a look at how that can be accomplished, now that we know how to tackle each data type independently.

Question 8:

Create a series of boxplots demonstrating the distribution of price (numerical variable in the y-axis) for different body styles (categorical variable in the x-axis).

Answer:

# Set the figure size
plt.figure(figsize = (10, 5))

# Create the boxplots
sns.boxplot(x = df['body-style'], y = df['price'])
plt.show()

Results:

Boxplot of car prices stratified by body style

I personally find this visualization very informative. For example, we can see that hatchbacks have a relatively smaller price range compared to hardtops or convertibles. Convertibles start from a higher price compared to other body styles and there seem to be a good range of prices, based on various characteristics of the car.

What if we wanted to just focus on sedans and see how the price range changes with the number of cylinders? Let’s create the boxplots.

# Set the figure size
plt.figure(figsize = (10, 5))

# Create the boxplots
sns.boxplot(x = df[df['body-style'] == 'sedan']['num-of-cylinders'], y = df[df['body-style'] == 'sedan']['price'])
plt.show()

Results:

Boxplot of sedans’ prices stratified by number of cylinders

As expected the price increases as the number of cylinders increase.

Below is the notebook with both questions and answers for reference and practice.

In this post, we introduced multivariate analysis as a tool to find hidden patterns within the data and then worked through implementation of such analysis for numerical and categorical variables and the mix of the two. We utilized visualization tools such as scatterplots and boxplots to visualize the relationship among variables and quantified such correlations in some instances.

If you found this post helpful, please follow me on Medium and subscribe to receive my latest posts!


an owl reflecting on the data, by DALL.E 2

It has become common practice these days for companies and businesses to collect as much information as reasonably possible, even if the use cases of such data are unknown at the time of collection — the hope is to understand and use the data at some point in the future. Once such data sets are available, data-driven individuals will dive into the data looking for the hidden patterns and relationships within the data. One of the tools of finding such hidden patterns in the data is multivariate analysis.

Multivariate analysis involves analyzing the relationships between multiple variables (i.e. multivariate data) and understanding how they influence each other. It is an important tool that helps us better understand complex data sets to make data-driven and informed decisions. If you are interested in only analyzing the impact of one variable at a time, that can be accomplished through a univariate analysis, which I covered in this post.

Now that we are familiar with multivariate data, we can define univariate data as a special case of multivariate data where data consists of only one variable. Similarly, bivariate data consists of two variables and so forth.

We will discuss bivariate/multivariate analysis of both numerical and categorical variables in this post. Therefore, let’s go over a quick refresher about the distinction between these two types of variables and then we can move on to the analysis.

  • Numerical Variables: Represent a measurable quantity, which can be either continuous or discrete variables. Continuous ones can take on any value within a certain range (e.g. height, weight, etc.), while discrete numerical variables can only take on specific values within a range (e.g. number of children, count of cars in a parking lot, etc.)
  • Categorical Variables: Represent a group (or category) and can take on a limited number of values, such as car brands, dog breeds, etc.

Now that we understand the distinction between these two types of variables, we can move on to the analysis itself.

I have organized this post in the format of a series of questions and answers, which I personally find an effective method of learning. I have also included a link to the notebook that I used to create this exercise towards the end. Feel free to download and practice after reading this post!

Let’s get started!

(All images, unless otherwise noted, are by the author.)

In order to practice multivariate analysis, we will be using a data set from UCI Machine Learning Repository (CC BY 4.0), which includes car prices and a set of car properties associated with each car price. In order to simplify the process, I have cleaned up and filtered the data, which can be downloaded from this link.

Let’s start with importing the libraries we will be using today, then we will read the data set into a dataframe and look at the top 5 rows of the dataframe to familiarize ourselves with the data.

# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

# Show all columns/rows of the dataframe
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

# Read the data
df = pd.read_csv('auto-cleaned.csv')

# Return top 5 rows of the dataframe
df.head()

Results:

Column that we will be using in this post are self-explanatory so do not worry about understanding all of the columns at this point.

Let’s move on to the analysis!

Let’s start with a case of bivariate data, consisting of only two variables. The goal of bivariate analysis is to understand the relationship between two variables. There are various statistical techniques that can be used to analyze bivariate data and using scatterplots is one of the most common ones. Let’s see how scatterplots work.

Question 1:

What is the relationship between price and engine size? One might intuitively expect cars with larger engines to have higher prices (all else equal) but let’s see if data supports this. Create a scatterplot with price in the x-axis and engine-size in the y-axis.

Answer:

# Create the scatterplot
sns.regplot(data = df, x = 'price', y = 'engine-size', fit_reg = False)
plt.show()

Results:

Scatterplot of price vs. engine size

As we see, there seems to be a positive relationship between the price and engine size in our data. Note this does not imply a causation (whether that is correct or incorrect) and merely shows a positive correlation between the two. Let’s add in the correlation values to have a quantitative measure for our reference.

Question 2:

Return the correlation between price and other variables in a descending order.

Answer:

# Create the overall correlation
corr = np.round(df.corr(numeric_only = True), 2)

# Return correlation only with alcohol
price_corr = corr['price'].sort_values(ascending = False)
price_corr

Results:

Results confirm the positive correlation that we observed in the scatterplot between price and engine size. Let’s try to go one layer deeper and look at the variation within the data.

Heterogeneity in the data refers to the variation within a data set. For example, our data set consists of different body styles, such as sedan, hatchback, wagon, convertible, etc. Do we expect the correlation between price and engine size to be similar in all of these body styles? For example, incremental willingness to pay of customers for larger engines in convertibles might be higher compared to wagons which are primarily used by families. Let’s look into that hypothesis and see if such a variation across body styles exists, by stratifying our data across body styles.

Question 3:

The data set includes car prices with various body styles, as indicated in column “body-style”. How many rows per class are there in the data set?

Answer:

# Apply value_counts to the df['class'] column
df['body-style'].value_counts()

Results:

According to the results, there are five classes.

Question 4:

Create a scatterplot per body style of price versus engine size to demonstrate whether there is a visual variance among the body styles.

Answer:

sns.FacetGrid(data = df, col = 'body-style').map(plt.scatter, 'price', 'engine-size').add_legend()
plt.show()

Results:

Scatterplots of price vs. engine size broken down by body style

Now that is interesting! The distributions are quite different than the overall distribution that we observed in Question 1 and demonstrate visual differences among these five body styles. All five body styles demonstrate a positive correlation between price and engine size as expected but the slope seems to be the highest for convertibles (despite smaller number of data points), compared to wagons. Let’s look at correlation numbers to quantify these.

Question 5:

What are the correlations between price and engine size for each of the body styles?

Answer:

bodies = df['body-style'].unique()

for body in bodies:
print(body)
print(df.loc[df['body-style'] == body, ['price', 'engine-size']].corr())
print()

Results confirm our visual inspection — correlation of price and engine size is positive for all body styles, with the highest correlation belonging to convertibles and the lowest to the wagons, as we intuitively expected. Next, we are going to look at a categorical bivariate analysis.

In this section, we are going to create a similar bivariate analysis but for categorical variables. In statistics, this type of analysis is usually visualized through a “contingency table” (aka cross-tabulation or crosstab), which displays the frequency or count of observations for two (for bivariate) or more (for multivariate) categorical variables. Let’s look at an example to better understand contingency tables.

Question 6:

Create a contingency table of car’s body style and number of cylinders. Do you see a pattern in the results?

Answer:

crosstab = pd.crosstab(df['body-style'], df['num-of-cylinders'])
crosstab

Results:

If you are familiar with cars, the most common cylinder counts are 4, 6 and 8, which is where we see most of the frequency in the table as well. We can also see that majority of the cars in our data set are four cylinders with body types of sedans and hatchbacks, followed by wagons. Did you notice that we are doing a mental math to calculate the percentage of total for each combination of cylinder count and body style? Contingency tables can be normalized to address this exact topic. There are three approaches to normalize such a table:

  1. Entries in each row sum to 1
  2. Entries of each column sum to 1
  3. Entries of the entire table sum to 1

Let’s try one of these in the next question.

Question 7:

Create a cross-tabulation table similar to the previous question normalized in a manner that the sum of the entries of each row equal to 1, rounded to 2 decimal places.

Answer:

I am going to demonstrate two different approaches here for learning purposes. First approach uses Pandas’ crosstab and the second one uses groupby.

# Approach 1

# Create the crosstab (similar to previous question)
crosstab = pd.crosstab(df['body-style'], df['num-of-cylinders'])

# Normalize the crosstab by row
crosstab_normalized = crosstab.apply(lambda x: x/x.sum(), axis = 1)

# Round the results to two decimal places
round(crosstab_normalized, 2)

Results:

# Approach 2

# Group by and count occurences using size method
grouped_table = df.groupby(['body-style', 'num-of-cylinders']).size()

# Pivot the results using unstack and apply the row normalization
grouped_table_normalized = grouped_table.unstack().fillna(0).apply(lambda x: x/x.sum(), axis = 1)

# Round the results to two decimal places
round(grouped_table_normalized, 2)

Results:

There are many instances where we need to analyze data that is a mix of numerical and categorical variables so let’s take a look at how that can be accomplished, now that we know how to tackle each data type independently.

Question 8:

Create a series of boxplots demonstrating the distribution of price (numerical variable in the y-axis) for different body styles (categorical variable in the x-axis).

Answer:

# Set the figure size
plt.figure(figsize = (10, 5))

# Create the boxplots
sns.boxplot(x = df['body-style'], y = df['price'])
plt.show()

Results:

Boxplot of car prices stratified by body style

I personally find this visualization very informative. For example, we can see that hatchbacks have a relatively smaller price range compared to hardtops or convertibles. Convertibles start from a higher price compared to other body styles and there seem to be a good range of prices, based on various characteristics of the car.

What if we wanted to just focus on sedans and see how the price range changes with the number of cylinders? Let’s create the boxplots.

# Set the figure size
plt.figure(figsize = (10, 5))

# Create the boxplots
sns.boxplot(x = df[df['body-style'] == 'sedan']['num-of-cylinders'], y = df[df['body-style'] == 'sedan']['price'])
plt.show()

Results:

Boxplot of sedans’ prices stratified by number of cylinders

As expected the price increases as the number of cylinders increase.

Below is the notebook with both questions and answers for reference and practice.

In this post, we introduced multivariate analysis as a tool to find hidden patterns within the data and then worked through implementation of such analysis for numerical and categorical variables and the mix of the two. We utilized visualization tools such as scatterplots and boxplots to visualize the relationship among variables and quantified such correlations in some instances.

If you found this post helpful, please follow me on Medium and subscribe to receive my latest posts!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment