Techno Blender
Digitally Yours.

Brief Introduction to Correspondence Analysis | by Gustavo Santos | Jan, 2023

0 26


Datasets are made of numbers and/ or text. Thus, we should expect that not all the variables will be just about numbers, which, by the way, count on many techniques to be analyzed, tested and worked.

When we are working with numerical variables, there are tools like correlation, PCA, scaling, normalization, and a bunch of tests. On the other hand, if we’re working with text, more specifically, categories, we should go after other techniques to apply to our data analysis.

One of these tools is the Correspondence Analysis [CA].

Correspondence analysis is an statistical technique that can show us the relationships between the categories within two variables, based on data given in a contingency table.

As seen in the definition, it is an statistical tool. Conceptually, it is similar to Principal Component Analysis [PCA], but applied to categorical data, as it gives us power to display a dataset in a 2D graphic, showing which categories corresponds (or relates) to what.

CA can be useful for a Data Scientist in many ways, like understanding how different types of customers buy a set of products, which types of movies are preferred by each age range or yet our example for this tutorial: what types of products are bought in register 1 and register 2.

We’ll start by importing the necessary libraries and creating some data.

# Imports
library(tidyverse)
library(ggrepel)
library(sjPlot) #contigency tables
library(FactoMineR) #CA functions
library(ade4) # Create CA

Creating the data.

# Dataset

df <- data.frame(
trans_id = 1:30,
register = as.factor(c('rgs1', 'rgs1', 'rgs1', 'rgs2', 'rgs1', 'rgs2',
'rgs1', 'rgs2', 'rgs1', 'rgs2', 'rgs1', 'rgs2',
'rgs1', 'rgs2', 'rgs1', 'rgs2', 'rgs1', 'rgs2',
'rgs1', 'rgs2', 'rgs1', 'rgs2', 'rgs1', 'rgs2',
'rgs1', 'rgs2', 'rgs1', 'rgs2', 'rgs1', 'rgs2')),
product1 = as.factor(c('banana', 'banana', 'pasta', 'milk', 'yogurt',
'milk', 'pasta', 'milk', 'pasta', 'milk', 'banana',
'milk', 'banana', 'banana', 'pasta', 'bread', 'bread',
'milk', 'yogurt', 'bread', 'banana', 'pasta', 'yogurt','milk',
'yogurt', 'bread', 'bread', 'pasta', 'milk', 'banana')),
product2 = as.factor(c('strawberries', 'strawberries', 'sauce', 'bread', 'water',
'bread', 'sauce', 'bread', 'sauce', 'bread', 'strawberries',
'bread', 'strawberries', 'bread', 'water', 'bread', 'water',
'bread', 'bread', 'yogurt','strawberries', 'sauce',
'strawberries', 'bread', 'strawberries', 'milk', 'bread',
'sauce', 'bread', 'strawberries'))
)

This is a sample of the data. So we have the register numbers and a pair of products by transaction.

Sample of the dataset created. Image by the author.

The first step to perform a CA is to begin with the statistical tests. So, as we’re working with more than a pair of variables, we will have to perform Chi-squared tests for each pair of variables, and all of them must be statistically significant for at least one pair. For example, product1 must pass the test with either the product2 or register.

The test to be performed is a hypothesis test where:

Ho (p-value > 0.05) means that the variables are not associated and

Ha (p-value ≤ 0.05) means that the variables are associated.

One way to quickly test the pairs of variables is using a for loop.

for (var1 in 2:4){
for (var2 in 4:2) {
contingency <- table(df[,var1], df[, var2])
chi2 <- chisq.test(contingency)
writeLines( paste("p-Value for",
colnames(df)[var1], "and", colnames(df)[var2],
chi2$p.value))
}
}

p-Value for register and product2 0.0271823155904414
p-Value for register and product1 0.0318997966416755
p-Value for register and register 3.2139733725587e-07
p-Value for product1 and product2 9.51614574849618e-06
p-Value for product1 and product1 5.49284039685425e-18
p-Value for product1 and register 0.0318997966416755
p-Value for product2 and product2 8.43312760405718e-20
p-Value for product2 and product1 9.51614574849618e-06
p-Value for product2 and register 0.0271823155904414

The results show that all the Chi² tests are under the threshold of p-Value < 0.05, so we can reject the null hypothesys in favor of the alternative and understand that there is statistically significant association between the variables.

Another option is to use the function stj.xtab() from the library sjPlot.

# Register x product1
sjt.xtab(var.row = df$register,
var.col = df$product1,
show.exp = TRUE,
show.row.prc = TRUE,
show.col.prc = TRUE)

It displays this beautifully formatted table that already carries the observed values, the expected values in green, as well as the percentages for each category and the p-value and Chi² statistic.

Result from the stj.xtab() function. Image by the author.

It is time to create our Multiple Correspondence Analysis, or just MCA. We can use the function dudi.acm() from the ade4 library. The scannf= FALSE argument is just to prevent it to show the eigenvalues bar plot.

# Creating the Multiple Correspondence Analysis
ACM <- dudi.acm(df[,2:4],
scannf = FALSE)

Once we run this, the output is a List of 12 objects in R. If we run ACM$co, for example, we will see the coordinates of each category for the 2 Principal Components calculated. This means the X and Y coordinates, or where each point will be placed on a 2D graphic.

ACM$co

Comp1 Comp2
register.rgs1 0.7660067 0.05610284
register.rgs2 -0.8754363 -0.06411753
product1.banana 0.8060812 0.99732829
product1.bread -0.6784873 -0.03550393
product1.milk -1.2068148 0.23776669
product1.pasta 0.6008691 -1.82914217
product1.yogurt 0.9497931 0.56723529
product2.bread -0.9315168 0.26905544
product2.milk -1.1707641 -0.10768356
product2.sauce 0.5351564 -1.96850658
product2.strawberries 1.0569306 1.00621404
product2.water 0.7961676 -0.40682587
product2.yogurt -1.1707641 -0.10768356

If we run ACM$cw, we can see the percentages of each category in the dataset as a whole.

ACM$cw

register.rgs1 register.rgs2 product1.banana product1.bread product1.milk
0.17777778 0.15555556 0.07777778 0.05555556 0.08888889
product1.pasta product1.yogurt product2.bread product2.milk product2.sauce
0.06666667 0.04444444 0.13333333 0.01111111 0.05555556
product2.strawberries product2.water product2.yogurt
0.08888889 0.03333333 0.01111111

In a MCA, we will be able to extract n = num_categories — n_variables dimensions. In this exercise, we have 3 variables (register1, register3, product1 and product2) and 13 categories ( banana, bread, milk, yogurt, sauce, water, strawberries, pasta, sauce, some repeated for product 1 and 2). So, 13–3 = 10dimensions.

Therefore, we can see the 10 eigenvalues of each category with ACM$eig. These values are the amount of variance captured by each category, in a simple way.

ACM$eig
[1] 0.77575767 0.64171051 0.54102510 0.44643851 0.33333333 0.25656245 0.15516469 0.10465009 0.05690406 0.02178693

# Variance from each dimension
perc_variance <- (ACM$eig / sum(ACM$eig)) * 100

[1] 23.272730 19.251315 16.230753 13.393155 10.000000 7.696873 4.654941 3.139503 1.707122 0.653608

The final step is creating the perceptual map, where we will see the categories plotted on a graphic. To do that, we must create a base data frame to hold the names of the categories and their respective X and Y coordinates. First, let’s check how many categories each variable holds.

# How many categories by variable
qty_categories <- apply( df[,2:4], 2, function(x) nlevels(as.factor(x)) )

register product1 product2
2 5 6

Great. Now we will create a data.frame object where we are getting the coordinates ACM$co, which will load the category name and the X and Y coordinates, and Variable column with the name of the variable (product1 or product2) for labeling purposes.

# Create the df with coordinates
df_ACM <- data.frame(ACM$co,
Variable = rep(names(qty_categories),
qty_categories) )
The X and Y coordinates. Image by the author.

From here, it’s just a matter of creating the plot using ggplot2 now.

We will start with the df_ACM object, take the rownames and create a column out of that ( rownames_to_column() ), then rename that column to Category. Next, we mutate the names like product1.banana to just banana, for example. Then we pipe this new data frame with a ggplot function, provide x=Comp1 and y=Comp2 and the label will be Category, and each variable has a different color. It will be a scatterplot ( geom_point ) and we use geom_label_repel so the names are not over the points. vline and hline are to create the reference lines where the 0 is.

# Plotting the perceptual map

df_ACM %>%
rownames_to_column() %>%
rename(Category = 1) %>%
mutate(Category = gsub("register.","", Category),
Category = gsub("product1.","", Category),
Category = gsub("product2.","", Category)) %>%
ggplot(aes(x = Comp1, y = Comp2, label = Category, color = Variable)) +
geom_point() +
geom_label_repel() +
geom_vline(aes(xintercept = 0), linetype = "longdash", color = "grey48") +
geom_hline(aes(yintercept = 0), linetype = "longdash", color = "grey48") +
labs(x = paste("Dimensão 1:", paste0(round(perc_variancia[1], 2), "%")),
y = paste("Dimensão 2:", paste0(round(perc_variancia[2], 2), "%"))) +
theme_bw()

Here’s the result.

Perceptual Map for the MCA. Image by the author.

The result shows us some interesting insights:

  • Register 1 receives more fruits strawberry and bananas, some water, some pasta and sauce.
  • Register 2 processes much more bread and milk or yogurt transactions.
  • Notice that pasta and sauce are more distant from both registers. That’s because there were 4 transactions for that combination on register1 and 2 on register2. The other combinations were either processed by rgs1 or rgs2.

MCA is a powerful tool. You should look it up and use whenever possible to create good analysis if you’re dealing with categorical data. However, have in mind that it will be harder to apply as we increase the number of variables and categories.

Imagine, for example a dataset with 30 variables with 5 categories each! That’s a lot to test and analyze. In that case, other techniques may be better, or perhaps some transformations to decrease the number of categories or creating a subset of the data to be analyzed.

If you liked this content, follow my blog for more.


Datasets are made of numbers and/ or text. Thus, we should expect that not all the variables will be just about numbers, which, by the way, count on many techniques to be analyzed, tested and worked.

When we are working with numerical variables, there are tools like correlation, PCA, scaling, normalization, and a bunch of tests. On the other hand, if we’re working with text, more specifically, categories, we should go after other techniques to apply to our data analysis.

One of these tools is the Correspondence Analysis [CA].

Correspondence analysis is an statistical technique that can show us the relationships between the categories within two variables, based on data given in a contingency table.

As seen in the definition, it is an statistical tool. Conceptually, it is similar to Principal Component Analysis [PCA], but applied to categorical data, as it gives us power to display a dataset in a 2D graphic, showing which categories corresponds (or relates) to what.

CA can be useful for a Data Scientist in many ways, like understanding how different types of customers buy a set of products, which types of movies are preferred by each age range or yet our example for this tutorial: what types of products are bought in register 1 and register 2.

We’ll start by importing the necessary libraries and creating some data.

# Imports
library(tidyverse)
library(ggrepel)
library(sjPlot) #contigency tables
library(FactoMineR) #CA functions
library(ade4) # Create CA

Creating the data.

# Dataset

df <- data.frame(
trans_id = 1:30,
register = as.factor(c('rgs1', 'rgs1', 'rgs1', 'rgs2', 'rgs1', 'rgs2',
'rgs1', 'rgs2', 'rgs1', 'rgs2', 'rgs1', 'rgs2',
'rgs1', 'rgs2', 'rgs1', 'rgs2', 'rgs1', 'rgs2',
'rgs1', 'rgs2', 'rgs1', 'rgs2', 'rgs1', 'rgs2',
'rgs1', 'rgs2', 'rgs1', 'rgs2', 'rgs1', 'rgs2')),
product1 = as.factor(c('banana', 'banana', 'pasta', 'milk', 'yogurt',
'milk', 'pasta', 'milk', 'pasta', 'milk', 'banana',
'milk', 'banana', 'banana', 'pasta', 'bread', 'bread',
'milk', 'yogurt', 'bread', 'banana', 'pasta', 'yogurt','milk',
'yogurt', 'bread', 'bread', 'pasta', 'milk', 'banana')),
product2 = as.factor(c('strawberries', 'strawberries', 'sauce', 'bread', 'water',
'bread', 'sauce', 'bread', 'sauce', 'bread', 'strawberries',
'bread', 'strawberries', 'bread', 'water', 'bread', 'water',
'bread', 'bread', 'yogurt','strawberries', 'sauce',
'strawberries', 'bread', 'strawberries', 'milk', 'bread',
'sauce', 'bread', 'strawberries'))
)

This is a sample of the data. So we have the register numbers and a pair of products by transaction.

Sample of the dataset created. Image by the author.

The first step to perform a CA is to begin with the statistical tests. So, as we’re working with more than a pair of variables, we will have to perform Chi-squared tests for each pair of variables, and all of them must be statistically significant for at least one pair. For example, product1 must pass the test with either the product2 or register.

The test to be performed is a hypothesis test where:

Ho (p-value > 0.05) means that the variables are not associated and

Ha (p-value ≤ 0.05) means that the variables are associated.

One way to quickly test the pairs of variables is using a for loop.

for (var1 in 2:4){
for (var2 in 4:2) {
contingency <- table(df[,var1], df[, var2])
chi2 <- chisq.test(contingency)
writeLines( paste("p-Value for",
colnames(df)[var1], "and", colnames(df)[var2],
chi2$p.value))
}
}

p-Value for register and product2 0.0271823155904414
p-Value for register and product1 0.0318997966416755
p-Value for register and register 3.2139733725587e-07
p-Value for product1 and product2 9.51614574849618e-06
p-Value for product1 and product1 5.49284039685425e-18
p-Value for product1 and register 0.0318997966416755
p-Value for product2 and product2 8.43312760405718e-20
p-Value for product2 and product1 9.51614574849618e-06
p-Value for product2 and register 0.0271823155904414

The results show that all the Chi² tests are under the threshold of p-Value < 0.05, so we can reject the null hypothesys in favor of the alternative and understand that there is statistically significant association between the variables.

Another option is to use the function stj.xtab() from the library sjPlot.

# Register x product1
sjt.xtab(var.row = df$register,
var.col = df$product1,
show.exp = TRUE,
show.row.prc = TRUE,
show.col.prc = TRUE)

It displays this beautifully formatted table that already carries the observed values, the expected values in green, as well as the percentages for each category and the p-value and Chi² statistic.

Result from the stj.xtab() function. Image by the author.

It is time to create our Multiple Correspondence Analysis, or just MCA. We can use the function dudi.acm() from the ade4 library. The scannf= FALSE argument is just to prevent it to show the eigenvalues bar plot.

# Creating the Multiple Correspondence Analysis
ACM <- dudi.acm(df[,2:4],
scannf = FALSE)

Once we run this, the output is a List of 12 objects in R. If we run ACM$co, for example, we will see the coordinates of each category for the 2 Principal Components calculated. This means the X and Y coordinates, or where each point will be placed on a 2D graphic.

ACM$co

Comp1 Comp2
register.rgs1 0.7660067 0.05610284
register.rgs2 -0.8754363 -0.06411753
product1.banana 0.8060812 0.99732829
product1.bread -0.6784873 -0.03550393
product1.milk -1.2068148 0.23776669
product1.pasta 0.6008691 -1.82914217
product1.yogurt 0.9497931 0.56723529
product2.bread -0.9315168 0.26905544
product2.milk -1.1707641 -0.10768356
product2.sauce 0.5351564 -1.96850658
product2.strawberries 1.0569306 1.00621404
product2.water 0.7961676 -0.40682587
product2.yogurt -1.1707641 -0.10768356

If we run ACM$cw, we can see the percentages of each category in the dataset as a whole.

ACM$cw

register.rgs1 register.rgs2 product1.banana product1.bread product1.milk
0.17777778 0.15555556 0.07777778 0.05555556 0.08888889
product1.pasta product1.yogurt product2.bread product2.milk product2.sauce
0.06666667 0.04444444 0.13333333 0.01111111 0.05555556
product2.strawberries product2.water product2.yogurt
0.08888889 0.03333333 0.01111111

In a MCA, we will be able to extract n = num_categories — n_variables dimensions. In this exercise, we have 3 variables (register1, register3, product1 and product2) and 13 categories ( banana, bread, milk, yogurt, sauce, water, strawberries, pasta, sauce, some repeated for product 1 and 2). So, 13–3 = 10dimensions.

Therefore, we can see the 10 eigenvalues of each category with ACM$eig. These values are the amount of variance captured by each category, in a simple way.

ACM$eig
[1] 0.77575767 0.64171051 0.54102510 0.44643851 0.33333333 0.25656245 0.15516469 0.10465009 0.05690406 0.02178693

# Variance from each dimension
perc_variance <- (ACM$eig / sum(ACM$eig)) * 100

[1] 23.272730 19.251315 16.230753 13.393155 10.000000 7.696873 4.654941 3.139503 1.707122 0.653608

The final step is creating the perceptual map, where we will see the categories plotted on a graphic. To do that, we must create a base data frame to hold the names of the categories and their respective X and Y coordinates. First, let’s check how many categories each variable holds.

# How many categories by variable
qty_categories <- apply( df[,2:4], 2, function(x) nlevels(as.factor(x)) )

register product1 product2
2 5 6

Great. Now we will create a data.frame object where we are getting the coordinates ACM$co, which will load the category name and the X and Y coordinates, and Variable column with the name of the variable (product1 or product2) for labeling purposes.

# Create the df with coordinates
df_ACM <- data.frame(ACM$co,
Variable = rep(names(qty_categories),
qty_categories) )
The X and Y coordinates. Image by the author.

From here, it’s just a matter of creating the plot using ggplot2 now.

We will start with the df_ACM object, take the rownames and create a column out of that ( rownames_to_column() ), then rename that column to Category. Next, we mutate the names like product1.banana to just banana, for example. Then we pipe this new data frame with a ggplot function, provide x=Comp1 and y=Comp2 and the label will be Category, and each variable has a different color. It will be a scatterplot ( geom_point ) and we use geom_label_repel so the names are not over the points. vline and hline are to create the reference lines where the 0 is.

# Plotting the perceptual map

df_ACM %>%
rownames_to_column() %>%
rename(Category = 1) %>%
mutate(Category = gsub("register.","", Category),
Category = gsub("product1.","", Category),
Category = gsub("product2.","", Category)) %>%
ggplot(aes(x = Comp1, y = Comp2, label = Category, color = Variable)) +
geom_point() +
geom_label_repel() +
geom_vline(aes(xintercept = 0), linetype = "longdash", color = "grey48") +
geom_hline(aes(yintercept = 0), linetype = "longdash", color = "grey48") +
labs(x = paste("Dimensão 1:", paste0(round(perc_variancia[1], 2), "%")),
y = paste("Dimensão 2:", paste0(round(perc_variancia[2], 2), "%"))) +
theme_bw()

Here’s the result.

Perceptual Map for the MCA. Image by the author.

The result shows us some interesting insights:

  • Register 1 receives more fruits strawberry and bananas, some water, some pasta and sauce.
  • Register 2 processes much more bread and milk or yogurt transactions.
  • Notice that pasta and sauce are more distant from both registers. That’s because there were 4 transactions for that combination on register1 and 2 on register2. The other combinations were either processed by rgs1 or rgs2.

MCA is a powerful tool. You should look it up and use whenever possible to create good analysis if you’re dealing with categorical data. However, have in mind that it will be harder to apply as we increase the number of variables and categories.

Imagine, for example a dataset with 30 variables with 5 categories each! That’s a lot to test and analyze. In that case, other techniques may be better, or perhaps some transformations to decrease the number of categories or creating a subset of the data to be analyzed.

If you liked this content, follow my blog for more.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment