Exploratory Correlational Analysis in R | by Brendan Mapes | May, 2023

By Jessie Hobb On May 5, 2023

Correlational analysis is one of the most basic and foundational ways to explore the relationship between two or more variables. You may have already performed correlational analysis at some point using R, and it might have looked something like this:

cor_results <- cor.test(my_data$x, my_data$y,
method = "pearson")cor_results

With an output looking like:

An image shows example output of a correlation test based on test data

This is the base R way to run a simple correlation for two variables that you’ve already picked out in advance.

But what if you didn’t really know what you were looking for? If you’re just at the stage of conducting some exploratory data analysis, you might not know what variables you’re interested in, or where you might want to look for associations. What might be useful in this case is to be able to pick a variable of interest, and then check against a dataset with multiple, maybe even hundreds of variables, in order to figure out a good starting point for further analysis. Thanks to the rstatix package developed by kassambara, there is a quick and relatively painless way to do this.

Getting the Data

For an example, I’ll use data from the World Bank’s World Development Indicators (WDI) dataset — a repository of open access data on global development indicators. We could access the WDI from the website linked above, but there’s also an R package for that —

install.packages("WDI")
library(WDI)

Specific data series can be imported from WDI using the WDI() function, but because we’re interested in exploratory analysis covering possible relationship between lots of variables, I’m going to bulk download the whole database…

bulk <- WDIbulk(timeout = 600)

Let’s say we’re interested in trying to figure out what other country characteristics might be associated with countries that trade more, relative to the size of their economy, and we’re also interested in data from the year 2020.

Once we’ve identified the right variable (here I’m going to use Trade as a % of GDP), we need to clean up the data a bit. We’ll create a list of series that are annual that we can filter on, and then apply another filtering step to make sure that we are only using variables that have a good number of observations to use in the analysis (arbitrarily, n>100 in the example below).

# Create a filtered set with only annual variables
filtered <- bulk$Series %>% filter(Periodicity == "Annual")# Create a list of variables to correlate against trade levels
bulk$Data %>% 
filter(Indicator.Code %in% c(filtered$Series.Code)) %>% 
filter(year == 2020) %>% 
group_by(Indicator.Code) %>%
filter(!is.na(value)) %>% 
count() %>% 
arrange(n) %>% 
filter(n>100) -> vars_list

Running the Analysis

So now we have a list of variables to run through — about 790 of them — to see which might be correlated with our trade level variable. This would take forever to run through manually, or with a loop on cor.test() from base R. This is where the cor_test() function in rstatix shines — it runs pretty quickly, output from the correlational analysis is dumped into a tibble format (making it easy to perform additional manipulation and analysis), and the functions are pipe-friendly, meaning that we can combine filtering, mutation, and execution steps together into a single piped framework, and we can also group variable inputs for grouped output from rstatix (we’ll look at some examples of this a little later).

So, to run the analysis:

# Because WDI contains regional data as well, we'll create a list that only has country codes, and filter our input based on that list
countries <- bulk$Country %>% filter(!Region == "") %>% as_tibble()bulk$Data %>% 
filter(Indicator.Code %in% c(vars_list$Indicator.Code)) %>%
filter(year == 2020) %>%
filter(Country.Code %in% c(countries$Country.Code)) %>% 
select(-Indicator.Name) %>% 
pivot_wider(names_from = Indicator.Code,
values_from = value) %>% 
cor_test(NE.TRD.GNFS.ZS, 
method = "pearson",
use = "complete.obs") -> results
results

This populates a tidy tibble with the variable pairings, correlation coefficient (r), t-statistic, confidence level (p), and a low and high confidence estimate. For our example run above, it looks like:

Because the output is a tibble, we can sort and decompose it however we want. Let’s make a key with the variable name and description, join that into our output data, filter for only variable pairings that were significant at p > 0.05 levels, and check out what variables had the highest r value:

indicator_explanation <- bulk$Series %>% select(Series.Code, Indicator.Name, Short.definition) %>% as_tibble()results %>% 
left_join(indicator_explanation, c("var2" = "Series.Code")) %>% 
arrange(desc(cor)) %>%
filter(p<0.05) %>% 
View()

Some of the variables with the highest correlations wont be surprising — overall trade is positively correlated at a high level across countries with trade in services and merchandise, for example. Others might be more unexpected — like the moderately high positive correlation (r = 0.43) between level of trade and the amount of Official Development Assistance (aid funding) that a country receives (the bottom row in the image above).

Grouped Analysis

So what if we wanted to look into this relationship more? For example — is the relationship still strong if we look at other years besides 2020? This is where the pipe-friendly nature of cor_test() again comes in handy.

Let’s filter our initial data to include just the two indicators we’re interested in, and then group the data by year before piping it into cor_test() this time:

bulk$Data %>% 
filter(Indicator.Code %in% c("NE.TRD.GNFS.ZS", "DT.ODA.ODAT.GI.ZS")) %>% 
filter(Country.Code %in% c(countries$Country.Code)) %>% 
select(-Indicator.Name) %>% 
filter(year<2021) %>% 
pivot_wider(names_from = Indicator.Code,
values_from = value) %>%
group_by(year) %>%
cor_test(NE.TRD.GNFS.ZS, DT.ODA.ODAT.GI.ZS,
method = "pearson",
use = "complete.obs") -> results_time

This will give us output on the correlation between the two variables, in each year with observations (I filtered the data to only years prior to 2021, because the ODA data only runs up to 2020). And because the correlation data is stored in a tidy way, we can easily run additional code to visualize our results:

results_time %>% 
mutate(`Significant?` = if_else(p<0.05, "Yes", "No")) %>% 
ggplot(aes(x = year, y = cor)) +
geom_hline(yintercept = 0, 
linetype = "dashed") +
geom_line() + 
ylab("cor (r)") +
geom_point(aes(color = `Significant?`)) +
theme_minimal()

Here we can see that historically there hasn’t been much of a relationship between these two variables at all (apart from a few years here and there with a weak negative relationship), but in the past few years the correlation has trended significant and positive:

cor_results <- cor.test(my_data$x, my_data$y,
method = "pearson")cor_results