Techno Blender
Digitally Yours.

Dimension Reduction: Facing the Curse of Dimensionality | by Victor Graff | Apr, 2023

0 127


Photo by Kolleen Gladden on Unsplash

Many data scientists are forced to deal with the challenge of dimension. Data sets can contain huge amounts of variables, making them complex to understand and compute. For example an asset manager can be overwhelmed with the many dynamic variables associated with its portfolio, and processing a large amount of data can lead to computational issues. Reducing the dimension is a way to extract the information from a large number of variables into a smaller set of reduced variables, without loosing too much of the explainability. In other words, dimension reduction methods can be considered as the research of a sub-space which minimises the reconstitution error.

Several methods exist to proceed with this information extraction, each adapted to different use cases. This article aims at providing a detailed comparison of two of these methods: the principal component analysis (PCA) and the dynamic factor model (DFM). The PCA can be used for any type of structured dataset, while the dynamic factor model is used for time series application, as it embeds the evolution of the series over time.

The analysis is made on economic and financial data. The data used for this study is a replication of the data used for the article measuring uncertainty and its impact on the economy by Clark, Todd; Carriero, Andrea; Marcellino, Massimiliano and is available on the Harvard dataverse. It consists of 18 macro economical variables and 12 financial variables. It covers the evolution of these variables from 1960 to 2014. The data is transformed to ensure stationarity, before being processed with the algorithms for dimension reduction.

The entire code is available on Github.

The theory

The PCA can be seen as an unsupervised method for dimension reduction. Let’s say we have a large number of variables. All of them seem of interest for the analysis, but there is no obvious way to aggregate these variables into categories. In this case, the algorithm will be in charge reducing the dimension without specific input from the modeler. In other words, the algorithm will create a smaller number of variables, called the reduced components, that shall be able to closely reproducing the initial variables.

The method of the PCA lies on the covariance of the variables. If two variables are highly covariant, it means that they follow the same trend. The first one is then highly efficient in reproducing the second, making it possible to keep only the first variable without loosing the ability to recreate the second one if needed. The PCA creates a subset of variables that maximises the covariance with the initial variable set, in order to store as much information as possible in a lower dimension.

The idea of the method is to compute an orthogonal basis of the space created by the original set of variables. The vectors creating this basis are the eigenvectors of the variance-covariance matrix. Reducing the dimension is then easily done by selecting the eigenvectors that are most representative of the initial data: those that contain most of the covariance. The amount of covariance stored by the vectors is quantified by the eigenvalues: the larger the eigenvalue, the more interesting its associated vectors.

The process of the PCA algorithm is as follows:

1. Computing the covariance matrix

2. Computing its eigenvectors and eigenvalues

3. Sorting the eigenvalues to keep vectors with most embedded information

The ratio of each eigenvalue over the sum of all eigenvalues represents the amount of covariance contained in its associated eigenvector. The remaining task is to determine the number of eigenvectors to keep. Different parameters come into consideration for this selection, as we will see in the next section.

Application to the data

Python makes it straightforward to define the PCA model, as it is contained in the library sklearn ready for use. The attribute n_components can be at first set to a large value in order to compare the eigenvalues, and then to select the right number of components to keep. Once fitted, the eigenvalues are displayed in descending order, to help us make the decision. The plot below shows the ratio of covariance embedded by each eigenvalues.

from sklearn.decomposition import PCA
pca = PCA(n_components=5).fit(u_data)
plt.plot(pca.explained_variance_ratio_)

The usual rule to select the right number of eigenvectors is to look at the “elbow” represented by the plot of the values. Taking the number of vectors up to the elbow gives an interesting trade-off between the information retained and the resulting dimension. In this case, we keep the first two components.

The ratio of covariance of the first two components is 69.6% and 9.7%. Thus, by keeping only two components, we retain almost 80% of the information contained in the initial data, while reducing the dimension from 30 to 2!

To sum up, the PCA is a great tool for dimension reduction. It is easy to deploy and it produces good results in terms of information retained, while seriously decreasing the dimension. Nonetheless, the PCA works as a black box that prevents any meaningful understanding of the resulting components. Also, the PCA works on any type of structured data but does not embed the dynamic of the data, if in the form of a time series.

The next section covers the dynamic factor models, which represent a potential solution to counter these limitations.

The theory

The dynamic factor model is used to observe the evolution of N variables over time (assembled in a vector Xt) with a reduced number of dynamic common factors. The strength of this method is that it embeds the co-movement of a large number of variables into a smaller number of components.

This method is suited for time series application. As a result, they are widely used in finance and economics, where many essential variables co-evolve over time.

The DFM defines the vector Xt as a linear combination of past and current values of the reduced factors (ft). The factors are themselves dynamic, i.e. defined in an auto regressive way. The number of reduced components is q, and the lag of the auto regression is p.

Each λ is a (N x q) matrix with q the number of reduced components, each ft is a (q x 1) vector and each ψ is a (q x q) matrix. The “dynamic” is captured in the fact that each reduced vector ft follows a vector autoregressive process, thus is itself computed based on past values of f. Moreover, the vector X is impacted by present and past data.

A very important aspect of the DFM is that the number of components is defined ahead of the computation, based on qualitative knowledge of the data. This can be an interesting feature in case the variables are easily categorised, but it can also be a challenge if no meaningful category appears.

Once the number of factors is defined, the main method to estimate the components is to use a Gaussian maximum likelihood estimator (MLE). ε and η are assumed to follow a gaussian distribution, and the MLE aims at maximizing the probability of achieving the sample data (Xt, ft), by tuning the gaussian parameters (the mean and standard deviation). Fortunately, this step is directly implemented in the Python libraries, making the computation easy.

Once estimated, the components thus computed represent the category they were assigned to. This results in as many components as the categories we have defined. This allows us to reduce the dimension in an efficient and qualitatively meaningful way.

Application to the data

The DFM will be applied to the same dataset as previously presented. Good news here: we directly have two obvious categories: macro economics and finance.

Python contains a DFM model in its statsmodels library: DynamicFactorMQ. To compute the model, a few parameters are needed. First, and obviously, the initial data that we aim to reduce. Second, a dictionary that associates each variable with its category (technically, each variable can be in more than one category, but we won’t cover this case here).

factors = dict()
for macro_variable in list(macro_variables.values()):
factors[macro_variable] = ["Macro"]
for finance_variable in list(finance_variables.values()):
factors[finance_variable] = ["Finance"]

Then, we define the order of the lag of the VAR model associated to each factor ft, i.e. how many time steps backward influence the current state of the factor. In our case, a lag of one seems sufficient. Increasing the lag obviously increases the computational constraint, but can have significant impact in the efficiency of the model by providing longer time information at each step.

Finally, the idiosyncratic component needs to be defined. This component represents the part of the vector Xt that cannot be explained by current and past values of ft. This component can be seen as the residuals in a linear regression. It can be either fitted as an AR(1) model, or as white noise. On an economic point of view, this choice is significant: do we estimate that the residuals of the models are autoregressive (i.e. present and past values are correlated) or independent and identically distributed? For economics studies, an uncorrelated idiosyncratic term is often unrealistic because the measurement method would usually induce correlated errors.

factor_model = DynamicFactorMQ(u_data,
factors=factors,
factor_orders = {'Macro': 1, "Finance": 1},
idiosyncratic_ar1=True,
standardize=False)
model_results = factor_model.fit(disp=30)

The next question that clearly arises is: which method should be used? Well, as expected, it depends on what we are looking for.

Let’s summarise the pros and cons of each model.

PCA

  • Can be applied on any type of structure data
  • No ex ante knowledge on the data required for computation
  • Rule of thumb to select the reduced components
  • Unsupervised process

DFM

  • Application on time series data
  • Qualitative knowledge of the data to determine the categories embedded in the reduced factors
  • Pre-determined number of reduced components

At first sight, PCA seems to gather more interest than the DFM, but there’s more to see to make the decision. The main difference between these two is the possibility of the DFM to provide a meaningful explanation of its results.

Readability

First, let’s have a look at the created components.

These two plots show the evolution of the two selected factors for each model. Interestingly enough, both models seem to separate a variable of trend (blue) and another one of volatility (orange). DFM gives us the advantage of the meaning behind this observation: it is of no surprise to see macro variables (e.g. GDP, housing price, etc.) increase over time. Also, financial variables are known to be much more volatile. The PCA seems to capture the same kind of information, but again, we are unable to make anything more than assumptions to explain this phenomenon. Advantage to DFM on this point.

Accuracy

Let’s get back to the purpose of the dimension reduction methods: being a good substitute for the original data in lower dimension. Therefore, we need to make sure that the models can accurately reproduce the original data.

Python provides, for both algorithms, an accessible way to reproduce the initial variables. For the PCA, after transforming the data into its reduced space, the inverse_transform method provides the representation of each initial variable processed by the model. The DFM model contains all representations into its fittedvalues attribute.

#PCA
scores = pca.transform(u_data)
reconstruct = u_data + pca.inverse_transform(scores) - u_data

#DFM
model_results.fittedvalues

We can then easily plot this representation of the data from each model. In the below plot, we show an example for the unemployment rate variable.

On this example, the DFM is obviously a better fit as it is constantly closer to the original data, containing much less variations. To make a more global and quantitative assessment, let’s compute the residuals of both models, on the entire dataset.

print(f"Residuals of DFM on global dataset: {np.round(np.abs(model_results.resid).sum().sum(), 2)}")
print(f"Residuals of PCA on global dataset: {np.round(np.abs(resid_pca).sum().sum(), 2)}")
Sum of residuals

The DFM is clearly more performant than the PCA on reproducing the initial data. The categorisation and dynamic embedded in the model seems to have accurately captured the information of the initial variable set.

We have compared two methods for dimension reduction, both presenting advantages and drawbacks. We saw that, in the presented case the DFM model is a better fit, but PCA is also of great interest. Let’s summarise:

When to prefer PCA?

  • There’s no time dynamic in the data
  • There’s no obvious categorisation of the initial data
  • There’s little qualitative knowledge of the initial data

When to prefer DFM?

  • Time dynamic is an important feature of the data
  • Understanding of the reduced components is required for analysis
  • Categorisation of the data is easily found

To sum up, no algorithm surpasses the other one in all contexts. It is the role of the modeller to assess what’s best in every situation. Moreover, as we have seen, both models are easy to implement on Python. Implementing both helps increasing the understanding of the data and leads to better solutions.

I hope this article was of interest and helped you understand the differences between these two models. Please feel free to give me any feedback or thoughts on this!

References

Clark, Todd; Carriero, Andrea; Marcellino, Massimiliano, 2017, “Replication Data for: “Measuring Uncertainty and Its Impact on the Economy””, https://doi.org/10.7910/DVN/ENTXDD, Harvard Dataverse, V3


Photo by Kolleen Gladden on Unsplash

Many data scientists are forced to deal with the challenge of dimension. Data sets can contain huge amounts of variables, making them complex to understand and compute. For example an asset manager can be overwhelmed with the many dynamic variables associated with its portfolio, and processing a large amount of data can lead to computational issues. Reducing the dimension is a way to extract the information from a large number of variables into a smaller set of reduced variables, without loosing too much of the explainability. In other words, dimension reduction methods can be considered as the research of a sub-space which minimises the reconstitution error.

Several methods exist to proceed with this information extraction, each adapted to different use cases. This article aims at providing a detailed comparison of two of these methods: the principal component analysis (PCA) and the dynamic factor model (DFM). The PCA can be used for any type of structured dataset, while the dynamic factor model is used for time series application, as it embeds the evolution of the series over time.

The analysis is made on economic and financial data. The data used for this study is a replication of the data used for the article measuring uncertainty and its impact on the economy by Clark, Todd; Carriero, Andrea; Marcellino, Massimiliano and is available on the Harvard dataverse. It consists of 18 macro economical variables and 12 financial variables. It covers the evolution of these variables from 1960 to 2014. The data is transformed to ensure stationarity, before being processed with the algorithms for dimension reduction.

The entire code is available on Github.

The theory

The PCA can be seen as an unsupervised method for dimension reduction. Let’s say we have a large number of variables. All of them seem of interest for the analysis, but there is no obvious way to aggregate these variables into categories. In this case, the algorithm will be in charge reducing the dimension without specific input from the modeler. In other words, the algorithm will create a smaller number of variables, called the reduced components, that shall be able to closely reproducing the initial variables.

The method of the PCA lies on the covariance of the variables. If two variables are highly covariant, it means that they follow the same trend. The first one is then highly efficient in reproducing the second, making it possible to keep only the first variable without loosing the ability to recreate the second one if needed. The PCA creates a subset of variables that maximises the covariance with the initial variable set, in order to store as much information as possible in a lower dimension.

The idea of the method is to compute an orthogonal basis of the space created by the original set of variables. The vectors creating this basis are the eigenvectors of the variance-covariance matrix. Reducing the dimension is then easily done by selecting the eigenvectors that are most representative of the initial data: those that contain most of the covariance. The amount of covariance stored by the vectors is quantified by the eigenvalues: the larger the eigenvalue, the more interesting its associated vectors.

The process of the PCA algorithm is as follows:

1. Computing the covariance matrix

2. Computing its eigenvectors and eigenvalues

3. Sorting the eigenvalues to keep vectors with most embedded information

The ratio of each eigenvalue over the sum of all eigenvalues represents the amount of covariance contained in its associated eigenvector. The remaining task is to determine the number of eigenvectors to keep. Different parameters come into consideration for this selection, as we will see in the next section.

Application to the data

Python makes it straightforward to define the PCA model, as it is contained in the library sklearn ready for use. The attribute n_components can be at first set to a large value in order to compare the eigenvalues, and then to select the right number of components to keep. Once fitted, the eigenvalues are displayed in descending order, to help us make the decision. The plot below shows the ratio of covariance embedded by each eigenvalues.

from sklearn.decomposition import PCA
pca = PCA(n_components=5).fit(u_data)
plt.plot(pca.explained_variance_ratio_)

The usual rule to select the right number of eigenvectors is to look at the “elbow” represented by the plot of the values. Taking the number of vectors up to the elbow gives an interesting trade-off between the information retained and the resulting dimension. In this case, we keep the first two components.

The ratio of covariance of the first two components is 69.6% and 9.7%. Thus, by keeping only two components, we retain almost 80% of the information contained in the initial data, while reducing the dimension from 30 to 2!

To sum up, the PCA is a great tool for dimension reduction. It is easy to deploy and it produces good results in terms of information retained, while seriously decreasing the dimension. Nonetheless, the PCA works as a black box that prevents any meaningful understanding of the resulting components. Also, the PCA works on any type of structured data but does not embed the dynamic of the data, if in the form of a time series.

The next section covers the dynamic factor models, which represent a potential solution to counter these limitations.

The theory

The dynamic factor model is used to observe the evolution of N variables over time (assembled in a vector Xt) with a reduced number of dynamic common factors. The strength of this method is that it embeds the co-movement of a large number of variables into a smaller number of components.

This method is suited for time series application. As a result, they are widely used in finance and economics, where many essential variables co-evolve over time.

The DFM defines the vector Xt as a linear combination of past and current values of the reduced factors (ft). The factors are themselves dynamic, i.e. defined in an auto regressive way. The number of reduced components is q, and the lag of the auto regression is p.

Each λ is a (N x q) matrix with q the number of reduced components, each ft is a (q x 1) vector and each ψ is a (q x q) matrix. The “dynamic” is captured in the fact that each reduced vector ft follows a vector autoregressive process, thus is itself computed based on past values of f. Moreover, the vector X is impacted by present and past data.

A very important aspect of the DFM is that the number of components is defined ahead of the computation, based on qualitative knowledge of the data. This can be an interesting feature in case the variables are easily categorised, but it can also be a challenge if no meaningful category appears.

Once the number of factors is defined, the main method to estimate the components is to use a Gaussian maximum likelihood estimator (MLE). ε and η are assumed to follow a gaussian distribution, and the MLE aims at maximizing the probability of achieving the sample data (Xt, ft), by tuning the gaussian parameters (the mean and standard deviation). Fortunately, this step is directly implemented in the Python libraries, making the computation easy.

Once estimated, the components thus computed represent the category they were assigned to. This results in as many components as the categories we have defined. This allows us to reduce the dimension in an efficient and qualitatively meaningful way.

Application to the data

The DFM will be applied to the same dataset as previously presented. Good news here: we directly have two obvious categories: macro economics and finance.

Python contains a DFM model in its statsmodels library: DynamicFactorMQ. To compute the model, a few parameters are needed. First, and obviously, the initial data that we aim to reduce. Second, a dictionary that associates each variable with its category (technically, each variable can be in more than one category, but we won’t cover this case here).

factors = dict()
for macro_variable in list(macro_variables.values()):
factors[macro_variable] = ["Macro"]
for finance_variable in list(finance_variables.values()):
factors[finance_variable] = ["Finance"]

Then, we define the order of the lag of the VAR model associated to each factor ft, i.e. how many time steps backward influence the current state of the factor. In our case, a lag of one seems sufficient. Increasing the lag obviously increases the computational constraint, but can have significant impact in the efficiency of the model by providing longer time information at each step.

Finally, the idiosyncratic component needs to be defined. This component represents the part of the vector Xt that cannot be explained by current and past values of ft. This component can be seen as the residuals in a linear regression. It can be either fitted as an AR(1) model, or as white noise. On an economic point of view, this choice is significant: do we estimate that the residuals of the models are autoregressive (i.e. present and past values are correlated) or independent and identically distributed? For economics studies, an uncorrelated idiosyncratic term is often unrealistic because the measurement method would usually induce correlated errors.

factor_model = DynamicFactorMQ(u_data,
factors=factors,
factor_orders = {'Macro': 1, "Finance": 1},
idiosyncratic_ar1=True,
standardize=False)
model_results = factor_model.fit(disp=30)

The next question that clearly arises is: which method should be used? Well, as expected, it depends on what we are looking for.

Let’s summarise the pros and cons of each model.

PCA

  • Can be applied on any type of structure data
  • No ex ante knowledge on the data required for computation
  • Rule of thumb to select the reduced components
  • Unsupervised process

DFM

  • Application on time series data
  • Qualitative knowledge of the data to determine the categories embedded in the reduced factors
  • Pre-determined number of reduced components

At first sight, PCA seems to gather more interest than the DFM, but there’s more to see to make the decision. The main difference between these two is the possibility of the DFM to provide a meaningful explanation of its results.

Readability

First, let’s have a look at the created components.

These two plots show the evolution of the two selected factors for each model. Interestingly enough, both models seem to separate a variable of trend (blue) and another one of volatility (orange). DFM gives us the advantage of the meaning behind this observation: it is of no surprise to see macro variables (e.g. GDP, housing price, etc.) increase over time. Also, financial variables are known to be much more volatile. The PCA seems to capture the same kind of information, but again, we are unable to make anything more than assumptions to explain this phenomenon. Advantage to DFM on this point.

Accuracy

Let’s get back to the purpose of the dimension reduction methods: being a good substitute for the original data in lower dimension. Therefore, we need to make sure that the models can accurately reproduce the original data.

Python provides, for both algorithms, an accessible way to reproduce the initial variables. For the PCA, after transforming the data into its reduced space, the inverse_transform method provides the representation of each initial variable processed by the model. The DFM model contains all representations into its fittedvalues attribute.

#PCA
scores = pca.transform(u_data)
reconstruct = u_data + pca.inverse_transform(scores) - u_data

#DFM
model_results.fittedvalues

We can then easily plot this representation of the data from each model. In the below plot, we show an example for the unemployment rate variable.

On this example, the DFM is obviously a better fit as it is constantly closer to the original data, containing much less variations. To make a more global and quantitative assessment, let’s compute the residuals of both models, on the entire dataset.

print(f"Residuals of DFM on global dataset: {np.round(np.abs(model_results.resid).sum().sum(), 2)}")
print(f"Residuals of PCA on global dataset: {np.round(np.abs(resid_pca).sum().sum(), 2)}")
Sum of residuals

The DFM is clearly more performant than the PCA on reproducing the initial data. The categorisation and dynamic embedded in the model seems to have accurately captured the information of the initial variable set.

We have compared two methods for dimension reduction, both presenting advantages and drawbacks. We saw that, in the presented case the DFM model is a better fit, but PCA is also of great interest. Let’s summarise:

When to prefer PCA?

  • There’s no time dynamic in the data
  • There’s no obvious categorisation of the initial data
  • There’s little qualitative knowledge of the initial data

When to prefer DFM?

  • Time dynamic is an important feature of the data
  • Understanding of the reduced components is required for analysis
  • Categorisation of the data is easily found

To sum up, no algorithm surpasses the other one in all contexts. It is the role of the modeller to assess what’s best in every situation. Moreover, as we have seen, both models are easy to implement on Python. Implementing both helps increasing the understanding of the data and leads to better solutions.

I hope this article was of interest and helped you understand the differences between these two models. Please feel free to give me any feedback or thoughts on this!

References

Clark, Todd; Carriero, Andrea; Marcellino, Massimiliano, 2017, “Replication Data for: “Measuring Uncertainty and Its Impact on the Economy””, https://doi.org/10.7910/DVN/ENTXDD, Harvard Dataverse, V3

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment