Techno Blender
Digitally Yours.

Spectral Entropy — An Underestimated Time Series Feature | by Ning Jia | Dec, 2022

0 48


Time series are everywhere. As data scientists, we have various time series tasks, such as segmentation, classification, forecasting, clustering, anomaly detection, and pattern recognition.

Depending on the data and the approaches, feature engineering could be a crucial step in solving those time series tasks. Well-engineered features can help understand the data better and boost models’ performance and interpretability. Feeding raw data to a black-box deep learning network may not work well, especially when data is limited or explainable models are preferred.

If you have worked on feature engineering for time series, you probably have tried building some basic features like the mean, the variance, the lag, and the statistics based on rolling windows.

In this post, I will introduce building features on spectral entropy. I suggest you include it as one of the must-try features when your data is applicable (frequency domain analysis makes sense). I will show how I tackled two classification problems on time series using spectral entropy. I have shown how to apply spectral entropy to an anomaly detection problem. Please refer to Anomaly Detection in Univariate Stochastic Time Series with Spectral Entropy.

My focus is on the application side, so I will skip some basic introductions and theories.

Frequency domain analysis and spectral entropy

Generally, time series data is saved in the time domain. The indexes are timestamps sampled with fixed intervals. Some time series have waveform or seasonality, like sensory data (seismic vibrations, sound, etc.). We can think something is oscillating and generating the data like waves.

When strong wave patterns exist, transforming and analyzing the data in the frequency domain will make sense. Fast Fourier Transform (FFT) is a classic way to transform from the time domain to the frequency domain. Spectral entropy encodes the spectral density (distribution of power in the frequency domain) into one value based on Shannon entropy. If you are interested in the fundamental introduction, please check Anomaly Detection in Univariate Stochastic Time Series with Spectral Entropy.

Here is an analogy for a quick intro without going deep into the formulas.

Suppose we study how people spend their spare time. One spends 90% on soccer. The other spends 90% on chess. Although their interests are different, it is almost certain that they will dedicate their spare time to their favourite hobbies. They are similar to some degree. This similarity is entropy. They will have the same lower entropy, implying lower uncertainty.

Another person spends 20% on hiking, 30% on reading, 20% on movies, and 30 % on whatever. Apparently, the third person is different from the first two. We don’t know which activity the third is doing exactly. In this case, the entropy is high, meaning a higher uncertainty.

Spectral entropy works the same way. How the time is distributed corresponds to how power is distributed across frequencies.

Next, let’s see two real-world examples of how spectral entropy works wonders. The dataset is not in the public domain. Please allow me to use vague descriptions and hide the dataset details.

Signal selection

The goal of this dataset is to build a binary classifier with a total of hundreds of samples. Each sample is a test result labelled pass or fail. One sample has close to 100 signals. The length of the signals is constant. Figure 1 shows one example (each small plot has three signals).

Figure 1. One sample has around 100 signals (Image by Author).

If we extract X features from each signal, we will have 100*N features. Considering the small sample size, we will run into the “curse of dimensionality” problem.

Since we have tremendous data for each sample, let’s be selective and only focus on the signals with the most predictive power and drop the irrelevant ones. I calculated spectral entropy for each signal, so we only have 100 features. Then shallow trees are trained.

From the top important features, only three signals show a high correlation with the labels. After I studied those signals, I constructed customized features according. Finally, I built a successful model with high performance. Moreover, the model only requires around ten input features, which benefits us with good generality and interpretability.

Frequency band selection

This example is another binary classification problem. Each sample only has one time series with different lengths. The total sample size is less than 100.

The varying length is not a big deal. We can split the time series into smaller fixed-length segments and use the sample label as the segment label.

The little problem is that we have a relatively large range of frequencies. Because the sampling frequency is 48,000 HZ (which covers the sound that the human ear can hear), based on the Nyquist theorem, the highest frequency in the frequency domain will be 24,000. Figure 2 is an example.

Figure2. Example of raw signal and FFT result (Image by Author).

I tried the spectral entropy directly but couldn’t clearly distinguish between positives and negatives. The main reason is both labels have similar peak frequencies and harmonics. The spectral distribution is identical. As a result, a spectral entropy on the entire frequency domain won’t tell a significant difference.

Since the frequency resolution is high, let’s zoom into certain frequency bands instead of the whole spectrum. Hopefully, the subtle separations are hiding somewhere.

I split the frequency into smaller bands. Each band is from X to X+100. So 24,000 will give us 240 bands. The higher frequencies contained only noise with minimum power. Therefore, I ignored the higher frequencies noises, picked the lower frequencies from 0 to 3,000 and cut them into 30 bins. Then I calculated the spectral entropy for each band. Finally, a tree-based model was trained using only 30 features. This approach worked surprisingly well. Figure 3 shows the top two features (spectral entropy of two bands). There is a reasonable boundary when using only frequency bands 1200 to 1300 and 2200 to 2300.

Figure 3. Scatterplot on the spectral entropy of the top 2 frequency bands (Image by Author).

Figure 4 below shows the model’s performance on test data labelled positive. The top plot is the raw signal. The middle plot is the predictions for each segment. The bottom is the entropy of frequency 1200 to 1300 for each segment. You can see most of the predictions are close to 1 and the entropy is likely in the range of 0.9 to 0.93. Figure 5 shows the test data labelled negative. Now the predictions drop to close to 0, and the entropies vary in the range of 0.8 to 0.9.

Figure 4. Example of test data labelled positive (Image by Author).
Figure 5. Example of test data labelled negative (Image by Author).

Conclusion

I show how spectra entropy helped me quickly find the most important signals and frequency bands for further feature engineering.

In those two examples, we don’t have to use spectral entropy. For instance, we may build features like the peak frequency, the average magnitude of frequency bands, etc. We can even have the entire frequency domain data as one input vector. After all, we can separate the targets in the frequency domain.

I like to explore features from the spectral entropy perspective because:

It’s easy to interpret and compute.

It significantly compresses the information contained in the frequency domain and keeps the core information.

The downside is some info is lost. For instance, the value of magnitude is not considered at all.

Furthermore, because we transform a list of values in the frequency domain into a signal value, there is a chance that different spectral distributions may have the same entropy value. This is like Hash Collision. For example, a fair coin and a biased coin have different entropy. But for a biased coin with a probability of X to heads and a second biased coin with a possibility of Y to tails, their entropy will be the same if X and Y are equal.

I hope you can learn the benefits of spectral entropy and apply it to your time series work in the future.

Thanks for reading.

Have fun with your time series.


Time series are everywhere. As data scientists, we have various time series tasks, such as segmentation, classification, forecasting, clustering, anomaly detection, and pattern recognition.

Depending on the data and the approaches, feature engineering could be a crucial step in solving those time series tasks. Well-engineered features can help understand the data better and boost models’ performance and interpretability. Feeding raw data to a black-box deep learning network may not work well, especially when data is limited or explainable models are preferred.

If you have worked on feature engineering for time series, you probably have tried building some basic features like the mean, the variance, the lag, and the statistics based on rolling windows.

In this post, I will introduce building features on spectral entropy. I suggest you include it as one of the must-try features when your data is applicable (frequency domain analysis makes sense). I will show how I tackled two classification problems on time series using spectral entropy. I have shown how to apply spectral entropy to an anomaly detection problem. Please refer to Anomaly Detection in Univariate Stochastic Time Series with Spectral Entropy.

My focus is on the application side, so I will skip some basic introductions and theories.

Frequency domain analysis and spectral entropy

Generally, time series data is saved in the time domain. The indexes are timestamps sampled with fixed intervals. Some time series have waveform or seasonality, like sensory data (seismic vibrations, sound, etc.). We can think something is oscillating and generating the data like waves.

When strong wave patterns exist, transforming and analyzing the data in the frequency domain will make sense. Fast Fourier Transform (FFT) is a classic way to transform from the time domain to the frequency domain. Spectral entropy encodes the spectral density (distribution of power in the frequency domain) into one value based on Shannon entropy. If you are interested in the fundamental introduction, please check Anomaly Detection in Univariate Stochastic Time Series with Spectral Entropy.

Here is an analogy for a quick intro without going deep into the formulas.

Suppose we study how people spend their spare time. One spends 90% on soccer. The other spends 90% on chess. Although their interests are different, it is almost certain that they will dedicate their spare time to their favourite hobbies. They are similar to some degree. This similarity is entropy. They will have the same lower entropy, implying lower uncertainty.

Another person spends 20% on hiking, 30% on reading, 20% on movies, and 30 % on whatever. Apparently, the third person is different from the first two. We don’t know which activity the third is doing exactly. In this case, the entropy is high, meaning a higher uncertainty.

Spectral entropy works the same way. How the time is distributed corresponds to how power is distributed across frequencies.

Next, let’s see two real-world examples of how spectral entropy works wonders. The dataset is not in the public domain. Please allow me to use vague descriptions and hide the dataset details.

Signal selection

The goal of this dataset is to build a binary classifier with a total of hundreds of samples. Each sample is a test result labelled pass or fail. One sample has close to 100 signals. The length of the signals is constant. Figure 1 shows one example (each small plot has three signals).

Figure 1. One sample has around 100 signals (Image by Author).

If we extract X features from each signal, we will have 100*N features. Considering the small sample size, we will run into the “curse of dimensionality” problem.

Since we have tremendous data for each sample, let’s be selective and only focus on the signals with the most predictive power and drop the irrelevant ones. I calculated spectral entropy for each signal, so we only have 100 features. Then shallow trees are trained.

From the top important features, only three signals show a high correlation with the labels. After I studied those signals, I constructed customized features according. Finally, I built a successful model with high performance. Moreover, the model only requires around ten input features, which benefits us with good generality and interpretability.

Frequency band selection

This example is another binary classification problem. Each sample only has one time series with different lengths. The total sample size is less than 100.

The varying length is not a big deal. We can split the time series into smaller fixed-length segments and use the sample label as the segment label.

The little problem is that we have a relatively large range of frequencies. Because the sampling frequency is 48,000 HZ (which covers the sound that the human ear can hear), based on the Nyquist theorem, the highest frequency in the frequency domain will be 24,000. Figure 2 is an example.

Figure2. Example of raw signal and FFT result (Image by Author).

I tried the spectral entropy directly but couldn’t clearly distinguish between positives and negatives. The main reason is both labels have similar peak frequencies and harmonics. The spectral distribution is identical. As a result, a spectral entropy on the entire frequency domain won’t tell a significant difference.

Since the frequency resolution is high, let’s zoom into certain frequency bands instead of the whole spectrum. Hopefully, the subtle separations are hiding somewhere.

I split the frequency into smaller bands. Each band is from X to X+100. So 24,000 will give us 240 bands. The higher frequencies contained only noise with minimum power. Therefore, I ignored the higher frequencies noises, picked the lower frequencies from 0 to 3,000 and cut them into 30 bins. Then I calculated the spectral entropy for each band. Finally, a tree-based model was trained using only 30 features. This approach worked surprisingly well. Figure 3 shows the top two features (spectral entropy of two bands). There is a reasonable boundary when using only frequency bands 1200 to 1300 and 2200 to 2300.

Figure 3. Scatterplot on the spectral entropy of the top 2 frequency bands (Image by Author).

Figure 4 below shows the model’s performance on test data labelled positive. The top plot is the raw signal. The middle plot is the predictions for each segment. The bottom is the entropy of frequency 1200 to 1300 for each segment. You can see most of the predictions are close to 1 and the entropy is likely in the range of 0.9 to 0.93. Figure 5 shows the test data labelled negative. Now the predictions drop to close to 0, and the entropies vary in the range of 0.8 to 0.9.

Figure 4. Example of test data labelled positive (Image by Author).
Figure 5. Example of test data labelled negative (Image by Author).

Conclusion

I show how spectra entropy helped me quickly find the most important signals and frequency bands for further feature engineering.

In those two examples, we don’t have to use spectral entropy. For instance, we may build features like the peak frequency, the average magnitude of frequency bands, etc. We can even have the entire frequency domain data as one input vector. After all, we can separate the targets in the frequency domain.

I like to explore features from the spectral entropy perspective because:

It’s easy to interpret and compute.

It significantly compresses the information contained in the frequency domain and keeps the core information.

The downside is some info is lost. For instance, the value of magnitude is not considered at all.

Furthermore, because we transform a list of values in the frequency domain into a signal value, there is a chance that different spectral distributions may have the same entropy value. This is like Hash Collision. For example, a fair coin and a biased coin have different entropy. But for a biased coin with a probability of X to heads and a second biased coin with a possibility of Y to tails, their entropy will be the same if X and Y are equal.

I hope you can learn the benefits of spectral entropy and apply it to your time series work in the future.

Thanks for reading.

Have fun with your time series.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment