Techno Blender
Digitally Yours.

Introducing PyCircular: A Python Library for Circular Data Analysis | by Alejandro Correa Bahnsen | Jan, 2023

0 50


Photo by Patrick McManaman on Unsplash

In this post, I’m introducing PyCircular, a specialized data analysis python library designed specifically for working with circular data. As one of the authors, I am excited to share this powerful tool with the community to help address the challenges of working with circular data.

Circular data, such as data that represents angles, directions or timestamps, can present unique challenges when it comes to analysis and modeling. The nature of the circular data can cause difficulties when trying to apply traditional linear and kernel-based methods, as these methods are not well suited to handle the periodic nature of circular data. Additionally, circular data can also raise issues when trying to compute mean and standard deviation, as these measures are not well-defined for circular data.

PyCircular addresses these challenges by providing a set of tools and functionality specifically tailored for working with circular data. The library includes a variety of circular statistical methods, including distributions, kernels, and confidence intervals. Additionally, it also includes visualization tools such as circular histograms and distribution plots, to help you better understand your data.

https://github.com/albahnsen/pycircular

The remainder of this post will dive deeper into the unique challenges of working with circular data, and demonstrate how PyCircular can be used to address these challenges through a series of examples. You will see how PyCircular be used to effectively handle the periodic nature of circular data and compute meaningful measures of central tendency and dispersion. You will also learn how the library’s visualization tools can help to better understand and interpret circular data. By the end of this post, you will have a solid understanding of how PyCircular can be used to effectively analyze and model circular data.

When training a machine learning model, we must have a dataset that includes input variables (features) and corresponding output variables (labels). The model learns to map the features to the labels, and the goal of training is to find the best set of parameters for this mapping.

In several applications, the features of the model consist in descriptive information of a user, a transaction, a login, among others. In most of these scenarios, there is information about the timestamp, might be time of the event, day of the week, or day of the month. If the goal is to predict an event based on past events, you can use timestamps as features. For example, you could use the time of day, day of the week, or month of the year as features to a model that predicts traffic volume or energy consumption.

However, the best approach for dealing with time in a machine learning problem will depend on the specific problem you’re trying to solve and the structure of your data.

Let’s see how we can use PyCircular to analyze this complex behavior.

First, let’s install the library and load some sample synthetic data.

!pip install pycircular

Using the dataset from load_transactions, we can see that we have 349 observations (transactions) starting January 1st, 2020 to July 29th 2020.

Image by autho

Then, plotting a histogram of the time of each observation, we see that most examples happened between 5pm and 7am, with few happening at noon. Moreover, when dealing with hour of the day as a scalar variable, there are a few issues that can arise.

  • One issue is that hour of the day is cyclical in nature, meaning that the value at the end of the day (24:00) is related to the value at the beginning of the day (00:00). However, when hour of the day is treated as a scalar variable, this cyclical relationship is not considered, which can lead to inaccurate or misleading results.
  • Another issue is that hour of the day is often correlated with other variables, such as day of the week or season. For example, there may be more traffic during rush hour on a weekday than on a weekend. However, when hour of the day is treated as a scalar variable, these correlations are not considered and can lead to biased or misleading results.
  • A third issue is that hour of the day can be affected by different factors such as season, day of the week, or even holidays. These factors can greatly impact the behavior and patterns of hour of the day. So, if this information is not taken into account when using hour of the day as a scalar variable, it can lead to inaccurate conclusions.

To overcome these issues, one solution is to use a cyclical encoding technique, such as sine and cosine encoding, to incorporate the cyclical nature of the data. Another solution is to include other relevant variables, such as day of the week or season, in the model to account for potential correlations. Additionally, it’s important to consider the impact of different factors on the hour of the day when analyzing data.

For our example, let’s begging by using a circular histogram plot to better understand our dataset.

Image by author

Then, calculate the scalar or arithmetic mean

Image by author

The issue when dealing with the time of the example, specifically, when analyzing a feature such as the mean of time, is that it is easy to make the mistake of using the arithmetic mean. Indeed, the arithmetic mean is not a correct way to average time because, as shown in the above figure, it does not consider the periodic behavior of the time feature. For example, the arithmetic mean of time of four transactions made at 2:00, 3:00, 22:00 and 23:00 is 12:30, which is counter-intuitive since no were made close to that time.

We can overcome this limitation by modeling the time of the transaction as a periodic variable, in particular using the von Mises distribution (Fisher, 1996). The von Mises distribution, also known as the periodic normal distribution, is a distribution of a wrapped normal distributed variable across a circle. The von Mises distribution of a set of examples (D) is defined as:

where

are the periodic mean and periodic standard deviation, respectively. In this paper we present the calculation of them.

Image by author

Now, having calculated the periodic mean and standard deviation, we can estimate the von Mises distribution.

Image by author

This method gives us a good approximation of the distribution of the time of the events. However, when using a statistical distribution with only one mode, it may be difficult to accurately model the data if the distribution is not a good fit for the data set. Additionally, if the data set is multi-modal (i.e. has multiple peaks), a single mode distribution will not be able to capture all the variations in the data. This could lead to poor predictions or inferences based on the model.

This can be overcome by modeling the data with a von Mises kernel distribution.

One way to overcome the issues of using a statistical distribution with only one mode is to use a kernel-based method, such as kernel density estimation (KDE).

KDE is a non-parametric method for estimating the probability density function of a random variable. It works by replacing the point-mass at each data point with a smooth and symmetric kernel function, such as a von Mises. The resulting estimate of the PDF is a sum of the kernel functions centered at each data point.

By using a kernel function, KDE can smooth out any single-modal distributions, and can capture multiple modes in the data, making it a more flexible method for modeling multi-modal data sets. Additionally, kernel density estimation is non-parametric, which means it does not make any assumptions about the underlying distribution of the data.

However, it’s worth noting that choosing the right kernel is important, and there are some challenges when working with KDE such as the choice of bandwidth and the curse of dimensionality.

Image by author

In summary, using a kernel-based method such as KDE with von Mises can help overcome the issues of using a statistical distribution with only one mode by allowing for a more flexible and robust modeling of multi-modal data sets.

Finally, we can apply the kernel to new observations and create a new feature that can be used as an input for a machine learning model.

Image by author

We can see that an observation at noon, have a very low probability (0.017) because when training the kernel, there wasn’t any observations at that time.

In conclusion, this methodology allows us to effectively deal with timestamps by creating more robust representations of the temporal information in the data. By using the kernel of von Mises during feature engineering, we can generate new features that accurately capture the nuances of temporal patterns in the data. This approach can overcome the limitations of treating dates as a scalar variable and lead to improved performance in machine learning models.

  • The selection of the bandwidth parameter (bw) is crucial for the performance of the model, pycircular library offers a range of optimization methods to select the best bw for a given dataset.
  • To evaluate the effectiveness of the kernel, it is important to perform accuracy tests and compare the results with other methods.
  • While the time of the day is a significant temporal feature to consider, it’s also critical to investigate how other temporal variables like day of the week and day of month may impact the model’s performance.
  • The kernel can be used in a machine learning model, it can be integrated as part of the feature engineering process, where it can be applied to the input data to create new features that better capture the temporal patterns in the data.

I will be showing how to deal with these issues in a following post.


Photo by Patrick McManaman on Unsplash

In this post, I’m introducing PyCircular, a specialized data analysis python library designed specifically for working with circular data. As one of the authors, I am excited to share this powerful tool with the community to help address the challenges of working with circular data.

Circular data, such as data that represents angles, directions or timestamps, can present unique challenges when it comes to analysis and modeling. The nature of the circular data can cause difficulties when trying to apply traditional linear and kernel-based methods, as these methods are not well suited to handle the periodic nature of circular data. Additionally, circular data can also raise issues when trying to compute mean and standard deviation, as these measures are not well-defined for circular data.

PyCircular addresses these challenges by providing a set of tools and functionality specifically tailored for working with circular data. The library includes a variety of circular statistical methods, including distributions, kernels, and confidence intervals. Additionally, it also includes visualization tools such as circular histograms and distribution plots, to help you better understand your data.

https://github.com/albahnsen/pycircular

The remainder of this post will dive deeper into the unique challenges of working with circular data, and demonstrate how PyCircular can be used to address these challenges through a series of examples. You will see how PyCircular be used to effectively handle the periodic nature of circular data and compute meaningful measures of central tendency and dispersion. You will also learn how the library’s visualization tools can help to better understand and interpret circular data. By the end of this post, you will have a solid understanding of how PyCircular can be used to effectively analyze and model circular data.

When training a machine learning model, we must have a dataset that includes input variables (features) and corresponding output variables (labels). The model learns to map the features to the labels, and the goal of training is to find the best set of parameters for this mapping.

In several applications, the features of the model consist in descriptive information of a user, a transaction, a login, among others. In most of these scenarios, there is information about the timestamp, might be time of the event, day of the week, or day of the month. If the goal is to predict an event based on past events, you can use timestamps as features. For example, you could use the time of day, day of the week, or month of the year as features to a model that predicts traffic volume or energy consumption.

However, the best approach for dealing with time in a machine learning problem will depend on the specific problem you’re trying to solve and the structure of your data.

Let’s see how we can use PyCircular to analyze this complex behavior.

First, let’s install the library and load some sample synthetic data.

!pip install pycircular

Using the dataset from load_transactions, we can see that we have 349 observations (transactions) starting January 1st, 2020 to July 29th 2020.

Image by autho

Then, plotting a histogram of the time of each observation, we see that most examples happened between 5pm and 7am, with few happening at noon. Moreover, when dealing with hour of the day as a scalar variable, there are a few issues that can arise.

  • One issue is that hour of the day is cyclical in nature, meaning that the value at the end of the day (24:00) is related to the value at the beginning of the day (00:00). However, when hour of the day is treated as a scalar variable, this cyclical relationship is not considered, which can lead to inaccurate or misleading results.
  • Another issue is that hour of the day is often correlated with other variables, such as day of the week or season. For example, there may be more traffic during rush hour on a weekday than on a weekend. However, when hour of the day is treated as a scalar variable, these correlations are not considered and can lead to biased or misleading results.
  • A third issue is that hour of the day can be affected by different factors such as season, day of the week, or even holidays. These factors can greatly impact the behavior and patterns of hour of the day. So, if this information is not taken into account when using hour of the day as a scalar variable, it can lead to inaccurate conclusions.

To overcome these issues, one solution is to use a cyclical encoding technique, such as sine and cosine encoding, to incorporate the cyclical nature of the data. Another solution is to include other relevant variables, such as day of the week or season, in the model to account for potential correlations. Additionally, it’s important to consider the impact of different factors on the hour of the day when analyzing data.

For our example, let’s begging by using a circular histogram plot to better understand our dataset.

Image by author

Then, calculate the scalar or arithmetic mean

Image by author

The issue when dealing with the time of the example, specifically, when analyzing a feature such as the mean of time, is that it is easy to make the mistake of using the arithmetic mean. Indeed, the arithmetic mean is not a correct way to average time because, as shown in the above figure, it does not consider the periodic behavior of the time feature. For example, the arithmetic mean of time of four transactions made at 2:00, 3:00, 22:00 and 23:00 is 12:30, which is counter-intuitive since no were made close to that time.

We can overcome this limitation by modeling the time of the transaction as a periodic variable, in particular using the von Mises distribution (Fisher, 1996). The von Mises distribution, also known as the periodic normal distribution, is a distribution of a wrapped normal distributed variable across a circle. The von Mises distribution of a set of examples (D) is defined as:

where

are the periodic mean and periodic standard deviation, respectively. In this paper we present the calculation of them.

Image by author

Now, having calculated the periodic mean and standard deviation, we can estimate the von Mises distribution.

Image by author

This method gives us a good approximation of the distribution of the time of the events. However, when using a statistical distribution with only one mode, it may be difficult to accurately model the data if the distribution is not a good fit for the data set. Additionally, if the data set is multi-modal (i.e. has multiple peaks), a single mode distribution will not be able to capture all the variations in the data. This could lead to poor predictions or inferences based on the model.

This can be overcome by modeling the data with a von Mises kernel distribution.

One way to overcome the issues of using a statistical distribution with only one mode is to use a kernel-based method, such as kernel density estimation (KDE).

KDE is a non-parametric method for estimating the probability density function of a random variable. It works by replacing the point-mass at each data point with a smooth and symmetric kernel function, such as a von Mises. The resulting estimate of the PDF is a sum of the kernel functions centered at each data point.

By using a kernel function, KDE can smooth out any single-modal distributions, and can capture multiple modes in the data, making it a more flexible method for modeling multi-modal data sets. Additionally, kernel density estimation is non-parametric, which means it does not make any assumptions about the underlying distribution of the data.

However, it’s worth noting that choosing the right kernel is important, and there are some challenges when working with KDE such as the choice of bandwidth and the curse of dimensionality.

Image by author

In summary, using a kernel-based method such as KDE with von Mises can help overcome the issues of using a statistical distribution with only one mode by allowing for a more flexible and robust modeling of multi-modal data sets.

Finally, we can apply the kernel to new observations and create a new feature that can be used as an input for a machine learning model.

Image by author

We can see that an observation at noon, have a very low probability (0.017) because when training the kernel, there wasn’t any observations at that time.

In conclusion, this methodology allows us to effectively deal with timestamps by creating more robust representations of the temporal information in the data. By using the kernel of von Mises during feature engineering, we can generate new features that accurately capture the nuances of temporal patterns in the data. This approach can overcome the limitations of treating dates as a scalar variable and lead to improved performance in machine learning models.

  • The selection of the bandwidth parameter (bw) is crucial for the performance of the model, pycircular library offers a range of optimization methods to select the best bw for a given dataset.
  • To evaluate the effectiveness of the kernel, it is important to perform accuracy tests and compare the results with other methods.
  • While the time of the day is a significant temporal feature to consider, it’s also critical to investigate how other temporal variables like day of the week and day of month may impact the model’s performance.
  • The kernel can be used in a machine learning model, it can be integrated as part of the feature engineering process, where it can be applied to the input data to create new features that better capture the temporal patterns in the data.

I will be showing how to deal with these issues in a following post.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment