Introducing PyCircular: A Python Library for Circular Data Analysis | by Alejandro Correa Bahnsen | Jan, 2023

By Jessie Hobb On Jan 24, 2023

Circular data can present unique challenges when it comes to analysis and modeling

In this post, I’m introducing PyCircular, a specialized data analysis python library designed specifically for working with circular data. As one of the authors, I am excited to share this powerful tool with the community to help address the challenges of working with circular data.

Circular data, such as data that represents angles, directions or timestamps, can present unique challenges when it comes to analysis and modeling. The nature of the circular data can cause difficulties when trying to apply traditional linear and kernel-based methods, as these methods are not well suited to handle the periodic nature of circular data. Additionally, circular data can also raise issues when trying to compute mean and standard deviation, as these measures are not well-defined for circular data.

PyCircular addresses these challenges by providing a set of tools and functionality specifically tailored for working with circular data. The library includes a variety of circular statistical methods, including distributions, kernels, and confidence intervals. Additionally, it also includes visualization tools such as circular histograms and distribution plots, to help you better understand your data.

**https://github.com/albahnsen/pycircular**

The remainder of this post will dive deeper into the unique challenges of working with circular data, and demonstrate how PyCircular can be used to address these challenges through a series of examples. You will see how PyCircular be used to effectively handle the periodic nature of circular data and compute meaningful measures of central tendency and dispersion. You will also learn how the library’s visualization tools can help to better understand and interpret circular data. By the end of this post, you will have a solid understanding of how PyCircular can be used to effectively analyze and model circular data.

When training a machine learning model, we must have a dataset that includes input variables (features) and corresponding output variables (labels). The model learns to map the features to the labels, and the goal of training is to find the best set of parameters for this mapping.

In several applications, the features of the model consist in descriptive information of a user, a transaction, a login, among others. In most of these scenarios, there is information about the timestamp, might be time of the event, day of the week, or day of the month. If the goal is to predict an event based on past events, you can use timestamps as features. For example, you could use the time of day, day of the week, or month of the year as features to a model that predicts traffic volume or energy consumption.

However, the best approach for dealing with time in a machine learning problem will depend on the specific problem you’re trying to solve and the structure of your data.

Let’s see how we can use PyCircular to analyze this complex behavior.

First, let’s install the library and load some sample synthetic data.

!pip install pycircular

We can overcome this limitation by modeling the time of the transaction as a periodic variable, in particular using the von Mises distribution (Fisher, 1996). The von Mises distribution, also known as the periodic normal distribution, is a distribution of a wrapped normal distributed variable across a circle. The von Mises distribution of a set of examples (D) is defined as:

where

are the periodic mean and periodic standard deviation, respectively. In this paper we present the calculation of them.

One way to overcome the issues of using a statistical distribution with only one mode is to use a kernel-based method, such as kernel density estimation (KDE).

KDE is a non-parametric method for estimating the probability density function of a random variable. It works by replacing the point-mass at each data point with a smooth and symmetric kernel function, such as a von Mises. The resulting estimate of the PDF is a sum of the kernel functions centered at each data point.

By using a kernel function, KDE can smooth out any single-modal distributions, and can capture multiple modes in the data, making it a more flexible method for modeling multi-modal data sets. Additionally, kernel density estimation is non-parametric, which means it does not make any assumptions about the underlying distribution of the data.

However, it’s worth noting that choosing the right kernel is important, and there are some challenges when working with KDE such as the choice of bandwidth and the curse of dimensionality.

Finally, we can apply the kernel to new observations and create a new feature that can be used as an input for a machine learning model.

The selection of the bandwidth parameter (bw) is crucial for the performance of the model, pycircular library offers a range of optimization methods to select the best bw for a given dataset.
To evaluate the effectiveness of the kernel, it is important to perform accuracy tests and compare the results with other methods.
While the time of the day is a significant temporal feature to consider, it’s also critical to investigate how other temporal variables like day of the week and day of month may impact the model’s performance.
The kernel can be used in a machine learning model, it can be integrated as part of the feature engineering process, where it can be applied to the input data to create new features that better capture the temporal patterns in the data.

I will be showing how to deal with these issues in a following post.