Techno Blender
Digitally Yours.

Rust Polars: Unlocking High-Performance Data Analysis — Part 1 | by Mahmoud Harmouch | May, 2023

0 42


Core Objects

Polars dataframe and series representations (Image by author)

In this section, we will explore the fundamental concepts of Polars. As always, to enhance your learning experience, a Jupyter Notebook was utilized for executing code snippets in this article. As many are aware, Jupyter is an interactive computing platform accessible through any web browser, enabling us to create and share documents containing live codes along with visualizations and explanatory text, making learning more engaging than ever before!

Series Object

Series object representation (Image by author)

To gain a comprehensive understanding of data wrangling using Polars, it is necessary to begin with the basics. This includes working with one-dimensional data, best represented using Series objects in Polars.

The Series object is a crucial data structure in Polars, representing one-dimensional, or 1-D for short, information. It combines the features of both a vector and HashMap by having an ordered sequence with labels for easy retrieval. A simple analogy would be to imagine one column that stores actual data values with a label. This makes managing large amounts of structured data more accessible and effective within your codebase.

To create a Series object in Polars, initialize it using the Series::new method. This powerful function allows for creating custom Series objects with specified values and inferred data types to suit your needs. This process can be demonstrated by examining the following code. By utilizing the Series::new method, we can create a Series object denoted as series and assign it values [1, 2, 3].

use polars::prelude::*;

let series: Series = [1, 2, 3].iter().collect();

// or

let series: Series = Series::new("", &[1, 2, 3]);

println!("{:?}", series);

Running the above code in a Jupyter notebook cell will produce the following output:

shape: (3,)
Series: ‘’ [i32]
[
1
2
3
]

The output of a Series object created using the Series::new method displays the representation of one-dimensional data in Polars. The Series object contains an ordered sequence of values indexed with a label to facilitate easy retrieval. The indices are integers by default, beginning with 0 and incrementing by 1 for each value in the Series.

It is crucial to note that the Series objects have a remarkable advantage over other data structures in terms of customization. Column names are utilized for better comprehension of data. Think of them as labels to better comprehend each column/feature.

Polars’s series objects are highly adaptable and can accommodate various data types, such as integers, strings, booleans, or datetime values. To create a new Series object with strings only in it named series, use the Series::new method while passing a vector of string objects for storing them as strings.

let seasons_ser: Series = Series::new("seasons", &["Winter", "Spring", "Summer", "Fall"]);
println!("{:?}", seasons_ser);

Running this snippet will result in the following output:

shape: (4,)
Series: 'seasons' [str]
[
"Winter"
"Spring"
"Summer"
"Fall"
]

The result is a Series object which is nicely rendered on the terminal. We can see here that Polars has automatically identified the type of data in this Series as str and set the dtype attribute as appropriate.

In Python, when working with data, it’s common to come across missing or null values denoted by the None type. However, when dealing with typed lists like those found in the Python pandas series object, we must handle these missing values differently. In such scenarios, Pandas automatically transforms the list into an object-type array and inserts a placeholder value of None.

To better understand this concept, let us consider a scenario where we have a list of seasons, but one season is without a name; for which we can use None as our representation of missing information.

>>> import pandas as pd
>>> seasons = ["Winter", "Spring", "Summer", None]
>>> pd.Series(seasons)

0 Winter
1 Spring
2 Summer
3 None
dtype: object

When creating strings within Pandas containing at least one instance of None, the resulting series will be converted into an object-type array while inserting None as its designated substitute value, thereby maintaining consistency among other elements’ datatype throughout your dataset.

The following example showcases how Pandas handles null values in a list of integer numbers. In such a case, Pandas will convert the data type to a floating point number and produces a NaN value. This functionality proves beneficial as it ensures uniformity when representing missing information across all data types.

>>> numbers = [1, 2, None]
>>> pd.Series(numbers)
0 1.0
1 2.0
2 NaN
dtype: float64

It is crucial to acknowledge that NaN constitutes a legitimate floating point number and conforms with the IEEE-724 standards. As such, it can be utilized in mathematical computations and comparisons without triggering errors, rendering it an influential instrument for data analysis.

In Rust, however, None values are transformed into Null when dealing with integers. Although this may seem like an insignificant variation at first glance, its ramifications could prove substantial while handling vast datasets or conducting complex analyses while maintaining the data type.

let s: Series = Series::new("seasons", &[None, Some(1), Some(2)]);

// Output:

// shape: (3,)
// Series: 'seasons' [i32]
// [
// null
// 1
// 2
// ]

As mentioned, and upon closer inspection of the process of creating a Series object in Rust Polars, several noticeable differences exist compared to Python Pandas. Firstly, the representation of missing data in Rust Polars is accomplished by using the null value instead of the NaN value in Python Pandas. Secondly, Rust Polars sets the data type of the Series to 32-bit integer numbers instead of automatically converting it to a floating-point number as in Python Pandas. This difference in behaviour can be attributed to Rust’s explicit typing system, which implicitly assign the data type. As a result, assigning the dtype to int is appropriate because 1 and 2 are integers. On the other hand, in Python pandas, missing data is represented by converting the None value to NaN, a floating-point number, and integers can be cast to float.

It is crucial to highlight the difference between the representations of None and NaN in scientific computing with Rust. Although data scientists may use them interchangeably to denote missing data, they are not represented similarly beneath the surface. One critical point to note is that NaN is not equivalent to None, and an equality test between them will always result in False.

In Rust, NaN cannot be compared to itself. Hence, attempting to do so will yield a False result. This underscores the fact that NaN is not equivalent to any value, including itself.

Some(f64::NAN)==None
// false
f64::NAN==f64::NAN
// false

As a result, when performing operations on data that includes NaN values, it is essential to handle them appropriately.

It is essential to note that Rust Polars counts null values as zero and dropping them will not eliminate them. This occurs because the null value in Rust Polars differs from NaN, representing missing data with a distinct value. Therefore, comprehending how missing information appears in your dataset is crucial for the precise analysis and manipulation of your data.

let series: Series = Series::new("", &[1, 2, 3]);

println!("{:?}", s.null_count());

// Output:

// 0

s.drop_nulls()

// Output:

// shape: (3,)
// Series: 'numbers' [f64]
// [
// NaN
// 1.0
// 2.0
// ]

It is undoubtedly possible to convert the elements of a series from one data type to another. For instance, consider our previous example and its conversion into integer values. The code excerpt below effectively demonstrates this conversion:

let s: Series = Series::new("numbers", &[Some(f64::NAN), Some(1.), Some(2.)]);
println!("{:?}", s.cast(&DataType::Int64).unwrap());

// Output:

// shape: (3,)
// Series: 'numbers' [i64]
// [
// null
// 1
// 2
// ]

The cast function is employed to transform the initial s series into a new 64-bit integer type sequence. The return value can be displayed using println! macro, but it’s worth mentioning that NaN value will become null after conversion.

It is crucial to keep in mind that converting a series from one data type to another can lead to the loss or modification of certain values. For instance, if you cast a floating point series into an integer series, all decimal points will be truncated. Additionally, trying to convert non-numeric data within a series into numeric types will result in errors. Henceforth, it’s imperative that you weigh up the consequences of any potential conversions before executing them meticulously and with caution.


Core Objects

Polars dataframe and series representations (Image by author)

In this section, we will explore the fundamental concepts of Polars. As always, to enhance your learning experience, a Jupyter Notebook was utilized for executing code snippets in this article. As many are aware, Jupyter is an interactive computing platform accessible through any web browser, enabling us to create and share documents containing live codes along with visualizations and explanatory text, making learning more engaging than ever before!

Series Object

Series object representation (Image by author)

To gain a comprehensive understanding of data wrangling using Polars, it is necessary to begin with the basics. This includes working with one-dimensional data, best represented using Series objects in Polars.

The Series object is a crucial data structure in Polars, representing one-dimensional, or 1-D for short, information. It combines the features of both a vector and HashMap by having an ordered sequence with labels for easy retrieval. A simple analogy would be to imagine one column that stores actual data values with a label. This makes managing large amounts of structured data more accessible and effective within your codebase.

To create a Series object in Polars, initialize it using the Series::new method. This powerful function allows for creating custom Series objects with specified values and inferred data types to suit your needs. This process can be demonstrated by examining the following code. By utilizing the Series::new method, we can create a Series object denoted as series and assign it values [1, 2, 3].

use polars::prelude::*;

let series: Series = [1, 2, 3].iter().collect();

// or

let series: Series = Series::new("", &[1, 2, 3]);

println!("{:?}", series);

Running the above code in a Jupyter notebook cell will produce the following output:

shape: (3,)
Series: ‘’ [i32]
[
1
2
3
]

The output of a Series object created using the Series::new method displays the representation of one-dimensional data in Polars. The Series object contains an ordered sequence of values indexed with a label to facilitate easy retrieval. The indices are integers by default, beginning with 0 and incrementing by 1 for each value in the Series.

It is crucial to note that the Series objects have a remarkable advantage over other data structures in terms of customization. Column names are utilized for better comprehension of data. Think of them as labels to better comprehend each column/feature.

Polars’s series objects are highly adaptable and can accommodate various data types, such as integers, strings, booleans, or datetime values. To create a new Series object with strings only in it named series, use the Series::new method while passing a vector of string objects for storing them as strings.

let seasons_ser: Series = Series::new("seasons", &["Winter", "Spring", "Summer", "Fall"]);
println!("{:?}", seasons_ser);

Running this snippet will result in the following output:

shape: (4,)
Series: 'seasons' [str]
[
"Winter"
"Spring"
"Summer"
"Fall"
]

The result is a Series object which is nicely rendered on the terminal. We can see here that Polars has automatically identified the type of data in this Series as str and set the dtype attribute as appropriate.

In Python, when working with data, it’s common to come across missing or null values denoted by the None type. However, when dealing with typed lists like those found in the Python pandas series object, we must handle these missing values differently. In such scenarios, Pandas automatically transforms the list into an object-type array and inserts a placeholder value of None.

To better understand this concept, let us consider a scenario where we have a list of seasons, but one season is without a name; for which we can use None as our representation of missing information.

>>> import pandas as pd
>>> seasons = ["Winter", "Spring", "Summer", None]
>>> pd.Series(seasons)

0 Winter
1 Spring
2 Summer
3 None
dtype: object

When creating strings within Pandas containing at least one instance of None, the resulting series will be converted into an object-type array while inserting None as its designated substitute value, thereby maintaining consistency among other elements’ datatype throughout your dataset.

The following example showcases how Pandas handles null values in a list of integer numbers. In such a case, Pandas will convert the data type to a floating point number and produces a NaN value. This functionality proves beneficial as it ensures uniformity when representing missing information across all data types.

>>> numbers = [1, 2, None]
>>> pd.Series(numbers)
0 1.0
1 2.0
2 NaN
dtype: float64

It is crucial to acknowledge that NaN constitutes a legitimate floating point number and conforms with the IEEE-724 standards. As such, it can be utilized in mathematical computations and comparisons without triggering errors, rendering it an influential instrument for data analysis.

In Rust, however, None values are transformed into Null when dealing with integers. Although this may seem like an insignificant variation at first glance, its ramifications could prove substantial while handling vast datasets or conducting complex analyses while maintaining the data type.

let s: Series = Series::new("seasons", &[None, Some(1), Some(2)]);

// Output:

// shape: (3,)
// Series: 'seasons' [i32]
// [
// null
// 1
// 2
// ]

As mentioned, and upon closer inspection of the process of creating a Series object in Rust Polars, several noticeable differences exist compared to Python Pandas. Firstly, the representation of missing data in Rust Polars is accomplished by using the null value instead of the NaN value in Python Pandas. Secondly, Rust Polars sets the data type of the Series to 32-bit integer numbers instead of automatically converting it to a floating-point number as in Python Pandas. This difference in behaviour can be attributed to Rust’s explicit typing system, which implicitly assign the data type. As a result, assigning the dtype to int is appropriate because 1 and 2 are integers. On the other hand, in Python pandas, missing data is represented by converting the None value to NaN, a floating-point number, and integers can be cast to float.

It is crucial to highlight the difference between the representations of None and NaN in scientific computing with Rust. Although data scientists may use them interchangeably to denote missing data, they are not represented similarly beneath the surface. One critical point to note is that NaN is not equivalent to None, and an equality test between them will always result in False.

In Rust, NaN cannot be compared to itself. Hence, attempting to do so will yield a False result. This underscores the fact that NaN is not equivalent to any value, including itself.

Some(f64::NAN)==None
// false
f64::NAN==f64::NAN
// false

As a result, when performing operations on data that includes NaN values, it is essential to handle them appropriately.

It is essential to note that Rust Polars counts null values as zero and dropping them will not eliminate them. This occurs because the null value in Rust Polars differs from NaN, representing missing data with a distinct value. Therefore, comprehending how missing information appears in your dataset is crucial for the precise analysis and manipulation of your data.

let series: Series = Series::new("", &[1, 2, 3]);

println!("{:?}", s.null_count());

// Output:

// 0

s.drop_nulls()

// Output:

// shape: (3,)
// Series: 'numbers' [f64]
// [
// NaN
// 1.0
// 2.0
// ]

It is undoubtedly possible to convert the elements of a series from one data type to another. For instance, consider our previous example and its conversion into integer values. The code excerpt below effectively demonstrates this conversion:

let s: Series = Series::new("numbers", &[Some(f64::NAN), Some(1.), Some(2.)]);
println!("{:?}", s.cast(&DataType::Int64).unwrap());

// Output:

// shape: (3,)
// Series: 'numbers' [i64]
// [
// null
// 1
// 2
// ]

The cast function is employed to transform the initial s series into a new 64-bit integer type sequence. The return value can be displayed using println! macro, but it’s worth mentioning that NaN value will become null after conversion.

It is crucial to keep in mind that converting a series from one data type to another can lead to the loss or modification of certain values. For instance, if you cast a floating point series into an integer series, all decimal points will be truncated. Additionally, trying to convert non-numeric data within a series into numeric types will result in errors. Henceforth, it’s imperative that you weigh up the consequences of any potential conversions before executing them meticulously and with caution.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment