Techno Blender
Digitally Yours.

Dealing with Dates in Python’s DataFrame Part 1 — Date Series Creation | by KahEm Chu | Jan, 2023

0 41


Data Processing in Python

Pandas methods for date series creation

Photo by Jon Tyson on Unsplash

Most of the time, the DateTime object represents a crucial element for drawing insights from the data. We can understand the trend, cycle, and seasonal pattern from the data with the date. From that, we can prepare the report based on the pattern found, and further study and analyze the data.

The importance of the DateTime object in analysis motivated me to study further what I can do with the DateTime object in the pandas module. Then, I jotted down the methods and properties I used frequently and those I felt I might need to use some days. Furthermore, I grouped them into parts according to my understanding. There are 2 parts created as below:

Image from Author.

For a better reading experience, I decided to split the group into 3 articles. Let us start with Part 1, The Basic to Deal with DateTime Series.

Part 1 — The Basic to Deal with DateTime Series

DateTime Series Creation

DateTime series creation is practical when you want to create a sample dataset to test out a couple of new functions that you are writing. Below are the four DateTime series creation methods from the pandas module.

  • pandas.date_range — Return a fixed frequency DatetimeIndex.
  • pandas.bdate_range — Return a fixed frequency DatetimeIndex, with the business day as the default frequency.
  • pandas.period_range — Return a fixed frequency PeriodIndex. The day (calendar) is the default frequency.
  • pandas.timedelta_range — Return a fixed frequency TimedeltaIndex, with the day as the default frequency.

The frequency mentioned above is referring to the interval between the dates generated, it could be hourly, daily, monthly, quarterly, yearly and more. You may learn about the frequency string aliases in this link [1].

Let’s see them one by one!

1. pandas.date_range

The pandas.date_range() method returns the DateTime series according to the combination of three parameters from the following four parameters:

  1. start — the start date of the date range generated
  2. end — the end date of the date range generated
  3. periods — the number of dates generated
  4. freq — default to “D”, the interval between dates generated, it could be hourly, monthly or yearly

Note: freq = “D” mean daily frequency.

At least three of the 4 parameters above must be stated to generate a DateTime series. Since the freq is defaulted to be “D”, if you are using freq=D, you need to specify the other two parameters only. If freq is omitted, which means you specify only the start, the end and the period parameters, and the date created will have periods of linearly spaced elements between the start and the end date. There are other parameters in the method, but in this article, we will focus on these 4 main parameters only.

For the first example, the dates are generated by specifying the start date and the period. As mentioned above, the frequency is set to be daily by default. Hence, there will be 10 dates generated at a daily frequency.

import pandas as pd
df = pd.DataFrame()
df["date_range"] = pd.date_range(start="2022/1/1", periods=10)
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

For the second example, the start date, the periods, and the frequency are specified. The following example is creating a date series starting from 2020/1/1, 10 dates with 3-month intervals between each date.

import pandas as pd
df = pd.DataFrame()
df["date_range"] = pd.date_range(start="2022/1/1", periods=10, freq="3M")
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

Why are the dates start from month’s end? 🤨

Well, it’s actually because the “M” frequency refers to the month-end frequency, while the “MS” refers to the month-start frequency [1].

For the third example, the start and end date, and the frequency are provided. As mentioned, the dates created will be linearly spaced when you omitted the frequency. Then, if the periods are omitted, the dates created will be the dates within the start and end date at the frequency interval specified.

import pandas as pd
df = pd.DataFrame()
df["date_range"] = pd.date_range(start="2022/1/1", end="2022-12-31", freq="3M")
print(df.head(10))
print("Data Type: ", df.dtypes)
Image by Author.
Image from Author. Created with Excalidraw.

As the next period will be Jan 31, 2023, hence there are only 4 dates created in the third example 😉.

A simple guide here:

When you are certain about the number of dates you want to create, you use the period parameter.

When you are not sure about how many exact dates you will have but know when should it end or it should not exceed, use the end parameter instead.

2. pandas.bdate_range

Similar to pandas.date_range() method, pandas.bdate_range() also have 4 main parameters, which are the start, the end, the periods and the freq, except the freq is defaulted to “B” in pandas.bdate_range(). The “B” refers to business day frequency, which the weekend like Saturday and Sunday will be skipped.

Let’s see the first example! So, in the following example, the start date and the periods are specified, and as mentioned, the frequency is defaulted to “B”.

import pandas as pd
df = pd.DataFrame()
# frequency is default to B, the weekend will be skipped
df["bdate_range"] = pd.bdate_range(start="2022/1/1", periods=10)
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

The two dates which are skipped, “2022–01–08” and “2022–01–09” are Saturday and Sunday respectively.

You might notice that the pandas.date_range() method can return workdays only as well when you set the freq= “B”, then why do we need to use pandas.bdate_range()? 🤷‍♀️

This is because of the pandas.bdate_range() return business day by default and pandas.bdate_range() have weekmask and holidays parameters.

Note: To use the holidays or weekmask parameter, custom business day frequency need to be used, where freq= “C”. [2]

Now, let us see what is the holidays parameter. Holidays referring to the list of dates to exclude from the set of valid business days.

For the second example, the start date, periods, frequency and holidays parameter are specified.

import pandas as pd
df = pd.DataFrame()
# frequency is set to C, the weekend and holidays will be skipped
# only can set holiday when freq is set to "C"
holidays = [pd.datetime(2022,1,7)]
df["bdate_range"] = pd.bdate_range(start="2022/1/1", periods=10, freq="C", holidays=holidays)
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

The holiday date specified is not on the list of dates generated, and as “C” refers to the custom business day frequency, so the weekend is still skipped in the date range created.

Note: The Holidays parameter only takes a list of the datetime objects.

Now, let’s see the weekmask parameter. Weekmask refers to the valid business days for a business that does not follow traditional business like Mon to Fri. Also, the default value for weekmask is equivalent to ‘Mon Tue Wed Thu Fri’.

For the third example, we specified the start date, the custom business day with weekmask = “Tue Wed Thu Fri Sat Sun”.

import pandas as pd
df = pd.DataFrame()
df["bdate_range"] = pd.bdate_range(start="2022/1/1", periods=10, freq="C", weekmask="Tue Wed Thu Fri Sat Sun")
print(df.head(10))
print("Data Type: ", df.dtypes)
Image by Author.

The Monday date (2022–01–10) will not be included in the dates created. This parameter is useful when the business not running according to a normal weekday.

Combining these two parameters, you can generate the DateTime series according to your business operating day as in the example below.

import pandas as pd
df = pd.DataFrame()
df["bdate_range"] = pd.bdate_range(start="2022/1/1", periods=10, freq="C", weekmask="Tue Wed Thu Fri Sat Sun", holidays=[pd.datetime(2022,1,7)])
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

As seen from the output, the Monday date (2022–01–10) and holiday date (2022–01–07) are not included in the list generated.

3. pandas.period range

There are some similarities and differences between pandas.period_range() method and the two methods earlier, the pandas.date_range() and pandas.bdate_range() .

Similar to the two methods earlier, pandas.period_range() can generate the date series by specifying the 3 out of the 4 main parameters, start, end, periods and freq. Also, the frequency is still defaulted to daily.

One difference to take note of is that the pandas.period_range() generate period object instead of a DateTime object.

For the first example, we are generating a series of 5 periods in daily frequency by default, starting from 2022–01–01.

import pandas as pd
df = pd.DataFrame()
df["period_range"] = pd.period_range(start="2022/1/1", periods=5)
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

For the second example, we are generating a series of 5 periods in monthly frequency, starting from 2022–01–01.

import pandas as pd
df = pd.DataFrame()
df["period_range"] = pd.period_range(start="2022/1/1", periods=5, freq="M")
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

For the third example, we are generating a series of 5 periods in yearly frequency, starting from 2022–01–01.

import pandas as pd
df = pd.DataFrame()
df["period_range"] = pd.period_range(start="2022/1/1", periods=5, freq="Y")
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

For the last example, we are generating a series in yearly frequency, starting from 2022–01–01 to 2027–01–01.

import pandas as pd
df = pd.DataFrame()
df["period_range"] = pd.period_range(start="2022/1/1", end="2027/1/1", freq="Y")
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

The period_range method work in the same way with pandas.date_range(), just it returns the period instead of the date. So, if the periods parameter is omitted, the periods created will be the periods within the start and end date with the frequency interval specified.

4. pandas.timedelta_range

Similar to the three methods above, pandas.timedelta_range() method returns the date series according to the combination of three parameters from the four main parameters, start, end, periods and frequency. The frequency is still defaulted to daily. There is one difference between this method with the three examples earlier, which can be explained with the example below.

The example below is from a mistake I made during running the script, and then the errors that occurred.

import pandas as pd
df = pd.DataFrame()
df["timedelta_range"] = pd.timedelta_range(start="2022/1/1", periods=5, freq="Y")
print(df.head(10))
print("Data Type: ", df.dtypes)

The script above returns a Key Error and a Value Error as below.

Key Error. Image by Author.
Value Error. Image by Author.

From the error script, we can see the error comes from the value we put for the “start” parameter. As we are generating a time delta object, the value we put for the “start” parameter should be in timedelta format too.

So, the correct example should be as below, where the start is specified in timedelta format, the number of periods is specified, and the default daily frequency is used.

import pandas as pd
df = pd.DataFrame()
df["timedelta_range"] = pd.timedelta_range(start="1 days", periods=5)
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

For the second example, the starting time delta, the periods and the frequency are specified.

import pandas as pd
df = pd.DataFrame()
df["timedelta_range"] = pd.timedelta_range(start="1 day", periods=5, freq="6H")
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

For the third example, the starting time delta, the end time delta and the frequency are specified.

import pandas as pd
df = pd.DataFrame()
df["timedelta_range"] = pd.timedelta_range(start="1 day", end="5days", freq="8H")
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

For the fourth example, the starting time delta, the end time delta and the periods are specified. The time delta series generated will be linearly spaced when the frequency is not set.

import pandas as pd
df = pd.DataFrame()
df["timedelta_range"] = pd.timedelta_range(start="1 day", end="5days", periods=3)
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

Note: For the pandas.timedelta_range() method, the “start” parameter accepts only the time delta object, while for the other three methods, the “start” parameter takes the Datetime object as input.

5. Create DateTime with a Timestamp

In the pandas module, we can also create the datetime object with the timestamp method.

There are two ways to create a DateTime object with a timestamp, the first way is with the datetime parameters as below.

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html
import pandas as pd
timestampsample = pd.Timestamp(year=2022,month=12,day=13,hour=21,minute=48, second=23, microsecond=35, nanosecond=58)
timestampsample

Output:

Image by Author.

The second way is to create the timestamp from the DateTime string.

import pandas as pd
str_timestamp = '2022-12-13 21:48:23.000035058'
timestampsample2 = pd.Timestamp(str_timestamp)
timestampsample2
Image by Author.

Okay, so the above is a demonstration of the usage of timestamp methods to create a DateTime object.


Data Processing in Python

Pandas methods for date series creation

Photo by Jon Tyson on Unsplash

Most of the time, the DateTime object represents a crucial element for drawing insights from the data. We can understand the trend, cycle, and seasonal pattern from the data with the date. From that, we can prepare the report based on the pattern found, and further study and analyze the data.

The importance of the DateTime object in analysis motivated me to study further what I can do with the DateTime object in the pandas module. Then, I jotted down the methods and properties I used frequently and those I felt I might need to use some days. Furthermore, I grouped them into parts according to my understanding. There are 2 parts created as below:

Image from Author.

For a better reading experience, I decided to split the group into 3 articles. Let us start with Part 1, The Basic to Deal with DateTime Series.

Part 1 — The Basic to Deal with DateTime Series

DateTime Series Creation

DateTime series creation is practical when you want to create a sample dataset to test out a couple of new functions that you are writing. Below are the four DateTime series creation methods from the pandas module.

  • pandas.date_range — Return a fixed frequency DatetimeIndex.
  • pandas.bdate_range — Return a fixed frequency DatetimeIndex, with the business day as the default frequency.
  • pandas.period_range — Return a fixed frequency PeriodIndex. The day (calendar) is the default frequency.
  • pandas.timedelta_range — Return a fixed frequency TimedeltaIndex, with the day as the default frequency.

The frequency mentioned above is referring to the interval between the dates generated, it could be hourly, daily, monthly, quarterly, yearly and more. You may learn about the frequency string aliases in this link [1].

Let’s see them one by one!

1. pandas.date_range

The pandas.date_range() method returns the DateTime series according to the combination of three parameters from the following four parameters:

  1. start — the start date of the date range generated
  2. end — the end date of the date range generated
  3. periods — the number of dates generated
  4. freq — default to “D”, the interval between dates generated, it could be hourly, monthly or yearly

Note: freq = “D” mean daily frequency.

At least three of the 4 parameters above must be stated to generate a DateTime series. Since the freq is defaulted to be “D”, if you are using freq=D, you need to specify the other two parameters only. If freq is omitted, which means you specify only the start, the end and the period parameters, and the date created will have periods of linearly spaced elements between the start and the end date. There are other parameters in the method, but in this article, we will focus on these 4 main parameters only.

For the first example, the dates are generated by specifying the start date and the period. As mentioned above, the frequency is set to be daily by default. Hence, there will be 10 dates generated at a daily frequency.

import pandas as pd
df = pd.DataFrame()
df["date_range"] = pd.date_range(start="2022/1/1", periods=10)
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

For the second example, the start date, the periods, and the frequency are specified. The following example is creating a date series starting from 2020/1/1, 10 dates with 3-month intervals between each date.

import pandas as pd
df = pd.DataFrame()
df["date_range"] = pd.date_range(start="2022/1/1", periods=10, freq="3M")
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

Why are the dates start from month’s end? 🤨

Well, it’s actually because the “M” frequency refers to the month-end frequency, while the “MS” refers to the month-start frequency [1].

For the third example, the start and end date, and the frequency are provided. As mentioned, the dates created will be linearly spaced when you omitted the frequency. Then, if the periods are omitted, the dates created will be the dates within the start and end date at the frequency interval specified.

import pandas as pd
df = pd.DataFrame()
df["date_range"] = pd.date_range(start="2022/1/1", end="2022-12-31", freq="3M")
print(df.head(10))
print("Data Type: ", df.dtypes)
Image by Author.
Image from Author. Created with Excalidraw.

As the next period will be Jan 31, 2023, hence there are only 4 dates created in the third example 😉.

A simple guide here:

When you are certain about the number of dates you want to create, you use the period parameter.

When you are not sure about how many exact dates you will have but know when should it end or it should not exceed, use the end parameter instead.

2. pandas.bdate_range

Similar to pandas.date_range() method, pandas.bdate_range() also have 4 main parameters, which are the start, the end, the periods and the freq, except the freq is defaulted to “B” in pandas.bdate_range(). The “B” refers to business day frequency, which the weekend like Saturday and Sunday will be skipped.

Let’s see the first example! So, in the following example, the start date and the periods are specified, and as mentioned, the frequency is defaulted to “B”.

import pandas as pd
df = pd.DataFrame()
# frequency is default to B, the weekend will be skipped
df["bdate_range"] = pd.bdate_range(start="2022/1/1", periods=10)
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

The two dates which are skipped, “2022–01–08” and “2022–01–09” are Saturday and Sunday respectively.

You might notice that the pandas.date_range() method can return workdays only as well when you set the freq= “B”, then why do we need to use pandas.bdate_range()? 🤷‍♀️

This is because of the pandas.bdate_range() return business day by default and pandas.bdate_range() have weekmask and holidays parameters.

Note: To use the holidays or weekmask parameter, custom business day frequency need to be used, where freq= “C”. [2]

Now, let us see what is the holidays parameter. Holidays referring to the list of dates to exclude from the set of valid business days.

For the second example, the start date, periods, frequency and holidays parameter are specified.

import pandas as pd
df = pd.DataFrame()
# frequency is set to C, the weekend and holidays will be skipped
# only can set holiday when freq is set to "C"
holidays = [pd.datetime(2022,1,7)]
df["bdate_range"] = pd.bdate_range(start="2022/1/1", periods=10, freq="C", holidays=holidays)
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

The holiday date specified is not on the list of dates generated, and as “C” refers to the custom business day frequency, so the weekend is still skipped in the date range created.

Note: The Holidays parameter only takes a list of the datetime objects.

Now, let’s see the weekmask parameter. Weekmask refers to the valid business days for a business that does not follow traditional business like Mon to Fri. Also, the default value for weekmask is equivalent to ‘Mon Tue Wed Thu Fri’.

For the third example, we specified the start date, the custom business day with weekmask = “Tue Wed Thu Fri Sat Sun”.

import pandas as pd
df = pd.DataFrame()
df["bdate_range"] = pd.bdate_range(start="2022/1/1", periods=10, freq="C", weekmask="Tue Wed Thu Fri Sat Sun")
print(df.head(10))
print("Data Type: ", df.dtypes)
Image by Author.

The Monday date (2022–01–10) will not be included in the dates created. This parameter is useful when the business not running according to a normal weekday.

Combining these two parameters, you can generate the DateTime series according to your business operating day as in the example below.

import pandas as pd
df = pd.DataFrame()
df["bdate_range"] = pd.bdate_range(start="2022/1/1", periods=10, freq="C", weekmask="Tue Wed Thu Fri Sat Sun", holidays=[pd.datetime(2022,1,7)])
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

As seen from the output, the Monday date (2022–01–10) and holiday date (2022–01–07) are not included in the list generated.

3. pandas.period range

There are some similarities and differences between pandas.period_range() method and the two methods earlier, the pandas.date_range() and pandas.bdate_range() .

Similar to the two methods earlier, pandas.period_range() can generate the date series by specifying the 3 out of the 4 main parameters, start, end, periods and freq. Also, the frequency is still defaulted to daily.

One difference to take note of is that the pandas.period_range() generate period object instead of a DateTime object.

For the first example, we are generating a series of 5 periods in daily frequency by default, starting from 2022–01–01.

import pandas as pd
df = pd.DataFrame()
df["period_range"] = pd.period_range(start="2022/1/1", periods=5)
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

For the second example, we are generating a series of 5 periods in monthly frequency, starting from 2022–01–01.

import pandas as pd
df = pd.DataFrame()
df["period_range"] = pd.period_range(start="2022/1/1", periods=5, freq="M")
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

For the third example, we are generating a series of 5 periods in yearly frequency, starting from 2022–01–01.

import pandas as pd
df = pd.DataFrame()
df["period_range"] = pd.period_range(start="2022/1/1", periods=5, freq="Y")
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

For the last example, we are generating a series in yearly frequency, starting from 2022–01–01 to 2027–01–01.

import pandas as pd
df = pd.DataFrame()
df["period_range"] = pd.period_range(start="2022/1/1", end="2027/1/1", freq="Y")
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

The period_range method work in the same way with pandas.date_range(), just it returns the period instead of the date. So, if the periods parameter is omitted, the periods created will be the periods within the start and end date with the frequency interval specified.

4. pandas.timedelta_range

Similar to the three methods above, pandas.timedelta_range() method returns the date series according to the combination of three parameters from the four main parameters, start, end, periods and frequency. The frequency is still defaulted to daily. There is one difference between this method with the three examples earlier, which can be explained with the example below.

The example below is from a mistake I made during running the script, and then the errors that occurred.

import pandas as pd
df = pd.DataFrame()
df["timedelta_range"] = pd.timedelta_range(start="2022/1/1", periods=5, freq="Y")
print(df.head(10))
print("Data Type: ", df.dtypes)

The script above returns a Key Error and a Value Error as below.

Key Error. Image by Author.
Value Error. Image by Author.

From the error script, we can see the error comes from the value we put for the “start” parameter. As we are generating a time delta object, the value we put for the “start” parameter should be in timedelta format too.

So, the correct example should be as below, where the start is specified in timedelta format, the number of periods is specified, and the default daily frequency is used.

import pandas as pd
df = pd.DataFrame()
df["timedelta_range"] = pd.timedelta_range(start="1 days", periods=5)
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

For the second example, the starting time delta, the periods and the frequency are specified.

import pandas as pd
df = pd.DataFrame()
df["timedelta_range"] = pd.timedelta_range(start="1 day", periods=5, freq="6H")
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

For the third example, the starting time delta, the end time delta and the frequency are specified.

import pandas as pd
df = pd.DataFrame()
df["timedelta_range"] = pd.timedelta_range(start="1 day", end="5days", freq="8H")
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

For the fourth example, the starting time delta, the end time delta and the periods are specified. The time delta series generated will be linearly spaced when the frequency is not set.

import pandas as pd
df = pd.DataFrame()
df["timedelta_range"] = pd.timedelta_range(start="1 day", end="5days", periods=3)
print(df.head(10))
print("Data Type: ", df.dtypes)

Output:

Image by Author.

Note: For the pandas.timedelta_range() method, the “start” parameter accepts only the time delta object, while for the other three methods, the “start” parameter takes the Datetime object as input.

5. Create DateTime with a Timestamp

In the pandas module, we can also create the datetime object with the timestamp method.

There are two ways to create a DateTime object with a timestamp, the first way is with the datetime parameters as below.

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html
import pandas as pd
timestampsample = pd.Timestamp(year=2022,month=12,day=13,hour=21,minute=48, second=23, microsecond=35, nanosecond=58)
timestampsample

Output:

Image by Author.

The second way is to create the timestamp from the DateTime string.

import pandas as pd
str_timestamp = '2022-12-13 21:48:23.000035058'
timestampsample2 = pd.Timestamp(str_timestamp)
timestampsample2
Image by Author.

Okay, so the above is a demonstration of the usage of timestamp methods to create a DateTime object.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment