Techno Blender
Digitally Yours.

Generating Fake Data for Data Analytics | by Wei-Meng Lee | Mar, 2023

0 34


If you don’t have real data, you got to fake it!

Photo by Leif Christoph Gottwald on Unsplash

In the world of data analytics, getting your hands on a good dataset is of paramount importance. In the real world, you probably have access to a lot of uncleaned data that you likely need to spend some time cleaning. What if you do not have the required data and wanted to hack something out quickly for a proof-of-concept demo? In this type of situation, you often have to cook up your own data, and at the same time you need your data to have some degree of realism. So what do you do? Do you painstakingly make up the data manually, or is there an automated way of doing things?

In this article, I will show you some cool ways to fake your data, and make them look real!

To generate some fake names, you can use the names package. To use it, first you need to install it:

!pip install names

You can now use the various functions in the package to generate gender-specific names:

import names

display(names.get_full_name('male'))
display(names.get_first_name())
display(names.get_last_name())

display(names.get_full_name('female'))
display(names.get_first_name())
display(names.get_last_name())

Here are some names generated:

'Gerald Paez'
'Matthew'
'Wiese'
'Dana Mcmullen'
'Heather'
'Oxley'

'Walter Walters'
'Connie'
'Vildosola'
'Nancy Correra'
'Aaron'
'Dawes'

'Randy Meli'
'Yvonne'
'Owen'
'Loretta Patague'
'Sidney'
'Oliver'

Besides names, another type of data that you might want to generate is UUIDs. An UUID (Universal Unique Identifier) is a 128-bit value used to uniquely identify an object or entity on the internet. In the mobile world, UUIDs are often used to identify apps installed on devices.

To generate sample UUIDs, you can use the uuid package:

!pip install uuid

You can convert the UUID generated to a string:

import uuid

str(uuid.uuid4())

Here is a sample UUID generated:

'54487fd7-0632-450e-b6e3-bcc54bc83133'

When generating fake data using Python, the faker package is definitely worth mentioning. The faker package generates all sorts of fake data for your usage. Data that you can generate include:

  • address
  • barcode
  • credit card information
  • ISBN
  • phone number
  • and more!

In the following sections, I will show you how to generate some commonly needed data.

Generating User Profile

The faker package can generate user profiles, such as username, sex, address, email, and date of birth. The following code snippet creates a simple profile for a male person:

from faker import Faker

fake = Faker()
fake.simple_profile(sex='M') # use 'F' for female

The output is a dictionary containing the various details of a male person:

{'username': 'lisa38',
'name': 'Brandon Gibson',
'sex': 'M',
'address': '406 Brandi Inlet\nWest Christopherville, PR 41632',
'mail': '[email protected]',
'birthdate': datetime.date(2008, 9, 10)}

Generating Dates

One particular type of data I want to generate is the date of birth (DOB) of a person. When storing details of a person, it is always recommended to store the DOB rather than the age (for very obvious reasons).

Using the faker package, you can generate the birth date of a person that is between 18 and 60 years old:

fake.date_between(start_date='-60y', end_date='-18y')

The data returned is a date object:

datetime.date(1963, 4, 18)

If you want to convert the result to a string, you can use the strftime() function:

fake.date_between(start_date='-60y', end_date='-18y').strftime('%Y-%m-%d')
# '1973-07-16'

Note that every time you call a function from the Faker object, a new set of data is generated. If you want the data generated to be deterministic (i.e always the same), you can use the seed() function, like this: Faker.seed(0).

Generating Locations

The next type of data I want to generate is location data. For example, you want to get the latitude and longitude of a location in the US. You can use the local_latlng() function and specify the country_code parameter:

fake.local_latlng(country_code = 'US')

The function returns a location known to exist on land in a country specified by country_code. The informations are enclosed is a tuple that looks like this:

('33.72255', '-116.37697', 'Palm Desert', 'US', 'America/Los_Angeles')

If you only want the latitude and longitude and not the rest, set coords_only to True:

fake.local_latlng(country_code = 'US', coords_only=True)

The country_code parameters accepts values from the land_coords constant, such as AU for Australia:

fake.local_latlng(country_code = 'AU')
# ('-25.54073', '152.70493', 'Maryborough', 'AU', 'Australia/Brisbane')

I couldn’t find the definition for the land_coords constant from the Faker documentation, but you can reference the land_coords variable defined in https://rdrr.io/github/LuYang19/faker/src/R/init.R.

If you want a pair of coordinates that is guaranteed to exist on land, use the location_on_land() function:

fake.location_on_land(coords_only=True)
# ('54.58048', '16.86194')

Generating Addresses

If you want generate some sample addresses, use the address(), current_country(), city(), country(), and country_code() functions:

display(fake.address())   
# '910 Jason Green Apt. 954\nJonesland, IL 76881'

display(fake.current_country()) # based on the address returned by address()
# 'United States'

display(fake.city())
# 'North Carolyn'

display(fake.country())
# 'Holy See (Vatican City State)'

display(fake.country_code())
# MU

Locales Support in Faker

So far all the names and addresses generated are in English. However, the faker package also supports different locales. The list of locales supported can be found from: https://faker.readthedocs.io/en/master/locales.html.

The following figure shows an example locale — zh_CN:

All images by author

For example, in the zh_CN locale, you can find the following providers:

  • faker.providers.address
  • faker.providers.company
  • faker.providers.date_time
  • faker.providers.internet
  • faker.providers.job
  • faker.providers.lores
  • faker.providers.person
  • faker.providers.phone_number
  • faker.providers.ssn

This means that all the above listed providers support the zh_CN locale. Take the faker.providers.address (https://faker.readthedocs.io/en/master/locales/zh_CN.html#faker-providers-address) as an example. When instantiating a Faker object, you can pass in one or more locales:

fake = Faker(['zh_CN'])   # Chinese in China locale
fake.address()

The above address() function returns the address in Chinese:

'内蒙古自治区飞市兴山深圳路b座 104347'

If you use the zh_CN locale, some functions will be tied to this locale, such as:

  • fake.name()
  • fake.address()
  • fake.current_country()

Here are some examples:

'洪凤兰'
'辽宁省波县永川王路s座 292815'
"People's Republic of China"

'吕峰'
'广东省凤英市吉区李路t座 385879'
"People's Republic of China"

'何秀梅'
'浙江省齐齐哈尔市上街潮州路M座 218662'
"People's Republic of China"

The address results will be those locations in China.

Calling other functions such as fake.country() will return other countries but the result will be in Chinese (based on the zh-CN locale):

'越南'

越南 is Vietnam.

You can also generate Chinese names using the zh_CN locale:

fake = Faker(['zh_CN'])
display(fake.first_name_male())
display(fake.last_name_male())
display(fake.name())

Here is a sample output of the above code snippet:

'龙'
'马'
'雷春梅'

With all the ways to generate the different types of fake data, I want to put them altogether so that I can perform some data analytics on them.

The following code snippet generates 1000 sets of the following data:

  • UUID
  • User name
  • Latitude, longitude, and country from one of the seven countries
  • Gender
  • Data of birth
from faker import Faker
import random
import uuid

uuids = []
usernames = []
latitudes = []
longitudes = []
genders = []
countries = []
dobs = []
n = 1000

fake = Faker()
country_codes = ['US','GB','AU','CN','FR','CH','DE']
for gender in ['M','F']:
for i in range(n // 2): # 500 males and 500 females
# uuids
uuids.append(str(uuid.uuid4()))

# username and sex
profile = fake.simple_profile(sex=gender)
usernames.append(profile['username'])
genders.append(profile['sex'])

# dob
dobs.append(fake.date_between(start_date='-78y', end_date='-18y'))

# lat and lng, and country
location = fake.local_latlng(country_code = country_codes[random.randint(0, len(country_codes) -1)])
latitudes.append(location[0])
longitudes.append(location[1])
countries.append(location[3])

I then combined the 1000 sets of data into a Pandas DataFrame:

import pandas as pd
df = pd.DataFrame(data = [uuids, usernames, genders, countries, latitudes, longitudes, dobs])
df = df.T
df.columns = ['uuid', 'username', 'gender', 'country', 'latitude', 'longitude', 'dob']
df

The dataframe now contains 1000 fictitious user accounts and their personal details like app ID, location information, gender, and DOB:

Plotting a map

With the latitude and longitude, it would be interesting to plot the geographical locations of my users. For this I used Folium:

import folium                    # pip install folium

mymap = folium.Map(location = [22.827806844385826, 4.363328554220703],
width = 950,
height = 600,
zoom_start = 2,
tiles = 'openstreetmap')

folium.TileLayer('Stamen Terrain').add_to(mymap)
folium.TileLayer('Stamen Toner').add_to(mymap)
folium.TileLayer('Stamen Water Color').add_to(mymap)
folium.TileLayer('cartodbpositron').add_to(mymap)
folium.TileLayer('cartodbdark_matter').add_to(mymap)
folium.LayerControl().add_to(mymap)

for lat, lng in zip(df['latitude'], df['longitude']):
station = folium.CircleMarker(
location = [lat, lng],
radius = 5,
color = 'red',
fill = True,
fill_color = 'yellow',
fill_opacity = 0.3)

# add the circle marker to the map
station.add_to(mymap)
mymap

Here’s the map showing the distribution of my users:

I can zoom into the map:

I can also change the tilesets:

Plotting pie chart

I can visualize where my users are from:

df.groupby('country').count().plot.pie(y='username')

I could also make the pie chart more descriptive:

total = df.shape[0]
def fmt(x):
return '{:.2f}%\n({:.0f})'.format(x, total * x / 100)

df.groupby('country').count().plot.pie(y='username', autopct=fmt)

Plotting bar chart

The total users from each country can also be plotted using a bar chart:

from matplotlib import cm
import numpy as np

color = cm.inferno_r(np.linspace(.4, .8, len(country_codes)))

df.groupby('country').count().plot.bar(y = 'username',
color = color,
legend = False
)

From the chart you can see that Great Britain has the most number of users while China has the least:

Plotting histogram

I can also find out about the age distribution of my users. For this, I need to first calculate their current age based on their DOB:

from datetime import datetime, date
from dateutil import relativedelta

def cal_age(born):
return relativedelta.relativedelta(date.today(), born).years

df['age'] = df['dob'].apply(cal_age)
df

The dataframe now has an additional column showing the age of each user:

You can now plot a histogram showing the age distribution:

ax = df['age'].hist(bins=15, edgecolor='black', linewidth=1.2, color='yellow')
ax.set_xlabel("Age")
ax.set_ylabel("Total")
ax.set_xticks(range(18,80,5))
ax.set_title("Users age distribution")

If you like reading my articles and that it helped your career/study, please consider signing up as a Medium member. It is $5 a month, and it gives you unlimited access to all the articles (including mine) on Medium. If you sign up using the following link, I will earn a small commission (at no additional cost to you). Your support means that I will be able to devote more time on writing articles like this.

I hope you are now better equipped to generate any additional data that your projects need. Generating realistic demo data not only allows you to test your algorithms more accurately, it also provides more realism when using them for demos. Let me know in the comments what other types of data you usually need to generate!


If you don’t have real data, you got to fake it!

Photo by Leif Christoph Gottwald on Unsplash

In the world of data analytics, getting your hands on a good dataset is of paramount importance. In the real world, you probably have access to a lot of uncleaned data that you likely need to spend some time cleaning. What if you do not have the required data and wanted to hack something out quickly for a proof-of-concept demo? In this type of situation, you often have to cook up your own data, and at the same time you need your data to have some degree of realism. So what do you do? Do you painstakingly make up the data manually, or is there an automated way of doing things?

In this article, I will show you some cool ways to fake your data, and make them look real!

To generate some fake names, you can use the names package. To use it, first you need to install it:

!pip install names

You can now use the various functions in the package to generate gender-specific names:

import names

display(names.get_full_name('male'))
display(names.get_first_name())
display(names.get_last_name())

display(names.get_full_name('female'))
display(names.get_first_name())
display(names.get_last_name())

Here are some names generated:

'Gerald Paez'
'Matthew'
'Wiese'
'Dana Mcmullen'
'Heather'
'Oxley'

'Walter Walters'
'Connie'
'Vildosola'
'Nancy Correra'
'Aaron'
'Dawes'

'Randy Meli'
'Yvonne'
'Owen'
'Loretta Patague'
'Sidney'
'Oliver'

Besides names, another type of data that you might want to generate is UUIDs. An UUID (Universal Unique Identifier) is a 128-bit value used to uniquely identify an object or entity on the internet. In the mobile world, UUIDs are often used to identify apps installed on devices.

To generate sample UUIDs, you can use the uuid package:

!pip install uuid

You can convert the UUID generated to a string:

import uuid

str(uuid.uuid4())

Here is a sample UUID generated:

'54487fd7-0632-450e-b6e3-bcc54bc83133'

When generating fake data using Python, the faker package is definitely worth mentioning. The faker package generates all sorts of fake data for your usage. Data that you can generate include:

  • address
  • barcode
  • credit card information
  • ISBN
  • phone number
  • and more!

In the following sections, I will show you how to generate some commonly needed data.

Generating User Profile

The faker package can generate user profiles, such as username, sex, address, email, and date of birth. The following code snippet creates a simple profile for a male person:

from faker import Faker

fake = Faker()
fake.simple_profile(sex='M') # use 'F' for female

The output is a dictionary containing the various details of a male person:

{'username': 'lisa38',
'name': 'Brandon Gibson',
'sex': 'M',
'address': '406 Brandi Inlet\nWest Christopherville, PR 41632',
'mail': '[email protected]',
'birthdate': datetime.date(2008, 9, 10)}

Generating Dates

One particular type of data I want to generate is the date of birth (DOB) of a person. When storing details of a person, it is always recommended to store the DOB rather than the age (for very obvious reasons).

Using the faker package, you can generate the birth date of a person that is between 18 and 60 years old:

fake.date_between(start_date='-60y', end_date='-18y')

The data returned is a date object:

datetime.date(1963, 4, 18)

If you want to convert the result to a string, you can use the strftime() function:

fake.date_between(start_date='-60y', end_date='-18y').strftime('%Y-%m-%d')
# '1973-07-16'

Note that every time you call a function from the Faker object, a new set of data is generated. If you want the data generated to be deterministic (i.e always the same), you can use the seed() function, like this: Faker.seed(0).

Generating Locations

The next type of data I want to generate is location data. For example, you want to get the latitude and longitude of a location in the US. You can use the local_latlng() function and specify the country_code parameter:

fake.local_latlng(country_code = 'US')

The function returns a location known to exist on land in a country specified by country_code. The informations are enclosed is a tuple that looks like this:

('33.72255', '-116.37697', 'Palm Desert', 'US', 'America/Los_Angeles')

If you only want the latitude and longitude and not the rest, set coords_only to True:

fake.local_latlng(country_code = 'US', coords_only=True)

The country_code parameters accepts values from the land_coords constant, such as AU for Australia:

fake.local_latlng(country_code = 'AU')
# ('-25.54073', '152.70493', 'Maryborough', 'AU', 'Australia/Brisbane')

I couldn’t find the definition for the land_coords constant from the Faker documentation, but you can reference the land_coords variable defined in https://rdrr.io/github/LuYang19/faker/src/R/init.R.

If you want a pair of coordinates that is guaranteed to exist on land, use the location_on_land() function:

fake.location_on_land(coords_only=True)
# ('54.58048', '16.86194')

Generating Addresses

If you want generate some sample addresses, use the address(), current_country(), city(), country(), and country_code() functions:

display(fake.address())   
# '910 Jason Green Apt. 954\nJonesland, IL 76881'

display(fake.current_country()) # based on the address returned by address()
# 'United States'

display(fake.city())
# 'North Carolyn'

display(fake.country())
# 'Holy See (Vatican City State)'

display(fake.country_code())
# MU

Locales Support in Faker

So far all the names and addresses generated are in English. However, the faker package also supports different locales. The list of locales supported can be found from: https://faker.readthedocs.io/en/master/locales.html.

The following figure shows an example locale — zh_CN:

All images by author

For example, in the zh_CN locale, you can find the following providers:

  • faker.providers.address
  • faker.providers.company
  • faker.providers.date_time
  • faker.providers.internet
  • faker.providers.job
  • faker.providers.lores
  • faker.providers.person
  • faker.providers.phone_number
  • faker.providers.ssn

This means that all the above listed providers support the zh_CN locale. Take the faker.providers.address (https://faker.readthedocs.io/en/master/locales/zh_CN.html#faker-providers-address) as an example. When instantiating a Faker object, you can pass in one or more locales:

fake = Faker(['zh_CN'])   # Chinese in China locale
fake.address()

The above address() function returns the address in Chinese:

'内蒙古自治区飞市兴山深圳路b座 104347'

If you use the zh_CN locale, some functions will be tied to this locale, such as:

  • fake.name()
  • fake.address()
  • fake.current_country()

Here are some examples:

'洪凤兰'
'辽宁省波县永川王路s座 292815'
"People's Republic of China"

'吕峰'
'广东省凤英市吉区李路t座 385879'
"People's Republic of China"

'何秀梅'
'浙江省齐齐哈尔市上街潮州路M座 218662'
"People's Republic of China"

The address results will be those locations in China.

Calling other functions such as fake.country() will return other countries but the result will be in Chinese (based on the zh-CN locale):

'越南'

越南 is Vietnam.

You can also generate Chinese names using the zh_CN locale:

fake = Faker(['zh_CN'])
display(fake.first_name_male())
display(fake.last_name_male())
display(fake.name())

Here is a sample output of the above code snippet:

'龙'
'马'
'雷春梅'

With all the ways to generate the different types of fake data, I want to put them altogether so that I can perform some data analytics on them.

The following code snippet generates 1000 sets of the following data:

  • UUID
  • User name
  • Latitude, longitude, and country from one of the seven countries
  • Gender
  • Data of birth
from faker import Faker
import random
import uuid

uuids = []
usernames = []
latitudes = []
longitudes = []
genders = []
countries = []
dobs = []
n = 1000

fake = Faker()
country_codes = ['US','GB','AU','CN','FR','CH','DE']
for gender in ['M','F']:
for i in range(n // 2): # 500 males and 500 females
# uuids
uuids.append(str(uuid.uuid4()))

# username and sex
profile = fake.simple_profile(sex=gender)
usernames.append(profile['username'])
genders.append(profile['sex'])

# dob
dobs.append(fake.date_between(start_date='-78y', end_date='-18y'))

# lat and lng, and country
location = fake.local_latlng(country_code = country_codes[random.randint(0, len(country_codes) -1)])
latitudes.append(location[0])
longitudes.append(location[1])
countries.append(location[3])

I then combined the 1000 sets of data into a Pandas DataFrame:

import pandas as pd
df = pd.DataFrame(data = [uuids, usernames, genders, countries, latitudes, longitudes, dobs])
df = df.T
df.columns = ['uuid', 'username', 'gender', 'country', 'latitude', 'longitude', 'dob']
df

The dataframe now contains 1000 fictitious user accounts and their personal details like app ID, location information, gender, and DOB:

Plotting a map

With the latitude and longitude, it would be interesting to plot the geographical locations of my users. For this I used Folium:

import folium                    # pip install folium

mymap = folium.Map(location = [22.827806844385826, 4.363328554220703],
width = 950,
height = 600,
zoom_start = 2,
tiles = 'openstreetmap')

folium.TileLayer('Stamen Terrain').add_to(mymap)
folium.TileLayer('Stamen Toner').add_to(mymap)
folium.TileLayer('Stamen Water Color').add_to(mymap)
folium.TileLayer('cartodbpositron').add_to(mymap)
folium.TileLayer('cartodbdark_matter').add_to(mymap)
folium.LayerControl().add_to(mymap)

for lat, lng in zip(df['latitude'], df['longitude']):
station = folium.CircleMarker(
location = [lat, lng],
radius = 5,
color = 'red',
fill = True,
fill_color = 'yellow',
fill_opacity = 0.3)

# add the circle marker to the map
station.add_to(mymap)
mymap

Here’s the map showing the distribution of my users:

I can zoom into the map:

I can also change the tilesets:

Plotting pie chart

I can visualize where my users are from:

df.groupby('country').count().plot.pie(y='username')

I could also make the pie chart more descriptive:

total = df.shape[0]
def fmt(x):
return '{:.2f}%\n({:.0f})'.format(x, total * x / 100)

df.groupby('country').count().plot.pie(y='username', autopct=fmt)

Plotting bar chart

The total users from each country can also be plotted using a bar chart:

from matplotlib import cm
import numpy as np

color = cm.inferno_r(np.linspace(.4, .8, len(country_codes)))

df.groupby('country').count().plot.bar(y = 'username',
color = color,
legend = False
)

From the chart you can see that Great Britain has the most number of users while China has the least:

Plotting histogram

I can also find out about the age distribution of my users. For this, I need to first calculate their current age based on their DOB:

from datetime import datetime, date
from dateutil import relativedelta

def cal_age(born):
return relativedelta.relativedelta(date.today(), born).years

df['age'] = df['dob'].apply(cal_age)
df

The dataframe now has an additional column showing the age of each user:

You can now plot a histogram showing the age distribution:

ax = df['age'].hist(bins=15, edgecolor='black', linewidth=1.2, color='yellow')
ax.set_xlabel("Age")
ax.set_ylabel("Total")
ax.set_xticks(range(18,80,5))
ax.set_title("Users age distribution")

If you like reading my articles and that it helped your career/study, please consider signing up as a Medium member. It is $5 a month, and it gives you unlimited access to all the articles (including mine) on Medium. If you sign up using the following link, I will earn a small commission (at no additional cost to you). Your support means that I will be able to devote more time on writing articles like this.

I hope you are now better equipped to generate any additional data that your projects need. Generating realistic demo data not only allows you to test your algorithms more accurately, it also provides more realism when using them for demos. Let me know in the comments what other types of data you usually need to generate!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment