Issue
I would like to generate a dummy dataset composed of a fake first name and a last name for 40 milion records using multiple processor n cores.
Below is a single task loop that generates a first name and a last name and appends them to a list:
import pandas as pd
from faker import Faker
def fake_data_generation(records):
fake = Faker(['en_US','en_GB'])
person = []
for i in range(records):
first_name = fake.first_name()
last_name = fake.last_name()
person.append({"First_Name": first_name,
"Last_Name": last_name}
)
return person
Output:
for i in range(5):
df = pd.DataFrame(fake_data_generation(i))
>>> df
First_Name Last_Name
0 Colin Stewart
1 Barbara Rios
2 Victor Green
3 Stephanie Booth
Solution
Maybe you can use providers
directly:
import pandas as pd
import numpy as np
from faker.providers.person.en_US import Provider as us
from faker.providers.person.en_GB import Provider as gb
first_names = list(set(us.first_names).union(gb.first_names))
last_names = list(set(us.last_names).union(gb.last_names))
N = 40_000_000
df = pd.DataFrame({'First_Name': np.random.choice(first_names, N),
'Last_Name': np.random.choice(last_names, N)})
Output:
>>> df
First_Name Last_Name
0 Kayla Tran
1 Gary Bates
2 Daisy Leblanc
3 Tiffany Ahmed
4 Kellie May
... ... ...
39999995 Kristine Collier
39999996 Joyce Mccoy
39999997 Paul Padilla
39999998 Tonya Bevan
39999999 Julie Bright
[40000000 rows x 2 columns]
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.