Issue
df -> ["user_id", "num_posts", "posts" ...]
My df is made of rows containing data for reddit user-accounts; where for each row "posts" contains a series of separate posts by that user.
The number of posts ranges up to 6000 for certain users.
data = pd.DataFrame(columns=["user_id","posts"])
for row in df.itertuples():
for post in row[ : len(row[3])]:
new_row = [row[1], post ]
data.loc[len(data)] = new_row
It seems the inner for-loop, that iterates over results from itertuples makes this terribly slow!
Even if I cap the maximum number of posts to be grabbed for a single user with 100, the code doesn't return for hours even running on a high powered remote server!
Any thoughts on how to improve the runtime?
Solution
I've tested your code verse 'concat' method with list comprehension' and I've got it 12 times faster with list comprehension:
data = pd.concat([pd.DataFrame([[row[1], post] for post in row], columns=["user_id", "posts"])
for row in df.itertuples()], ignore_index=True)
Answered By - Ze'ev Ben-Tsvi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.