Issue
I am trying to count the number of books in a dataset whose publication year is equal to or greater than 2000. Here is the format of the column: publication_date = "dd/mm/yyyy"
Here is my code:
df[int(df["publication_date"][-4: 0]) >= 2000]["publication_date"].count()
I am receiving error like the one below:
TypeError Traceback (most recent call last)
<ipython-input-31-ed1072acfb26> in <module>
----> 1 df[int(df["publication_date"][-4: 0]) >= 2000]["publication_date"].count()
/opt/conda/lib/python3.8/site-packages/pandas/core/series.py in wrapper(self)
127 if len(self) == 1:
128 return converter(self.iloc[0])
--> 129 raise TypeError(f"cannot convert the series to {converter}")
130
131 wrapper.__name__ = f"__{converter.__name__}__"
TypeError: cannot convert the series to <class 'int'>
What should I do to fix it?
Solution
For speed up processing of datetime, you may have to convert it to datetime, then extract the year to make comparison.
import pandas as pd
data = {'publication_date': ['10/05/1999', '15/12/2005', '23/09/2002', '05/03/2000', '18/07/2008']}
df = pd.DataFrame(data)
df['publication_date'] = pd.to_datetime(df['publication_date'], format='%d/%m/%Y')
# Fastest: due to it directly checks the condition publication_date year greater than or equal to 2000 for each element in the column and then sums up the True values
print(df["publication_date"].dt.year.ge(2000).sum())
# Slower a bit: filters the DataFrame based on the condition publication_date year greater than 2000 and then counts the number of rows in the filtered DataFrame.
print(df[df['publication_date'].dt.year > 2000].count())
Performance measurement:
import pandas as pd
import timeit as t
data = {'publication_date': ['10/05/1999', '15/12/2005', '23/09/2002', '05/03/2000', '18/07/2008']*100000}
df = pd.DataFrame(data)
df['publication_date'] = pd.to_datetime(df['publication_date'], format='%d/%m/%Y')
time = t.timeit(stmt='df["publication_date"].dt.year.ge(2000).sum()', number=500, globals=globals())
print(time) # 13.602070399967488
time = t.timeit(stmt='df[df["publication_date"].dt.year > 2000].astype(bool).sum()', number=500, globals=globals())
print(time) # 16.904740899975877
time = t.timeit(stmt='df[df["publication_date"].dt.year > 2000].count()', number=500, globals=globals())
print(time) # 17.05563960003201
Answered By - Tấn Nguyên
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.