Thursday, November 16, 2023

[FIXED] Pandas data manipulation and counting on the same line

November 16, 2023 pandas, python No comments

Issue

I am trying to count the number of books in a dataset whose publication year is equal to or greater than 2000. Here is the format of the column: publication_date = "dd/mm/yyyy"

Here is my code:

df[int(df["publication_date"][-4: 0]) >= 2000]["publication_date"].count()

I am receiving error like the one below:

TypeError                                 Traceback (most recent call last)
<ipython-input-31-ed1072acfb26> in <module>
----> 1 df[int(df["publication_date"][-4: 0]) >= 2000]["publication_date"].count()

/opt/conda/lib/python3.8/site-packages/pandas/core/series.py in wrapper(self)
    127         if len(self) == 1:
    128             return converter(self.iloc[0])
--> 129         raise TypeError(f"cannot convert the series to {converter}")
    130 
    131     wrapper.__name__ = f"__{converter.__name__}__"

TypeError: cannot convert the series to <class 'int'>

What should I do to fix it?

Solution

For speed up processing of datetime, you may have to convert it to datetime, then extract the year to make comparison.

import pandas as pd

data = {'publication_date': ['10/05/1999', '15/12/2005', '23/09/2002', '05/03/2000', '18/07/2008']}
df = pd.DataFrame(data)

df['publication_date'] = pd.to_datetime(df['publication_date'], format='%d/%m/%Y')
# Fastest: due to it directly checks the condition publication_date year greater than or equal to 2000 for each element in the column and then sums up the True values
print(df["publication_date"].dt.year.ge(2000).sum())

# Slower a bit: filters the DataFrame based on the condition publication_date year greater than 2000 and then counts the number of rows in the filtered DataFrame.
print(df[df['publication_date'].dt.year > 2000].count())

Performance measurement:

import pandas as pd
import timeit as t

data = {'publication_date': ['10/05/1999', '15/12/2005', '23/09/2002', '05/03/2000', '18/07/2008']*100000}

df = pd.DataFrame(data)

df['publication_date'] = pd.to_datetime(df['publication_date'], format='%d/%m/%Y')

time = t.timeit(stmt='df["publication_date"].dt.year.ge(2000).sum()', number=500, globals=globals())
print(time) # 13.602070399967488

time = t.timeit(stmt='df[df["publication_date"].dt.year > 2000].astype(bool).sum()', number=500, globals=globals())
print(time) # 16.904740899975877

time = t.timeit(stmt='df[df["publication_date"].dt.year > 2000].count()', number=500, globals=globals())
print(time) # 17.05563960003201

Answered By - Tấn Nguyên

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 16, 2023

[FIXED] Pandas data manipulation and counting on the same line

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels