Issue
Let's say i have a dataframe like this:
date_1 date_2
0 2022-08-01 2022-08-05
1 2022-08-20 NaN
2 NaN NaN
I want to have another column which tells the difference in business days and have a dataframe like this (in case date_2
is empty, it will be compared to today's date (2022-08-28
)):
date_1 date_2 diff
0 2022-08-01 2022-08-05 4
1 2022-08-20 NaN 5
2 NaN NaN Empty
I tried to use this one:
df["diff"] = df.apply(
lambda x: np.busday_count(x.date_1, x.date_2) if (x.date_1 != '' and x.date_2 != '') else (np.busday_count(x.date_1, np.datetime64('today')) if (x.date_1 != '' and x.date_2 == '') else ''), axis=1)
but im getting this error:
Iterator operand 0 dtype could not be cast from dtype('<M8[us]') to dtype('<M8[D]') according to the rule 'safe'
Any idea how to get the desired dataframe?
Solution
I think you just need to coerce the types. Also, better to avoid lambdas if you have more than one condition to check. Code below runnable as-is, though the second diff value will change if you run it tomorrow :)
def busday_diff(x):
if pd.isna(x.date_1):
return ""
date2_to_use = pd.Timestamp("today") if pd.isna(x.date_2) else x.date_2
return np.busday_count(np.datetime64(x.date_1, "D"), np.datetime64(date2_to_use, "D"))
df = pd.DataFrame(
{"date_1": ["2022-08-01", "2022-08-20", np.nan], "date_2": ["2022-08-05", np.nan, np.nan]}
)
df["diff"] = df.apply(busday_diff, axis=1)
print(df)
# date_1 date_2 diff
#0 2022-08-01 2022-08-05 4
#1 2022-08-20 NaT 5
#2 NaT NaT
If you have to do more than a couple of these, you will probably want to vectorize it. Pandas and Numpy are much much faster if you can vectorize your commands:
df = pd.DataFrame(
{
"date_1": ["2022-08-01", "2022-08-20", np.nan, np.nan],
"date_2": ["2022-08-05", np.nan, np.nan, "2022-08-10"],
}
)
calcable = df[~df.date_1.isnull()].fillna(pd.Timestamp("today").date())[["date_1", "date_2"]]
df["diff"] = pd.Series(
np.busday_count(
calcable.date_1.values.astype("datetime64[D]"),
calcable.date_2.values.astype("datetime64[D]"),
),
index=calcable.index,
)
Interestingly, the cast to "D" resolution must be called on the underlying numpy array values
. Otherwise it reverts back to "ns" resolution. This is probably the origin of the confusion behind this question. Strange design decision on the part of pandas:
calcable.date_1.values.astype("datetime64[D]")
# array(['2022-08-01', '2022-08-20'], dtype='datetime64[D]')
calcable.date_1.astype("datetime64[D]").values
# array(['2022-08-01T00:00:00.000000000', '2022-08-20T00:00:00.000000000'],
dtype='datetime64[ns]')
Answered By - mmdanziger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.