Issue
I am performing the following operation using Dask.
import dask.dataframe as dd
import pandas as pd
salary_df = pd.DataFrame({"Salary":[10000, 50000, 25000, 30000, 7000]})
salary_category = pd.DataFrame({"Hi":[5000, 20000, 25000, 30000, 90000],
"Low":[0, 5001, 20001, 25001, 30001],
"category":["Very Poor", "Poor", "Medium", "Rich", "Super Rich" ]
})
sal_ddf = dd.from_pandas(salary_df, npartitions=10)
salary_category.index = pd.IntervalIndex.from_arrays(salary_category['Low'],salary_category['Hi'],closed='both')
sal_ddf['Category'] = sal_ddf['Salary'].apply(lambda x : salary_category.iloc[salary_category.index.get_loc(x)]['category'])
I do get the results but there is a warning on the line below
sal_ddf['Category'] = sal_ddf['Salary'].apply(lambda x : salary_category.iloc[salary_category.index.get_loc(x)]['category'])
You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
Before: .apply(func)
After: .apply(func, meta=('Salary', 'object'))
What am I missing here ?
Solution
The missing keyword argument here is meta
. Dask generates an automatic suggestion (in the warning message):
After: .apply(func, meta=('Salary', 'object'))
As this is a warning message, for many use cases specifying meta
is optional, but could be useful if you want to be explicit about the dtype
of the calculated variables.
Running the snippet below should not generate the warning message:
# extracted your code into `func` for readability only
func = lambda x: salary_category.iloc[salary_category.index.get_loc(x)]['category']
sal_ddf['Category'] = sal_ddf['Salary'].apply(func, meta=('Salary', 'object'))
For more details this link might be useful: meta.
Answered By - SultanOrazbayev
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.