Issue
I have a column in a dask
data frame that contains comma separated lists of different categories. I'm looking to replicate the functionality of sklearn
's MultiLabelBinarizer or the pandas
function pd.get_dummies(',')
exactly as this thread describes: Create dummies from column with multiple values in dask
Is there absolutely no way to do this as the one answer there states? Is there a way to implement this if I got a list of all of the values?
Solution
If the list of all classes are known, then it's an easy task for dask
:
import dask.dataframe as dd
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
df = pd.DataFrame({"col_a": ["c, d", "e", "g", "e, g", "d, e"]})
all_classes = ["c", "d", "e", "g"]
mlb = MultiLabelBinarizer(classes=all_classes)
def myfunc(df):
return pd.DataFrame(mlb.fit_transform(df["col_a"].values), columns=all_classes)
ddf = dd.from_pandas(df, npartitions=2)
ddf.map_partitions(myfunc, meta=pd.DataFrame(columns=all_classes)).compute()
If the list is not known, then one option is to do a first pass through the dataframe, collecting all unique values, then integrating these classes into a snippet similar to above.
Answered By - SultanOrazbayev
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.