Issue
I am using sklearn.preprocessing.FunctionTransformer with some custom functions.
def enumerate_virus_scanned(virus_scanned: str) -> int:
return 1 if not pd.isnull(virus_scanned) else 0
def enumerate_priority(priority: str) -> int:
try:
return int(re.search(r'\d+', priority).group(0))
except (AttributeError, TypeError):
return 0
def enumerate_encoding(encoding: str) -> int:
content_transfer_encoding = {
"na": 0,
"base64": 1,
"quoted-printable": 2,
"8bit": 3,
"7bit": 4,
"binary": 5
}
try:
return content_transfer_encoding[encoding.lower()]
except (AttributeError, KeyError):
return 0
As you may notice, these functions take a scalar as an input, but in the FunctionTransformer call, the DataFrame is passed as an input. Thus, I need to use the pd.DataFrame.applymap() method for each transformer.
virus_scanned_transformer, priority_transformer, encoding_transformer = (
FunctionTransformer(lambda df: df.applymap(func)) for func in
[enumerate_virus_scanned, enumerate_priority, enumerate_encoding]
)
However, this does not work. I do not want to convert the functions to call df.applymap()
inside like that:
def enumerate_virus_scanned(df: pd.DataFrame) -> pd.DataFrame:
return df.applymap(lambda x: 1 if not pd.isnull(x) else 0)
Is there any possibility to create a wrapper with a decorator, that will automatically call df.applymap()
inside while calling the function transforming a scalar itself?
def transformer_wrapper(func):
def wrap(*args, **kwargs):
return df.applymap(func)
return wrap
@transformer_wrapper
def enumerate_virus_scanned(virus_scanned: str) -> int:
return 1 if not pd.isnull(virus_scanned) else 0
Maybe there is a better solution for that?
Solution
Your decorator is fairly close, just need to extract df
as the first positional argument:
from functools import wraps
import pandas as pd
from numpy import nan
def applymap_wrap(func):
@wraps(func)
def wrapper(df, *args, **kwargs):
return df.applymap(func, *args, **kwargs)
return wrapper
@applymap_wrap
def enumerate_virus_scanned(virus_scanned: str) -> int:
return 1 if not pd.isnull(virus_scanned) else 0
# ---
df = DataFrame({
"x": [ 10, nan, 20, 30, nan],
"y": [nan, nan, 1, 2, 3]
})
print(enumerate_virus_scanned(df))
x y
0 1 0
1 0 0
2 1 1
3 1 1
4 0 1
But on the other hand, why not use DataFrame
level methods? Using approaches like DataFrame.isnull()
is much faster than DataFrame.applymap(lambda x: …)
print(df.notnull().astype(int))
x y
0 1 0
1 0 0
2 1 1
3 1 1
4 0 1
Answered By - Cameron Riddell
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.