Issue
I am writing a Python script to validate columns in Pandas DataFrame using Pandera library. My issue is that I am OK validating one column at a time, but I was unable to find out how I can validate values in one column taking into consideration values from another column in the dataframe.
Specifically, I want to check if values exist in a column, and if they do I also want to check if values in another column for that row also contain certain text.
import numpy as np
import pandas as pd
import pandera as pa
dataframe = pd.DataFrame({'column_A': ['ABC company', 'BBB company', 'ABC company', 'CCC company'],
'column_B': ['1000', np.NaN, '2000', np.NaN]
})
schema = pa.DataFrameSchema(
columns={
'column_A': pa.Column(pa.String),
'column_B': pa.Column(pa.String, nullable=True,
checks=pa.Check(
lambda df: (df['column_A'].str.contains('ABC')) & (~df['column_B'].isna())))
}
)
schema.validate(dataframe)
This was my attempt at solving the issue but Pandera is not aware of the fact that I want to examine multiple series during validation of column_B.
Solution
If you want to check (look for # define a wide DataFrame-level check in the documentation) at the dataframe level, you need to do your test outside of a column:
import numpy as np
import pandas as pd
import pandera as pa
dataframe = pd.DataFrame({'column_A': ['ABC company', 'BBB company', 'ABC company', 'CCC company'],
'column_B': ['1000', np.NaN, '2000', np.NaN]
})
# define your dataframe level test
check_AB = pa.Check(
lambda df: (df['column_A'].str.contains('ABC')) & (~df['column_B'].isna()),
name='check_AB'
)
schema = pa.DataFrameSchema(
columns={
'column_A': pa.Column(pa.String),
'column_B': pa.Column(pa.String, nullable=True)
},
checks=check_AB # <- dataframe wide check
)
Output:
>>> schema.validate(dataframe)
...
<Check check_AB>
failure cases:
column index failure_case
0 column_A 1 BBB company
1 column_A 3 CCC company
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.