Issue
I've been looking for robust type hints for a pandas DataFrame, but cannot seem to find anything useful. This question barely scratches the surface Pythonic type hints with pandas?
Normally if I want to hint the type of a function, that has a DataFrame as an input argument I would do:
import pandas as pd
def func(arg: pd.DataFrame) -> int:
return 1
What I cannot seem to find is how do I type hint a DataFrame with mixed dtypes. The DataFrame constructor supports only type definition of the complete DataFrame. So to my knowledge changes in the dtypes can only occur afterwards with the pd.DataFrame().astype(dtypes={})
function.
This here works, but doesn't seem very pythonic to me
import datetime
def func(arg: pd.DataFrame(columns=['integer', 'date']).astype(dtype={'integer': int, 'date': datetime.date})) -> int:
return 1
I came across this package: https://pypi.org/project/dataenforce/ with examples such as this one:
def process_data(data: Dataset["id": int, "name": object, "latitude": float, "longitude": float])
pass
This looks somewhat promising, but sadly the project is old and buggy.
As a data scientist, building a machine learning application with long ETL processes I believe that type hints are important.
What do you use and does anybody type hint their dataframes in pandas?
Solution
I have now found the pandera library that seems very promising:
https://github.com/pandera-dev/pandera
It allows users to create schemas and use those schemas to create verbose checks. From their docs:
https://pandera.readthedocs.io/en/stable/schema_models.html
import pandas as pd
import pandera as pa
from pandera.typing import Index, DataFrame, Series
class InputSchema(pa.SchemaModel):
year: Series[int] = pa.Field(gt=2000, coerce=True)
month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
class OutputSchema(InputSchema):
revenue: Series[float]
@pa.check_types
def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
return df.assign(revenue=100.0)
df = pd.DataFrame({
"year": ["2001", "2002", "2003"],
"month": ["3", "6", "12"],
"day": ["200", "156", "365"],
})
transform(df)
invalid_df = pd.DataFrame({
"year": ["2001", "2002", "1999"],
"month": ["3", "6", "12"],
"day": ["200", "156", "365"],
})
transform(invalid_df)
Also a note from them:
Due to current limitations in the pandas library (see discussion here), pandera annotations are only used for run-time validation and cannot be leveraged by static-type checkers like mypy. See the discussion here for more details.
But still, even though there is no static-type checking I think that this is going in a very good direction.
Answered By - borisdonchev
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.