Issue
I have a pyarrow.Table that's created from a pandasDataFrame
df = pd.DataFrame({"col1": [1.0, 2.0], "col2": [2.3, 2.4]})
df.columns = pd.MultiIndex.from_tuples([('a',100),('b',200)], names=('name', 'number'))
df.index = pd.MultiIndex.from_tuples([('a',100),('b',200)], names=('name', 'number'))
table = pa.Table.from_pandas(df)
The original df has thousands of columns and rows, and the values are all float64
, and therefore become double
when I convert to pyarrow Table
How can I change them all to float32
?
I tried the following:
schema = pa.schema([pa.field("('a',100)", pa.float32()),pa.field("('b',200)", pa.float32()),])
table = pa.Table.from_pandas(df, schema=schema)
but that complains about the schema and the dataframe not matching: KeyError: "name '('a',100)' present in the specified schema is not found in the columns or index"
Solution
You can cast the table to the types you need
table = pa.Table.from_pandas(df)
table = table.cast(pa.schema([("('a', '100')", pa.float32()),
("('b', '200')", pa.float32()),
("name", pa.string()),
("number", pa.string())]))
I doubt you will find a way to provide a working schema to Table.from_pandas
when using a Pandas multikey index. The name of the column in that case is a tuple
(('a', 100)
) but for Arrow schema
column names can only be strings. So you will never be able to create a schema that points to the same column names that the dataframe has.
That's why casting afterward works, because after you made an Arrow table (and thus all column names became strings) you can finally provide the string equal to the column name to the cast function.
Answered By - amol
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.