Issue
- I have a pandas DataFrame:
import pandas as pd
df = pd.DataFrame(data={"col1": [1, 2], "col2": [3.0, 4.0], "col3": ["foo", "bar"]})
- Using s3fs:
from s3fs import S3FileSystem
s3fs = S3FileSystem(**kwargs)
- I can write this as a parquet dataset
import pyarrow as pa
import pyarrow.parquet as pq
tbl = pa.Table.from_pandas(df)
root_path = "../parquet_dataset/foo"
pq.write_to_dataset(
table=tbl,
root_path=root_path,
filesystem=s3fs,
partition_cols=["col3"],
partition_filename_cb=lambda _: "data.parquet",
)
- Later, I need the
pq.ParquetSchema
for the dumped DataFrame.
import pyarrow as pa
import pyarrow.parquet as pq
dataset = pq.ParquetDataset(root_path, filesystem=s3fs)
schema = dataset.schema
However parquet dataset -> "schema" does not include partition cols schema.
How do I get the schema for the partition columns?
Solution
Turns out I have to explicitly dump "metadata".
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
table=table,
root_path=path,
filesystem=s3fs,
partition_cols=partition_cols,
partition_filename_cb=lambda _: "data.parquet",
)
# Write metadata-only Parquet file from schema
pq.write_metadata(
schema=table.schema, where=path + "/_common_metadata", filesystem=s3fs
)
Docs https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files
I only care about the "common metadata" but you can dump row stats.
Answered By - mishbah
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.