Issue
I am trying to write a parquet file which contains one date column having logical type in parquet as DATE and physical type as INT32. I am writing the parquet file using pandas and using fastparquet as the engine since I need to stream the data from database and append to the same parquet file. Here is my code
import os
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import text
sql = "SELECT TO_VARCHAR(CURRENT_DATE, 'YYYYMMDD') AS "REPORT_DATE" FROM DUMMY;"
def create_stream_enabled_connection(username, password, host, port):
conn_str = f"mydb://{username}:{password}@{host}:{port}"
engine = create_engine(conn_str, connect_args={'encrypt': 'True', 'sslValidateCertificate':'False'})
connection = engine.connect().execution_options(stream_results=True)
return connection
connection = create_stream_enabled_connection(username, password, host, port)
ROWS_IN_CHUNK = 500000
# Stream chunks from database
for dataframe_chunk in pd.read_sql(text(sql), connection, chunksize=ROWS_IN_CHUNK):
if os.stat(local_path).st_size == 0: # If file is empty
# write parquet file
dataframe_chunk.to_parquet(local_path, index=False, engine='fastparquet')
else:
# write parquet file
dataframe_chunk.to_parquet(local_path, index=False, engine='fastparquet', append=True)
Problem:
I am unable to get the logical type to be DATE and physical type to be INT32 in the output parquet from pandas to_parquet
function using fastparquet
as engine.
A few things that I have tried:
- If I take the REPORT_DATE coming from database as string and put it to parquet, then the logical type is STRING. Doesn't work for me.
- If I take the REPORT_DATE coming from database as string and convert it to datetime using the following code, then the dtype is
datetime64[ns]
and logical type in parquet isTimestamp(isAdjustedToUTC=false, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false)
and physical type isINT64
. Doesn't work for me.
dataframe_chunk['report_date'] = pd.to_datetime(dataframe_chunk['report_date'], format='%Y%m%d')
I need the parquet logical type to be DATE
and physical type to be INT32
since this parquet will be loaded directly to bigquery and the REPORT_DATE column will go to a DATE
type column in bigquery. See the bigquery documentation here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#type_conversions
If I try to store STRING or DATETIME column in parquet and load it in bigquery, bigquery fails by saying that it expected the column with another type.
Solution
Answered in https://github.com/dask/fastparquet/issues/880#issuecomment-1697436417
The short story is: fastparquet cannot do this right now. I don't know how you can persuade bigquery to accept the current style of output, but I would think there must be a way, since the 64-bit format is valid parquet.
Answered By - mdurant
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.