Issue
I have multiple folders each containing several thousand files. I need to filter each file in each folder based on date & once all the files in a folder are filtered, I need to create a zip file of each folder containing all the filtered data CSVs.
The below function creates a zip file of the individual files which is not really what I want.
One solution that works is creating a new file for every file in a new folder with the filtered data & then creating a zip of the new folder. I don't want to do this as the data files will be repeated across multiple folders & I'll run out of disk space & additionally it's massively time consuming given I am dealing with several terabytes of data
Is there a way I can filter out data from all files at ones into it's own CSV in memory & zip it all up?
def filter_data(data, file_name):
data[pd.to_datetime(data['Date'], format='%Y-%m') >= five_years_ago].to_csv(
data_store + file_name + '.csv.gz',
index=False, compression="gzip")
Solution
This code answers your last question,
Is there a way I can filter out data from all files at ones into it's own CSV in memory & zip it all up?
import pandas as pd
import dask.dataframe as dd
import shutil
# i refers to the number of the folder
# folderInput_i represent the folder with thousand files
# and the command below imports all files in the folder at onces
dataPara = dd.read_csv("C:\\______\\______\\Desktop\\folderInput_i\\*.csv")
five_years_ago = "2016-12"
five_years_ago = pd.to_datetime(five_years_ago, format='%Y-%m')
# treatement of all files at once
dataPara = dataPara[dd.to_datetime(dataPara['Date'], format='%Y-%m') >= five_years_ago]
dataPara.to_csv("C:\\______\\______\\Desktop\\folderOutput_i\\*.csv", index = False)
# to zip your new folder who contains the filtered data files
shutil.make_archive('C:\\______\\______\\Desktop\\folderOutput_i.csv.gz','zip',
"C:\\______\\______\\Desktop\\folderOutput_i")
Answered By - Safouane Labbad
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.