Issue
I've been dealing with a lot of 4-5 Gb csv files last few days at work and so that I know how much they progressed through reading/writing I wrote couple wrapper functions on top of pandas' methods. It all seems to work very well, a bit of overhead but convenience outweighs most issues.
At the same time, when reading a csv, so that the progress bar displays correct percentage, I need to know the number of rows in advance since that determines how many chunks there will be. The simplest solution I came up with is to simply load the 0th column of the csv before starting to load the rest and get its size. But this does take a bit of time when you have files of millions of rows in size.
Also, reading of a single column takes an unreasonably high proportion of total time: reading a single column in a csv with 125 columns a few million rows took ~24 seconds, reading the whole file is 63 seconds.
And this is a function I've been using to read csvs:
def read_csv_with_progressbar(filename: str,
chunksize: int = 50000) -> pd.DataFrame:
length = pd.read_csv(filename, usecols=[0])
length = length.values.shape[0]
total = length//chunksize
chunk_list = []
chunks = pd.read_csv(filename, chunksize=chunksize)
with tqdm(total=total, file=sys.stdout) as pbar:
for chunk in chunks:
chunk_list.append(chunk)
pbar.set_description('Reading source csv file')
pbar.update(1)
df = pd.concat([i for i in chunk_list], axis=0)
return df
Any way to get the number of rows in a csv faster that using my flawed method?
Solution
Assuming there are no quoted strings (with newlines in them) or other shenanigans in your CSV file an accurate (but hacky) solution is to not even parse the file but simply count the number of newlines in the file:
chunk = 1024*1024 # Process 1 MB at a time.
f = np.memmap("test.csv")
num_newlines = sum(np.sum(f[i:i+chunk] == ord('\n'))
for i in range(0, len(f), chunk))
del f
Answered By - orlp
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.