Issue
Using the zipfile
module to unzip a large data file in Python works correctly on Python 2 but produces the following error on Python 3.6.0:
BadZipFile: Bad CRC-32 for file 'myfile.csv'
I traced this to error handling code checking the CRC values.
Using ZipFile.testzip()
on Python 2 returns nothing (all files are fine). Running it on Python 3 returns 'myfile.csv'
indicating a problem with that file.
Code to reproduce on both Python 2 and Python 3 (involves a 300 MB download, sorry):
import zipfile
import urllib
import sys
url = "https://de.iplantcollaborative.org/anon-files//iplant/home/shared/commons_repo/curated/Vertnet_Amphibia_Sep2016/VertNet_Amphibia_Sept2016.zip"
if sys.version_info >= (3, 0, 0):
urllib.request.urlretrieve(url, "vertnet_latest_amphibians.zip")
else:
urllib.urlretrieve(url, "vertnet_latest_amphibians.zip")
archive = zipfile.ZipFile("vertnet_latest_amphibians.zip")
archive.testzip()
Does anyone understand why this difference exists and if there's a way to get Python 3 to properly extract the file using:
archive.extract("vertnet_latest_amphibians.csv")
Solution
The CRC value is OK. The CRC of 'vertnet_latest_amphibians.csv' recorded in the zip is 0x87203305. After extraction, this is indeed the CRC of the file.
However, the given uncompressed size is incorrect. The zip file records compressed size of 309,723,024 bytes, and uncompressed size of 292,198,614 bytes (that's smaller!). In reality, the uncompressed file is 4,587,165,910 bytes (4.3 GiB). This is bigger than the 4 GiB threshold where 32-bit counters break.
You can fix it like this (this worked in Python 3.5.2, at least):
archive = zipfile.ZipFile("vertnet_latest_amphibians.zip")
archive.getinfo("vertnet_latest_amphibians.csv").file_size += 2**32
archive.testzip() # now passes
archive.extract("vertnet_latest_amphibians.csv") # now works
Answered By - Nick Matteo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.