Issue
I'm trying to read a large (~850 mb) .csv file from an URL.
The thing is that the .csv file is within a .zip file that also contains a .pdf file, so when I try to read it in pandas:
df = pd.read_csv('link', encoding='latin1', sep=';')
It doesn't work because it states:
ValueError: Multiple files found in ZIP file. Only one file per ZIP: ['LEIAME.pdf', 'perfil_eleitorado_2018.csv']
I'm working with a collaborative notebook, so the best solution would be just to open the .zip file directly from the link or to upload the .csv file somewhere that won't ask for permissions, log-ins, or anything like that to open it directly in the notebook.
Obs: This is just one of the large .csv databases I'm working with, there are others with similar sizes, or even slightly bigger.
Solution
pd.read_csv() function allows the first argument to be a .zip file path or URL, but only one file per ZIP file is supported. The posted zip file has multiple files.
You can iterate over the zip file and read CSV data as a buffered object.
import pandas as pd
import zipfile
from io import BytesIO
with zipfile.ZipFile("perfil_eleitorado_2018.zip", "r") as f:
for name in f.namelist():
if name.endswith('.csv'):
with f.open(name) as zd:
df = pd.read_csv(zd, encoding='latin1', sep=';')
print(df)
break
If you want to interact with the URL directly w/o first downloading it then use can use the request library.
import pandas as pd
import zipfile
from io import BytesIO
import requests
url = 'https://cdn.tse.jus.br/estatistica/sead/odsele/perfil_eleitorado/perfil_eleitorado_2018.zip'
r = requests.get(url)
buf1 = BytesIO(r.content)
with zipfile.ZipFile(buf1, "r") as f:
for name in f.namelist():
if name.endswith('.csv'):
with f.open(name) as zd:
df = pd.read_csv(zd, encoding='latin1', sep=';')
print(df)
break
Output:
DT_GERACAO HH_GERACAO ANO_ELEICAO ... QT_ELEITORES_DEFICIENCIA QT_ELEITORES_INC_NM_SOCIAL
0 12/04/2021 13:55:01 2018 ... 1 0
1 12/04/2021 13:55:01 2018 ... 2 0
2 12/04/2021 13:55:01 2018 ... 4 0
3 12/04/2021 13:55:01 2018 ... 2 0
4 12/04/2021 13:55:01 2018 ... 25 0
.. ... ... ... ... ... ...
Answered By - CodeMonkey
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.