Issue
My scrapy crawler correctly reads all fields as the debug output shows:
2017-01-29 02:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.willhaben.at/iad/immobilien/mietwohnungen/niederoesterreich/krems-an-der-donau/altbauwohnung-wg-geeignet-donaublick-189058451/>
{'Heizung': 'Gasheizung', 'whCode': '189058451', 'Teilmöbliert / Möbliert': True, 'Wohnfläche': '105', 'Objekttyp': 'Zimmer/WG', 'Preis': 1050.0, 'Miete (inkl. MWSt)': 890.0, 'Stockwerk(e)': '2', 'Böden': 'Laminat', 'Bautyp': 'Altbau', 'Zustand': 'Sehr gut/gut', 'Einbauküche': True, 'Zimmer': 3.0, 'Miete (exkl. MWSt)': 810.0, 'Befristung': 'nein', 'Verfügbar': 'ab sofort', 'zipcode': 3500, 'Gesamtbelastung': 1150.0}
but when I output the csv using the command line option
scrapy crawl mietwohnungen -o mietwohnungen.csv --logfile=mietwohnungen.log
some of the fields are missing, as the corresponding line from the output file shows:
Keller,whCode,Garten,Zimmer,Terrasse,Wohnfläche,Parkplatz,Objekttyp,Befristung,zipcode,Preis
,189058451,,3.0,,105,,Zimmer/WG,nein,3500,1050.0
The fields missing in the example are: Heizung, Teilmöbliert / Möbliert, Miete (inkl. MWSt), Stockwerk(e), Böden, Bautyp, Zustand, Einbauküche, Miete (exkl. MWSt), Verfügbar, Gesamtbelastung
This happens with a few values that I scrape. One thing to note is that not every page contains the same fields, hence I generate the field names depending on the page. I create a dict containing all the fields present and yield
that in the end. This works as the DEBUG output shows. However, some csv columns don't seem to be printed.
As you can see some columns are blank because other pages obviously have these fields (in the example 'Keller').
The scraper works if I use a smaller list to scrape (e.g. refine my initial search selection while keeping some of the problematic pages in the results):
Heizung,Zimmer,Bautyp,Gesamtbelastung,Einbauküche,Miete (exkl. MWSt),Zustand,Miete (inkl. MWSt),zipcode,Teilmöbliert / Möbliert,Objekttyp,Stockwerk(e),Böden,Befristung,Wohnfläche,whCode,Preis,Verfügbar
Gasheizung,3.0,Altbau,1150.0,True,810.0,Sehr gut/gut,890.0,3500,True,Zimmer/WG,2,Laminat,nein,105,189058451,1050.0,ab sofort
I have already changed to python3 to avoid any unicode string problems.
Is this a bug? This also seems to only affect the csv output, if I output to xml all fields are printed.
I don't understand why it does not work with the full list. Is the only solution really to write a csv exporter manually?
Solution
If you yielding scraped results as dict, CSV columns will be populated from the keys of first yielded dict:
def _write_headers_and_set_fields_to_export(self, item):
if self.include_headers_line:
if not self.fields_to_export:
if isinstance(item, dict):
# for dicts try using fields of the first item
self.fields_to_export = list(item.keys())
else:
# use fields declared in Item
self.fields_to_export = list(item.fields.keys())
row = list(self._build_row(self.fields_to_export))
self.csv_writer.writerow(row)
So you should either define and populate Item
with all the fields defined explicitly, or write custom CSVItemExporter
.
Answered By - mizhgun
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.