Wednesday, November 15, 2023

[FIXED] Scrapy csv output "randomly" missing fields

November 15, 2023 csv, python, python-3.x, scrapy No comments

Issue

My scrapy crawler correctly reads all fields as the debug output shows:

2017-01-29 02:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.willhaben.at/iad/immobilien/mietwohnungen/niederoesterreich/krems-an-der-donau/altbauwohnung-wg-geeignet-donaublick-189058451/>
{'Heizung': 'Gasheizung', 'whCode': '189058451', 'Teilmöbliert / Möbliert': True, 'Wohnfläche': '105', 'Objekttyp': 'Zimmer/WG', 'Preis': 1050.0, 'Miete (inkl. MWSt)': 890.0, 'Stockwerk(e)': '2', 'Böden': 'Laminat', 'Bautyp': 'Altbau', 'Zustand': 'Sehr gut/gut', 'Einbauküche': True, 'Zimmer': 3.0, 'Miete (exkl. MWSt)': 810.0, 'Befristung': 'nein', 'Verfügbar': 'ab sofort', 'zipcode': 3500, 'Gesamtbelastung': 1150.0}

but when I output the csv using the command line option

scrapy crawl mietwohnungen -o mietwohnungen.csv --logfile=mietwohnungen.log

some of the fields are missing, as the corresponding line from the output file shows:

Keller,whCode,Garten,Zimmer,Terrasse,Wohnfläche,Parkplatz,Objekttyp,Befristung,zipcode,Preis
,189058451,,3.0,,105,,Zimmer/WG,nein,3500,1050.0

The fields missing in the example are: Heizung, Teilmöbliert / Möbliert, Miete (inkl. MWSt), Stockwerk(e), Böden, Bautyp, Zustand, Einbauküche, Miete (exkl. MWSt), Verfügbar, Gesamtbelastung

This happens with a few values that I scrape. One thing to note is that not every page contains the same fields, hence I generate the field names depending on the page. I create a dict containing all the fields present and yield that in the end. This works as the DEBUG output shows. However, some csv columns don't seem to be printed.

As you can see some columns are blank because other pages obviously have these fields (in the example 'Keller').

The scraper works if I use a smaller list to scrape (e.g. refine my initial search selection while keeping some of the problematic pages in the results):

Heizung,Zimmer,Bautyp,Gesamtbelastung,Einbauküche,Miete (exkl. MWSt),Zustand,Miete (inkl. MWSt),zipcode,Teilmöbliert / Möbliert,Objekttyp,Stockwerk(e),Böden,Befristung,Wohnfläche,whCode,Preis,Verfügbar
Gasheizung,3.0,Altbau,1150.0,True,810.0,Sehr gut/gut,890.0,3500,True,Zimmer/WG,2,Laminat,nein,105,189058451,1050.0,ab sofort

I have already changed to python3 to avoid any unicode string problems.

Is this a bug? This also seems to only affect the csv output, if I output to xml all fields are printed.

I don't understand why it does not work with the full list. Is the only solution really to write a csv exporter manually?

Solution

If you yielding scraped results as dict, CSV columns will be populated from the keys of first yielded dict:

def _write_headers_and_set_fields_to_export(self, item):
    if self.include_headers_line:
        if not self.fields_to_export:
            if isinstance(item, dict):
                # for dicts try using fields of the first item
                self.fields_to_export = list(item.keys())
            else:
                # use fields declared in Item
                self.fields_to_export = list(item.fields.keys())
        row = list(self._build_row(self.fields_to_export))
        self.csv_writer.writerow(row)

So you should either define and populate Item with all the fields defined explicitly, or write custom CSVItemExporter.

Answered By - mizhgun

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 15, 2023

[FIXED] Scrapy csv output "randomly" missing fields

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels