Monday, January 17, 2022

[FIXED] Splitting a Scrapy element among multiple CSV rows

January 17, 2022 csv, python, scrapy, scrapy-spider No comments

Issue

I've been working on something that I think should be relatively easy but I keep hitting my head against a wall. I've tried multiple similar solutions from stackoverflow and I've improved my code but still stuck on the basic functionality.

I am scraping a web page that returns an element (genre) that is essential a list of genres:

Mystery, Comedy, Horror, Drama

The xpath returns perfectly. I'm using a Scrapy pipeline to output to a CSV file. What I'd like to do is create a separate row for each item in the above list along with the page url:

"Mystery", "http:domain.com/page1.html"
"Comedy", "http:domain.com/page1.html"

No matter what I try I can only output:

"Mystery, Comedy, Horror, Drama", ""http:domain.com/page1.html"

Here's my code:

def parse_genre (self, response):
    for item in [i.split (',') for i in response.xpath ('//span [contains (@class, "genre")]/text()').extract()]:
        sg = ItemLoader (item=ItemGenre (), response=response)
        sg.add_value ('url', response.url)
        sg.add_value ('genre', item, MapCompose(str.strip))
        yield sg.load_item ()

This is called from the main parse routine for the spider. That all functions correctly. (I have two items on each web page. The main spider gathers the "parent" information and this function is attempting to gather "child" information. Technically not a child record, but definitely a 1 to many relationship.)

I've tried a number of possible solutions. This is the only version that makes sense to me and seems like it should work. I'm sure I'm just not splitting the genre string correctly.

Solution

You are very close. Your culprit seems to be the way you are getting your items:

[i.split(',') for i in response.xpath('//span[contains(@class, "genre")]/text()').extract()]

Without having the source I can't correct you fully but it is obvious here your code is returning a list of lists.

You should either flatten this list of lists into list of strings or iterate through it appropriately:

items = response.xpath('//span[contains (@class, "genre")]/text()').extract()]
for item in items:
    for category in item.split(','):
        sg = ItemLoader(item=ItemGenre(), response=response)
        sg.add_value('url', response.url)
        sg.add_value('genre', category, MapCompose(str.strip))
        yield sg.load_item ()

Alternative more advance technique would be to use list nested comprehension:

items = response.xpath('//span[contains (@class, "genre")]/text()').extract()]
# good cheatsheet to remember this [leaf for tree in forest for leaf in tree]
categories = [cat for item in items for cat in items]
for category in categories:
    sg = ItemLoader(item=ItemGenre(), response=response)
    sg.add_value('url', response.url)
    sg.add_value('genre', category, MapCompose(str.strip))
    yield sg.load_item ()

Answered By - Granitosaurus

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 17, 2022

[FIXED] Splitting a Scrapy element among multiple CSV rows

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels