Sunday, October 16, 2022

[FIXED] How to scrape a various amount of in between a <div> class

October 16, 2022 html, python-3.x, scrapy, web-scraping No comments

Issue

I'm trying to scrape a webpage, which have an unknown amount of tags, in between a known div class.. Some pages have only 1 tag, while others have 10 or even more.. How can I extract them all? Preferable inside one variable, so I can store them inside a csv like all the other data's I'm scraping :)

The HTML structure is as in the following example:

<div class="div_name">
    <h2 class="h5">title text</h2>
    <p>&nbsp;</p>
    <p>text text text...</p>
    <p>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>text text text...</p>
    <p>text text text...</p>
</div>

I'm using python and scrapy framework to achieve this.

Currently I have:

divs = response.xpath('/html/body/div[6]/div/section[2]/article/div/div/div')
for p in divs.xpath('.//p'):  # extracts all <p> inside
        print(p.get())
story = p

yield {
    'story': story
    }

It does print all the text values for the various tags, but when stored to the csv file, only the last is inserted to the *.csv.

To store the scraped data into *.csv, I have the following inside my settings.py:

# Deph of Crawler
DEPTH_LIMIT = 0 # 0 = Infinite depth

# Feed Export Settings
FEED_FORMAT="csv"
FEED_URI="output_%(name)s.csv"

and the yield part above, are the fields going into the *.csv.

Kindest regards,

Solution

You could do it in one line, really:

story = ' '.join([x.get().strip() for x in response.xpath('//div[6]/div/section[2]/article/div/div/div//p')])

If you would confirm the page url, I would probably be able to improve that long, fragile XPATH. Nonetheless, the above should work.

Scrapy documentation can be found at https://docs.scrapy.org/en/latest/

Answered By - Barry the Platipus

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, October 16, 2022

[FIXED] How to scrape a various amount of <p> in between a <div> class

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels