Issue
I'm trying to scrape a webpage, which have an unknown amount of < p> tags, in between a known div class.. Some pages have only 1 < p> tag, while others have 10 or even more.. How can I extract them all? Preferable inside one variable, so I can store them inside a csv like all the other data's I'm scraping :)
The HTML structure is as in the following example:
<div class="div_name">
<h2 class="h5">title text</h2>
<p> </p>
<p>text text text...</p>
<p>text text text...</p>
<p> </p>
<p><br>text text text...</p>
<p> </p>
<p><br>text text text...</p>
<p> </p>
<p><br>text text text...</p>
<p> </p>
<p><br>text text text...</p>
<p> </p>
<p>text text text...</p>
<p> </p>
<p><br>text text text...</p>
<p> </p>
<p><br>text text text...</p>
<p> </p>
<p><br>text text text...</p>
<p>text text text...</p>
<p>text text text...</p>
</div>
I'm using python and scrapy framework to achieve this.
Currently I have:
divs = response.xpath('/html/body/div[6]/div/section[2]/article/div/div/div')
for p in divs.xpath('.//p'): # extracts all <p> inside
print(p.get())
story = p
yield {
'story': story
}
It does print all the text values for the various < p> tags, but when stored to the csv file, only the last < p> is inserted to the *.csv.
To store the scraped data into *.csv, I have the following inside my settings.py:
# Deph of Crawler
DEPTH_LIMIT = 0 # 0 = Infinite depth
# Feed Export Settings
FEED_FORMAT="csv"
FEED_URI="output_%(name)s.csv"
and the yield part above, are the fields going into the *.csv.
Kindest regards,
Solution
You could do it in one line, really:
story = ' '.join([x.get().strip() for x in response.xpath('//div[6]/div/section[2]/article/div/div/div//p')])
If you would confirm the page url, I would probably be able to improve that long, fragile XPATH. Nonetheless, the above should work.
Scrapy documentation can be found at https://docs.scrapy.org/en/latest/
Answered By - Barry the Platipus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.