Issue
I am learning web scraping with Python, Xpath, and Scrapy. I am stuck with the following. I would be grateful, if you can help me.
This is the HTML code
<div class="discussionpost">
“This is paragraph one.”
<br>
<br>
“This is paragraph two."'
<br>
<br>
"This is paragraph three.”
</div>
This is the output I would like to get: "This is paragraph one. This is paragraph two. This is paragraph three." I would like to combine all paragraphs separated by the <br>
. There is no <p>
tag.
However, the output I am getting is: "This is sentence one.","This is sentence two.","This is sentence three."
This is the code I am using:
sentences = response.xpath('//div[@class="discussionpost"]/text()').extract()
I understand why the code above is acting the way it is. But, I could not change it to do what I need to do. Any help is greatly appreciated.
Solution
To get all text nodes value, You have to invoke //text()
instead of /text()
sentences = ' '.join(response.x`path('//div[@class="discussionpost"]//text()').extract()).strip()
Proven by scrapy shell:
>>> from scrapy import Selector
>>> html_doc = '''
... <html>
... <body>
... <div class="discussionpost">
... “This is paragraph one.”
... <br/>
... <br/>
... “This is paragraph two."'
... <br/>
... <br/>
... "This is paragraph three.”
... </div>
... </body>
... </html>
...
... '''
>>> res = Selector(text=html_doc)
>>> res
<Selector xpath=None data='<html>\n <body>\n <div class="discussi...'>
>>> sentences = ''.join(res.xpath('//div[@class="discussionpost"]//text()').extract())
>>> sentences
'\n “This is paragraph one.”\n \n \n “This is paragraph two."\'\n \n \n "This is paragraph three.”\n '
>>> txt = sentences
>>> txt
'\n “This is paragraph one.”\n \n \n “This is paragraph two."\'\n \n \n "This is paragraph three.”\n '
>>> txt = sentences.replace('\n','').replace("\'",'').replace(' ','').replace("“",'').replace('”','').replace('"','').strip()
>>> txt
'This is paragraph one. This is paragraph two. This is paragraph three.'
>>>
Update:
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.ibsgroup.org/threads/hemorrhoids-as-cause-of-pain.363290/']
def parse(self, response):
for p in response.xpath('//*[@class="bbWrapper"]'):
yield {
'comment': ''.join(p.xpath(".//text()").getall()).strip()
}
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.