Saturday, August 13, 2022

[FIXED] XPath Scrapy Join Text Nodes Separated by br tags in a class

August 13, 2022 python, scrapy, xpath No comments

Issue

I am learning web scraping with Python, Xpath, and Scrapy. I am stuck with the following. I would be grateful, if you can help me.

This is the HTML code

<div class="discussionpost">
“This is paragraph one.”
<br>
<br>
“This is paragraph two."'
<br>
<br>
"This is paragraph three.”
</div>

This is the output I would like to get: "This is paragraph one. This is paragraph two. This is paragraph three." I would like to combine all paragraphs separated by the <br>. There is no <p> tag.

However, the output I am getting is: "This is sentence one.","This is sentence two.","This is sentence three."

This is the code I am using:

sentences = response.xpath('//div[@class="discussionpost"]/text()').extract()

I understand why the code above is acting the way it is. But, I could not change it to do what I need to do. Any help is greatly appreciated.

Solution

To get all text nodes value, You have to invoke //text() instead of /text()

sentences = ' '.join(response.x`path('//div[@class="discussionpost"]//text()').extract()).strip()

Proven by scrapy shell:

>>> from scrapy import Selector
>>> html_doc = '''
... <html>
...  <body>
...   <div class="discussionpost">
...    “This is paragraph one.”
...    <br/>
...    <br/>
...    “This is paragraph two."'
...    <br/>
...    <br/>
...    "This is paragraph three.”
...   </div>
...  </body>
... </html>
...
... '''
>>> res = Selector(text=html_doc)
>>> res
<Selector xpath=None data='<html>\n <body>\n  <div class="discussi...'>
>>> sentences = ''.join(res.xpath('//div[@class="discussionpost"]//text()').extract())
>>> sentences
'\n   “This is paragraph one.”\n   \n   \n   “This is paragraph two."\'\n   \n   \n   "This is paragraph three.”\n  '
>>> txt = sentences
>>> txt
'\n   “This is paragraph one.”\n   \n   \n   “This is paragraph two."\'\n   \n   \n   "This is paragraph three.”\n  '
>>> txt = sentences.replace('\n','').replace("\'",'').replace('    ','').replace("“",'').replace('”','').replace('"','').strip()
>>> txt
'This is paragraph one. This is paragraph two. This is paragraph three.'
>>>

Update:

import scrapy
class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.ibsgroup.org/threads/hemorrhoids-as-cause-of-pain.363290/']
     
    def parse(self, response):
        for p in response.xpath('//*[@class="bbWrapper"]'):
            yield {
            'comment': ''.join(p.xpath(".//text()").getall()).strip()
            }

Answered By - F.Hoque

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, August 13, 2022

[FIXED] XPath Scrapy Join Text Nodes Separated by br tags in a class

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels