Wednesday, August 24, 2022

[FIXED] accessing li and ul elements in html

August 24, 2022 html, python, scrapy No comments

Issue

I want to extract text from an html of the structure below.

selector = scrapy.Selector(text="""
<li>Text1
  <ul>
    <li>Text2</li>
    <li>Text3</li>
  </ul>
</li>
""")

The following options give me all the text, including new lines, but lose all of the structure of the html.

selector.xpath('/descendant-or-self::*/text()').extract() 
selector.xpath('//li/text()').extract()

Is there a way of accessing these elements via some path? I would expect to be able to select the first text (Text1) via something like this,

selector.xpath('//li/text()').extract()

as the rest (Text2 and Text3) can be selected via,

selector.xpath('//li/ul/li/text()').extract()

Solution

As what is your expected output is not clear, So Assume you want to extract the string/text nodes from the top li tags and ul/li tags.

The following xpath expression will select text from the top li tags

el =''.join(selector.xpath('//*[@class="a"]/ancestor::li/text()').extract()).replace('\n','').strip()
txt = re.sub(r'\s+',' ',el)

and

sel =' '.join(selector.xpath('//*[@class="a"]/ancestor::ul//li//text()').extract()).replace('\n','').strip()

txt2 = re.sub(r'\s+',' ',sel)

The above path expression will select text from the ul/li tags

P/S : I use re module only for removing the extra white spaces

Proven by scrapy shell:

 %paste
selector = scrapy.Selector(text="""

<li>Text1
  <ul>
    <li>Text2</li>
    <li>Text3</li>
    <li><class="a">
      <i>Text4</i>
        Text5 
        <cite style="Style2" class="a">
        <a href="href1" title="Title1"> Text6</a>.
      </cite>
      <span class="b" title="Title2">
        <span style="Style1"></span>
      </span>
    </li>
    <li>
      Text7 
      <cite style="Style2" class="a">
        <i>Text8</i>
        Text9
        <a href="href2" title="Title2">Text10</a>.
      </cite>
      <span class="b" title="Title3">
        <span style="Style3"></span>
      </span>
    </li>
  </ul>
</li>
""")

   
       
    el =''.join(selector.xpath('//*[@class="a"]/ancestor::li/text()').extract()).replace('\n','').strip()       
    
    In [3]: el
    Out[3]: 'Text1        Text7'
    
    In [4]: import re
    
    In [5]: txt = re.sub(r'\s+',' ',el)
    
    In [6]: txt
    Out[6]: 'Text1 Text7'
    
    In [7]: sel =' '.join(selector.xpath('//*[@class="a"]/ancestor::ul//li//text()').extract()).replace('\n','').strip( 
       ...: )
    
    In [8]: sel
    Out[8]: 'Text2 Text3        Text4         Text5                    Text6 .                                         Text7                 Text8         Text9         Text10 .'
    
    In [9]: txt2 = re.sub(r'\s+',' ',sel)
    
    In [10]: txt2
    Out[10]: 'Text2 Text3 Text4 Text5 Text6 . Text7 Text8 Text9 Text10 .'

Answered By - F.Hoque

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, August 24, 2022

[FIXED] accessing li and ul elements in html

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels