Sunday, October 17, 2021

[FIXED] scrapy able to check if only next sibling has expected tag?

October 17, 2021 html, nextsibling, python, scrapy, web-scraping No comments

Issue

Let me post part of html I want to scrape first

<div id="hello">
  <p>abc</p>
  <center><img src="image_url"></center>
  <p align="center" style="text-align: center;"><b>def</b></p>
  <center><img src="image_url"></center>
  <p align="center" style="text-align: center;"><b>def</b></p>
  <p>abc</p>
  <p align="center" style="text-align: center;"><b>def</b></p>
  <center><img src="image_url"></center>
  <p align="center" style="text-align: center;"><b>def</b></p>
  <p>abc</p>
  <center><img src="image_url"></center>
</div>

I am trying to scrape the text in p and src of image which is the image_url in order. The thing is, the html I showed above is actually not static, all pages have different structure which means sometimes there'll be more p tags before having center tag which includes img src

Since the p and center tags are randomly structured in each pages, I was thinking of getting all the p tags for example using response.css('#hello p') then loop through all the p to get text but while getting the text from current p tag while looping, also check if next sibling has a center tag, if do then get the src append it.

I found something like that by doing p.xpath('following-sibling::center[1]/img/@src').get() as p is each paragraph duing the iteration.

But I figured, that does not work at all because let's say if I have 4 p tags until a center I will actually get 4 img src because that p.xpath('following-sibling::center[1]/img/@src').get() does not just find the next sibling but goes through all the siblings after and see if center tag is matched.

I tried googling but I do not see anything mentioning only check if next sibling is some tag. Anyone has any idea I can get it work so I can save the data in sequence?

Hopefully my explanation makes sense.

Thanks in advance for any help and suggestions

Solution

Try below XPath to get required output

p.xpath('following-sibling::*[1][name()="center"]/img/@src')

Answered By - JaSON

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, October 17, 2021

[FIXED] scrapy able to check if only next sibling has expected tag?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels