Issue
Let me post part of html I want to scrape first
<div id="hello">
<p>abc</p>
<center><img src="image_url"></center>
<p align="center" style="text-align: center;"><b>def</b></p>
<center><img src="image_url"></center>
<p align="center" style="text-align: center;"><b>def</b></p>
<p>abc</p>
<p align="center" style="text-align: center;"><b>def</b></p>
<center><img src="image_url"></center>
<p align="center" style="text-align: center;"><b>def</b></p>
<p>abc</p>
<center><img src="image_url"></center>
</div>
I am trying to scrape the text in p and src of image which is the image_url
in order.
The thing is, the html I showed above is actually not static, all pages have different structure which means sometimes there'll be more p
tags before having center
tag which includes img src
Since the p
and center
tags are randomly structured in each pages, I was thinking of getting all the p
tags for example using response.css('#hello p')
then loop through all the p
to get text but while getting the text from current p
tag while looping, also check if next sibling has a center
tag, if do then get the src
append it.
I found something like that by doing p.xpath('following-sibling::center[1]/img/@src').get()
as p is each paragraph duing the iteration.
But I figured, that does not work at all because let's say if I have 4 p
tags until a center
I will actually get 4 img src
because that p.xpath('following-sibling::center[1]/img/@src').get()
does not just find the next sibling but goes through all the siblings after and see if center
tag is matched.
I tried googling but I do not see anything mentioning only check if next sibling is some tag. Anyone has any idea I can get it work so I can save the data in sequence?
Hopefully my explanation makes sense.
Thanks in advance for any help and suggestions
Solution
Try below XPath to get required output
p.xpath('following-sibling::*[1][name()="center"]/img/@src')
Answered By - JaSON
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.