Issue
I have following problem scrapping site. I have a 3700 pages with person email and I need to achive them. The problem is that they do not contain any class name and Xpath can be different for different pages beacuse sometimes there are phone number before email and it breaks everything. I try to use a different solutions with selenium, but it doesn`t work. Can you please give me some advices of how to deal with this and how I can scrape them. Below is some examples of pages where different structure of html is presented. Thanks!
<div>
<div><i class="fa fa-envelope" style="margin-right: 0.5rem;"></i><span style="unicode-bidi: bidi-override; direction: rtl;"> moc.ttobbanaej@naej</span></div>
<div><a href="http://JeanAbbott.com" target="_blank" class="websiteLink" rel="noopener noreferrer" style="overflow-wrap: normal; text-overflow: ellipsis; overflow: hidden;">JeanAbbott.com</a></div>
<div id="contactInfoWrap" style="margin-top: 10px;">
<div>Jean Abbott</div>
<div>
<div>5 Colonial Circle</div>
<div>Medicine Lake, MN 55441</div>
<div>US</div>
</div>
</div>
</div>
And another one
<div>
<div><i class="fa fa-phone" style="margin-right: 0.5rem;"></i>202-800-7057</div>
<div><i class="fa fa-envelope" style="margin-right: 0.5rem;"></i><span style="unicode-bidi: bidi-override; direction: rtl;"> moc.tsiugnilde@ahsynal</span></div>
<div><a href="http://edlinguist.com/" target="_blank" class="websiteLink" rel="noopener noreferrer" style="overflow-wrap: normal; text-overflow: ellipsis; overflow: hidden;">edlinguist.com/</a></div>
<div id="contactInfoWrap" style="margin-top: 10px;">
<div>LaNysha Adams</div>
<div>
<div>80 M St SE</div>
<div>1st Floor</div>
<div>Washington, DC 20003</div>
<div>US</div>
</div>
</div>
</div>
The element that I need looks like this
<span style="unicode-bidi: bidi-override; direction: rtl;"> moc.ttobbanaej@naej</span>
Solution
//div[contains(.,"@")]/span
The above xpath expression will select your desired html portion:
<span style="unicode-bidi: bidi-override; direction: rtl;"> moc.tsiugnilde@ahsynal</span>
and the desired text node value is : moc.tsiugnilde@ahsynal
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.