Issue
Why does this XPath expression not returning the value?
XPath: //p[@class="email"]/text()
When I run this code, it doesn't print any value.
Website: https://codewithawais.com/test
# -*- coding: utf-8 -*-
import scrapy
class MainSpider(scrapy.Spider):
name = 'main'
start_urls = ['https://codewithawais.com/test']
def parse(self, response):
box = response.xpath('//div[@class="all_listing_details"]')
for each in box:
email = each.xpath('.//p[@class="email"]/text()').get()
yield {
"email": email,
}
Solution
It looks like email-decode.min.js is replacing the emails with
'<p class="email"><a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f49597979b819a8087b4839b9b90969186868d9a81868791868dda979bda819f">[email\xa0protected]</a></p>'
If you look at the response within chrometools, you end up with that. In the scrapy shell just looking at the paragraph tag of class email, this is the type of response you get back.
>>> box = response.xpath('//div[@class="all_listing_details"]')
>>> for each in box:
... each.xpath('.//p[@class="email"]').get()
...
'<p class="email"><a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f194909c92b19385929e9f9f949285df929e9c">[email\xa0protected]</a></p>'
Using Requests/BS4 I wasn't able to parse this.
Here's a work around using selenium
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'c:\users\aaron\chromedriver.exe')
driver.get('https://codewithawais.com/test/')
driver.page_source
soup = BeautifulSoup(driver.page_source,'lxml')
emails = [a.get_text() for a in soup.select('p.email')]
You could use selenium directly in your scrapy script, or use scrapy-selenium downloader middleware or use splash-scrapy.
Answered By - AaronS
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.