Issue
I am trying to scrape email but it give me none
these is page link https://www.avocats-lille.com/fr/annuaire/avocats-du-tableau-au-barreau-de-lille/3?view=entry
I am going to the network tab
and check the html code
from the but the email doesnot exsist in html code:
<div class="contact"><p>Contacter par email : <span id="cloak65106">Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.</span><script type='text/javascript'>
Code: import scrapy from scrapy.http import Request
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.avocats-lille.com/fr/annuaire/avocats-du-tableau-au-barreau-de-lille/3?view=entry']
page_number = 1
def parse(self, response):
mail=response.xpath("//span//a[starts-with(@href, 'mailto')]/@href").get()
yield{
'email':mail
}
Solution
The webpage is static except email
portion. That's why you are getting None. To grab the email, you can use scrapy with SeleniumRequest
import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
class TestSpider(scrapy.Spider):
name = 'test'
def start_requests(self):
yield SeleniumRequest(url='https://www.avocats-lille.com/fr/annuaire/avocats-du-tableau-au-barreau-de-lille/3?view=entry', callback=self.parse)
def parse(self, response):
driver=response.meta['driver']
r = Selector(text=driver.page_source)
yield {
'mail_link': r.xpath('//*[@class="contact"]/following-sibling::div[1]/p/span/a/@href').get(),
'mail': r.xpath('//*[@class="contact"]/following-sibling::div[1]/p/span/a/text()').get()
}
Output:
{'mail_link': 'mailto:[email protected]', 'mail': '[email protected]'}
You have to add the following code in settings.py file
# Middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
# Selenium
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
# '--headless' if using chrome instead of firefox
SELENIUM_DRIVER_ARGUMENTS = ['--headless']
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.