Issue
They will scrape the first page when the move to second page they show KeyError: 'driver'
is there any solution for these I want to create a webcrawler using scrapy-selenium. these is page link https://barreau-montpellier.com/annuaire-professionnel/?cn-s My code looks like this:
import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
class TestSpider(scrapy.Spider):
name = 'test'
page_number=1
def start_requests(self):
yield SeleniumRequest(url='https://barreau-montpellier.com/annuaire-professionnel/?cn-s=',callback=self.parse)
def parse(self, response):
driver=response.meta['driver']
r = Selector(text=driver.page_source)
details=r.xpath("//div[@class='cn-entry cn-background-gradient']")
for detail in details:
email=detail.xpath(".//span[@class='email cn-email-address']//a//@href").get()
try:
email=email.replace("mailto:","")
except:
email=''
n1=detail.xpath(".//span[@class='given-name']//text()").get()
n2=detail.xpath(".//span[@class='family-name']//text()").get()
name=n1+n2
telephone=detail.xpath(".//span[@class='tel cn-phone-number cn-phone-number-type-workphone']//a//text()").get()
fax=detail.xpath(".//span[@class='tel cn-phone-number cn-phone-number-type-workfax']//a//text()").get()
street=detail.xpath(".//span[@class='adr cn-address']//span[@class='street-address notranslate']//text()").get()
locality=detail.xpath(".//span[@class='adr cn-address']//span[@class='locality']//text()").get()
code=detail.xpath(".//span[@class='adr cn-address']//span[@class='postal-code']//text()").get()
address=street+locality+code
yield{
'name':name,
'mail':email,
'telephone':telephone,
'Fax':fax,
'address':address
}
next_page = 'https://barreau-montpellier.com/annuaire-professionnel/pg/'+ str(TestSpider.page_number)+'/?cn-s'
if TestSpider.page_number<=155:
TestSpider.page_number += 1
yield response.follow(next_page, callback = self.parse,)
In setting .py
I have added these:
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('C:\Program Files (x86)\chromedriver.exe')
SELENIUM_DRIVER_ARGUMENTS=['--headless']
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
Solution
Actually, Why are you getting key error driver
? Most likely, I'm clear about it after testing your code more than once. Have you ever tested your code without pagination portion? I also got key error driver but when I get rid of the pagination part the error has gone disappeared. So for the incorrect next pages/pagination, you are getting key error driver
. I've made the pagination in def start_requests(self) using range function and it's working fine without any issues plus this type of pagination is two times faster than others.
Full working code:
import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
class TestSpider(scrapy.Spider):
name = 'test'
page_number = 1
def start_requests(self):
urls = ['https://barreau-montpellier.com/annuaire-professionnel/pg/'+str(x)+'/?cn-s' for x in range(1,156)]
for url in urls:
yield SeleniumRequest(
url= url,
callback=self.parse,
wait_time=3)
def parse(self, response):
driver = response.meta['driver']
r = Selector(text=driver.page_source)
details = r.xpath(
"//div[@class='cn-entry cn-background-gradient']")
for detail in details:
email = detail.xpath(
".//span[@class='email cn-email-address']//a//@href").get()
try:
email = email.replace("mailto:", "")
except:
email = ''
n1 = detail.xpath(".//span[@class='given-name']//text()").get()
n2 = detail.xpath(
".//span[@class='family-name']//text()").get()
name = n1+n2
telephone = detail.xpath(
".//span[@class='tel cn-phone-number cn-phone-number-type-workphone']//a//text()").get()
fax = detail.xpath(
".//span[@class='tel cn-phone-number cn-phone-number-type-workfax']//a//text()").get()
street = detail.xpath(
".//span[@class='adr cn-address']//span[@class='street-address notranslate']//text()").get()
locality = detail.xpath(
".//span[@class='adr cn-address']//span[@class='locality']//text()").get()
code = detail.xpath(
".//span[@class='adr cn-address']//span[@class='postal-code']//text()").get()
address = street+locality+code
yield {
'name': name,
'mail': email,
'telephone': telephone,
'Fax': fax,
'address': address
}
Output:
{'name': 'CharlesZWILLER', 'mail': '[email protected]', 'telephone': '04 67 60 24 56', 'Fax': '04
67 60 00 58', 'address': '24 Bd du Jeu de PaumeMONTPELLIER34000'}
2022-08-15 11:56:31 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-15 11:56:31 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://localhost:51142/se /session/da80a3907e6e6e78f9356f20bf4103be HTTP/1.1" 200 14
2022-08-15 11:56:31 [selenium.webdriver.remote.remote_connection] DEBUG: Remote re /session/da80a3907e6e6sponse: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-csponse: status=200 | daache'}) : 'application/json; ch
2022-08-15 11:56:31 [selenium.webdriver.remote.remote_connection] DEBUG: Finished
Request Request
2022-08-15 11:56:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 29687144,
'downloader/response_count': 155,
'downloader/response_status_count/200': 155,
'elapsed_time_seconds': 2230.899805,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 8, 15, 18, 56, 31, 850294),
'item_scraped_count': 1219,
'log_count/DEBUG': 3864,
'log_count/INFO': 37,
'response_received_count': 155,
'scheduler/dequeued': 155,
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.