Issue
I want to restrict some xpath using Link Extractor
but they gave me these error you have multiple values for argument
kindly give me some suggestion what mistake I am doing
import scrapy
from scrapy.http imporrt Request
from selenium import webdriver
from scrapy.http import HtmlResponse
import time
from scrapy_selenium import SeleniumRequest
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class BarSpider(scrapy.Spider):
name = 'bar'
start_urls=["https://www.veteranownedbusiness.com/?mode=geo#BrowseByState"]
def parse(self, response):
books = response.xpath('//table[@class="categories"]//tr//td//a[@class="category"]//@href').extract()
for book in books:
url = response.urljoin(book)
rules = (Rule(LinkExtractor(restrict_xpaths=('//table[@class="categories"]//tr//td[1]//a[@class="category"]//@href'))))
yield Request(url ,rules,callback='base_url')
def base_url(self,response):
links = response.xpath('//table[@class="listings"]//a//@href').extract()
for link in links:
b_link = response.urljoin(link)
yield{
'url':b_link,
}
Solution
There are a few issues with your spider.
A
Rule
object is useless unless it is an attribute of acrawlspider
. If you simply want to use a LinkExtractor, then you can do so without wrapping it in a Rule object.LinkExtractor
s extract links from selectors so you should include the@href
at the end of yourrestrict_xpaths
selector list.This is the cause for the error you are receiving: A Request objects expects only 1 positional argument, which is the url. If it receives a second positional argument then is assumes that the value is the callback. However in your example you have the url as the first parameter, something else as the second parameter and the callback is keyword argument, so it throws an error as having received multiple values for the callback parameter. Also Request objects don't accept
Rule
objects as parameter.
What you can do to address these issues is instantiate a LinkExtractor directly, remove the @href
part of your xpath, and then iterate the extracted links and yield a Request for each link extracted individually.
For example:
def parse(self, response):
for link in LinkExtractor(restrict_xpaths=[
'//table[@class="categories"]//tr//td[1]//a[@class="category"]'
]).extract_links(response):
url = response.urljoin(link.url)
yield Request(url,callback=self.base_url)
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.