Issue
I am trying to run my first CrawlSpider, but the program terminates without any errors, while it does not return anything, it terminates with zero result. What's wrong with my code?
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class FagorelectrodomesticoSpider(CrawlSpider):
name = 'fagorelectrodomestico.com'
allowed_domains = ['fagorelectrodomestico.com']
start_urls = ['https://fagorelectrodomestico.com']
rules = (
Rule(LinkExtractor(allow='product/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
for doc in response.css('a.file'):
doclink = doc.css('::attr("href")').get()
product = Product()
product['model'] = response.css('h2.data__symbol::text').get()
product['brand'] = 'Fagor'
product['file_urls'] = [doclink]
yield product
Solution
The main problem is this page uses JavaScript
to add all elements to HTML but Scrapy
can't run JavaScript
. If you turn off JavaScript
in browser and reload this page then you should see empty white page. But there is module scrapy_selenium which can use module Selenium to control real web browser which can run JavaScript
(but it will run slower).
Other problem: your rule search links with product/
which I don't see on main page but I can see on pages with categories. But you don't need rule to load other pages and it can't get links product/
from subpages - so it needs another rule to get other links and send to callback parser
(which in Spider
loads page, searchs all links and checks rules on these links).
And it may need to add /en/
to starting url to get english version which has links with product/
. Spanish version has links productos/
.
Some code needed to use SeleniumRequest
instead of standard Request
- I took some code from source code of CrawlSpider and add it with changes.
I also used CrawlerProcess
to run code without creating project - so everyone can simply copy it and run python script.py
It downloads files to folder full
.
I tested only without option -headless
to see what it get in browser. You may have to test with -headless
because it may work faster but sometimes it bahave different.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import scrapy_selenium
class FagorelectrodomesticoSpider(CrawlSpider):
name = 'fagorelectrodomestico.com'
allowed_domains = ['fagorelectrodomestico.com']
start_urls = ['https://fagorelectrodomestico.com/en/']
rules = (
Rule(LinkExtractor(allow='/en/product/'), callback='parse_item', follow=True),
Rule(LinkExtractor(allow='/en/', deny='/en/product/'), callback='parse', follow=True),
)
def start_requests(self):
print('[start_requests]')
for url in self.start_urls:
print('[start_requests] url:', url)
yield scrapy_selenium.SeleniumRequest(url=url, callback=self.parse)
def parse(self, response):
print('[parse] url:', response.url)
for rule_index, rule in enumerate(self._rules):
#print(rule.callback)
for link in rule.link_extractor.extract_links(response):
yield scrapy_selenium.SeleniumRequest(
url=link.url,
callback=rule.callback,
errback=rule.errback,
meta=dict(rule=rule_index, link_text=link.text),
)
def parse_item(self, response):
print('[parse_item] url:', response.url)
for doc in response.css('a.file'):
doclink = doc.css('::attr("href")').get()
product = {
'model': response.css('h2.data__symbol::text').get(),
'brand': 'Fagor',
'file_urls': [doclink],
}
yield product
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
#'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'output.csv': {'format': 'csv'}}, # new in 2.1
'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1}, # used standard FilesPipeline (download to FILES_STORE/full)
#'FILES_STORE': '/path/to/valid/dir', # this folder has to exist before downloading
'FILES_STORE': '.', # this folder has to exist before downloading
'SELENIUM_DRIVER_NAME': 'firefox',
'SELENIUM_DRIVER_EXECUTABLE_PATH': '/home/furas/bin/geckodriver',
#'SELENIUM_DRIVER_ARGUMENTS': ['-headless'], # '--headless' if using chrome instead of firefox
'SELENIUM_DRIVER_ARGUMENTS': [],
#'SELENIUM_BROWSER_EXECUTABLE_PATH': '',
#'SELENIUM_COMMAND_EXECUTOR': '',
'DOWNLOADER_MIDDLEWARES': {'scrapy_selenium.SeleniumMiddleware': 800}
})
c.crawl(FagorelectrodomesticoSpider)
c.start()
Answered By - furas
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.