Issue
Introduction
Since my crawler is more or less finished yet, i need to redo a crawler which only crawls whole domain for links, i need this for my work. The spider which crawls every link should run once per month.
I'm running scrapy 2.4.0 and my os is Linux Ubuntu server 18.04 lts
Problem
The website which i have to crawl changed their "privacy", so you have to be logged in before you can see the products, which is the reason why my "linkcrawler" wont work anymore. I already managed to login and scrape all my stuff, but the start_urls where given in a csv file.
Code
import scrapy
from ..items import DuifItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import FormRequest, Request
from scrapy_splash import SplashRequest
class DuifLinkSpider(CrawlSpider):
name = 'duiflink'
allowed_domains = ['duif.nl']
login_page = 'https://www.duif.nl/login'
start_urls = ['https://www.duif.nl']
custom_settings = {'FEED_EXPORT_FIELDS' : ['Link']}
def start_requests(self):
yield SplashRequest(
url=self.login_page,
callback=self.parse_login,
args={'wait': 3},
dont_filter=True
)
rules = (
Rule(LinkExtractor(deny='https://www.duif.nl/nl/'), callback='parse_login', follow=True),
)
def parse_login(self, response):
return FormRequest.from_response(
response,
formid='login-form',
formdata={
'username' : 'not real',
'password' : 'login data'},
clickdata={'type' : 'submit'},
callback=self.after_login)
def after_login(self, response):
accview = response.xpath('//ul[@class="nav navbar-nav navbar-secondary navbar-right"]//a/@href')[13]
if accview:
print('success')
else:
print(':(')
for url in self.start_urls:
yield response.follow(url=url, callback=self.search_links)
def search_links(self, response):
link = response.xpath('//ul[@class="nav navbar-nav navbar-secondary navbar-right"]/li/a/@href').get()
for a in link:
link = response.url
yield response.follow(url=link, callback=self.parse_page)
def parse_page(self, response):
productpage = response.xpath('//div[@class="product-details col-md-12"]')
if not productpage:
print('No productlink', response.url)
for a in productpage:
items = DuifItem()
items['Link'] = response.url
yield items
Unfortunately i cant provide a dummyaccount, where you can try login by yourself, because its a b2b-service website.
I can imagine that my "def search_links" is wrong.
My planned structure is:
- visit login_page, pass my login credentials
- check if logged in by xpath, where it checks, if the logout button is given or not.
- If logged in, it prints 'success'
- Given by xpath expression, it should start to follow links by:
- by visiting every link, it should check by xpath xpression, if specific container is given or not, so it knows whether its a productpage or not.
- if product page, save visited link, if not productpage, take next link
Console output
Like you can see, the authentication is working, but it wont do anything afterwards.
Update
i reworked my code a very bit:
import scrapy
from ..items import DuifItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import FormRequest, Request
from scrapy_splash import SplashRequest
class DuifLinkSpider(CrawlSpider):
name = 'duiflink'
allowed_domains = ['duif.nl']
login_page = 'https://www.duif.nl/login'
start_urls = ['https://www.duif.nl/']
custom_settings = {'FEED_EXPORT_FIELDS' : ['Link']}
def start_requests(self):
yield SplashRequest(
url=self.login_page,
callback=self.parse_login,
args={'wait': 3},
dont_filter=True
)
rules = (
Rule(LinkExtractor(), callback='parse_login', follow=True),
)
def parse_login(self, response):
return FormRequest.from_response(
response,
formid='login-form',
formdata={
'username' : 'not real',
'password' : 'login data'},
clickdata={'type' : 'submit'},
callback=self.after_login)
def after_login(self, response):
accview = response.xpath('//ul[@class="nav navbar-nav navbar-secondary navbar-right"]//a/@href')[13]
if accview:
print('success')
else:
print(':(')
for url in self.start_urls:
yield response.follow(url=url, callback=self.search_links, dont_filter=True)
def search_links(self, response):
# link = response.xpath('//ul[@class="nav navbar-nav navbar-secondary navbar-right"]/li/a/@href')
link = response.xpath('//a/@href')
for a in link:
link = a.get()
link = 'https://www.duif.nl' + link if link else link
yield response.follow(url=link, callback=self.parse_page, dont_filter=True)
def parse_page(self, response):
productpage = response.xpath('//div[@class="product-details col-md-12"]')
if not productpage:
print('No productlink', response.url)
for a in productpage:
items = DuifItem()
items['Link'] = response.url
yield items
Now i know, that i am definitely logged in, but it doesnt follow the "sub"-links, but i thought if i use response.xpath('//a/@href')
, it will automatically searches the whole dom for every link.
Below my new console output
Solution
After you login, you go back to parsing your start url. Scrapy filters out duplicate requests by default, so in your case it stops here. You can avoid this by using 'dont_filter=True' in your request, like this:
yield response.follow(url=url, callback=self.search_links, dont_filter=True)
Answered By - Wim Hermans
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.