Issue
I'm stuck trying to find a way to make my spider work. This is the scenario: I'm trying to find all the URLs of a specific domain that are contained in a particular target website. For this, I've defined a couple of rules so I can crawl the site and find out the links of my interest.
The thing is that it doesn't seem to work, even when I know that there are links with the proper format inside the website.
This is my spider:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class sp(CrawlSpider):
name = 'sp'
start_urls = ['https://nationalpavementexpo.com/show/floor-plan-exhibitor-list/']
custom_settings = {
'LOG_LEVEL': 'INFO',
'DEPTH_LIMIT': 4
}
rules = (
Rule(LinkExtractor(unique=True, allow_domains='a2zinc.net'), callback='parse_item'),
Rule(LinkExtractor(unique=True, canonicalize=True, allow_domains = 'nationalpavementexpo.com'))
)
def parse_item(self, response):
print(response.request.url)
yield {'link':response.request.url}
So, in summary, I'm trying to find all the links from 'a2zinc.net' contained inside https://nationalpavementexpo.com/show/floor-plan-exhibitor-list/ and its subsections.
As you guys can see, there are at least 3 occurrences of the desired links inside the target website.
The funny thing is that when I test the spider using another target site (like this one) that also contains links of interest, it works as expected and I can't really see the difference.
Also, if I define a Link Extractor instance (as in the snippet below) inside a parsing method, it is also capable of finding the desired links, but I think this won't be the best way of using CrawlSpider + Rules.
def parse_item(self, response):
le = LinkExtractor(allow_domains='a2zinc.net')
links = le.extract_links(response)
for link in links:
yield {'link': link.url}
any idea what the cause of the problem could be?
Thanks a lot.
Solution
Your code works. The only issue is that you have set the logging level to INFO
while the links that are being extracted are returning status code 403
which is only visible at the DEBUG
level. Comment out your custom settings and you will see that the links are being extracted.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class sp(CrawlSpider):
name = 'sp'
start_urls = ['https://nationalpavementexpo.com/show/floor-plan-exhibitor-list/']
custom_settings = {
# 'LOG_LEVEL': 'INFO',
# 'DEPTH_LIMIT': 4
}
rules = (
Rule(LinkExtractor(allow_domains='a2zinc.net'), callback='parse_item'),
Rule(LinkExtractor(unique=True, canonicalize=True, allow_domains = 'nationalpavementexpo.com'))
)
def parse_item(self, response):
print(response.request.url)
yield {'link':response.request.url}
OUTPUT:
Answered By - msenior_
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.