Issue
I noticed that some of the websites I am trying to scrape redirect me to a different hostname: https://www.citibank.com.au/ redirects for example to https://www1.citibank.com.au/. While Scrapy does scrape regular subdomains (www.subdomain.example.com), it skips www2.example.com.
This is the apparently how Scrapy is supposed to work: https://doc.ebichu.cc/scrapy/ that:
OffsiteMiddleware classscrapy.spidermiddlewares.offsite.OffsiteMiddleware Filters out Requests for URLs outside the domains covered by the spider.
This middleware filters out every request whose host names aren’t in the spider’s allowed_domains attribute. All subdomains of any domain in the list are also allowed. E.g. the rule www.example.org will also allow bob.www.example.org but not www2.example.com nor example.com.
My question is: how would I make sure that all subdomains that have a different hostname (e.g. www2.example.com) are scraped?
The solution I could think of is to populate the allowed domains list with all variations of an url (e.g. [www.example.com,www1.example.com,www2.example.com,etc.]). Is this the way to go, or is there any option I have overlooked in the Scrapy documentation that could fix this in a nicer way?
Solution
This is the way the allowed_domains
is supposed to work in scrapy. It's meant to filter any requests that go offsite from the domains you allowed.
If you want your scrapy to reach several offsite domains, you don't need to use the allowed_domains
attribute, just remove it from your spider (or leave it empty) and requests will get through.
In the specific case you mentioned, if they are all part of the same domain and you are having problem with mirrors ("www1..." "www2..."
) use just the actual domain.
allowed_domains = ['example.com']
Answered By - renatodvc
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.