Issue
I wanted to know if there's a better way to search for multiple URLs inside the same web page with the same spider. I have several URLs that I want to access with an index.
The code would be:
class MySpider(scrapy.Spider):
limit = 5
pages = list(range(1, limit))
shuffle(pages)
cat_a = 'http://example.com/a?page={}'
cat_b = 'http://example.com/b?page={}'
def parse(self, response):
for i in self.pages:
page_cat_a = self.cat_a.format(i)
page_cat_b = self.cat_b.format(i)
yield response.follow(page_cat_a, self.parse_page)
yield response.follow(page_cat_b, self.parse_page)
The function parse_page
continues to crawl for other data within these pages.
On my output file, I can see the data is gathered in repeating sequences, so I have 10 web pages from category a and then 10 web pages from category b repeating. I wonder if the web server I am crawling would notice these sequential behaviours and could ban me.
Also, I have 8 URLs within the same web page I want to crawl, all using indexes so instead of 2 categories I gave in the example, it would be 8. Thanks.
Solution
You can use the start_requests
spider method instead of doing this inside the parse
method.
import scrapy
from random import shuffle
class MySpider(scrapy.Spider):
categories = ('a', 'b')
limit = 5
pages = list(range(1, limit))
base_url = 'http://example.com/{category}?page={page}'
def start_requests(self):
# Shuffle pages to try to avoid bans
shuffle(pages)
for category in categories:
for page in pages:
url = self.base_url.format(category=category, page=page)
yield scrapy.Request(url)
def parse(self, response):
# Parse the page
pass
Another thing you can try to do is search for the category urls from within the site.
Let's say you want to get information from the tags showed on the sidebar of http://quotes.toscrape.com/
.
You could manually copy the links and use it the way you are doing or you could do this:
import scrapy
class MySpider(scrapy.Spider):
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
for tag in response.css('div.col-md-4.tags-box a.tag::attr(href)').getall():
yield response.follow(tag, callback=self.parse_tag)
def parse_tag(self, response):
# Print the url we are parsing
print(response.url)
I wonder if the web server I am crawling would notice these sequential behaviours and could ban me.
Yes, the site could notice. Just for you to know, there is no guarantees that the requests will be in the order you "yield".
Answered By - Luiz Rodrigues da Silva
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.