Sunday, December 5, 2021

[FIXED] Multiple URLs for a same spider

December 05, 2021 python, scrapy No comments

Issue

I wanted to know if there's a better way to search for multiple URLs inside the same web page with the same spider. I have several URLs that I want to access with an index.

The code would be:

class MySpider(scrapy.Spider):
limit = 5
pages = list(range(1, limit))
shuffle(pages)
cat_a = 'http://example.com/a?page={}'
cat_b = 'http://example.com/b?page={}'

    def parse(self, response):
        for i in self.pages:
          page_cat_a = self.cat_a.format(i)
          page_cat_b = self.cat_b.format(i)
          yield response.follow(page_cat_a, self.parse_page)
          yield response.follow(page_cat_b, self.parse_page)

The function parse_page continues to crawl for other data within these pages.

On my output file, I can see the data is gathered in repeating sequences, so I have 10 web pages from category a and then 10 web pages from category b repeating. I wonder if the web server I am crawling would notice these sequential behaviours and could ban me.

Also, I have 8 URLs within the same web page I want to crawl, all using indexes so instead of 2 categories I gave in the example, it would be 8. Thanks.

Solution

You can use the start_requests spider method instead of doing this inside the parse method.

import scrapy
from random import shuffle

class MySpider(scrapy.Spider):
    categories = ('a', 'b')
    limit = 5
    pages = list(range(1, limit))
    base_url = 'http://example.com/{category}?page={page}'

    def start_requests(self):
        # Shuffle pages to try to avoid bans
        shuffle(pages)

        for category in categories:
            for page in pages:
                url = self.base_url.format(category=category, page=page)
                yield scrapy.Request(url)

    def parse(self, response):
        # Parse the page
        pass

Another thing you can try to do is search for the category urls from within the site. Let's say you want to get information from the tags showed on the sidebar of http://quotes.toscrape.com/. You could manually copy the links and use it the way you are doing or you could do this:

import scrapy

class MySpider(scrapy.Spider):
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for tag in response.css('div.col-md-4.tags-box a.tag::attr(href)').getall():
            yield response.follow(tag, callback=self.parse_tag)

    def parse_tag(self, response):
        # Print the url we are parsing
        print(response.url)

I wonder if the web server I am crawling would notice these sequential behaviours and could ban me.

Yes, the site could notice. Just for you to know, there is no guarantees that the requests will be in the order you "yield".

Answered By - Luiz Rodrigues da Silva

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 5, 2021

[FIXED] Multiple URLs for a same spider

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels