Saturday, July 23, 2022

[FIXED] Why does scrapy_splash CrawlSpider take the same amount of time as scrapy with Selenium?

July 23, 2022 python, scrapy, scrapy-splash No comments

Issue

I have the following scrapy CrawlSpider:

import logger as lg
from scrapy.crawler import CrawlerProcess
from scrapy.http import Response
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashTextResponse
from urllib.parse import urlencode
from scrapy.linkextractors import LinkExtractor
from scrapy.http import HtmlResponse

logger = lg.get_logger("oddsportal_spider")


class SeleniumScraper(CrawlSpider):
    
    name = "splash"
    
    custom_settings = {
        "USER_AGENT": "*",
        "LOG_LEVEL": "WARNING",
        "DOWNLOADER_MIDDLEWARES": {
            'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543,
        },
    }

    httperror_allowed_codes = [301]
    
    start_urls = ["https://www.oddsportal.com/tennis/results/"]
    
    rules = (
        Rule(
            LinkExtractor(allow="/atp-buenos-aires/results/"),
            callback="parse_tournament",
            follow=True,
        ),
        Rule(
            LinkExtractor(
                allow="/tennis/",
                restrict_xpaths=("//td[@class='name table-participant']//a"),
            ),
            callback="parse_match",
        ),
    )

    def parse_tournament(self, response: Response):
        logger.info(f"Parsing tournament - {response.url}")
    
    def parse_match(self, response: Response):
        logger.info(f"Parsing match - {response.url}")


process = CrawlerProcess()
process.crawl(SeleniumScraper)
process.start()

The Selenium middleware is as follows:

class SeleniumMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
        return middleware

    def process_request(self, request, spider):
        logger.debug(f"Selenium processing request - {request.url}")
        self.driver.get(request.url)
        return HtmlResponse(
            request.url,
            body=self.driver.page_source,
            encoding='utf-8',
            request=request,
        )

    def spider_opened(self, spider):
        options = webdriver.FirefoxOptions()
        options.add_argument("--headless")
        self.driver = webdriver.Firefox(
            options=options,
            executable_path=Path("/opt/geckodriver/geckodriver"),
        )

    def spider_closed(self, spider):
        self.driver.close()

End to end this takes around a minute for around 50ish pages. To try and speed things up and take advantage of multiple threads and Javascript I've implemented the following scrapy_splash spider:

class SplashScraper(CrawlSpider):
    
    name = "splash"
    
    custom_settings = {
        "USER_AGENT": "*",
        "LOG_LEVEL": "WARNING",
        "SPLASH_URL": "http://localhost:8050",
        "DOWNLOADER_MIDDLEWARES": {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        "SPIDER_MIDDLEWARES": {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100},
        "DUPEFILTER_CLASS": 'scrapy_splash.SplashAwareDupeFilter',
        "HTTPCACHE_STORAGE": 'scrapy_splash.SplashAwareFSCacheStorage',
    }

    httperror_allowed_codes = [301]
    
    start_urls = ["https://www.oddsportal.com/tennis/results/"]
    
    rules = (
        Rule(
            LinkExtractor(allow="/atp-buenos-aires/results/"),
            callback="parse_tournament",
            process_request="use_splash",
            follow=True,
        ),
        Rule(
            LinkExtractor(
                allow="/tennis/",
                restrict_xpaths=("//td[@class='name table-participant']//a"),
            ),
            callback="parse_match",
            process_request="use_splash",
        ),
    )

    def process_links(self, links): 
        for link in links: 
            link.url = "http://localhost:8050/render.html?" + urlencode({'url' : link.url}) 
        return links

    def _requests_to_follow(self, response):
        if not isinstance(response, (HtmlResponse, SplashTextResponse)):
            return
        seen = set()
        for rule_index, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            for link in rule.process_links(links):
                seen.add(link)
                request = self._build_request(rule_index, link)
                yield rule.process_request(request, response)

    def use_splash(self, request, response):
        request.meta.update(splash={'endpoint': 'render.html'})
        return request

    def parse_tournament(self, response: Response):
        logger.info(f"Parsing tournament - {response.url}")
    
    def parse_match(self, response: Response):
        logger.info(f"Parsing match - {response.url}")

However, this takes about the same amount of time. I was hoping to see a big increase in speed :(

I've tried playing around with different DOWNLOAD_DELAY settings but that hasn't made things any faster.

All the concurrency settings are left at their defaults.

Any ideas on if/how I'm going wrong?

Solution

Taking a stab at an answer here with no experience of the libraries.

It looks like Scrapy Crawlers themselves are single-threaded. To get multi-threaded behavior you need to configure your application differently or write code that makes it behave mulit-threaded. It sounds like you've already tried this so this is probably not news to you but make sure you have configured the CONCURRENT_REQUESTS and REACTOR_THREADPOOL_MAXSIZE.

https://docs.scrapy.org/en/latest/topics/settings.html?highlight=thread#reactor-threadpool-maxsize

I can't imagine there is much CPU work going on in the crawling process so i doubt it's a GIL issue.

Excluding GIL as an option there are two possibilities here:

Your crawler is not actually multi-threaded. This may be because you are missing some setup or configuration that would make it so. i.e. You may have set the env variables correctly but your crawler is written in a way that is processing requests for urls synchronously instead of submitting them to a queue.

To test this, create a global object and store a counter on it. Each time your crawler starts a request increment the counter. Each time your crawler finishes a request, decrement the counter. Then run a thread that prints the counter every second. If your counter value is always 1, then you are still running synchronously.

# global_state.py

GLOBAL_STATE = {"counter": 0}

# middleware.py

from global_state import GLOBAL_STATE

class SeleniumMiddleware:

    def process_request(self, request, spider):
        GLOBAL_STATE["counter"] += 1
        self.driver.get(request.url)
        GLOBAL_STATE["counter"] -= 1

        ...

# main.py

from global_state import GLOBAL_STATE
import threading
import time

def main():
  gst = threading.Thread(target=gs_watcher)
  gst.start()

  # Start your app here

def gs_watcher():
  while True:
    print(f"Concurrent requests: {GLOBAL_STATE['counter']}")
    time.sleep(1)

The site you are crawling is rate limiting you.

To test this, run the application multiple times. If you go from 50 req/s to 25 req/s per application then you are being rate limited. To skirt around this use a VPN to hop-around.

If after that you find that you are running concurrent requests, and you are not being rate limited, then there is something funky going on in the libraries. Try removing chunks of code until you get to the bare minimum of what you need to crawl. If you have gotten to the absolute bare minimum implementation and it's still slow then you now have a minimal reproducible example and can get much better/informed help.

Answered By - micah

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, July 23, 2022

[FIXED] Why does scrapy_splash CrawlSpider take the same amount of time as scrapy with Selenium?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels