Friday, January 28, 2022

[FIXED] Memory Leak in Scrapy

January 28, 2022 python, scrapy, web-scraping No comments

Issue

i wrote the following code to scrape for email addresses (for testing purposes):

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import Selector
from crawler.items import EmailItem

class LinkExtractorSpider(CrawlSpider):
    name = 'emailextractor'
    start_urls = ['http://news.google.com']

    rules = ( Rule (LinkExtractor(), callback='process_item', follow=True),)

    def process_item(self, response):
        refer = response.url
        items = list()
        for email in Selector(response).re("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}"):

            emailitem = EmailItem()
            emailitem['email'] = email
            emailitem['refer'] = refer
            items.append(emailitem)
        return items

Unfortunately, it seems that references to the Requests are not closed properly, as with the scrapy telnet console, the number of Requests increases by 5k/s. After ~3min and 10k scraped pages, my system starts swapping (8GB RAM). Anyone got an idea what is wrong? I already tried to remove the refer and "copied" the string using

emailitem['email'] = ''.join(email)

without success. After scraping, the items get saved into a BerkeleyDB counting their occurrences (using pipelines), so the references should be gone after that.

What would be the difference between returning a set of items and yielding each item separately?

EDIT:

After quite a while of debugging I found out, that the Requests are not freed, such that I end up with:

$> nc localhost 6023
>>> prefs()
Live References
Request 10344   oldest: 536s ago
>>> from scrapy.utils.trackref import get_oldest
>>> r = get_oldest('Request')
>>> r.url
<GET http://news.google.com>

which is in fact the start url. Anybody knows what the problem is? Where is the missing reference to the Request object?

EDIT2:

After running for ~12 hours on a server (having 64GB RAM), the RAM used is ~16GB (using ps, even if ps is not the right tool for it). The problem is, that the number of crawled pages is going significantly down and the number of scraped items remains 0 since hours:

INFO: Crawled 122902 pages (at 82 pages/min), scraped 3354 items (at 0 items/min)

EDIT3: I did the objgraph analysis which results in the following graph (thanks @Artur Gaspar): Python Objgraph Backlink

It does not seem that I can influence it?

Solution

The final answer for me was the use of a disk-based queue in conjunction with a working directory as runtime parameter.

This is adding the following code to the settings.py:

DEPTH_PRIORITY = 1 
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

afterwards, starting the crawler using the following commandline makes the changes persistent in the given directory:

scrapy crawl {spidername} -s JOBDIR=crawls/{spidername} see scrapy docs for details

The addidtional benefit of this approach is, that the crawl can be paused and resumed at any time. My spider now runs for more than 11 days blocking ~15GB memory (file cache memory for disk FIFO queues)

Answered By - Robin

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 28, 2022

[FIXED] Memory Leak in Scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels