Issue
i wrote the following code to scrape for email addresses (for testing purposes):
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import Selector
from crawler.items import EmailItem
class LinkExtractorSpider(CrawlSpider):
name = 'emailextractor'
start_urls = ['http://news.google.com']
rules = ( Rule (LinkExtractor(), callback='process_item', follow=True),)
def process_item(self, response):
refer = response.url
items = list()
for email in Selector(response).re("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}"):
emailitem = EmailItem()
emailitem['email'] = email
emailitem['refer'] = refer
items.append(emailitem)
return items
Unfortunately, it seems that references to the Requests are not closed properly, as with the scrapy telnet console, the number of Requests increases by 5k/s. After ~3min and 10k scraped pages, my system starts swapping (8GB RAM). Anyone got an idea what is wrong? I already tried to remove the refer and "copied" the string using
emailitem['email'] = ''.join(email)
without success. After scraping, the items get saved into a BerkeleyDB counting their occurrences (using pipelines), so the references should be gone after that.
What would be the difference between returning a set of items and yielding each item separately?
EDIT:
After quite a while of debugging I found out, that the Requests are not freed, such that I end up with:
$> nc localhost 6023
>>> prefs()
Live References
Request 10344 oldest: 536s ago
>>> from scrapy.utils.trackref import get_oldest
>>> r = get_oldest('Request')
>>> r.url
<GET http://news.google.com>
which is in fact the start url. Anybody knows what the problem is? Where is the missing reference to the Request object?
EDIT2:
After running for ~12 hours on a server (having 64GB RAM), the RAM used is ~16GB (using ps, even if ps is not the right tool for it). The problem is, that the number of crawled pages is going significantly down and the number of scraped items remains 0 since hours:
INFO: Crawled 122902 pages (at 82 pages/min), scraped 3354 items (at 0 items/min)
EDIT3: I did the objgraph analysis which results in the following graph (thanks @Artur Gaspar):
It does not seem that I can influence it?
Solution
The final answer for me was the use of a disk-based queue in conjunction with a working directory as runtime parameter.
This is adding the following code to the settings.py:
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
afterwards, starting the crawler using the following commandline makes the changes persistent in the given directory:
scrapy crawl {spidername} -s JOBDIR=crawls/{spidername} see scrapy docs for details
The addidtional benefit of this approach is, that the crawl can be paused and resumed at any time. My spider now runs for more than 11 days blocking ~15GB memory (file cache memory for disk FIFO queues)
Answered By - Robin
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.