Issue
Apologies if this is a scrapy noob question but I have spent ages looking for the answer to this:
I want to store the raw data from each & every URL I crawl in my local filesystem as a separate file (ie response.body -> /files/page123.html) - ideally with the filename being a hash of the URL. This is so I can do further processing of the HTML (ie further parsing, indexing in Solr/ElasticSearch etc).
I've read the docs and not sure if there's a built-in way of doing this? Since the pages are by default being downloaded by the system it doesn't seem to make sense to be writing custom pipelines etc
Solution
As paul t said HttpCache Middleware might work for you but I'd advise writing you're own custom pipeline.
Scrapy has built-in ways of exporting data to files but they're for json, xml and csv not raw html. Don't worry though it's not too hard!
provided your items.py
looks somthing like:
from scrapy.item import Item, Field
class Listing(Item):
url = Field()
html = Field()
and you've been saving your scraped data to those items in your spider like so:
item['url'] = response.url
item['html'] = response.body
your pipelines.py
would just be:
import hashlib
class HtmlFilePipeline(object):
def process_item(self, item, spider):
# choose whatever hashing func works for you
file_name = hashlib.sha224(item['url']).hexdigest()
with open('files/%s.html' % file_name, 'w+b') as f:
f.write(item['html'])
Hope that helps. Oh and don't forget to put a files/
directory in your project root and add to your settings.py
:
ITEM_PIPELINES = {
'myproject.pipeline.HtmlFilePipeline': 300,
}
source: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
Answered By - NKelner
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.