Issue
I have a Scrapy project that I want to use to scrape some websites. When I try to save all the information in a MySQL db the error in the title pops out. I read everywhere and I found out that it is a "list" problem, probably connected to the items[] list... Can you please help me understand what this error means and where I should fix the code? Please also explain why because I want to understand. Thank you so much.
Spider Code:
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders.crawl import Rule, CrawlSpider
from scrapy.selector import HtmlXPathSelector
from gscrape.items import GscrapeItem
class GuideSpider(CrawlSpider):
name = "Dplay"
allowed_domains = ['www.example.com']
start_urls = [
"http://www.examplea.com/forums/forumdisplay.php?f=108&order=desc&page=1"
]
rules = (
Rule(SgmlLinkExtractor(allow=("forumdisplay.php.*f=108.*page=")), callback='parse_item', follow=True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
items = []
sites = hxs.select('//div')
for site in sites:
item = GscrapeItem()
item['title'] = site.select('a[@class="threadcolor"]/text()').extract()
item['guide_url'] = site.select('a[@class="threadcolor"]/@href').extract()
item['subject'] = site.select('./text()[1]').extract()
items.append(item)
return items
Pipeline Code:
from scrapy.exceptions import DropItem
from string import join
from scrapy import log
from twisted.enterprise import adbapi
import MySQLdb.cursors
class GscrapePipeline(object):
def process_item(self, item, spider):
if item['guide_url']:
item['guide_url'] = "http://www.example.com/forums/" + join(item['guide_url'])
return item
else:
raise DropItem()
class MySQLStorePipeline(object):
def __init__(self):
# @@@ hardcoded db settings
# TODO: make settings configurable through settings
self.dbpool = adbapi.ConnectionPool('MySQLdb',
db='prova',
host='127.0.0.1',
user='root',
passwd='',
cursorclass=MySQLdb.cursors.DictCursor,
charset='utf8',
use_unicode=True
)
def process_item(self, spider, item):
# run db query in thread pool
query = self.dbpool.runInteraction(self._conditional_insert, item)
query.addErrback(self.handle_error)
return item
def _conditional_insert(self, tx, item):
# create record if doesn't exist.
# all this block run on it's own thread
tx.execute("select * from prova where guide_url = %s", item['guide_url'])
result = tx.fetchone()
if result:
log.msg("Item already stored in db: %s" % item, level=log.DEBUG)
else:
tx.execute(\
"insert into prova (title, guide_url, subject) "
"values (%s, %s, %s)",
(item['title'],
item['guide_url'],
item['subject']
))
log.msg("Item stored in db: %s" % item, level=log.DEBUG)
def handle_error(self, e):
log.err(e)
Error: exceptions.TypeError: 'GuideSpider' object is not subscriptable (line 47) pipelines.py
Solution
According to the docs:
process_item(item, spider)
I mean in your pipeline:
def process_item(self, spider, item):
you have wrong order of parameters, this meaning that you pass to _conditional_insert
your spider and not the item.
Learn to us debugger. Install ipdb and put on line 47 (before the offending line) this:
import ipdb; ipdb.set_trace()
When the program reaches the breakpoint, you will be able to see values of the variables, calling the method manually and see the backtrace.
Answered By - warvariuc
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.