Issue
I have one spider that crawls one website, but I want to store the results in two different tables in my Postgresql DB.
1 in "races" 2 in "participants"
If I just fill in one table, it works fine, but how do I get the scrapy pipeline to fill in both tables in one go?
I tried to make two classes in my pipelines.py, but that did not work out. I guess I just miss sth. here
Well, here is my code
import logging
import psycopg2
from scrapy.loader import ItemLoader
class RacesPipeline(object):
def open_spider(self, spider):
hostname = 'localhost'
username = 'postgres'
password = '****!'
database = 'horseracing'
port = "***"
self.connection = psycopg2.connect(host=hostname, user=username, password=password,
dbname=database, port=port)
self.cur = self.connection.cursor()
def close_spider(self, spider):
self.cur.close()
self.connection.close()
def process_item(self, item, spider):
self.cur.execute("insert into races(track, date, racename, racetype, distancefinal, minalter, maxalter, raceclass, classrating, going, finalhurdle, anzahlstarter, winningtimecombined, pricemoney1, pricemoney2, pricemoney3, pricemoney4, pricemoney5, pricemoney6, pricemoney7, pricemoney8) values(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",
(
item['track'][0],
item['date'],
item['racename'][0],
item['racetype'],
item['distancefinal'],
item['minalter'],
item['maxalter'],
item['raceclass'],
item['classrating'],
item['going'][0],
item['finalhurlde'],
item['anzahlstarter'],
item['winningtimecombined'],
item['pricemoney1'],
item['pricemoney2'],
item['pricemoney3'],
item['pricemoney4'],
item['pricemoney5'],
item['pricemoney6'],
item['pricemoney7'],
item['pricemoney8']
))
self.connection.commit()
return item
class HorsesPipeline(object):
def open_spider(self, spider):
hostname = 'localhost'
username = 'postgres'
password = '********'
database = 'horseracing'
port = "****"
self.connection = psycopg2.connect(host=hostname, user=username, password=password, dbname=database, port=port)
self.cur = self.connection.cursor()
def close_spider(self, spider):
self.cur.close()
self.connection.close()
def process_item(self, item, spider):
self.cur.execute("insert into participants(pos, draw, dwinner, dnext, startnumber, pferde, horsecountry, odd, jockey, trainer, weightkg, alter, headgear, officalrating, rp, ts, rprc) values(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",
(
item['pos'][0],
item['draw'],
item['dwinner'],
item['dnext'],
item['startnumber'],
item['pferde'],
item['horsecountry'],
item['odd'],
item['jockey'],
item['trainer'],
item['weightkg'],
item['alter'],
item['headgear'],
item['officalrating'],
item['rp'],
item['ts'],
item['rprc']
))
self.connection.commit()
return item
And the pipeline settings:
ITEM_PIPELINES = {
'results.pipelines.RacesPipeline': 100,
'results.pipelines.HorsesPipeline':200,
}
If I run the code, I get the error line 33, in process_item item['track'][0], KeyError: 'track'
But they run just fine when I don't try to string the two table inserts together, but test them individually. AND, it inserts the first table just fine, even if the error above suggests otherwiese
I know I am just missing sth. to add them together, but I can`t figure it out
Solution
I see you have TWO different item types. You need to check item instance in your process_item
and use different insert
:
from your_spider.items import RaceItem, ParticipantItem # use actual names here
if isinstance(item, RaceItem):
# insert into race
Single process_item
will work for two different tables, no need to create second class
.
Answered By - gangabass
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.