Sunday, February 20, 2022

[FIXED] How to store scraped data in multiple Postregsql tables?

February 20, 2022 database, pipeline, postgresql, scrapy No comments

Issue

I have one spider that crawls one website, but I want to store the results in two different tables in my Postgresql DB.

1 in "races" 2 in "participants"

If I just fill in one table, it works fine, but how do I get the scrapy pipeline to fill in both tables in one go?

I tried to make two classes in my pipelines.py, but that did not work out. I guess I just miss sth. here

Well, here is my code

import logging
import psycopg2
from scrapy.loader import ItemLoader


class RacesPipeline(object):

def open_spider(self, spider):
    
    hostname = 'localhost'
    username = 'postgres'
    password = '****!'
    database = 'horseracing'
    port = "***"
    
    self.connection = psycopg2.connect(host=hostname, user=username, password=password, 
    dbname=database, port=port)
    self.cur = self.connection.cursor()

def close_spider(self, spider):
    self.cur.close()
    self.connection.close()
    
def process_item(self, item, spider):
    self.cur.execute("insert into races(track, date, racename, racetype, distancefinal, minalter, maxalter, raceclass, classrating, going, finalhurdle, anzahlstarter, winningtimecombined, pricemoney1, pricemoney2, pricemoney3, pricemoney4, pricemoney5, pricemoney6, pricemoney7, pricemoney8) values(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",
        (
        item['track'][0],
        item['date'],
        item['racename'][0],
        item['racetype'],
        item['distancefinal'],
        item['minalter'],
        item['maxalter'],
        item['raceclass'],
        item['classrating'],
        item['going'][0],
        item['finalhurlde'],
        item['anzahlstarter'],
        item['winningtimecombined'],
        item['pricemoney1'],
        item['pricemoney2'],
        item['pricemoney3'],
        item['pricemoney4'],
        item['pricemoney5'],
        item['pricemoney6'],
        item['pricemoney7'],
        item['pricemoney8']
        ))
    self.connection.commit()
    return item

class HorsesPipeline(object):

def open_spider(self, spider):
    
    hostname = 'localhost'
    username = 'postgres'
    password = '********'
    database = 'horseracing'
    port = "****"
    self.connection = psycopg2.connect(host=hostname, user=username, password=password, dbname=database, port=port)
    self.cur = self.connection.cursor()
    

def close_spider(self, spider):
    self.cur.close()
    self.connection.close()

def process_item(self, item, spider):
    self.cur.execute("insert into participants(pos, draw, dwinner, dnext, startnumber, pferde, horsecountry, odd, jockey, trainer, weightkg, alter, headgear, officalrating, rp, ts, rprc) values(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",
        (
        item['pos'][0],
        item['draw'],
        item['dwinner'],
        item['dnext'],
        item['startnumber'],
        item['pferde'],
        item['horsecountry'],
        item['odd'],
        item['jockey'],
        item['trainer'],
        item['weightkg'],
        item['alter'],
        item['headgear'],
        item['officalrating'],
        item['rp'],
        item['ts'],
        item['rprc']
        )) 
    self.connection.commit()
    return item

And the pipeline settings:

 ITEM_PIPELINES = {
'results.pipelines.RacesPipeline': 100,
'results.pipelines.HorsesPipeline':200,
 }

If I run the code, I get the error line 33, in process_item item['track'][0], KeyError: 'track'

But they run just fine when I don't try to string the two table inserts together, but test them individually. AND, it inserts the first table just fine, even if the error above suggests otherwiese

I know I am just missing sth. to add them together, but I can`t figure it out

Solution

I see you have TWO different item types. You need to check item instance in your process_item and use different insert:

from your_spider.items import RaceItem, ParticipantItem # use actual names here

    if isinstance(item, RaceItem):
        # insert into race

Single process_item will work for two different tables, no need to create second class.

Answered By - gangabass

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, February 20, 2022

[FIXED] How to store scraped data in multiple Postregsql tables?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels