Issue
When scraping a website using Scrapy, to create a database of the following form (as defined in models.py in the tutorial structure):
from sqlalchemy import create_engine, Column, Table, ForeignKey, MetaData
from sqlalchemy.orm import relationship
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import (Integer, String, Date, DateTime, Float, Boolean, Text)
from scrapy.utils.project import get_project_settings
Base = declarative_base()
def db_connect():
return create_engine(get_project_settings().get("CONNECTION_STRING"))
def create_table(engine):
Base.metadata.create_all(engine)
Article_author = Table('article_author', Base.metadata,
Column('article_id', Integer, ForeignKey('article.article_id'), primary_key=True),
Column('author_id', Integer, ForeignKey('author.author_id'), primary_key=True),
Column('author_number', Integer)
)
class Article(Base):
__tablename__ = "article"
article_id = Column(Integer, primary_key=True)
article_title = Column('name', String(50), unique=True)
authors = relationship('Author', secondary='article_author',lazy='dynamic', backref="article")
class Author(Base):
__tablename__ = "author"
author_id = Column(Integer, primary_key=True)
author_name = Column('name', String(50), unique=True)
articles = relationship('Article', secondary='article_author',lazy='dynamic', backref="article")
an error occurs when adding an author number (e.g. first or second author) to the automatically created association table 'article_author' as I don't know how to acces the table from the pipelines.py script. There is a many-to-many relation between article and author tables as an author can write multiple articles and articles can have multiple authors. The article table has a unique article_id and the author table has a unique author_id. The association table has a unique (article_id,author_id) structure. In the pipeline.py script there is a function proces_item in which an instance of articles can be made, after which the author and association tables are updated accordingly. The question is how the author number can be inserted as well.
Is there a relation that should be added in models.py?
The script pipeline.py reads:
from sqlalchemy.orm import sessionmaker
from scrapy.exceptions import DropItem
from tutorial.models import Article, Author, Article_author, Article_author, db_connect, create_table
class SavePipeline(object):
def __init__(self):
"""
Initializes database connection and sessionmaker
Creates tables
"""
engine = db_connect()
create_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
session = self.Session()
article = Article()
#article_author = Article_author()
#check whether the current article has authors or not
if 'author' in item:
for author,n in zip(item["author"],item["n"]):
writer = Author(author=author)
# check whether author already exists in the database
exist = session.query(Author).filter_by(author = writer.author).first()
if exist_title is not None:
# the current author exists
writer = exist
article.authors.append(writer)
nr = article_author(author_number =n)
article.article_author.append(nr)
#article_author.append(nr)
#article.authors.append(pag)
#article_author.author_number = n
try:
session.add(proverb)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item
The resulting error from the terminal is an integrity error as it cannot be related to the author_id:
sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) NOT NULL constraint failed: article_author.author_id
[SQL: INSERT INTO proverb_source (article_id, author_number) VALUES (?, ?)]
[parameters: (30, 2]
When defining an instance Article_author in process_item and appending it via
nr = Article_author(author_number =n)
article_author.append(nr)
it results in an attribute error:
article_author.append(nr)
AttributeError: 'Article_author' object has no attribute 'append'
When adding it via the authors member of article
article.authors.append(pag)
it gives a ValueError:
ValueError: Bidirectional attribute conflict detected: Passing object <Article_author at 0x7f9007276c70> to attribute "Article.authors" triggers a modify event on attribute "Article.article_author" via the backref "Article_author.article".
When accessing it directly it gives no error, but leaves the column empty,
article_author.author_number = n
Solution
I solved this by defining the relations from the association table and appending from this table, cf. https://docs.sqlalchemy.org/en/14/glossary.html#term-association-relationship
Answered By - Josh
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.