Thursday, October 28, 2021

[FIXED] Passing SQLalchemy query results to start_urls in scrapy

October 28, 2021 python, scrapy, sqlalchemy No comments

Issue

Using scrapy in a Python 2 environment, I want to use sqlalchemy to query a database for a list of URLs, and then send that URL list to scrapy, to be used as its list of start_urls.

The filename is betsy.py and I execute this whole affair by typing:

scrapy runspider betsy.py

This is supposed to be a fairly simple program to double-check for 404s, etc. I don't need to do any further crawling once I reach these URLs.

Here's what I think is the relevant code:

class LandingPages(Base):
    __tablename__ = 'landingpages_programmatic'
    id = Column(Integer, primary_key=True)
    client_id = Column(Integer, nullable=True)
    campaign_id = Column(Integer, nullable=True)
    ad_id = Column(Integer, nullable=True)
    ad_url = Column(String(512), nullable=True)
    ad_url_utm = Column(String(512), nullable=True)
    created_on = Column(DateTime(),default=datetime.now)

    def __repr__(self):
        return "'{self.ad_url}'".format(self=self)

todaysdate = str(datetime.now().year) + '-' + str(datetime.now().month) + '-' + str(datetime.now().day)
unique_landingpages =  session.query(LandingPages).filter(LandingPages.created_on.startswith(todaysdate)).limit(2).all()

class BetsySpider(scrapy.Spider):

    name='BetsySpider'
    start_urls = [unique_landingpages]

    def parse(self, response):
        url = response.url
        title = response.css('h1::text').extract_first()
        print('URL is: {}'.format(url))

If I add this line just after the unique_landingpages variable is set:

print unique_landingpages

Then I see the seemingly usable results:

['https://www.google.com', 'https://www.bing.com/']

However, I have no success passing these results onto scrapy's start_urls argument.

If I try start_urls = unique_landingpages, I get this error:

File "/Users/chris/Desktop/Banff Experiments/banff/lib/python2.7/site-packages/scrapy/http/request/init.py", line 56, in _set_url raise TypeError('Request url must be str or unicode, got %s:' % type(url).name)

TypeError: Request url must be str or unicode, got LandingPages:

If I try start_urls = [unique_landingpages], I get this error:

File "/Users/chris/Desktop/Banff Experiments/banff/lib/python2.7/site-packages/scrapy/http/request/init.py", line 56, in _set_url raise TypeError('Request url must be str or unicode, got %s:' % type(url).name)

TypeError: Request url must be str or unicode, got list:

However, when I run this in Mac Terminal and then copy the output of print unique_landingpages, as shown above, and then directly paste that into betsy.py, so that it says:

start_urls = ['https://www.google.com', 'https://www.bing.com/']

it runs perfectly fine.

I've searched a dozen or so articles on here but don't see anyone else with this same situation. Can anyone kindly shed some light on where I've gone wrong?

(Incidentally, there's probably a much cleaner way to filter by today's date.)

Solution

Your query is returning two instances of LandingPages objects. You can either change the query so that only the ad_url values are returned, or extract the ad_url attribute of each LangingPages instance that is returned.

Option 1:

unique_landingpages =  session.query(LandingPages.ad_url).filter(LandingPages.created_on.startswith(todaysdate)).limit(2).all()

Option 2:

unique_landingpage_records =  session.query(LandingPages).filter(LandingPages.created_on.startswith(todaysdate)).limit(2).all()
unique_landingpages = [u.ad_url for u in unique_landingpage_records]

If you only need the ad_url field of each record, use option 2, as the query will be less expensive.

The print statement is misleading because you have defined a __repr__ method for the LandingPages class. print looks for that method and uses it when trying to figure out how to print what you give it.

Answered By - c0lon

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, October 28, 2021

[FIXED] Passing SQLalchemy query results to start_urls in scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels