Issue
Using scrapy in a Python 2 environment, I want to use sqlalchemy to query a database for a list of URLs, and then send that URL list to scrapy, to be used as its list of start_urls.
The filename is betsy.py and I execute this whole affair by typing:
scrapy runspider betsy.py
This is supposed to be a fairly simple program to double-check for 404s, etc. I don't need to do any further crawling once I reach these URLs.
Here's what I think is the relevant code:
class LandingPages(Base):
__tablename__ = 'landingpages_programmatic'
id = Column(Integer, primary_key=True)
client_id = Column(Integer, nullable=True)
campaign_id = Column(Integer, nullable=True)
ad_id = Column(Integer, nullable=True)
ad_url = Column(String(512), nullable=True)
ad_url_utm = Column(String(512), nullable=True)
created_on = Column(DateTime(),default=datetime.now)
def __repr__(self):
return "'{self.ad_url}'".format(self=self)
todaysdate = str(datetime.now().year) + '-' + str(datetime.now().month) + '-' + str(datetime.now().day)
unique_landingpages = session.query(LandingPages).filter(LandingPages.created_on.startswith(todaysdate)).limit(2).all()
class BetsySpider(scrapy.Spider):
name='BetsySpider'
start_urls = [unique_landingpages]
def parse(self, response):
url = response.url
title = response.css('h1::text').extract_first()
print('URL is: {}'.format(url))
If I add this line just after the unique_landingpages variable is set:
print unique_landingpages
Then I see the seemingly usable results:
['https://www.google.com', 'https://www.bing.com/']
However, I have no success passing these results onto scrapy's start_urls argument.
If I try start_urls = unique_landingpages, I get this error:
File "/Users/chris/Desktop/Banff Experiments/banff/lib/python2.7/site-packages/scrapy/http/request/init.py", line 56, in _set_url raise TypeError('Request url must be str or unicode, got %s:' % type(url).name)
TypeError: Request url must be str or unicode, got LandingPages:
If I try start_urls = [unique_landingpages], I get this error:
File "/Users/chris/Desktop/Banff Experiments/banff/lib/python2.7/site-packages/scrapy/http/request/init.py", line 56, in _set_url raise TypeError('Request url must be str or unicode, got %s:' % type(url).name)
TypeError: Request url must be str or unicode, got list:
However, when I run this in Mac Terminal and then copy the output of print unique_landingpages, as shown above, and then directly paste that into betsy.py, so that it says:
start_urls = ['https://www.google.com', 'https://www.bing.com/']
it runs perfectly fine.
I've searched a dozen or so articles on here but don't see anyone else with this same situation. Can anyone kindly shed some light on where I've gone wrong?
(Incidentally, there's probably a much cleaner way to filter by today's date.)
Solution
Your query is returning two instances of LandingPages
objects. You can either change the query so that only the ad_url
values are returned, or extract the ad_url
attribute of each LangingPages
instance that is returned.
Option 1:
unique_landingpages = session.query(LandingPages.ad_url).filter(LandingPages.created_on.startswith(todaysdate)).limit(2).all()
Option 2:
unique_landingpage_records = session.query(LandingPages).filter(LandingPages.created_on.startswith(todaysdate)).limit(2).all()
unique_landingpages = [u.ad_url for u in unique_landingpage_records]
If you only need the ad_url
field of each record, use option 2, as the query will be less expensive.
The print
statement is misleading because you have defined a __repr__
method for the LandingPages
class. print
looks for that method and uses it when trying to figure out how to print what you give it.
Answered By - c0lon
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.