Issue
I have a Pandas DataFrame which I want to use as Scrapy Start URL, The function get_links opens an xlsx to DataFrame, This has a Column LINK which I want to run the spider on,
I convert it to dict using,
dictdf = df.to_dict(orient='records']
I know these links can be achieved by url = url['LINK'] but what I want to do is pass the whole dict into the scrapy output
dictdf = {'Data1':'1','Data2':'2','LINK':'www.link.com',.....,'Datan':'n'}
# start urls
def start_requests(self):
urls = get_links()
for url in urls:
yield scrapy.Request(url=url['LINK'], callback=self.parse)
My question is is there any way to pass the whole dict into parse() so yielding the dictdf in output as well? and output of scrapy be,
{'ScrapedData1':'d1','Data1':'1','Data2':'2','LINK':'www.link.com',.....,'Datan':'n'}
Solution
If I understand you correctly you want to carry over some data from start_requests
method.
To do that you can user Request.meta
attribute:
def start_requests(self):
data = [{
'url': 'http://httpbin.org',
'extra_data': 'extra',
}]
for item in data:
yield Request(item['url'], meta={'item': item})
def parse(self, response):
item = response.meta['item']
# {'url': 'http://httpbin.org', 'extra_data': 'extra'}
Answered By - Granitosaurus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.