Issue
I have been reading a lot on scrapy and have my code done to scrape a printer web page to get the information I want.
Currently I can run the script with -o data.json
What I am looking for is one of two things.
1) Instead of saving as a file, send the json to an endpoint as POST request to an API. I have read on item pipeline and know I can set a number to batch things (dont know it fully) but I just want to send all the json at once when the scrape is over.
2) if 1 is not possible, is it possible to run scrapy from another python script and get the data back in there. From there I can do whatever I need to with it.
Solution
Have you tried storing data in MySQL instead? and later syncing it over other server(s).
Here is a tweak, just in case you'd still like to use your idea:
First, enable the pipeline in the 'spider':
'ITEM_PIPELINES' : {
'yourproject.pipelines.YourProjectPipeline': 300
},
Then add this (pseudo) code in item pipeline 'pipelines.py':
class YourProjectPipeline(object):
def __init__(self):
# this will make an object with multiple json string(s)
self.json = []
def process_item(self, item, spider):
self.json.append( item['varilable_which_holds_data'] )
def __del__(self):
# this def is called when crawler ends therefore
# this is the place where you need to send data to API
pass
Answered By - Janib Soomro
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.