Issue
I need to iterate a form, filling out it with different options. I already can crawl/scrape data using Scrapy and Python for one set of variables, but I need to iterate through a list of them.
Currently, my spider can log in, fills the form and scrapes the data.
To log in and complete the form I use:
class FormSpider(CrawlSpider):
name= 'formSpider'
allow_domain = ['example.org']
start_urls = ['https://www.example.org/en-en/']
age = '35'
days = '21'
S1 = 'abc'
S2 = 'cde'
S3 = 'efg'
S4 = 'hij'
def parse(self, response):
token = response.xpath('//*[@name="__VIEWSTATE"]/@value').extract_first()
return FormRequest.from_response(response,
formdata={'__VIEWSTATE': token,
'Password': 'XXXXX',
'UserName': 'XXXXX'},
callback=self.scrape_main)
And I use this code to complete the Form:
def parse_transfer(self, response):
return FormRequest.from_response(response,
formdata={"Age" : self.age,
"Days" : self.days,
"Skill_1" : self.S1,
"Skill_2" : self.S2,
"Skill_3" : self.S2,
"Skill4" : self.S3
"butSearch" : "Search"},
callback=self.parse_item)
Then, I scrape the data and export it as CSV.
What I need now is to iterate the inputs from the form. I was thinking of using a list for each variable to change the form each time (I only need a certain number of combinations).
age = ['35','36','37','38']
days = ['10','20','30','40']
S1 = ['abc','def','ghi','jkl']
S2 = ['cde','qwe','rty','yui']
S3 = ['efg','asd','dfg','ghj']
S4 = ['hij','bgt','nhy','mju']
So I can iterate the form in a way like:
age[0],days[0],S1[0],S2[0],S3[0],S4[0]... age[1],days[1]... and so on
Any recommendation? I am open to different options (not only lists) to avoid creating multiple spiders.
UPDATE
This is the final code:
def parse_transfer(self, response):
return FormRequest.from_response(response,
formdata={"Age" : self.age,
"Days" : self.days,
"Skill_1" : self.S1,
"Skill_2" : self.S2,
"Skill_3" : self.S2,
"Skill4" : self.S3
"butSearch" : "Search"},
dont_filter=True,
callback=self.parse_item)
def parse_item(self, response):
open_in_browser(response)
# it opens all the websites after submitting the form :)
Solution
It's hard to understand what your current parse_transfer()
is meant to be doing because your FormSpider
doesn't have a self.skill_1
that we can see. Also you may not need to inherit from CrawlSpider
here. And change the returns
to yields
.
To iterate on the form, I recommend replacing the spider attributes you currently have with the lists you will use for iteration.
Then loop in parse_transfer()
def parse_transfer(self, response):
for i in range(len(age)):
yield FormRequest.from_response(response,
formdata={"Age" : self.age[i],
"Days" : self.days[i],
"Skill_1" : self.S1[i],
"Skill_2" : self.S2[i],
"Skill_3" : self.S3[i],
"Skill_4" : self.S4[i]
"butSearch" : "Search"},
callback=self.parse_item)
This may not be a viable solution based on the way the website accepts requests, though.
Answered By - pwinz
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.