Issue
I'm using Python, Scrapy, Splash, and the scrapy_splash package to scrap a website.
I'm able to log in using the SplashRequest object in scrapy_splash. Login creates a cookie which gives me access to a portal page. To this point all works.
On the portal page, there is a form element wrapping a number of buttons. When clicked the action URL gets updated and a form submission is triggered. The form submission results in a 302 redirect.
I tried the same approach with the SplashRequest, however, I'm unable to capture the SSO query parameter that is returned with the redirect. I've tried to read the header Location parameter without success.
I've also tried using lua scripts in combination with the SplashRequest object, however, I'm still unable to access the redirect Location object.
Any guidance would be greatly appreciated.
I realize there are other solutions (i.e. selenium) available however the above tech is what we are using across a large number of other scripts and I hesitate to add new tech for this specific use case.
# Lua script to capture cookies and SSO query parameter from 302 Redirect
lua_script = """
function main(splash)
if splash.args.cookies then
splash:init_cookies(splash.args.cookies)
end
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
formdata=splash.args.formdata
})
assert(splash:wait(0))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
"""
def parse(self, response):
yield SplashRequest(
url='https://members.example.com/login',
callback=self.portal_page,
method='POST',
endpoint='execute',
args={
'wait': 0.5,
'lua_source': self.lua_script,
'formdata': {
'username': self.login,
'password': self.password
},
}
)
def portal_page(self, response):
yield SplashRequest(
url='https://data.example.com/portal'
callback=self.data_download,
args={
'wait': 0.5,
'lua_source': self.lua_script,
'formdata': {}
},
)
def data_download(self, response):
print(response.body.decode('utf8')
Solution
I updated the question above with a working example.
I changed a few things however the problem I was having was directly related to missing the splash:init_cookies(splash.args.cookies)
reference.
I also converted from using SplashFormRequest
to SplashRequest
, refactored the splash:go
block and removed a reference to the specific form.
Thanks again @MikhailKorobov for your help.
Answered By - Charles Green
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.