Issue
I'm new to scrapy and splash, and I need to scrape data from single page and regular web apps.
A caveat, though, is I'm mostly scraping data from internal tools and applications, so some require authentication and all of them require at least a couple of seconds loading time before the page fully loads.
I naively tried a Python time.sleep(seconds) and it didn't work. It seems like SplashRequest and scrapy.Request both run and yield results, basically. I then learned about LUA scripts as arguments to these requests, and attempted a LUA script with various forms of wait(), but it looks like the requests never actually run the LUA scripts. It finishes right away and my HTMl selectors don't find anything I'm looking for.
I'm following directions from here https://github.com/scrapy-plugins/scrapy-splash, and have their docker instance running on localhost:8050 and created a settings.py.
Anyone with experience here know what I might be missing?
Thanks!
spider.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy_splash import SplashRequest
import logging
import base64
import time
# from selenium import webdriver
# lua_script="""
# function main(splash)
# splash:set_user_agent(splash.args.ua)
# assert(splash:go(splash.args.url))
# splash:wait(5)
# -- requires Splash 2.3
# -- while not splash:select('#user-form') do
# -- splash:wait(5)
# -- end
# repeat
# splash:wait(5))
# until( splash:select('#user-form') ~= nil )
# return {html=splash:html()}
# end
# """
load_page_script="""
function main(splash)
splash:set_user_agent(splash.args.ua)
assert(splash:go(splash.args.url))
splash:wait(5)
function wait_for(splash, condition)
while not condition() do
splash:wait(0.5)
end
end
local result, error = splash:wait_for_resume([[
function main(splash) {
setTimeout(function () {
splash.resume();
}, 5000);
}
]])
wait_for(splash, function()
return splash:evaljs("document.querySelector('#user-form') != null")
end)
-- repeat
-- splash:wait(5))
-- until( splash:select('#user-form') ~= nil )
return {html=splash:html()}
end
"""
class HelpSpider(scrapy.Spider):
name = "help"
allowed_domains = ["secet_internal_url.com"]
start_urls = ['https://secet_internal_url.com']
# http_user = 'splash-user'
# http_pass = 'splash-password'
def start_requests(self):
logger = logging.getLogger()
login_page = 'https://secet_internal_url.com/#/auth'
splash_args = {
'html': 1,
'png': 1,
'width': 600,
'render_all': 1,
'lua_source': load_page_script
}
#splash_args = {
# 'html': 1,
# 'png': 1,
# 'width': 600,
# 'render_all': 1,
# 'lua_source': lua_script
#}
yield SplashRequest(login_page, self.parse, endpoint='execute', magic_response=True, args=splash_args)
def parse(self, response):
# time.sleep(10)
logger = logging.getLogger()
html = response._body.decode("utf-8")
# Looking for a form with the ID 'user-form'
form = response.css('#user-form')
logger.info("####################")
logger.info(form)
logger.info("####################")
Solution
I figured it out!
Short Answer
My Spider class was configured incorrectly for using splash with scrapy.
Long Answer
Part of running splash with scrape is, in my case, running a local Docker instance that it uses to load my requests into for it to run the Lua scripts. An important caveat to note is the settings for splash as described in the github page must be a property of the spider class itself, so I added this code to my Spider:
custom_settings = {
'SPLASH_URL': 'http://localhost:8050',
# if installed Docker Toolbox:
# 'SPLASH_URL': 'http://192.168.99.100:8050',
'DOWNLOADER_MIDDLEWARES': {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
'SPIDER_MIDDLEWARES': {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
},
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
}
Then I noticed my Lua code running, and the Docker container logs indicating the interactions. After fixing errors with the splash:select() my login script worked, as did my waits:
splash:wait( seconds_to_wait )
Lastly, I created a Lua script to handle logging in, redirecting, and gathering links and text from pages. My application is an AngularJS app, so I can't gather links or visit them except clicking. This script let me run through every link, click it, and gather content.
I suppose an alternative solution would have been to use end-to-end testing tools such as Selenium/WebDriver or Cypress, but I prefer to use scrapy to scrape and testing tools to test. To each their own (Python or NodeJS tools), I suppose.
Neat Trick
Another thing to mention that's really helpful for debugging, is when you're running the Docker instance for Scrapy-Splash, you can visit that URL in your browser and there's an interactive "request tester" that lets you test out Lua scripts and see rendered HTML results (for example, verifying login or page visits). For me, this url was http://0.0.0.0:8050, and this URL is set in your settings and should be configured to match with your Docker container.
Cheers!
Answered By - RoboBear
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.