Issue
I am trying to use Pyside to render a webpage's JavaScript generated HTML, then use that html for webscraping. I started off using this quick example, but the results are very inconsistent.
The problem is that some pages work perfectly fine, but others hang infinitely. And I'm not talking about giving up after a few seconds, I've let my script run for hours at various times and no progress is being made.
My current code is as follows:
import sys
from PySide.QtCore import *
from PySide.QtGui import *
from PySide.QtWebKit import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished[bool].connect(self.end)
self.mainFrame().load(url)
self.app.exec_()
def end(self, result):
print 'end'
self.finalFrame = self.mainFrame()
self.app.quit()
r = Render('http://pyside.github.io/docs/pyside/PySide/QtWebKit/index.html')
print r.finalFrame.toHtml().encode('ascii', 'ignore')
print 'done'
This page works, as do the pages given in this answer, but most others ('https://www.google.ca/', 'https://webscraping.com') do not.
How do I get these pages to load?
Solution
The problem seems to be SSL related. I'm still not sure what exactly the problem was, but it was fixed by:
uninstalling the Anaconda version (1.2.1) of PySide and installing it with pip (1.2.4). It seems like the Anaconda build is fundamentally broken, in that various attributes of classes don't exist when they should and there are unresolvable circular dependencies.
downloading openSSL (lite) and placing the 2 dlls (ssleay.dll and libeay.dll) in both the directory where the program is run and the environment/Library/bin. Either one on it's own did not work. Credit for this part goes to this question.
Answered By - GreySage
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.