Issue
I wrote a python script that processes a large amount of downloaded webpages HTML(120K pages). I need to parse them and extract some information from there. I tried using BeautifulSoup, which is easy and intuitive, but it seems to run super slowly. As this is something that will have to run routinely on a weak machine (on amazon) speed is important. is there an HTML/XML parser in python that will work much faster than BeautifulSoup? or must I resort to regex parsing..
Solution
lxml is a fast xml and html parser: http://lxml.de/parsing.html
Answered By - Marcin
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.