Issue
I want to extract some data from HTML and then be able to highlight extracted elements on client side without modifying source html. And XPath or CSS Path looks great for this. Is that possible to extract XPATH or CSS Path directly from BeautifulSoup?
Right now I use marking of target element and then lxml lib to extract xpath, which is very bad for performance. I know about BSXPath.py
-- it's does not work with BS4.
Solution with rewriting everything to use native lxml lib is not acceptable due to complexity.
import bs4
import cStringIO
import random
from lxml import etree
def get_xpath(soup, element):
_id = random.getrandbits(32)
for e in soup():
if e == element:
e['data-xpath'] = _id
break
else:
raise LookupError('Cannot find {} in {}'.format(element, soup))
content = unicode(soup)
doc = etree.parse(cStringIO.StringIO(content), etree.HTMLParser())
element = doc.xpath('//*[@data-xpath="{}"]'.format(_id))
assert len(element) == 1
element = element[0]
xpath = doc.getpath(element)
return xpath
soup = bs4.BeautifulSoup('<div id=i>hello, <b id=i test=t>world!</b></div>')
xpath = get_xpath(soup, soup.div.b)
assert '//html/bodydiv/b' == xpath
Solution
It's actually pretty easy to extract simple CSS/XPath. This is the same lxml lib gives you.
def get_element(node):
# for XPATH we have to count only for nodes with same type!
length = len(list(node.previous_siblings)) + 1
if (length) > 1:
return '%s:nth-child(%s)' % (node.name, length)
else:
return node.name
def get_css_path(node):
path = [get_element(node)]
for parent in node.parents:
if parent.name == 'body':
break
path.insert(0, get_element(parent))
return ' > '.join(path)
soup = bs4.BeautifulSoup('<div></div><div><strong><i>bla</i></strong></div>')
assert get_css_path(soup.i) == 'div:nth-child(2) > strong > i'
Answered By - Dmytro Sadovnychyi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.