Issue
I am trying to scrape a website but had problems with the Xpath expressions I was using on Scrapy's response objects.
From what I learned about XPath, I thought I was using the correct XPath expressions.
So I used a web browser to load the web page, then downloaded it and saved it as an HTML file.
Then I tried the XPath expressions two different ways.
The first way was to use Python's lxml.html module to open the file and load it as an HTMLParser object.
The second way was to use Scrapy and point it to the saved HTML file.
In both cases, I used the same XPath expression. But I get different results.
The sample HTML code is something like this (not exactly but I didn't want to post a huge chunk of code verbatim):
<html>
<body>
<div>
<table type="games">
<tbody>
<tr row="1">
<th data="week_number">1</th>
<td data="date">"9/13/2020"</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
For example, I'm trying to scrape the week number in the "TH" element under the "TR" element in the "TABLE".
I double checked the content by using Chrome, instead of Firefox, to Inspect the file (Firefox adds "tbody" elements to tables, according to this post: Parsing HTML with XPath, Python and Scrapy
The <tbody>
element is in the file, according to Chrome's Inspect.
The first way was to open the HTML file using the lxml.html module:
from lxml import etree, html
if __name__ == '__main__':
filename_04 = "/home/foo.html"
# Try opening the filename
try:
fh_04 = open(filename_04, "r")
except:
print "Error opening %s. Exiting" % filename_04
sys.exit(1)
# Try reading the contents of the HTML file.
# Then close the file
try:
content_04 = fh_04.read().decode('utf-8')
except UnicodeDecodeError:
print "Error trying to read as UTF-8. Exiting."
sys.exit(1)
fh_04.close()
# Define an HTML parser object
parser_04 = html.HTMLParser()
# Create a logical XML tree from the contents of parser_04
tree_04 = html.parse(StringIO(content_04), parser_04)
game_elements_list = list()
# Get all the <TR> elements from the <table type="games">
game_elements_list = tree_04.xpath("//table[@type = 'games']/tbody/tr")
num_games = len(game_elements_list)
# Now loop thru each of the <TR> element objects of game_elements_list
for x in range(num_games):
# Parse the week number using xpath()
# *** NOTE: this expression returns a list
parsed_week_number = game_elements_list[x].xpath(".//th[@data = 'week_number']/text()")
print ":: parsed_week_number: ", str(parsed_week_number)
p_type = type(parsed_week_number)
print ":: p_type: ", str(p_type)
Using the XPath expressions via the lxml.html module returns this output:
:: parsed_week_number: ['1']
:: p_type: <type 'list'>
This is what I expect from the XPath expressions so my XPath expressions are correct.
However, when I point the Scrapy spider to the local file, I get different results:
# I'm only posting the callback method, not the
# method that makes the actual request, because
# the request() call works
def parse_schedule_page(self, response):
game_elements_list = list()
# The xpath expression is the same as the one used in the file that
# uses lxml.html module
game_elements_list = response.xpath("//table[@type = 'games']/tbody/tr")
num_game_elements = len(game_elements_list)
for i in range(num_game_elements):
# Again, the XPath expression is the same
# as the one used in the file that
# uses the lxml.html module
parsed_week_number = game_elements_list[i].xpath(".//th[@data = 'week_number']/text()")
stmt = ":: parsed_week_number: " + str(parsed_week_number)
self.log(stmt)
p_type = type(parsed_week_number)
stmt = "p_type: " + str(p_type)
self.log(stmt)
"""
To get the week number, I have to add the following line:
week_number = parsed_week_number.extract()
"""
But in the case of the Spider, the output is different:
2020-07-17 21:22:30 [test_schedule] DEBUG: :: parsed_week_number: [<Selector xpath=".//th[@data-stat = 'week_num']/text()" data=u'1'>]
2020-07-17 21:22:30 [test_schedule] DEBUG: p_type: <class 'scrapy.selector.unified.SelectorList'>
The same XPath expression doesn't return the contents of <th data="week_number">1</th>
I know Scrapy uses a different extractor method than lxml's HTMLParser. But no matter how the HTML data is stored, shouldn't XPath expressions work the same even if the extractor methods were different?
Does Scrapy's response.xpath() method evaluate XPath expressions differently than lxml.html's xpath() method?
Solution
To answer your question Scrapy imports lxml internally and XML Path language is standardised albeit not been updated in a while. So your XPATH expressions should be the same.
To further help you, an URL would be good for the specific XPATH selector you're struggling with.
Tips
As a general rule if I cant get the XPATH selector to work when running the script I go to the scrapy shell and work it out. Generally speaking I tend to work in scrapy shell with a list of the data I want and try out the xpath in there to confirm tha it'll be picked up in the script before writing my scrapy spiders.
Additional Information
For more informtion on XPATH see here
It's worth looking at the Scrapy codebase if you have questions like this about the internals, even if you don't think you'll understand a lot of it.
In the Scrapy Docs here references the response.xpath method but you also get access to the source if you just click the source text.
Below is the relevant codebase for the xpath method including the imports.
response.xpath imports
"""
XPath selectors based on lxml
"""
import sys
import six
from lxml import etree, html
response.xpath method
def xpath(self, query, namespaces=None, **kwargs):
"""
Find nodes matching the xpath ``query`` and return the result as a
:class:`SelectorList` instance with all elements flattened. List
elements implement :class:`Selector` interface too.
``query`` is a string containing the XPATH query to apply.
``namespaces`` is an optional ``prefix: namespace-uri`` mapping (dict)
for additional prefixes to those registered with ``register_namespace(prefix, uri)``.
Contrary to ``register_namespace()``, these prefixes are not
saved for future calls.
Any additional named arguments can be used to pass values for XPath
variables in the XPath expression, e.g.::
selector.xpath('//a[href=$url]', url="http://www.example.com")
"""
try:
xpathev = self.root.xpath
except AttributeError:
return self.selectorlist_cls([])
nsp = dict(self.namespaces)
if namespaces is not None:
nsp.update(namespaces)
try:
result = xpathev(query, namespaces=nsp,
smart_strings=self._lxml_smart_strings,
**kwargs)
except etree.XPathError as exc:
msg = u"XPath error: %s in %s" % (exc, query)
msg = msg if six.PY3 else msg.encode('unicode_escape')
six.reraise(ValueError, ValueError(msg), sys.exc_info()[2])
if type(result) is not list:
result = [result]
result = [self.__class__(root=x, _expr=query,
namespaces=self.namespaces,
type=self.type)
for x in result]
return self.selectorlist_cls(result)
Answered By - AaronS
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.