Sunday, January 30, 2022

[FIXED] Does Scrapy translate XPath differently than Python's lxml module?

January 30, 2022 lxml, python, scrapy, xpath No comments

Issue

I am trying to scrape a website but had problems with the Xpath expressions I was using on Scrapy's response objects.

From what I learned about XPath, I thought I was using the correct XPath expressions.

So I used a web browser to load the web page, then downloaded it and saved it as an HTML file.

Then I tried the XPath expressions two different ways.

The first way was to use Python's lxml.html module to open the file and load it as an HTMLParser object.

The second way was to use Scrapy and point it to the saved HTML file.

In both cases, I used the same XPath expression. But I get different results.

The sample HTML code is something like this (not exactly but I didn't want to post a huge chunk of code verbatim):

<html>
  <body>
    <div>
      <table type="games">
        <tbody>
          <tr row="1">
            <th data="week_number">1</th>
            <td data="date">"9/13/2020"</td>
          </tr>
        </tbody>
       </table>
     </div>
   </body>
</html>

For example, I'm trying to scrape the week number in the "TH" element under the "TR" element in the "TABLE".

I double checked the content by using Chrome, instead of Firefox, to Inspect the file (Firefox adds "tbody" elements to tables, according to this post: Parsing HTML with XPath, Python and Scrapy

The <tbody> element is in the file, according to Chrome's Inspect.

The first way was to open the HTML file using the lxml.html module:

from lxml import etree, html

if __name__ == '__main__':

    filename_04 = "/home/foo.html"

    # Try opening the filename
    try:
        fh_04 = open(filename_04, "r")
    except:
        print "Error opening %s.  Exiting" % filename_04
        sys.exit(1)

    # Try reading the contents of the HTML file.
    # Then close the file
    try:
        content_04 = fh_04.read().decode('utf-8')
    except UnicodeDecodeError:
        print "Error trying to read as UTF-8. Exiting."
        sys.exit(1)

    fh_04.close()

    # Define an HTML parser object
    parser_04 = html.HTMLParser()

    # Create a logical XML tree from the contents of parser_04
    tree_04 = html.parse(StringIO(content_04), parser_04)

    game_elements_list = list()

    # Get all the <TR> elements from the <table type="games">
    game_elements_list = tree_04.xpath("//table[@type = 'games']/tbody/tr")

    num_games = len(game_elements_list)
    # Now loop thru each of the <TR> element objects of game_elements_list 
    for x in range(num_games):
        # Parse the week number using xpath()
        # *** NOTE: this expression returns a list
        parsed_week_number = game_elements_list[x].xpath(".//th[@data = 'week_number']/text()")
                                                   
        print ":: parsed_week_number: ", str(parsed_week_number)
        p_type = type(parsed_week_number)
        print ":: p_type: ", str(p_type)

Using the XPath expressions via the lxml.html module returns this output:

:: parsed_week_number:  ['1']
:: p_type:  <type 'list'>

This is what I expect from the XPath expressions so my XPath expressions are correct.

However, when I point the Scrapy spider to the local file, I get different results:

    # I'm only posting the callback method, not the
    # method that makes the actual request, because
    # the request() call works
    def parse_schedule_page(self, response):

        game_elements_list = list()
        # The xpath expression is the same as the one used in the file that
        # uses lxml.html module
        game_elements_list = response.xpath("//table[@type = 'games']/tbody/tr")
        num_game_elements = len(game_elements_list)

        for i in range(num_game_elements):
            # Again, the XPath expression is the same
            # as the one used in the file that 
            # uses the lxml.html module
            parsed_week_number = game_elements_list[i].xpath(".//th[@data = 'week_number']/text()")
            stmt = ":: parsed_week_number: " + str(parsed_week_number)
            self.log(stmt)
            p_type = type(parsed_week_number)
            stmt = "p_type: " + str(p_type)
            self.log(stmt)

            """
            To get the week number, I have to add the following line:
            week_number = parsed_week_number.extract()
            """

But in the case of the Spider, the output is different:

2020-07-17 21:22:30 [test_schedule] DEBUG: :: parsed_week_number: [<Selector xpath=".//th[@data-stat = 'week_num']/text()" data=u'1'>]
2020-07-17 21:22:30 [test_schedule] DEBUG: p_type: <class 'scrapy.selector.unified.SelectorList'>

The same XPath expression doesn't return the contents of <th data="week_number">1</th>

I know Scrapy uses a different extractor method than lxml's HTMLParser. But no matter how the HTML data is stored, shouldn't XPath expressions work the same even if the extractor methods were different?

Does Scrapy's response.xpath() method evaluate XPath expressions differently than lxml.html's xpath() method?

Solution

To answer your question Scrapy imports lxml internally and XML Path language is standardised albeit not been updated in a while. So your XPATH expressions should be the same.

To further help you, an URL would be good for the specific XPATH selector you're struggling with.

Tips

As a general rule if I cant get the XPATH selector to work when running the script I go to the scrapy shell and work it out. Generally speaking I tend to work in scrapy shell with a list of the data I want and try out the xpath in there to confirm tha it'll be picked up in the script before writing my scrapy spiders.

Additional Information

For more informtion on XPATH see here

It's worth looking at the Scrapy codebase if you have questions like this about the internals, even if you don't think you'll understand a lot of it.

In the Scrapy Docs here references the response.xpath method but you also get access to the source if you just click the source text.

Below is the relevant codebase for the xpath method including the imports.

response.xpath imports

"""
XPath selectors based on lxml
"""

import sys

import six
from lxml import etree, html

response.xpath method

def xpath(self, query, namespaces=None, **kwargs):
        """
        Find nodes matching the xpath ``query`` and return the result as a
        :class:`SelectorList` instance with all elements flattened. List
        elements implement :class:`Selector` interface too.

        ``query`` is a string containing the XPATH query to apply.

        ``namespaces`` is an optional ``prefix: namespace-uri`` mapping (dict)
        for additional prefixes to those registered with ``register_namespace(prefix, uri)``.
        Contrary to ``register_namespace()``, these prefixes are not
        saved for future calls.

        Any additional named arguments can be used to pass values for XPath
        variables in the XPath expression, e.g.::

            selector.xpath('//a[href=$url]', url="http://www.example.com")
        """

        try:
            xpathev = self.root.xpath
        except AttributeError:
            return self.selectorlist_cls([])

        nsp = dict(self.namespaces)
        if namespaces is not None:
            nsp.update(namespaces)
        try:
            result = xpathev(query, namespaces=nsp,
                             smart_strings=self._lxml_smart_strings,
                             **kwargs)
        except etree.XPathError as exc:
            msg = u"XPath error: %s in %s" % (exc, query)
            msg = msg if six.PY3 else msg.encode('unicode_escape')
            six.reraise(ValueError, ValueError(msg), sys.exc_info()[2])

        if type(result) is not list:
            result = [result]

        result = [self.__class__(root=x, _expr=query,
                                 namespaces=self.namespaces,
                                 type=self.type)
                  for x in result]
        return self.selectorlist_cls(result)

Answered By - AaronS

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 30, 2022

[FIXED] Does Scrapy translate XPath differently than Python's lxml module?

Issue

Solution

Tips

Additional Information

response.xpath imports

response.xpath method

0 comments:

Post a Comment

Popular Posts

Labels