Wednesday, July 13, 2022

[FIXED] How to scrape the whole information while using xpath selector

July 13, 2022 python, scrapy, selector, web-scraping, xpath No comments

Issue

I encountered a problem where I could not get all the information while using the XPath selector. The line is in developer mode. Is this

<address class="location-row-address" data-qa-target="provider-office-address">
230 W 13th St Ste 1b<!-- 
--> <!-- 
-->New York<!-- 
-->, <!--
-->NY<!-- 
--> <!-- 
-->10011<!--
--> 
</address>

The XPath selector that I use is

response.xpath('//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address/text()').get()

The result I am getting is

230 W 13th St Ste 1b

The result I am expecting is

230 W 13th St Ste 1b New York, NY 10011

I am using scrapy for scraping. Thank you. Your help is appreciated.

Edit: The above problem I was facing was solved. I used the string() method and get() to get all the strings from the element node.

response.xpath('string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)').get()

Solution

Your XPath expression returns all the text nodes which are children of the address element. There are several text nodes, with comment nodes separating them!

Back in Python land, you are calling the get() method on the result which returns you only the first node of the nodeset.

.get() always returns a single result; if there are several matches, content of a first match is returned; if there are no matches, None is returned. .getall() returns a list with all results. https://docs.scrapy.org/en/latest/topics/selectors.html

If you called the getall() method you would retrieve a list of strings, and you could concatenate them to produce the text you want. But a simpler method is to use the XPath function string to get the "string value" of the address element. In the XPath 1.0 spec it defines the string value of an element node this way:

The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
https://www.w3.org/TR/1999/REC-xpath-19991116/#element-nodes

Applying this function to the address element will return you a single string value, which you can then access using the get() method in Scrapy:

response.xpath(
   'string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)'
).get()

Answered By - Conal Tuohy

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, July 13, 2022

[FIXED] How to scrape the whole information while using xpath selector

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels