Issue
I encountered a problem where I could not get all the information while using the XPath selector. The line is in developer mode. Is this
<address class="location-row-address" data-qa-target="provider-office-address">
230 W 13th St Ste 1b<!--
--> <!--
-->New York<!--
-->, <!--
-->NY<!--
--> <!--
-->10011<!--
-->
</address>
The XPath selector that I use is
response.xpath('//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address/text()').get()
The result I am getting is
230 W 13th St Ste 1b
The result I am expecting is
230 W 13th St Ste 1b New York, NY 10011
I am using scrapy for scraping. Thank you. Your help is appreciated.
Edit: The above problem I was facing was solved. I used the string() method and get() to get all the strings from the element node.
response.xpath('string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)').get()
Solution
Your XPath expression returns all the text nodes which are children of the address
element. There are several text nodes, with comment nodes separating them!
Back in Python land, you are calling the get()
method on the result which returns you only the first node of the nodeset.
.get() always returns a single result; if there are several matches, content of a first match is returned; if there are no matches, None is returned. .getall() returns a list with all results. https://docs.scrapy.org/en/latest/topics/selectors.html
If you called the getall()
method you would retrieve a list of strings, and you could concatenate them to produce the text you want. But a simpler method is to use the XPath function string
to get the "string value" of the address
element. In the XPath 1.0 spec it defines the string value of an element node this way:
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
https://www.w3.org/TR/1999/REC-xpath-19991116/#element-nodes
Applying this function to the address
element will return you a single string value, which you can then access using the get()
method in Scrapy:
response.xpath(
'string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)'
).get()
Answered By - Conal Tuohy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.