Issue
I'm running scrapy 0.20.2.
$ scrapy shell "http://newyork.craigslist.org/ata/"
I would like to make the list of all links to advertisements pages set apart the index.html
$ sel.xpath('//a[contains(@href,html)]')
...
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/mnh/atq/4243973984.html">Wicke'>,
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/mnh/atd/4257230057.html" class'>,
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/mnh/atd/4257230057.html">Recla'>,
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/ata/index100.html" class="butt'>]
I would like to use the XPath matches()
function to match links the form of the regex [0-9]+.html
.
$ sel.xpath('//a[matches(@href,"[0-9]+.html")]')
...
ValueError: Invalid XPath: //a[matches(@href,"[0-9]+.html")]
What's wrong?
Solution
matches
is an XPath 2.0 function, and scrapy only supports XPath 1.0 (which does not have any regular expression support built in). You'll have to extract all the links using a scrapy selector and then do the regex filtering at the Python level rather than within the XPath.
Answered By - Ian Roberts
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.