Issue
I'm 99% sure something is going on with my hxs.select
on this website. I cannot extract anything. When I run the following code, I don't get any error feedback. title
or link
doesn't get populated. Any help?
def parse(self, response):
self.log("\n\n\n We got data! \n\n\n")
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class=\'footer\']')
items = []
for site in sites:
item = CarrierItem()
item['title'] = site.select('.//a/text()').extract()
item['link'] = site.select('.//a/@href').extract()
items.append(item)
return items
Is there a way I can debug this? I also tried to use the scrapy shell
command with an url but when I input view(response)
in the shell it simply returns True
and a text file opens instead of my Web Browser.
>>> response.url 'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp' >>> hxs.select('//div') Traceback (most recent call last): File "", line 1, in AttributeError: 'NoneType' object has no attribute 'select' >>> view(response) True >>> hxs.select('//body') Traceback (most recent call last): File "", line 1, in AttributeError: 'NoneType' object has no attribute 'select'
Solution
Scrapy shell is a good tool for that indeed. And if your document has an XML stylesheet, it's probably an XML document. So you can use scrapy shell with xxs
instead of hxs
as in this Scrapy documentation example about removing namespaces:
http://doc.scrapy.org/en/latest/topics/selectors.html#removing-namespaces
When that doesn't work, I tend to go back to pure lxml.etree and dump the whole document's elements:
import lxml.etree
import lxml.html
class myspider(BaseSpider):
...
def parse(self, response):
self.log("\n\n\n We got data! \n\n\n")
root = lxml.etree.fromstring(response.body).getroot()
# or for broken XML docs:
# root = lxml.etree.fromstring(response.body, parser = lxml.etree.XMLParser(recover=True)).getroot()
# or for HTML:
# root = lxml.etree.fromstring(response.body, parser=lxml.html.HTMLParser()).getroot()
# and then lookup what are the actual elements I can select
print list(root.iter()) # this could be very big, but at least you all what's inside, the element tags and namespaces
Answered By - paul trmbrth
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.