Friday, June 3, 2022

[FIXED] How can I debug Scrapy?

June 03, 2022 python, scrapy, web-scraping No comments

Issue

I'm 99% sure something is going on with my hxs.select on this website. I cannot extract anything. When I run the following code, I don't get any error feedback. title or link doesn't get populated. Any help?

def parse(self, response):
    self.log("\n\n\n We got data! \n\n\n")
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//div[@class=\'footer\']')
    items = []
    for site in sites:
        item = CarrierItem()
        item['title'] = site.select('.//a/text()').extract()
        item['link'] = site.select('.//a/@href').extract()
        items.append(item)
    return items

Is there a way I can debug this? I also tried to use the scrapy shell command with an url but when I input view(response) in the shell it simply returns True and a text file opens instead of my Web Browser.

>>> response.url
'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'

>>> hxs.select('//div')
Traceback (most recent call last):
    File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'select'

>>> view(response)
True

>>> hxs.select('//body')
Traceback (most recent call last):
    File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'select'

Solution

Scrapy shell is a good tool for that indeed. And if your document has an XML stylesheet, it's probably an XML document. So you can use scrapy shell with xxs instead of hxs as in this Scrapy documentation example about removing namespaces: http://doc.scrapy.org/en/latest/topics/selectors.html#removing-namespaces

When that doesn't work, I tend to go back to pure lxml.etree and dump the whole document's elements:

import lxml.etree
import lxml.html

class myspider(BaseSpider):
    ...
    def parse(self, response):
        self.log("\n\n\n We got data! \n\n\n")
        root = lxml.etree.fromstring(response.body).getroot()
        # or for broken XML docs:
        # root = lxml.etree.fromstring(response.body, parser = lxml.etree.XMLParser(recover=True)).getroot()
        # or for HTML:
        # root = lxml.etree.fromstring(response.body, parser=lxml.html.HTMLParser()).getroot()

        # and then lookup what are the actual elements I can select
        print list(root.iter()) # this could be very big, but at least you all what's inside, the element tags and namespaces

Answered By - paul trmbrth

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, June 3, 2022

[FIXED] How can I debug Scrapy?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels