Wednesday, January 12, 2022

[FIXED] DEBUG: Crawled (404) when crawling table with Scrapy

January 12, 2022 python, scrapy No comments

Issue

I am quite new to Scrapy and I try to get table data from every page from this website.

But first, I just want to get the table data from page 1.

This is my code:

import scrapy

class UAESpider(scrapy.Spider):
    name = 'uae_free'

    allowed_domains = ['https://www.uaeonlinedirectory.com']

    start_urls = [
        'https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A'
    ]

    def parse(self, response):
        zones = response.xpath('//table[@class="GridViewStyle"]/tbody/tr')
        for zone in zones[1:]:
            yield {
                'company_name': zone.xpath('.//td[1]//text()').get(),
                'zone': zone.xpath('.//td[2]//text()').get(),
                'category': zone.xpath('.//td[4]//text()').get()
            }

On the terminal, I get this message:

2020-07-01 08:41:07 [scrapy.core.engine] INFO: Spider opened
2020-07-01 08:41:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-01 08:41:07 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-01 08:41:09 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.uaeonlinedirectory.com/robots.txt> (referer: None)
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 8 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2020-07-01 08:41:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A> (referer: None)
2020-07-01 08:41:14 [scrapy.core.engine] INFO: Closing spider (finished)

Do you guys know what is this message about and what wrong with my code?

Update:

I found this answer, and after I set ROBOTSTXT_OBEY = False, I don't receive the message above anymore. But I still cannot get the data.

The terminal message after I set ROBOTSTXT_OBEY = False:

2020-07-01 08:56:03 [scrapy.core.engine] INFO: Spider opened
2020-07-01 08:56:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-01 08:56:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-01 08:56:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A> (referer: None)
2020-07-01 08:56:07 [scrapy.core.engine] INFO: Closing spider (finished)

Update 2:

I open terminal and use scrapy shell https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A to check my xpath:

>>> response.xpath('//table[@class="GridViewStyle"]')
[<Selector xpath='//table[@class="GridViewStyle"]' data='<table class="GridViewStyle" cellspac...'>]
>>> response.xpath('//table[@class="GridViewStyle"]/tbody')
[]

So does my xpath wrong?

Solution

Not sure why, but for some reason your XPath doesn't find the table body. I changed it to this and it seems to work now:

//table[@class="GridViewStyle"]//tr'

Answered By - Hubert Grzeskowiak

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 12, 2022

[FIXED] DEBUG: Crawled (404) when crawling table with Scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels