Issue
I am quite new to Scrapy
and I try to get table data from every page from this website.
But first, I just want to get the table data from page 1
.
This is my code:
import scrapy
class UAESpider(scrapy.Spider):
name = 'uae_free'
allowed_domains = ['https://www.uaeonlinedirectory.com']
start_urls = [
'https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A'
]
def parse(self, response):
zones = response.xpath('//table[@class="GridViewStyle"]/tbody/tr')
for zone in zones[1:]:
yield {
'company_name': zone.xpath('.//td[1]//text()').get(),
'zone': zone.xpath('.//td[2]//text()').get(),
'category': zone.xpath('.//td[4]//text()').get()
}
On the terminal, I get this message:
2020-07-01 08:41:07 [scrapy.core.engine] INFO: Spider opened
2020-07-01 08:41:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-01 08:41:07 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-01 08:41:09 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.uaeonlinedirectory.com/robots.txt> (referer: None)
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 8 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2020-07-01 08:41:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A> (referer: None)
2020-07-01 08:41:14 [scrapy.core.engine] INFO: Closing spider (finished)
Do you guys know what is this message about and what wrong with my code?
Update:
I found this answer, and after I set ROBOTSTXT_OBEY = False
, I don't receive the message above anymore. But I still cannot get the data.
The terminal message after I set ROBOTSTXT_OBEY = False
:
2020-07-01 08:56:03 [scrapy.core.engine] INFO: Spider opened
2020-07-01 08:56:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-01 08:56:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-01 08:56:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A> (referer: None)
2020-07-01 08:56:07 [scrapy.core.engine] INFO: Closing spider (finished)
Update 2:
I open terminal and use scrapy shell https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A
to check my xpath:
>>> response.xpath('//table[@class="GridViewStyle"]')
[<Selector xpath='//table[@class="GridViewStyle"]' data='<table class="GridViewStyle" cellspac...'>]
>>> response.xpath('//table[@class="GridViewStyle"]/tbody')
[]
So does my xpath wrong?
Solution
Not sure why, but for some reason your XPath doesn't find the table body. I changed it to this and it seems to work now:
//table[@class="GridViewStyle"]//tr'
Answered By - Hubert Grzeskowiak
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.