Issue
I am trying to get a list of movie theaters in the US from http://cinematreasures.org/ as part of my process learning python and scrapy.
I have written a spider to crawl the site but I don't get any response when I run it. Please find attached pictures of the html tree, my spider, the response when I run the spider and the changes I made to seetings.py.
I was thinking of trying proxy IP's but I don't know how to use them with scrapy. Please help
I have tried the code in scrapy shell and it works fine.
When I try to run it via scrapy crawl listall I get nothing!
I just want to be able to export to csv via pandas if possible.
This is my code:
name = 'listall'
allowed_domains = ['cinematreasures.org']
start_urls = ['http://cinematreasures.org/theaters/united-states?page=1&status=all']
#url = 'http://cinematreasures.org/theaters/united-states?page={}&status=all'
def parse(self, response):
for row in response.xpath('//table//tr')[1:]:
name = row.xpath('td//text()')[2].get()
address = row.xpath('td//text()')[4].get()
yield {
'Name':name,
'Address':address,
}
next_page = response.xpath("//a[@class='next_page']").get()
if next_page:
yield scrapy.Request(response.urljoin(next_page))
Solution
Your xpath
expressions aren't correct. When you are using relative xpath
expressions they need to start with a "./"
and using class specifiers is much easier than indexing in my opinion.
def parse(self, response):
for row in response.xpath('//table[@class="list"]//tr'):
name = row.xpath('./td[@class="name"]/a/text()').get()
address = row.xpath('./td[@class="location"]/text()').get()
yield {
'Name':name,
'Address':address,
}
next_page = response.xpath("//a[@class='next-page']/@href").get()
if next_page:
yield scrapy.Request(response.urljoin(next_page))
OUTPUT
...
...
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': None, 'Address': None}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': ' Airdome', 'Address': '\n Ardmore, OK, United States\n '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': ' Liberty Theatre', 'Address': '\n Chickamauga, GA, United States\n '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': ' Route 54 Drive-In', 'Address': '\n Tularosa, NM, United States\n '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '#1 Auto Theatre', 'Address': '\n Daytona Beach, FL, United States\n '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '#1 Drive-In', 'Address': '\n Apalachicola, FL, United States\n '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '$1.00 Cinema', 'Address': '\n Sherman, TX, United States\n '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '$uper Cinemas', 'Address': '\n East Lansing, MI, United States\n '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '0only Outdoor Theatre', 'Address': '\n Little Chute, WI, United States\n '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '10 Hi Drive-In', 'Address': '\n St. Cloud, MN, United States\n '}
...
...
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.