Wednesday, September 28, 2022

[FIXED] Scrapy keeps getting blocked

September 28, 2022 python, scrapy No comments

Issue

I am trying to get a list of movie theaters in the US from http://cinematreasures.org/ as part of my process learning python and scrapy.

I have written a spider to crawl the site but I don't get any response when I run it. Please find attached pictures of the html tree, my spider, the response when I run the spider and the changes I made to seetings.py.

I was thinking of trying proxy IP's but I don't know how to use them with scrapy. Please help

DOM Tree

My Default Headers

Terminal Output

I have tried the code in scrapy shell and it works fine.

When I try to run it via scrapy crawl listall I get nothing!

I just want to be able to export to csv via pandas if possible.

This is my code:

    name = 'listall'
allowed_domains = ['cinematreasures.org']
start_urls = ['http://cinematreasures.org/theaters/united-states?page=1&status=all']
#url = 'http://cinematreasures.org/theaters/united-states?page={}&status=all'
    
def parse(self, response):

    for row in response.xpath('//table//tr')[1:]:
        name =  row.xpath('td//text()')[2].get()
        address = row.xpath('td//text()')[4].get()   
        yield {
            'Name':name,
            'Address':address,
        }
    next_page = response.xpath("//a[@class='next_page']").get()
    if next_page:
        yield scrapy.Request(response.urljoin(next_page))

Solution

Your xpath expressions aren't correct. When you are using relative xpath expressions they need to start with a "./" and using class specifiers is much easier than indexing in my opinion.

    def parse(self, response):
        for row in response.xpath('//table[@class="list"]//tr'):
            name =  row.xpath('./td[@class="name"]/a/text()').get()
            address = row.xpath('./td[@class="location"]/text()').get()
            yield {
                'Name':name,
                'Address':address,
            }
        next_page = response.xpath("//a[@class='next-page']/@href").get()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page))

OUTPUT

...
...
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': None, 'Address': None}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': ' Airdome', 'Address': '\n                Ardmore, OK, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': ' Liberty Theatre', 'Address': '\n                Chickamauga, GA, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': ' Route 54 Drive-In', 'Address': '\n                Tularosa, NM, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '#1 Auto Theatre', 'Address': '\n                Daytona Beach, FL, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '#1 Drive-In', 'Address': '\n                Apalachicola, FL, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '$1.00 Cinema', 'Address': '\n                Sherman, TX, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '$uper Cinemas', 'Address': '\n                East Lansing, MI, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '0only Outdoor Theatre', 'Address': '\n                Little Chute, WI, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '10 Hi Drive-In', 'Address': '\n                St. Cloud, MN, United States\n              '}
...
...

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, September 28, 2022

[FIXED] Scrapy keeps getting blocked

Issue

Solution

OUTPUT

0 comments:

Post a Comment

Popular Posts

Labels