Sunday, January 30, 2022

[FIXED] Error when using a csv file with URLs in scrapy python

January 30, 2022 python, scrapy, web-scraping No comments

Issue

I have multiple URLs to scrape stored in a csv file where each row is a separate URL and I'm using this code to run it

     def start\\_requests(self): 

             with open('csvfile', 'rb') as f: 

                      list=[] 

                      for line in f.readlines(): 

                             array = line.split(',')

                             url = array[9] 

                             list.append(url) 

                    list.pop(0)
             for url in list:
                    if url != "": 

                          yield scrapy.Request(url=url, callback=self.parse)

It gives me the following error IndexError: list index out of range, can anyone help me correct this or suggest another way to use that csv file?

edit: csv file looks like this:

http://example.org/page1
http://example.org/page2

there are 9 such rows

Solution

You should be able to do this by reading the csv file into a list variable without having to do some of the code above. Therefore no need to split, pop and append

Working example

import csv
import scrapy
from scrapy.crawler import CrawlerProcess


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        with open('websites.csv') as csv_file:
            data = csv.reader(csv_file)
            for row in data:
                # Supposing that the data is in the first column
                url = row[0]
                if url != "":
                    # We need to check this has the http prefix or we get a Missing scheme error
                    if not url.startswith('http://') and not url.startswith('https://'):
                        url = 'https://' + url
                    yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # Do my data extraction
        print("test")


if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
    })
    c.crawl(QuotesSpider)
    c.start()

Answered By - Ryan

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 30, 2022

[FIXED] Error when using a csv file with URLs in scrapy python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels