Issue
I have multiple URLs to scrape stored in a csv file where each row is a separate URL and I'm using this code to run it
def start\\_requests(self):
with open('csvfile', 'rb') as f:
list=[]
for line in f.readlines():
array = line.split(',')
url = array[9]
list.append(url)
list.pop(0)
for url in list:
if url != "":
yield scrapy.Request(url=url, callback=self.parse)
It gives me the following error IndexError: list index out of range
, can anyone help me correct this or suggest another way to use that csv file?
edit: csv file looks like this:
http://example.org/page1
http://example.org/page2
there are 9 such rows
Solution
You should be able to do this by reading the csv file into a list variable without having to do some of the code above. Therefore no need to split
, pop
and append
Working example
import csv
import scrapy
from scrapy.crawler import CrawlerProcess
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
with open('websites.csv') as csv_file:
data = csv.reader(csv_file)
for row in data:
# Supposing that the data is in the first column
url = row[0]
if url != "":
# We need to check this has the http prefix or we get a Missing scheme error
if not url.startswith('http://') and not url.startswith('https://'):
url = 'https://' + url
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# Do my data extraction
print("test")
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
})
c.crawl(QuotesSpider)
c.start()
Answered By - Ryan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.