Issue
So I ran a crawler last week and produced a CSV file that lists all the image URLs I need for my project. After reading the CSV to a python list, I was unsure how to use Scrapy to simply download them through a pipeline. I've tried many things and recently I got it to work, but it's ugly and not quite right. For my list of 10 image URLs, Scrapy finishes the scrape with 20 requests made even tho 10 images were correctly stored. I am probably doing something stupid because I am fairly new to Scrapy, but I've read through most of Scrapy's documentation and quite a bit of trial and error with google results.
I simply want Scrapy to send one request per URL and download the corresponding image. Any help would be appreciated. I have banged my head against this for 3 days. My code:
spider.py
import scrapy
import csv
import itertools
from ..items import ImgItem
urls=[]
with open('E:/Chris/imgUrls.csv') as csvDataFile:
csvReader = csv.reader(csvDataFile)
for elem in itertools.islice(csvReader, 0, 10):
urls.append(elem[0]) #Just doing first 10 for testing
#My Csv file is not the problem
# ...1 url per row
class DwImgSpider(scrapy.Spider):
name = 'dw-img'
start_urls = urls
def parse(self, response):
item = ImgItem()
img_urls = urls
item['image_urls'] = img_urls
return item
If you want to see additional files, I can edit this to add them. I just figured this was where the problem came from since it does technically work. Thanks again, appreciate any help or redirects.
Solution
Another method.
import csv,os
import itertools
from simplified_scrapy import Spider, SimplifiedMain, utils
class ImageSpider(Spider):
name = 'images'
start_urls = []
def __init__(self):
with open('E:/Chris/imgUrls.csv') as csvDataFile:
csvReader = csv.reader(csvDataFile)
for elem in itertools.islice(csvReader, 0, 10):
self.start_urls.append(elem[0])
Spider.__init__(self,self.name) # Necessary
if(not os.path.exists('images/')):
os.mkdir('images/')
def afterResponse(self, response, url, error=None, extra=None):
try:
utils.saveResponseAsFile(response,'images/','image')
except Exception as err:
print (err)
return None
SimplifiedMain.startThread(ImageSpider()) # Start download
Answered By - dabingsou
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.