Issue
I'm just starting to learn Scrapy and I have such a question. for my "spider" I have to take a list of urls (start_urls) from the google sheets table and I have this code:
import gspread
from oauth2client.service_account import ServiceAccountCredentials
scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('token.json', scope)
client = gspread.authorize(creds)
sheet = client.open('Sheet_1')
sheet_instance = sheet.get_worksheet(0)
records_data = sheet_instance.col_values(col=2)
for link in records_data:
print(link)
........
How do I configure the middleware so that when the spider (scrappy crawl my_spider
) is launched, links from this code are automatically substituted into start_urls? perhaps i need to create a class in middlewares.py?
I will be grateful for any help, with examples.
it is necessary that this rule applies to all new spiders, generating a list from a file in start_requests (for example start_urls = [l.strip() for an open string('urls.txt ').readline()]
) is not convenient...
Solution
Read this
spider.py:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
custom_settings = {
'SPIDER_MIDDLEWARES': {
'tempbuffer.middlewares.ExampleMiddleware': 543,
}
}
def parse(self, response):
print(response.url)
middlewares.py:
class ExampleMiddleware(object):
def process_start_requests(self, start_requests, spider):
# change this to your needs:
with open('urls.txt', 'r') as f:
for url in f:
yield scrapy.Request(url=url)
urls.txt:
https://example.com
https://example1.com
https://example2.org
output:
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example2.org> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example1.com> (referer: None)
https://example2.org
https://example.com
https://example1.com
Answered By - SuperUser
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.