Issue
I have created a script that scrapes some elements from the webpage and then goes into the links attached to each listing. Then it grabs additional further info from that webpage, however it scrapes relatively slow. I get ~ 300/min, and my guess is the structure of my scraper and how it's gathering the requests, following the url, and scraping the info. Might this be the case, and how can I improve the speed?
import scrapy
from scrapy.item import Field
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose, Join
from scrapy.crawler import CrawlerProcess
from price_parser import Price
def get_price(price_raw):
price_object = Price.fromstring(price_raw)
return price_object.amount_float
def get_currency(price_raw):
price_object = Price.fromstring(price_raw)
currency = price_object.currency
return currency
class VinylItem(scrapy.Item):
title = Field(output_processor = TakeFirst())
label = Field()
media_condition=Field(input_processor = MapCompose(str.strip),
output_processor = TakeFirst())
sleeve_condition = Field(output_processor = TakeFirst())
location = Field(input_processor = MapCompose(str.strip),
output_processor = Join())
price = Field(input_processor = MapCompose(get_price)
,output_processor = TakeFirst())
currency = Field(input_processor = MapCompose(get_currency)
,output_processor = TakeFirst())
rated = Field(input_processor = MapCompose(str.strip)
,output_processor = Join())
have_vinyl = Field(output_processor = TakeFirst())
want_vinyl = Field(output_processor = TakeFirst())
format = Field(input_processor = MapCompose(str.strip),
output_processor = Join())
released = Field(input_processor = MapCompose(str.strip),
output_processor = Join())
genre = Field(input_processor = MapCompose(str.strip),
output_processor = Join())
style = Field(input_processor = MapCompose(str.strip),
output_processor = Join())
class VinylSpider(scrapy.Spider):
name = 'vinyl'
#allowed_domains = ['x']
start_urls = ['https://www.discogs.com/sell/list?format=Vinyl']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url, callback = self.parse
)
def parse(self, response):
content = response.xpath("//table[@class='table_block mpitems push_down table_responsive']//tbody//tr")
for items in content:
loader = ItemLoader(VinylItem(), selector = items)
loader.add_xpath('title', "(.//strong//a)[position() mod 2=1]//text()")
loader.add_xpath('label', './/p[@class="hide_mobile label_and_cat"]//a//text()')
loader.add_xpath("media_condition", '(.//p[@class="item_condition"]//span)[position() mod 3=0]//text()')
loader.add_xpath("sleeve_condition", './/p[@class="item_condition"]//span[@class="item_sleeve_condition"]//text()')
loader.add_xpath("location", '(.//td[@class="seller_info"]//li)[position() mod 3=0]//text()')
loader.add_xpath('price', '(//tbody//tr//td//span[@class="price"])[position() mod 2=0]//text()')
loader.add_xpath('currency', '(//tbody//tr//td//span[@class="price"])[position() mod 2=0]//text()')
loader.add_xpath('rated', './/td//div[@class="community_rating"]//text()')
loader.add_xpath('have_vinyl', '(.//td//div[@class="community_result"]//span[@class="community_label"])[contains(text(),"have")]//text()')
loader.add_xpath('want_vinyl', '(.//td//div[@class="community_result"]//span[@class="community_label"])[contains(text(),"want")]//text()')
links = items.xpath('.//td[@class="item_description"]//strong//@href').get()
yield response.follow(
response.urljoin(links),
callback = self.parse_vinyls,
cb_kwargs = {
'loader':loader
}
)
next_page = response.xpath('(//ul[@class="pagination_page_links"]//a)[last()]//@href').get()
if next_page:
yield response.follow(
response.urljoin(next_page),
callback = self.parse
)
def parse_vinyls(self, response, loader):
#loader = ItemLoader(VinylItem(), selector = response)
loader.add_value('format', response.xpath("(.//div[@id='page_content']//div[5])[1]//text()").get())
loader.add_value('released', response.xpath("(.//div[@id='page_content']//div[9])[1]//text()").get())
loader.add_value('genre', response.xpath("(.//div[@id='page_content']//div[11])[1]//text()").get())
loader.add_value('style', response.xpath("(.//div[@id='page_content']//div[13])[1]//text()").get())
yield loader.load_item()
process = CrawlerProcess(
settings = {
'FEED_URI':'vinyl.jl',
'FEED_FORMAT':'jsonlines'
}
)
process.crawl(VinylSpider)
process.start()
Solution
From the code snippet you have provided, your scraper is set up efficiently as it is yield
ing many requests at a go which lets scrapy handle the concurrency.
There are a couple of settings you can tweak to increase the speed of scraping. However, note that the first rule of scraping is that you should not harm the website you are scraping. See below sample of the settings you can tweak.
- Increase the value of
CONCURRENT_REQUESTS
. Defaults to 16 in scrapy - Increase the value of
CONCURRENT_REQUESTS_PER_DOMAIN
. Defaults to 8 in scrapy - Increase Twisted IO thread pool maximum size so that DNS resolution is faster
REACTOR_THREADPOOL_MAXSIZE
- Reduce log level
LOG_LEVEL = 'INFO'
- Disable cookies if you do not require them
COOKIES_ENABLED = False
- Reduce download timeout
DOWNLOAD_TIMEOUT = 15
- Reduce the value of
DOWNLOAD_DELAY
if your internet speed is fast and you are sure the website you are targeting is fast enough. This is not recommended
Read more about these settings from the docs
If the above settings do not solve your problem, then you may need to look into distributed crawling
Answered By - msenior_
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.