Issue
I'm attempting to collect information about all of the products sold on groceries.aldi.co.uk. I have some experience scraping similar websites and have used the CrawlSpider to do so.
When I run the spider it seems to crawl throughout the website, but does not return any of the items. I've tried multiple different rule combinations as I suspect the issue is linked to these, but I haven't been able to fix it.
Any help would be really appreciated.
Here's my spider code:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from aldiscraper.items import AldiscraperItem
from scrapy.loader import ItemLoader
from datetime import datetime
import re
class AldiSpider(CrawlSpider):
name = 'aldi'
start_urls = ['https://groceries.aldi.co.uk/']
rules = (
Rule(LinkExtractor(allow='en-GB/', deny=r'/ddddddddddddd')),
Rule(LinkExtractor(allow=r'/ddddddddddddd'), callback='parse_products')
)
custom_settings = {
'FEED_EXPORT_FIELDS': [
'prod_id',
'name',
'size',
'price',
'scrape_date',
],
}
def parse_products(self, response):
item = AldiscraperItem()
item['prod_id'] = response.css('span.sku.small::text').get()
item['name'] = response.css('h1.my-0::text').get()
item['size'] = response.css('span.text-black-50.font-weight-bold::text').get()
item['price'] = response.css('span.product-price.h4.m-0.font-weight-bold::text').get()
item['scrape_date'] = datetime.now().strftime('%d/%m/%Y')
yield item
I originally tried to run the spider using the following rules, with the same results:
class AldiSpider(CrawlSpider):
name = 'aldi'
start_urls = ['https://groceries.aldi.co.uk/']
rules = (
Rule(LinkExtractor(allow='en-GB/', deny='en-GB/p-')),
Rule(LinkExtractor(allow='en-GB/p-'), callback='parse_products')
)
I'm using this command to run the spider:
scrapy crawl aldi -O aldi.csv
And here's an extract from the logs
2022-04-11 19:32:04 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: aldi)
2022-04-11 19:32:04 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l 24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.19044-SP0
2022-04-11 19:32:04 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'aldi',
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 0.75,
'FEED_EXPORT_FIELDS': ['prod_id', 'name', 'size', 'price', 'scrape_date'],
'NEWSPIDER_MODULE': 'aldiscraper.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['aldiscraper.spiders']}
2022-04-11 19:32:04 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-11 19:32:04 [scrapy.extensions.telnet] INFO: Telnet Password: 2072eb077b8dcc04
2022-04-11 19:32:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2022-04-11 19:32:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-11 19:32:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-11 19:32:06 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-11 19:32:06 [scrapy.core.engine] INFO: Spider opened
2022-04-11 19:32:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-11 19:32:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-11 19:32:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/robots.txt> (referer: None)
2022-04-11 19:32:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://groceries.aldi.co.uk/en-GB/> from <GET https://groceries.aldi.co.uk/>
2022-04-11 19:32:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/> (referer: None)
2022-04-11 19:32:09 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://groceries.aldi.co.uk/en-GB/#footer-collapse-0> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2022-04-11 19:32:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/forgot-password> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery/bread?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=sliced-bread&clickedon=sliced-bread> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery/bread?origin=dropdown&c1=groceries&c2=bakery&c3=bread&clickedon=bread> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=shopall-bakery&clickedon=shopall-bakery> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&clickedon=bakery> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/vegan-alcohol?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=vegan-alcohol&clickedon=vegan-alcohol> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/milk-alternatives?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=milk-alternatives&clickedon=milk-alternatives> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=shopall-vegan-drinks&clickedon=shopall-vegan-drinks> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&clickedon=vegan-drinks> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-sides-snacks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=vegan-sides-snacks&clickedon=vegan-sides-snacks> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-meat-alternatives?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=meat-alternatives&clickedon=meat-alternatives> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=shopall-vegan-food&clickedon=shopall-vegan-food> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&clickedon=vegan-food> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=shopall-vegan-plant-based&clickedon=shopall-vegan-plant-based> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range?origin=dropdown&c1=groceries&c2=vegan-plant-based&clickedon=vegan-plant-based> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/Grocery-Click-and-Collect/Notify-Me> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/Modern-Slavery-Act?origin=footer&c1=about-aldi&c2=modern-slavery-act&clickedon=modern-slavery-act> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:32:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery/bread?sortDirection=asc&page=3> (referer: https://groceries.aldi.co.uk/en-GB/bakery/bread?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=sliced-bread&clickedon=sliced-bread)
2022-04-11 19:32:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=7> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:32:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=6> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:32:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=5> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:32:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=4> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:32:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/vegan-alcohol?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/vegan-alcohol?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=vegan-alcohol&clickedon=vegan-alcohol)
2022-04-11 19:32:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/milk-alternatives?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/milk-alternatives?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=milk-alternatives&clickedon=milk-alternatives)
2022-04-11 19:33:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/vegan-alcohol?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=vegan-alcohol&clickedon=vegan-alcohol)
2022-04-11 19:33:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=shopall-vegan-drinks&clickedon=shopall-vegan-drinks)
2022-04-11 19:33:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-sides-snacks?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-sides-snacks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=vegan-sides-snacks&clickedon=vegan-sides-snacks)
2022-04-11 19:33:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-meat-alternatives?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-meat-alternatives?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=meat-alternatives&clickedon=meat-alternatives)
2022-04-11 19:33:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=shopall-vegan-food&clickedon=shopall-vegan-food)
2022-04-11 19:33:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-food/vegan-sides-snacks?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-food&c4=vegan-sides-snacks&clickedon=vegan-sides-snacks)
2022-04-11 19:33:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range?sortDirection=asc&page=3> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=shopall-vegan-plant-based&clickedon=shopall-vegan-plant-based)
2022-04-11 19:33:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range?sortDirection=asc&page=2> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=shopall-vegan-plant-based&clickedon=shopall-vegan-plant-based)
2022-04-11 19:33:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/Grocery-Click-and-Collect> (referer: https://groceries.aldi.co.uk/en-GB/Grocery-Click-and-Collect/Notify-Me)
2022-04-11 19:33:06 [scrapy.extensions.logstats] INFO: Crawled 36 pages (at 36 pages/min), scraped 0 items (at 0 items/min)
2022-04-11 19:33:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/Login> (referer: https://groceries.aldi.co.uk/en-GB/Grocery-Click-and-Collect/Notify-Me)
2022-04-11 19:33:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=shopall-vegan-plant-based&clickedon=shopall-vegan-plant-based)
2022-04-11 19:33:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/vegan-range> (referer: https://groceries.aldi.co.uk/en-GB/vegan-range/vegan-drinks/vegan-alcohol?origin=dropdown&c1=groceries&c2=vegan-plant-based&c3=vegan-drinks&c4=vegan-alcohol&clickedon=vegan-alcohol)
2022-04-11 19:33:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=3> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:33:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=2> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:33:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:33:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/easter/hot-cross-buns> (referer: https://groceries.aldi.co.uk/en-GB/bakery?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=shopall-bread&clickedon=shopall-bread)
2022-04-11 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery/bread?sortDirection=asc&page=2> (referer: https://groceries.aldi.co.uk/en-GB/bakery/bread?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=sliced-bread&clickedon=sliced-bread)
2022-04-11 19:33:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/About-Click--Collect?origin=footer&c1=about-aldi&c2=covid-19&clickedon=covid-19> (referer: https://groceries.aldi.co.uk/en-GB/)
2022-04-11 19:33:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/bakery/bread?sortDirection=asc&page=1> (referer: https://groceries.aldi.co.uk/en-GB/bakery/bread?origin=dropdown&c1=groceries&c2=bakery&c3=bread&c4=sliced-bread&clickedon=sliced-bread)
2022-04-11 19:33:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://groceries.aldi.co.uk/en-GB/Privacy-Notice?origin=footer&c1=help&c2=privacy-notice&clickedon=privacy-notice> (referer: https://groceries.aldi.co.uk/en-GB/)
And finally, here are the stats:
2022-04-11 19:49:55 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-11 19:49:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 5,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 5,
'downloader/request_bytes': 539252,
'downloader/request_count': 1095,
'downloader/request_method_count/GET': 1095,
'downloader/response_bytes': 56319721,
'downloader/response_count': 1095,
'downloader/response_status_count/200': 1087,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/302': 3,
'downloader/response_status_count/404': 4,
'dupefilter/filtered': 455167,
'elapsed_time_seconds': 1068.912414,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 11, 18, 49, 55, 791572),
'httpcompression/response_bytes': 370815991,
'httpcompression/response_count': 1091,
'log_count/DEBUG': 1102,
'log_count/INFO': 27,
'request_depth_max': 4,
'response_received_count': 1091,
'robotstxt/forbidden': 5,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1099,
'scheduler/dequeued/memory': 1099,
'scheduler/enqueued': 1099,
'scheduler/enqueued/memory': 1099,
'start_time': datetime.datetime(2022, 4, 11, 18, 32, 6, 879158)}
2022-04-11 19:49:55 [scrapy.core.engine] INFO: Spider closed (finished)
The only other output is a completely blank CSV file.
I can't understand why it is scraping the pages but not returning any items. Thanks in advance for any help you can give me!
Thanks Chris
Solution
The url is dynamically populated by javascript and crawlspider
can't render javascript meaning scrapy
can't render javascript. So you have to use an automation tool something like selenium with scrapy a bit complex or You can easily grab data from api if they exist hidden api. Here is an example how to extract data from api as the url contains api.
import scrapy
import json
from scrapy.selector import Selector
from scrapy.crawler import CrawlerProcess
class AldiSpider(scrapy.Spider):
name = 'aldi'
def start_requests(self):
api_url='https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter'
headers= {
"x-requested-with": "XMLHttpRequest",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36",
}
yield scrapy.Request (
url=api_url,
method="GET",
headers=headers,
callback=self.parse
)
def parse(self,response):
resp=json.loads(response.body)
for item in resp:
yield {
'DisplayName':item.get('DisplayName')}
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(AldiSpider)
process.start()
Output:
{'DisplayName': 'Dairyfine Sparkle The Unicorn Milk Chocolate Egg & Toy 100g'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Dairyfine Bumper Egg Hunt Pack 700g'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Moser Roth Belgian Milk & White Chocolate Caramel Egg Slab 120g'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Moser Roth Belgian Dark Chocolate & Raspberry Egg Slab 120g'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Dairyfine Giant Milk Chocolate Bunny 300g'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Kit Kat Chunky Milk Chocolate Large Easter Egg 230g'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Dairyfine Milk Chocolate Bunnies 125g'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Moser Roth Milk Chocolate Truffle Luxury Filled Eggs 150g'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Dairyfine Mini Mix Ups 212g'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Dairyfine Eggjoyables Cookies & Cream 144g'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Dairyfine White Hot Chocolate Melting Unicorn 65g'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Dairyfine Milk Hot Chocolate Melting Chick 65g'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Dairyfine Milk Chocolate Bunny 125g'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Dairyfine Milk Chocolate Easter Bunny Lollies 10 Pack'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Dairyfine Milk Chocolate Eggs 100g'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Maltesers Chocolate Mini Bunnies Bag 58g'}
2022-04-12 03:56:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://groceries.aldi.co.uk/api/aldisearchquery/productset?productSetName=Easter>
{'DisplayName': 'Dairyfine Mini Chocolate Eggs 80g'}
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.