Issue
I'm trying to extract the title of some products but it doesn't work and it yields an empty list every time. I tried grabbing the css and xpath of the 'title' using selectorgadget extension but failed, tried to grab the path by inspecting the element yet I failed.
These are some css, xpath (by selector gadget tool) and byInspectElement paths that I tried that didn't work:
css:
response.css('.eyNLqb > span > span > span:nth-child(1)').css('::text').extract()
xpath:
response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "eyNLqb", " " ))]//>//span//>//span//>//span[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]/text()').extract()
inspectingElement:
response.css('div.sc-9d1cc060-20.eyNLqb span span span').css('::text').extract()
here is the full code:
import scrapy
from ..items import NoonItem
class NoonspiderSpider(scrapy.Spider):
name = 'noonspider'
allowed_domains = ['noon.com']
start_urls = ['https://www.noon.com/uae-en/search/?q=figurine']
def parse(self, response):
items = NoonItem()
items['title'] = response.css('.eyNLqb span').css('::text').extract()
yield items
here is items.py
import scrapy
class NoonItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
here is the log
2022-11-27 02:10:27 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: noon)
2022-11-27 02:10:27 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.0.1, Twisted 22.10.0, Python 3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.3, Platform Windows-8.1-6.3.9600-SP0
2022-11-27 02:10:27 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'noon',
'NEWSPIDER_MODULE': 'noon.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['noon.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2022-11-27 02:10:27 [asyncio] DEBUG: Using selector: SelectSelector
2022-11-27 02:10:27 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2022-11-27 02:10:27 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2022-11-27 02:10:27 [scrapy.extensions.telnet] INFO: Telnet Password: d24399c47a8a1a1f
2022-11-27 02:10:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-11-27 02:10:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-11-27 02:10:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-11-27 02:10:28 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-11-27 02:10:28 [scrapy.core.engine] INFO: Spider opened
2022-11-27 02:10:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-27 02:10:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-11-27 02:10:28 [filelock] DEBUG: Attempting to acquire lock 354146442064 on C:\Users\Mohamed.aldhuhoori\.cache\python-tldextract\3.9.5.final__ScrapyTutorial__4afa8a__tldextract-3.4.0\publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-27 02:10:28 [filelock] DEBUG: Lock 354146442064 acquired on C:\Users\Mohamed.aldhuhoori\.cache\python-tldextract\3.9.5.final__ScrapyTutorial__4afa8a__tldextract-3.4.0\publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-27 02:10:29 [filelock] DEBUG: Attempting to release lock 354146442064 on C:\Users\Mohamed.aldhuhoori\.cache\python-tldextract\3.9.5.final__ScrapyTutorial__4afa8a__tldextract-3.4.0\publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-27 02:10:29 [filelock] DEBUG: Lock 354146442064 released on C:\Users\Mohamed.aldhuhoori\.cache\python-tldextract\3.9.5.final__ScrapyTutorial__4afa8a__tldextract-3.4.0\publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-27 02:10:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.noon.com/robots.txt> (referer: None)
2022-11-27 02:10:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.noon.com/uae-en/search/?q=figurine> (referer: None)
2022-11-27 02:10:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/uae-en/search/?q=figurine>
{'title': []}
2022-11-27 02:10:29 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-27 02:10:29 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 477,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 71960,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 1.162434,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 26, 22, 10, 29, 841064),
'httpcompression/response_bytes': 285255,
'httpcompression/response_count': 2,
'item_scraped_count': 1,
'log_count/DEBUG': 10,
'log_count/INFO': 10,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 11, 26, 22, 10, 28, 678630)}
2022-11-27 02:10:29 [scrapy.core.engine] INFO: Spider closed (finished)
Solution
They are using javascript to load their page dynamically. Fortunately their search api is fairly straight forward and provides all of the information you are looking for most likely.
import scrapy
class NoonspiderSpider(scrapy.Spider):
name = 'noonspider'
allowed_domains = ['noon.com']
start_urls = ['https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine']
def parse(self, response):
for i in response.json()["hits"]:
yield {'title': i['name']}
{'title': 'Gold Plated Attractive Jewelry Box Multicolour '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Football World Cup Trophy Gold '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Sitting Camel Figurine Multicolour '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Gold Plated Attractive Decorative Aftaba Set Golden '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Merry Go Round Carousel Music Box Pink 110x190x110millimeter '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Astronaut Moon Lamp Spaceman Night Light Battery Operated Space Figurine Desktop Lamp Gifts for Outerspace Party Favors Bedroom Decor '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': "Arts & Crafts Toys,Two Boxes Pack Children's Plaster Painting Set,DIY Graffiti Toys,Paint Your Own Figurines,STEAM Creative DIY Toys,Ceramics Plaster Painting Set Gift Toys For 6+ Year Old Boys & Girl "}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Christmas Ornaments Santa Claus Figurine '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Harry Potter Black Edition Classic Mini Music Box '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Home Decor Sculptures Collectible Figurines Stand Artwork Modern Graffiti Art for Home Decor Living Room Bedroom Office Retail Decoration Gift Bulldog Statue Resin L25xH16xW10 cm '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Standing Camel Figurine Brown '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Feng Shui Natural Citrine Gem Money Tree Yellow 470g '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Trivia Metal Abstract Flute Man Figurine Silver 20cm '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Fuse Face Changer Figurine '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': "Music Box Love Engraved Vintage Music Box Best Gift for Girlfriend Valentine's Day to Girlfriend "}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Tacto Dino by PlayShifu - Interactive Dinosaur Figurines | Explore 100+ Facts | Works with iPads, Android tablets, Amazon Fire tablets | Gift for Boys & Girls, Ages 4-8 (Tablet Not Included) 31 x 26 x 6cm '
}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Grendizer SFC Collectible PVC Figure 33cm Tall Statue Anime Manga Figurine Home Room Office Décor Gift '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'You are My Sunshine Wood Music Box for Wife Daughter Son Laser Engraved Vintage Wooden Hand Crank Music Box Gifts '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Bell with Doll Figurine 2.5inch Assorted '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': '6pcs/set Jujutsu Kaisen Anime Figure Itadori Yuji Fushiguro Megumi Action Figure Gojo Satoru Kugisaki Nobara Figurine Model Toys '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Black Panther Figurine '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Trivia Metal Kick In Progress Figurine Silver 24cm '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Modern Cute Coin Bank Box Resin KAWS Figurine Home Decorations Coin Storage Box Holder Toy Child Gift Organizer Money Box KAWS Blue Type B 10x25x9cm '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Christmas Ornaments Santa Claus Figurine '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Vintage Gramophone Shaped Music Box Gold/Red 130x225x107millimeter '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Standing Animal Collection Figurine Fox DY1940-3 Multicolour '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Wooden Music Box Hand Crank Carved Vintage Mechanism Music Box for Home Decor Gifts '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Bald Eagle Figurine '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Lighted Nativity Crèche Figurine Gold/Brown/Red 2.99inch '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'New Pink Wooden Merry Go Round Carousel Classic Music Box Gift Toy '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Burj Khalifa And Sitting Camel With Waterball On Top Figurine Multicolour '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Trivia Metal Men With Log Gold 14cm '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Inaara Metal Hanging Man Figurine Gold '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': "Vintage Merry-Go-Round Horse Valentine's Birthday Gift Carousel Music Box Pink 11 x 18cm "}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Girl Dress Figurine For Home Décor Gold/Grey 36.6x23.2x52cm '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'My Daughter You Are Wood Music Box for Wife Daughter Son Dad Laser Engraved Vintage Wooden Hand Crank Music Box Gifts '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Electroplating Trophy Gold 52cm '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Roblox Game Zombie Attack Playset 7cm PVC Suite Dolls Action Figures Boys Toys Model Figurines for Collection Birthday Gifts for Kids (21 pcs) '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Merry Go Round Carousel Music Box Blue 110x190x110millimeter '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Attractive Jewelry Box Golden/White '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Scout Wooden Horse Bust Figurine Brown '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Astronaut Figurines Cake Topper Outer Space Birthday Decoration Spaceman Model Display Miniature Toys Set Planet Rocket Pearl Balls and Star DIY Toppers for Kids Party (4Pcs) '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Snoopy With Woodstock In Nest Collectible Figure Beige/Green/Brown 6.75inch '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Electronic Burj Al Arab Showpiece with USB Cable Silver/White '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Romantic Christmas Trees Music Box White/Red/Blue 21x11centimeter '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Tumbler 3D Figurine Avengers Comic Heroes Iron Man 360ml '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Cinderella Ball Dress Figurine '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Double Sided Camel With Waterball on Top Figurines Multicolour '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Midnight Dragon Water Snow Globe Figurine Grey/Blue/Green 3.5inch '}
2022-11-26 15:03:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.noon.com/_svc/catalog/api/v3/u/search/?q=figurine>
{'title': 'Sheep Shape Creative Hanging Double sided Black White Message Board Hanging Black 36 x 1 x 28cm '}
2022-11-26 15:03:14 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-26 15:03:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 329,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 79492,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.721553,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 26, 23, 3, 14, 214060),
'item_scraped_count': 50,
'log_count/DEBUG': 56,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 11, 26, 23, 3, 13, 492507)}
2022-11-26 15:03:14 [scrapy.core.engine] INFO: Spider closed (finished)
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.