Issue
I am doing a practice project about scraping dynamically loaded content with scrapy-plawright but I managed to hit a wall and cannot figure out what's the issue is. The spider simply refuses to start the crawling process and gets stuck on the "Telnet console listening on 127.0.0.1:6023" part.
I set up the project as it is is recommended in the tutorial.
this is how the relevant part of my settings.py looks like (I played around other settings too to try to fix it like with CONCURRENT_REQUESTS
and COOKIES_ENABLED
but no changes)
import asyncio
from scrapy.utils.reactor import install_reactor
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
And this is how the spider itself:
class roksh_crawler(scrapy.Spider):
name = "roksh_crawler"
def start_requests(self):
yield Request(
url="https://www.roksh.com/",
callback=self.parse,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("screenshot", path="example.png", full_page=True),
],
},
)
def parse(self, response):
screenshot = response.meta["playwright_page_methods"][0]
# screenshot.result contains the image's bytes
I tried to take a screenshot of the page but nothing else works either so I assume this is not the issue.
And here is the the log I am getting:
2022-11-24 09:54:19 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: roksh_crawler) 2022-11-24 09:54:19 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.0.1, Twisted 21.7.0, Python 3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.3, Platform Windows-10-10.0.19045-SP0 2022-11-24 09:54:19 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'roksh_crawler', 'CONCURRENT_REQUESTS': 32, 'NEWSPIDER_MODULE': 'roksh.spiders', 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['roksh.spiders'], 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'} 2022-11-24 09:54:19 [asyncio] DEBUG: Using selector: SelectSelector 2022-11-24 09:54:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor 2022-11-24 09:54:19 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop 2022-11-24 09:54:19 [scrapy.extensions.telnet] INFO: Telnet Password: 7aad12ee78cfff92 2022-11-24 09:54:19 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2022-11-24 09:54:19 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2022-11-24 09:54:19 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2022-11-24 09:54:19 [scrapy.middleware] INFO: Enabled item pipelines: [] 2022-11-24 09:54:19 [scrapy.core.engine] INFO: Spider opened 2022-11-24 09:54:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 09:54:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2022-11-24 09:55:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 09:56:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 09:57:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 09:58:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 09:59:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 10:00:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 10:01:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 10:02:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 10:03:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
and this goes on infinitely.
I also tried with different URLs but got the same result so I assume the problem is on my end not on the server's. Plus if I try to run the spider without playwright (so if I take out the DOWNLOAD_HANDLERS
from the settings) then it works, albeit it only returns the source HTML which is not my desired result.
Solution
It works fine for me.
Just remove or comment out these lines in your settings.py
file
# import asyncio
# from scrapy.utils.reactor import install_reactor
# install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
# asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.