Issue
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import protego
class NirsoftSpider(CrawlSpider):
name = 'sotw'
allowed_domains = ['www.shadowofthewyrm.org']
start_urls = ['https://www.shadowofthewyrm.org/downloads.html']
rules = (
Rule(LinkExtractor(allow=r'/releases/'),
callback='parse_item', follow=True),
)
def parse_item(self, response):
file_url = response.css('attr(href)').get()
file_url = response.urljoin(file_url)
yield {'file_url': file_url}
Output from crawl:
2022-04-08 14:43:47 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: SotW)
2022-04-08 14:43:47 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1n 15
Mar 2022), cryptography 36.0.2, Platform Windows-10-10.0.19043-SP0
2022-04-08 14:43:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'SotW',
'NEWSPIDER_MODULE': 'SotW.spiders',
'SPIDER_MODULES': ['SotW.spiders']}
2022-04-08 14:43:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-08 14:43:47 [scrapy.extensions.telnet] INFO: Telnet Password: 1a18143a1c486859
2022-04-08 14:43:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-04-08 14:43:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-08 14:43:47 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-08 14:43:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-08 14:43:47 [scrapy.core.engine] INFO: Spider opened
2022-04-08 14:43:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-08 14:43:47 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-08 14:43:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.shadowofthewyrm.org/downloads.html> (referer: None)
2022-04-08 14:43:47 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-08 14:43:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 292,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 1571,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.282146,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 8, 19, 43, 47, 996453),
'httpcompression/response_bytes': 2629,
'httpcompression/response_count': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 4, 8, 19, 43, 47, 714307)}
2022-04-08 14:43:47 [scrapy.core.engine] INFO: Spider closed (finished)
Specifically what I'm trying to pull is the Windows download link from this website as a way to "auto-update" the program. The download link itself is https://www.shadowofthewyrm.org/releases/ShadowOfTheWyrm-Win-1.4.3.zip
the base url is https://www.shadowofthewyrm.org/downloads.html
unsure how to pull it.
tried removing the "allow" rule of the LinkExtractor to see if the zip download shows but still no luck there either. Any help would be appreciated, thank you.
Solution
Your LinkExtractor
doesn't work because href
contains only releases/
(check HTML source code). In fact you don't need to follow this link (if you don't want to download it). But after download you'll have everything in response.url
:
def parse_item(self, response):
yield {'file_url': response.url}
Answered By - gangabass
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.