Wednesday, June 22, 2022

[FIXED] Scrapy not pulling link from webpage, only the webpage itself

June 22, 2022 python, scrapy No comments

Issue

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import protego

class NirsoftSpider(CrawlSpider):
    name = 'sotw'
    allowed_domains = ['www.shadowofthewyrm.org']
    start_urls = ['https://www.shadowofthewyrm.org/downloads.html']

    rules = (
        Rule(LinkExtractor(allow=r'/releases/'),
             callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        file_url = response.css('attr(href)').get()
        file_url = response.urljoin(file_url)
        yield {'file_url': file_url}

Output from crawl:

2022-04-08 14:43:47 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: SotW)
2022-04-08 14:43:47 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.9.5 (tags/v3.9.5:0a7dcbd, May  3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1n  15
 Mar 2022), cryptography 36.0.2, Platform Windows-10-10.0.19043-SP0
2022-04-08 14:43:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'SotW',
 'NEWSPIDER_MODULE': 'SotW.spiders',
 'SPIDER_MODULES': ['SotW.spiders']}
2022-04-08 14:43:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-08 14:43:47 [scrapy.extensions.telnet] INFO: Telnet Password: 1a18143a1c486859
2022-04-08 14:43:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-04-08 14:43:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-08 14:43:47 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-08 14:43:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-08 14:43:47 [scrapy.core.engine] INFO: Spider opened
2022-04-08 14:43:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-08 14:43:47 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-08 14:43:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.shadowofthewyrm.org/downloads.html> (referer: None)
2022-04-08 14:43:47 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-08 14:43:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 292,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1571,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.282146,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 8, 19, 43, 47, 996453),
 'httpcompression/response_bytes': 2629,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 4, 8, 19, 43, 47, 714307)}
2022-04-08 14:43:47 [scrapy.core.engine] INFO: Spider closed (finished)

Specifically what I'm trying to pull is the Windows download link from this website as a way to "auto-update" the program. The download link itself is https://www.shadowofthewyrm.org/releases/ShadowOfTheWyrm-Win-1.4.3.zip

the base url is https://www.shadowofthewyrm.org/downloads.html

unsure how to pull it.

tried removing the "allow" rule of the LinkExtractor to see if the zip download shows but still no luck there either. Any help would be appreciated, thank you.

Solution

Your LinkExtractor doesn't work because href contains only releases/ (check HTML source code). In fact you don't need to follow this link (if you don't want to download it). But after download you'll have everything in response.url:

def parse_item(self, response):
    yield {'file_url': response.url}

Answered By - gangabass

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, June 22, 2022

[FIXED] Scrapy not pulling link from webpage, only the webpage itself

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels