Issue
I'm a Scrapy enthusiast into scraping for 3 months. Because I really enjoy scraping, I ended up being frustrated and excitedly purchased a proxy package from Leafpad.
Unfortunetaly, when I uploaded them to my Scrapy spider, I recevied ValueError:
I used scrapy-rotating-proxies to integrate the proxies. I added the proxies which are not numbers but string urls like below:
ROTATING_PROXY_LIST = [
"us-retail-fast.resdleafproxies.com:5000:ksre9jXXXXXXXXI38HJg5:XXX9nh",
"us-retail-fast.resdleafproxies.com:5000:ksre9jvXXXXXXXXk+zHtjyZRG:XXXXtf9nh",
# ...
]
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 800,
'rotating_proxies.middlewares.BanDetectionMiddleware': 800
}
Scrapy logs:
draco@draco:~/docs/scraping/scrapyyy/thomas$ scrapy crawl home2 -o all_np4.csv
/home/draco/.local/lib/python3.8/site-packages/scrapy/spiderloader.py:37: UserWarning: There are several spiders with the same name:
HomeSpider named 'home' (in thomas.spiders.home)
HomeSpider named 'home' (in thomas.spiders.home3)
This can cause unexpected behavior.
warnings.warn(
2022-02-21 00:16:51 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: thomas)
2022-02-21 00:16:51 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.10 (default, Nov 26 2021, 20:14:08) - [GCC 9.3.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-5.13.0-30-generic-x86_64-with-glibc2.29
2022-02-21 00:16:51 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-02-21 00:16:51 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'thomas',
'CLOSESPIDER_ERRORCOUNT': 10,
'CONCURRENT_REQUESTS': 3,
'CONCURRENT_REQUESTS_PER_DOMAIN': 3,
'CONCURRENT_REQUESTS_PER_IP': 5,
'COOKIES_ENABLED': False,
'DNS_TIMEOUT': 10,
'DOWNLOAD_DELAY': 2,
'DOWNLOAD_TIMEOUT': 200,
'NEWSPIDER_MODULE': 'thomas.spiders',
'SPIDER_MODULES': ['thomas.spiders']}
2022-02-21 00:16:51 [scrapy.extensions.telnet] INFO: Telnet Password: 536c802b585074b3
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'rotating_proxies.middlewares.RotatingProxyMiddleware',
'rotating_proxies.middlewares.BanDetectionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'thomas.middlewares.UserAgentRotatorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'thomas.middlewares.ThomasSpiderMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-21 00:16:51 [scrapy.core.engine] INFO: Spider opened
2022-02-21 00:16:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-21 00:16:51 [home2] INFO: Spider opened: home2
2022-02-21 00:16:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-21 00:16:51 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 30, reanimated: 0, mean backoff time: 0s)
INITIAL REQUEST
OPENING LIST https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden
OPENING LIST https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau
OPENING LIST https://www.homegate.ch/buy/apartment/canton-zurich/matching-list
2022-02-21 00:16:51 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5006:XXXXXj: XXXXXXXtf9nh> is DEAD
#....
2022-02-21 00:17:02 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> with another proxy (failed 2 times, max retries: 5)
esdleafproxies.com:5005:ksre9jva95etajxxaoll9k+cw17qdyl:xxxx9nh> is DEAD
2022-02-21 00:17:21 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau> with another proxy (failed 5 times, max retries: 5)
2022-02-21 00:17:23 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5001:XXXXXXjxxaoll9k+ZcGvdwJf:XXXXXXXtf9nh> is DEAD
2022-02-21 00:17:23 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> with another proxy (failed 5 times, max retries: 5)
2022-02-21 00:17:25 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproXXXXXXXsre9jva95etajxxaoll9k+oFx6kEXE:xxxxxxxtf9nh> is DEAD
2022-02-21 00:17:25 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden> (failed 6 times with different proxies)
OPENING LIST https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400
2022-02-21 00:17:25 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden>
Traceback (most recent call last):
File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
result = current_context.run(
File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
return handler.download_request(request, spider)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
return agent.download_request(request)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
agent = self._get_agent(request, timeout)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
return _parsed_url_args(parsed)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
port = parsed.port
File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
raise ValueError(message) from None
ValueError: Port could not be cast to integer value as '5007:ksre9jva95etajxxaoll9k+oFx6kEXE:XXXXtf9nh'
2022-02-21 00:17:28 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5006xxxxxxxxetajxxaoll9k+V2UowimU:XXXXXXf9nh> is DEAD
2022-02-21 00:17:28 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau> (failed 6 times with different proxies)
2022-02-21 00:17:28 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau>
Traceback (most recent call last):
File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
result = current_context.run(
File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
return handler.download_request(request, spider)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
return agent.download_request(request)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
agent = self._get_agent(request, timeout)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
return _parsed_url_args(parsed)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
port = parsed.port
File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
raise ValueError(message) from None
ValueError: Port could not be cast to integer value as '5006:ksre9jva95etajxxaoll9k+XXXXXX'
2022-02-21 00:17:30 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5004:XXXXXXX5etajxxaoll9k+fbg56Ioj:XXXXf9nh> is DEAD
2022-02-21 00:17:30 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> (failed 6 times with different proxies)
2022-02-21 00:17:30 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list>
Traceback (most recent call last):
File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
result = current_context.run(
File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
return handler.download_request(request, spider)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
return agent.download_request(request)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
agent = self._get_agent(request, timeout)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
return _parsed_url_args(parsed)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
port = parsed.port
File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
raise ValueError(message) from None
ValueError: Port could not be cast to integer value as '5004:XXXXXva95etajxxaoll9k+fbg56Ioj:XXXXXtf9nh'
2022-02-21 00:17:31 [rotating_proxies.middlewares] DEBUG: 1 proxies moved from 'dead' to 'reanimated'
2022-02-21 00:17:33 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5000:XXXXXajxxaoll9k+zHtjyZRG:XXXX9nh> is DEAD
2022-02-21 00:17:33 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400> with another proxy (failed 1 times, max retries: 5)
2022-02-21 00:17:36 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5001:XXXXXXXXXetajxxaoll9k+uSsCeYH5:lXXXXXXmtf9nh> is DEAD
2022-02-21 00:17:36 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400> with another proxy (failed 2 times, max retries: 5)
ValueError: Port could not be cast to integer value as '5009:ksre9jva95etajxxaoll9k+HOggeKA3:XXXXXh'
2022-02-21 00:17:47 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-21 00:17:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/error/builtins.ValueError': 24,
'downloader/exception_count': 24,
'downloader/exception_type_count/builtins.ValueError': 24,
'downloader/request_bytes': 7158,
'downloader/request_count': 24,
'downloader/request_method_count/GET': 24,
'elapsed_time_seconds': 55.895942,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 2, 20, 21, 17, 47, 135433),
'log_count/DEBUG': 50,
'log_count/ERROR': 4,
'log_count/INFO': 13,
'memusage/max': 65073152,
'memusage/startup': 65073152,
'proxies/dead': 21,
'proxies/mean_backoff': 196.90260209397636,
'proxies/reanimated': 1,
'proxies/unchecked': 9,
'scheduler/dequeued': 24,
'scheduler/dequeued/memory': 24,
'scheduler/enqueued': 24,
'scheduler/enqueued/memory': 24,
'start_time': datetime.datetime(2022, 2, 20, 21, 16, 51, 239491)}
2022-02-21 00:17:47 [scrapy.core.engine] INFO: Spider closed (finished)
What could the problem be about?
My proxy membership in Leafproxies is "Residential Proxies". Leafproxies doesn't provide any info about the details of it and how it could be used. As I understand, there is no real consumert suport but a Discord channel.
Here is the panel that Leafproxies gives. I get the proxies from listed below. There is no data usage recorded
Solution
The way you have defined your proxies list is not correct. You need to use the format username:password@server:port
and not server:port:username:password
. Try using the below definition:
ROTATING_PROXY_LIST= [
"https://ksre9jva95etajxxaoll9k+JI38HJg5:[email protected]:5000",
"https://ksre9jva95etajxxaoll9k+zHtjyZRG:[email protected]:5001",
]
DOWNLOADER_MIDDLEWARES = {
# ...
'rotating_proxies.middlewares.RotatingProxyMiddleware': 800,
'rotating_proxies.middlewares.BanDetectionMiddleware': 810,
# ...
}
NOTE: You have exposed your credentials to the internet so anyone seeing this question can use your proxy service for free. Consider revoking the credentials ASAP.
The second issue you might be facing is that some of the proxies might already be banned by the site you are scraping and so you will receive failed responses. So you need to increase the value of RETRIES
when using proxies.
Answered By - msenior_
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.