Issue
I'm using Scrapy to crawl a website where the server has a faulty SSL configuration. (I can't control the server config). This results in Scrapy (or perhaps Twisted?) yielding SSL handshake failures every time it attempts to connect, even when using custom_settings with the same parameters that work for the OpenSSL CLI, and also for a basic proof of concept using Python & SSL. (See below).
What am I doing wrong? Scrapy's STDOUT shows that the setting overrides are taking effect, but the handshake fails every time.
Details about the root cause of the server SSL issue are here. In summary, it only accepts TLS1.2 and requires the client to offer SHA-1 as a signing algorithm. Hence the need for SECLEVEL=0 in the client context.
Scrapy Output
(.venv) root@348980730ce9:/ssl_test/ssl_test/spiders# scrapy crawl badsslconfig
2023-08-06 05:40:22 [scrapy.utils.log] INFO: Scrapy 2.10.0 started (bot: ssl_test)
2023-08-06 05:40:22 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.4 (main, Jul 28 2023, 05:02:22) [GCC 12.2.0], pyOpenSSL 23.2.0 (OpenSSL 3.1.2 1 Aug 2023), cryptography 41.0.3, Platform Linux-5.15.49-linuxkit-x86_64-with-glibc2.36
2023-08-06 05:40:22 [scrapy.addons] INFO: Enabled addons:
[]
2023-08-06 05:40:22 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'ssl_test',
'DOWNLOADER_CLIENT_TLS_CIPHERS': 'DEFAULT:@SECLEVEL=0',
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'ssl_test.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['ssl_test.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-08-06 05:40:22 [asyncio] DEBUG: Using selector: EpollSelector
2023-08-06 05:40:22 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-08-06 05:40:22 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2023-08-06 05:40:22 [scrapy.extensions.telnet] INFO: Telnet Password: 52cbcfbfdbe0e1e7
2023-08-06 05:40:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2023-08-06 05:40:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-08-06 05:40:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-08-06 05:40:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-08-06 05:40:23 [scrapy.core.engine] INFO: Spider opened
2023-08-06 05:40:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-08-06 05:40:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.legislation.gov.au/robots.txt> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.legislation.gov.au/robots.txt> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.legislation.gov.au/robots.txt> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET https://www.legislation.gov.au/robots.txt>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
Traceback (most recent call last):
File "/.venv/lib/python3.11/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
return (yield download_func(request=request, spider=spider))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.legislation.gov.au> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.legislation.gov.au> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.legislation.gov.au> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.legislation.gov.au>
Traceback (most recent call last):
File "/.venv/lib/python3.11/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
return (yield download_func(request=request, spider=spider))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.core.engine] INFO: Closing spider (finished)
2023-08-06 05:40:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 6,
'downloader/request_bytes': 1368,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'elapsed_time_seconds': 0.612751,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 8, 6, 5, 40, 23, 702263),
'log_count/DEBUG': 7,
'log_count/ERROR': 4,
'log_count/INFO': 10,
'memusage/max': 64425984,
'memusage/startup': 64425984,
'retry/count': 4,
'retry/max_reached': 2,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 4,
"robotstxt/exception_count/<class 'twisted.web._newclient.ResponseNeverReceived'>": 1,
'robotstxt/request_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2023, 8, 6, 5, 40, 23, 89512)}
2023-08-06 05:40:23 [scrapy.core.engine] INFO: Spider closed (finished)
Version Information:
root@348980730ce9:/ssl_test/ssl_test/spiders# scrapy version -v
Scrapy : 2.10.0
lxml : 4.9.3.0
libxml2 : 2.10.3
cssselect : 1.2.0
parsel : 1.8.1
w3lib : 2.1.2
Twisted : 22.10.0
Python : 3.11.4 (main, Jul 28 2023, 05:02:22) [GCC 12.2.0]
pyOpenSSL : 23.2.0 (OpenSSL 3.1.2 1 Aug 2023)
cryptography : 41.0.3
Platform : Linux-5.15.49-linuxkit-x86_64-with-glibc2.36
(.venv) root@348980730ce9:/ssl_test/ssl_test/spiders# which openssl
/usr/bin/openssl
(.venv) root@348980730ce9:/ssl_test/ssl_test/spiders# openssl version -v
Successful OpenSSL Handshake
(.venv) root@348980730ce9:/ssl_test/ssl_test/spiders# openssl s_client -connect 54.66.220.183:443 -cipher 'DEFAULT:@SECLEVEL=0'
CONNECTED(00000003)
Can't use SSL_get_servername
depth=2 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA
verify return:1
depth=1 C = US, O = "DigiCert, Inc.", CN = RapidSSL Global TLS RSA4096 SHA256 2022 CA1
verify return:1
depth=0 CN = *.legislation.gov.au
verify return:1
---
Certificate chain
0 s:CN = *.legislation.gov.au
i:C = US, O = "DigiCert, Inc.", CN = RapidSSL Global TLS RSA4096 SHA256 2022 CA1
a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
v:NotBefore: Jan 30 00:00:00 2023 GMT; NotAfter: Feb 11 23:59:59 2024 GMT
1 s:C = US, O = "DigiCert, Inc.", CN = RapidSSL Global TLS RSA4096 SHA256 2022 CA1
i:C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA
a:PKEY: rsaEncryption, 4096 (bit); sigalg: RSA-SHA256
v:NotBefore: May 4 00:00:00 2022 GMT; NotAfter: Nov 9 23:59:59 2031 GMT
2 s:C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA
i:C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA
a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA1
v:NotBefore: Nov 10 00:00:00 2006 GMT; NotAfter: Nov 10 00:00:00 2031 GMT
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIHqDCCBZCgAwIBAgIQCD51iO5LTch3btyrzg48bzANBgkqhkiG9w0BAQsFADBc
MQswCQYDVQQGEwJVUzEXMBUGA1UEChMORGlnaUNlcnQsIEluYy4xNDAyBgNVBAMT
K1JhcGlkU1NMIEdsb2JhbCBUTFMgUlNBNDA5NiBTSEEyNTYgMjAyMiBDQTEwHhcN
MjMwMTMwMDAwMDAwWhcNMjQwMjExMjM1OTU5WjAfMR0wGwYDVQQDDBQqLmxlZ2lz
bGF0aW9uLmdvdi5hdTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBANrx
FvQbBE9bnuXZiHrdR7mB1tkiWLTHhoAq00uAffKkS6bkM1Gs7OuO5XKBP0LlBPll
bgn/DJ5pXlZKX3nqhjV3x/nJRRqAf3EdvrDMTRbj4zyxQ+4zQ0V8sOVcU5HJddcu
yNQek1LLhXf5tpWpd+RsP5V7CZlIHLl3PyrCuCsugv4SKnGh1Xm0QrHB/NrpNz8w
J1hTQTP6NlO7KiVs92BQ6ZXTl1ZD5mmgg5muDo0kpNN2inzv2BJvdH4KCEw5bTAq
EmcWXM+vHoQA0acFEMwwxr8iT/1keaKAwRabg9PiWqDdA13egKNQAqUIDK1dF/eM
pf8X75arHZxkk2+CMjMCAwEAAaOCA6EwggOdMB8GA1UdIwQYMBaAFPCchf2in32P
yWi71dSJTR2+05D/MB0GA1UdDgQWBBRcWwBAEE3RJ6flW6Mf5kraJAjWRDAzBgNV
HREELDAqghQqLmxlZ2lzbGF0aW9uLmdvdi5hdYISbGVnaXNsYXRpb24uZ292LmF1
MA4GA1UdDwEB/wQEAwIFoDAdBgNVHSUEFjAUBggrBgEFBQcDAQYIKwYBBQUHAwIw
gZ8GA1UdHwSBlzCBlDBIoEagRIZCaHR0cDovL2NybDMuZGlnaWNlcnQuY29tL1Jh
cGlkU1NMR2xvYmFsVExTUlNBNDA5NlNIQTI1NjIwMjJDQTEuY3JsMEigRqBEhkJo
dHRwOi8vY3JsNC5kaWdpY2VydC5jb20vUmFwaWRTU0xHbG9iYWxUTFNSU0E0MDk2
U0hBMjU2MjAyMkNBMS5jcmwwPgYDVR0gBDcwNTAzBgZngQwBAgEwKTAnBggrBgEF
BQcCARYbaHR0cDovL3d3dy5kaWdpY2VydC5jb20vQ1BTMIGHBggrBgEFBQcBAQR7
MHkwJAYIKwYBBQUHMAGGGGh0dHA6Ly9vY3NwLmRpZ2ljZXJ0LmNvbTBRBggrBgEF
BQcwAoZFaHR0cDovL2NhY2VydHMuZGlnaWNlcnQuY29tL1JhcGlkU1NMR2xvYmFs
VExTUlNBNDA5NlNIQTI1NjIwMjJDQTEuY3J0MAkGA1UdEwQCMAAwggF+BgorBgEE
AdZ5AgQCBIIBbgSCAWoBaAB2AO7N0GTV2xrOxVy3nbTNE6Iyh0Z8vOzew1FIWUZx
H7WbAAABhgUN3XoAAAQDAEcwRQIhAIuzKlDiXLZitacpPcnjPr+ivxEwoh3PVaSm
6cSs0ufWAiAeCWS3fTLXwi9X1BFpZqGlyUVwo+GGsBVf48TtfRTrcgB2AHPZnokb
TJZ4oCB9R53mssYc0FFecRkqjGuAEHrBd3K1AAABhgUN3Z8AAAQDAEcwRQIhAKOm
Ht0FHIjxWfNvxQ5hsAxAhnMD+E6vN+VtOItO+JMIAiBKKW5bNxkrTVH8UJmo688w
Nzq6mifm0HpqA7zcX3W8MAB2AEiw42vapkc0D+VqAvqdMOscUgHLVt0sgdm7v6s5
2IRzAAABhgUN3WUAAAQDAEcwRQIhAPt7qx6WI7D2Ohuiw12Y6Wdak9SyfP47tDXF
ygquEtgeAiA7DSooWXRKaVjCWX75kDCt70PoA6MJd2xb6qZyTfV0DDANBgkqhkiG
9w0BAQsFAAOCAgEAHdZISuK409QEVnClR0w3Hwkeca/uoRADtvNUg69Ei6oHhEZw
tb1FvXPxhdXEU6409a9mNdjcmLDg+5Cfo9zVWpneL2vg+qcbbsq7W31WjA7DWoHV
HjRSzoYzd9SGsGGOMmqXlOFtLVhkBJTdxb7DyVMTZxZoKIzL5EXqj9VykYB+nAm2
Xv8+xcTBzoaF5OhvVQ78K2I1X5rjDwIsrbpCBpB6MUAiLsmBDY5F+mXnFIG+8Jxk
OLmJ88pQWblLRub59xBC5i2+qXSNqyAJKcIY3HUGpA+f/KT5f7K5DMMlecxPpBJW
eLzlXzOXE8vYezKtazhMdi8eO2zEVedAY8BmvGcoHFMFIcfZ9Bbno5qSiGb5WIfw
oxupuQtvtTg6oBtN7vHanBtc4+EVaQrKmQ2VnTRug4PTGTUcRaFmWY0d5+pfiSbo
v7zW5tVOl6Whu9+alcAAl5L1kZwrGPwWYXazDf4Q6lh2mLToA/b4AFQRmKDCpa1X
HIXNpAHbBKBNXGUfK1Ky9ZEtJpOAi0fPRwVGRwR2mzAdE+rzz6ARSWn5+xaStqtm
ImflxSVn2YI041tBguWayCw4du+iOFVBpdPzEiMOyJ95L+XngAZCwc296hnkljiL
8wRteqCkwMMXpVfHSTDopMKPndZ3k99Hv/XSHAqQ0xXYspoLNlhjtNf0ELA=
-----END CERTIFICATE-----
subject=CN = *.legislation.gov.au
issuer=C = US, O = "DigiCert, Inc.", CN = RapidSSL Global TLS RSA4096 SHA256 2022 CA1
---
No client certificate CA names sent
---
SSL handshake has read 4565 bytes and written 621 bytes
Verification: OK
---
New, TLSv1.2, Cipher is AES128-GCM-SHA256
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
Protocol : TLSv1.2
Cipher : AES128-GCM-SHA256
Session-ID: 35C8A1175ABF47501236C0C9B171BCD21F973C8C745E5D2377851B53DE62ED60
Session-ID-ctx:
Master-Key: DB0F145BB6A858F762CE4ED39E19F77C531B91A41CDED14E8A96377F688A9BA1A5B3386FE83017A83F4B99CEDBFEDDCD
PSK identity: None
PSK identity hint: None
SRP username: None
Start Time: 1691299616
Timeout : 7200 (sec)
Verify return code: 0 (ok)
Extended master secret: no
---
closed
Note: I deliberately used an IP address instead of hostname because there are some IPv6 servers sharing the same name that seem to be configured OK.
Steps to Recreate:
deploy fresh default Python container from Docker Hub
pip install scrapy
scrapy createproject
scrapy genspider <name> https://www.legislation.gov.au
Add custom_settings to <name>.py sider class definition:
custom_settings = { 'DOWNLOADER_CLIENT_TLS_METHOD' : 'TLSv1.2', 'DOWNLOADER_CLIENT_TLS_CIPHERS' : 'DEFAULT:@SECLEVEL=0'}
Other Troubleshooting Steps Tried:
Downgrade to OpenSSL 1.1.1
Test proof of concept using python & SSL (i.e. bypass Scrapy dependencies):
-
import ssl, socket hostname = 'legislation.gov.au' context = ssl.SSLContext(ssl.PROTOCOL_TLSv1_2) context.set_ciphers('DEFAULT:@SECLEVEL=0') context.check_hostname=False context.verify_mode =ssl.CERT_NONE # It's not important to authenticate the server for the moment. s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) ssl_sock = context.wrap_socket(s, server_hostname=hostname) ssl_sock.connect((hostname, 443))
This works as expected, suggesting the problem is somewhere in the implementation of Scrapy or its dependencies.
-
Test on another platform (MacOS): same error
Scrapy Spider Definition (all other files are default):
class BadsslconfigSpider(scrapy.Spider):
name = "badsslconfig"
allowed_domains = ["www.legislation.gov.au"]
start_urls = ["https://www.legislation.gov.au"]
custom_settings = {
'DOWNLOADER_CLIENT_TLS_CIPHERS' : 'DEFAULT:@SECLEVEL=0',
}
def parse(self, response):
pass
Solution
TL;DR: it looks like the SECLEVEL information gets thrown away by Twisted, which is the library used by scrapy to handle I/O including TLS.
In detail:
Based on some debugging in the code it looks like that twisted expands the cipher string before setting the ciphers by setting the cipher string into an SSL context using set_cipher_list
and then reading the ciphers from the context using get_cipher_list
. Since SECLEVEL is not an actual cipher its gets thrown away this way. The information from SECLEVEL is still contained in the SSLContext used, but unfortunately this SSLContext is only temporary used to get the expanded cipher list and not actually used when doing the connection.
See _expandCipherString for more.
This result of this can also be observed when doing a packet capture and analyzing the signature_algorithms extension. It would be expected that SHA-1 is in there because of SECLEVEL=0 and this is what the (broken) server also expects in order to work properly. But it can be seen that SHA-1 is not in there, i.e. SECLEVEL was ignored.
I cannot see a workaround except for digging into Twisted itself. A quick but dirty hack would be to add @SECLEVEL=0
when calling set_cipher_list
. So instead of this in _sslverify.py
ctx.set_cipher_list(self._cipherString.encode("ascii"))
do this:
ctx.set_cipher_list(self._cipherString.encode("ascii") + b":@SECLEVEL=0")
Submitted as bug to Twisted - https://github.com/twisted/twisted/issues/11903
Answered By - Steffen Ullrich
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.