Issue
I have a situation where a website fingerprints based off header order and casing.
I've been able to specify header order with correct case by:
import json
from scrapy.spiders import Spider
from scrapy.http import Request
from twisted.web.http_headers import Headers as TwistedHeaders
class Test(Spider):
name = 'test'
custom_settings = {
'DEFAULT_REQUEST_HEADERS': {
'aA': 'a',
'Bb': 'b',
'CC': 'c',
'Content-Length': '14',
'dD': 'd',
},
}
# Preserve casing of headers
TwistedHeaders._caseMappings[b'aa'] = b'aA'
TwistedHeaders._caseMappings[b'bb'] = b'Bb'
TwistedHeaders._caseMappings[b'cc'] = b'CC'
TwistedHeaders._caseMappings[b'dd'] = b'dD'
def start_requests(self):
yield Request(
'https://httpbin.org/post',
body=json.dumps({'foo': 'bar'}),
method='POST',
# Sniff with Fiddler
# meta={'proxy': 'https://127.0.0.1:8866'}
)
def parse(self, response): pass
I notice in Fiddler that when I run the spider another Content-Length
is present at the start of the request headers:
I've tried to find where in Scrapy/Twisted this is being set, but as I am pretty new it is a lot to read through. As a result, I am having a hard time understanding why this is happening.
Is there anyway to instruct Content-Length
to not be added automatically if it's already present? Or, if it is automatically added, for Content-Length
to respect header order?
I know that if I remove Content-Length
, the request works; however, it is still unordered (Content-Length
occurs as the first key in the headers). For my use case, I think Content-Length
must occur in the right spot. For the case of this example, that's between CC
and dD
.
I would appreciate any steps in the right direction. Thank you!
Solution
I was able to sort (alphabetically) and make case sensitive scrapy headers (including Content-Length) by:
- ORDER: Creating a custom downloader which sets headers as sorted alphabetically
- CASE SENSITIVE: Modifying
_caseMappings
of internal TwistedHeaders
class to allow case sensitive headers - Two "Content-Length" headers: modify Twisted
web/_newClient.py
_writeToBodyProducerContentLength
method (found here) to go from
def _writeToBodyProducerContentLength(self, transport):
- self._writeHeaders(
- transport,
- networkString("Content-Length: %d\r\n" % (self.bodyProducer.length,)),
- )
+ self._writeHeaders(transport, None)
My github repository code can be found here
Answered By - yeqiuuu
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.