Issue
Please forgive me if this question is too stupid. We know that in the browser it is possible to go to Inspect -> Network -> XHR -> Headers and get Request Headers. It is then possible to add these Headers to the Scrapy request.
However, is there a way to get these Request Headers automatically using the Scrapy request, rather than manually?
I tried to use: response.request.headers
but this information is not enough:
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 S afari/537.36'], b'Accept-Encoding': [b'gzip,deflate']}
We see a lot more of Request Headers information in the browser. How to get this information?
Solution
Scrapy uses these headers to scrape the webpage. Sometimes if a website needs some special keys in headers (like an API), you'll notice that the scrapy won't be able to scrape the webpage.
However there is a workaround, in DownloaMiddilewares, you can implement Selenium. So the requested webpage will be downloaded using selenium automated browser. then you would be able to extract the complete headers as the selenium initiates an actual browser.
## Import webdriver from Selenium Wire instead of Selenium
from seleniumwire import webdriver
## Get the URL
driver = webdriver.Chrome("my/path/to/driver", options=options)
driver.get("https://my.test.url.com")
## Print request headers
for request in driver.requests:
print(request.url) # <--------------- Request url
print(request.headers) # <----------- Request headers
print(request.response.headers) # <-- Response headers
You can use the above code to get the request headers. This must be placed within DownlaodMiddleware of Scrapy so both can work together.
Answered By - Akhlaq Ahmed
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.