Friday, November 26, 2021

[FIXED] How to save all the network traffic (both request and response headers) from a website using python

November 26, 2021 json, python, selenium, web-scraping No comments

Issue

I am trying to find an object that is downloaded into the browser during the loading of a website.

This is the website, https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en,

I'm not very good with web technology and such.

I am trying to save the request and response headers and the actual response using only the link to the website.

If you look at the network traffic, you can see an object jobsearch.ftl?lang=en that loads towards the end and you can see the reponse and headers.

Here are the screenshots, of the network event log showing the request and response headers.

And the actual response.

These are the objects that I want to save. How can I do that?

I have tried

import json
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, TimeoutException, StaleElementReferenceException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

chromepath = "~/chromedriver/chromedriver"

caps = DesiredCapabilities.CHROME
caps['goog:loggingPrefs'] = {'performance': 'ALL'}
driver = webdriver.Chrome(executable_path=chromepath, desired_capabilities=caps)
driver.get('https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en')

def process_browser_log_entry(entry):
    response = json.loads(entry['message'])['message']
    return response

browser_log = driver.get_log('performance') 
events = [process_browser_log_entry(entry) for entry in browser_log]
events = [event for event in events if 'Network.response' in event['method']]

But I only get some of the headers, they look like this,


{'method': 'Network.responseReceivedExtraInfo',
  'params': {'blockedCookies': [],
   'headers': {'Cache-Control': 'private',
    'Connection': 'Keep-Alive',
    'Content-Encoding': 'gzip',
    'Content-Security-Policy': "frame-ancestors 'self'",
    'Content-Type': 'text/html;charset=UTF-8',
    'Date': 'Mon, 27 Sep 2021 18:18:10 GMT',
    'Keep-Alive': 'timeout=5, max=100',
    'P3P': 'CP="CAO PSA OUR"',
    'Server': 'Taleo Web Server 8',
    'Set-Cookie': 'locale=en; path=/careersection/; secure; HttpOnly',
    'Transfer-Encoding': 'chunked',
    'Vary': 'Accept-Encoding',
    'X-Content-Type-Options': 'nosniff',
    'X-UA-Compatible': 'IE=edge',
    'X-XSS-Protection': '1'},
   'headersText': 'HTTP/1.1 200 OK\r\nDate: Mon, 27 Sep 2021 18:18:10 GMT\r\nServer: Taleo Web Server 8\r\nCache-Control: private\r\nP3P: CP="CAO PSA OUR"\r\nContent-Encoding: gzip\r\nVary: Accept-Encoding\r\nX-Content-Type-Options: nosniff\r\nSet-Cookie: locale=en; path=/careersection/; secure; HttpOnly\r\nContent-Security-Policy: frame-ancestors \'self\'\r\nX-XSS-Protection: 1\r\nX-UA-Compatible: IE=edge\r\nKeep-Alive: timeout=5, max=100\r\nConnection: Keep-Alive\r\nTransfer-Encoding: chunked\r\nContent-Type: text/html;charset=UTF-8\r\n\r\n',
   'requestId': '1E3CDDE80EE37825EF2D9C909FFFAFF3',
   'resourceIPAddressSpace': 'Public'}},
 {'method': 'Network.responseReceived',
  'params': {'frameId': '1624E6F3E724CA508A6D55D556CBE198',
   'loaderId': '1E3CDDE80EE37825EF2D9C909FFFAFF3',
   'requestId': '1E3CDDE80EE37825EF2D9C909FFFAFF3',
   'response': {'connectionId': 26,

They don't contain all the information I can see from the web inspector in chrome.

I want to get the whole response and request headers and the actual response. Is this the correct way? Is there another better way which doesn't use selenium and only requests instead?

Solution

You can use the selenium-wire library if you want to use Selenium to work with this. However, if you're only concerned for a specific API, then rather than using Selenium, you can use the requests library for hitting the API and then print the results of the request and response headers.

Considering you're looking for the earlier, using the Selenium way, one way to achieve this is using selenium-wire library. However, it will give the result for all the background API's/requests being hit - which you can then easily filter after either piping the result to a text file or in terminal itself

Install using pip install selenium-wire

Install webdriver-manager using pip install webdriver-manager

Install Selenium 4 using pip install selenium==4.0.0.b4

Use this code

from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager

from selenium.webdriver.chrome.service import Service

svc=  Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=svc)
driver.maximize_window()
# To use firefox browser
driver.get("https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en")
for request in driver.requests:
    if request.response:
        print(
            request.url,
            request.response.status_code,
            request.headers,
            request.response.headers

        )

which gives a detailed output of all the requests - copying the relavent one -

https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en 200 


Host: epco.taleo.net
Connection: keep-alive
sec-ch-ua: "Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8


Date: Tue, 28 Sep 2021 11:14:14 GMT
Server: Taleo Web Server 8
Cache-Control: private
P3P: CP="CAO PSA OUR"
Content-Encoding: gzip
Vary: Accept-Encoding
X-Content-Type-Options: nosniff
Set-Cookie: locale=en; path=/careersection/; secure; HttpOnly
Content-Security-Policy: frame-ancestors 'self'
X-XSS-Protection: 1
X-UA-Compatible: IE=edge
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html;charset=UTF-8

Answered By - demouser123

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 26, 2021

[FIXED] How to save all the network traffic (both request and response headers) from a website using python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels