Issue
The code below is supposed to retrieve links from the search results page of google.
Without using header 'linkedElems'
has 0 elements, but when I used a header 'linkedElems'
had 44 elements which means after using header "select('.r a')" found 44 elements in page. Does the HTML
code of a page change when a header is used?
I inspected the page's HTML
code using the firefox's developer tool to find links and select them so "select('.r a') isn't supposed to return 0.
Code:
import requests,bs4
print("Search something in google:")
searchKeyword = input()
print("Googling.... " + searchKeyword)
head = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'}
responseObj = requests.get("https://www.google.com/search?q="+searchKeyword, headers = head)
responseObj.raise_for_status()
print("Status code: " + str(responseObj.status_code))
soupObj = bs4.BeautifulSoup(responseObj.text, features='html.parser')
linkedElems = soupObj.select('.r a')
print(len(linkedElems))
Result (With header):
Search something in google:
test
Googling.... test
Status code: 200
44
Process finished with exit code 0
Result (Without header):
Search something in google:
test
Googling.... test
Status code: 200
0
Process finished with exit code 0
Solution
The User-Agent
header is specifically designed for the server to know the browser/OS/hardware of the client that issued the request so it can build the proper response to that specific client:
The User-Agent request header is a characteristic string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent.
If Google's server was designed to return a specific HTML for specific clients (spoiler alert, it was), then the answer is "yes, the HTML will be different for different values of User-Agent", as you discovered yourself.
Answered By - DeepSpace
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.