Issue
I am unable to view the HTML of the Google News page when running the following code from my console. The HTML I see instead is that of the Google privacy notice (the one that starts with "Before you continue").
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get("https://www.google.com/news", headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())
Is there a way to prevent the privacy notice from popping up at all?
A snippet of what I get instead:
<title>
Before you continue
</title>
<meta content="initial-scale=1, maximum-scale=5, width=device-width" name="viewport"/>
<link href="//www.google.com/favicon.ico" rel="shortcut icon"/>
</head>
<body>
<div class="signin">
<a class="button" href="https://accounts.google.com/ServiceLogin?hl=en-US&continue=https://news.google.com/topics/CAAqBwgKMKHQ9Qowlc7cAg&gae=cb-">
Sign in
</a>
</div>
<div class="box">
<img alt="Google" height="28" src="//www.gstatic.com/images/branding/googlelogo/1x/googlelogo_color_68x28dp.png" srcset="//www.gstatic.com/images/branding/googlelogo/2x/googlelogo_color_68x28dp.png 2x" width="68"/>
<div class="productLogoContainer">
<img alt="" aria-hidden="true" class="image" height="100%" src="https://www.gstatic.com/ac/cb/scene_cookie_wall_search_v2.svg" width="100%"/>
</div>
Solution
You can set CONSENT
cookie to not get "Before you continue" page:
EDIT 10-10-2023: Updated headers/cookies.
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/118.0"
}
cookies = {"CONSENT": "YES+cb.20220419-08-p0.cs+FX+111"}
r = requests.get("https://www.google.com/news", headers=headers, cookies=cookies)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.prettify())
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.