Saturday, November 11, 2023

[FIXED] Can I accept or ignore the Google privacy notice when webscraping with BeautifulSoup?

November 11, 2023 beautifulsoup, python, web-scraping No comments

Issue

I am unable to view the HTML of the Google News page when running the following code from my console. The HTML I see instead is that of the Google privacy notice (the one that starts with "Before you continue").

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get("https://www.google.com/news", headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())

Is there a way to prevent the privacy notice from popping up at all?

A snippet of what I get instead:

  <title>
   Before you continue
  </title>
  <meta content="initial-scale=1, maximum-scale=5, width=device-width" name="viewport"/>
  <link href="//www.google.com/favicon.ico" rel="shortcut icon"/>
 </head>
 <body>
  <div class="signin">
   <a class="button" href="https://accounts.google.com/ServiceLogin?hl=en-US&amp;continue=https://news.google.com/topics/CAAqBwgKMKHQ9Qowlc7cAg&amp;gae=cb-">
    Sign in
   </a>
  </div>
  <div class="box">
   <img alt="Google" height="28" src="//www.gstatic.com/images/branding/googlelogo/1x/googlelogo_color_68x28dp.png" srcset="//www.gstatic.com/images/branding/googlelogo/2x/googlelogo_color_68x28dp.png 2x" width="68"/>
   <div class="productLogoContainer">
    <img alt="" aria-hidden="true" class="image" height="100%" src="https://www.gstatic.com/ac/cb/scene_cookie_wall_search_v2.svg" width="100%"/>
   </div>

Solution

You can set CONSENT cookie to not get "Before you continue" page:

EDIT 10-10-2023: Updated headers/cookies.

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/118.0"
}
cookies = {"CONSENT": "YES+cb.20220419-08-p0.cs+FX+111"}
r = requests.get("https://www.google.com/news", headers=headers, cookies=cookies)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.prettify())

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, November 11, 2023

[FIXED] Can I accept or ignore the Google privacy notice when webscraping with BeautifulSoup?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels