Monday, July 18, 2022

[FIXED] Python Scrape links from google result

July 18, 2022 beautifulsoup, python No comments

Issue

Is there any way I can scrape certain links from google result containing specific words in link. By using beautifulsoup or selenium ?

import requests 
from bs4 import BeautifulSoup 
import csv 

URL = "https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups"
r = requests.get(URL) 

soup = BeautifulSoup(r.content, 'html5lib')

Want to extract links containing group links.

Solution

Not sure what you want to do, but if you want to extract facebook links from the returned content, you can just check whether facebook.com is within the URL:

import requests 
from bs4 import BeautifulSoup 
import csv 
URL = "https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups" 
r = requests.get(URL) 
soup = BeautifulSoup(r.text, 'html5lib')
for link in soup.findAll('a', href=True): 
    if 'facebook.com' in link.get('href'):
        print link.get('href')

Update: There is another workaround. The thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser. :

# This is a standard user-agent of Chrome browser running on Windows 10
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}

Example:

from bs4 import BeautifulSoup 
import requests 
URL = 'https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get(URL, headers=headers).text 
soup = BeautifulSoup(resp, 'html.parser')
for link in soup.findAll('a', href=True): 
    if 'facebook.com' in link.get('href'):
        print link.get('href')

Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
    'Accept' : 
    'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language' : 'en-US,en;q=0.5',
    'Accept-Encoding' : 'gzip',
    'DNT' : '1', # Do Not Track Request Header
    'Connection' : 'close'
}

Answered By - 0xInfection

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, July 18, 2022

[FIXED] Python Scrape links from google result

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels