Issue
I am trying to parse the first page of google search results. Specifically, the Title and the small Summary that is provided. Here is what I have so far:
from urllib.request import urlretrieve
import urllib.parse
from urllib.parse import urlencode, urlparse, parse_qs
import webbrowser
from bs4 import BeautifulSoup
import requests
address = 'https://google.com/#q='
# Default Google search address start
file = open( "OCR.txt", "rt" )
# Open text document that contains the question
word = file.read()
file.close()
myList = [item for item in word.split('\n')]
newString = ' '.join(myList)
# The question is on multiple lines so this joins them together with proper spacing
print(newString)
qstr = urllib.parse.quote_plus(newString)
# Encode the string
newWord = address + qstr
# Combine the base and the encoded query
print(newWord)
source = requests.get(newWord)
soup = BeautifulSoup(source.text, 'lxml')
The part I am stuck on now is going down the HTML path to parse the specific data that I want. Everything I have tried so far has just thrown an error saying that it has no attribute or it just gives back "[]".
I am new to Python and BeautifulSoup so I am not sure the syntax of how to get to where I want. I have found that these are the individual search results in the page:
Any help on what to add to parse the Title and Summary of each search result would be MASSIVELY appreciated.
Thank you!
Solution
EDIT (2023.09):
I added headers
because it stoped working.
It needs User-Agent
from real web browser. It can't be short 'Mozilla/5.0'
.
You can check your User-Agent
using pages like httpbin.org/user-agent
Your url doesn't work for me. But with https://google.com/search?q=
I get results.
import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/117.0'
}
text = 'hello world'
text = urllib.parse.quote_plus(text)
url = 'https://google.com/search?q=' + text
response = requests.get(url, headers=headers)
#with open('output.html', 'wb') as f:
# f.write(response.content)
#webbrowser.open('output.html')
soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
print(g.text)
print('-----')
Read Beautiful Soup Documentation
Answered By - furas
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.