Issue
So I want to find all the search results and store them in a list or something. Analysing the Google page give me that all results are technically in the g
class:
So technically, extracting an URL (i.e.) from the search results page should be as easy as:
import urllib
from bs4 import BeautifulSoup
import requests
text = 'cyber security'
text = urllib.parse.quote_plus(text)
url = 'https://google.com/search?q=' + text
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
links = [div.find('a') for div in result_divs]
hrefs = [link.get('href') for link in links]
print(hrefs)
And yet, I have no output. Why?
Edit: Even manually parsing the stored page doesn't help:
with open('output.html', 'wb') as f:
f.write(response.content)
webbrowser.open('output.html')
url = "output.html"
page = open(url)
soup = BeautifulSoup(page.read(), features="lxml")
#soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
links = [div.find('a') for div in result_divs]
hrefs = [link.get('href') for link in links]
print(hrefs)
Solution
The following approach should fetch you few random links out of the total result links from it's landing page. You may need to kick out some links ending with dots. It's really a difficult job to grab links from google search using requests.
import requests
from bs4 import BeautifulSoup
url = "http://www.google.com/search?q={}&hl=en"
def scrape_google_links(query):
res = requests.get(url.format(query.replace(" ","+")),headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for result in soup.select(".kCrYT > a > .BNeawe:nth-of-type(2)"):
print(result.text.replace(" › ","/"))
if __name__ == '__main__':
scrape_google_links('cyber security')
Answered By - SIM
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.