Issue
I'm trying to use BeautifulSoup to find the birth years of different authors. I'm working in VS Code, if that's relevant. This is my first attempt at web scraping so please explain things as clearly as possible
For authors with wikipedia pages, I can successully find birth years using the following code:
source_code = requests.get("a_wikipedia_url")
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="html.parser")
finder = soup.find("span", {"class": "bday"})
if finder is not None:
birth_year = finder.string[0:4]
return birth_year
However when I try the same thing with google search for authors with no (English) wikipedia page, I just get None.
After reading this question https://stackoverflow.com/questions/62466340/cant-scrape-google-search-results-with-beautifulsoup I added a User Agent response header to requests.get (I'm using Chrome Version 114.0.5735.134 (Official Build) (64-bit) and Windows 11 Home), but all it did was print None instead of giving my AttributeError: 'NoneType' object has no attribute 'string', which is what I was getting before adding the header.
This is my code:
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.134 Safari/537.36"}
source_code = requests.get("https://www.google.com/search?q=Guillermo+Saccomanno", headers=headers)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="html.parser")
google_finder = soup.find("span", {"class": "LrzXr kno-fv wHYlTd z8gr9e"})
print(google_finder.string)
The result is just None - no error message, but no text.
I also tried with the header Chrome version as Chrome/114.0.0.0, which is what I found online. Still gives None.
I'm not sure where I'm going wrong as the syntax is identical and I copied the class name from the page source? For this particular author, I would expect google_finder.string to be "9 June 1948 (age 75 years)".
Solution
If you want to parse the born date I'd chose different strategy: Find a <span>
tag with text "Born:"
and then next sibling. Also add hl=en
parameter to URL to get english results:
import requests
from bs4 import BeautifulSoup
url = 'https://www.google.com/search?q=Guillermo+Saccomanno&hl=en'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
born = soup.select_one('span:-soup-contains("Born:") + span')
print(born.text)
Prints:
June 9, 1948 (age 75 years), Buenos Aires, Argentina
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.