Issue
I am trying to scrape the Abstract of the research paper on IEEE Xplore website, link :. For this I used urllib library and Beautifulsoup in Python(3.10.9). Below is the code i have used:
`
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
url = 'https://ieeexplore.ieee.org/document/8480057'
headers = {"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'}
req = Request(url, headers=headers)
response = urlopen(req, timeout=10)
html_text = response.read()
soup = BeautifulSoup(html_text, "html.parser")
# Find the element containing the abstract
abstract_element = soup.find("div", class_="u-pb-1")
# Extract the text from the abstract element
abstract = abstract_element.text.strip()
# Print the abstract
print(abstract)
`
Here i have attached the screenshot of html part having Abstract.
I am getting AttributeError: 'NoneType' object has no attribute 'text' for abstract.
I got the value of soup. But I don't know how to get the Abstract text. I am new to web scraping. I have tried a lot but didn't able to solve it. Please help me to solve this problem. Thanks.
Solution
In this case it's actually possible to get the abstract without the need for Selenium. I generally use the requests
library and css selectors in BeautifulSoup and that's what I did here:
import requests
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.text, "html.parser")
and then simply:
print(soup.select_one('meta[property="og:description"]')['content'])
The output should be the abstract.
Answered By - Jack Fleeting
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.