Issue
I am scraping MIT news website . My aim is to scrape all the article headers and their link and store in a data frame. I am using bs4 for this, but I am not able to find a way to get how many pages are their to scrape.
The website is showing this information Displaying 1 - 15 of 1019 news articles related to this topic. . But this does not have any pattern through which I can extract number of article on one page and total articles .
I can do the calculation by seeing them on website, but I want a way that allow me to get this info while running my code.
Solution
I would approach this problem by targeting directly the Displaying xx - xx of yy news articles related to this topic
section, using code below:
import requests
from bs4 import BeautifulSoup
from math import ceil
url = "https://news.mit.edu/topic/artificial-intelligence2"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
text = soup.find('div', class_='page-term--views--header').get_text(strip=True).split('news articles')[0]
total_articles = int(text.split('of')[1].split()[0])
articles_per_page = int(text.split('-')[1].split('of')[0])
total_pages = ceil(total_articles / articles_per_page)
print(f"Total number of articles: {total_articles}")
print(f"Articles per page: {articles_per_page}")
print(f"Total pages: {total_pages}")
Which outputs:
Total number of articles: 1019
Articles per page: 15
Total pages: 68
Answered By - Marc
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.