Issue
I am trying to scrape the medium website. Here is my code.
import requests
from bs4 import BeautifulSoup as bs
class Publication:
def __init__(self, publication):
self.publication = publication
self.headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'}# mimics a browser's request
def get_articles(self):
"Get the articles of the user/publication which was given as input"
publication = self.publication
r = requests.get(f"https://{publication}.com/", headers=self.headers)
soup = bs(r.text, 'lxml')
elements = soup.find_all('h2')
for x in elements:
print(x.text)
publication = Publication('towardsdatascience')
publication.get_articles()
It is working somewhat good but it is not scraping all the titles. It is only getting the some of the articles from the top of the page. I want it to get all the article names from the page. It also getting the side bar stuff like who to follow and all. I dont want that. How do I do that?
Here is the output of my code:
How to Rewrite and Optimize Your SQL Queries to Pandas in 5 Simple Examples
Storytelling with Charts
Simplify Your Data Preparation with These Four Lesser-Known Scikit-Learn Classes
Non-Parametric Tests for Beginners (Part 1: Rank and Sign Tests)
BigQuery Best Practices: Unleash the Full Potential of Your Data Warehouse
How to Test Your Python Code with Pytest
7 Signs You’ve Become an Advanced Sklearn User Without Even Realizing It
How Data Scientists Save Time
MLOps: What is Operational Tempo?
Finding Your Dream Master’s Program in AI
Editors
TDS Editors
Ben Huberman
Caitlin Kindig
Sign up for The Variable
Solution
As Barry the Platipus mentions in a comment, the content you want is loaded via Javascript. A complicating factor is that this content is only loaded when you scroll the page, so even a naive Selenium-based solution like this will still return only the same set of results as your existing code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
class Publication:
def __init__(self, publication):
self.publication = publication
def get_articles(self):
"Get the articles of the user/publication which was given as input"
publication = self.publication
driver.get(f"https://{publication}.com/")
elements = driver.find_elements(By.CSS_SELECTOR, "h2")
for x in elements:
print(x.text)
publication = Publication("towardsdatascience")
publication.get_articles()
To get more than the initial set of articles, we need to scroll the page. For example, if we add a simple loop to scroll the page a few times before querying for h2
elements, like this:
def get_articles(self):
"Get the articles of the user/publication which was given as input"
publication = self.publication
driver.get(f"https://{publication}.com/")
for x in range(3):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# The sleep here is to give the page time to respond.
time.sleep(0.2)
elements = driver.find_elements(By.CSS_SELECTOR, "h2")
for x in elements:
print(x.text)
Then the output of the code is:
Large Language Models in Molecular Biology
How to Rewrite and Optimize Your SQL Queries to Pandas in 5 Simple Examples
Storytelling with Charts
Simplify Your Data Preparation with These Four Lesser-Known Scikit-Learn Classes
Non-Parametric Tests for Beginners (Part 1: Rank and Sign Tests)
BigQuery Best Practices: Unleash the Full Potential of Your Data Warehouse
How to Test Your Python Code with Pytest
7 Signs You’ve Become an Advanced Sklearn User Without Even Realizing It
How Data Scientists Save Time
MLOps: What is Operational Tempo?
Finding Your Dream Master’s Program in AI
Temporary Variables in Python: Readability versus Performance
Naive Bayes Classification
Predicting the Functionality of Water Pumps with XGBoost
Detection of Credit Card Fraud with an Autoencoder
4 Reasons Why I Won’t Sign the “Existential Risk” New Statement
The Data-centric AI Concepts in Segment Anything
3D Deep Learning Python Tutorial: PointNet Data Preparation
Why Trust and Safety in Enterprise AI Is (Relatively) Easy
The Principles of a Modern Computer Scientist
That page appears to be an "infinite scroll" style of page, so you probably want to set a limit on how many times you scroll to find new content.
Answered By - larsks
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.