Wednesday, November 15, 2023

[FIXED] WebScraping - BeautifulSoup Python

November 15, 2023 beautifulsoup, python, web-scraping No comments

Issue

I am trying to scrape the medium website. Here is my code.

import requests
from bs4 import BeautifulSoup as bs

class Publication:
    def __init__(self, publication):
        self.publication = publication
        self.headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'}# mimics a browser's request
    def get_articles(self):
        "Get the articles of the user/publication which was given as input"
        publication = self.publication
        r = requests.get(f"https://{publication}.com/", headers=self.headers)
        soup = bs(r.text, 'lxml')
        elements = soup.find_all('h2')
        for x in elements:
            print(x.text)

publication = Publication('towardsdatascience')
publication.get_articles()

It is working somewhat good but it is not scraping all the titles. It is only getting the some of the articles from the top of the page. I want it to get all the article names from the page. It also getting the side bar stuff like who to follow and all. I dont want that. How do I do that?

Here is the output of my code:

How to Rewrite and Optimize Your SQL Queries to Pandas in 5 Simple Examples
Storytelling with Charts
Simplify Your Data Preparation with These Four Lesser-Known Scikit-Learn Classes
Non-Parametric Tests for Beginners (Part 1: Rank and Sign Tests)
BigQuery Best Practices: Unleash the Full Potential of Your Data Warehouse
How to Test Your Python Code with Pytest
7 Signs You’ve Become an Advanced Sklearn User Without Even Realizing It
How Data Scientists Save Time
MLOps: What is Operational Tempo?
Finding Your Dream Master’s Program in AI
Editors
TDS Editors
Ben Huberman
Caitlin Kindig
Sign up for The Variable

Solution

As Barry the Platipus mentions in a comment, the content you want is loaded via Javascript. A complicating factor is that this content is only loaded when you scroll the page, so even a naive Selenium-based solution like this will still return only the same set of results as your existing code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options


options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)


class Publication:
    def __init__(self, publication):
        self.publication = publication

    def get_articles(self):
        "Get the articles of the user/publication which was given as input"
        publication = self.publication
        driver.get(f"https://{publication}.com/")
        elements = driver.find_elements(By.CSS_SELECTOR, "h2")
        for x in elements:
            print(x.text)


publication = Publication("towardsdatascience")
publication.get_articles()

To get more than the initial set of articles, we need to scroll the page. For example, if we add a simple loop to scroll the page a few times before querying for h2 elements, like this:

    def get_articles(self):
        "Get the articles of the user/publication which was given as input"
        publication = self.publication
        driver.get(f"https://{publication}.com/")
        for x in range(3):
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

            # The sleep here is to give the page time to respond.
            time.sleep(0.2)
        elements = driver.find_elements(By.CSS_SELECTOR, "h2")
        for x in elements:
            print(x.text)

Then the output of the code is:

Large Language Models in Molecular Biology
How to Rewrite and Optimize Your SQL Queries to Pandas in 5 Simple Examples
Storytelling with Charts
Simplify Your Data Preparation with These Four Lesser-Known Scikit-Learn Classes
Non-Parametric Tests for Beginners (Part 1: Rank and Sign Tests)
BigQuery Best Practices: Unleash the Full Potential of Your Data Warehouse
How to Test Your Python Code with Pytest
7 Signs You’ve Become an Advanced Sklearn User Without Even Realizing It
How Data Scientists Save Time
MLOps: What is Operational Tempo?
Finding Your Dream Master’s Program in AI
Temporary Variables in Python: Readability versus Performance
Naive Bayes Classification
Predicting the Functionality of Water Pumps with XGBoost
Detection of Credit Card Fraud with an Autoencoder
4 Reasons Why I Won’t Sign the “Existential Risk” New Statement
The Data-centric AI Concepts in Segment Anything
3D Deep Learning Python Tutorial: PointNet Data Preparation
Why Trust and Safety in Enterprise AI Is (Relatively) Easy
The Principles of a Modern Computer Scientist

That page appears to be an "infinite scroll" style of page, so you probably want to set a limit on how many times you scroll to find new content.

Answered By - larsks

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 15, 2023

[FIXED] WebScraping - BeautifulSoup Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels