Thursday, January 6, 2022

[FIXED] BeautifulSoup: how to get all article links from this link?

January 06, 2022 beautifulsoup, jupyter-notebook, python, web-crawler, web-scraping No comments

Issue

I want to get all article link from "https://www.cnnindonesia.com/search?query=covid" Here is my code:

links = []
base_url = requests.get(f"https://www.cnnindonesia.com/search?query=covid")
soup = bs(base_url.text, 'html.parser')
cont = soup.find_all('div', class_='container')

for l in cont:
    l_cont = l.find_all('div', class_='l_content')
    for bf in l_cont:
        bf_cont = bf.find_all('div', class_='box feed')
        for lm in bf_cont:
            lm_cont = lm.find('div', class_='list media_rows middle')
            for article in lm_cont.find_all('article'):
                a_cont = article.find('a', href=True)
                if url:
                    link = a['href']
                    links.append(link)

and result is as follows:

links
[]

Solution

Each article has this structure:

<article class="col_4">
   <a href="https://www.cnnindonesia.com/...">
       <span>...</span>
       <h2 class="title">...</h2>
   </a>
</article>

Simpler to iterate over the article elements then look for a elements.

Try:

from bs4 import BeautifulSoup
import requests

links = []
response = requests.get(f"https://www.cnnindonesia.com/search?query=covid")
soup = BeautifulSoup(response.text, 'html.parser')
for article in soup.find_all('article'):
    url = article.find('a', href=True)
    if url:
        link = url['href']
        print(link)
        links.append(link)
print(links)

Output:

https://www.cnnindonesia.com/nasional/...pola-sawah-di-laut-natuna-utara
...
['https://www.cnnindonesia.com/nasional/...pola-sawah-di-laut-natuna-utara', ...
'https://www.cnnindonesia.com/gaya-hidup/...ikut-penerbangan-gravitasi-nol']

Update:

If want to extract the URLs that are dynamically added by JavaScript inside the <div class="list media_rows middle"> element then you must use something like Selenium that can extract the content after the full page is rendered in the web browser.

from selenium import webdriver
from selenium.webdriver.common.by import By

url = 'https://www.cnnindonesia.com/search?query=covid'
links = []

options = webdriver.ChromeOptions()
pathToChromeDriver = "chromedriver.exe"
browser = webdriver.Chrome(executable_path=pathToChromeDriver,
                           options=options)
try:
    browser.get(url)
    browser.implicitly_wait(10)
    html = browser.page_source
    content = browser.find_element(By.CLASS_NAME, 'media_rows')
    for elt in content.find_elements(By.TAG_NAME, 'article'):
        link = elt.find_element(By.TAG_NAME, 'a')
        href = link.get_attribute('href')
        if href:
            print(href)
            links.append(href)
finally:
    browser.quit()

Answered By - CodeMonkey

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 6, 2022

[FIXED] BeautifulSoup: how to get all article links from this link?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels