Monday, January 31, 2022

[FIXED] Getting news from WSJ using BeautifulSoup

January 31, 2022 beautifulsoup No comments

Issue

I'm trying o scrappe the main news from the WSJ website (https://www.wsj.com/news/economy specifically). However, I'm not understanding why executing the code below I obtain the news on a side column titled "Most Popular News", since I'm looking for a div class "WSJTheme--headline--7VCzo7Ay " that seems to refer to the main news instead when I inspect the site.

Would appreciate very much any help to get the news from the main section of the page link sent above. For example, two of the news I'm after are (right now) "Powell Says Low-Income Lending Rules Should Apply to All Firms" and "Treasury Expects to Borrow $1.3 Trillion in Second Half of Fiscal 2021". Thank you! Code below.

from bs4 import BeautifulSoup
import requests
from datetime import date, time, datetime, timedelta

url='https://www.wsj.com/news/economy'
response=requests.get(url)

soup=BeautifulSoup(response.content,'lxml')

for item in soup.select('.WSJTheme--headline--7VCzo7Ay '):
    headline = item.find('h2').get_text()
    link = item.find('a')['href']

    noticia = headline + ' - ' + link

    print(noticia)

Solution

There are two problems with your current code:

You need to specify the HTTP User-Agent header, otherwise, the website thinks that your a bot and will block you.
You are searching for article headlines by searching for an <h2> tag, however, only the first article is under an <h2>, the others are under an <h3> tag. To select both <h2> and <h3> you can use a CSS selector: .select_one("h2, h3").

from bs4 import BeautifulSoup
import requests

url = "https://www.wsj.com/news/economy"
# Specify the `user-agent` inorder not to be blocked
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}

response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.content, "lxml")

for item in soup.select(".WSJTheme--headline--7VCzo7Ay"):
    # Articles might be under an `h2` or `h3`, use a CSS selector to select both
    headline = item.select_one("h2, h3").get_text()
    link = item.find("a")["href"]
    noticia = headline + " - " + link

    print(noticia)

Output (truncated):

Consumer Demand Drives U.S. Imports to Record High - https://www.wsj.com/articles/u-s-trade-deficit-widened-to-74-4-billion-in-march-11620132526
Who Would Pay Biden’s Corporate Tax Hike Is Key to Policy Debate - https://www.wsj.com/articles/who-would-pay-bidens-corporate-tax-increase-is-key-question-in-policy-debate-11620130284
Powell Says Low-Income Lending Rules Should Apply to All Firms - https://www.wsj.com/articles/powell-highlights-slower-recovery-for-low-wage-and-minority-workers-11620065926
Treasury Expects to Borrow $1.3 Trillion in Second Half of Fiscal 2021 - https://www.wsj.com/articles/treasury-expects-to-borrow-1-3-trillion-over-second-half-of-fiscal-2021-11620068646
Yellen to Appoint Senior Fed Official to Run Top Bank Regulator - https://www.wsj.com/articles/yellen-to-appoint-senior-fed-official-to-run-occ-11620057637
...

Answered By - MendelG

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 31, 2022

[FIXED] Getting news from WSJ using BeautifulSoup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels