Issue
I'm trying o scrappe the main news from the WSJ website (https://www.wsj.com/news/economy specifically). However, I'm not understanding why executing the code below I obtain the news on a side column titled "Most Popular News", since I'm looking for a div class "WSJTheme--headline--7VCzo7Ay " that seems to refer to the main news instead when I inspect the site.
Would appreciate very much any help to get the news from the main section of the page link sent above. For example, two of the news I'm after are (right now) "Powell Says Low-Income Lending Rules Should Apply to All Firms" and "Treasury Expects to Borrow $1.3 Trillion in Second Half of Fiscal 2021". Thank you! Code below.
from bs4 import BeautifulSoup
import requests
from datetime import date, time, datetime, timedelta
url='https://www.wsj.com/news/economy'
response=requests.get(url)
soup=BeautifulSoup(response.content,'lxml')
for item in soup.select('.WSJTheme--headline--7VCzo7Ay '):
headline = item.find('h2').get_text()
link = item.find('a')['href']
noticia = headline + ' - ' + link
print(noticia)
Solution
There are two problems with your current code:
You need to specify the HTTP
User-Agent
header, otherwise, the website thinks that your a bot and will block you.You are searching for article headlines by searching for an
<h2>
tag, however, only the first article is under an<h2>
, the others are under an<h3>
tag. To select both<h2>
and<h3>
you can use a CSS selector:.select_one("h2, h3")
.
from bs4 import BeautifulSoup
import requests
url = "https://www.wsj.com/news/economy"
# Specify the `user-agent` inorder not to be blocked
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
for item in soup.select(".WSJTheme--headline--7VCzo7Ay"):
# Articles might be under an `h2` or `h3`, use a CSS selector to select both
headline = item.select_one("h2, h3").get_text()
link = item.find("a")["href"]
noticia = headline + " - " + link
print(noticia)
Output (truncated):
Consumer Demand Drives U.S. Imports to Record High - https://www.wsj.com/articles/u-s-trade-deficit-widened-to-74-4-billion-in-march-11620132526
Who Would Pay Biden’s Corporate Tax Hike Is Key to Policy Debate - https://www.wsj.com/articles/who-would-pay-bidens-corporate-tax-increase-is-key-question-in-policy-debate-11620130284
Powell Says Low-Income Lending Rules Should Apply to All Firms - https://www.wsj.com/articles/powell-highlights-slower-recovery-for-low-wage-and-minority-workers-11620065926
Treasury Expects to Borrow $1.3 Trillion in Second Half of Fiscal 2021 - https://www.wsj.com/articles/treasury-expects-to-borrow-1-3-trillion-over-second-half-of-fiscal-2021-11620068646
Yellen to Appoint Senior Fed Official to Run Top Bank Regulator - https://www.wsj.com/articles/yellen-to-appoint-senior-fed-official-to-run-occ-11620057637
...
Answered By - MendelG
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.