Issue
I am trying to scrape data from https://www.wsj.com/market-data/bonds/treasuries. There are two tables on this website which get switched when we select the options:
1. Treasury Notes and Bond
2. Treasury Bills
I want to scrape the data for Treasury bills. But there is no change in the link and attributes or anything when i click that option. I have tried a lot of things but every time, i am able to scrape the data for Treasury Notes and Bond. Can someone help me with that? Following the my code:
import re
import csv
import requests
import pandas as pd
from bs4 import BeautifulSoup
mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
page = requests.get(mostActiveStocksUrl)
data = page.text
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('tr')
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
df = pd.DataFrame(list_rows)
df1 = df[0].str.split(',', expand=True)
Solution
All the data in the site is loaded once and then js is used to update the values in the table
Here is a working quickly written code:
import requests
from bs4 import BeautifulSoup
import json
mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
page = requests.get(mostActiveStocksUrl)
data = page.text
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('script') # we get all the script tags
importantJson = ''
for r in rows:
text = r.text
if 'NOTES_AND_BONDS' in text: # the scirpt tags containing the date, probably you can do this better
importantJson = text
break
# remove the non json stuff
importantJson = importantJson\
.replace('window.__STATE__ =', '')\
.replace(';', '')\
.strip()
#parse the json
jsn = json.loads(importantJson)
print(jsn) #json object containing all the data you need
How did I got to this conclusion? First I noticed that switching between the two tables makes no http requests to the server, meaning the data is already there. Then I inspected the table html and noticed that there is only one table and its contents are dynamically changing, which lead me to the conclusion that this data is already on the page. Then with simple search in the source I found the script tag containing the json.
Answered By - Borislav Stoilov
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.