Sunday, January 30, 2022

[FIXED] Web Scraping Python For Two Different Buttons

January 30, 2022 beautifulsoup, python, web-scraping No comments

Issue

I am trying to scrape data from https://www.wsj.com/market-data/bonds/treasuries. There are two tables on this website which get switched when we select the options:

     1. Treasury Notes and Bond
     2. Treasury Bills

I want to scrape the data for Treasury bills. But there is no change in the link and attributes or anything when i click that option. I have tried a lot of things but every time, i am able to scrape the data for Treasury Notes and Bond. Can someone help me with that? Following the my code:

   import re
   import csv
   import requests
   import pandas as pd
   from bs4 import BeautifulSoup


   mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
   page = requests.get(mostActiveStocksUrl)
   data = page.text
   soup = BeautifulSoup(page.content, 'html.parser')
   rows = soup.find_all('tr')


   list_rows = []
   for row in rows:
       cells = row.find_all('td')
       str_cells = str(cells)
       clean = re.compile('<.*?>')
       clean2 = (re.sub(clean, '',str_cells))
       list_rows.append(clean2)


   df = pd.DataFrame(list_rows)
   df1 = df[0].str.split(',', expand=True)

Solution

All the data in the site is loaded once and then js is used to update the values in the table

Here is a working quickly written code:

import requests
from bs4 import BeautifulSoup
import json

mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
page = requests.get(mostActiveStocksUrl)
data = page.text
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('script') # we get all the script tags

importantJson = ''

for r in rows:
    text = r.text
    if 'NOTES_AND_BONDS' in text: # the scirpt tags containing the date, probably you can do this better
        importantJson = text
        break

# remove the non json stuff
importantJson = importantJson\
    .replace('window.__STATE__ =', '')\
    .replace(';', '')\
    .strip()

#parse the json
jsn = json.loads(importantJson)
print(jsn) #json object containing all the data you need

How did I got to this conclusion? First I noticed that switching between the two tables makes no http requests to the server, meaning the data is already there. Then I inspected the table html and noticed that there is only one table and its contents are dynamically changing, which lead me to the conclusion that this data is already on the page. Then with simple search in the source I found the script tag containing the json.

Answered By - Borislav Stoilov

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 30, 2022

[FIXED] Web Scraping Python For Two Different Buttons

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels