Issue
Im trying to write some code that will scrape different data from a table on a stock screener website and save the data in excel. The problem I'm having is there isn't a distinct class code for some of the values I want to pull from the table. so I tried this only for the first header I wanted the ticker but it pulls all of the tab-links on the page. any help would be appreciated?
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'}
df_headers = ['Ticker' , 'Owner' , 'Relationshiop' , 'Date' ,'Transaction' , 'Total Shares' , 'SEC Form']
url= "https://finviz.com/insidertrading.ashx"
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
Ticker = [item.text for item in soup.select('.tab-link:nth-of-type(1):not([id])')]
print(Ticker)
I also tried this code Ticker = [item.text for item in soup.select('.insider-buy-row-2 .tab-link')]
and it did pull the ticker I wanted but it also included the persons name and other rows.
Solution
Use combination of pandas
and BeautifulSoup
-
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'}
df_headers = ['Ticker' , 'Owner' , 'Relationshiop' , 'Date' ,'Transaction' , 'Total Shares' , 'SEC Form']
url= "https://finviz.com/insidertrading.ashx"
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
tbl = soup.findAll("table")
tbls = pd.read_html(str(tbl))
df = tbls[4]
df, df.columns = df[1:] , df.iloc[0]
Important part here is pd.read_html
can read multiple dataframes from <table>
tags. You just have to grab the right table from the output and set the header properly.
Answered By - Vivek Kalyanarangan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.