Issue
I have been trying to extract a table but it retrieves only the heading of the table. This is my first way to retrieve the table.
url = r"https://www.sec.gov/edgar/search/#/q=Women&dateRange=custom&entityName=Infosys&startdt=2010-03-01&enddt=2020-03-01"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find_all("table")[1]
#Extracting heading of the columns of the table.
rows = table.find_all('tr')
columns=[]
headings = rows[0].find_all('th')
for col in headings:
columns.append(col.text.strip())
print(columns)
#Extracting all data of the table row wise.
all_data=[]
for row in rows[1:]:
data = row.find_all('td')
lst=[]
for d in data:
lst.append(d.text.strip())
all_data.append(lst)
#Creating the dataframe out of the extracted data.
ds = pd.DataFrame(all_data, columns=columns)
ds
Second way:
ds1 = pd.read_html(url)[0]
ds1
When I tried to search the table, I get all the columns heading in the thead tag, but I get an empty tbody.
table = soup.find_all("table", class_='table')
table
Output:
[<table class="table table-hover entity-hints" id="asdf"></table>,
<table class="table">
<thead>
<tr>
<th class="filetype" id="filetype">Form & File</th>
<th class="filed">Filed</th>
<th class="enddate">Reporting for</th>
<th class="entity-name">Filing entity/person</th>
<th class="cik">CIK</th>
<th class="located">Located</th>
<th class="incorporated">Incorporated</th>
<th class="file-num">File number</th>
<th class="film-num">Film number</th>
</tr>
</thead>
<tbody>
</tbody>
</table>]
Why the tbody tag is empty?
Sceenshot of table:
Solution
The table is loaded via sending a POST
request to https://efts.sec.gov/LATEST/search-index
. You can scrape the data as follows:
import json
import requests
from bs4 import BeautifulSoup
URL = "https://efts.sec.gov/LATEST/search-index"
data = {
"q": "Women",
"dateRange": "custom",
"entityName": "Infosys",
"startdt": "2010-03-01",
"enddt": "2020-03-01",
}
soup = BeautifulSoup(requests.post(URL, data=json.dumps(data)).content, "html.parser")
json_data = json.loads(str(soup))
fmt_string = "{:<25} {:<20} {:<20} {:<20}"
print(
fmt_string.format("Form & File", "Filed", "Reporting for", "Filing/entity person")
)
print("-" * 100)
for data in json_data["hits"]["hits"]:
form = data["_source"]["root_form"] + data["_source"]["file_type"]
filed = data["_source"]["file_date"]
reporting_for = data["_source"]["period_ending"]
entity = data["_source"]["display_names"][0].split("(CIK")[0]
print(fmt_string.format(form, filed, reporting_for, entity))
Output:
Form & File Filed Reporting for Filing/entity person
----------------------------------------------------------------------------------------------------
6-KEX-99.1 CHARTER 2016-01-14 2015-12-31 Infosys Ltd (INFY)
6-KEX-99.3 VOTING TRUST 2016-07-20 2016-06-30 Infosys Ltd (INFY)
6-KEX-99.1 CHARTER 2014-01-15 2013-12-31 Infosys Ltd (INFY)
6-KEX-99.1 2014-01-10 2013-12-31 Infosys Ltd (INFY)
6-KEX-99.1 CHARTER 2019-10-11 2019-09-30 Infosys Ltd (INFY)
6-KEX-99.2 BYLAWS 2019-10-16 2019-09-30 Infosys Ltd (INFY)
20-F20-F 2016-05-18 2016-03-31 Infosys Ltd (INFY)
6-KEX-99.2 2016-01-19 2015-12-31 Infosys Ltd (INFY)
20-F20-F 2019-06-19 2019-03-31 Infosys Ltd (INFY)
6-KEX-99.1 CHARTER 2013-12-20 2013-12-20 Infosys Ltd (INFY)
20-F20-F 2017-06-12 2017-03-31 Infosys Ltd (INFY)
20-F20-F 2014-05-09 2014-03-31 Infosys Ltd (INFY)
6-KEX-99.2 BYLAWS 2014-01-15 2013-12-31 Infosys Ltd (INFY)
6-KEX-99.1 CHARTER 2019-10-16 2019-09-30 Infosys Ltd (INFY)
20-F20-F 2018-07-19 2018-03-31 Infosys Ltd (INFY)
6-K6-K 2013-12-20 2013-12-20 Infosys Ltd (INFY)
6-KEX-99.1 2016-01-19 2015-12-31 Infosys Ltd (INFY)
6-K6-K 2014-03-28 2014-03-28 Infosys Ltd (INFY)
20-F20-F 2015-05-20 2015-03-31 Infosys Ltd (INFY)
6-KEX-99.3 VOTING TRUST 2010-07-16 2010-06-30 INFOSYS TECHNOLOGIES LTD (INFY)
Answered By - MendelG
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.