Issue
I am learning python, I am trying to scrape a table from https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-1-company.html website. In this table you can see there are 4 columns "CIN", Company Name", "Roc" and "Status". As you can see "Company Name" is a hyperlink, I need 5 columns "CIN", "Company Name", "Company Link", "Roc" and "Status". for the same I wrote a code, but I got only 4 columns and instead of "Company Link" I got different result. I am sharing the screen shot of my output csv file.
Please help me to scraping this table in 5 columns "CIN", "Company Name", "Company Link", "Roc" and "Status". here is my code and please find the image of my output csv file.
import csv
from bs4 import BeautifulSoup
import re
import html5lib
def find_between(s, first, last ):
try:
start = s.index( first ) + len( first )
end = s.index( last, start )
return s[start:end]
except ValueError:
return ""
loop = 1
while(True):
try:
URL = "https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-" + str(loop) + "-company.html"
loop=loop+1
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
tbody = soup.find('tbody')
rows = tbody.find_all('tr')
row_list = list()
for tr in rows:
row=[]
td = tr.find_all('td')
for a in td:
href=a.find('a',href=True)
if href==None:
row.append(a.text.strip())
print(a.text.strip())
else:
linktext = href.__getitem__
row.append(linktext)
row_list.append(row)
with open('zaubadata.csv', 'a') as csvFile:
writer = csv.writer(csvFile)
for r in row_list:
writer.writerow(r)
except Exception as obj:
print(obj)
csvFile.close()
break
[![result of above code in 4 columns][1]][1]
[1]: https://i.stack.imgur.com/oUVLK.png
Solution
This script iterates over all pages and writes columns "CIN", "Company Name", "Company Link", "Roc" and "Status" into data.csv
:
import csv
import requests
from bs4 import BeautifulSoup
url = 'https://www.zaubacorp.com/company-list/city-DELHI/status-Active/p-{}-company.html'
page = 1
all_data = []
while True:
soup = BeautifulSoup(requests.get(url.format(page)).content, 'html.parser')
rows = soup.select('#table tr:has(td)')
if not rows:
break
for tr in rows:
all_data.append([td.get_text(strip=True) for td in tr.select('td')])
all_data[-1].insert(2, tr.a['href'])
print(all_data[-1])
page += 1
with open('data.csv', 'w', newline='') as csvfile:
csv_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
csv_writer.writerow(["CIN", "Company Name", "Company Link", "Roc", "Status"])
for row in all_data:
csv_writer.writerow(row)
Outputs data.csv
(screenshot from LibreOffice):
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.