Issue
I am trying to extract information from a table in an html file, I want to use this possible as a text as I can only access this file through VPN so I have downloaded all the necessary html files I need.
I want to specifically get the information from various tables of the same table class, however when I try to obtain the information there is nothing being returned. I have attached the code that I was trying to use to obtain this information but have not been successful.
Below also is the html file that I have been trying to get the information from, it is quite big however so I hope this to not be a problem
<table class="region-table">
<thead>
<tr>
<th>Region</th>
<th>Type</th>
<th>From</th>
<th>To</th>
<th colspan="2">Most similar known cluster</th>
<th>Similarity</th>
</tr>
</thead>
<tbody>
<tr class="linked-row odd" data-anchor="#r1c1">
<td class="regbutton NRPS-like r1c1">
<a href="#r1c1">Region 1.1</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps-like" target="_blank">NRPS-like</a>
</td>
<td class="digits">21,469</td>
<td class="digits table-split-left">62,957</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001740/1" target="_blank">phthoxazolin</a></td>
<td>NRP + Polyketide</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 4%, #ffffff00 4%)">4%</td>
</tr>
<tr class="linked-row even" data-anchor="#r1c2">
<td class="regbutton NRPS r1c2">
<a href="#r1c2">Region 1.2</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>
</td>
<td class="digits">74,163</td>
<td class="digits table-split-left">124,963</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001709/1" target="_blank">nystatin</a></td>
<td>Polyketide</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 10%, #ffffff00 10%)">10%</td>
</tr>
</tbody>
</table>
<table class="region-table">
<thead>
<tr>
<th>Region</th>
<th>Type</th>
<th>From</th>
<th>To</th>
<th colspan="2">Most similar known cluster</th>
<th>Similarity</th>
</tr>
</thead>
<tbody>
<tr class="linked-row odd" data-anchor="#r2c1">
<td class="regbutton terpene r2c1">
<a href="#r2c1">Region 2.1</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#terpene" target="_blank">terpene</a>
</td>
<td class="digits">3,800</td>
<td class="digits table-split-left">23,263</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001580/1" target="_blank">ebelactone</a></td>
<td>Polyketide</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 5%, #ffffff00 5%)">5%</td>
</tr>
<tr class="linked-row even" data-anchor="#r2c2">
<td class="regbutton NRPS-like r2c2">
<a href="#r2c2">Region 2.2</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps-like" target="_blank">NRPS-like</a>
</td>
<td class="digits">55,320</td>
<td class="digits table-split-left">97,088</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0000727/1" target="_blank">indigoidine</a></td>
<td>Saccharide</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 17%, #ffffff00 17%)">17%</td>
</tr>
<tr class="linked-row odd" data-anchor="#r2c3">
<td class="regbutton NRPS r2c3">
<a href="#r2c3">Region 2.3</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>
</td>
<td class="digits">144,740</td>
<td class="digits table-split-left">193,599</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0000368/1" target="_blank">streptobactin</a></td>
<td>NRP</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(210, 105, 30, 0.3), rgba(210, 105, 30, 0.3) 70%, #ffffff00 70%)">70%</td>
</tr>
<tr class="linked-row even" data-anchor="#r2c4">
<td class="regbutton siderophore r2c4">
<a href="#r2c4">Region 2.4</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#siderophore" target="_blank">siderophore</a>
</td>
<td class="digits">347,862</td>
<td class="digits table-split-left">362,833</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001593/1" target="_blank">ficellomycin</a></td>
<td>NRP</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 3%, #ffffff00 3%)">3%</td>
</tr>
<tr class="linked-row odd" data-anchor="#r2c5">
<td class="regbutton lassopeptide r2c5">
<a href="#r2c5">Region 2.5</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#lassopeptide" target="_blank">lassopeptide</a>
</td>
<td class="digits">548,017</td>
<td class="digits table-split-left">570,561</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001435/1" target="_blank">ikarugamycin</a></td>
<td>NRP + Polyketide:Iterative type I</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 12%, #ffffff00 12%)">12%</td>
</tr>
<tr class="linked-row even" data-anchor="#r2c6">
<td class="regbutton NRPS r2c6">
<a href="#r2c6">Region 2.6</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>
</td>
<td class="digits">628,834</td>
<td class="digits table-split-left">683,050</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001117/1" target="_blank">himastatin</a></td>
<td>NRP</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 12%, #ffffff00 12%)">12%</td>
</tr>
<tr class="linked-row odd" data-anchor="#r2c7">
<td class="regbutton NRPS,terpene hybrid r2c7">
<a href="#r2c7">Region 2.7</a>
</td>
<td>
<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>,<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#terpene" target="_blank">terpene</a>
</td>
<td class="digits">1,043,511</td>
<td class="digits table-split-left">1,104,786</td>
<td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0002024/1" target="_blank">nargenicin</a></td>
<td>Polyketide</td>
<td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 11%, #ffffff00 11%)">11%</td>
</tr>
</tbody>
</table>
Code Snippet
soup = BeautifulSoup(html, "lxml")
gdp_table = soup.find("table", attrs={"class": "region-table"})
gdp_table_data = gdp_table.tbody.find_all("tr") # contains 2 rows
# Get all the headings of Lists
print ("Extracted {num} Region-Tables".format(num=len(gdp_table_data)))
print(gdp_table_data[0]) #print first table
print(gdp_table_data[1]) #print second table
Ideally I would want to input the html file and extract all the different tables information, merge as one big table and output as csv possibly.
Solution
Take HTML data from the file and export a separate csv.
import csv
from simplified_scrapy import SimplifiedDoc,req,utils
name = 'test.html'
html = utils.getFileContent(name) # Get data from file
doc = SimplifiedDoc(html)
rows = []
tables = doc.selects('table.region-table')
for table in tables:
trs = table.tbody.trs
for tr in trs:
rows.append([td.text for td in tr.tds])
with open(name+'.csv','w',encoding='utf-8') as f:
csv_writer = csv.writer(f)
csv_writer.writerows(rows)
If you want to keep one file per table
doc = SimplifiedDoc(html)
i=0
tables = doc.selects('table.region-table')
for table in tables:
i+=1
rows = []
trs = table.tbody.trs
for tr in trs:
rows.append([td.text for td in tr.tds])
with open(name+str(i)+'.csv','w',encoding='utf-8') as f:
csv_writer = csv.writer(f)
csv_writer.writerows(rows)
Keep the original one for comparison.
import csv
from simplified_scrapy import SimplifiedDoc,req
html = '''''' # Your HTML
doc = SimplifiedDoc(html)
rows = []
tables = doc.selects('table.region-table')
for table in tables:
trs = table.tbody.trs
for tr in trs:
rows.append([td.text for td in tr.tds])
# If you have '>Region.*?</a>' in each row, you can get all the rows directly in the following way
# trs = doc.getElementsByReg('>Region.*?</a>',tag='tr')
# for tr in trs:
# rows.append([td.text for td in tr.tds])
with open('test.csv','w',encoding='utf-8') as f:
csv_writer = csv.writer(f)
csv_writer.writerows(rows)
Result:
Region 1.1,NRPS-like,"21,469","62,957",phthoxazolin,NRP + Polyketide,4%
Region 1.2,NRPS,"74,163","124,963",nystatin,Polyketide,10%
Region 2.1,terpene,"3,800","23,263",ebelactone,Polyketide,5%
Region 2.2,NRPS-like,"55,320","97,088",indigoidine,Saccharide,17%
Region 2.3,NRPS,"144,740","193,599",streptobactin,NRP,70%
Region 2.4,siderophore,"347,862","362,833",ficellomycin,NRP,3%
Region 2.5,lassopeptide,"548,017","570,561",ikarugamycin,NRP + Polyketide:Iterative type I,12%
Region 2.6,NRPS,"628,834","683,050",himastatin,NRP,12%
Region 2.7,"NRPS,terpene","1,043,511","1,104,786",nargenicin,Polyketide,11%
Answered By - dabingsou
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.