Issue
I am working to web scrape the dynamic table data from the 'File UUID' column from the HMP website using python (beautiful soup and selenium). For some reason, I am able to pull all the data from the HMP website dynamic table except the column I need. It is not showing up for some reason. Below is my python code I am running. Let me know what the issue may be or if there is a better approach to getting this data.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import numpy as np
import pandas as pd
# establishing connection to hmp main website and parsing hmp data table information
url = 'https://portal.hmpdacc.org/query/f?query=file.matrix_type%20in%20%5B%22wgs_community%22,%2216s_community%22%5D%20and%20sample.body_site%20in%20%5B%22feces%22%5D&filters=%7B%22op%22:%22and%22,%22content%22:%5B%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22file.matrix_type%22,%22value%22:%5B%22wgs_community%22,%2216s_community%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22sample.body_site%22,%22value%22:%5B%22feces%22%5D%7D%7D%5D%7D#:~:text=Samples%20(3%2C452)-,Files%20(5%2C181),-files'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(3)
html = browser.page_source
hmp_parsed_page = BeautifulSoup(html, "lxml")
hmp_files_table = hmp_parsed_page.find('table', id='files-table')
# gathering hmp meta datatable column headers
hmp_metadata_fields = []
for th in hmp_files_table.find_all('th'):
col_header = th.text
hmp_metadata_fields.append(col_header)
# creating dataframe of hmp information scraped
hmp_metadata_df = pd.DataFrame(columns = hmp_metadata_fields)
# appending hmp row data to dataframe
for tr in hmp_files_table.find_all('tr')[1:]:
row_data = tr.find_all('td')
row = [data_point.text for data_point in row_data]
hmp_metadata_df.loc[len(hmp_metadata_df.index)] = row
# dropping unneeded columns
hmp_metadata_df = hmp_metadata_df.drop(hmp_metadata_df.columns[[0,1]], axis = 1)
# adding hmp indicator to front of dataframe
hmp_metadata_df['Data Source'] = 'HMP'
print(hmp_metadata_df)
# closing hmp website connection
browser.close()
browser.quit()
I have tried all different manners of screen scraping this table data from HMP with no luck. I am expecting the outputs to be a table of all the columns and rows shown on the website. For some reason it is not showing. When I look up each element in the table using inspect it shows 'File UUID' is there under the 'files-table'.
<th title="File UUID" ng-repeat="h in tsc.headings" ng-class="{
'sortable': h['sortable'],
'sort-asc': tsc.tableParams.sorting()[h['sortable']]=='asc',
'sort-desc': tsc.tableParams.sorting()[h['sortable']]=='desc'
}" ng-click="tsc.sortByCol(h, $event)" ng-if="h.show" class="header ng-scope sortable" role="button" tabindex="0" style=""><div class="ng-table-header " ng-class="{'sort-indicator': tsc.tableParams.defaultSettings.sortingIndicator == 'div'}"><span data-cell="tsc.getHeaderCell(h)" data-data="data" data-paging="paging" ng-class="{'sort-indicator': tsc.tableParams.defaultSettings.sortingIndicator == 'span'}" class="ng-isolate-scope sort-indicator">File UUID</span></div></th>
Solution
You can use directly their Ajax API to obtain the data (the UUID is I believe in id
column):
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://portal.hmpdacc.org/api/files"
params = {
"fields": "file_format,file_type,file_annotation_pipeline,file_matrix_type",
"filters": '{"op":"and","content":[{"op":"in","content":{"field":"file.matrix_type","value":["wgs_community","16s_community"]}},{"op":"in","content":{"field":"sample.body_site","value":["feces"]}}]}',
"from": 0,
"save": "",
"size": "20",
"sort": "file_id:asc",
}
all_dfs = []
for params['from'] in range(0, 40, 20): # <--- increase the range for next pages
data = requests.get(url, params=params).json()
all_dfs.append(pd.DataFrame([h['file'] for h in data['data']['hits']]))
df = pd.concat(all_dfs).reset_index(drop=True)
print(df.tail())
Prints:
format_doc study ver organism_type format data_modality node_type size subtype fasp data_type matrix_type abundance_type https id md5 file_name access comment
35 http://biom-format.org/ prediabetes NaN bacterial Biological Observation Matrix marker sequence abundance_matrix 196000 16s_community fasp://aspera.ihmpdcc.org/t2d/genome/microbiome/16s/analysis/hmqcp/HMP2_J45372_1_ST_T0_B0_0120_ZY39SN0-02_APB4D.biom abundance 16s_community community https://downloads.hmpdacc.org/ihmp/t2d/genome/microbiome/16s/analysis/hmqcp/HMP2_J45372_1_ST_T0_B0_0120_ZY39SN0-02_APB4D.biom 76612bd9a41885add4f6b0b7683a65da 70600351056001048c1d42d7268cc6b7 https://downloads.hmpdacc.org/ihmp/t2d/genome/microbiome/16s/analysis/hmqcp/HMP2_J45372_1_ST_T0_B0_0120_ZY39SN0-02_APB4D.biom open Qiime output upload from DCC for HMP2_J45372_1_ST_T0_B0_0120_ZY39SN0-02_APB4D.clean.dehost.fastq.gz
36 http://biom-format.org/ prediabetes NaN bacterial Biological Observation Matrix marker sequence abundance_matrix 196000 16s_community fasp://aspera.ihmpdcc.org/t2d/genome/microbiome/16s/analysis/hmqcp/HMP2_J45281_1_ST_T0_B0_0120_ZRB0F6P-6021_APATM.biom abundance 16s_community community https://downloads.hmpdacc.org/ihmp/t2d/genome/microbiome/16s/analysis/hmqcp/HMP2_J45281_1_ST_T0_B0_0120_ZRB0F6P-6021_APATM.biom 76612bd9a41885add4f6b0b76836df9b 39643700bd4bcf040064c12f1d2b644c https://downloads.hmpdacc.org/ihmp/t2d/genome/microbiome/16s/analysis/hmqcp/HMP2_J45281_1_ST_T0_B0_0120_ZRB0F6P-6021_APATM.biom open Qiime output upload from DCC for HMP2_J45281_1_ST_T0_B0_0120_ZRB0F6P-6021_APATM.clean.dehost.fastq.gz
37 http://biom-format.org/ prediabetes NaN bacterial Biological Observation Matrix marker sequence abundance_matrix 81000 16s_community fasp://aspera.ihmpdcc.org/t2d/genome/microbiome/16s/analysis/hmqcp/HMP2_J04182_1_ST_T0_B0_0122_ZN0JE53-04_AAH7B.biom abundance 16s_community community https://downloads.hmpdacc.org/ihmp/t2d/genome/microbiome/16s/analysis/hmqcp/HMP2_J04182_1_ST_T0_B0_0122_ZN0JE53-04_AAH7B.biom 6cca313bce90a4392c3d5cf23fdb7ca8 7a33c9809cb98fac4e89aa2d3c151597 https://downloads.hmpdacc.org/ihmp/t2d/genome/microbiome/16s/analysis/hmqcp/HMP2_J04182_1_ST_T0_B0_0122_ZN0JE53-04_AAH7B.biom open Qiime output upload from DCC for HMP2_J04182_1_ST_T0_B0_0122_ZN0JE53-04_AAH7B.clean.dehost.fastq.gz
38 http://biom-format.org/ prediabetes NaN bacterial Biological Observation Matrix marker sequence abundance_matrix 204000 16s_community fasp://aspera.ihmpdcc.org/t2d/genome/microbiome/16s/analysis/hmqcp/otu_table.biom abundance 16s_community community https://downloads.hmpdacc.org/ihmp/t2d/genome/microbiome/16s/analysis/hmqcp/otu_table.biom 76612bd9a41885add4f6b0b7681567ac 7a33c9809cb98fac4e89aa2d3c151597 https://downloads.hmpdacc.org/ihmp/t2d/genome/microbiome/16s/analysis/hmqcp/otu_table.biom open Qiime output upload from DCC for HMP2_J04182_1_ST_T0_B0_0122_ZN0JE53-04_AAH7B.clean.dehost.fastq.gz
39 http://biom-format.org/ prediabetes NaN bacterial Biological Observation Matrix marker sequence abundance_matrix 120000 16s_community fasp://aspera.ihmpdcc.org/t2d/genome/microbiome/16s/analysis/hmqcp/HMP2_J00840_1_ST_T0_B0_0120_ZLZNCLZ-01_AA31J.biom abundance 16s_community community https://downloads.hmpdacc.org/ihmp/t2d/genome/microbiome/16s/analysis/hmqcp/HMP2_J00840_1_ST_T0_B0_0120_ZLZNCLZ-01_AA31J.biom 6cca313bce90a4392c3d5cf23fdafbcc 9757b64815cbfee3ba188e80b69a023e https://downloads.hmpdacc.org/ihmp/t2d/genome/microbiome/16s/analysis/hmqcp/HMP2_J00840_1_ST_T0_B0_0120_ZLZNCLZ-01_AA31J.biom open Qiime output upload from DCC for HMP2_J00840_1_ST_T0_B0_0120_ZLZNCLZ-01_AA31J.clean.dehost.fastq.gz
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.