Issue
I wanna web scrape the information of this table in this page that has many other pages.
I wrote the following code:
url = 'https://dbaasp.org/search?id.value=&name.value=&sequence.value=&sequence.option=full&sequenceLength.value=&complexity.value=&synthesisType.value=Nonribosomal&uniprot.value=&nTerminus.value=&cTerminus.value=&unusualAminoAcid.value=&intraChainBond.value=&interChainBond.value=&coordinationBond.value=&threeDStructure.value=&kingdom.value=&source.value=&hemolyticAndCytotoxicActivitie.value=on&synergy.value=&articleAuthor.value=&articleJournal.value=&articleYear.value=&articleVolume.value=&articlePages.value=&articleTitle.value='
pep_table = pd.read_html(url)
But the output was this:
pep_table
[Empty DataFrame
Columns: [ID, Name, N terminus, Sequence, C terminus, View]
Index: []]
I also tried to get it through selenium
webdriver:
chromedriver = '/usr/local/bin/chromedriver'
driver = webdriver.Chrome(chromedriver)
driver.get(url)
table = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#DataTables_Table_0_info")))
tableRows = table.get_attribute("outerHTML")
df = pd.read_html(tableRows)[0]
But it shows the selenium webdriver timeout error:
File "/home/es/anaconda3/envs/pyg-env/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
- Am I using the wrong selector?
- This page is the search results. Do I need to add more selectors?
- How to solve this issue?
Solution
Your table locator was wrong. I have modified that. The easiest way without clicking on the pagination button and navigating to url.
You can use this url, where you have to change the offset value.
url="https://dbaasp.org/search?id.value=&name.value=&sequence.value=&sequence.option=full&sequenceLength.value=&complexity.value=&synthesisType.value=Nonribosomal&uniprot.value=&nTerminus.value=&cTerminus.value=&unusualAminoAcid.value=&intraChainBond.value=&interChainBond.value=&coordinationBond.value=&threeDStructure.value=&kingdom.value=&source.value=&hemolyticAndCytotoxicActivitie.value=on&synergy.value=&articleAuthor.value=&articleJournal.value=&articleYear.value=&articleVolume.value=&articlePages.value=&articleTitle.value=&limit=30&offset={}"
You need to create an empty dataframe and concat with it.
Use time.sleep() to wait otherwise page will move faster and unable to capture all pages.
Code:
url="https://dbaasp.org/search?id.value=&name.value=&sequence.value=&sequence.option=full&sequenceLength.value=&complexity.value=&synthesisType.value=Nonribosomal&uniprot.value=&nTerminus.value=&cTerminus.value=&unusualAminoAcid.value=&intraChainBond.value=&interChainBond.value=&coordinationBond.value=&threeDStructure.value=&kingdom.value=&source.value=&hemolyticAndCytotoxicActivitie.value=on&synergy.value=&articleAuthor.value=&articleJournal.value=&articleYear.value=&articleVolume.value=&articlePages.value=&articleTitle.value=&limit=30&offset={}"
counter=0
df=pd.DataFrame()
while counter <150:
driver.get(url.format(counter))
time.sleep(2)
table = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#DataTables_Table_0")))
tableRows = table.get_attribute("outerHTML")
df1 = pd.read_html(tableRows)[0]
df = pd.concat([df,df1], ignore_index=True)
counter=counter+30
print(df)
Output:
ID Name N terminus Sequence C terminus View
0 1688 Gramicidin S, GS NaN VXLfPVXLfP NaN View
1 3314 Gratisin, GR NaN VXLfPyVXLfPy NaN View
2 3316 Tyrocidine A, TA NaN fPFfNQYVXL NaN View
3 4876 Trichogin GA IV C8 XGLXGGLXGIX NaN View
4 5374 Baceridin NaN WaXVlL NaN View
.. ... ... ... ... ... ...
137 19210 Burkholdine-1215 NaN xxGNSXXs NaN View
138 19212 Burkholdine-1213 NaN xnGNSNXs NaN View
139 19548 Hirsutatin A NaN XTSXXF NaN View
140 19549 Hirsutatin B NaN XTSXXX NaN View
141 19554 Hirsutellide NaN XxIXxI NaN View
[142 rows x 6 columns]
Answered By - KunduK
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.