Tuesday, September 27, 2022

[FIXED] Web Scaping: Panda DataFrame.read_html(url_address) returns a Empty DataFrame?

September 27, 2022 css-selectors, dataframe, selenium, webdriver, webdriverwait No comments

Issue

I wanna web scrape the information of this table in this page that has many other pages.

I wrote the following code:

url = 'https://dbaasp.org/search?id.value=&name.value=&sequence.value=&sequence.option=full&sequenceLength.value=&complexity.value=&synthesisType.value=Nonribosomal&uniprot.value=&nTerminus.value=&cTerminus.value=&unusualAminoAcid.value=&intraChainBond.value=&interChainBond.value=&coordinationBond.value=&threeDStructure.value=&kingdom.value=&source.value=&hemolyticAndCytotoxicActivitie.value=on&synergy.value=&articleAuthor.value=&articleJournal.value=&articleYear.value=&articleVolume.value=&articlePages.value=&articleTitle.value='
pep_table = pd.read_html(url)

But the output was this:

pep_table
[Empty DataFrame
Columns: [ID, Name, N terminus, Sequence, C terminus, View]
Index: []]

I also tried to get it through selenium webdriver:

chromedriver = '/usr/local/bin/chromedriver'
driver = webdriver.Chrome(chromedriver)  
driver.get(url)
 table = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#DataTables_Table_0_info")))  
    tableRows = table.get_attribute("outerHTML")
    df = pd.read_html(tableRows)[0]

But it shows the selenium webdriver timeout error:

File "/home/es/anaconda3/envs/pyg-env/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

Am I using the wrong selector?
This page is the search results. Do I need to add more selectors?
How to solve this issue?

Solution

Your table locator was wrong. I have modified that. The easiest way without clicking on the pagination button and navigating to url.

You can use this url, where you have to change the offset value.

url="https://dbaasp.org/search?id.value=&name.value=&sequence.value=&sequence.option=full&sequenceLength.value=&complexity.value=&synthesisType.value=Nonribosomal&uniprot.value=&nTerminus.value=&cTerminus.value=&unusualAminoAcid.value=&intraChainBond.value=&interChainBond.value=&coordinationBond.value=&threeDStructure.value=&kingdom.value=&source.value=&hemolyticAndCytotoxicActivitie.value=on&synergy.value=&articleAuthor.value=&articleJournal.value=&articleYear.value=&articleVolume.value=&articlePages.value=&articleTitle.value=&limit=30&offset={}"

You need to create an empty dataframe and concat with it.

Use time.sleep() to wait otherwise page will move faster and unable to capture all pages.

Code:

url="https://dbaasp.org/search?id.value=&name.value=&sequence.value=&sequence.option=full&sequenceLength.value=&complexity.value=&synthesisType.value=Nonribosomal&uniprot.value=&nTerminus.value=&cTerminus.value=&unusualAminoAcid.value=&intraChainBond.value=&interChainBond.value=&coordinationBond.value=&threeDStructure.value=&kingdom.value=&source.value=&hemolyticAndCytotoxicActivitie.value=on&synergy.value=&articleAuthor.value=&articleJournal.value=&articleYear.value=&articleVolume.value=&articlePages.value=&articleTitle.value=&limit=30&offset={}"
counter=0
df=pd.DataFrame()
while counter <150:
   driver.get(url.format(counter))
   time.sleep(2)
   table = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#DataTables_Table_0")))  
   tableRows = table.get_attribute("outerHTML")
   df1 = pd.read_html(tableRows)[0]
   df = pd.concat([df,df1], ignore_index=True)
   counter=counter+30
print(df)

Output:

        ID              Name N terminus      Sequence C terminus  View
0     1688  Gramicidin S, GS        NaN    VXLfPVXLfP        NaN  View
1     3314      Gratisin, GR        NaN  VXLfPyVXLfPy        NaN  View
2     3316  Tyrocidine A, TA        NaN    fPFfNQYVXL        NaN  View
3     4876   Trichogin GA IV         C8   XGLXGGLXGIX        NaN  View
4     5374         Baceridin        NaN        WaXVlL        NaN  View
..     ...               ...        ...           ...        ...   ...
137  19210  Burkholdine-1215        NaN      xxGNSXXs        NaN  View
138  19212  Burkholdine-1213        NaN      xnGNSNXs        NaN  View
139  19548      Hirsutatin A        NaN        XTSXXF        NaN  View
140  19549      Hirsutatin B        NaN        XTSXXX        NaN  View
141  19554      Hirsutellide        NaN        XxIXxI        NaN  View

[142 rows x 6 columns]

Answered By - KunduK

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, September 27, 2022

[FIXED] Web Scaping: Panda DataFrame.read_html(url_address) returns a Empty DataFrame?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels