Friday, November 11, 2022

[FIXED] Pagination with BeautifulSoup

November 11, 2022 beautifulsoup, pagination, python No comments

Issue

I am trying to get some data from the following website. https://www.drugbank.ca/drugs

For every drug in the table, I will need to go deeply and have the name and some other specific features like categories, structured indication (please click on drug name to see the features I will use).

I wrote the following code but the issue that I can't make my code handle pagination (as you see there more than 2000 pages!).

import requests
from bs4 import BeautifulSoup


def drug_data():
url = 'https://www.drugbank.ca/drugs/'
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
for link in soup.select('name-head a'):
    href = 'https://www.drugbank.ca/drugs/' + link.get('href')
    pages_data(href)


def pages_data(item_url):
r = requests.get(item_url)
soup = BeautifulSoup(r.text, "lxml")
g_data = soup.select('div.content-container')

for item in g_data:
    print item.contents[1].text
    print item.contents[3].findAll('td')[1].text
    try:
        print item.contents[5].findAll('td',{'class':'col-md-2 col-sm-4'})
    [0].text
    except:
        pass
    print item_url
    drug_data()

How can I scrape all of the data and handle pagination properly?

Solution

This page uses almost the same url for all pages so you can use for loop to generate them

def drug_data(page_number):
    url = 'https://www.drugbank.ca/drugs/?page=' + str(page_number)
    #... rest ...

# --- later ---

for x in range(1, 2001):
    drug_data(x)

Or using while and try/except to get more then 2000 pages

def drug_data(page_number):
    url = 'https://www.drugbank.ca/drugs/?page=' + str(page_number)
    #... rest ...

# --- later ---

page = 0

while True:
    try:
        page += 1
        drug_data(page)
    except Exception as ex:
        print(ex)
        print("probably last page:", page)
        break # exit `while` loop

You can also find url to next page in HTML

<a rel="next" class="page-link" href="/drugs?approved=1&amp;c=name&amp;d=up&amp;page=2">›</a>

so you can use BeautifulSoup to get this link and use it.

It displays current url, finds link to next page (using class="page-link" rel="next") and loads it

import requests
from bs4 import BeautifulSoup

def drug_data():
    url = 'https://www.drugbank.ca/drugs/'

    while url:
        print(url)
        r = requests.get(url)
        soup = BeautifulSoup(r.text ,"lxml")
        
        #data = soup.select('name-head a')
        #for link in data:
        #    href = 'https://www.drugbank.ca/drugs/' + link.get('href')
        #    pages_data(href)

        # next page url
        url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
        print(url)
        if url:
            url = 'https://www.drugbank.ca' + url[0].get('href')
        else:
            break
        
drug_data()

BTW: never use except:pass because you can have error which you didn't expect and you will not know why it doesn't work. Better display error

 except Exception as ex:
      print('Error:',  ex)

Answered By - furas

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 11, 2022

[FIXED] Pagination with BeautifulSoup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels