Friday, October 15, 2021

[FIXED] How can I scrape the ID tags and their Content(text) from a website?

October 15, 2021 beautifulsoup, nlp, python No comments

Issue

At the top of this site are 17 ID tags:

1.Boxed warning
2.Indications
3.Dosage/Administration
4.Dosage forms
5.Contraindications
6.Warnings/Precautions
7.Adverse reactions
8.Drug interactions
9.Specific populations
10.Overdosage
11.Description
12.Clinical pharmacology
13.Nonclinical toxicology
14.Clinical studies
15.How supplied
16.Patient counseling
17.Medication guide

I want to scrape the page and make a dictionary with those tags as the keys. How can I do this? Here's what I've tried so far:

urls = "https://www.drugs.com/pro/abacavir-lamivudine-and-zidovudine-tablets.html"
response = requests.get(urls)
soup = BeautifulSoup(response.text, 'html.parser')
data3 = soup.findAll('h2')
out = {}
y1 = []
y2 = []
for header in data3:
   x0 = header.get('id')
   y1.append(x0)
   nextNode = header
   while True:
      nextNode = nextNode.nextSibling
      if nextNode is None:
          break
      if isinstance(nextNode, NavigableString):
          x1 = nextNode.strip()
      if isinstance(nextNode, Tag):
          if nextNode.name == "h2":
              break

      x2 = nextNode.get_text(strip=True).strip()
      x3 = x1 + " " + x2
      y2.append(x3)
 print(y1,y2)

I'm getting

Output I'm Getting: [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None] [content]

Desired Output: ['boxed warning', 'indications', 'dosage/administration', 'dosage forms', 'contraindications', 'warnings/precautions', 'adverse reactions', 'drug interactions', 'specific populations', 'overdosage', 'description', 'clinical pharmacology', 'nonclinical toxicology', 'clinical studies', 'how supplied', 'patient counseling', 'medication guide'] ['content present under boxed warning', 'content present under indications']

How can I get a dictionary or list that replaces all the Nones with the list of tags? I'm struggling to work with the structure of the webpage. Thank you!

Solution

I'm not 100% sure what you need, but based on the comments I think this is what you are looking for. You can easily add the output to a list or a dictionary.

import requests
from bs4 import BeautifulSoup
urls = "https://www.drugs.com/pro/abacavir-lamivudine-and-zidovudine-tablets.html"
response = requests.get(urls)
soup = BeautifulSoup(response.text, 'html.parser')
tags = soup.find('div', {'class': 'ddc-anchor-links'})

available_information = []

for tag in tags.find_all('a'):
    available_information.append(tag.text)
    

print(available_information)
# output
['Boxed Warning', 'Indications and Usage', 'Dosage and Administration', 'Dosage Forms and Strengths', 'Contraindications', 'Warnings and Precautions', 'Adverse Reactions/Side Effects', 'Drug Interactions', 'Use In Specific Populations', 'Overdosage', 'Description', 'Clinical Pharmacology', 'Nonclinical Toxicology', 'Clinical Studies', 'How Supplied/Storage and Handling', 'Patient Counseling Information', 'Medication Guide']

You can obtain the content for each TOC using this code:

anchor_tags = []
soup = BeautifulSoup(response.text, 'html.parser')
tags = soup.find('div', {'class': 'ddc-toc-content'})
for tag in tags.find_all('a'):
    anchor_tag = str(tag['href']).replace('#', '')
    anchor_tags.append(anchor_tag)

for tag in anchor_tags:
    anchor_tag = soup.find("a", {"id": tag})
    header_tag = anchor_tag.find_next_sibling('h2')
    # now you need to figure out how you want to store this information that is being extracted.

Based on our chat conversation you can query multiple pages that have different structures this way. You will have to modified the search_terms and known_tags as you scrape more pages with different structures.

import requests
from bs4 import BeautifulSoup

def get_soup(target_url):
    response = requests.get(target_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

def obtain_toc_content(soup):
    available_information = []
    anchor_tags = []
    known_tags = ['div', 'ul']
    search_terms = ['ddc-toc-content', 'ddc-anchor-links']
    for tag, search_string in zip(known_tags, search_terms):
        tag_found = bool(soup.find(tag, {'class': search_string}))
        if tag_found:
            toc = soup.find(tag, {'class': search_string})
            for toc_tag in toc.find_all('a'):
                available_information.append(toc_tag.text)
                anchor_tag = str(toc_tag['href'])
                anchor_tags.append(anchor_tag)

    return available_information, anchor_tags


urls = ['https://www.drugs.com/pro/abacavir-lamivudine-and-zidovudine-tablets.html',
        'https://www.drugs.com/ajovy.html','https://www.drugs.com/cons/a-b-otic.html']
for url in urls:
    make_soup = get_soup(url)
    results = obtain_toc_content(make_soup)
    table_of_content = results[0]
    toc_tags = results[1]

Answered By - Life is complex

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, October 15, 2021

[FIXED] How can I scrape the ID tags and their Content(text) from a website?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels