Issue
At the top of this site are 17 ID tags:
1.Boxed warning
2.Indications
3.Dosage/Administration
4.Dosage forms
5.Contraindications
6.Warnings/Precautions
7.Adverse reactions
8.Drug interactions
9.Specific populations
10.Overdosage
11.Description
12.Clinical pharmacology
13.Nonclinical toxicology
14.Clinical studies
15.How supplied
16.Patient counseling
17.Medication guide
I want to scrape the page and make a dictionary with those tags as the keys. How can I do this? Here's what I've tried so far:
urls = "https://www.drugs.com/pro/abacavir-lamivudine-and-zidovudine-tablets.html"
response = requests.get(urls)
soup = BeautifulSoup(response.text, 'html.parser')
data3 = soup.findAll('h2')
out = {}
y1 = []
y2 = []
for header in data3:
x0 = header.get('id')
y1.append(x0)
nextNode = header
while True:
nextNode = nextNode.nextSibling
if nextNode is None:
break
if isinstance(nextNode, NavigableString):
x1 = nextNode.strip()
if isinstance(nextNode, Tag):
if nextNode.name == "h2":
break
x2 = nextNode.get_text(strip=True).strip()
x3 = x1 + " " + x2
y2.append(x3)
print(y1,y2)
I'm getting
Output I'm Getting: [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None] [content]
Desired Output: ['boxed warning', 'indications', 'dosage/administration', 'dosage forms', 'contraindications', 'warnings/precautions', 'adverse reactions', 'drug interactions', 'specific populations', 'overdosage', 'description', 'clinical pharmacology', 'nonclinical toxicology', 'clinical studies', 'how supplied', 'patient counseling', 'medication guide'] ['content present under boxed warning', 'content present under indications']
How can I get a dictionary or list that replaces all the Nones with the list of tags? I'm struggling to work with the structure of the webpage. Thank you!
Solution
I'm not 100% sure what you need, but based on the comments I think this is what you are looking for. You can easily add the output to a list or a dictionary.
import requests
from bs4 import BeautifulSoup
urls = "https://www.drugs.com/pro/abacavir-lamivudine-and-zidovudine-tablets.html"
response = requests.get(urls)
soup = BeautifulSoup(response.text, 'html.parser')
tags = soup.find('div', {'class': 'ddc-anchor-links'})
available_information = []
for tag in tags.find_all('a'):
available_information.append(tag.text)
print(available_information)
# output
['Boxed Warning', 'Indications and Usage', 'Dosage and Administration', 'Dosage Forms and Strengths', 'Contraindications', 'Warnings and Precautions', 'Adverse Reactions/Side Effects', 'Drug Interactions', 'Use In Specific Populations', 'Overdosage', 'Description', 'Clinical Pharmacology', 'Nonclinical Toxicology', 'Clinical Studies', 'How Supplied/Storage and Handling', 'Patient Counseling Information', 'Medication Guide']
You can obtain the content for each TOC using this code:
anchor_tags = []
soup = BeautifulSoup(response.text, 'html.parser')
tags = soup.find('div', {'class': 'ddc-toc-content'})
for tag in tags.find_all('a'):
anchor_tag = str(tag['href']).replace('#', '')
anchor_tags.append(anchor_tag)
for tag in anchor_tags:
anchor_tag = soup.find("a", {"id": tag})
header_tag = anchor_tag.find_next_sibling('h2')
# now you need to figure out how you want to store this information that is being extracted.
Based on our chat conversation you can query multiple pages that have different structures this way. You will have to modified the search_terms and known_tags as you scrape more pages with different structures.
import requests
from bs4 import BeautifulSoup
def get_soup(target_url):
response = requests.get(target_url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup
def obtain_toc_content(soup):
available_information = []
anchor_tags = []
known_tags = ['div', 'ul']
search_terms = ['ddc-toc-content', 'ddc-anchor-links']
for tag, search_string in zip(known_tags, search_terms):
tag_found = bool(soup.find(tag, {'class': search_string}))
if tag_found:
toc = soup.find(tag, {'class': search_string})
for toc_tag in toc.find_all('a'):
available_information.append(toc_tag.text)
anchor_tag = str(toc_tag['href'])
anchor_tags.append(anchor_tag)
return available_information, anchor_tags
urls = ['https://www.drugs.com/pro/abacavir-lamivudine-and-zidovudine-tablets.html',
'https://www.drugs.com/ajovy.html','https://www.drugs.com/cons/a-b-otic.html']
for url in urls:
make_soup = get_soup(url)
results = obtain_toc_content(make_soup)
table_of_content = results[0]
toc_tags = results[1]
Answered By - Life is complex
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.