Issue
How do I extract the link in the following html:
<li><i class="fas fa-file-pdf"></i> <a href="https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf" rel="noopener" target="_blank">2021-2022 Common Data Set</a></li>
Solution
Use list comprehension
and css selectors
to get a list of links - Select all links that ends with .pdf
:
[a['href'] for a in soup.select('a[href$=".pdf"]')]
or more specific <a>
with href
as sibling of the <i>
with class fa-file-pdf
:
[a['href'] for a in soup.select('li i.fa-file-pdf + a[href]')]
So if the goal is to extract only the first:
link = [a['href'] for a in soup.select('a[href$=".pdf"]')][0]
or
link = soup.select_one('a[href$=".pdf"]')['href']
Example
from bs4 import BeautifulSoup
import requests
html = '''
<li><i class="fas fa-file-pdf"></i> <a href="https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf" rel="noopener" target="_blank">2021-2022 Common Data Set</a></li>
<li><i class="fas fa-file-pdf"></i> <a href="https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf" rel="noopener" target="_blank">2021-2022 Common Data Set</a></li>
<li><i class="fas fa-file-pdf"></i> <a href="https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf" rel="noopener" target="_blank">2021-2022 Common Data Set</a></li>
<li><i class="fas fa-file-pdf"></i> <a href="https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf" rel="noopener" target="_blank">2021-2022 Common Data Set</a></li>
'''
soup = BeautifulSoup(html)
urlList = [a['href'] for a in soup.select('a[href$=".pdf"]')]
Output
['https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf',
'https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf',
'https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf',
'https://manoa.hawaii.edu/miro/wp-content/uploads/2022/03/AnalysisBrief_CommonDataSet_2021.docx.pdf']
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.