Wednesday, September 21, 2022

[FIXED] bs4 - how to use find or find_all to get specific content from an url

September 21, 2022 beautifulsoup, python No comments

Issue

I need to get the individual url for each country after the "a href=" under the "div" class of "well span4". For example,I need to get https://www.rulac.org/browse/countries/myanmar and https://www.rulac.org/browse/countries/the-netherlands and every url after "a href=" (as shown in the partial html structure below.

since the "a href=" is not under any class, how do I conduct a search and get all the countries url?

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup


url = "https://www.rulac.org/browse/countries/P36"  
resp = requests.get(url)

soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.find_all("div", class_="well span4")

# Partial html structure shown as below 
[<div class="well span4">
 <a href="https://www.rulac.org/browse/countries/myanmar">
 <div class="map-wrap">
 <img alt="Myanmar" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&amp;zoom=5¢er=19.7633057,96.07851040000003&amp;format=png&amp;style=feature:administrative.locality%7Celement:all%7Cvisibility:off&amp;style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&amp;style=feature:road%7Celement:all%7Cvisibility:off&amp;style=feature:landscape%7Celement:all%7Chue:0xE0EADC&amp;key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Myanmar"/>
 <img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
 </div>
 </a>
 <h2>Myanmar</h2>
 <a class="btn" href="https://www.rulac.org/browse/countries/myanmar">Read on <i class="icon-caret-right"></i></a>
 </div>,
 <div class="well span4">
 <a href="https://www.rulac.org/browse/countries/the-netherlands">
 <div class="map-wrap">
 <img alt="Netherlands" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&amp;zoom=5¢er=52.203566364441,5.7275408506393&amp;format=png&amp;style=feature:administrative.locality%7Celement:all%7Cvisibility:off&amp;style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&amp;style=feature:road%7Celement:all%7Cvisibility:off&amp;style=feature:landscape%7Celement:all%7Chue:0xE0EADC&amp;key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Netherlands"/>
 <img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
 </div>
 </a>
 <h2>Netherlands</h2>
 <a class="btn" href="https://www.rulac.org/browse/countries/the-netherlands">Read on <i class="icon-caret-right"></i></a>
 </div>,
 <div class="well span4">
 <a href="https://www.rulac.org/browse/countries/niger">
 <div class="map-wrap">
 <img alt="Niger" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&amp;zoom=5¢er=13.5115963,2.1253854000000274&amp;format=png&amp;style=feature:administrative.locality%7Celement:all%7Cvisibility:off&amp;style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&amp;style=feature:road%7Celement:all%7Cvisibility:off&amp;style=feature:landscape%7Celement:all%7Chue:0xE0EADC&amp;key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Niger"/>
 <img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
 </div>
 </a>
 <h2>Niger</h2>
 <a class="btn" href="https://www.rulac.org/browse/countries/niger">Read on <i class="icon-caret-right"></i></a>
 </div>,

Solution

You can use soup.select() with a CSS selector to get all <a> elements of class btn that are children of <div>s with classes well and span4. Like this:

import requests
from bs4 import BeautifulSoup


url = "https://www.rulac.org/browse/countries/P36"
resp = requests.get(url)

soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.select("div.well.span4 > a.btn")

# get all hrefs in a list and print it
hrefs = [el['href'] for el in res]
for href in hrefs:
    print(href)

Answered By - Michael M.

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, September 21, 2022

[FIXED] bs4 - how to use find or find_all to get specific content from an url

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels