Issue
I need to get the individual url for each country after the "a href=" under the "div" class of "well span4". For example,I need to get https://www.rulac.org/browse/countries/myanmar and https://www.rulac.org/browse/countries/the-netherlands and every url after "a href=" (as shown in the partial html structure below.
since the "a href=" is not under any class, how do I conduct a search and get all the countries url?
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
url = "https://www.rulac.org/browse/countries/P36"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.find_all("div", class_="well span4")
# Partial html structure shown as below
[<div class="well span4">
<a href="https://www.rulac.org/browse/countries/myanmar">
<div class="map-wrap">
<img alt="Myanmar" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=19.7633057,96.07851040000003&format=png&style=feature:administrative.locality%7Celement:all%7Cvisibility:off&style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&style=feature:road%7Celement:all%7Cvisibility:off&style=feature:landscape%7Celement:all%7Chue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Myanmar"/>
<img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Myanmar</h2>
<a class="btn" href="https://www.rulac.org/browse/countries/myanmar">Read on <i class="icon-caret-right"></i></a>
</div>,
<div class="well span4">
<a href="https://www.rulac.org/browse/countries/the-netherlands">
<div class="map-wrap">
<img alt="Netherlands" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=52.203566364441,5.7275408506393&format=png&style=feature:administrative.locality%7Celement:all%7Cvisibility:off&style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&style=feature:road%7Celement:all%7Cvisibility:off&style=feature:landscape%7Celement:all%7Chue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Netherlands"/>
<img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Netherlands</h2>
<a class="btn" href="https://www.rulac.org/browse/countries/the-netherlands">Read on <i class="icon-caret-right"></i></a>
</div>,
<div class="well span4">
<a href="https://www.rulac.org/browse/countries/niger">
<div class="map-wrap">
<img alt="Niger" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=13.5115963,2.1253854000000274&format=png&style=feature:administrative.locality%7Celement:all%7Cvisibility:off&style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&style=feature:road%7Celement:all%7Cvisibility:off&style=feature:landscape%7Celement:all%7Chue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Niger"/>
<img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Niger</h2>
<a class="btn" href="https://www.rulac.org/browse/countries/niger">Read on <i class="icon-caret-right"></i></a>
</div>,
Solution
You can use soup.select()
with a CSS selector to get all <a>
elements of class btn
that are children of <div>
s with classes well
and span4
. Like this:
import requests
from bs4 import BeautifulSoup
url = "https://www.rulac.org/browse/countries/P36"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.select("div.well.span4 > a.btn")
# get all hrefs in a list and print it
hrefs = [el['href'] for el in res]
for href in hrefs:
print(href)
Answered By - Michael M.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.