Issue
I'm a beginner and this is my first question on the forum. As said in the title, my goal is to scrape the links from only one column of the table of that wiki page : https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain
I've already watched several contributions asked on that forum (especially this one How do I extract text data in first column from Wikipedia table?) but none of them seem to answer my questions (and from what I understand, using a Dataframe is not a solution since it is a sort of copy/paste of the table while I want to get links).
Here is my code so far
import requests
res=requests.get("https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain")
from bs4 import BeautifulSoup as bs
soup=bs(res.text,"html.parser")
table=soup.find('table','wikitable')
links=table.findAll('a')
communes={}
for link in links:
url=link.get("href","")
communes[link.text.strip()]=url
print(communes)
Thanks in advance for you answers !
Solution
To scrape a specific column, you can use the nth-of-type(n)
CSS Selector. In order to use a CSS Selector, use the select()
method instead of find_all()
.
For example, to only scrape the sixth column, select the sixth <td>
using soup.select("td:nth-of-type(6)")
Here's an example of how to print all the links from only the fifth column:
import requests
from bs4 import BeautifulSoup
BASE_URL = "https://fr.wikipedia.org"
URL = "https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
# The following will find all `a` tags under the fifth `td` of it's type, which is the fifth column
for tag in soup.select("td:nth-of-type(5) a"):
print(BASE_URL + tag["href"])
Output:
https://fr.wikipedia.org/wiki/Canton_de_Bourg-en-Bresse-1
https://fr.wikipedia.org/wiki/Canton_de_Bourg-en-Bresse-2
https://fr.wikipedia.org/wiki/Canton_d%27Amb%C3%A9rieu-en-Bugey
https://fr.wikipedia.org/wiki/Canton_de_Villars-les-Dombes
https://fr.wikipedia.org/wiki/Canton_de_Belley
...
Answered By - MendelG
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.