Tuesday, November 9, 2021

[FIXED] How do I scrape link of only one column from a Wikipedia table with python?

November 09, 2021 beautifulsoup, python, web-scraping, wikipedia No comments

Issue

I'm a beginner and this is my first question on the forum. As said in the title, my goal is to scrape the links from only one column of the table of that wiki page : https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain

I've already watched several contributions asked on that forum (especially this one How do I extract text data in first column from Wikipedia table?) but none of them seem to answer my questions (and from what I understand, using a Dataframe is not a solution since it is a sort of copy/paste of the table while I want to get links).

Here is my code so far

import requests
res=requests.get("https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain")

from bs4 import BeautifulSoup as bs
soup=bs(res.text,"html.parser")
table=soup.find('table','wikitable')
links=table.findAll('a')
communes={}
for link in links:
    url=link.get("href","")
    communes[link.text.strip()]=url
print(communes)

Thanks in advance for you answers !

Solution

To scrape a specific column, you can use the nth-of-type(n) CSS Selector. In order to use a CSS Selector, use the select() method instead of find_all().

For example, to only scrape the sixth column, select the sixth <td> using soup.select("td:nth-of-type(6)")

Here's an example of how to print all the links from only the fifth column:

import requests
from bs4 import BeautifulSoup


BASE_URL = "https://fr.wikipedia.org"
URL = "https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain"

soup = BeautifulSoup(requests.get(URL).content, "html.parser")

# The following will find all `a` tags under the fifth `td` of it's type, which is the fifth column
for tag in soup.select("td:nth-of-type(5) a"):
    print(BASE_URL + tag["href"])

Output:

https://fr.wikipedia.org/wiki/Canton_de_Bourg-en-Bresse-1
https://fr.wikipedia.org/wiki/Canton_de_Bourg-en-Bresse-2
https://fr.wikipedia.org/wiki/Canton_d%27Amb%C3%A9rieu-en-Bugey
https://fr.wikipedia.org/wiki/Canton_de_Villars-les-Dombes
https://fr.wikipedia.org/wiki/Canton_de_Belley
...

Answered By - MendelG

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, November 9, 2021

[FIXED] How do I scrape link of only one column from a Wikipedia table with python?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels