Issue
I'm trying to extract data from a website with BeautifulSoup.
I'm actually stuck with this :
"Trad. de l'anglais par < a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien < /a>"
I want to get the names of translaters but the tag uses their id.
my code is
translater = soup.find_all("a", href="/searchinternet/advanced?all_authors_id=")
I tried with a str.startswith but it doesn't work. Can someone help me plz?
Solution
Providing your HTML is correct, static (doesn't get loaded with javascript after initial page load), this is one way to select that/those links:
from bs4 import BeautifulSoup as bs
html = '''<p>Trad. de l'anglais par <a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien </a></p>'''
soup = bs(html, 'html.parser')
a = soup.select('a[href^="/searchinternet/advanced?all_authors_id="]')
print(a[0])
print(a[0].get_text(strip=True))
print(a[0].get('href'))
Result in terminal:
<a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien </a>
Camille Fabien
/searchinternet/advanced?all_authors_id=35534&SearchAction=1
EDIT: Who doesn't like a challenge?... Based on further comments made by OP, here is a way of obtaining titles, authors, translators and illustrator from that page - considering there can be one, or more translators/one or more illustrators:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
url = 'https://www.gallimard.fr/searchinternet/advanced/(editor_brand_id)/1/(fserie)/FOLIO-JUNIOR+LIVRE+HEROS%3A%3AFolio+Junior+-+Un+Livre+dont+Vous+%C3%AAtes+le+H%C3%A9ros+%40+DEFIS+FANTASTIQ%3A%3AS%C3%A9rie+D%C3%A9fis+Fantastiques/(limit)/3?date%5Bfrom%5D=1980-01-01&date%5Bto%5D=1995-01-01&SearchAction=OK'
big_list = []
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
items = soup.select('div[class="results bg_white"] > table div[class="item"]')
print()
for i in items:
title = i.select_one('div[class="title"] h3')
author = i.select_one('div[class="author"] a')
history = i.select_one('p[class="collective_work_entries"]')
translators = [[y.get_text() for y in x.find_previous_siblings('a')] for x in history.contents if "Illustrations" in x]
illustrators = [[y.get_text() for y in x.find_next_siblings('a')] for x in history.contents if "Illustrations" in x]
big_list.append((title.text.strip(), author.text.strip(), ', '.join([x for y in translators for x in y]), ', '.join([x for y in illustrators for x in y])))
df = pd.DataFrame(big_list, columns = ['Title', 'Author', 'Translator(s)', 'Illustrator(s)'])
print(df)
Result in terminal:
Title | Author | Translator(s) | Illustrator(s) | |
---|---|---|---|---|
0 | Le Sépulcre des Ombres | Jonathan Green | Noël Chassériau | Alan Langford |
1 | La Légende de Zagor | Ian Livingstone | Pascale Houssin | Martin McKenna |
2 | Les Mages de Solani | Keith Martin | Noël Chassériau | Russ Nicholson |
3 | Le Siège de Sardath | Keith P. Phillips | Yannick Surcouf | Pete Knifton |
4 | Retour à la Montagne de Feu | Ian Livingstone | Yannick Surcouf | Martin McKenna |
5 | Les Mondes de l'Aleph | Peter Darvill-Evans | Yannick Surcouf | Tony Hough |
6 | Les Mercenaires du Levant | Paul Mason | Mona de Pracontal | Terry Oakes |
7 | L'Arpenteur de la Lune | Stephen Hand | Pierre de Laubier | Martin McKenna, Terry Oakes |
8 | La Tour de la Destruction | Keith Martin | Mona de Pracontal | Pete Knifton |
9 | La Légende des Guerriers Fantômes | Stephen Hand | Alexis Galmot | Martin McKenna |
10 | Le Repaire des Morts-Vivants | Dave Morris | Nicolas Grenier | David Gallagher |
11 | L'Ancienne Prophétie | Paul Mason | Mona de Pracontal | Terry Oakes |
12 | La Vengeance des Démons | Jim Bambra | Mona de Pracontal | Martin McKenna |
13 | Le Sceptre Noir | Keith Martin | Camille Fabien | David Gallagher |
14 | La Nuit des Mutants | Peter Darvill-Evans | Anne Collas | Alan Langford |
15 | L'Élu des Six Clans | Luke Sharp | Noël Chassériau | Martin Mac Kenna, Martin McKenna |
16 | Le Volcan de Zamarra | Luke Sharp | Olivier Meyer | David Gallagher |
17 | Les Sombres Cohortes | Ian Livingstone | Noël Chassériau | Nik William |
18 | Le Vampire du Château Noir | Keith Martin | Mona de Pracontal | Martin McKenna |
19 | Le Voleur d'Âmes | Keith Martin | Mona de Pracontal | Russ Nicholson |
20 | Le Justicier de l'Univers | Martin Allen | Mona de Pracontal | Tim Sell |
21 | Les Esclaves de l'Eternité | Paul Mason | Sylvie Bonnet | Bob Harvey |
22 | La Créature venue du Chaos | Steve Jackson | Noël Chassériau | Alan Langford |
23 | Les Rôdeurs de la Nuit | Graeme Davis | Nicolas Grenier | John Sibbick |
24 | L'Empire des Hommes-Lézards | Marc Gascoigne | Jean Lacroix | David Gallagher |
25 | Les Gouffres de la Cruauté | Luke Sharp | Sylvie Bonnet | Russ Nicholson |
26 | Les Spectres de l'Angoisse | Robin Waterfield | Mona de Pracontal | Ian Miller |
27 | Le Chasseur des Étoiles | Luke Sharp | Arnaud Dupin de Beyssat | Cary Mayes, Gary Mayes |
28 | Les Sceaux de la Destruction | Robin Waterfield | Sylvie Bonnet | Russ Nicholson |
29 | La Crypte du Sorcier | Ian Livingstone | Noël Chassériau | John Sibbick |
30 | La Forteresse du Cauchemar | Peter Darvill-Evans | Mona de Pracontal | Dave Carson |
31 | La Grande Menace des Robots | Steve Jackson | Danielle Plociennik | Gary Mayes |
32 | L'Épée du Samouraï | Mark Smith | Pascale Jusforgues | Alan Langford |
33 | L'Épreuve des Champions | Ian Livingstone | Alain Vaulont, Pascale Jusforgues | Brian Williams |
34 | Défis Sanglants sur l'Océan | Andrew Chapman | Jean Walter | Bob Harvey |
35 | Les Démons des Profondeurs | Steve Jackson | Noël Chassériau | Bob Harvey |
36 | Rendez-vous avec la M.O.R.T. | Steve Jackson | Arnaud Dupin de Beyssat | Declan Considine |
37 | La Planète Rebelle | Robin Waterfield | C. Degolf | Gary Mayes |
38 | Les Trafiquants de Kelter | Andrew Chapman | Anne Blanchet | Nik Spender |
39 | Le Combattant de l'Autoroute | Ian Livingstone | Alain Vaulont, Pascale Jusforgues | Kevin Bulmer |
40 | Le Mercenaire de l'Espace | Andrew Chapman | Jean Walthers | Geoffroy Senior |
41 | Le Temple de la Terreur | Ian Livingstone | Denise May | Bill Houston |
42 | Le Manoir de l'Enfer | Steve Jackson | ||
43 | Le Marais aux Scorpions | Steve Jackson | Camille Fabien | Duncan Smith |
44 | Le Talisman de la Mort | Steve Jackson | Camille Fabien | Bob Harvey |
45 | La Sorcière des Neiges | Ian Livingstone | Michel Zénon | Edward Crosby, Gary Ward |
46 | La Citadelle du Chaos | Steve Jackson | Marie-Raymond Farré | Russ Nicholson |
47 | La Galaxie Tragique | Steve Jackson | Camille Fabien | Peter Jones |
48 | La Forêt de la Malédiction | Ian Livingstone | Camille Fabien | Malcolm Barter |
49 | La Cité des Voleurs | Ian Livingstone | Henri Robillot | Iain McCaig |
50 | Le Labyrinthe de la Mort | Ian Livingstone | Patricia Marais | Iain McCaig |
51 | L'Île du Roi Lézard | Ian Livingstone | Fabienne Vimereu | Alan Langford |
52 | Le Sorcier de la Montagne de Feu | Steve Jackson | Camille Fabien | Russ Nicholson |
Bear in mind this method fails for Le Manoir de l'Enfer
, because word 'Illustrations' is not found in text. It's down to the OP to find a solution for that one.
BeautifulSoup documentation can be found at https://beautiful-soup-4.readthedocs.io/en/latest/index.html
Also, Pandas docs can be found here: https://pandas.pydata.org/pandas-docs/stable/index.html
Answered By - Barry the Platipus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.