Issue
Hello everyone I'm trying to get all href links with python by using this :
import requests
from bs4 import BeautifulSoup
url = 'https://rappel.conso.gouv.fr'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
#Collecting links on rappel.gouv
def get_url(url):
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def extract(soup):
results = soup.find_all('div', {'class' : 'product-content'})
for item in results:
item.find('a', {'class' : 'product-link'}).text.replace('','').strip()
links = url + item.find('a', {'class' : 'product-link'})['href']
return links
soup = get_url(url)
print(extract(soup))
I'm supposed to get 10 htmls links as following :
https://rappel.conso.gouv.fr/fiche-rappel/4571/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4572/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4573/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4575/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4569/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4565/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4568/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4570/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4567/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4558/Interne
it actually works when I write print
into the code as following :
def extract(soup):
results = soup.find_all('div', {'class' : 'product-content'})
for item in results:
item.find('a', {'class' : 'product-link'}).text.replace('','').strip()
links = url + item.find('a', {'class' : 'product-link'})['href']
print(links)
return
but I'm supposed with all the links I get from this request put them into a loop so I'll get data from each of those 10 pages and store them in a database (so it means there are lines code to write after def extract(soup)
to come.
I have tried to understand with many tutorials, I get ever one html or a none
Solution
You just need to build a list of links, in your code the variable links only resets each time in the loop. Try this:
def extract(soup):
results = soup.find_all('div', {'class' : 'product-content'})
links = []
for item in results:
item.find('a', {'class' : 'product-link'}).text.replace('','').strip()
links.append(url + item.find('a', {'class' : 'product-link'})['href'])
return links
To print each link in main code after functions:
soup = get_url(url)
linklist = extract(soup)
for url in linklist:
print(url)
Answered By - Alecx
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.