Saturday, February 12, 2022

[FIXED] Scrape html links Python

February 12, 2022 beautifulsoup, html, python, python-3.x No comments

Issue

Hello everyone I'm trying to get all href links with python by using this :

import requests
from bs4 import BeautifulSoup

url = 'https://rappel.conso.gouv.fr'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}

#Collecting links on rappel.gouv
def get_url(url):
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def extract(soup):
    results = soup.find_all('div', {'class' : 'product-content'})
    for item in results:
        item.find('a', {'class' : 'product-link'}).text.replace('','').strip()
        links = url + item.find('a', {'class' : 'product-link'})['href']

    return links

soup = get_url(url)
print(extract(soup))

I'm supposed to get 10 htmls links as following :

https://rappel.conso.gouv.fr/fiche-rappel/4571/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4572/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4573/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4575/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4569/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4565/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4568/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4570/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4567/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4558/Interne

it actually works when I write print into the code as following :

def extract(soup):
    results = soup.find_all('div', {'class' : 'product-content'})
    for item in results:
        item.find('a', {'class' : 'product-link'}).text.replace('','').strip()
        links = url + item.find('a', {'class' : 'product-link'})['href']
        print(links)

    return

but I'm supposed with all the links I get from this request put them into a loop so I'll get data from each of those 10 pages and store them in a database (so it means there are lines code to write after def extract(soup)to come.

I have tried to understand with many tutorials, I get ever one html or a none

Solution

You just need to build a list of links, in your code the variable links only resets each time in the loop. Try this:

def extract(soup):
    results = soup.find_all('div', {'class' : 'product-content'})
    links = []
    for item in results:
        item.find('a', {'class' : 'product-link'}).text.replace('','').strip()
        links.append(url + item.find('a', {'class' : 'product-link'})['href'])


    return links

To print each link in main code after functions:

soup = get_url(url)
linklist = extract(soup)
for url in linklist:
    print(url)

Answered By - Alecx

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, February 12, 2022

[FIXED] Scrape html links Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels