Saturday, November 20, 2021

[FIXED] Bs4 find method inside a for loop not parsing all tags just the first

November 20, 2021 beautifulsoup, python No comments

Issue

im trying to parse iphone's data. The code is working almost fine but it retrieves only the data of the first iphone in each page. I tried to look in several other topics but i was not able to solve my problem. The issue is adressed along with the code below.

Here's the code:

from bs4 import BeautifulSoup
import pandas as pd
import time as tm
from selenium import webdriver


path = "isert your driver path here"

termo_busca = 'iphone'
url = f'https://www.magazineluiza.com.br/busca/{termo_busca}/'

#im using selenium browser acces the website and 
#then parse  the html with bs4 cause the website reconized both
#requests_html and urllib requests as bots.

#this function parse the html

def extrator_html_selenium(url): 
    navegador = webdriver.Chrome(path)
    navegador.get(url)
    tm.sleep(5)
    html = navegador.page_source
    soup = BeautifulSoup(html, 'html5lib')
    navegador.quit()
    return soup

The find_all method is parsing all of the info that I want, but the find method inside the for loop is parsing only the first set of elements of each page. what's wrong with it?

lista_dados = []


def extrator_dados(soup):
    
links = soup.find_all('section', {'style' : 'grid-area:content'}) #pega o link e o nome produto
    
    preco_normal = 0
    preco_promocao_avista = 0
    preco_promocao_parcelado = 0
    num_avaliacoes = 0

    for item in links:
        nome_produto = item.find('h2', {'class': 'sc-bHXGc hsFKpx'}).text.strip()
        link_produto = 'https://www.magazineluiza.com.br' + str(item.find('a', {'class': 'sc-kfzBvY kHKKYz sc-kNMMVl fJHpec'})['href'])   
      
        try:   
            preco_normal = item.find('p', {'class' : 'sc-hKgJUU eKvUCv sc-jYCGPb kAAMBY'}).text.replace('R$','').strip()
            preco_promocao_avista = item.find('p', {'class': 'sc-hKgJUU kegCEa sc-bxnjHY cTpdOW'}).text.replace('R$','').strip()
            preco_promocao_parcelado = item.find('p', {'class': 'sc-hKgJUU nIoWN sc-gyUflj dQzJJE'}).text.replace('R$','').strip() #.replace('ou', '').replace('x de', '').replace('sem juros', '')

        except:
            preco_normal = 0
            preco_promocao_avista = 0    
            preco_promocao_parcelado = 0 

        try:
            num_avaliacoes = item.find('span', {'class': 'sc-irOPex eZnwGI'}).text.strip()

        except:

            num_avaliacoes = 0
            #numero_avaliacoes = [item.get_text(strip = True) for item in links3.find('span', {'class': 'sc-irOPex eZnwGI'})]
   
        dados_dict = {
            'nome_produto' : nome_produto,
            'link_produto' : link_produto,
            'preco_normal' : preco_normal,
            'preco_promocao_avista' : preco_promocao_avista,
            'preco_promocao_parcelado': preco_promocao_parcelado,
             'num_avaliacoes': num_avaliacoes
            }
        lista_dados.append(dados_dict)
    return

the other problem is that the code doesnt extract the text element (the number of reviews) inside the following tag, even though both the find_all & find methods used above are correctly parsing it.

"num_avaliacoes = item.find('span', {'class': 'sc-irOPex eZnwGI'}).text.strip()".


#loop through the 17 pages of the website with iphone's data

for url2 in range(1,17):
    url2 = f'https://www.magazineluiza.com.br/busca/iphone/?page={url2}'    
    soup = extrator_html_selenium(url2)
    extrator_dados(soup)
    print(len(lista_dados))


#create a df with the parsed data

df = pd.DataFrame(lista_dados)
df.to_csv('iphones_magalu_final.csv', index=False)

Thank you in advance!

Solution

What happens?

You are selecting the outer container, that holds the items you like to iterate over. find_all() in this case will give you a list with exact one element, thats the reason why you only get the first item.

How to fix?

It is simple as that - Select all <li> in the <div> with data-testid="productlist" and it will work. Change from:

links = soup.find_all('section', {'style' : 'grid-area:content'})

to:

links = soup.select('div[data-testid="product-list"] li')

Output

    nome_produto    link_produto    preco_normal    preco_promocao_avista   preco_promocao_parcelado    num_avaliacoes
0   iPhone 11 Apple 64GB Roxo 6,1” 12MP iOS https://www.magazineluiza.com.br/iphone-11-app...   5.699,00    3.999,00    ou 10x de  399,90 sem juros 1
1   iPhone 11 Apple 64GB Preto 6,1” 12MP iOS    https://www.magazineluiza.com.br/iphone-11-app...   5.699,00    3.999,00    ou 10x de  399,90 sem juros 80
2   iPhone XR Apple 64GB Preto 6,1” 12MP iOS    https://www.magazineluiza.com.br/iphone-xr-app...   4.999,00    3.399,00    ou 10x de  339,90 sem juros 88

Hint

The site uses bot detection, so be restrained when scraping ;)

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, November 20, 2021

[FIXED] Bs4 find method inside a for loop not parsing all tags just the first

Issue

Solution

What happens?

How to fix?

Output

Hint

0 comments:

Post a Comment

Popular Posts

Labels