Issue
im trying to parse iphone's data. The code is working almost fine but it retrieves only the data of the first iphone in each page. I tried to look in several other topics but i was not able to solve my problem. The issue is adressed along with the code below.
Here's the code:
from bs4 import BeautifulSoup
import pandas as pd
import time as tm
from selenium import webdriver
path = "isert your driver path here"
termo_busca = 'iphone'
url = f'https://www.magazineluiza.com.br/busca/{termo_busca}/'
#im using selenium browser acces the website and
#then parse the html with bs4 cause the website reconized both
#requests_html and urllib requests as bots.
#this function parse the html
def extrator_html_selenium(url):
navegador = webdriver.Chrome(path)
navegador.get(url)
tm.sleep(5)
html = navegador.page_source
soup = BeautifulSoup(html, 'html5lib')
navegador.quit()
return soup
The find_all method is parsing all of the info that I want, but the find method inside the for loop is parsing only the first set of elements of each page. what's wrong with it?
lista_dados = []
def extrator_dados(soup):
links = soup.find_all('section', {'style' : 'grid-area:content'}) #pega o link e o nome produto
preco_normal = 0
preco_promocao_avista = 0
preco_promocao_parcelado = 0
num_avaliacoes = 0
for item in links:
nome_produto = item.find('h2', {'class': 'sc-bHXGc hsFKpx'}).text.strip()
link_produto = 'https://www.magazineluiza.com.br' + str(item.find('a', {'class': 'sc-kfzBvY kHKKYz sc-kNMMVl fJHpec'})['href'])
try:
preco_normal = item.find('p', {'class' : 'sc-hKgJUU eKvUCv sc-jYCGPb kAAMBY'}).text.replace('R$','').strip()
preco_promocao_avista = item.find('p', {'class': 'sc-hKgJUU kegCEa sc-bxnjHY cTpdOW'}).text.replace('R$','').strip()
preco_promocao_parcelado = item.find('p', {'class': 'sc-hKgJUU nIoWN sc-gyUflj dQzJJE'}).text.replace('R$','').strip() #.replace('ou', '').replace('x de', '').replace('sem juros', '')
except:
preco_normal = 0
preco_promocao_avista = 0
preco_promocao_parcelado = 0
try:
num_avaliacoes = item.find('span', {'class': 'sc-irOPex eZnwGI'}).text.strip()
except:
num_avaliacoes = 0
#numero_avaliacoes = [item.get_text(strip = True) for item in links3.find('span', {'class': 'sc-irOPex eZnwGI'})]
dados_dict = {
'nome_produto' : nome_produto,
'link_produto' : link_produto,
'preco_normal' : preco_normal,
'preco_promocao_avista' : preco_promocao_avista,
'preco_promocao_parcelado': preco_promocao_parcelado,
'num_avaliacoes': num_avaliacoes
}
lista_dados.append(dados_dict)
return
the other problem is that the code doesnt extract the text element (the number of reviews) inside the following tag, even though both the find_all & find methods used above are correctly parsing it.
"num_avaliacoes = item.find('span', {'class': 'sc-irOPex eZnwGI'}).text.strip()".
#loop through the 17 pages of the website with iphone's data
for url2 in range(1,17):
url2 = f'https://www.magazineluiza.com.br/busca/iphone/?page={url2}'
soup = extrator_html_selenium(url2)
extrator_dados(soup)
print(len(lista_dados))
#create a df with the parsed data
df = pd.DataFrame(lista_dados)
df.to_csv('iphones_magalu_final.csv', index=False)
Thank you in advance!
Solution
What happens?
You are selecting the outer container, that holds the items you like to iterate over. find_all()
in this case will give you a list with exact one element, thats the reason why you only get the first item.
How to fix?
It is simple as that - Select all <li>
in the <div>
with data-testid="productlist"
and it will work. Change from:
links = soup.find_all('section', {'style' : 'grid-area:content'})
to:
links = soup.select('div[data-testid="product-list"] li')
Output
nome_produto link_produto preco_normal preco_promocao_avista preco_promocao_parcelado num_avaliacoes
0 iPhone 11 Apple 64GB Roxo 6,1” 12MP iOS https://www.magazineluiza.com.br/iphone-11-app... 5.699,00 3.999,00 ou 10x de 399,90 sem juros 1
1 iPhone 11 Apple 64GB Preto 6,1” 12MP iOS https://www.magazineluiza.com.br/iphone-11-app... 5.699,00 3.999,00 ou 10x de 399,90 sem juros 80
2 iPhone XR Apple 64GB Preto 6,1” 12MP iOS https://www.magazineluiza.com.br/iphone-xr-app... 4.999,00 3.399,00 ou 10x de 339,90 sem juros 88
Hint
The site uses bot detection, so be restrained when scraping ;)
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.