Saturday, November 20, 2021

[FIXED] How to extract content from tags with Beautiful soup

November 20, 2021 beautifulsoup, html, python, web-scraping No comments

Issue

I have been trying to practice web-scraping with beautiful soup. But everytime I changed a website, the tags structure are so different which really confuses me. This time I am trying to scrape the amazon best seller site (https://www.amazon.com/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_unv_mas_1_9408444011_1) for the ranking, Name, rating, as well as the number of review(Circled in the picture below).

My idea is to find the "main" tag for each item and dig into the tag that has the information I want. So I used .select() and started with the "li class". But when I try to add tags after "span.a-list-item", I then get empty result with the following code,

container = page.select('li.zg-item-immersion > span.a-list-item > div.a-section a-spacing-none aok-relative' )

Is there a tag limit I can put into .select() or am I doing something wrong?

So I stopped at "span. a-list-item" and tried the following approach, but I don't understand why my code sometimes gives me the empty result and sometimes returns the things I want... I guess this is something related to the connection to the website?

from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.com/Best-Sellers-Appstore- 
Android/zgbs/mobile-apps/ref=zg_bs_unv_mas_1_9408444011_1"
page = BeautifulSoup(requests.get(url).content,'lxml')    
containers = page.select('li.zg-item-immersion > span.a-list-item')
ranking = (containers[1].find("span",class_="zg-badge-text").text)[1:]

On the last line, I was able to get the ranking number successfully with that line of code, but when I try to append them into a list with a loop,

for item in range(50):
   ranking.append((containers[item].find("span",class_="zg-badge-text").text)[1:])

I keep getting "list index out of range" error which I don't understand why it is out of range as there is 50 items on a single page.

Last but not least, can I please get some advice on learning to scape different websites? I also read the beautifulsoup document and follow the instruction on using the different functions to get to a specific tag but still not getting what I want...

Solution

Actually, after for loop it didn't grab data from a range of list as text. You also need to inject user agent as headers.

Code:

from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}
url = "https://www.amazon.com/Best-Sellers-Appstore- Android/zgbs/mobile-apps/ref=zg_bs_unv_mas_1_9408444011_1"
r =requests.get(url, headers = headers)
page = BeautifulSoup(r.content,'lxml') 

containers = page.select('li.zg-item-immersion > span.a-list-item')
for container in containers:
    ranking = container.find("span",class_="zg-badge-text").text
    print(ranking)

Output:

#1
#2 
#3 
#4 
#5 
#6 
#7 
#8 
#9 
#10
#11
#12
#13
#14
#15
#16
#17
#18
#19
#20
#21
#22
#23
#24
#25
#26
#27
#28
#29
#30
#31
#32
#33
#34
#35
#36
#37
#38
#39
#40
#41
#42
#43
#44
#45
#46
#47
#48
#49
#50

Answered By - Fazlul

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, November 20, 2021

[FIXED] How to extract content from tags with Beautiful soup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels