Tuesday, November 1, 2022

[FIXED] Scraping More than rendered Data with Beautiful Soup

November 01, 2022 beautifulsoup, python, selenium No comments

Issue

I'm scraping apps names from Google Play Store and for each URL as input I get only 60apps (because the website rendered 60apps if the user doesn't scroll down). How is it working and how can I scrape all the apps from a page using BeautifulSoup and/or Selenium ?

Thank you

Here is my code :

urls = []

urls.extend(["https://play.google.com/store/apps/category/NEWS_AND_MAGAZINES/collection/topselling_paid"])

for i in urls:
    response = get(i)
    html_soup = BeautifulSoup(response.text, 'html.parser')
    app_container = html_soup.find_all('div', class_="card no-rationale square-cover apps small")
    file = open("./InputFiles/applications.txt","w+")
    for i in range(0, len(app_container)):
        #print(app_container[i].div['data-docid'])
        file.write(app_container[i].div['data-docid'] + "\n")

    file.close()
num_lines = sum(1 for line in open('./InputFiles/applications.txt'))
print("Applications : " + str(num_lines) )

Solution

In this case You need to use Selenium . I try it for you an get the all apps . I will try to explain hope will understand.

Using Selenium is more powerful than other Python function .I used ChromeDriver so If you don't install yet You can install it in

http://chromedriver.chromium.org/

from time import sleep
from selenium import webdriver


options = webdriver.ChromeOptions()
driver=webdriver.Chrome(chrome_options=options, 
executable_path=r'This part is your Driver path')
driver.get('https://play.google.com/store/apps/category/NEWS_AND_MAGAZINES/collection/topselling_paid')

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") ## Scroll to bottom of page with using driver
sleep(5) ## Give a delay for allow to page scroll . If we dont program will already take 60 element without letting scroll
x = driver.find_elements_by_css_selector("div[class='card-content id-track-click id-track-impression']") ## Declare which class

for a in x:
  print a.text
driver.close()

OUTPUT :

1. Pocket Casts
Podcast Media LLC
₺24,99
2. Broadcastify Police Scanner Pro
RadioReference.com LLC
₺18,99
3. Relay for reddit (Pro)
DBrady
₺8,00
4. Sync for reddit (Pro)
Red Apps LTD
₺15,00
5. reddit is fun golden platinum (unofficial)
TalkLittle
₺9,99
... **UP TO 75**

Note :

Dont mind the money. Its my countr currency so It will change in yours.

UPDATE ACCORDİNG TO YOUR COMMENT:

The same data-docid is also in span tag.You can get it with using get_attribute . Just add below codes into your project.

y = driver.find_elements_by_css_selector("span[class=preview-overlay-container]")

for b in y :
   print b.get_attribute('data-docid')

OUTPUT

au.com.shiftyjelly.pocketcasts
com.radioreference.broadcastifyPro
reddit.news
com.laurencedawson.reddit_sync.pro
com.andrewshu.android.redditdonation
com.finazzi.distquakenoads
com.twitpane.premium
org.fivefilters.kindleit
.... UP TO 75

Answered By - Omer Tekbiyik

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, November 1, 2022

[FIXED] Scraping More than rendered Data with Beautiful Soup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels