Issue
I'm scraping apps names from Google Play Store and for each URL as input I get only 60apps (because the website rendered 60apps if the user doesn't scroll down). How is it working and how can I scrape all the apps from a page using BeautifulSoup and/or Selenium ?
Thank you
Here is my code :
urls = []
urls.extend(["https://play.google.com/store/apps/category/NEWS_AND_MAGAZINES/collection/topselling_paid"])
for i in urls:
response = get(i)
html_soup = BeautifulSoup(response.text, 'html.parser')
app_container = html_soup.find_all('div', class_="card no-rationale square-cover apps small")
file = open("./InputFiles/applications.txt","w+")
for i in range(0, len(app_container)):
#print(app_container[i].div['data-docid'])
file.write(app_container[i].div['data-docid'] + "\n")
file.close()
num_lines = sum(1 for line in open('./InputFiles/applications.txt'))
print("Applications : " + str(num_lines) )
Solution
In this case You need to use Selenium
. I try it for you an get the all apps . I will try to explain hope will understand.
Using Selenium
is more powerful than other Python function .I used ChromeDriver so If you don't install yet You can install it in
from time import sleep
from selenium import webdriver
options = webdriver.ChromeOptions()
driver=webdriver.Chrome(chrome_options=options,
executable_path=r'This part is your Driver path')
driver.get('https://play.google.com/store/apps/category/NEWS_AND_MAGAZINES/collection/topselling_paid')
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") ## Scroll to bottom of page with using driver
sleep(5) ## Give a delay for allow to page scroll . If we dont program will already take 60 element without letting scroll
x = driver.find_elements_by_css_selector("div[class='card-content id-track-click id-track-impression']") ## Declare which class
for a in x:
print a.text
driver.close()
OUTPUT :
1. Pocket Casts
Podcast Media LLC
₺24,99
2. Broadcastify Police Scanner Pro
RadioReference.com LLC
₺18,99
3. Relay for reddit (Pro)
DBrady
₺8,00
4. Sync for reddit (Pro)
Red Apps LTD
₺15,00
5. reddit is fun golden platinum (unofficial)
TalkLittle
₺9,99
... **UP TO 75**
Note :
Dont mind the money. Its my countr currency so It will change in yours.
UPDATE ACCORDÄ°NG TO YOUR COMMENT:
The same data-docid is also in span tag.You can get it with using get_attribute
. Just add below codes into your project.
y = driver.find_elements_by_css_selector("span[class=preview-overlay-container]")
for b in y :
print b.get_attribute('data-docid')
OUTPUT
au.com.shiftyjelly.pocketcasts
com.radioreference.broadcastifyPro
reddit.news
com.laurencedawson.reddit_sync.pro
com.andrewshu.android.redditdonation
com.finazzi.distquakenoads
com.twitpane.premium
org.fivefilters.kindleit
.... UP TO 75
Answered By - Omer Tekbiyik
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.