Issue
I am using BeautifulSoup in Python to extract data from https://appsource.microsoft.com/en-us/marketplace/apps?product=power-bi-visuals&page=3
And this is what I am getting right up to this point.
The code that is doing this is:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import re
# Base URL without the page number
base_url = 'https://appsource.microsoft.com/en-us/marketplace/apps?product=power-bi-visuals&page=3'
all_data = {'Title': [], 'Owner': [], 'Ratings': [], 'Count of Rates': [], 'Page': []}
try:
# Loop through pages 1 to 10 (for example)
for page_num in range(1, 2): # Change the range as needed
url = base_url.format(page_num)
# Send a GET request to the URL
response = requests.get(url)
# Introduce a delay to wait for the page to potentially load dynamic content
time.sleep(5) # Adjust the delay as needed (e.g., 5 seconds)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find all elements with the specified classes
elementsTitle = soup.find_all('span', attrs={"class": 'title'})
elementsOwner = soup.find_all('span', attrs={"class": 'publisher'})
elementsRatings = soup.find_all('label', attrs={"class": 'detailsRatingAvgNumOfStars'})
elementsCountOfRates = soup.find_all('span', attrs={"class": 'detailsRatingNumOfRatingText'})
# Extract text content from elements and append to all_data dictionary
for title, owner, rating, CountOfRates in zip(elementsTitle, elementsOwner, elementsRatings, elementsCountOfRates):
# Find span elements inside the price div
all_data['Title'].append(title.text.strip())
all_data['Owner'].append(owner.text.strip())
all_data['Ratings'].append(rating.text.strip())
all_data['Count of Rates'].append(int(re.search(r'\((\d+) ratings\)', CountOfRates.text).group(1)))
all_data['Page'].append(page_num)
print(f'Page {page_num} content processed')
else:
print(f'Failed to fetch page {page_num}:', response.status_code)
except Exception as e:
print("An error occurred:", str(e))
# Convert all_data to pandas DataFrame
df = pd.DataFrame(all_data)
# Export DataFrame to Excel file
df.to_excel('xyz.xlsx', index=False)
print('Data written to xyz.xlsx')
My problem is that I want more data. I tried pull up the full page and found values labeled priceModel
that tell me if a visual is free or not. But whenever I try to pull the price data up, nothing is returned. You will find in the next image what data I want to extract. I tried using HTML Price that is div value but it returns nothing, tried by the text inside of span but it also returns nothing.
Solution
Here's how you can get the data without bs4
.
There's an endpoint you can query that has all the data you need.
Here's a working example:
import pandas as pd
import requests
from tabulate import tabulate
endpoint = "https://appsource.microsoft.com/view/tiledata?"
payload = {
"ReviewsMyCommentsFilter": "true",
"country": "US",
"entityType": "App",
"page": "NONE",
"product": "power-bi-visuals",
"region": "ALL",
}
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0"
}
app_data = []
with requests.Session() as s:
response = s.get(endpoint, headers=headers, params=payload)
apps = response.json()["apps"]
# 60 apps per page + 1 to account for remainder
total_pages = apps["count"] // 60 + 1
for page in range(1, total_pages + 1):
payload["page"] = page
response = s.get(endpoint, headers=headers, params=payload)
apps = response.json()["apps"]
app_data += [
{
"name": app["title"].strip(),
"owner": app["publisher"].strip(),
"price": "Free" if not app["hasPrices"] else "Paid",
"certified": "Yes" if app["tags"] else "No",
"rating": app["AverageRating"],
"ratingCount": app["NumberOfRatings"],
}
for app in apps["dataList"]
]
df = pd.DataFrame(app_data, columns=app_data[0].keys())
df = df.sort_values(by=["ratingCount"], ascending=False)
df.to_csv("powerbi_visuals.csv", index=False)
print(tabulate(df, headers="keys", tablefmt="psql", showindex=False))
This should output all the data for all the pages.
Sample output:
+-------------------------------------------------------------------------+-----------------------------------------------------------------------+---------+-------------+----------+---------------+
| name | owner | price | certified | rating | ratingCount |
|-------------------------------------------------------------------------+-----------------------------------------------------------------------+---------+-------------+----------+---------------|
| HierarchySlicer | DataScenarios | Free | Yes | 3.89 | 273 |
| Chiclet Slicer | Microsoft Corporation | Free | Yes | 3.907 | 247 |
| Timeline Slicer | Microsoft Corporation | Free | Yes | 3.336 | 241 |
| Visio Visual | Microsoft Corporation | Free | Yes | 2.964 | 192 |
| Text Filter | Microsoft Corporation | Free | Yes | 4.124 | 177 |
| Gantt | Microsoft Corporation | Free | Yes | 3.13 | 161 |
| Word Cloud | Microsoft Corporation | Free | Yes | 4.229 | 140 |
| Power KPI Matrix | Microsoft Corporation | Free | Yes | 3.728 | 136 |
| Tachometer | Annik Inc | Free | Yes | 4.326 | 129 |
| Radar Chart | Microsoft Corporation | Free | Yes | 3.574 | 129 |
and many more...
Answered By - baduker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.