Sunday, January 7, 2024

[FIXED] Microsoft Appsource scraper is not returning all the values

January 07, 2024 beautifulsoup, dataframe, pandas, python No comments

Issue

I am using BeautifulSoup in Python to extract data from https://appsource.microsoft.com/en-us/marketplace/apps?product=power-bi-visuals&page=3

And this is what I am getting right up to this point.

The code that is doing this is:

import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import re

# Base URL without the page number
base_url = 'https://appsource.microsoft.com/en-us/marketplace/apps?product=power-bi-visuals&page=3'

all_data = {'Title': [], 'Owner': [], 'Ratings': [], 'Count of Rates': [], 'Page': []}

try:
    # Loop through pages 1 to 10 (for example)
    for page_num in range(1, 2):  # Change the range as needed
        url = base_url.format(page_num)

        # Send a GET request to the URL
        response = requests.get(url)

        # Introduce a delay to wait for the page to potentially load dynamic content
        time.sleep(5)  # Adjust the delay as needed (e.g., 5 seconds)

        # Check if the request was successful (status code 200)
        if response.status_code == 200:
            # Parse the HTML content using BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            # Find all elements with the specified classes
            elementsTitle = soup.find_all('span', attrs={"class": 'title'})
            elementsOwner = soup.find_all('span', attrs={"class": 'publisher'})
            elementsRatings = soup.find_all('label', attrs={"class": 'detailsRatingAvgNumOfStars'})
            elementsCountOfRates = soup.find_all('span', attrs={"class": 'detailsRatingNumOfRatingText'})
                
            # Extract text content from elements and append to all_data dictionary
            for title, owner, rating, CountOfRates in zip(elementsTitle, elementsOwner, elementsRatings, elementsCountOfRates):
                # Find span elements inside the price div


                all_data['Title'].append(title.text.strip())
                all_data['Owner'].append(owner.text.strip())
                all_data['Ratings'].append(rating.text.strip())
                all_data['Count of Rates'].append(int(re.search(r'\((\d+) ratings\)', CountOfRates.text).group(1)))
                all_data['Page'].append(page_num)

            print(f'Page {page_num} content processed')

        else:
            print(f'Failed to fetch page {page_num}:', response.status_code)

except Exception as e:
    print("An error occurred:", str(e))

# Convert all_data to pandas DataFrame
df = pd.DataFrame(all_data)

# Export DataFrame to Excel file
df.to_excel('xyz.xlsx', index=False)
print('Data written to xyz.xlsx')

My problem is that I want more data. I tried pull up the full page and found values labeled priceModel that tell me if a visual is free or not. But whenever I try to pull the price data up, nothing is returned. You will find in the next image what data I want to extract. I tried using HTML Price that is div value but it returns nothing, tried by the text inside of span but it also returns nothing.

Solution

Here's how you can get the data without bs4.

There's an endpoint you can query that has all the data you need.

Here's a working example:

import pandas as pd
import requests
from tabulate import tabulate

endpoint = "https://appsource.microsoft.com/view/tiledata?"

payload = {
    "ReviewsMyCommentsFilter": "true",
    "country": "US",
    "entityType": "App",
    "page": "NONE",
    "product": "power-bi-visuals",
    "region": "ALL",
}

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0"
}

app_data = []

with requests.Session() as s:
    response = s.get(endpoint, headers=headers, params=payload)
    apps = response.json()["apps"]
    
    # 60 apps per page + 1 to account for remainder
    total_pages = apps["count"] // 60 + 1

    for page in range(1, total_pages + 1):
        payload["page"] = page
        response = s.get(endpoint, headers=headers, params=payload)
        apps = response.json()["apps"]

        app_data += [
            {
                "name": app["title"].strip(),
                "owner": app["publisher"].strip(),
                "price": "Free" if not app["hasPrices"] else "Paid",
                "certified": "Yes" if app["tags"] else "No",
                "rating": app["AverageRating"],
                "ratingCount": app["NumberOfRatings"],
            }
            for app in apps["dataList"]
        ]

df = pd.DataFrame(app_data, columns=app_data[0].keys())
df = df.sort_values(by=["ratingCount"], ascending=False)
df.to_csv("powerbi_visuals.csv", index=False)

print(tabulate(df, headers="keys", tablefmt="psql", showindex=False))

This should output all the data for all the pages.

Sample output:

+-------------------------------------------------------------------------+-----------------------------------------------------------------------+---------+-------------+----------+---------------+
| name                                                                    | owner                                                                 | price   | certified   |   rating |   ratingCount |
|-------------------------------------------------------------------------+-----------------------------------------------------------------------+---------+-------------+----------+---------------|
| HierarchySlicer                                                         | DataScenarios                                                         | Free    | Yes         |  3.89    |           273 |
| Chiclet Slicer                                                          | Microsoft Corporation                                                 | Free    | Yes         |  3.907   |           247 |
| Timeline Slicer                                                         | Microsoft Corporation                                                 | Free    | Yes         |  3.336   |           241 |
| Visio Visual                                                            | Microsoft Corporation                                                 | Free    | Yes         |  2.964   |           192 |
| Text Filter                                                             | Microsoft Corporation                                                 | Free    | Yes         |  4.124   |           177 |
| Gantt                                                                   | Microsoft Corporation                                                 | Free    | Yes         |  3.13    |           161 |
| Word Cloud                                                              | Microsoft Corporation                                                 | Free    | Yes         |  4.229   |           140 |
| Power KPI Matrix                                                        | Microsoft Corporation                                                 | Free    | Yes         |  3.728   |           136 |
| Tachometer                                                              | Annik Inc                                                             | Free    | Yes         |  4.326   |           129 |
| Radar Chart                                                             | Microsoft Corporation                                                 | Free    | Yes         |  3.574   |           129 |

and many more...

Answered By - baduker

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 7, 2024

[FIXED] Microsoft Appsource scraper is not returning all the values

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels