Wednesday, April 27, 2022

[FIXED] Scraping multiple pages in Steam with BeautifulSoup

April 27, 2022 beautifulsoup, html, python No comments

Issue

My goal is to scrape Action games' information, such as name of game, tags, prices. Used libraries are requests, beautifulsoup. URL : https://store.steampowered.com/tags/en/Action/#p=0&tab=ConcurrentUsers

I managed to code it up for the first page and then I tried to scrape 15 pages. My plan was that when I replace the "/Action/#p=0" with "/Action/#p=1" in the url and send a get request, I would receive the html response with the games from next page. For some reason this did not work as even if I try with "#p=15", I get the html for the first page. Then I inspected the page elements (1,2,3,4..) but they do not contain any links. Next, I started looking in "Inspect > Network tab" to check if I can intercept any link that resembles the html of the next page and I found it - upon inspection it did contain the games from the next page. URL for second page : https://store.steampowered.com/contenthub/querypaginated/tags/ConcurrentUsers/render/?query=&start=15&count=15&cc=BG&l=english&v=4&tag=Action&tagid=19

The page number 2 in the URL where the number is the "=&start" value/15. Unfortunately, the content is unusable as the hierarchies of the tags are messed up. For example:

           <span class="top_tag">
            FPS
           </span>
           <span class="top_tag">
            , Shooter
           </span>

Would be:

       <span class='\"top_tag\"'>
        FPS&lt;\/span&gt;
        <span class='\"top_tag\"'>
         , Shooter&lt;\/span&gt;

The second span class is the child of the first, where it should be its sibling. Both examples are decoded using prettify soup method with utf-8.

Is there a better way to do this? I am aware I can do it using regex or selenium, but I wonder if there is a way to do this task with beautifulsoup and requests.

Solution

The content that the server responds is in Json format, so use .json() method to parse it. For example:

import requests
from bs4 import BeautifulSoup

url = "https://store.steampowered.com/contenthub/querypaginated/tags/ConcurrentUsers/render/"

params = {
    "query": "",
    "start": 0,
    "count": 15,
    "cc": "US",
    "l": "english",
    "v": "4",
    "tag": "Action",
    "tagid": "19",
}


for page in range(5):  # <-- increase number of pages here
    params["start"] = 15 * page
    data = requests.get(url, params=params).json()
    soup = BeautifulSoup(data["results_html"], "html.parser")
    for item in soup.select(".tab_item_content"):
        print(
            "{:<40} {}".format(
                item.select_one(".tab_item_name").text,
                item.select_one(".tab_item_top_tags").text,
            )
        )

Prints:

Counter-Strike: Global Offensive         FPS, Shooter, Multiplayer, Competitive
Grand Theft Auto V                       Open World, Action, Multiplayer, Automobile Sim
Lost Ark                                 MMORPG, Free to Play, Action RPG, Hack and Slash
Apex Legends™                            Free to Play, Battle Royale, Multiplayer, Shooter
PUBG: BATTLEGROUNDS                      Survival, Shooter, Multiplayer, Battle Royale
Dota 2                                   Free to Play, MOBA, Multiplayer, Strategy
ELDEN RING                               Souls-like, Relaxing, Dark Fantasy, RPG
Tom Clancy's Rainbow Six® Siege          FPS, Hero Shooter, Multiplayer, Tactical
Vampire Survivors                        Action Roguelike, Pixel Graphics, Bullet Hell, Casual
NARAKA: BLADEPOINT                       Battle Royale, Sexual Content, Multiplayer, Martial Arts
Warframe                                 Free to Play, Action RPG, RPG, Action
Destiny 2                                Free to Play, Open World, Looter Shooter, FPS
Wallpaper Engine                         Mature, Utilities, Software, Anime
Rust                                     Survival, Crafting, Multiplayer, Open World
Dead by Daylight                         Horror, Survival Horror, Multiplayer, Online Co-Op
Brawlhalla                               Free to Play, Multiplayer, Fighting, Casual
Dread Hunger                             Multiplayer, Survival, Online Co-Op, Social Deduction
Stumble Guys                             Action, Casual, 3D, 3D Platformer
ARK: Survival Evolved                    Open World Survival Craft, Survival, Open World, Multiplayer
LEGO® Star Wars™: The Skywalker Saga     LEGO, Adventure, Open World, Multiplayer

...and so on.

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, April 27, 2022

[FIXED] Scraping multiple pages in Steam with BeautifulSoup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels