Issue
I am trying to get the links from all the pages on https://apexranked.com/. I tried using
url = 'https://apexranked.com/'
page = 1
while page != 121:
url = f'https://apexranked.com/?page={page}'
print(url)
page = page + 1
however, if you click on the page numbers it doesn't include a https://apexranked.com/?page=number as you see from https://www.mlb.com/stats/?page=2. How would I go about accessing and getting the links from all pages if the page doesn't include ?page=number after the link?
Solution
The page is not reloading when you click on page 2. Instead, it is firing a GET request to the website's backend.
The request is being sent to : https://apexranked.com/wp-admin/admin-ajax.php
In addition, several parameters are parsed directly onto the previous url.
?action=get_player_data&page=3&total_pages=195&_=1657230896643
Parameters :
- action: As the endpoint can handle several purpose, you must indicate the performed action. Surely a mandatory parameter, don't omit it.
- page: indicates the requested page (i.e the index you're iteraring over).
- total_pages: indicates the total number of page (maybe it can be omitted, otherwise you can scrap it on the main page)
- _: this one corresponds to an unix timestamp, same idea as above, try to omit and see what happens. Otherwise you can get a unix timestamp quite easily with
time.time()
Once you get a response, it yields a rendered HTML, maybe try to set Accept: application/json
field in request headers to get a Json, but that's just a detail.
All these informations wrapped up:
import requests
import time
# Issued from a previous scraping on the main page
total_pages = 195
params = {
"total_pages": total_pages,
"_": round(time.time() * 1000),
"action": "get_player_data"
}
# Make sure to include all mandatory fields
headers = {
...
}
for k in range(1, total_pages + 1):
params['page'] = k
res = requests.get(url, headers=headers, params=params)
# Make your thing :)
Answered By - Bil11
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.