Issue
I'm trying to use Python 3 requests.get
to retrieve data from this page using its API. I'm interested in retrieving it using the data found here and saving the entire table into my own JSON.
Here's my attempt so far
source = requests.get("https://www.mwebexplorer.com/api/mwebblocks").json()
with open('mweb.json', 'w') as json_file:
json.dump(source, json_file)
I've looked through other questions in regards to pagination and all the other problems are able to write for loops to iterate through all the pages, but in my specific case, the link does not change when clicking next to go to the next page of data. I also can't use scrapy's xpath method to click next due to the entire table and its pagination not being accessible through HTML or XML.
Is there something I can add to my requests.get to include the entire JSON of all pages of the table?
Solution
Depending on what browser you're using it might be different, but in chrome I can go to the network tab in devtools and view the full details of the request. This reveals that it's actually a POST request, not a GET request. If you look at the payload, you can see a bunch of key-value pairs, including a start
and a length
.
So, try something like
requests.post("https://www.mwebexplorer.com/api/mwebblocks", data={"start": "50", "length": "50"})
or similar. You might need to include the other parts of the form data, depending on the response you get.
Keep in mind that sites frequently don't like it when you try to scrape them like this.
Answered By - The Guy with The Hat
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.