Issue
I am trying to extract data from this website, It is almost impossible to scrape as after any search it's not changing its URL.
I want to search based on PUBLISHER IPI '00144443097' and extract all data they have insideclass="items-container"
.
My code
quote_page = 'https://portal.themlc.com/search'
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('section', attrs={'class': 'items-container'})
name = name_box.text
print(name)
Here as the URL after search doesn't change it's not giving me any value.
After extracting values I want to sort them in pandas
Solution
When the url doesn't change, you can use the developer tools to see if an api is being called. In this case there are two apis. One gives basic information about the writer and the other gives the information on the works. You can parse the json response however you wish from here.
Note: this a post, not a get
url = 'https://api.ptl.themlc.com/api/search/writer?page=1&limit=10'
payload = {'writerIpi': "00144443097"}
requests.post(url, json=payload).json()
url = 'https://api.ptl.themlc.com/api/search/work?page=1&limit=10'
payload = {'writerIpi': "00144443097"}
requests.post(url, json=payload).json()
url = 'https://api.ptl.themlc.com/api/search/publisher?page=1&limit=10'
payload = {"publisherIpi":"00144443097"}
requests.post(url, json=payload).json()
# this url gets the 161 works for the publisheripid you want. it's convoluted, but you may be able to automate, but I used developer tools to find the right publisheripid
url = 'https://api.ptl.themlc.com/api/search/work?page=1&limit=10'
payload = {'publisherIpId': "7305902"}
requests.post(url, json=payload).json()
Answered By - Jonathan Leon
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.