Issue
So i am scraping a website containing window.INITIAL_STATE which is assigned with a huge JSON string. I am looking for the stock info (This item is currently out of stock) which looks like below in JSON grid:
{
"slotType": "WIDGET",
"id": 11,
"parentId": 10002,
"layoutParams": {
"margin": "0,24,0,0",
"orientation": "",
"widgetHeight": 150,
"widgetWidth": 12
},
"dataId": "1230886539",
"elementId": "11-AVAILABILITY",
"hasWidgetDataChanged": true,
"ttl": 3000,
"widget": {
"type": "AVAILABILITY",
"viewType": "brand",
"data": {
"announcementComponent": {
"action": null,
"metaData": null,
"tracking": null,
"trackingData": null,
"value": {
"type": "AnnouncementValue",
"subTitle": "This item is currently out of stock",
"title": "Sold Out"
}
}
}
}
},
I tried like below but does not work:
soup = BeautifulSoup(page.content, features="lxml")
print(soup.find(elementID='11-AVAILABILITY').get_text().strip())
Solution
To parse the __INITIAL_STATE__
out of HTML, you can use this example:
import re
import json
import requests
url = 'https://www.flipkart.com/sony-310ap-wired-headset/p/itm0527f8b27c68f'
html_data = requests.get(url).text
data = re.search(r'window\.__INITIAL_STATE__ = ({.*});', html_data).group(1)
data = json.loads(data)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for w in data['pageDataV4']['page']['data']['10002']:
if w.get("elementId") == "11-AVAILABILITY":
print(json.dumps(w, indent=4))
break
Prints:
{
"slotType": "WIDGET",
"id": 11,
"parentId": 10002,
"layoutParams": {
"margin": "0,24,0,0",
"orientation": "",
"widgetHeight": 150,
"widgetWidth": 12
},
"dataId": "1230886539",
"elementId": "11-AVAILABILITY",
"hasWidgetDataChanged": true,
"ttl": 3000,
"widget": {
"type": "AVAILABILITY",
"viewType": "brand",
"data": {
"announcementComponent": {
"action": null,
"metaData": null,
"tracking": null,
"trackingData": null,
"value": {
"type": "AnnouncementValue",
"subTitle": "This item is currently out of stock",
"title": "Sold Out"
}
}
}
}
}
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.