Issue
I am trying to scrape the data from this webpage, and i am successfully able to scrape the data what i need.
Problem is the downloaded page using requests
has only 45 product details but actually on that webpage it has more than 4000 products, this is happening because all data is not available directly it shows only if you scroll down to the page.
I would like to scrape all products that is available on the page.
CODE
import requests
from bs4 import BeautifulSoup
import json
import re
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
base_url = "link that i provided"
r = requests.get(base_url,headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
scripts = soup.find_all('script')[11].text
script = scripts.split('=', 1)[1]
script = script.rstrip()
script = script[:-1]
data = json.loads(script)
skus = list(data['grid']['entities'].keys())
prodpage = []
for sku in skus:
prodpage.append('https://www.ajio.com{}'.format(data['grid']['entities'][sku]['url']))
print(len(prodpage))
Solution
Scrolling down means the data is being generated by JavaScript so you have more than one option here first one is to use selenium second one is to send the same Ajax request the website is using as follows :
def get_source(page_num = 1):
url = 'https://www.ajio.com/api/category/830216001?fields=SITE¤tPage={}&pageSize=45&format=json&query=%3Arelevance%3Abrickpattern%3AWashed&sortBy=relevance&gridColumns=3&facets=brickpattern%3AWashed&advfilter=true'
res = requests.get(url.format(1),headers={'User-Agent': 'Mozilla/5.0'})
if res.status_code == 200 :
return res.json()
# data = get_source(page_num = 1)
# total_pages = data['pagination']['totalPages'] # total pages are 111
prodpage = []
for i in range(1,112):
print(f'Getting page {i}')
data = get_source(page_num = i)['products']
for item in data:
prodpage.append('https://www.ajio.com{}'.format(item['url']))
if i == 3: break
print(len(prodpage)) # output 135 for 3 pages
Answered By - Ahmed Soliman
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.