Issue
I need to scrape a webpage the link to which is here In this webpage there is a Cross Reference section that I want to scrape But when I use the python requests to collect the content of the page by below code:
url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
The resultant content does not have that cross reference part maybe bcz its not loaded.I can scrape the rest of the html content but not the cross reference part. Now when I did the same thing with selenium it worked fine which means selenium is able to find this element after its loaded. Can Someone guide me how should I be able to get this done using python requests and beautifulsoup instead of selenium?
Solution
The data is loaded through Javascript, but you can extract it with requests
, BeautifulSoup
and json
module:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')
t = soup.select_one('#arrow-state').text
t = t.replace('&q;', '"').replace('&g;', ">").replace('&l;', "<").replace('&a;', "&")
data = json.loads( t )
d = None
for item in data['jss']['sitecore']['route']['placeholders']['arrow-main']:
if item['componentName'] == 'PdpWrapper':
d = item
break
if d:
cross_reverence_product_tiles = d['placeholders']['product-details'][0]['fields']['crossReferenceProductTilesCollection']['crossReverenceProductTiles']['productTiles']
print(json.dumps(cross_reverence_product_tiles, indent=4))
Prints:
[
{
"partId": "16571604",
"partNumber": "CGB3B1X5R1A475M055AC",
"productDetailUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
"productDetailShareUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
"productImage": "https://static5.arrow.com/pdfs/2017/4/18/7/26/14/813/tdk_/manual/010101_lowprofile_pi0402.jpg",
"manufacturerName": "TDK",
"productLineTitle": "Capacitor Ceramic Multilayer",
"productDescription": "Cap Ceramic 4.7uF 10V X5R 20% Pad SMD 0603 85\u00b0C T/R",
"datasheetUrl": "",
"lowestPrice": 0.0645,
"lowestPriceFormatted": "$0.0645",
"highestPrice": 0.3133,
"highestPriceFormatted": "$0.3133",
"stockFormatted": "1,875",
"stock": 1875,
"attributes": [],
"buyingOptionType": "AddToCart",
"numberOfAttributesToShow": 1,
"rrClickTrackingUrl": null,
"pricingDataPopulated": true,
"sourcePartId": "V72:2272_06586404",
"sourceCode": "ACNA",
"packagingType": "Cut Strip",
"unitOfMeasure": "",
"isDiscontinued": false,
"productTileHint": null,
"tileSize": 1,
"tileType": "1x1",
"suplementaryClasses": "u-height"
},
...and so on.
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.