Issue
Problem
I'm scraping a dynamic page - one that appears to be loading results from a database for display - but, it appears, I'm only getting the placeholder for the text elements rather than the text itself. The page I'm loading is:
https://www.bricklink.com/v2/search.page?q=8084#T=S
Expected / Actual
Expected:
<table>
<tr>
<td class="pspItemClick">
<a class="pspItemNameLink" href="www.some-url.com">The Name</a>
<br/>
<span class="pspItemCateAndNo">
<span class="blcatList">Catalog Num</span> : 1111
</span>
</td>
</tr>
</table
Actual
<table>
<tr>
<td class="pspItemClick">
<a class="pspItemNameLink" href="[%catalogUrl%]">[%strItemName%]</a>
<br/>
<span class="pspItemCateAndNo">
<span class="blcatList">[%strCategory%]</span> : [%strItemNo%]
</span>
</td>
</tr>
</table
Attempted Solutions
- I first just tried loading the site using the
requests
library which, of course, didn't work since it's not a static page.
def load_page(url: str) -> BeautifulSoup:
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
req = requests.get(url, headers=headers)
return BeautifulSoup(req.content, 'html.parser')
- I then tried Selenium's
webdriver
to load the dynamic content:
def html_source_from_webdriver(url: str, wait: int = 0) -> BeautifulSoup:
browser = webdriver.Chrome(service=selenium_chrome_service, chrome_options=options)
browser.implicitly_wait(wait)
browser.get(urljoin(ROOT_URL, url))
page_source = browser.page_source
return BeautifulSoup(page_source, features="html.parser")
Both attempts yield the same results. I haven't used the implicitly_wait
feature much so I was just experimenting with different values (0-15) - none of which worked. I've also tried the browser.set_script_timeout(<timeout>)
which also did not work.
Any thoughts on where to go from here would be greatly appreciated.
Update
I appreciate those of you providing suggestions. I've also tried the following with no luck:
- using
time.sleep()
- added after thebrowser.get(...)
call. - using
browser.set_page_load_timeout()
- didn't expect this one to work, but tried anyway.
Solution
Here is one way of getting that information (you start by inspecting Network tab in browser, when loading the page, and looking for any calls made to various APIs via XHR, WS):
import requests
import pandas as pd
headers= {
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'
}
url = 'https://www.bricklink.com/ajax/clone/search/searchproduct.ajax?q=8084&st=0&cond=&type=&cat=&yf=0&yt=0&loc=®=0&ca=0&ss=&pmt=&nmp=0&color=-1&min=0&max=0&minqty=0&nosuperlot=1&incomplete=0&showempty=1&rpp=25&pi=1&ci=0'
r = requests.get(url, headers=headers)
df = pd.json_normalize(r.json()['result']['typeList'], record_path = ['items'])
print(df)
Result in terminal:
idItem typeItem strItemNo strItemName idColor idColorImg cItemImgTypeS bHasLargeImg n4NewQty n4NewSellerCnt mNewMinPrice mNewMaxPrice n4UsedQty n4UsedSellerCnt mUsedMinPrice mUsedMaxPrice strCategory strPCC
0 95924 S 66364-1 Star Wars Bundle Pack, Super Pack 3 in 1 (Sets... -1 0 J True 3 3 CZK 3,839.17 CZK 4,145.02 0 0 CZK 0.00 CZK 0.00 65.258 None
1 95927 S 66368-1 Star Wars Bundle Pack, Super Pack 3 in 1 (Sets... -1 0 J True 7 4 CZK 2,889.67 CZK 3,974.78 0 0 CZK 0.00 CZK 0.00 65.258 None
2 88129 S 8084-1 Snowtrooper Battle Pack -1 0 G True 157 51 CZK 473.72 CZK 3,552.88 109 79 CZK 231.09 CZK 884.62 65.258 None
3 95085 C c09se2 2009 Large Swedish July - December (456.8084-SV) -1 -1 None False 0 0 CZK 0.00 CZK 0.00 1 1 CZK 117.24 CZK 117.24 647 None
4 210835 G SW4AM2 Display Assembled Set, Star Wars Sets 8083, 80... -1 11 J True 0 0 CZK 0.00 CZK 0.00 0 0 CZK 0.00 CZK 0.00 848.65.258 None
5 88128 I 8084-1 Snowtrooper Battle Pack -1 0 J True 590 74 CZK 0.27 CZK 84.42 605 302 CZK 0.22 CZK 221.21 65.258 None
6 95922 O 66364-1 Star Wars Bundle Pack, Super Pack 3 in 1 (Sets... -1 0 J True 0 0 CZK 0.00 CZK 0.00 1 1 CZK 473.72 CZK 473.72 65.258 None
7 95925 O 66368-1 Star Wars Bundle Pack, Super Pack 3 in 1 (Sets... -1 0 J True 0 0 CZK 0.00 CZK 0.00 0 0 CZK 0.00 CZK 0.00 65.258 None
8 88127 O 8084-1 Snowtrooper Battle Pack -1 0 G True 3 3 CZK 60.55 CZK 236.86 9
See relevant documentation for packages used: pandas and requests
Answered By - Barry the Platipus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.