Issue
I was scraping some data from this URL
https://www.degruyter.com/search?query=*&startItem=0&pageSize=10&sortBy=relevance&documentTypeFacet=journal
when I try to get the journal names its not giving anything. Some tags giving response, but tags for journal names gives nothing. div with class name "resultTitle" has journal names but when I try the following in scrapy
response.css("div.resultTitle").get()
is giving nothing.
I have tried BeautifulSoup also
Solution
It seems that the block contains what you want "resultTitle" was loaded by JS which is xxxxxxxx-main.js
...
a.loginContentPromise.then((()=>{
const e = document.querySelector("#session-redirect");
if (e) {
const t = e.dataset.destination || "/";
window.location.replace(t)
}
}
)),
...
You can find the code block like below if you post your request via "wget" command, instead of using web browser.
...
<main id="main" class='language_en px-0 min-vh-100 container-fluid'>
<div id="session-redirect" data-destination='/search?query=*&startItem=0&pageSize=10&sortBy=relevance&documentTypeFacet=journal'></div>
</main>
...
You can read the "xxxxxxxx-main.js" JS code and implement it. or just simply use Splash to handle it.
P.S.
wget -O search_result.html https://www.degruyter.com/search\?query\=\*\&startItem\=0\&pageSize\=10\&sortBy\=relevance\&documentTypeFacet\=journal
Answered By - marco
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.