Issue
I want to find all of the article pages and scraping title and description tag, and at the first my problem is with finding all articles, and to find articles we need to search in search box but i want to do that automatically. I think one of the approaches might be using sitemap but I'm not sure about that. Please help me to find a way.
I know that how can I scrape a website but in this case my problem is about how to find all articles without search in search box and automatically in ieee (e.x. https://ieeexplore.ieee.org/).
I want to get all article pages in https://ieeexplore.ieee.org (Just all article page URLs).
Solution
When you are clicking on the search button, it is redirecting you to a link
https://ieeexplore.ieee.org/search/searchresult.jsp?newsearch=true&queryText=smart%20grid
The queryText parameter is your search term. The content is loaded using JavaScript, so you cannot just send a request to the link and then parse the response. You can either
- Use selenium to go to the link or
- Emulate the XHR request being used on that page
With Selenium, Go to the url (with your preferred search term), click on load more button till you have loaded enough articles and then get the response.
I prefer to emulate the XHR request b'coz it is faster.
import requests
# change this for a different search term
search_term = "smart grid"
# change this for different page no
page_no = 1
headers = {
"Accept": "application/json, text/plain, */*",
"Origin": "https://ieeexplore.ieee.org",
"Content-Type": "application/json",
}
payload = {
"newsearch": True,
"queryText": search_term,
"highlight": True,
"returnFacets": ["ALL"],
"returnType": "SEARCH",
"pageNumber": page_no
}
r = requests.post(
"https://ieeexplore.ieee.org/rest/search",
json=payload,
headers=headers
)
page_data = r.json()
for record in page_data["records"]:
print(record["articleTitle"])
print('https://ieeexplore.ieee.org'+record["documentLink"], end="\n----\n")
Output
IEEE Vision for Smart Grid Controls: 2030 and Beyond Reference Model
https://ieeexplore.ieee.org/document/6598993/
----
IEEE Vision for Smart Grid Control: 2030 and Beyond Roadmap
https://ieeexplore.ieee.org/document/6648362/
----
Software models for Smart Grid
https://ieeexplore.ieee.org/document/6225717/
----
IEEE Vision for Smart Grid Communications: 2030 and Beyond Roadmap
https://ieeexplore.ieee.org/document/6690098/
----
IEEE Smart Grid Vision for Computing: 2030 and Beyond Roadmap
https://ieeexplore.ieee.org/document/7376995/
----
...
The response in JSON format which we can parse using python to get the output you want. The URL https://ieeexplore.ieee.org/rest/search
can be obtained using the network tab on your browser developer tools.
Answered By - Bitto Bennichan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.