Issue
Im recently studying about web scraping, and i got stucked. I need to scrap the data from the next page, but there is only a clickable button, and link stays the same. So my problem is how can i extract link to the next page if the url stays same? The web Im scraping is http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp
My code so far :
import scrapy
import json
class EsgKrx1Spider(scrapy.Spider):
name = 'esg_krx1'
allowed_domains = ['esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
start_urls = ['http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp/']
def start_requests(self):
#sending a post request to the web
return [scrapy.FormRequest("http://esg.krx.co.kr/contents/99/ESG99000001.jspx",
formdata={'sch_com_nm': '',
'sch_yy': '2021',
'pagePath': '/contents/02/02020000/ESG02020000.jsp',
'code': '02/02020000/esg02020000',
'pageFirstCall': 'Y'},
callback=self.parse)]
def parse(self, response):
dict_data = json.loads(response.text)
#looping in the result and assigning the company name
for i in dict_data['result']:
company_name = i['com_abbrv']
compay_share_id = i['isu_cd']
print(company_name, compay_share_id)
So now i need got informations only from the first page. Now i have to move to the next page. Could someone explain me please how do I do this?
Solution
The website you are scraping exposes an API that you can call directly instead of using splash. If you examine the network tab you will see the POST
request being sent to the server.
See below sample code. I have hardcoded the total number of pages but you can find an automated way of getting the total instead of hard coding the value.
Note the use of response.follow
. It takes care of cookies and other headers automatically.
import scrapy
class EsgKrx1Spider(scrapy.Spider):
name = 'esg_krx1'
allowed_domains = ['esg.krx.co.kr']
start_urls = ['http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
custom_settings = {
"USER_AGENT": 'Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0'
}
def parse(self, response):
#send a post request to the api
url = "http://esg.krx.co.kr/contents/99/ESG99000001.jspx"
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
}
total_pages = 77
for page in range(total_pages):
payload = f"sch_com_nm=&sch_yy=2021&pagePath=%2Fcontents%2F02%2F02020000%2FESG02020000.jsp&code=02%2F02020000%2Fesg02020000&curPage={page+1}"
yield response.follow(url=url, method='POST', callback=self.parse_result, headers=headers, body=payload)
def parse_result(self, response):
# #looping in the result and assigning the company name
for item in response.json().get('result'):
yield {
'company_name': item.get('com_abbrv'),
'compay_share_id': item.get('isu_cd')
}
Answered By - msenior_
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.