Thursday, June 30, 2022

[FIXED] How to get to the next page while scraping if the link stays the same?

June 30, 2022 scrapy, web-scraping No comments

Issue

Im recently studying about web scraping, and i got stucked. I need to scrap the data from the next page, but there is only a clickable button, and link stays the same. So my problem is how can i extract link to the next page if the url stays same? The web Im scraping is http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp

My code so far :

import scrapy
import json

class EsgKrx1Spider(scrapy.Spider):
name = 'esg_krx1'
allowed_domains = ['esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
start_urls = ['http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp/']

def start_requests(self):
    #sending a post request to the web
    return [scrapy.FormRequest("http://esg.krx.co.kr/contents/99/ESG99000001.jspx",
                               formdata={'sch_com_nm': '',
                                         'sch_yy': '2021',
                                         'pagePath': '/contents/02/02020000/ESG02020000.jsp',
                                         'code': '02/02020000/esg02020000',
                                         'pageFirstCall': 'Y'},
                               callback=self.parse)]

def parse(self, response):
    dict_data = json.loads(response.text)

    #looping in the result and assigning the company name
    for i in dict_data['result']:
        company_name = i['com_abbrv']
        compay_share_id = i['isu_cd']
        print(company_name, compay_share_id)

So now i need got informations only from the first page. Now i have to move to the next page. Could someone explain me please how do I do this?

Solution

The website you are scraping exposes an API that you can call directly instead of using splash. If you examine the network tab you will see the POST request being sent to the server.

See below sample code. I have hardcoded the total number of pages but you can find an automated way of getting the total instead of hard coding the value.

Note the use of response.follow. It takes care of cookies and other headers automatically.

import scrapy

class EsgKrx1Spider(scrapy.Spider):
    name = 'esg_krx1'
    allowed_domains = ['esg.krx.co.kr']
    start_urls = ['http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
    custom_settings = {
        "USER_AGENT": 'Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0'
    }

    def parse(self, response):
        #send a post request to the api
        url = "http://esg.krx.co.kr/contents/99/ESG99000001.jspx"
        
        headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        }

        total_pages = 77
        for page in range(total_pages):
            payload = f"sch_com_nm=&sch_yy=2021&pagePath=%2Fcontents%2F02%2F02020000%2FESG02020000.jsp&code=02%2F02020000%2Fesg02020000&curPage={page+1}"
            yield response.follow(url=url, method='POST', callback=self.parse_result, headers=headers, body=payload)

    def parse_result(self, response):

        # #looping in the result and assigning the company name
        for item in response.json().get('result'):
            yield {
                'company_name': item.get('com_abbrv'),
                'compay_share_id': item.get('isu_cd')
            }

Answered By - msenior_

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, June 30, 2022

[FIXED] How to get to the next page while scraping if the link stays the same?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels