Thursday, March 24, 2022

[FIXED] Cookies / data handling redirect causes wrong scraping website

March 24, 2022 python, scrapy, web-scraping No comments

Issue

I have a problem with a very simple custom spider, but I can't figure it out. Scrapy is redirected to the consent.yahoo page when trying to scrape a page on yahoo finance.

The spider looks like this:

import scrapy

class CompanyDetailsSpider(scrapy.Spider):
    name = 'company_details'
    allowed_domains = ['finance.yahoo.com']
    start_urls = ['https://finance.yahoo.com/screener/predefined/ms_technology']

    def parse(self, response):
        company_names_list = response.xpath(
            '//*[@id="scr-res-table"]/div[1]/table/tbody/tr/td[2]/text()').extract()
        company_price_list = response.xpath(
            '//*[@id="scr-res-table"]/div[1]/table/tbody/tr/td[3]/span/text()').extract()

        count = len(company_names_list)
        
        for i in range(0, count):
            print(company_names_list[i], company_price_list[i])

This code was taken from a course on scrapy, where it did work. The problem is when I try to run it. It shows me:

    2022-02-01 15:29:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (307) to <GET https://guce.yahoo.com/consent?brandType=nonEu&gcrumb=TEYoGM4&done=https%3A%2F%2Ffinance.yahoo.com%2Fscreener%2Fpredefined%2Fms_technology> from <GET https://finance.yahoo.com/screener/predefined/ms_technology>
2022-02-01 15:29:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_4eb5a247-c8c1-47f7-b860-1b593d8ad1ef> from <GET https://guce.yahoo.com/consent?brandType=nonEu&gcrumb=TEYoGM4&done=https%3A%2F%2Ffinance.yahoo.com%2Fscreener%2Fpredefined%2Fms_technology>
2022-02-01 15:29:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_4eb5a247-c8c1-47f7-b860-1b593d8ad1ef> (referer: None)

And when I view the response when I simply run the page in a scrapy shell, it shows: that is redirected to a (cookies?) request page.

I can't find a solution to this anywhere as I can't find anyone reporting the same issue. However, other cookie-related issues say that cookies should be enabled, which I did. And the robot.txt is turned to false. My settings look like this:

    BOT_NAME = 'SimpleSpider'

SPIDER_MODULES = ['SimpleSpider.spiders']
NEWSPIDER_MODULE = 'SimpleSpider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'SimpleSpider (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = True

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'SimpleSpider.middlewares.SimplespiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'SimpleSpider.middlewares.SimplespiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'SimpleSpider.pipelines.SimplespiderPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

I hope anyone could help with this!

Solution

The issue is that you need to include the cookies into the start_requests, and then there is the issue in how you're indexing the values. It's better to yield the data with scrapy as opposed to print. You also did not need span in your xpath for the prices.

Here's a working solution:


import scrapy

cookies = {
    'B': '7t389hlgv4sqv&b=3&s=gb',
    'GUCS': 'AU8-5cgT',
    'EuConsent': 'CPTv0BMPTv0BMAOACBENB-CoAP_AAH_AACiQIJNe_X__bX9n-_59__t0eY1f9_r3v-QzjhfNt-8F2L_W_L0H_2E7NB36pq4KuR4ku3bBIQFtHMnUTUmxaolVrzHsak2MpyNKJ7LkmnsZe2dYGHtPn9lD-YKZ7_7___f73z___9_-39z3_9f___d9_-__-vjfV_993________9nd____BBIAkw1LyALsSxwJNo0qhRAjCsJCoBQAUUAwtEVgAwOCnZWAT6ghYAITUBGBECDEFGDAIAAAIAkIiAkALBAIgCIBAACAFCAhAARMAgsALAwCAAUA0LEAKAAQJCDI4KjlMCAiRaKCWysQSgr2NMIAyywAoFEZFQgIlCCBYGQkLBzHAEgJYAYaADAAEEEhEAGAAIIJCoAMAAQQSA',
    'A1': 'd=AQABBF9z8mECELBiwNCF9soE8MMAyI0JjX4FEgABBgHX-mHJYvbPb2UB9iMAAAcIX3PyYY0JjX4&S=AQAAAjnkhOf_LxrMMNCN1-BYfEY',
    'A3': 'd=AQABBF9z8mECELBiwNCF9soE8MMAyI0JjX4FEgABBgHX-mHJYvbPb2UB9iMAAAcIX3PyYY0JjX4&S=AQAAAjnkhOf_LxrMMNCN1-BYfEY',
    'A1S': 'd=AQABBF9z8mECELBiwNCF9soE8MMAyI0JjX4FEgABBgHX-mHJYvbPb2UB9iMAAAcIX3PyYY0JjX4&S=AQAAAjnkhOf_LxrMMNCN1-BYfEY&j=GDPR',
    'GUC': 'AQABBgFh-tdiyUIdFwSP',
    'cmp': 'v=22&t=1643742832&j=1',
}


class CompanyDetailsSpider(scrapy.Spider):
    name = 'company_details'
    allowed_domains = ['finance.yahoo.com']
    start_urls = ['https://finance.yahoo.com/screener/predefined/ms_technology']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url, 
                cookies=cookies, 
                callback = self.parse
            )
            

    def parse(self, response):
        company_names_list = response.xpath(
            '//*[@id="scr-res-table"]/div[1]/table/tbody/tr/td[2]/text()').extract()
        company_price_list = response.xpath(
            './/*[@id="scr-res-table"]/div[1]/table/tbody/tr/td[3]//text()').extract()
        yield {
                'company_names_list':company_names_list,
                'company_price_list':company_price_list
            }

Output:

{'company_names_list': ['Apple Inc.', 'Microsoft Corporation', 'Taiwan Semiconductor Manufacturing Company Limited', 'NVIDIA Corporation', 'ASML Holding N.V.', 'Adobe Inc.', 'Broadcom Inc.', 'Cisco Systems, Inc.', 'salesforce.com, inc.', 'Accenture plc', 'Oracle Corporation', 'Intel Corporation', 'QUALCOMM Incorporated', 'Texas Instruments Incorporated', 'Intuit Inc.', 'SAP SE', 'Sony Group Corporation', 'Advanced Micro Devices, Inc.', 'Applied Materials, Inc.', 'Shopify Inc.', 'International Business Machines Corporation', 'ServiceNow, Inc.', 'Infosys Limited', 'Micron Technology, Inc.', 'Snowflake Inc.'], 'company_price_list': ['172.95', '305.85', '121.92', '241.58', '675.08', '532.16', '585.24', '55.24', '230.07', '350.62', '81.11', '48.63', '175.59', '179.27', '557.11', '127.09', '111.82', '114.18', '137.63', '958.72', '134.34', '582.21', '23.36', '80.83', '282.22']}

Answered By - Stackbeans

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, March 24, 2022

[FIXED] Cookies / data handling redirect causes wrong scraping website

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels