Saturday, November 5, 2022

[FIXED] Scrapy LinkExtractor ScraperApi integration

November 05, 2022 scrapy No comments

Issue

I am trying to extract links from a webpage but I have to use proxy service. If I use proxy service links not extracting correctly. Extracted links missing https://www.homeadvisor.com part. Extracted links using api.scraperapi.com as domain without website domain. How can I fix this problem?

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scraper_api import ScraperAPIClient


client = ScraperAPIClient("67e5e7755771b9abf8062e595dd5cc2a")


class Sip2Spider(CrawlSpider):
    name = 'sip2'
    # allowed_domains = ['homeadvisor.com']
    # start_urls = ['https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html']

    start_urls =[client.scrapyGet(url = 'https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]


    rules= [
        Rule(LinkExtractor(allow="/rated"), callback="parse_page", follow=True)
    ]


    def parse_page(self, response):
        company_name = response.css("h1.\@w-full.\@text-3xl").css("::text").get().strip()

        yield {
            "company_name" : company_name
        }

2022-10-24 18:13:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python> (referer: None)

2022-10-24 18:13:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.MarkAllenContracting.117262730.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:51 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.FireSignDBAdditionsand.123758218.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:52 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.DCEnclosuresInc.16852798.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:53 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.G3BuildersLLC.71091804.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:54 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.FletcherCooleyInc.43547458.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)

Solution

It looks like when using the ScraperAPIClient, it requires you to use that specific syntax client.scrapyGet(url=...) for each and ever request. However since you are using the crawlspider with a linkextractor set to follow, scrapy automatically sends out requests in it's usual way, so those requests are getting blocked. You might be better off extracting all of the links yourself, and then filtering the ones you want to follow.

For example:

import scrapy
from scraper_api import ScraperAPIClient


client = ScraperAPIClient("67e5e7755771b9abf8062e595dd5cc2a")


class Sip2Spider(scrapy.Spider):
    name = 'sip2'
    domain = 'https://www.homeadvisor.com'
    start_urls =[client.scrapyGet(url = 'https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]


    def parse(self, response):
        print(response)
        links = [self.domain + i if not i.startswith('https://') else i for i in response.xpath("//a/@href").getall()]
        yield {"links" : list(set(links))}

This will yield:

[
  {
    "links": [
      "https://www.homeadvisor.com/rated.TapConstructionLLC.42214874.html",
      "https://www.homeadvisor.com#quote=42214874",
      "https://www.homeadvisor.com/emc.Drywall-Plaster-directory.-12025.html",
      "https://www.linkedin.com/company/homeadvisor/",
      "https://www.homeadvisor.com/c.Additions-Remodeling.Philadelphia.PA.-12001.html",
      "https://www.homeadvisor.com/login",
      "https://www.homeadvisor.com/task.Major-Home-Repairs-General-Contractor.40062.html",
      "https://www.homeadvisor.com/near-me/home-addition-builders/",
      "https://www.homeadvisor.com/c.Additions-Remodeling.Lawrenceville.GA.-12001.html",
      "https://www.homeadvisor.com/near-me/carpentry-contractors/",
      "https://www.homeadvisor.com/emc.Roofing-directory.-12061.html",
      "https://www.homeadvisor.com/c.Doors.Atlanta.GA.-12024.html",
      "https://www.homeadvisor.com#quote=20057351",
      "https://www.homeadvisor.com/near-me/deck-companies/",
      "https://www.homeadvisor.com/tloc/Atlanta-GA/Bathroom-Remodel/",
      "https://www.homeadvisor.com/c.Additions-Remodeling.Knoxville.TN.-12001.html",
      "https://www.homeadvisor.com/xm/35317287/task-selection/-12001?postalCode=30301",
      "https://www.homeadvisor.com/category.Additions-Remodeling.12001.html",
      "https://www.homeadvisor.comtel:4042672949",
      "https://www.homeadvisor.com/rated.DCEnclosuresInc.16852798.html",
      "https://www.homeadvisor.com#quote=16721785",
      "https://www.homeadvisor.com/near-me/bathroom-remodeling/",
      "https://www.homeadvisor.com/near-me",
      "https://www.homeadvisor.com/emc.Heating-Furnace-Systems-directory.-12040.html",
      "https://pro.homeadvisor.com/r/?m=sp_pro_center&entry_point_id=33522463",
      "https://www.homeadvisor.com/r/hiring-a-home-architect/",
      "https://www.homeadvisor.com#quote=119074241",
      "https://www.homeadvisor.comtel:8669030759",
      "https://www.homeadvisor.com/rated.SilverOakRemodel.78475581.html#ratings-reviews",
      "https://www.homeadvisor.com/emc.Tree-Service-directory.-12074.html",
      "https://www.homeadvisor.com/task.Bathroom-Remodel.40129.html",
      "https://www.homeadvisor.com/rated.G3BuildersLLC.71091804.html",
      "https://www.homeadvisor.com/sp/horizon-remodeling-construction",
      "https://www.homeadvisor.com/near-me/fence-companies/",
      "https://www.homeadvisor.com/emc.Gutters-directory.-12038.html",
      "https://www.homeadvisor.com/c.GA.html#topcontractors",
      ...
      ...
      ...
    ]
  }
]

The actual output is almost 400 links...

Then you can use some kind of filtering to decide which links you want to follow and use the same api sdk syntax to follow them. Applying some kind of filtering system will also cut down on the number of requests sent which will conserve api calls which will save you money as well.

For example:

def parse(self, response):
        print(response)
        links = [self.domain + i if not i.startswith('https://') else i for i in response.xpath("//a/@href").getall()]
        yield {"links" : list(set(links))}
        # some filtering process
        for link in links:
            yield scrapy.Request(client.scrapyGet(url = link))

UPDATE:

Try this...

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from urllib.parse import urlencode
APIKEY = "67e5e7755771b9abf8062e595dd5cc2a"  # <- your api key
APIDOMAIN = "http://api.scraperapi.com/"
DOMAIN = 'https://www.homeadvisor.com/'

def get_scraperapi_url(url):
    payload = {'api_key': APIKEY, 'url': url}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

def process_links(links):
    for link in links:
        i = link.url.index('rated')
        link.url = DOMAIN + link.url[i:]
        link.url = get_scraperapi_url(link.url)
    return links

class Sip2Spider(CrawlSpider):
    name = 'sip2'
    domain = 'https://www.homeadvisor.com'
    start_urls =[get_scraperapi_url('https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]

    rules= [
        Rule(LinkExtractor(allow="/rated"), callback="parse_page", follow=True, process_links=process_links)
    ]

    def parse_page(self, response):
        company_name = response.xpath("//h1[contains(@class,'@w-full @text-3xl')]/text()").get()
        yield {
            "company_name" : company_name
        }

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, November 5, 2022

[FIXED] Scrapy LinkExtractor ScraperApi integration

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels