Issue
I am trying to extract links from a webpage but I have to use proxy service. If I use proxy service links not extracting correctly. Extracted links missing https://www.homeadvisor.com
part. Extracted links using api.scraperapi.com as domain without website domain. How can I fix this problem?
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scraper_api import ScraperAPIClient
client = ScraperAPIClient("67e5e7755771b9abf8062e595dd5cc2a")
class Sip2Spider(CrawlSpider):
name = 'sip2'
# allowed_domains = ['homeadvisor.com']
# start_urls = ['https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html']
start_urls =[client.scrapyGet(url = 'https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]
rules= [
Rule(LinkExtractor(allow="/rated"), callback="parse_page", follow=True)
]
def parse_page(self, response):
company_name = response.css("h1.\@w-full.\@text-3xl").css("::text").get().strip()
yield {
"company_name" : company_name
}
2022-10-24 18:13:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python> (referer: None)
2022-10-24 18:13:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.MarkAllenContracting.117262730.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:51 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.FireSignDBAdditionsand.123758218.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:52 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.DCEnclosuresInc.16852798.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:53 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.G3BuildersLLC.71091804.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:54 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.FletcherCooleyInc.43547458.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
Solution
It looks like when using the ScraperAPIClient, it requires you to use that specific syntax client.scrapyGet(url=...)
for each and ever request. However since you are using the crawlspider with a linkextractor set to follow, scrapy automatically sends out requests in it's usual way, so those requests are getting blocked. You might be better off extracting all of the links yourself, and then filtering the ones you want to follow.
For example:
import scrapy
from scraper_api import ScraperAPIClient
client = ScraperAPIClient("67e5e7755771b9abf8062e595dd5cc2a")
class Sip2Spider(scrapy.Spider):
name = 'sip2'
domain = 'https://www.homeadvisor.com'
start_urls =[client.scrapyGet(url = 'https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]
def parse(self, response):
print(response)
links = [self.domain + i if not i.startswith('https://') else i for i in response.xpath("//a/@href").getall()]
yield {"links" : list(set(links))}
This will yield:
[
{
"links": [
"https://www.homeadvisor.com/rated.TapConstructionLLC.42214874.html",
"https://www.homeadvisor.com#quote=42214874",
"https://www.homeadvisor.com/emc.Drywall-Plaster-directory.-12025.html",
"https://www.linkedin.com/company/homeadvisor/",
"https://www.homeadvisor.com/c.Additions-Remodeling.Philadelphia.PA.-12001.html",
"https://www.homeadvisor.com/login",
"https://www.homeadvisor.com/task.Major-Home-Repairs-General-Contractor.40062.html",
"https://www.homeadvisor.com/near-me/home-addition-builders/",
"https://www.homeadvisor.com/c.Additions-Remodeling.Lawrenceville.GA.-12001.html",
"https://www.homeadvisor.com/near-me/carpentry-contractors/",
"https://www.homeadvisor.com/emc.Roofing-directory.-12061.html",
"https://www.homeadvisor.com/c.Doors.Atlanta.GA.-12024.html",
"https://www.homeadvisor.com#quote=20057351",
"https://www.homeadvisor.com/near-me/deck-companies/",
"https://www.homeadvisor.com/tloc/Atlanta-GA/Bathroom-Remodel/",
"https://www.homeadvisor.com/c.Additions-Remodeling.Knoxville.TN.-12001.html",
"https://www.homeadvisor.com/xm/35317287/task-selection/-12001?postalCode=30301",
"https://www.homeadvisor.com/category.Additions-Remodeling.12001.html",
"https://www.homeadvisor.comtel:4042672949",
"https://www.homeadvisor.com/rated.DCEnclosuresInc.16852798.html",
"https://www.homeadvisor.com#quote=16721785",
"https://www.homeadvisor.com/near-me/bathroom-remodeling/",
"https://www.homeadvisor.com/near-me",
"https://www.homeadvisor.com/emc.Heating-Furnace-Systems-directory.-12040.html",
"https://pro.homeadvisor.com/r/?m=sp_pro_center&entry_point_id=33522463",
"https://www.homeadvisor.com/r/hiring-a-home-architect/",
"https://www.homeadvisor.com#quote=119074241",
"https://www.homeadvisor.comtel:8669030759",
"https://www.homeadvisor.com/rated.SilverOakRemodel.78475581.html#ratings-reviews",
"https://www.homeadvisor.com/emc.Tree-Service-directory.-12074.html",
"https://www.homeadvisor.com/task.Bathroom-Remodel.40129.html",
"https://www.homeadvisor.com/rated.G3BuildersLLC.71091804.html",
"https://www.homeadvisor.com/sp/horizon-remodeling-construction",
"https://www.homeadvisor.com/near-me/fence-companies/",
"https://www.homeadvisor.com/emc.Gutters-directory.-12038.html",
"https://www.homeadvisor.com/c.GA.html#topcontractors",
...
...
...
]
}
]
The actual output is almost 400 links...
Then you can use some kind of filtering to decide which links you want to follow and use the same api sdk syntax to follow them. Applying some kind of filtering system will also cut down on the number of requests sent which will conserve api calls which will save you money as well.
For example:
def parse(self, response):
print(response)
links = [self.domain + i if not i.startswith('https://') else i for i in response.xpath("//a/@href").getall()]
yield {"links" : list(set(links))}
# some filtering process
for link in links:
yield scrapy.Request(client.scrapyGet(url = link))
UPDATE:
Try this...
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from urllib.parse import urlencode
APIKEY = "67e5e7755771b9abf8062e595dd5cc2a" # <- your api key
APIDOMAIN = "http://api.scraperapi.com/"
DOMAIN = 'https://www.homeadvisor.com/'
def get_scraperapi_url(url):
payload = {'api_key': APIKEY, 'url': url}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url
def process_links(links):
for link in links:
i = link.url.index('rated')
link.url = DOMAIN + link.url[i:]
link.url = get_scraperapi_url(link.url)
return links
class Sip2Spider(CrawlSpider):
name = 'sip2'
domain = 'https://www.homeadvisor.com'
start_urls =[get_scraperapi_url('https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]
rules= [
Rule(LinkExtractor(allow="/rated"), callback="parse_page", follow=True, process_links=process_links)
]
def parse_page(self, response):
company_name = response.xpath("//h1[contains(@class,'@w-full @text-3xl')]/text()").get()
yield {
"company_name" : company_name
}
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.