Issue
I want to get some information inside each child page, this goes. However the code does not enter the child page and goes to the next page. What should be done is to take the data inside the child page and go to the bottom element to the end of the page and then change the page.
import scrapy
from ukparl.items import UkparlItem
class UkparlSpider(scrapy.Spider):
name = 'ukparldata'
# allowed_domains = ["https://members.parliament.uk/"]
start_urls = ['https://members.parliament.uk/members/commons?page=1']
def parse(self, response):
nextpageurl = response.xpath('//a[@title="Go to next page"]/@href')
yield from self.scrape(response)
if nextpageurl:
path = nextpageurl.extract_first()
nextpage = response.urljoin(path)
print("Found url: {}".format(nextpage))
yield scrapy.Request(nextpage, callback=self.parse)
def scrape(self, response):
for resource in response.xpath('//div[@class="primary-info"]/..'):
item = UkparlItem()
item['name'] = resource.xpath('div[@class="primary-info"]/text()').extract_first()
profilepage = response.urljoin(resource.xpath('//a[@class="card card-member"]/@href').extract_first())
item['link'] = profilepage
item['party'] = resource.xpath('div[@class="secondary-info"]/text()').extract_first()
item['region'] = resource.xpath('//div[@class="indicator indicator-label"]/text()').extract_first()
request = scrapy.Request(profilepage, callback=self.get_data)
request.meta['item'] = item
yield request
def get_data(self, response):
item = response.meta['item']
item['phonenumber'] = response.xpath('//div[@class="contact-line"]/a/text()').extract_first()
item['twitter'] = response.xpath('//a[@class="card card-contact-info"][2]/@href').extract()
yield item
Solution
It does enter the 'child page', you can add print(item) in 'get_data' function to see it.
The problem is that you scrape the same page over and over:
<GET https://members.parliament.uk/members/commons?page=2> (referer: https://members.parliament.uk/members/commons?page=1)
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
<GET https://members.parliament.uk/member/4362/contact>
And so on
You can change:
profilepage = response.urljoin(resource.xpath('//a[@class="card card-member"]/@href').extract_first())
to
profilepage = response.urljoin(resource.xpath('../../@href').extract_first())
(I think it's better to replace response.xpath('//div[@class="primary-info"]/..') with something else but it's just my opinion)
as you can see it works:
DEBUG: Crawled (200) <GET https://members.parliament.uk/member/3922/contact> (referer: https://members.parliament.uk/members/commons?page=5)
<GET https://members.parliament.uk/member/3986/contact>
<GET https://members.parliament.uk/member/4769/contact>
<GET https://members.parliament.uk/member/1538/contact>
<GET https://members.parliament.uk/member/420/contact>
<GET https://members.parliament.uk/member/185/contact>
<GET https://members.parliament.uk/member/4439/contact>
<GET https://members.parliament.uk/member/4589/contact>
<GET https://members.parliament.uk/member/4806/contact>
<GET https://members.parliament.uk/member/4465/contact>
<GET https://members.parliament.uk/member/1508/contact>
<GET https://members.parliament.uk/member/4368/contact>
<GET https://members.parliament.uk/member/1554/contact>
<GET https://members.parliament.uk/member/4469/contact>
<GET https://members.parliament.uk/member/4088/contact>
<GET https://members.parliament.uk/member/4859/contact>
<GET https://members.parliament.uk/member/3950/contact>
<GET https://members.parliament.uk/member/1406/contact>
Answered By - SuperUser
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.