Monday, January 3, 2022

[FIXED] Can't yield paralel requests conducted by items pipeline

January 03, 2022 containers, pipeline, scrapy No comments

Issue

In my scrapy code I'm trying to yield the following figures from parliament's website where all the members of parliament (MPs) are listed. Opening the links for each MP, I'm making parallel requests to get the figures I'm trying to count. I didn't use metas here because my code doesn't just make consecutive requests but it makes parallel requests for the figures after the individual page of the MP is requested. Thus I thought item containers would fit my purpose better.

Here are the figures I'm trying to scrape

How many bill proposals that each MP has their signature on
How many question proposals that each MP has their signature on
How many times that each MP spoke on the parliament

In order to count and yield out how many bills has each member of parliament has their signature on, I'm trying to write a scraper on the members of parliament which works with 3 layers:

Starting with the link where all MPs are listed
From (1) accessing the individual page of each MP where the three information defined above is displayed
3a) Requesting the page with bill proposals and counting the number of them by len function 3b) Requesting the page with question proposals and counting the number of them by len function 3c) Requesting the page with speeches and counting the number of them by len function

What I want: I want to yield the inquiries of 3a,3b,3c with the name and the party of the MP

Problem: My code above just doesn't yield anything but empty dictionaries for each request

Note: Because my parse functions doesn't work like parse => parse2 => parse3 but rather I have 3 parallel parse functions after parse2, I failed to use the meta because I'm not yielding all the values at parse three. Therefore I preferred using the pipelines which apparently doesn't work.

Main code:

'''

from scrapy import Spider        
from scrapy.http import Request
from ..items import MeclisItem
import logging

class MvSpider(Spider):
    name = 'mv'
    allowed_domains = ['tbmm.gov.tr']
    start_urls = ['https://www.tbmm.gov.tr/Milletvekilleri/liste']

    def parse(self, response):

        items = MeclisItem()


        mv_list =  mv_list = response.xpath("//ul[@class='list-group list-group-flush']") #taking all MPs listed


        for  mv in mv_list:

            items['name'] = mv.xpath("./li/div/div/a/text()").get() # MP's name taken


            items['party'] = mv.xpath("./li/div/div[@class='col-md-4 text-right']/text()").get().strip() #MP's party name taken


            partial_link = mv.xpath('.//div[@class="col-md-8"]/a/@href').get()
            full_link = response.urljoin(partial_link)


            yield Request(full_link, callback = self.mv_analysis)

        pass

    def mv_analysis(self, response):

        items = MeclisItem()

        billprop_link_path = response.xpath(".//a[contains(text(),'İmzası Bulunan Kanun Teklifleri')]/@href").get()
        billprop_link = response.urljoin(billprop_link_path)


        questionprop_link_path = response.xpath(".//a[contains(text(),'Sahibi Olduğu Yazılı Soru Önergeleri')]/@href").get()
        questionprop_link = response.urljoin(questionprop_link_path)

        speech_link_path = response.xpath(".//a[contains(text(),'Genel Kurul Konuşmaları')]/@href").get()
        speech_link = response.urljoin(speech_link_path)


        yield Request(billprop_link, callback = self.bill_prop_counter)  #number of bill proposals to be requested
        yield Request(questionprop_link, callback = self.quest_prop_counter) #number of question propoesals to be requested
        yield Request(speech_link, callback = self.speech_counter)  #number of speeches to be requested

        yield items

# COUNTING FUNCTIONS


    def bill_prop_counter(self,response):

        items = MeclisItem()

        billproposals = response.xpath("//tr[@valign='TOP']")

        items['bill_prop_count'] = len(billproposals)

        pass

    def quest_prop_counter(self, response):

        items = MeclisItem()

        questionproposals = response.xpath("//tr[@valign='TOP']")
        items['res_prop_count'] = len(questionproposals)

        pass

    def speech_counter(self, response):

        items = MeclisItem()

        speeches = response.xpath("//tr[@valign='TOP']")
        items['speech_count'] = len(speeches)

        pass

'''

items.py code:

import scrapy


class MeclisItem(scrapy.Item):
    name = scrapy.Field()
    party = scrapy.Field()
    bill_prop_count = scrapy.Field()
    res_prop_count = scrapy.Field()
    speech_count = scrapy.Field()

    pass

What's displayed at scrapy:

I checked many questions on stackoverflow but still couldn't figure a way out. Thanks in advance.

ps: Spent ten minutes seperately to colour the code above and couldn't make it either :(

Solution

Note: Because my parse functions doesn't work like parse => parse2 => parse3 but rather I have 3 parallel parse functions after parse2, I failed to use the meta because I'm not yielding all the values at parse three.

You can do it like this:

Edit:

import scrapy
from scrapy import Spider
from scrapy.http import Request
# from ..items import MeclisItem
import logging


class MeclisItem(scrapy.Item):
    name = scrapy.Field()
    party = scrapy.Field()
    bill_prop_count = scrapy.Field()
    res_prop_count = scrapy.Field()
    speech_count = scrapy.Field()


class MvSpider(Spider):
    name = 'mv'
    allowed_domains = ['tbmm.gov.tr']
    start_urls = ['https://www.tbmm.gov.tr/Milletvekilleri/liste']

    def parse(self, response):  
        mv_list = mv_list = response.xpath("//ul[@class='list-group list-group-flush']") #taking all MPs listed
        for mv in mv_list:
            item = MeclisItem()
            item['name'] = mv.xpath("./li/div/div/a/text()").get() # MP's name taken
            item['party'] = mv.xpath("./li/div/div[@class='col-md-4 text-right']/text()").get().strip() #MP's party name taken
            partial_link = mv.xpath('.//div[@class="col-md-8"]/a/@href').get()
            full_link = response.urljoin(partial_link)
            yield Request(full_link, callback=self.mv_analysis, cb_kwargs={'item': item})

    def mv_analysis(self, response, item):
        billprop_link_path = response.xpath(".//a[contains(text(),'İmzası Bulunan Kanun Teklifleri')]/@href").get()
        billprop_link = response.urljoin(billprop_link_path)
        questionprop_link_path = response.xpath(".//a[contains(text(),'Sahibi Olduğu Yazılı Soru Önergeleri')]/@href").get()
        questionprop_link = response.urljoin(questionprop_link_path)
        speech_link_path = response.xpath(".//a[contains(text(),'Genel Kurul Konuşmaları')]/@href").get()
        speech_link = response.urljoin(speech_link_path)

        yield Request(billprop_link,
                      callback=self.bill_prop_counter,
                      cb_kwargs={'item': item, 'questionprop_link': questionprop_link, 'speech_link': speech_link})  #number of bill proposals to be requested

    # COUNTING FUNCTIONS
    def bill_prop_counter(self, response, item, questionprop_link, speech_link):
        billproposals = response.xpath("//tr[@valign='TOP']")
        item['bill_prop_count'] = len(billproposals)
        yield Request(questionprop_link,
                      callback=self.quest_prop_counter,
                      cb_kwargs={'item': item, 'speech_link': speech_link}) #number of question propoesals to be requested

    def quest_prop_counter(self, response, item, speech_link):
        questionproposals = response.xpath("//tr[@valign='TOP']")
        item['res_prop_count'] = len(questionproposals)
        yield Request(speech_link,
                      callback=self.speech_counter,
                      cb_kwargs={'item': item})  #number of speeches to be requested

    def speech_counter(self, response, item):
        speeches = response.xpath("//tr[@valign='TOP']")
        item['speech_count'] = len(speeches)
        yield item

Answered By - SuperUser

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 3, 2022

[FIXED] Can't yield paralel requests conducted by items pipeline

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels