Thursday, December 2, 2021

Scrapy

December 02, 2021 python, scrapy, selenium-webdriver No comments

Issue

I have created a Selector object to store the element selected by driver.page_source. I had issues iterating and accessing the data directly so I created a "Selector" object to allow for loop to iterate. If this is avoidable please let me know.

The issue is I need to access some of the data inside that Selector object, specifically the id attribute of the element. Whenever I attempt to use a function on the object such as get_attribute, it states: "Selector" object has no attribute 'data'

I attempted a variety of different accessing methods such as sub-scripting the value directly "['id']". That wasn't applicable.

Does anyone have any idea how to access this data or perhaps rework my code to make it accessible?

    class emails_spider(scrapy.Spider):
    name= 'emails'
    allowed_domains = ["example.com"]
    start_urls = [ 'example', ]

    def __init__(self):
        #setup driver and browser emulation
        self.driver = webdriver.Firefox()

    # start firefox emulator 
    def parse(self, response):
        self.driver.get(response.url)
        search = True #search condition boolean
        iteration = 0 #while loop iteration counter

        #while there is a next page to click on
        while True:
            # try get next page content
            # yield { 'person': self.driver.page_source }
            sel = scrapy.Selector(text=self.driver.page_source) #create Selector object for easy access in for loop
            # iterate each tr element in path
            for person in sel.xpath("//table[@class='rgMasterTable rgClipCells']/tbody/tr"): 

                # instansiate email_spiderPerson object and set all values from person 
                item = email_spiderPerson()
                item['name'] = person.xpath("td[1]/text()").extract()
                item['city'] = person.xpath("td[2]/text()").extract()
                item['state'] = person.xpath("td[3]/text()").extract()
                item['country'] = person.xpath("td[4]/text()").extract()
                item['phone'] = person.xpath("td[5]/text()").extract()
                item['website'] = person.xpath("td[6]/text()").extract()
                item['cred'] = person.xpath("td[7]/text()").extract()

                # code chunk below - click on current tr element to go to page and retrieve email, then return and continue loop
                # This below part is a problem, must not grab manual index ([1]), Must be auto, Use person object? 
                email_path = self.driver.find_element_by_xpath("//table[@class='rgMasterTable rgClipCells']/tbody/tr[1]")
                #WebDriverWait(self.driver, 1000)
                self.driver.execute_script("arguments[0].setAttribute('class','rgRow rgHoveredRow')", email_path)
                div_click = self.driver.find_element_by_xpath("//div[@class='RadGrid RadGrid_MXDefault']")
                #self.driver.execute_script("arguments[0].scrollIntoView();", email_path2)
                div_click.click()

                email = scrapy.Selector(text=self.driver.page_source)
                email_value = email.xpath("//div[@class='GlobalFindAccountTemplate_MXDefault']/a").extract()
                item['email'] = person.data('id')
                self.driver.execute_script("window.history.go(-1)")

                yield item

            # if first time then click search / else click next button
            if search == True:
                next_url = self.driver.find_element_by_xpath("//fieldset[@class='buttons']/input[@value='Search']")
                search = False
            else:
                next_url = self.driver.find_element_by_xpath("//ul[@class='pagination']/li[@class='next']/a")
            try:
                next_url.click()
                iteration = iteration + 1
            except:
                break
            if iteration >= 3:
                break
        self.driver.close()

Also, you may notice I have set the value of my item['email'] equal to person.data('id'). I just wanted to try to get the id. When set to 'person' the output is the following in XML format:

<email>&lt;Selector xpath="//table[@class='rgMasterTable rgClipCells']/tbody/tr" data='&lt;tr class="rgRow" id="dnn_ctr1604_Fin...'&gt;</email>

That is an XML version of the "Selector" object "person".

Solution

If you want an id attribute of the person Selector:

item['email'] = person.xpath('./@id').extract_first()

Answered By - gangabass

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, December 2, 2021

[FIXED] Accessing attributes / data of Selenium "Selector" object, Python / Scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels