Friday, December 3, 2021

[FIXED] I have one problem, when I put more than one url, data are overwritten

December 03, 2021 csv, export-to-csv, scrapy, web-scraping No comments

Issue

I made a simple scrapy script for scraping data from https://www.jobs2careers.com with items and exporting data to csv file. But I have one problem, when I put more than one url, data are overwritten.

I tryed with some other python libraryes, like openpyxl. Maybe there is a problem with running multiple spiders

import scrapy
from scrapy.selector import Selector
from ..items import QuotetutorialItem


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    n = 1
    start_urls = ['https://www.jobs2careers.com/results3.php?q=Fashion&l=Miami%2C+FL&s=00']

    def parse(self, response):
        items = QuotetutorialItem()
        s = Selector(response)
        quote = s.xpath("//div[@id='jobs_container']/div")

        for q in quote:
            url = response.url
            industry = q.xpath("//input[@id='search_q']/@value").get()
            state = q.xpath("//*[@id='results_controller']/div[1]/div[1]/div/div/div/div/a[3]/text()").get()
            company_name = q.xpath(".//div[@class='companyname']/span[@class='company']/text()").get()
            job_title = q.xpath(".//div[@class='title title1 hidden-xs']/text()").get()

            items['url'] = url
            items['industry'] = industry
            items['state'] = state
            items['company_name'] = company_name
            items['job_title'] = job_title

            yield items

        num = int(response.xpath("//h1[@class='result2']//text()").get().split("\n")[0].replace(',', ''))
        if num > 1000:
            num = 1000
        total = int(num) // 10 + 1
        np = response.url
        np = np[np.rfind('=') + 1:]
        next_page = response.url.replace(np, str((self.n * 10)))
        if self.n < total:
            self.n += 1
            yield response.follow(next_page,callback = self.parse)

Solution

Data are not being overwritten here.

You are getting 1000 items total because you're using self.n to limit pagination.

There are callbacks populated from each start_url, and they both increment the spider's self.n attribute asynchronously. The first url moves self.n from 1 to 2, then the second moves it from 2 to 3, then the first from 3 to 4, and so on. Because it's async, this isn't guaranteed to be the case exactly, but something like this is happening every time.

Answered By - pwinz

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 3, 2021

[FIXED] I have one problem, when I put more than one url, data are overwritten

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels