Wednesday, February 7, 2024

[FIXED] Spider only crawling the last url, but not all

February 07, 2024 scrapy, url, web-scraping No comments

Issue

I want to scrape multiple urls stored in a csv file using Scrapy. My code works(shows no error) but it only scrapes the last url, but not all of them. Here is a picture of my code. Plz tell me what I'm doing wrong. I want to scrape all the urls and save the scraped text together. I have already tried a lot of the suggestions found on StackOverflow. My code-

import scrapy
from scrapy import Request
from ..items import personalprojectItem


class ArticleSpider(scrapy.Spider):
    name = 'articles'
    with open('C:\\Users\\Admin\\Documents\\Bhavya\\input_urls.csv') as file:
        for line in file:
            start_urls = line

            def start_requests(self):
                request = Request(url=self.start_urls)
                yield request

        def parse(self, response):
            item = personalprojectItem()
            article = response.css('div p::text').extract()
            item['article'] = article
            yield item

Solution

Below is a minimal example of how you can include a list of urls from file in a scrapy project.

We have a text file with the following links, inside the scrapy project folder:

https://www.theguardian.com/technology/2022/nov/18/elon-musk-twitter-engineers-workers-mass-resignation
https://www.theguardian.com/world/2022/nov/18/iranian-protesters-set-fire-to-ayatollah-khomeinis-ancestral-home
https://www.theguardian.com/world/2022/nov/18/canada-safari-park-shooting-animals-two-charged

The spider code looks like this (again, minimal example):

import scrapy


class GuardianSpider(scrapy.Spider):
    name = 'guardian'
    allowed_domains = ['theguardian.com']
    start_urls = [x for x in open('urls_list.txt', 'r').readlines()]

    def parse(self, response): 
        title = response.xpath('//h1/text()').get()
        header = response.xpath('//div[@data-gu-name="standfirst"]//p/text()').get()
        yield {
            'title': title,
            'header': header
        }

If we run the spider with scrapy crawl guardian -o guardian_news.json, we get a JSON file looking like this:

[
{"title": "Elon Musk summons Twitter engineers amid mass resignations and puts up poll on Trump ban", "header": "Reports show nearly 1,200 workers left company after demand for \u2018long hours at high intensity\u2019, while Musk starts poll on whether to reinstate Donald Trump"},
{"title": "Iranian protesters set fire to Ayatollah Khomeini\u2019s ancestral home", "header": "Social media images show what is now a museum commemorating the Islamic Republic founder ablaze as protests continue"},
{"title": "Two Canadian men charged with shooting animals at safari park", "header": "Mathieu Godard and Jeremiah Mathias-Polson accused of breaking into Parc Omega in Quebec and killing three wild boar and an elk"}
]

Scrapy documentation can be found here: https://docs.scrapy.org/en/latest/

Answered By - Barry the Platipus

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, February 7, 2024

[FIXED] Spider only crawling the last url, but not all

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels