Friday, December 1, 2023

[FIXED] Scrapy not saving scraped items

December 01, 2023 python, python-3.x, scrapy No comments

Issue

Here is how I start my spider:

    with resources.path(SCRAPING_MANIFESTS['systems'], 'manifest.jl') as path:
        process = CrawlerProcess({
            'FEEDS': {
                path: {
                    'format': 'jsonlines',
                    'overwrite': True,
                    'indent': 4
                }
            }
        })

        process.crawl(SystemsManifestSpider, prev_hashes=prev_list)
        process.start()

I verified in debug that indeed the path is correct and points to an existing file that I set to overwrite. It's a local file.

When I debug my spider, I see my item is being successfully populated. I populate it as such:

MyCustomItem(scrapy.Item):
    my_field = scrapy.Field()


# inside my spider class defined as SystemsManifestSpider(scrapy.Spider):
def parse(self, response, **kwargs):
    while True:
    
        ...

        my_item = MyCustomItem()
        my_item['my_field'] = "test"

        ...

        print(my_item)  # prints the dictionary with 'my_field' correctly populated
        yield my_item

And the debug output after letting it run a while: 'item_scraped_count': 9

It correctly accesses the webpage and scrapes the data verified in debug, but after the item is yielded it doesn't save anything to my manifest.jl.

However, when I modify it to this:

process = CrawlerProcess({
        'FEEDS': {
            Path("./manifest.jsonlines"): {
                'format': 'jsonlines',
                'overwrite': True,
                'indent': 4
            }
        }
    })

It correctly creates a new file and saves it at the local directory where I run the code. Both paths lead to a real file and directory.

But again when I specify an absolute path, it doesn't work:

process = CrawlerProcess({
        'FEEDS': {
            Path("C:\\Users\\xxxx\\PycharmProjects\\xxxx\\src\\python_scrapper_nf\\xxxx\\manifest.jsonlines"): {
                'format': 'jsonlines',
                'overwrite': True,
                'indent': 4
            }
        }
    })

The path above is identical to Path(./manifest.jsonlines)

Note: It seems all relative paths work, but absolute paths never do.

Solution

Yes, @Alexander is completely right.

There is a fix that may come in future versions. You can follow this: https://github.com/scrapy/scrapy/pull/5971. With this, Windows paths will be better handled and if it cannot determine the correct storage, it will default to file storage, which will work in your case.

Also, if you prepend file:/// to your path, I think it might work.

Answered By - Leandro Rodrigues de Souza

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 1, 2023

[FIXED] Scrapy not saving scraped items

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels