Issue
Here is how I start my spider:
with resources.path(SCRAPING_MANIFESTS['systems'], 'manifest.jl') as path:
process = CrawlerProcess({
'FEEDS': {
path: {
'format': 'jsonlines',
'overwrite': True,
'indent': 4
}
}
})
process.crawl(SystemsManifestSpider, prev_hashes=prev_list)
process.start()
I verified in debug that indeed the path is correct and points to an existing file that I set to overwrite. It's a local file.
When I debug my spider, I see my item is being successfully populated. I populate it as such:
MyCustomItem(scrapy.Item):
my_field = scrapy.Field()
# inside my spider class defined as SystemsManifestSpider(scrapy.Spider):
def parse(self, response, **kwargs):
while True:
...
my_item = MyCustomItem()
my_item['my_field'] = "test"
...
print(my_item) # prints the dictionary with 'my_field' correctly populated
yield my_item
And the debug output after letting it run a while:
'item_scraped_count': 9
It correctly accesses the webpage and scrapes the data verified in debug, but after the item is yielded it doesn't save anything to my manifest.jl
.
However, when I modify it to this:
process = CrawlerProcess({
'FEEDS': {
Path("./manifest.jsonlines"): {
'format': 'jsonlines',
'overwrite': True,
'indent': 4
}
}
})
It correctly creates a new file and saves it at the local directory where I run the code. Both paths lead to a real file and directory.
But again when I specify an absolute path, it doesn't work:
process = CrawlerProcess({
'FEEDS': {
Path("C:\\Users\\xxxx\\PycharmProjects\\xxxx\\src\\python_scrapper_nf\\xxxx\\manifest.jsonlines"): {
'format': 'jsonlines',
'overwrite': True,
'indent': 4
}
}
})
The path above is identical to Path(./manifest.jsonlines)
Note: It seems all relative paths work, but absolute paths never do.
Solution
Yes, @Alexander is completely right.
There is a fix that may come in future versions. You can follow this: https://github.com/scrapy/scrapy/pull/5971. With this, Windows paths will be better handled and if it cannot determine the correct storage, it will default to file storage, which will work in your case.
Also, if you prepend file:///
to your path, I think it might work.
:)
Answered By - Leandro Rodrigues de Souza
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.