Issue
I am trying to set up a Scrapy spider inside Django app, that reads info from a page and posts it in Django's SQLite database using DjangoItems. Right now it seems that scraper itself is working, however, it is not adding anything to database. My guess is that it happens because of scrapy not enabling any item pipelines. Here is the log:
2019-10-05 15:23:07 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: scrapybot)
2019-10-05 15:23:07 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 19:29:22) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c 28 May 2019), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
2019-10-05 15:23:07 [scrapy.crawler] INFO: Overridden settings: {}
2019-10-05 15:23:07 [scrapy.extensions.telnet] INFO: Telnet Password: 6e614667b3cf5a1a
2019-10-05 15:23:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-10-05 15:23:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-10-05 15:23:07 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-10-05 15:23:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-10-05 15:23:07 [scrapy.core.engine] INFO: Spider opened
2019-10-05 15:23:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-10-05 15:23:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-10-05 15:23:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.barbora.lv/produkti/biezpiena-sierins-karums-vanilas-45-g> (referer: None)
2019-10-05 15:23:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.barbora.lv/produkti/biezpiena-sierins-karums-vanilas-45-g>
{'product_title': ['Biezpiena sieriņš KĀRUMS vaniļas 45g']}
2019-10-05 15:23:08 [scrapy.core.engine] INFO: Closing spider (finished)
2019-10-05 15:23:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 259,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 15402,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.418066,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 10, 5, 12, 23, 8, 417204),
'item_scraped_count': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 10, 5, 12, 23, 7, 999138)}
2019-10-05 15:23:08 [scrapy.core.engine] INFO: Spider closed (finished)
As I can see, the scraper returns expected value as "{'product_title': ['Biezpiena sieriņš KĀRUMS vaniļas 45g']}", but it seems like it is not passed into pipeline because no pipelines are loaded.
I have spent several hours looking at different tutorials and trying to fix the issue, but had no luck so far. Is there anything else I might have forgotten regarding setting up the scraper? Maybe it has something to do with file structure in the project.
Here are relevant files.
items.py
from scrapy_djangoitem import DjangoItem
from product_scraper.models import Scrapelog
class ScrapelogItem(DjangoItem):
django_model = Scrapelog
pipelines.py
class ProductInfoPipeline(object):
def process_item(self, item, spider):
item.save()
yield item
settings.py
BOT_NAME = 'scraper'
SPIDER_MODULES = ['scraper.spiders']
NEWSPIDER_MODULE = 'scraper.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'scraper.pipelines.ProductInfoPipeline': 300,
}
spider product_info.py:
import scrapy
from product_scraper.scraper.scraper.items import ScrapelogItem
class ProductInfoSpider(scrapy.Spider):
name = 'product_info'
allowed_domains = ['www.barbora.lv']
start_urls = ['https://www.barbora.lv/produkti/biezpiena-sierins-karums-vanilas-45-g']
def parse(self, response):
item = ScrapelogItem()
item['product_title'] = response.xpath('//h1[@itemprop="name"]/text()').extract()
return ScrapelogItem(product_title=item["product_title"])
Project file structure:
Solution
After further tinkering and research I found out that my settings file was not properly configured (however, it was only part of the problem). I added these lines to code based on other resources (they link Django project settings with scrapers settings):
import sys
import django
sys.path.append(os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), ".."))
os.environ['DJANGO_SETTINGS_MODULE'] = 'broccoli.settings'
django.setup()
After that, the spider did not run, but this time it at least gave an error message about not finding the Django settings module. I don't remember the exact syntax but it was something like that: "broccoli.settings MODULE NOT FOUND"
After some experiments I found out that moving scraper directory "scraper" from inside the app "Product_Scraper" to same level with other apps helped deal with this issue and everything worked.
Answered By - Elvijs
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.