Issue
I created a simple scrapy project that scrape a web page and save the data to the postgresql. I can get all the scraped data in my parse method but the pipline doesnot get called to save the data to the database. Here is my spider parse method.
def parse(self, response):
links = response.css('a::attr(href)').getall()
if links is not None:
for link in links:
yield response.follow(link, callback=self.parse)
else:
loader = ItemLoader(item=TestItem(), selector=response)
quote = response.css('div.quote p::text').get()
loader.add_value('quote', title)
yield loader.load_item()
Here is the TestItem
class TestItem(scrapy.Item):
quote = scrapy.Field()
Here is the pipline
class TestPipeline:
def process_item(self, item, spider):
logging.log(logging.INFO, item)
print(item)
quote = Quote(text=item.quote)
db.session.add(quote)
db.session.commit()
return item
Finaly the pipline
is called in setting
ITEM_PIPELINES = {
'Test.pipelines.TestPipeline': 300,
}
Any help is welcome.
I printed the item
in pipline. Here is the ouput.
2021-07-16 09:54:33 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: Quote)
2021-07-16 09:54:33 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
2021-07-16 09:54:33 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-07-16 09:54:33 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'Quote',
'NEWSPIDER_MODULE': 'Quote.spiders',
'SPIDER_MODULES': ['Quote.spiders']}
2021-07-16 09:54:33 [scrapy.extensions.telnet] INFO: Telnet Password: 88264e27a8108b1f
2021-07-16 09:54:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2021-07-16 09:54:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-07-16 09:54:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-07-16 09:54:34 [scrapy.middleware] INFO: Enabled item pipelines:
['Quote.pipelines.QuotePipeline']
2021-07-16 09:54:34 [scrapy.core.engine] INFO: Spider opened
2021-07-16 09:54:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-07-16 09:54:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2021-07-16 09:54:34 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.azquotes.com/quotes/topics/inspirational.html/> from <GET http://azquotes.com/quotes/topics/inspirational.html/>
2021-07-16 09:54:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.azquotes.com/quotes/topics/inspirational.html/> (referer: None)
2021-07-16 09:54:36 [scrapy.core.engine] INFO: Closing spider (finished)
2021-07-16 09:54:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 494,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 23417,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 2.771838,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 7, 16, 0, 54, 36, 828890),
'httpcompression/response_bytes': 107443,
'httpcompression/response_count': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2021, 7, 16, 0, 54, 34, 57052)}
2021-07-16 09:54:36 [scrapy.core.engine] INFO: Spider closed (finished)
Solution
It is being called.
from ..items import TestItem
class Something(scrapy.Spider):
name = "something"
start_urls = ['http://azquotes.com/quotes/topics/inspirational.html']
def parse(self, response):
links = response.css('a::attr(href)').getall()
#if links is not None:
# for link in links:
# yield response.follow(link, callback=self.parse)
#else:
loader = ItemLoader(item=TestItem(), selector=response)
quote = response.css('div.quote p::text').get()
title = "some_title"
loader.add_value('quote', title)
yield loader.load_item()
will print (I added 'print("called")' to process_item function):
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.azquotes.com/quotes/topics/inspirational.html> (referer: None)
[root] INFO: {'quote': ['some_title']}
{'quote': ['some_title']}
called
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.azquotes.com/quotes/topics/inspirational.html>
{'quote': ['some_title']}
When you yield an item the 'process_item' function is called, but you're getting some errors because you try to follow some non-existing pages (like <a href="#">
).
From what I'm seeing you're trying to scrape all the quotes from the site, if I'm correct then this is what you need to do:
- Make a function that will scrape all quotes from just one of the pages.
- Create crawling rules for the links you want to follow (more info here.)
- Change to crawl spider:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from ..items import TestItem
class Something(CrawlSpider):
name = "something"
start_urls = ['http://azquotes.com/quotes/topics/inspirational.html']
rules = (Rule(LinkExtractor(........),)
def parse(self, response):
........
........
........
Answered By - SuperUser
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.