Issue
I am practicing to use the Scrapy webcrawler package and have a 2 part question because I am struggling a little bit to know what to do next:
I have a script called
spider4Techcrunch.py
which contains the following code:import scrapy from scrapy import cmdline class TCSpider(scrapy.Spider): name = "techcrunch" def start_requests(self): urls = [ "https://techcrunch.com/" ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): SET_SELECTOR = ".post-block__title" output = "--BEGIN OUTPUT--" print(output) for data in response.css(SET_SELECTOR): print('--BEGIN DATA--') print(data) TITLE_SELECTOR = "a ::text" URL_SELECTOR = "a ::attr(href)" yield { 'title': data.css(TITLE_SELECTOR).extract_first(), 'url':data.css(URL_SELECTOR).extract_first(), } scrapy.cmdline.execute("scrapy runspider spider4Techcrunch.py".split())
When I execute the code, everything is working perfectly and returning results. Where I am struggling is with the
yield
command. How do I extract the text results of theKEY: VALUE
pairs for"title": "url"
from theyield
command so that I can save the results into a text file line by line?The last line in my code, is this a proper way to execute it? It just looks weird to me that I am executing the scrapy code by calling itself within the same file. Is there a best practice/better way? (Keeping in mind, I would like this class to be reusable for multiple URLS.)
scrapy.cmdline.execute("scrapy runspider spider4Techcrunch.py".split())
Solution
Preferred way to run scrapy application as script - docs
You can use one of built-in feed exporters
In your case it solution will be like this (for scrapy version 2.1):
import scrapy
from scrapy.crawler import CrawlerProcess
class TCSpider(scrapy.Spider):
name = "techcrunch"
def start_requests(self):
urls = [
"https://techcrunch.com/"
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
SET_SELECTOR = ".post-block__title"
output = "--BEGIN OUTPUT--"
print(output)
for data in response.css(SET_SELECTOR):
print('--BEGIN DATA--')
print(data)
TITLE_SELECTOR = "a ::text"
URL_SELECTOR = "a ::attr(href)"
yield {
'title': data.css(TITLE_SELECTOR).extract_first(),
'url':data.css(URL_SELECTOR).extract_first(),
}
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {"format": "json"},
#"items.jl": {"format": "jsonlines"},
},
})
process.crawl(TCSpider)
process.start()
Answered By - Georgiy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.