Issue
I'm using scrapy to extract target text in html like this below:
my scrapy code is:
import scrapy
from scrapy.crawler import CrawlerProcess
class MmSpider(scrapy.Spider):
name = 'name'
start_urls = ['file:///Users/saihhold/Desktop/maimai.mht']
def parse(self, response):
for title in response.xpath('//div[@class="media-body"]/div/div[1]'):
yield {
title.xpath('.//text()').getall()
}
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(MmSpider)
process.start()
then use this command to run it:
scrapy runspider mmspider.py -o mm.jl
but mm.jl file is empty, is there any problem with my code or xpath?
Solution
Your code is okey but xpath selection was incorrect.You can follow the next example how to grab title using xpath.
import scrapy
from scrapy.crawler import CrawlerProcess
class MmSpider(scrapy.Spider):
name = 'name'
start_urls = ['https://www.timeout.com/film/best-movies-of-all-time']
def parse(self, response):
for title in response.xpath('//h3[@class="_h3_cuogz_1"]'):
yield {
'title':title.xpath('.//text()').getall()[-1].replace('\xa0','')
}
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(MmSpider)
process.start()
Output:
{'title': '2001: A Space Odyssey (1968)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'The Godfather (1972)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Citizen Kane (1941)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Jeanne Dielman, 23, Quai du Commerce, 1080 Bruxelles (1975)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Raiders of the Lost Ark (1981)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'La Dolce Vita (1960)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Seven Samurai (1954)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'In the Mood for Love (2000)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'There Will Be Blood (2007)'}
... so on
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.