Wednesday, June 8, 2022

[FIXED] how to extract a specific text with no tag by python scrapy?（new problem）

June 08, 2022 python, scrapy, web-scraping No comments

Issue

I'm using scrapy to extract target text in html like this below:

my scrapy code is:

import scrapy
from scrapy.crawler import CrawlerProcess
class MmSpider(scrapy.Spider):
    name = 'name'
    start_urls = ['file:///Users/saihhold/Desktop/maimai.mht']

    def parse(self, response):
        for title in response.xpath('//div[@class="media-body"]/div/div[1]'):
            yield {
                title.xpath('.//text()').getall()
            }

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(MmSpider)
    process.start()

then use this command to run it:

scrapy runspider mmspider.py -o mm.jl

but mm.jl file is empty, is there any problem with my code or xpath?

Solution

Your code is okey but xpath selection was incorrect.You can follow the next example how to grab title using xpath.

import scrapy
from scrapy.crawler import CrawlerProcess
class MmSpider(scrapy.Spider):
    name = 'name'
    start_urls = ['https://www.timeout.com/film/best-movies-of-all-time']

    def parse(self, response):
        for title in response.xpath('//h3[@class="_h3_cuogz_1"]'):
            yield {
                'title':title.xpath('.//text()').getall()[-1].replace('\xa0','')
            }

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(MmSpider)
    process.start()

Output:

{'title': '2001: A Space Odyssey (1968)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'The Godfather (1972)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Citizen Kane (1941)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Jeanne Dielman, 23, Quai du Commerce, 1080 Bruxelles (1975)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Raiders of the Lost Ark (1981)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'La Dolce Vita (1960)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'Seven Samurai (1954)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'In the Mood for Love (2000)'}
2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
{'title': 'There Will Be Blood (2007)'}

... so on

Answered By - F.Hoque

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, June 8, 2022

[FIXED] how to extract a specific text with no tag by python scrapy?（new problem）

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels