Sunday, December 5, 2021

tags?

December 05, 2021 html, python, scrapy, web-scraping, xpath No comments

Issue

I am setting up my first Scrapy Spider, and I'm having some difficulty using xpath to extract certain elements.

My target is http://www.cbooo.cn/m/641515 (a Chinese website similar to Box Office Mojo). I can extract the Chinese name of the film 阿龙浴血记 with no problem , but I can't figure out how to get the information below it. I believe this is because the HTML is not standard, as discussed here. There are several paragraph elements nested beneath the header.

I have tried the solution in the link above, and also here, to no avail.

def parse(self, response):
    chinesetitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/text()').extract()
    englishtitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/p').extract()
    chinesereleasedate = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[4]').extract()
    productionregions = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[6]').extract()
    chineseboxoffice = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[1]/span/text()[2]').extract()
    yield {
        'chinesetitle': chinesetitle,
        'englishtitle': englishtitle,
        'chinesereleasedate': chinesereleasedate,
        'productionregions': productionregions,
        'chineseboxoffice': chineseboxoffice
        }

When I run the spider in the Scrapy shell, the spider finds the Chinese title as expected. However, the remaining items return either a [], or a weird mish-mash of text on the page.

Any advice? This is my first amature programming project, so I appreciate your patience with my ignorance and your help. Thank you!

EDIT

Tried implement the text cleaning method in the comments. The example in the comments worked, but when I tried to reimplement it I got an "Attribute Error: 'list' object has no attribute 'split'“ (please see the China Box Office, country of origin, and genre examples below)

def parse(self, response):
        chinesetitle = response.css('.cont h2::text').extract_first()
        englishtitle = response.css('.cont h2 + p::text').extract_first()
        chinaboxoffice = response.xpath('//span[@class="m-span"]/text()[2]').extract_first()        
        chinaboxoffice = chinaboxoffice.split('万')[0]
        chinareleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()
        chinareleasedate = chinareleasedate.split('：')[1].split('（')[0]
        countryoforigin = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()
        countryoforigin = countryoforigin.split('：')[1]
        genre = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"类型")]/text()').extract_first()
        genre = genre.split('：')[1]
        director = response.xpath('//*[@id="tabcont1"]/dl/dd[1]/p/a/text()').extract()

Solution

Here are some examples from which you can infer the last one. Remember to always use a class or id attribute to identify the html element. /div[3]/div[2]/div/div[1]/.. is not a good practise.

chinesetitle = response.xpath('//div[@class="ziliaofr"]/div/h2/text()').extract_first()
englishtitle = response.xpath('//div[@class="ziliaofr"]/div/p/text()').extract_first()
chinesereleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first())
productionregions = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()

To find chinesereleasedate I took the p element whose text contains '上映时间'. You have to parse this to get the exact value.

To find productionregions I took the 7th selector from the list response.xpath('//div[@class="ziliaofr"]/div/p')[6] selected the text. A better method would be to check if the text contains '国家及地区' just like above.

Edit : To answer the question in the comments,

response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()

returns a string like '\r\n 上映时间：2017-7-27（中国）\r\n ' which is not what you are looking for. You can clean it up like:

chinesereleasedate = chinesereleasedate.split('：')[1].split('（')[0]

This gives us the correct date.

Answered By - Nihal Sangeeth

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 5, 2021

[FIXED] What's the correct Scrapy XPath for <p> elements incorrectly placed within <h> tags?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels