Issue
I am setting up my first Scrapy Spider, and I'm having some difficulty using xpath to extract certain elements.
My target is http://www.cbooo.cn/m/641515 (a Chinese website similar to Box Office Mojo). I can extract the Chinese name of the film 阿龙浴血记 with no problem , but I can't figure out how to get the information below it. I believe this is because the HTML is not standard, as discussed here. There are several paragraph elements nested beneath the header.
I have tried the solution in the link above, and also here, to no avail.
def parse(self, response):
chinesetitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/text()').extract()
englishtitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/p').extract()
chinesereleasedate = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[4]').extract()
productionregions = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[6]').extract()
chineseboxoffice = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[1]/span/text()[2]').extract()
yield {
'chinesetitle': chinesetitle,
'englishtitle': englishtitle,
'chinesereleasedate': chinesereleasedate,
'productionregions': productionregions,
'chineseboxoffice': chineseboxoffice
}
When I run the spider in the Scrapy shell, the spider finds the Chinese title as expected. However, the remaining items return either a [], or a weird mish-mash of text on the page.
Any advice? This is my first amature programming project, so I appreciate your patience with my ignorance and your help. Thank you!
EDIT
Tried implement the text cleaning method in the comments. The example in the comments worked, but when I tried to reimplement it I got an "Attribute Error: 'list' object has no attribute 'split'“ (please see the China Box Office, country of origin, and genre examples below)
def parse(self, response):
chinesetitle = response.css('.cont h2::text').extract_first()
englishtitle = response.css('.cont h2 + p::text').extract_first()
chinaboxoffice = response.xpath('//span[@class="m-span"]/text()[2]').extract_first()
chinaboxoffice = chinaboxoffice.split('万')[0]
chinareleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()
chinareleasedate = chinareleasedate.split(':')[1].split('(')[0]
countryoforigin = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()
countryoforigin = countryoforigin.split(':')[1]
genre = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"类型")]/text()').extract_first()
genre = genre.split(':')[1]
director = response.xpath('//*[@id="tabcont1"]/dl/dd[1]/p/a/text()').extract()
Solution
Here are some examples from which you can infer the last one. Remember to always use a class or id attribute to identify the html element. /div[3]/div[2]/div/div[1]/..
is not a good practise.
chinesetitle = response.xpath('//div[@class="ziliaofr"]/div/h2/text()').extract_first()
englishtitle = response.xpath('//div[@class="ziliaofr"]/div/p/text()').extract_first()
chinesereleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first())
productionregions = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()
To find chinesereleasedate
I took the p
element whose text contains '上映时间'
. You have to parse this to get the exact value.
To find productionregions
I took the 7th selector from the list response.xpath('//div[@class="ziliaofr"]/div/p')[6]
selected the text. A better method would be to check if the text contains '国家及地区' just like above.
Edit : To answer the question in the comments,
response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()
returns a string like '\r\n 上映时间:2017-7-27(中国)\r\n '
which is not what you are looking for. You can clean it up like:
chinesereleasedate = chinesereleasedate.split(':')[1].split('(')[0]
This gives us the correct date.
Answered By - Nihal Sangeeth
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.