Issue
I'm trying to scrape comments of a video , i can easily get everything except for the body of each specific comments using scrapy from this site : https://tamasha.com/v/KGbXY
from scrapy.selector import Selector
def crawl_comment(self, video_id):
video_url = f"https://www.tamasha.com/v/{video_id}"
# response = self.request.get_request(video_url, proxy=self.proxy, timeout=30, headers=None)
response = self.request.get_request(video_url, timeout=30, headers=None)
if response.status_code == 404:
raise VideoNotFoundException()
comment_information = Selector(text=response.text).xpath(
'//*[@class="comment-item"]').getall()
comment_data_list = []
for comment_info in comment_information:
video_id = video_id
author_username = None
try:
author_username = Selector(text=comment_info).xpath('string(//*[@class="user-name"])').get()
except:
pass
author_id = None
try:
author_id = Selector(text=comment_info).xpath('//*[@class="user-name"]/@href').get()
author_id = author_id.split('/')[-1]
except:
pass
date = Selector(text=response.text).xpath('//*[@class="comment-time"]/text()').get()
body = Selector(text=response.text).css('#commentBox > div:nth-child(2) > div.more-comment > p').get
id = Selector(text=response.text).xpath('//*[@class="comment-item"]/@data-comment-id').get()
comment_data_list.append({
'author_username': author_username,
'author_id': author_id,
'date': date,
'body': body,
'id': id,
'video_id': video_id
})
print(comment_data_list)
I want to get the text of each comment but can't , the code that gets that part is in the body field.
Solution
Don't known about the language on the page. But here's the way to extract the comment with CSS selector.
# selector is `Selector(text=response.text)`
selector.css('div.comment-item > .comment .comment-header + p::text').getall()
# explanation of css selector
# >, direct child
# space, descendent
# +, adjacent sibling
Here's the output
# the second comment seems to be a comment replay
['چه خوب بودن همشون', 'عالی بودن', 'nice']
BTW, are you not using scrapy
but only the parser? In Scrapy, Response.xpath()
will dispatch calling to Selector(response_text).xpath()
automatically.
If you don't use scrapy but only want the parser. pip install parsel
, and use parsel.Selector
. parsel
is the parser integrated in Scrapy. It can be used independently.
Answered By - Simba
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.