Sunday, January 23, 2022

[FIXED] Can't get each <p> tag for every comment

January 23, 2022 python, scrapy, web-scraping No comments

Issue

I'm trying to scrape comments of a video , i can easily get everything except for the body of each specific comments using scrapy from this site : https://tamasha.com/v/KGbXY

from scrapy.selector import Selector

    def crawl_comment(self, video_id):
        video_url = f"https://www.tamasha.com/v/{video_id}"
        # response = self.request.get_request(video_url, proxy=self.proxy, timeout=30, headers=None)
        response = self.request.get_request(video_url, timeout=30, headers=None)
        if response.status_code == 404:
            raise VideoNotFoundException()
        comment_information = Selector(text=response.text).xpath(
            '//*[@class="comment-item"]').getall()
        comment_data_list = []
        for comment_info in comment_information:
            video_id = video_id
            author_username = None
            try:
                author_username = Selector(text=comment_info).xpath('string(//*[@class="user-name"])').get()
            except:
                pass
            author_id = None
            try:
                author_id = Selector(text=comment_info).xpath('//*[@class="user-name"]/@href').get()
                author_id = author_id.split('/')[-1]
            except:
                pass
            date = Selector(text=response.text).xpath('//*[@class="comment-time"]/text()').get()
            body = Selector(text=response.text).css('#commentBox > div:nth-child(2) > div.more-comment > p').get
            id = Selector(text=response.text).xpath('//*[@class="comment-item"]/@data-comment-id').get()
            comment_data_list.append({
                'author_username': author_username,
                'author_id': author_id,
                'date': date,
                'body': body,
                'id': id,
                'video_id': video_id
            })
        print(comment_data_list)

I want to get the text of each comment but can't , the code that gets that part is in the body field.

Solution

Don't known about the language on the page. But here's the way to extract the comment with CSS selector.

# selector is `Selector(text=response.text)`
selector.css('div.comment-item > .comment .comment-header + p::text').getall()

# explanation of css selector
# >, direct child
# space, descendent
# +, adjacent sibling

Here's the output

# the second comment seems to be a comment replay
['چه خوب بودن همشون', 'عالی بودن', 'nice']

BTW, are you not using scrapy but only the parser? In Scrapy, Response.xpath() will dispatch calling to Selector(response_text).xpath() automatically.

If you don't use scrapy but only want the parser. pip install parsel, and use parsel.Selector. parsel is the parser integrated in Scrapy. It can be used independently.

Answered By - Simba

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 23, 2022

[FIXED] Can't get each <p> tag for every comment

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels