Monday, March 14, 2022

[FIXED] Scrape data from div as shown on the page

March 14, 2022 css-selectors, python, scrapy, web-scraping, xpath No comments

Issue

I am trying to scrape data from this URL https://eksisozluk.com/mortingen-sitraze--1277239, I want to scrape the title and then all the comments beneath the title. If you open the website you will see that the first comment under the title is (bkz: mortingen). The problem is that (bkz is in a div and inside the div mortingen is inside an anchor link, so it makes it difficult to scrape the data as shown on the webste. Can anyone help me with a CSS Selector or Xpath that can scrape all comments as shown. My code is written below but it gives me (bkz: in one column then akhisar and then ) in three seperate columns instead of one

def parse(self, response):
    data={}
    #count=0
    title = response.css('[itemprop="name"]::text').get()
    #data["Title"] = title
    count=0
    data["title"] = title
    count=0
    for content in response.css('li .content ::text'):
        text = content.get()
        text=text.strip()
        content = "content" +str(count)
        data[content] = text
        count=count+1
    yield data

Solution

You should first get all .content without ::text and use for-loop to work with every .content separatelly. And for every .content you should run ::text to get all text only in this content, put on list and later join it into single string

       for count, content in enumerate(response.css('li .content')):
            text = []

            # get all `::text` in current `.content`
            for item in content.css('::text'):
                item = item.get()#.strip()
                # put on list
                text.append(item)

            # join all items in single string
            text = "".join(text)
            text = text.strip()

            print(count, '|', text)
            data[f"content {count}"] = text

Minimal working code.

You can put all code in one file and run python script.py without creating project in scrapy.

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = ['https://eksisozluk.com/mortingen-sitraze--1277239']

    def parse(self, response):
        print('url:', response.url)

        data = {}  # PEP8: spaces around `=`

        title = response.css('[itemprop="name"]::text').get()
        data["title"] = title

        for count, content in enumerate(response.css('li .content')):
            text = []

            for item in content.css('::text'):
                item = item.get()#.strip()
                text.append(item)

            text = "".join(text)
            text = text.strip()

            print(count, '|', text)
            data[f"content {count}"] = text

        yield data
    
# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(MySpider)
c.start()

EDIT:

Little shorter with getall()

        for count, content in enumerate(response.css('li .content')):

            text = content.css('::text').getall()

            text = "".join(text)
            text = text.strip()

            print(count, '|', text)
            data[f"content {count}"] = text

Answered By - furas

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, March 14, 2022

[FIXED] Scrape data from div as shown on the page

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels