Sunday, December 17, 2023

[FIXED] Storing in csv through Scrapy

December 17, 2023 python, scrapy No comments

Issue

I am learning web scraping through Scrapy. I have been stuck in the issue as follows:

Why does CSV file open the whole data in one row? Indeed it should have 8 rows and 4 columns. It is ok with columns but I couldn't understand why it opens the data in only one row.

import scrapy


class MyredditSpider(scrapy.Spider):
    name = 'myreddit'
    allowed_domains = ['reddit.com']
    start_urls = ['http://www.reddit.com/']
    #custom_settings = {
   #"FEEDS":{"result.csv":{"format":"csv",}}
  #}

    def parse(self, response):
        all_var=response.xpath("//div[@class='rpBJOHq2PR60pnwJlUyP0']")
        
        for variable in all_var:
            post= variable.xpath("//h3[@class='_eYtD2XCVieq6emjKBH3m']/text()").extract()
            vote= variable.xpath("//div[@class='_1rZYMD_4xY3gRcSS3p8ODO _3a2ZHWaih05DgAOtvu6cIo ']/text()").extract()
            time=variable.xpath("//span[@class='_2VF2J19pUIMSLJFky-7PEI']/text()").extract()
            links= variable.xpath("//a[@data-click-id='body']/@href").extract()
        
       
        

            yield{"Posts": post, "Votes": vote, "Time": time, "Links":links}

I used scrapy crawl myreddit -o items.csv to save the data in csv. I want to get CSV that every value in a row accordingly. Almost like in the image

Solution

It is because that is the way you are extracting the information...

Each of your extract() calls is pulling all of those elements that are on the page all at once, if you want to have them listed row by row you will want to iterate through the html elements row by row as well.

For example it should look closer to this where it iterates through each of the rows and extracts the information from each row and yields it and then moves on to the next one.

import scrapy


class MyredditSpider(scrapy.Spider):
    name = 'myreddit'
    allowed_domains = ['reddit.com']
    start_urls = ['http://www.reddit.com/']

    def parse(self, response):
        for row in response.xpath('//div[@class="rpBJOHq2PR60pnwJlUyP0"]/div'):
            post = row.xpath(".//h3[@class='_eYtD2XCVieq6emjKBH3m']/text()").get()
            vote = row.xpath(".//div[@class='_1rZYMD_4xY3gRcSS3p8ODO _3a2ZHWaih05DgAOtvu6cIo ']/text()").get()
            time = row.xpath(".//span[@class='_2VF2J19pUIMSLJFky-7PEI']/text()").get()
            links = row.xpath(".//a[@data-click-id='body']/@href").get()
            yield {"Posts": post, "Votes": vote, "Time": time, "Links":links}

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 17, 2023

[FIXED] Storing in csv through Scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels