Issue
I am learning web scraping through Scrapy. I have been stuck in the issue as follows:
Why does CSV file open the whole data in one row? Indeed it should have 8 rows and 4 columns. It is ok with columns but I couldn't understand why it opens the data in only one row.
import scrapy
class MyredditSpider(scrapy.Spider):
name = 'myreddit'
allowed_domains = ['reddit.com']
start_urls = ['http://www.reddit.com/']
#custom_settings = {
#"FEEDS":{"result.csv":{"format":"csv",}}
#}
def parse(self, response):
all_var=response.xpath("//div[@class='rpBJOHq2PR60pnwJlUyP0']")
for variable in all_var:
post= variable.xpath("//h3[@class='_eYtD2XCVieq6emjKBH3m']/text()").extract()
vote= variable.xpath("//div[@class='_1rZYMD_4xY3gRcSS3p8ODO _3a2ZHWaih05DgAOtvu6cIo ']/text()").extract()
time=variable.xpath("//span[@class='_2VF2J19pUIMSLJFky-7PEI']/text()").extract()
links= variable.xpath("//a[@data-click-id='body']/@href").extract()
yield{"Posts": post, "Votes": vote, "Time": time, "Links":links}
I used scrapy crawl myreddit -o items.csv to save the data in csv. I want to get CSV that every value in a row accordingly. Almost like in the image
Solution
It is because that is the way you are extracting the information...
Each of your extract()
calls is pulling all of those elements that are on the page all at once, if you want to have them listed row by row you will want to iterate through the html elements row by row as well.
For example it should look closer to this where it iterates through each of the rows and extracts the information from each row and yields it and then moves on to the next one.
import scrapy
class MyredditSpider(scrapy.Spider):
name = 'myreddit'
allowed_domains = ['reddit.com']
start_urls = ['http://www.reddit.com/']
def parse(self, response):
for row in response.xpath('//div[@class="rpBJOHq2PR60pnwJlUyP0"]/div'):
post = row.xpath(".//h3[@class='_eYtD2XCVieq6emjKBH3m']/text()").get()
vote = row.xpath(".//div[@class='_1rZYMD_4xY3gRcSS3p8ODO _3a2ZHWaih05DgAOtvu6cIo ']/text()").get()
time = row.xpath(".//span[@class='_2VF2J19pUIMSLJFky-7PEI']/text()").get()
links = row.xpath(".//a[@data-click-id='body']/@href").get()
yield {"Posts": post, "Votes": vote, "Time": time, "Links":links}
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.