Issue
I am trying to scrape data from this URL https://eksisozluk.com/mortingen-sitraze--1277239, I want to scrape the title and then all the comments beneath the title. If you open the website you will see that the first comment under the title is (bkz: mortingen). The problem is that (bkz is in a div and inside the div mortingen is inside an anchor link, so it makes it difficult to scrape the data as shown on the webste. Can anyone help me with a CSS Selector or Xpath that can scrape all comments as shown. My code is written below but it gives me (bkz: in one column then akhisar and then ) in three seperate columns instead of one
def parse(self, response):
data={}
#count=0
title = response.css('[itemprop="name"]::text').get()
#data["Title"] = title
count=0
data["title"] = title
count=0
for content in response.css('li .content ::text'):
text = content.get()
text=text.strip()
content = "content" +str(count)
data[content] = text
count=count+1
yield data
Solution
You should first get all .content
without ::text
and use for
-loop to work with every .content
separatelly. And for every .content
you should run ::text
to get all text only in this content, put on list and later join it into single string
for count, content in enumerate(response.css('li .content')):
text = []
# get all `::text` in current `.content`
for item in content.css('::text'):
item = item.get()#.strip()
# put on list
text.append(item)
# join all items in single string
text = "".join(text)
text = text.strip()
print(count, '|', text)
data[f"content {count}"] = text
Minimal working code.
You can put all code in one file and run python script.py
without creating project in scrapy
.
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://eksisozluk.com/mortingen-sitraze--1277239']
def parse(self, response):
print('url:', response.url)
data = {} # PEP8: spaces around `=`
title = response.css('[itemprop="name"]::text').get()
data["title"] = title
for count, content in enumerate(response.css('li .content')):
text = []
for item in content.css('::text'):
item = item.get()#.strip()
text.append(item)
text = "".join(text)
text = text.strip()
print(count, '|', text)
data[f"content {count}"] = text
yield data
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'output.csv': {'format': 'csv'}}, # new in 2.1
})
c.crawl(MySpider)
c.start()
EDIT:
Little shorter with getall()
for count, content in enumerate(response.css('li .content')):
text = content.css('::text').getall()
text = "".join(text)
text = text.strip()
print(count, '|', text)
data[f"content {count}"] = text
Answered By - furas
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.