Issue
I want to extract info from this website using Scrapy. But the info I need is in a JSON file; and this JSON file has unwanted literal newlines characters in only the description section.
Here is an example page and the JSON element I want to scrape is this
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Product",
"description": "Hamster ve Guinea Pig için tasarlanmış temizliği kolay mama kabıdır.
Hamster motifleriyle süslü ve son derece sevimlidir.
Ürün seramikten yapılmıştır
Ürün ölçüleri
Hacim: 100 ml
Çap: 8 cm",
"name": "Karlie Seramik Hamster ve Guinea Pigler İçin Yemlik 100ml 8cm",
"image": "https://www.petlebi.com/up/ecommerce/product/lg_karlie-hamster-mama-kaplari-359657192.jpg",
"brand": {
"@type": "Brand",
"name": "Karlie"
},
"category": "Guinea Pig Yemlikleri",
"sku": "4016598440834",
"gtin13": "4016598440834",
"offers": {
"@type": "Offer",
"availability": "http://schema.org/InStock",
"price": "149.00",
"priceCurrency": "TRY",
"itemCondition": "http://schema.org/NewCondition",
"url": "https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html"
},
"review": [
]
}
</script>
As you can see there are literal newline characters in the description, which are not allowed in JSON. Here is the code I was trying but it didn't work:
import scrapy
import json
import re
class JsonSpider(scrapy.Spider):
name = 'json_spider'
start_urls = ['https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html']
def parse(self, response):
# Extract the script content containing the JSON data
script_content = response.xpath('/html/body/script[12]').get()
if not script_content:
self.logger.warning("Script content not found.")
return
json_data_match = re.search(r'<script type="application/ld\+json">(.*?)<\/script>', script_content, re.DOTALL)
if json_data_match:
json_data_str = json_data_match.group(1)
try:
json_obj = json.loads(json_data_str)
product_info = {
"name": json_obj.get("name"),
"description": json_obj.get("description"),
"image": json_obj.get("image"),
"brand": json_obj.get("brand", {}).get("name"),
"category": json_obj.get("category"),
"sku": json_obj.get("sku"),
"price": json_obj.get("offers", {}).get("price"),
"url": json_obj.get("offers", {}).get("url")
}
self.logger.info("Extracted Product Information: %s", product_info)
with open('product_info.json', 'w', encoding='utf-8') as json_file:
json.dump(product_info, json_file, ensure_ascii=False, indent=2)
except json.JSONDecodeError as e:
self.logger.error("Error decoding JSON: %s", e)
def start_requests(self):
yield scrapy.Request(
url='https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html',
callback=self.parse,
)
I want this to be a dynamic code so it works for every product.
I used https://jsonlint.com/ to see the unwanted characters and when i delete the escape characters in the description it says it is valid. I tried html.unescape
but it didn't work. The code stops working in this line:
json_obj = json.loads(json_data_str)
How can I do it?
Solution
Replacing newlines only within the "description":
value is a little bit more involved than I'd like, but try this.
json_data_str_fixed = re.sub(
r'"description": "[^"]*(\n[^"]*)*"',
lambda x: re.sub(r"\n", r"\\n", x.group(0)),
json_data_str)
json_obj = json.loads(json_data_str_fixed)
In so many words, the outer re.sub
selects the "desription":
key and value, including any newlines, and replaces it with ... the same string with the newlines replaced with escaped newlines by the inner re.sub
.
If you don't want to preserve the newlines at all, of course, this is much simpler; just
json_obj = json.loads(json_data_str.replace("\n", "")
but understand that this will turn e.g. yapılmıştır
(newline)(newline)Ürün
into yapılmıştırÜrün
which probably isn't what you want.
Using json.loads(..., strict=False)
as suggested in the other answer is probably easier in your scenario; but I wanted to provide an answer which can be adapted to scenarios where this doesn't work. (I would upvote, and suggest you accept, the other answer if it didn't suggest munging the text as its primary solution.)
Demo: https://ideone.com/okXYok
Answered By - tripleee
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.