Friday, January 19, 2024

[FIXED] Unwanted newline characters in JSON while web scraping

January 19, 2024 json, python, scrapy, web-crawler, web-scraping No comments

Issue

I want to extract info from this website using Scrapy. But the info I need is in a JSON file; and this JSON file has unwanted literal newlines characters in only the description section.

Here is an example page and the JSON element I want to scrape is this

<script type="application/ld+json">
    {
      "@context": "http://schema.org",
      "@type": "Product",
            "description": "Hamster ve Guinea Pig için tasarlanmış temizliği kolay mama kabıdır.

Hamster motifleriyle süslü ve son derece sevimlidir.

Ürün seramikten yapılmıştır 

Ürün ölçüleri 


    Hacim: 100 ml
    Çap: 8 cm",
      "name": "Karlie Seramik Hamster ve Guinea Pigler İçin Yemlik 100ml 8cm",
      "image": "https://www.petlebi.com/up/ecommerce/product/lg_karlie-hamster-mama-kaplari-359657192.jpg",
      "brand": {
        "@type": "Brand",
        "name": "Karlie"
      },
      "category": "Guinea Pig Yemlikleri",
      "sku": "4016598440834",
      "gtin13": "4016598440834",
      "offers": {
        "@type": "Offer",
         "availability": "http://schema.org/InStock",
         "price": "149.00",
        "priceCurrency": "TRY",
        "itemCondition": "http://schema.org/NewCondition",
        "url": "https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html"
      },
      "review": [
            ]
    }
    </script>

As you can see there are literal newline characters in the description, which are not allowed in JSON. Here is the code I was trying but it didn't work:

import scrapy
import json
import re

class JsonSpider(scrapy.Spider):
    name = 'json_spider'
    start_urls = ['https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html']

    def parse(self, response):
        # Extract the script content containing the JSON data
        script_content = response.xpath('/html/body/script[12]').get()

        if not script_content:
            self.logger.warning("Script content not found.")
            return

        json_data_match = re.search(r'<script type="application/ld\+json">(.*?)<\/script>', script_content, re.DOTALL)
        if json_data_match:
            json_data_str = json_data_match.group(1)
            try:
                json_obj = json.loads(json_data_str)

                product_info = {
                    "name": json_obj.get("name"),
                    "description": json_obj.get("description"),
                    "image": json_obj.get("image"),
                    "brand": json_obj.get("brand", {}).get("name"),
                    "category": json_obj.get("category"),
                    "sku": json_obj.get("sku"),
                    "price": json_obj.get("offers", {}).get("price"),
                    "url": json_obj.get("offers", {}).get("url")
                }

                self.logger.info("Extracted Product Information: %s", product_info)

                with open('product_info.json', 'w', encoding='utf-8') as json_file:
                    json.dump(product_info, json_file, ensure_ascii=False, indent=2)

            except json.JSONDecodeError as e:
                self.logger.error("Error decoding JSON: %s", e)

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html',
            callback=self.parse,
        )

I want this to be a dynamic code so it works for every product.

I used https://jsonlint.com/ to see the unwanted characters and when i delete the escape characters in the description it says it is valid. I tried html.unescape but it didn't work. The code stops working in this line: json_obj = json.loads(json_data_str) How can I do it?

Solution

Replacing newlines only within the "description": value is a little bit more involved than I'd like, but try this.

                json_data_str_fixed = re.sub(
                    r'"description": "[^"]*(\n[^"]*)*"',
                    lambda x: re.sub(r"\n", r"\\n", x.group(0)),
                    json_data_str)
                json_obj = json.loads(json_data_str_fixed)

In so many words, the outer re.sub selects the "desription": key and value, including any newlines, and replaces it with ... the same string with the newlines replaced with escaped newlines by the inner re.sub.

If you don't want to preserve the newlines at all, of course, this is much simpler; just

                json_obj = json.loads(json_data_str.replace("\n", "")

but understand that this will turn e.g. yapılmıştır(newline)(newline)Ürün into yapılmıştırÜrün which probably isn't what you want.

Using json.loads(..., strict=False) as suggested in the other answer is probably easier in your scenario; but I wanted to provide an answer which can be adapted to scenarios where this doesn't work. (I would upvote, and suggest you accept, the other answer if it didn't suggest munging the text as its primary solution.)

Demo: https://ideone.com/okXYok

Answered By - tripleee

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 19, 2024

[FIXED] Unwanted newline characters in JSON while web scraping

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels