Issue
I have txt
file that contians two columns (filename
and text
) the spreater during generating txt file is tab
example of input file below :
text.txt
23.jpg még
24.jpg több
the expacted output_file.jsonl
type json line format
{"file_name": "23.jpg", "text": "még"}
{"file_name": "24.jpg", "text": "több"}
But I got issue with uincode or encoding format :
{"file_name": "23.jpg", "text": "m\u00c3\u00a9g"}
{"file_name": "24.jpg", "text": "t\u00c3\u00b6bb"}
it seems that dosent recognize hungarain spicial charchters áéíöóőüúüű
for both small and captial case
for example in resulting *.jsonl
file it gives assci or differnt encoding \u00c3\u00a9
code instead of the letter é
I wrote this small sript to convert *.txt
file in Hungarain languge to *.jsonl
in Hungarain too
import pandas pd
train_text = 'text.txt'
df = pd.read_csv(f'{train_text}' ,header=None,delimiter=' ',encoding="utf8") # delimiter tab here
df.rename(columns={0: "file_name", 1: "text"}, inplace=True)
# convert txt file to jsonlines
reddit = df.to_dict(orient= "records")
import json
with open("output_file.jsonl","w") as f:
for line in reddit:
f.write(json.dumps(line) + "\n")
My expactation output_file.jsonl
type json line format
{"file_name": "23.jpg", "text": "még"}
{"file_name": "24.jpg", "text": "több"}
Solution
From the docs, json.dump
includes a ensure_ascii
flag. It's always puzzled me why it's True
by default, but that's the thing that inserts Unicode escape sequences instead of using multibyte utf-8 encodings. It should be fine... other parsers should figure it out. But to fix the problem do,
f.write(json.dumps(line, ensure_ascii=False) + "\n")
Answered By - tdelaney
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.