Wednesday, July 27, 2022

[FIXED] How to check if text is Japanese Hiragana in Python?

July 27, 2022 python-3.x, scrapy, web, web-crawler No comments

Issue

I'm making a web crawler using python scrapy to collect text from websites.

I only want to collect Japanese Hiragana text. Is there a solution to detect Japanese Hiragana text?

Solution

Assuming you only need Hiragana, and you can convert your text to unicode / utf8:

Hiragana is Unicode code block U+3040 - U+309F, so you could test it with:

def char_is_hiragana(c) -> bool:
    return u'\u3040' <= c <= u'\u309F'
def string_is_hiragana(s: str) -> bool:
    return all(char_is_hiragana(c) for c in s)

print('ぁ', string_is_hiragana('ぁ'))
print('ひらがな', string_is_hiragana('ひらがな'))
print('a', string_is_hiragana('a'))
print('english', string_is_hiragana('english'))

ぁ True
ひらがな True
a False
english False

But note that this excludes historic and non-standard hiragana (hentaigana), whitespace, punctuation, Katakana and Kanji:

# hiragana
print('ひらがな', string_is_hiragana('ひらがな'))
# katakana
print('カタカナ', string_is_hiragana('カタカナ'))
# kanji
print('漢字', string_is_hiragana('漢字'))
# punctuation
print('ひらがなもじ「ゆ」', string_is_hiragana('ひらがな「ゆ」'))
print('いいひと。', string_is_hiragana('いいひと。'))

ひらがな True
カタカナ False
漢字 False
ひらがなもじ「ゆ」 False
いいひと。 False

You could allow Whitespace:

import string
def string_is_hiragana_or_whitespace(s: str) -> bool:
    return all(c in string.whitespace or char_is_hiragana(c) for c in s)

print('ひらがな  ひらがな', string_is_hiragana_or_whitespace('ひらがな  ひらがな'))

ひらがな  ひらがな True

But I would avoid going down this path of being too specific, there are a lot of difficult problems, like encoding, half-width characters, emoji, CJK code blocks, loan words, etc.

Answered By - Shameen

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, July 27, 2022

[FIXED] How to check if text is Japanese Hiragana in Python?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels