Add Korean TN support for cardinal numbers #280

bbae0312 · 2025-05-08T15:18:46Z

What does this PR do ?

Adds support for Korean cardinal number text normalization (TN), including:

Classify and verbalize grammars
Unit tests with coverage up to 17-digit numbers
Support for spacing around units (억, 만, 조, 경)

Notes

Deterministic TN support only

Before your PR is "Ready for review"

Pre checks:

[x ] Have you signed your commits? Use git commit -s to sign.
[x ] Do all unittests finish successfully before sending PR?
1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
[x ] Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
[x ] Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
[x ] Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
[x ] If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
[x ] Remove import guards (try import: ... except: ...) if not already done.
If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
[x ] Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

[x ] New Feature
Bugfix
Documentation
Test

If you haven't finished some of the above items you can still open "Draft" PR.

Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>

for more information, see https://pre-commit.ci

mgrafu · 2025-05-12T19:08:48Z

nemo_text_processing/text_normalization/ko/taggers/cardinal.py

+
+        # 1-99 reading
+        read_1_to_99 = pynini.union(read_1, read_10_to_19, read_20_to_99).optimize()
+        read_100_to_999 = (NEMO_DIGIT**3) @ graph_hundred_component


is this different from line 66?

removed this line

mgrafu · 2025-05-12T19:09:18Z

nemo_text_processing/text_normalization/ko/taggers/cardinal.py

+        graph_thousand = thousands @ graph_thousand_component
+
+        # 1-99 reading
+        read_1_to_99 = pynini.union(read_1, read_10_to_19, read_20_to_99).optimize()


is this different from line 44?

removed this line as well.

mgrafu · 2025-05-12T19:09:31Z

nemo_text_processing/text_normalization/ko/taggers/cardinal.py

+        graph_10_to_19 = graph_teen
+        graph_20_to_99 = graph_ty
+
+        graph_all = pynini.union(


let's use a more descriptive name for this variable

I updated the variable names and used a new logic for the creation part.

mgrafu · 2025-05-12T19:09:54Z

nemo_text_processing/text_normalization/ko/taggers/cardinal.py

+        # 1-99 reading
+        read_1_to_99 = pynini.union(read_1, read_10_to_19, read_20_to_99).optimize()
+        read_100_to_999 = (NEMO_DIGIT**3) @ graph_hundred_component
+        read_1000_to_9999 = (NEMO_DIGIT**4) @ graph_thousand_component


is this different from line 78?

removed this line

Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>

…l.py Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>

for more information, see https://pre-commit.ci

bbae0312 · 2025-05-21T22:23:55Z

Superseded by #285. Closing this PR.

bbae0312 added 2 commits May 8, 2025 14:34

Add Korean TN test files for cardinal

d9adadc

Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>

Add test files and updates for Korean TN

e606249

Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>

bbae0312 force-pushed the ko-cardinal branch from 70baa6c to e606249 Compare May 8, 2025 21:58

[pre-commit.ci] auto fixes from pre-commit.com hooks

d94c723

for more information, see https://pre-commit.ci

mgrafu reviewed May 12, 2025

View reviewed changes

Remove .far files from PR

e7888d4

Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>

bbae0312 force-pushed the ko-cardinal branch from f1edfca to e7888d4 Compare May 12, 2025 20:20

bbae0312 and others added 2 commits May 15, 2025 11:36

Refactor Korean TN: remove .far and unused data files, update cardina…

16347ec

…l.py Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f1ad01d

for more information, see https://pre-commit.ci

bbae0312 closed this May 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Korean TN support for cardinal numbers #280

Add Korean TN support for cardinal numbers #280

Uh oh!

bbae0312 commented May 8, 2025 •

edited

Loading

Uh oh!

mgrafu May 12, 2025

Uh oh!

bbae0312 May 19, 2025

Uh oh!

mgrafu May 12, 2025

Uh oh!

bbae0312 May 19, 2025

Uh oh!

mgrafu May 12, 2025

Uh oh!

bbae0312 May 19, 2025

Uh oh!

mgrafu May 12, 2025

Uh oh!

bbae0312 May 19, 2025

Uh oh!

bbae0312 commented May 21, 2025

Uh oh!

Uh oh!

Add Korean TN support for cardinal numbers #280

Add Korean TN support for cardinal numbers #280

Uh oh!

Conversation

bbae0312 commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Notes

Before your PR is "Ready for review"

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bbae0312 commented May 21, 2025

Uh oh!

Uh oh!

bbae0312 commented May 8, 2025 •

edited

Loading