Skip to content

Add Korean TN support for cardinal numbers #280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

bbae0312
Copy link

@bbae0312 bbae0312 commented May 8, 2025

What does this PR do ?

Adds support for Korean cardinal number text normalization (TN), including:

  • Classify and verbalize grammars
  • Unit tests with coverage up to 17-digit numbers
  • Support for spacing around units (억, 만, 조, 경)

Notes

  • Deterministic TN support only

Before your PR is "Ready for review"

Pre checks:

  • [x ] Have you signed your commits? Use git commit -s to sign.
  • [x ] Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • [x ] Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • [x ] Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • [x ] Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • [x ] If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • [x ] Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • [x ] Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • [x ] New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

bbae0312 added 2 commits May 8, 2025 14:34
Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>
Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>

# 1-99 reading
read_1_to_99 = pynini.union(read_1, read_10_to_19, read_20_to_99).optimize()
read_100_to_999 = (NEMO_DIGIT**3) @ graph_hundred_component
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this different from line 66?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this line

graph_thousand = thousands @ graph_thousand_component

# 1-99 reading
read_1_to_99 = pynini.union(read_1, read_10_to_19, read_20_to_99).optimize()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this different from line 44?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this line as well.

graph_10_to_19 = graph_teen
graph_20_to_99 = graph_ty

graph_all = pynini.union(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use a more descriptive name for this variable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the variable names and used a new logic for the creation part.

# 1-99 reading
read_1_to_99 = pynini.union(read_1, read_10_to_19, read_20_to_99).optimize()
read_100_to_999 = (NEMO_DIGIT**3) @ graph_hundred_component
read_1000_to_9999 = (NEMO_DIGIT**4) @ graph_thousand_component
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this different from line 78?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this line

Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>
bbae0312 and others added 2 commits May 15, 2025 11:36
…l.py

Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>
@bbae0312
Copy link
Author

Superseded by #285. Closing this PR.

@bbae0312 bbae0312 closed this May 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants