Skip to content

Add Korean TN support for cardinal numbers and postprocessing #285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 10, 2025

Conversation

bbae0312
Copy link

What does this PR do ?

Classify and verbalize grammars
Unit tests with coverage up to 17-digit numbers
Support for spacing around units (억, 만, 조, 경)
Post-processing logic for formatting
Unit tests and Sparrowhawk tests

Before your PR is "Ready for review"

Pre checks:

  • [x ] Have you signed your commits? Use git commit -s to sign.
  • [x ] Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • [x ] If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • [x ] Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • [x ] Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • [x ] Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • [x ] If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • [x ] Remove import guards (try import: ... except: ...) if not already done.
  • [x ] If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • [x ] Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • [x ] New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

digit_except_one = pynini.difference(NEMO_DIGIT, "1")
digit_except_zero_one = pynini.difference(digit_except_one, "0")

graph_digit_alt = digit_except_zero_one @ graph_digit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call this something like graph_digit_no_zero_one to make it more explicit in its 'alt'-ness

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed graph_digit_alt to graph_digit_no_zero_one for clarity as suggested.


graph_digit_alt = digit_except_zero_one @ graph_digit
graph_ty = pynini.string_file(get_abs_path("data/number/ty.tsv"))
graph_teen = pynini.string_file(get_abs_path("data/number/teen.tsv"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the rules for creating teens any different from the ones to create ties? does it make sense to have a separate file for them, or would it be possible to just add 1 십 to the ties file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! Since both follow the pattern of combining a base digit with “십”, it makes sense to consolidate them. I’ll go ahead and remove the teen.tsv file and just add 1 십 to the ty.tsv file.

graph_teen = pynini.string_file(get_abs_path("data/number/teen.tsv"))

# Compose all basic number forms
graph_all = (graph_ty + (graph_digit | pynutil.delete('0'))) | graph_teen | graph_digit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this covers numbers from 1 to 99, right? can we rename the variable for more clarity?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll rename it to graph_1_to_99 for clarity.

graph_hundred_thousand = hundred_thousands @ graph_hundred_thousand_component

millions = NEMO_DIGIT**7
graph_million_component = ((NEMO_DIGIT**3 @ graph_hundred_component) + pynutil.insert('만')) + pynini.union(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(NEMO_DIGIT**3 @ graph_hundred_component) == graph_hundred, let's try to use the same variable intead of redefining the rule (this applies to other _component graphs too)

Copy link
Author

@bbae0312 bbae0312 May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code to use graph_hundred and graph_thousand instead of (NEMO_DIGIT3 @ graph_hundred_component) and (NEMO_DIGIT4 @ graph_thousand_component).


# FST
graph_num = pynini.union(
graph_thousand_trillions,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a test case in test_cases_cardinal.txt for each of these rules

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I’ll add one test case for each of these rules in test_cases_cardinal.txt

).optimize()

# Sign and final formatting
optional_sign = pynini.closure(pynutil.insert('negative: "true" ') + pynini.cross("-", ""), 0, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also add a few test cases in test_cases_cardinal.txt for negative numbers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some negative cases as well.

runtest $input
}

#testTNSpecialText() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove the commented out tests for now and add them as we develop each class

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted all the commented out tests

…feedback

Signed-off-by: Jinwoo Bae <bbae7050@gmail.com>
pre-commit-ci bot and others added 2 commits May 24, 2025 03:36
Signed-off-by: Jinwoo Bae <bbae7050@gmail.com>
@mgrafu
Copy link
Collaborator

mgrafu commented Jun 3, 2025

LGTM -- let's just get CI running properly

Signed-off-by: Jinwoo Bae <bbae7050@gmail.com>
@bbae0312 bbae0312 marked this pull request as ready for review June 3, 2025 18:29
@mgrafu mgrafu changed the base branch from main to ko_tn_staging_v1 June 10, 2025 12:03
@mgrafu mgrafu merged commit e7e78c6 into NVIDIA:ko_tn_staging_v1 Jun 10, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants