-
Notifications
You must be signed in to change notification settings - Fork 116
Add Korean TN support for cardinal numbers and postprocessing #285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Jinwoo Bae <bbae7050@gmail.com>
for more information, see https://pre-commit.ci
nemo_text_processing/text_normalization/ko/taggers/tokenize_and_classify.py
Fixed
Show fixed
Hide fixed
digit_except_one = pynini.difference(NEMO_DIGIT, "1") | ||
digit_except_zero_one = pynini.difference(digit_except_one, "0") | ||
|
||
graph_digit_alt = digit_except_zero_one @ graph_digit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's call this something like graph_digit_no_zero_one
to make it more explicit in its 'alt'-ness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed graph_digit_alt to graph_digit_no_zero_one for clarity as suggested.
|
||
graph_digit_alt = digit_except_zero_one @ graph_digit | ||
graph_ty = pynini.string_file(get_abs_path("data/number/ty.tsv")) | ||
graph_teen = pynini.string_file(get_abs_path("data/number/teen.tsv")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the rules for creating teens any different from the ones to create ties? does it make sense to have a separate file for them, or would it be possible to just add 1 십
to the ties file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! Since both follow the pattern of combining a base digit with “십”, it makes sense to consolidate them. I’ll go ahead and remove the teen.tsv file and just add 1 십 to the ty.tsv file.
graph_teen = pynini.string_file(get_abs_path("data/number/teen.tsv")) | ||
|
||
# Compose all basic number forms | ||
graph_all = (graph_ty + (graph_digit | pynutil.delete('0'))) | graph_teen | graph_digit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this covers numbers from 1 to 99, right? can we rename the variable for more clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll rename it to graph_1_to_99 for clarity.
graph_hundred_thousand = hundred_thousands @ graph_hundred_thousand_component | ||
|
||
millions = NEMO_DIGIT**7 | ||
graph_million_component = ((NEMO_DIGIT**3 @ graph_hundred_component) + pynutil.insert('만')) + pynini.union( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(NEMO_DIGIT**3 @ graph_hundred_component) == graph_hundred
, let's try to use the same variable intead of redefining the rule (this applies to other _component
graphs too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code to use graph_hundred and graph_thousand instead of (NEMO_DIGIT3 @ graph_hundred_component) and (NEMO_DIGIT4 @ graph_thousand_component).
|
||
# FST | ||
graph_num = pynini.union( | ||
graph_thousand_trillions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a test case in test_cases_cardinal.txt
for each of these rules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I’ll add one test case for each of these rules in test_cases_cardinal.txt
).optimize() | ||
|
||
# Sign and final formatting | ||
optional_sign = pynini.closure(pynutil.insert('negative: "true" ') + pynini.cross("-", ""), 0, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also add a few test cases in test_cases_cardinal.txt
for negative numbers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add some negative cases as well.
runtest $input | ||
} | ||
|
||
#testTNSpecialText() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove the commented out tests for now and add them as we develop each class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted all the commented out tests
…feedback Signed-off-by: Jinwoo Bae <bbae7050@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: Jinwoo Bae <bbae7050@gmail.com>
LGTM -- let's just get CI running properly |
Signed-off-by: Jinwoo Bae <bbae7050@gmail.com>
What does this PR do ?
Classify and verbalize grammars
Unit tests with coverage up to 17-digit numbers
Support for spacing around units (억, 만, 조, 경)
Post-processing logic for formatting
Unit tests and Sparrowhawk tests
Before your PR is "Ready for review"
Pre checks:
git commit -s
to sign.pytest
or (if your machine does not have GPU)pytest --cpu
from the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')
).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
pytest
and Sparrowhawk here.__init__.py
for every folder and subfolder, includingdata
folder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
to all newly added Python files?Copyright 2015 and onwards Google, Inc.
. See an example here.try import: ... except: ...
) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.