Skip to content

Commit 57d47b7

Browse files
anand-nvtarushi2k2pre-commit-ci[bot]
authored
Future implementations to date.py - Hindi ITN (#265) (#266)
* Future implementations to date.py - Hindi ITN (#265) * Addition of whitelist and word classes Signed-off-by: Tarushi V <tarushiv@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updation of Jenkins date Signed-off-by: Tarushi V <tarushiv@nvidia.com> * Cleanup Signed-off-by: Tarushi V <tarushiv@nvidia.com> * Updation Signed-off-by: Tarushi V <tarushiv@nvidia.com> * Updation Signed-off-by: Tarushi V <tarushiv@nvidia.com> * Future implementations for date Signed-off-by: Tarushi V <tarushiv@nvidia.com> * pushing rough date code for ref Signed-off-by: Tarushi V <tarushiv@nvidia.com> * Future implementations date.py Signed-off-by: Tarushi V <tarushiv@nvidia.com> * Cleanup Signed-off-by: Tarushi V <tarushiv@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updation of Jenkinsfile Signed-off-by: Tarushi V <tarushiv@nvidia.com> * Telephone.py-hindi itn Signed-off-by: Tarushi V <tarushiv@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Telephone.py - Hindi ITN Signed-off-by: Tarushi V <tarushiv@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Telephone modified tagger and verbalizer Signed-off-by: Tarushi V <tarushiv@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * telephone tagger with 3,4,5 digit std codes Signed-off-by: Tarushi V <tarushiv@nvidia.com> * Further additions - telephone.py Signed-off-by: Tarushi V <tarushiv@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Jenkins update Signed-off-by: Tarushi V <tarushiv@nvidia.com> * Telephone.py Signed-off-by: Tarushi V <tarushiv@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updated tagger-telephone.py Signed-off-by: Tarushi V <tarushiv@nvidia.com> * Telephone and Jenkinsfile cleanup Signed-off-by: Tarushi V <tarushiv@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update Jenkins Signed-off-by: Tarushi V <tarushiv@nvidia.com> --------- Signed-off-by: Tarushi V <tarushiv@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Anand Joseph <anajoseph@nvidia.com> * Add missing __init__.py file Signed-off-by: Anand Joseph <anajoseph@nvidia.com> --------- Signed-off-by: Tarushi V <tarushiv@nvidia.com> Signed-off-by: Anand Joseph <anajoseph@nvidia.com> Co-authored-by: tarushi2k2 <tarushiv@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 48ca992 commit 57d47b7

File tree

15 files changed

+490
-6
lines changed

15 files changed

+490
-6
lines changed

Jenkinsfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ pipeline {
2727
HY_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/03-12-24-0'
2828
MR_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/03-12-24-1'
2929
JA_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/10-17-24-1'
30-
HI_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/11-29-24-1'
30+
HI_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/04-03-25-1'
3131
DEFAULT_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-08-23-0'
3232
}
3333
stages {
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
ई.पू. ईसा पूर्व
2+
ई. ईस्वी
3+
ई. ईसवी
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
zero
2+
one
3+
two
4+
three
5+
four
6+
five
7+
six
8+
seven
9+
eight
10+
nine
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
१० ten
2+
११ eleven
3+
१२ twelve
4+
१३ thirteen
5+
१४ fourteen
6+
१५ fifteen
7+
१६ sixteen
8+
१७ seventeen
9+
१८ eighteen
10+
१९ nineteen
11+
२० twenty
12+
२१ twenty one
13+
२२ twenty two
14+
२३ twenty three
15+
२४ twenty four
16+
२५ twenty five
17+
२६ twenty six
18+
२७ twenty seven
19+
२८ twenty eight
20+
२९ twenty nine
21+
३० thirty
22+
३१ thirty one
23+
३२ thirty two
24+
३३ thirty three
25+
३४ thirty four
26+
३५ thirty five
27+
३६ thirty six
28+
३७ thirty seven
29+
३८ thirty eight
30+
३९ thirty nine
31+
४० forty
32+
४१ forty one
33+
४२ forty two
34+
४३ forty three
35+
४४ forty four
36+
४५ forty five
37+
४६ forty six
38+
४७ forty seven
39+
४८ forty eight
40+
४९ forty nine
41+
५० fifty
42+
५१ fifty one
43+
५२ fifty two
44+
५३ fifty three
45+
५४ fifty four
46+
५५ fifty five
47+
५६ fifty six
48+
५७ fifty seven
49+
५८ fifty eight
50+
५९ fifty nine
51+
६० sixty
52+
६१ sixty one
53+
६२ sixty two
54+
६३ sixty three
55+
६४ sixty four
56+
६५ sixty five
57+
६६ sixty six
58+
६७ sixty seven
59+
६८ sixty eight
60+
६९ sixty nine
61+
७० seventy
62+
७१ seventy one
63+
७२ seventy two
64+
७३ seventy three
65+
७४ seventy four
66+
७५ seventy five
67+
७६ seventy six
68+
७७ seventy seven
69+
७८ seventy eight
70+
७९ seventy nine
71+
८० eighty
72+
८१ eighty one
73+
८२ eighty two
74+
८३ eighty three
75+
८४ eighty four
76+
८५ eighty five
77+
८६ eighty six
78+
८७ eighty seven
79+
८८ eighty eight
80+
८९ eighty nine
81+
९० ninety
82+
९१ ninety one
83+
९२ ninety two
84+
९३ ninety three
85+
९४ ninety four
86+
९५ ninety five
87+
९६ ninety six
88+
९७ ninety seven
89+
९८ ninety eight
90+
९९ ninety nine

nemo_text_processing/inverse_text_normalization/hi/taggers/date.py

Lines changed: 33 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,10 +44,22 @@ def __init__(self, cardinal: GraphFst):
4444

4545
month_graph = pynini.string_file(get_abs_path("data/date/months.tsv"))
4646
graph_date_days = pynini.string_file(get_abs_path("data/date/date_days.tsv")).invert()
47+
graph_century = pynini.string_file(get_abs_path("data/date/century.tsv")).invert()
4748

4849
self.day = pynutil.insert("day: \"") + graph_date_days + pynutil.insert("\" ")
4950
self.month = pynutil.insert("month: \"") + month_graph + pynutil.insert("\" ")
5051
self.year = pynutil.insert("year: \"") + graph_year + pynutil.insert("\" ")
52+
self.year_range = (
53+
pynutil.insert("year: \"")
54+
+ graph_year
55+
+ delete_space
56+
+ pynini.cross("से", "-")
57+
+ delete_space
58+
+ graph_year
59+
+ delete_space
60+
+ pynutil.insert("\" ")
61+
)
62+
self.century = pynutil.insert("text: \"") + graph_century + pynutil.insert("\" ")
5163
insert_comma = pynutil.insert(", ")
5264

5365
graph_day_month = self.day + delete_space + self.month
@@ -58,9 +70,28 @@ def __init__(self, cardinal: GraphFst):
5870
graph_month_day_year += pynutil.insert(" preserve_order: true")
5971
graph_month_year = self.month + delete_space + self.year
6072
graph_saal = self.year
73+
graph_AD_BC = self.year + delete_space + self.century
74+
graph_day_month_year_century = (
75+
self.day + delete_space + self.month + delete_space + self.year + delete_space + self.century
76+
)
77+
graph_month_year_century = self.month + delete_space + self.year + delete_space + self.century
78+
graph_year_range = self.year_range
6179

62-
graph = graph_day_month | graph_month_day | graph_day_month_year | graph_month_day_year | graph_month_year
63-
self.graph = graph.optimize()
80+
graph_date_exceptions = self.month + delete_space + pynutil.delete("की") + delete_space + self.day
81+
graph_date_exceptions += pynutil.insert("preserve_order: true")
6482

83+
graph = (
84+
graph_day_month
85+
| graph_month_day
86+
| graph_day_month_year
87+
| graph_month_day_year
88+
| graph_month_year
89+
| graph_saal
90+
| graph_AD_BC
91+
| graph_day_month_year_century
92+
| graph_month_year_century
93+
| graph_year_range
94+
| graph_date_exceptions
95+
)
6596
final_graph = self.add_tokens(graph)
6697
self.fst = final_graph
Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import pynini
16+
from pynini.lib import pynutil
17+
18+
from nemo_text_processing.inverse_text_normalization.hi.graph_utils import GraphFst, delete_space
19+
from nemo_text_processing.inverse_text_normalization.hi.utils import get_abs_path
20+
21+
22+
class TelephoneFst(GraphFst):
23+
"""
24+
Finite state transducer for classifying telephone numbers, e.g.
25+
e.g. प्लस इक्यानवे नौ आठ सात छह पांच चार तीन दो एक शून्य => tokens { name: "+९१ ९८७६५ ४३२१०" }
26+
27+
Args:
28+
Cardinal: CardinalFst
29+
"""
30+
31+
def __init__(self, cardinal: GraphFst):
32+
super().__init__(name="telephone", kind="classify")
33+
34+
hindi_digit_graph = pynini.string_file(get_abs_path("data/numbers/digit.tsv")).invert()
35+
hindi_digit_graph |= pynini.string_file(get_abs_path("data/numbers/zero.tsv")).invert()
36+
37+
english_digit_graph = pynini.string_file(get_abs_path("data/telephone/eng_to_hindi_digit.tsv")).invert()
38+
39+
country_code_graph_single_digits = pynini.string_file(get_abs_path("data/numbers/digit.tsv")).invert()
40+
country_code_graph_single_digits |= pynini.string_file(get_abs_path("data/numbers/zero.tsv")).invert()
41+
country_code_graph_single_digits |= pynini.string_file(
42+
get_abs_path("data/telephone/eng_to_hindi_digit.tsv")
43+
).invert()
44+
45+
country_code_graph_double_digits = pynini.string_file(get_abs_path("data/numbers/teens_and_ties.tsv")).invert()
46+
country_code_graph_double_digits |= pynini.string_file(
47+
get_abs_path("data/telephone/teens_and_ties_eng_to_hin.tsv")
48+
).invert()
49+
50+
self.hindi_digit = (
51+
pynutil.insert("number_part: \"")
52+
+ pynini.closure(hindi_digit_graph + delete_space, 0, 9)
53+
+ hindi_digit_graph
54+
+ pynutil.insert("\" ")
55+
)
56+
self.english_digit = (
57+
pynutil.insert("number_part: \"")
58+
+ pynini.closure(english_digit_graph + delete_space, 0, 9)
59+
+ english_digit_graph
60+
+ delete_space
61+
+ pynutil.insert("\" ")
62+
)
63+
64+
self.country_code_with_single_digits = (
65+
pynutil.insert("country_code: \"")
66+
+ pynini.closure(country_code_graph_single_digits + delete_space, 0, 2)
67+
+ pynutil.insert("\" ")
68+
)
69+
self.country_code_with_double_digits = (
70+
pynutil.insert("country_code: \"")
71+
+ pynini.closure(country_code_graph_double_digits + delete_space, 0, 1)
72+
+ pynutil.insert("\" ")
73+
)
74+
self.country_code = self.country_code_with_single_digits | self.country_code_with_double_digits
75+
76+
# two, three, four-digit extension code with zero
77+
self.city_code_hindi = (
78+
pynutil.insert("extension: \"")
79+
+ pynini.closure(hindi_digit_graph + delete_space, 2, 5)
80+
+ pynutil.insert("\" ")
81+
)
82+
self.city_code_english = (
83+
pynutil.insert("extension: \"")
84+
+ pynini.closure(english_digit_graph + delete_space, 2, 5)
85+
+ pynutil.insert("\" ")
86+
)
87+
88+
self.city_extension = self.city_code_hindi | self.city_code_english
89+
90+
# 7-digit landline graph in hindi and english digits
91+
self.landline_hindi = (
92+
pynutil.insert("number_part: \"")
93+
+ pynini.closure(hindi_digit_graph + delete_space, 7, 7)
94+
+ pynutil.insert("\" ")
95+
)
96+
self.landline_english = (
97+
pynutil.insert("number_part: \"")
98+
+ pynini.closure(english_digit_graph + delete_space, 7, 7)
99+
+ pynutil.insert("\" ")
100+
)
101+
102+
self.landline = self.landline_hindi | self.landline_english
103+
104+
self.pincode_in_hindi = (
105+
pynutil.insert("number_part: \"")
106+
+ pynini.closure(hindi_digit_graph + delete_space, 0, 5)
107+
+ hindi_digit_graph
108+
+ pynutil.insert("\" ")
109+
)
110+
self.pincode_in_english = (
111+
pynutil.insert("number_part: \"")
112+
+ pynini.closure(english_digit_graph + delete_space, 0, 5)
113+
+ english_digit_graph
114+
+ pynutil.insert("\" ")
115+
)
116+
117+
self.credit_card_last_digits_hindi = (
118+
pynutil.insert("number_part: \"")
119+
+ pynini.closure(hindi_digit_graph + delete_space, 0, 3)
120+
+ hindi_digit_graph
121+
+ pynutil.insert("\" ")
122+
)
123+
self.credit_card_last_digits_english = (
124+
pynutil.insert("number_part: \"")
125+
+ pynini.closure(english_digit_graph + delete_space, 0, 3)
126+
+ english_digit_graph
127+
+ pynutil.insert("\" ")
128+
)
129+
130+
delete_plus = pynini.union(
131+
pynutil.delete("प्लस") | pynutil.delete("plus") | pynutil.delete("Plus") | pynutil.delete("PLUS")
132+
)
133+
134+
delete_zero = pynini.union(
135+
pynutil.delete("शून्य") | pynutil.delete("zero") | pynutil.delete("Zero") | pynutil.delete("ZERO")
136+
)
137+
138+
graph_number_with_hindi_digit = (
139+
delete_plus + delete_space + self.country_code + delete_space + self.hindi_digit
140+
)
141+
graph_number_with_english_digit = delete_plus + delete_space + self.country_code + self.english_digit
142+
143+
graph_landline_with_extension = delete_zero + delete_space + self.city_extension + delete_space + self.landline
144+
145+
graph_pincode = self.pincode_in_hindi | self.pincode_in_english
146+
147+
graph_credit_card_last_digits = self.credit_card_last_digits_hindi | self.credit_card_last_digits_english
148+
149+
graph = (
150+
graph_number_with_hindi_digit
151+
| graph_number_with_english_digit
152+
| graph_landline_with_extension
153+
| graph_pincode
154+
| graph_credit_card_last_digits
155+
)
156+
157+
final_graph = self.add_tokens(graph)
158+
self.fst = final_graph

nemo_text_processing/inverse_text_normalization/hi/taggers/tokenize_and_classify.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@
3333
from nemo_text_processing.inverse_text_normalization.hi.taggers.money import MoneyFst
3434
from nemo_text_processing.inverse_text_normalization.hi.taggers.ordinal import OrdinalFst
3535
from nemo_text_processing.inverse_text_normalization.hi.taggers.punctuation import PunctuationFst
36+
from nemo_text_processing.inverse_text_normalization.hi.taggers.telephone import TelephoneFst
3637
from nemo_text_processing.inverse_text_normalization.hi.taggers.time import TimeFst
3738
from nemo_text_processing.inverse_text_normalization.hi.taggers.whitelist import WhiteListFst
3839
from nemo_text_processing.inverse_text_normalization.hi.taggers.word import WordFst
@@ -82,6 +83,8 @@ def __init__(
8283
measure_graph = measure.fst
8384
money = MoneyFst(cardinal, decimal)
8485
money_graph = money.fst
86+
telephone = TelephoneFst(cardinal)
87+
telephone_graph = telephone.fst
8588
punct_graph = PunctuationFst().fst
8689
whitelist_graph = WhiteListFst().fst
8790
word_graph = WordFst().fst
@@ -95,6 +98,7 @@ def __init__(
9598
| pynutil.add_weight(time_graph, 1.1)
9699
| pynutil.add_weight(measure_graph, 1.1)
97100
| pynutil.add_weight(money_graph, 1.1)
101+
| pynutil.add_weight(telephone_graph, 1.1)
98102
| pynutil.add_weight(word_graph, 100)
99103
| pynutil.add_weight(whitelist_graph, 1.01)
100104
)

0 commit comments

Comments
 (0)