Skip to content

Commit ac07488

Browse files
mgrafungachchipre-commit-ci[bot]github-advanced-security[bot]
authored
Staging hi tn (#271)
* Future Implementations for classes - Measure, Money, and Date (#258) * Future Implementations for classes - Measure, Money, and Date Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * Resolved the conflicts with mm_yyyy and date ranges and added the previously removed failing test cases. Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed the unused empty string implementation Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes for the tagger files Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reformatted decimal final graph Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * incorporated the suggestion for decimal graph Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Century implementations Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * Working on the yyyy format for the date class Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * reverted yyyy code Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * working on future implementations Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * working on improving the date class accuracy Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added year prefix for the date class Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * working on the commma cases for date class Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * minor fixes Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * implemented mixed fractions Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * rectified the test case Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * working on quarterly measurements Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * reformatted the prefixes and suffixes for date tagger class Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * replaced text tag with era tag for the date class Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> * Removed the text tag reference from date class verbalizer Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> --------- Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update jenkins cache Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Potential fix for code scanning alert no. 821: Unused local variable Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com> --------- Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com> Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com> Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: Namrata Gachchi <ngachchi@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
1 parent 3e4ac3e commit ac07488

File tree

24 files changed

+334
-70
lines changed

24 files changed

+334
-70
lines changed

Jenkinsfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ pipeline {
2727
HY_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/03-12-24-0'
2828
MR_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/03-12-24-1'
2929
JA_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/10-17-24-1'
30-
HI_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/04-03-25-1'
30+
HI_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/04-22-25-0'
3131
DEFAULT_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-08-23-0'
3232
}
3333
stages {
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
सन्
2+
सन
3+
साल
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
में
2+
का
3+
की
4+
के
5+
से
6+
तक
7+
ईस्वी
8+
शताब्दी
9+
दशक
10+
सदी
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
ई. पू. ईसा पूर्व
2+
ई. ईसवी
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
s सेकंड
2+
hr घंटा
3+
h घंटे
4+
min मिनट
5+
doz दर्जन
6+
yr साल
7+
yr वर्ष
8+
hp हॉर्सपॉवर
9+
d दिन
10+
month महीना
11+
months महीने
12+
हफ़्ते हफ़्ते

nemo_text_processing/text_normalization/hi/data/measure/unit.tsv

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -141,14 +141,16 @@ month महीना
141141
months महीने
142142
ct कैरेट
143143
pH पीएच
144+
km/h किलोमीटर प्रति घंटा
144145
km/hr किलोमीटर प्रति घंटा
145146
km/min किलोमीटर प्रति मिनट
147+
m/h मीटर प्रति घंटा
146148
m/hr मीटर प्रति घंटा
147149
mi/s मील प्रति सेकंड
150+
mi/h मील प्रति घंटा
148151
mi/hr मील प्रति घंटा
149152
mi/min मील प्रति मिनट
150153
₹/ac रुपए प्रति एकड़
151154
x बाई
152155
X बाई
153156
* बाई
154-
- से
Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,9 @@
11
रुपए
2-
P पैसे
32
£ पाउंड
43
वॉन
54
$ डॉलर
65
लीरा
76
टका
87
¥ येन
98
नाइरा
10-
यूरो
9+
यूरो
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
रुपए पैसे
2+
पाउंड पेंस
3+
वॉन जिओन
4+
डॉलर सेंट
5+
लीरा कुरस
6+
टका पैसे
7+
येन सेन
8+
नाइरा कोबो
9+
यूरो सेंट

nemo_text_processing/text_normalization/hi/data/numbers/teens_and_ties.tsv

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -79,12 +79,12 @@
7979
८८ अट्ठासी
8080
८९ नवासी
8181
९० नब्बे
82-
९१ इक्यानबे
83-
९२ बानबे
84-
९३ तिरानबे
85-
९४ चौरानबे
86-
९५ पंचानबे
87-
९६ छियानबे
88-
९७ सत्तानबे
89-
९८ अट्ठानबे
82+
९१ इक्यानबे
83+
९२ बानबे
84+
९३ तिरानबे
85+
९४ चौरानबे
86+
९५ पंचानबे
87+
९६ छियानबे
88+
९७ सत्तानबे
89+
९८ अट्ठानबे
9090
९९ निन्यानबे

nemo_text_processing/text_normalization/hi/data/time/hours.tsv

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
शून्य
12
एक
23
दो
34
तीन

nemo_text_processing/text_normalization/hi/taggers/cardinal.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ def create_larger_number_graph(digit_graph, suffix, zeros_counts, sub_graph):
8080
graph_ten_thousands |= create_larger_number_graph(teens_and_ties, suffix_thousands, 1, teens_ties)
8181
graph_ten_thousands |= create_larger_number_graph(teens_and_ties, suffix_thousands, 0, graph_hundreds)
8282
graph_ten_thousands.optimize()
83+
self.graph_ten_thousands = graph_ten_thousands
8384

8485
# Lakhs graph and ten lakhs graph
8586
suffix_lakhs = pynutil.insert(" लाख")
@@ -90,6 +91,7 @@ def create_larger_number_graph(digit_graph, suffix, zeros_counts, sub_graph):
9091
graph_lakhs |= create_larger_number_graph(digit, suffix_lakhs, 1, graph_thousands)
9192
graph_lakhs |= create_larger_number_graph(digit, suffix_lakhs, 0, graph_ten_thousands)
9293
graph_lakhs.optimize()
94+
self.graph_lakhs = graph_lakhs
9395

9496
graph_ten_lakhs = create_graph_suffix(teens_and_ties, suffix_lakhs, 5)
9597
graph_ten_lakhs |= create_larger_number_graph(teens_and_ties, suffix_lakhs, 4, digit)
@@ -98,6 +100,7 @@ def create_larger_number_graph(digit_graph, suffix, zeros_counts, sub_graph):
98100
graph_ten_lakhs |= create_larger_number_graph(teens_and_ties, suffix_lakhs, 1, graph_thousands)
99101
graph_ten_lakhs |= create_larger_number_graph(teens_and_ties, suffix_lakhs, 0, graph_ten_thousands)
100102
graph_ten_lakhs.optimize()
103+
self.graph_ten_lakhs = graph_ten_lakhs
101104

102105
# Crores graph ten crores graph
103106
suffix_crores = pynutil.insert(" करोड़")

nemo_text_processing/text_normalization/hi/taggers/date.py

Lines changed: 54 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,20 @@
2626

2727
days = pynini.string_file(get_abs_path("data/date/days.tsv"))
2828
months = pynini.string_file(get_abs_path("data/date/months.tsv"))
29+
year_suffix = pynini.string_file(get_abs_path("data/date/year_suffix.tsv"))
30+
digit = pynini.string_file(get_abs_path("data/numbers/digit.tsv"))
31+
teens_ties = pynini.string_file(get_abs_path("data/numbers/teens_and_ties.tsv"))
32+
teens_and_ties = pynutil.add_weight(teens_ties, -0.1)
33+
34+
# Read suffixes from file into a list
35+
with open(get_abs_path("data/date/suffixes.tsv"), "r", encoding="utf-8") as f:
36+
suffixes_list = f.read().splitlines()
37+
with open(get_abs_path("data/date/prefixes.tsv"), "r", encoding="utf-8") as f:
38+
prefixes_list = f.read().splitlines()
39+
40+
# Create union of suffixes and prefixes
41+
suffix_union = pynini.union(*suffixes_list)
42+
prefix_union = pynini.union(*prefixes_list)
2943

3044

3145
class DateFst(GraphFst):
@@ -51,6 +65,10 @@ def __init__(self, cardinal: GraphFst):
5165
(NEMO_HI_DIGIT + NEMO_HI_NON_ZERO + NEMO_HI_DIGIT + NEMO_HI_DIGIT), cardinal.graph_hundreds_as_thousand
5266
)
5367

68+
cardinal_graph = (
69+
digit | teens_and_ties | cardinal.graph_hundreds | graph_year_thousands | graph_year_hundreds_as_thousands
70+
)
71+
5472
graph_year = graph_year_thousands | graph_year_hundreds_as_thousands
5573

5674
delete_dash = pynutil.delete("-")
@@ -68,6 +86,22 @@ def __init__(self, cardinal: GraphFst):
6886

6987
graph_mm_dd += pynutil.insert(" preserve_order: true ")
7088

89+
# Graph for era
90+
era_graph = pynutil.insert("era: \"") + year_suffix + pynutil.insert("\"") + insert_space
91+
92+
range_graph = pynini.cross("-", "से")
93+
94+
# Graph for year
95+
century_number = pynini.compose(pynini.closure(NEMO_HI_DIGIT, 1), cardinal_graph) + pynini.accep("वीं")
96+
century_text = pynutil.insert("era: \"") + century_number + pynutil.insert("\"") + insert_space
97+
98+
# Updated logic to use suffix_union
99+
year_number = graph_year + suffix_union
100+
year_text = pynutil.insert("era: \"") + year_number + pynutil.insert("\"") + insert_space
101+
102+
# Updated logic to use prefix_union
103+
year_prefix = pynutil.insert("era: \"") + prefix_union + insert_space + graph_year + pynutil.insert("\"")
104+
71105
graph_dd_mm_yyyy = (
72106
days_graph + (delete_dash | delete_slash) + months_graph + (delete_dash | delete_slash) + years_graph
73107
)
@@ -78,7 +112,20 @@ def __init__(self, cardinal: GraphFst):
78112

79113
graph_mm_dd_yyyy += pynutil.insert(" preserve_order: true ")
80114

81-
graph_mm_yyyy = months_graph + delete_dash + years_graph
115+
graph_mm_yyyy = months_graph + delete_dash + insert_space + years_graph
116+
117+
graph_year_suffix = era_graph
118+
119+
graph_range = (
120+
pynutil.insert("era: \"")
121+
+ cardinal_graph
122+
+ insert_space
123+
+ range_graph
124+
+ insert_space
125+
+ cardinal_graph
126+
+ pynutil.insert("\"")
127+
+ pynutil.insert(" preserve_order: true ")
128+
)
82129

83130
# default assume dd_mm_yyyy
84131

@@ -87,7 +134,12 @@ def __init__(self, cardinal: GraphFst):
87134
| graph_mm_dd
88135
| pynutil.add_weight(graph_dd_mm_yyyy, -0.001)
89136
| graph_mm_dd_yyyy
90-
| graph_mm_yyyy
137+
| pynutil.add_weight(graph_mm_yyyy, -0.2)
138+
| pynutil.add_weight(graph_year_suffix, -0.001)
139+
| pynutil.add_weight(graph_range, -0.005)
140+
| pynutil.add_weight(century_text, -0.001)
141+
| pynutil.add_weight(year_text, -0.001)
142+
| pynutil.add_weight(year_prefix, -0.009)
91143
)
92144

93145
self.final_graph = final_graph.optimize()

nemo_text_processing/text_normalization/hi/taggers/measure.py

Lines changed: 79 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,11 @@
1919
from nemo_text_processing.text_normalization.hi.utils import get_abs_path
2020

2121

22+
digit = pynini.string_file(get_abs_path("data/numbers/digit.tsv"))
23+
teens_ties = pynini.string_file(get_abs_path("data/numbers/teens_and_ties.tsv"))
24+
teens_and_ties = pynutil.add_weight(teens_ties, -0.1)
25+
26+
2227
class MeasureFst(GraphFst):
2328
"""
2429
Finite state transducer for classifying measure, suppletive aware, e.g.
@@ -35,39 +40,105 @@ class MeasureFst(GraphFst):
3540
def __init__(self, cardinal: GraphFst, decimal: GraphFst):
3641
super().__init__(name="measure", kind="classify")
3742

38-
cardinal_graph = cardinal.final_graph
39-
decimal_graph = decimal.final_graph_wo_negative
43+
cardinal_graph = (
44+
digit
45+
| teens_and_ties
46+
| cardinal.graph_hundreds
47+
| cardinal.graph_thousands
48+
| cardinal.graph_ten_thousands
49+
| cardinal.graph_lakhs
50+
| cardinal.graph_ten_lakhs
51+
)
52+
point = pynutil.delete(".")
53+
decimal_integers = pynutil.insert("integer_part: \"") + cardinal_graph + pynutil.insert("\"")
54+
decimal_graph = decimal_integers + point + insert_space + decimal.graph_fractional
4055
unit_graph = pynini.string_file(get_abs_path("data/measure/unit.tsv"))
56+
quarterly_units_graph = pynini.string_file(get_abs_path("data/measure/quarterly_units.tsv"))
4157

4258
optional_graph_negative = pynini.closure(
4359
pynutil.insert("negative: ") + pynini.cross("-", "\"true\"") + insert_space,
4460
0,
4561
1,
4662
)
4763

64+
# Define the quarterly measurements
65+
quarter = pynini.string_map(
66+
[
67+
(".५", "साढ़े"),
68+
("१.५", "डेढ़"),
69+
("२.५", "ढाई"),
70+
]
71+
)
72+
quarter_graph = pynutil.insert("integer_part: \"") + quarter + pynutil.insert("\"")
73+
4874
# Define the unit handling
49-
self.unit = pynutil.insert("units: \"") + unit_graph + pynutil.insert("\" ")
75+
unit = pynutil.insert(" units: \"") + unit_graph + pynutil.insert("\" ")
76+
units = pynutil.insert(" units: \"") + quarterly_units_graph + pynutil.insert("\" ")
77+
78+
# Handling symbols like x, X, *
79+
symbol_graph = pynini.string_map(
80+
[
81+
("x", "बाई"),
82+
("X", "बाई"),
83+
("*", "बाई"),
84+
]
85+
)
5086

51-
graph_measurements = (
87+
graph_decimal = (
5288
pynutil.insert("decimal { ")
5389
+ optional_graph_negative
5490
+ decimal_graph
5591
+ pynutil.insert(" }")
5692
+ delete_space
57-
+ self.unit
93+
+ unit
5894
)
59-
graph_measurements |= (
95+
96+
graph_quarter = (
97+
pynutil.insert("cardinal { ")
98+
+ optional_graph_negative
99+
+ quarter_graph
100+
+ pynutil.insert(" }")
101+
+ delete_space
102+
+ units
103+
)
104+
105+
graph_cardinal = (
60106
pynutil.insert("cardinal { ")
61107
+ optional_graph_negative
62108
+ pynutil.insert("integer: \"")
63109
+ cardinal_graph
64110
+ pynutil.insert("\"")
65111
+ pynutil.insert(" }")
66112
+ delete_space
67-
+ self.unit
113+
+ unit
68114
)
69115

70-
graph = graph_measurements
116+
# Handling cardinal clubbed with symbol as single token
117+
graph_exceptions = (
118+
pynutil.insert("cardinal { ")
119+
+ optional_graph_negative
120+
+ pynutil.insert("integer: \"")
121+
+ cardinal_graph
122+
+ pynutil.insert("\"")
123+
+ pynutil.insert(" }")
124+
+ pynutil.insert(" units: \"")
125+
+ symbol_graph
126+
+ pynutil.insert("\" ")
127+
+ pynutil.insert("} }")
128+
+ insert_space
129+
+ pynutil.insert("tokens { cardinal { ")
130+
+ optional_graph_negative
131+
+ pynutil.insert("integer: \"")
132+
+ cardinal_graph
133+
+ pynutil.insert("\"")
134+
)
135+
136+
graph = (
137+
pynutil.add_weight(graph_decimal, 0.01)
138+
| pynutil.add_weight(graph_quarter, 0.005)
139+
| pynutil.add_weight(graph_cardinal, 0.01)
140+
| pynutil.add_weight(graph_exceptions, 0.01)
141+
)
71142
self.graph = graph.optimize()
72143

73144
final_graph = self.add_tokens(graph)

0 commit comments

Comments
 (0)