- 🚨 The
IndicProcessor
class has been re-written in Cython for faster implementation. This gives us atleast+10 lines/s
. - A new
visualize
argument as been added topreprocess_batch
to track the processing with atqdm
bar.
- The repository has been renamed to
IndicTransToolkit
. - 🚨 The custom tokenizer is now removed from the repository. Please revert to a previous commit (v1.0.1) to use it (strongly discouraged). The official (and only tokenizer) is available on HF along with the models.
- The PreTrainedTokenizer for IndicTrans2 is now available on HF 🎉🎉 Note that, you still need the
IndicProcessor
to pre-process the sentences before tokenization. - 🚨 In favor of the standard PreTrainedTokenizer, we deprecated the custom tokenizer. However, this custom tokenizer will still be available here for backward compatibility, but no further updates/bug-fixes will be provided.
- The
indic_evaluate
function is now consolidated into a concreteIndicEvaluator
class. - The data collation function for training is consolidated into a concrete
IndicDataCollator
class. - A simple batching method is now available in the
IndicProcessor
.