Skip to content

Pre-compute Coefficients for Common Languages in CharAugmenter #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks
LSanselme opened this issue Jan 16, 2024 · 0 comments
Open
3 tasks

Pre-compute Coefficients for Common Languages in CharAugmenter #10

LSanselme opened this issue Jan 16, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@LSanselme
Copy link
Collaborator

Issue description

Textnoisr uses a coefficient to take into account repetitions in consecutive letters in natural language.

As @felix-martel-prl said in #7 (review) :

The next step would be to pre-compute this coefficient for a range of common languages
CharAugmenter(language="en") is better than CharAugmenter(natural_language_swap_correction=1.052).

It could indeed enhance readability, and make the code more easily usable for non-English languages.

Suggested Implementation Steps:

  • Identify a set of common languages for pre-computation.
  • Implement a mechanism to store and retrieve pre-computed coefficients.
  • Update the CharAugmenter module to use pre-computed coefficients when available.
@LSanselme LSanselme added the enhancement New feature or request label Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant