Machine Translation: English to Hindi Using Transformers

This project implements a transformer-based neural network architecture from scratch using PyTorch for machine translation between English and Hindi. The model is trained on the IIT Bombay English-Hindi Parallel Corpus to translate English sentences into Hindi.

Introduction

Machine Translation (MT) is the task of automatically converting text from one language to another. This project focuses on building a transformer-based model to translate English sentences into Hindi. Transformers have become the standard architecture for most NLP tasks, including machine translation, due to their efficiency in capturing dependencies in sequential data.

Dataset

We use the Helsinki-NLP/opus-100 en-hi Dataset, which contains around 538K sentence pairs with their corresponding translations in both languages. The dataset consists of formal translations in various domains such as health, tourism, and general topics.

Source Language: English
Target Language: Hindi
Dataset Link: Helsinki-NLP/opus-100

Preprocessing

Tokenization of both English and Hindi sentences using word-level tokenizers.
Sequence padding and truncation to handle varying sentence lengths.
Vocabulary creation and encoding for both languages.

Model Architecture

The project implements the Transformer model from scratch in PyTorch. Key components of the transformer architecture include:

Multi-Head Attention: Enables the model to attend to different parts of the sequence simultaneously. Used in the form of Masked Multi Head Attention and Cross Attention in Decoder through slight modification.
Positional Encoding: Adds information about the position of words in the sequence.
Feed-Forward Networks: Non-linear transformation of attention output.
Layer Normalization: Stabilizes the training and speeds up convergence.

Model Details

Number of Layers: 6 layers for both encoder and decoder.
Heads in Attention Mechanism: 8
Hidden Size: 512
Dropout: 0.1 for regularization.
Optimizer: Adam

Installation

Requirements

Python 3.8+
PyTorch 1.8+
TorchText
Numpy
tensorboard
tokenizers
Datasets

To install the necessary packages, run:

pip install -r requirements.txt

Results

The model was evaluated using the Character Error Rate, Word Error Rate and BLEU score to measure translation accuracy.

After training, the model achieved :

Character Error Rate: 0.25051334500312805
Word Error Rate: 0.4615384638309479
BLEU Score: 0.07438015937805176

Example translations:

SOURCE: A significant number of people have been blaming two main political leaders for all the vice of Bangladesh.
TARGET: बांग्लादेश की सारी समस्या के लिए बहुत से लोग दो मुख्य राजनीतिक दल के नेताओं पर दोषारोपण करते रहे हैं.
 PREDICTED: बांग्लादेश की सारी समस्या के लिए बहुत से लोग दो मुख्य राजनीतिक दल के नेताओं पर नज़र रखी गई .

SOURCE: Allah originates the creation, then He will bring it back, then you will be brought back to Him.
TARGET: ख़ुदा ही ने मख़लूकात को पहली बार पैदा किया फिर वही दुबारा (पैदा करेगा) फिर तुम सब लोग उसी की तरफ लौटाए जाओगे
 PREDICTED: ख़ुदा ही ने मख़लूकात को पहली बार पैदा किया फिर वही दुबारा ( पैदा करेगा ) फिर तुम सब लोग उसी की तरफ लौटाए जाओगे

SOURCE: Was it not a proof for them that the learned men of Israel knew about this?
TARGET: क्या यह उनके लिए कोई निशानी नहीं है कि इसे बनी इसराईल के विद्वान जानते है?
 PREDICTED: क्या यह उनके लिए कोई निशानी नहीं है कि इसे बनी इसराईल के विद्वान जानते है ?

SOURCE: Timeout for marking message as seen.
TARGET: देखे गये के रूप में संदेश चिह्नित करने के लिए समय समाप्ति.
 PREDICTED: देखे गये के रूप में संदेश चिह्नित करने के लिए समय समाप्ति .

Further-Improvements

The following model was trained till 20 epochs. The performance can be increased by improving the number of epochs and introducing early stopping so that our model does not overfit
As the dataset is huge only 20% of the data(~58k) is used for training and validation due to limited resources. The performance can be further improved by using the entire dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
config.py		config.py
dataset.py		dataset.py
image.png		image.png
model.py		model.py
readme.md		readme.md
requirements.txt		requirements.txt
training.py		training.py
transformer.ipynb		transformer.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Translation: English to Hindi Using Transformers

Table of Contents

Introduction

Dataset

Preprocessing

Model Architecture

Model Details

Installation

Requirements

Results

Further-Improvements

References

About

Releases

Packages

Languages

sarthakg004/machine_translation

Folders and files

Latest commit

History

Repository files navigation

Machine Translation: English to Hindi Using Transformers

Table of Contents

Introduction

Dataset

Preprocessing

Model Architecture

Model Details

Installation

Requirements

Results

Further-Improvements

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages