Skip to content

My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.

License

Notifications You must be signed in to change notification settings

fzn0x/bert-sms-classification

Repository files navigation

Fine-tuned BERT-base-uncased pre-trained model to classify spam SMS.

My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.

How to use this model?

from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained('fzn0x/bert-spam-classification-model')
model = BertForSequenceClassification.from_pretrained('fzn0x/bert-spam-classification-model')

Check scripts/predict.py for full example (You just need to modify the argument of from_pretrained).

✅ Install requirements

Install required dependencies

pip install --upgrade pip
pip install -r requirements.txt

✅ Add BERT virtual env

write the command below

# ✅ Create and activate a virtual environment
python -m venv bert-env
source bert-env/bin/activate    # On Windows use: bert-env\Scripts\activate

✅ Install CUDA

Check if your GPU supports CUDA:

nvidia-smi

Then:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False

🔧 How to use

  • Check your device and CUDA availability:
python check_device.py

⚠️ Using CPU is not advisable, prefer check your CUDA availability.

  • Train the model:
python scripts/train.py

⚠️ Remove unneeded checkpoint in models/pretrained to save your storage after training

  • Run prediction:
python scripts/predict.py

✅ Dataset Location: data/spam.csv, modify the dataset to enhance the model based on your needs.

📚 Citations

If you use this repository or its ideas, please cite the following:

See citations.bib for full BibTeX entries.

  • Wolf et al., Transformers: State-of-the-Art Natural Language Processing, EMNLP 2020. ACL Anthology
  • Pedregosa et al., Scikit-learn: Machine Learning in Python, JMLR 2011.
  • Almeida & Gómez Hidalgo, SMS Spam Collection v.1, UCI Machine Learning Repository (2011). Kaggle Link

🧠 Credits and Libraries Used

License and Usage

License under MIT license.


Leave a ⭐ if you think this project is helpful, contributions are welcome.


About

My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.

Resources

License

Stars

Watchers

Forks