This repository contains the code and documentation for a sentiment analysis project using Twitter data. The primary objective of this project is to classify text documents from the Twitter "Twitter dataset" according to the expressed sentiments, categorized into positive, negative, neutral, or irrelevant.
The goal of this project is to leverage classification models to analyze and predict sentiments expressed in Twitter posts. Using a structured approach, the project aims to preprocess data, train sentiment analysis models, and evaluate their performance rigorously.
- Classification
- Text mining
The dataset, referred to as the "Ayaz dataset," comprises text documents each tagged with a specific sentiment. The dataset is divided into three parts:
- Twitter-training - Used for training the models.
- Twitter-test - Used for preliminary testing and model tuning.
- Twitter-validation - Used for final model validation and performance assessment.
- Your ID: Unique identifier for each text document.
- Existence: Indicates entities in the text about which feelings are expressed.
- Feeling: Sentiment expressed in the text (positive, negative, neutral, or irrelevant).
- The content of the website: Actual text content of the Twitter post.
The raw dataset undergoes several preprocessing steps to make it suitable for model training:
- Removal of special characters.
- Conversion of text to lowercase.
- Elimination of stop words.
- Stemming and lemmatization.
- Any additional necessary preprocessing steps.
Features are extracted from the preprocessed text documents using common methods such as:
- Bag-of-Words
- TF-IDF
- Word embeddings (Word2Vec, GloVe)
Multiple classification models are trained using the processed data. At least two models will be used:
- Naïve Bayes
- Support Vector Machine (SVM)
Models are evaluated using the Twitter-test dataset with metrics like:
- Confusion Matrix
- Precision
- Recall
- F1-Score
- Accuracy
Performance results are documented and compared.
The final model validation is performed using the Twitter-validation dataset. An Excel file is created for each tested model, with an additional column named "PredictedSentiment" to save the predicted sentiment tags.