Welcome to the NLP Chatbot Project! This is a first project demonstrates how to create a simple chatbot using cosine similarity for question answering.
- Purpose: A chatbot that matches user queries to predefined questions and returns the corresponding answers.
- Technologies: Python, NLTK, NumPy, scikit-learn
This chatbot:
- Tokenizes and removes stopwords from user input.
- Matches the input to a list of predefined questions using cosine similarity.
- Returns the corresponding answer if a match is found.
- Responds with
"I can't answer this question."
if no match is found.
- Python 3.x
- Google colab/jupyter notebbok
- Libraries:
nltk
numpy
scikit-learn
pandas
- Source: CSV file containing questions and answersregarding data analytics.
- Path:
test.csv
as per your location
-
Mount Google Drive:
from google.colab import drive drive.mount('/content/drive')
-
Import Libraries:
import numpy as np import pandas as pd import nltk from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity
-
Download NLTK Data:
nltk.download('punkt') nltk.download('wordnet') nltk.download('stopwords')
-
Read Dataset:
path = r"/content/drive/MyDrive/IMP1DS INTERVIEW PREP2024/15.DSPROJECT2024/1.NLPPROJECTS2024/test.csv" df = pd.read_csv(path, encoding='unicode_escape') questions_list = df['Questions'].tolist() answers_list = df['Answers'].tolist()
-
Initialize Tools:
from nltk.stem import WordNetLemmatizer, PorterStemmer from nltk.corpus import stopwords import re
-
Preprocess Function:
def preprocess_with_stopwords(text): lemmatizer = WordNetLemmatizer() stemmer = PorterStemmer() text = re.sub(r'[^\w\s]', '', text) tokens = nltk.word_tokenize(text.lower()) lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens] stemmed_tokens = [stemmer.stem(token) for token in lemmatized_tokens] return ' '.join(stemmed_tokens)
- Setup Vectorizer:
vectorizer = TfidfVectorizer(tokenizer=nltk.word_tokenize) X = vectorizer.fit_transform([preprocess_with_stopwords(q) for q in questions_list])
- Get Response Function:
def get_response(text): processed_text = preprocess_with_stopwords(text) vectorized_text = vectorizer.transform([processed_text]) similarities = cosine_similarity(vectorized_text, X) max_similarity = np.max(similarities) if max_similarity > 0.6: high_similarity_questions = [q for q, s in zip(questions_list, similarities[0]) if s > 0.6] target_answers = [answers_list[questions_list.index(q)] for q in high_similarity_questions] Z = vectorizer.fit_transform([preprocess_with_stopwords(q) for q in high_similarity_questions]) final_similarities = cosine_similarity(vectorized_text, Z) closest = np.argmax(final_similarities) return target_answers[closest] else: return "I can't answer this question."
- Example Query:
get_response('Who is MS Dhoni?')
-
GingerIt for Grammar Check:
!pip install gingerit from gingerit.gingerit import GingerIt text = 'What is Data Anlytics' parser = GingerIt() corrected_text = parser.parse(text) print(corrected_text['result'])
-
TextBlob for Spelling Correction:
!pip install textblob from textblob import TextBlob text = 'What is Data Anlytics' blob = TextBlob(text) corrected_text = blob.correct() print(corrected_text)
Feel free to explore and contribute to the project! 🚀