BLOG
Sentiment Analysis using NLP on a Hotel Review Dataset
Hotel booking companies have been amassing tremendous amounts of data.
The reviews left by users are of value to the hotels but due to the volume, extracting
insights is no easy task. Using Natural LanguageProcessing (NLP) techniques we carry out a sentiment
analysis based on the given review data and visualize it.Further, the data is analyzed for the negative and positive sentiments to point out Key strengths and weaknesses based on the given reviews.
Data Sources
The data is obtained from the Booking.com website (Dataset), The data was scraped from Booking.com. All data in the file is publicly available to everyone already. Data is originally owned by Booking.com.& This dataset contains 515,000 customer reviews and scoring of 1493 luxury hotels across Europe.
Libraries Used:
-
Pandas – DataFrames and Manipulation
-
Numpy – Numerical Library
-
Matplotlib – Visualization
-
Seaborn – Visualization
-
Natural Language Toolkit – Library for NLP
-
Scikit-learn – Processing techniques
Pre-Processing steps:
The first step as with all data is to handle the Null values. Since there is a minimal set of null values, they are simply dropped from our analysis. It is also imperative to remove all symbols and characters other than letters.
The text must be parsed to exclude some words in order to use textual data for predictive modeling.
This technique is called tokenization. They are all then converted into their lower case. The removal of stop words
such as ‘and’, ‘but’ etc. is done as these words do not have any significance in our analysis. Finally, lemmatization
is done and the values are stored.
Note: All above steps are done for both negative and positive cases.
Key Insights from the data:
Fig 1 Continent of reviewer
By aggregating the user review scores and analyzing them through Seaborn’s graphing tools, it is perspicuous that the Ritz Paris is the best hotel and that Hotel Liberty appears to be the worst. We observe that 70% of the people who have left reviews are from Europe and a large portion of them originate from the United Kingdom.
The reviewer’s scores seem fairly consistent and independent of the number of days stayed, with the exception
of the loss occurring at 21 days. The scores left by the reviewers appear to be independent of the
month/year of stay.
Fig 2(a) Month of stay vs. Reviewer score
Fig 2(b) Number of days stayed vs. Reviewer score
The best reviewers are written by couples and solo travelers seem to leave the worst ones. “Whom” in the dataset represents the following:
​
-
Solo Traveler
-
Couple
-
Group
-
Family with young children
-
Family with older children
Fig 3(a) Country of Hotel vs. Reviewer score
The best hotels are present in Austria and Spain.
Referencing the Grouped Categories in Bar Charts:
Fig 3(b) “Whom” vs. Reviewer score
Fig 4(a) Air conditioning vs. Reviewer score | Fig 4(b) Room size vs. Reviewer score | Fig 5(a) Room problem vs. Reviewer score |
---|---|---|
Fig 5(b) Location vs. Reviewer score | Fig 6(a) Staff vs. Reviewer score | Fig 6(b) Bed/ Room vs. Reviewer score |
Fig 7(b) Trip type vs. Reviewer score | Fig 7(a) Mobile review vs. Reviewer score |
Pearson Correlation Matrix for the features:
Fig 8 Pearson Correlation Matrix for features
Thus a great amount of insights can be extracted from observing the review data. This can be further extended
by selecting a particular hotel from the dataset and observing each feature involved. Using data to meticulously observe the nuances of the reviews can greatly aid in increasing customer satisfaction and by extension improve customer retention, loyalty and generate interest among new prospects.
Rishan Sanjay
Software (Data) Engineer and Product Management
at Kushagramati