top of page
blog_cover.png

BLOG

Sentiment Analysis using NLP on a Hotel Review Dataset

Hotel booking companies have been amassing tremendous amounts of data.

The reviews left by users are of value to the hotels but due to the volume, extracting

insights is no easy task. Using Natural LanguageProcessing (NLP) techniques we carry out a sentiment

analysis based on the given review data and visualize it.Further, the data is analyzed for the negative and positive sentiments to point out Key strengths and weaknesses based on the given reviews.

Data Sources

The data is obtained from the Booking.com website (Dataset), The data was scraped from Booking.com. All data in the file is publicly available to everyone already. Data is originally owned by Booking.com.& This dataset contains 515,000 customer reviews and scoring of 1493 luxury hotels across Europe.

Libraries Used:

  • Pandas – DataFrames and Manipulation

  • Numpy – Numerical Library

  • Matplotlib – Visualization

  • Seaborn – Visualization

  • Natural Language Toolkit – Library for NLP

  • Scikit-learn – Processing techniques

Pre-Processing steps:

The first step as with all data is to handle the Null values. Since there is a minimal set of null values, they are simply dropped from our analysis. It is also imperative to remove all symbols and characters other than letters.

The text must be parsed to exclude some words in order to use textual data for predictive modeling.

This technique is called tokenization. They are all then converted into their lower case. The removal of stop words

such as ‘and’, ‘but’ etc. is done as these words do not have any significance in our analysis. Finally, lemmatization

is done and the values are stored.

Note: All above steps are done for both negative and positive cases.

Key Insights from the data:

fig1.png

Fig 1 Continent of reviewer

By aggregating the user review scores and analyzing them through Seaborn’s graphing tools, it is perspicuous that the Ritz Paris is the best hotel and that Hotel Liberty appears to be the worst. We observe that 70% of the people who have left reviews are from Europe and a large portion of them originate from the United Kingdom.

The reviewer’s scores seem fairly consistent and independent of the number of days stayed, with the exception

of the loss occurring at 21 days. The scores left by the reviewers appear to be independent of the

month/year of stay.

fig_2a.png

Fig 2(a) Month of stay vs. Reviewer score

fig_2b.png

Fig 2(b) Number of days stayed vs. Reviewer score

The best reviewers are written by couples and solo travelers seem to leave the worst ones. “Whom” in the dataset represents the following:

​

  • Solo Traveler

  • Couple

  • Group

  • Family with young children

  • Family with older children

fig_3b (1).png

Fig 3(a) Country of Hotel vs. Reviewer score

The best hotels are present in Austria and Spain.

Referencing the Grouped Categories in Bar Charts:

fig_3b.png

Fig 3(b) “Whom” vs. Reviewer score

Fig 4(a) Air conditioning vs. Reviewer score

Fig 4(a) Air conditioning vs. Reviewer score

Fig 4(b) Room size vs. Reviewer score

Fig 4(b) Room size vs. Reviewer score

Fig 5(a) Room problem vs. Reviewer score

Fig 5(a) Room problem vs. Reviewer score

Fig 5(b) Location vs. Reviewer score

Fig 5(b) Location vs. Reviewer score

Fig 6(a) Staff vs. Reviewer score

Fig 6(a) Staff vs. Reviewer score

Fig 6(b) Bed/ Room vs. Reviewer score

Fig 6(b) Bed/ Room vs. Reviewer score

Fig 7(b) Trip type vs. Reviewer score

Fig 7(b) Trip type vs. Reviewer score

Fig 7(a) Mobile review vs. Reviewer score

Fig 7(a) Mobile review vs. Reviewer score

Pearson Correlation Matrix for the features:

fig8.png

Fig 8 Pearson Correlation Matrix for features

Thus a great amount of insights can be extracted from observing the review data. This can be further extended

by selecting a particular hotel from the dataset and observing each feature involved. Using data to meticulously observe the nuances of the reviews can greatly aid in increasing customer satisfaction and by extension improve customer retention, loyalty and generate interest among new prospects.

Rishan Sanjay

Software (Data) Engineer and Product Management
at Kushagramati

rishan_shetty.png
bottom of page