Tải bản đầy đủ (.docx) (29 trang)

PROJECT REPORT_Hotel customer reviews analysis ( Full)

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (854.46 KB, 29 trang )

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

- -🕮🙢----

PROJECT REPORT
Hotel customer reviews analysis

Instructor: Tran Viet Trung
Group: 16
Member: Ngo Quang Viet
Tran Tung Lam
Yannic Elias Hanel
Le Dam Quan
Ayoub Ala Mostafa

Hà Nội, 2023

20194881
20194788
2023T039
2023T051
2023T011


Table of content

Table of content...........................................................................................................................................................
Introduction.................................................................................................................................................................
1. Data preparation......................................................................................................................................................
1.1. Business analysis...................................................................................................................................................


1.2. Data collection.......................................................................................................................................................
1.3. Data understanding...............................................................................................................................................
2. Data cleaning and preprocessing..............................................................................................................................
2.1. Handling missing values and duplications.............................................................................................................
2.2. Datatypes..............................................................................................................................................................
2.3. One-hot encoding..................................................................................................................................................
2.4. Cleaning text data.................................................................................................................................................

3. Exploratory Data Analysis........................................................................................................................................
3.1. Univariate analysis................................................................................................................................................
3.2. Bivariate analysis...................................................................................................................................................
3.3. Multivariate analysis.............................................................................................................................................
3.4. Text analysis..........................................................................................................................................................
3.4.1. Text length......................................................................................................................................
3.4.2. Common words...............................................................................................................................
3.4.3. Sentiment analysis..........................................................................................................................
4. Sentiment Classification with Machine Learning model...........................................................................................
4.1. Overview...............................................................................................................................................................
4.2. Prepare data..........................................................................................................................................................
4.3. Feature extraction.................................................................................................................................................
4.3.1. BoW features..................................................................................................................................
4.3.2. TF-IDF features................................................................................................................................
4.4. Model training.......................................................................................................................................................
4.5. Model evaluation..................................................................................................................................................
Conclusion...................................................................................................................................................................


Introduction
In an era where digital travel planning is becoming increasingly important, the
analysis of hotel reviews is becoming more and more relevant. This project addresses the

challenge of gaining valuable insights from an abundance of hotel reviews. By automatically
collecting around 20,000 individual reviews via the Google Travel website, we provide a
comprehensive insight into the needs and expectations of travelers.
The immense amount of information available, especially in the form of unstructured text
data from hotel reviews, holds a potential that has so far remained largely untapped. This
data contains valuable information about the customer experience, the quality of services
and the strengths and weaknesses of hotels. In order to fully exploit this potential, we have
supplemented the automatic data collection with advanced text understanding models.
The main goal of this project is not only to collect data, but to understand it in depth and
breadth. By applying advanced text comprehension models and comprehensive data
analysis, we aim to paint a clear picture of customer reviews. This precise interpretation will
allow companies to not only capture the tone of their customers, but also derive concrete
actions to optimize their services and maintain a positive customer experience.
For companies in the tourism sector manifests itself in the possibility of gaining in-depth
insights into customer opinions. By precisely analyzing the collected data, companies can
emphasize their strengths, address weaknesses and make targeted improvements. This is
not only in response to past reviews, but also as a proactive approach to future customer
expectations. At a time when customer loyalty is heavily influenced by online reviews, this
project offers businesses the opportunity to strengthen their online reputation and gain a
clear competitive advantage. By understanding the data collected, companies can deploy
their resources more effectively and continuously adapt their services to the needs of their
customers.
In the following sections, you will dive into the intricacies of our data preparation and
collection process, where a detailed analysis of the user rating scheme is presented. Then
we provide a comprehensive overview of our data cleansing procedures, where we also
discuss the more detailed cleansing of review texts. Finally, we provide an in-depth analysis
of the cleansed data, supported by informative visuals. As an added feature, we conclude
the report with a sentiment analysis for the ratings, which provides additional insight into
user feedback.



1. Data preparation
1.1. Business analysis
This data science project centers around mining hotel reviews to discern the factors
that resonate with customers. The goal extends beyond identifying surface-level
preferences; it encompasses delving into the finer details that shape a guest's perception.
Rather than just cataloging customer likes, the focus is on unraveling the underlying reasons
behind their preferences. It goes beyond recognizing, for instance, a fondness for
comfortable beds, seeking to comprehend how these seemingly small details contribute to
an overall positive guest experience.
The significance of this initiative transcends individual hotels; it aspires to provide valuable
insights to the broader hospitality industry. The findings serve as a strategic tool, offering
foresight into emerging trends and ensuring that hotels are not just keeping pace but
staying ahead in meeting evolving guest expectations. It's akin to a strategic guide,
empowering hotels to proactively enhance guest satisfaction and continually elevate their
standards.

1.2. Data collection
For this analysis, we collect hotel information from 200 hotels across 5 locations:
Birmingham, Edinburgh, Liverpool, London and Manchester, as well as their users’ reviews:
100 for each hotel. All hotel information and reviews are collected from Google Travel.
Hotel information schema is as follows:

Field name

Type

Description

May be empty?


source

str

Website which the hotel
information was collected from (In

No


this case, it’s only “Google”).
name

str

Hotel name

No

address

str

Hotel address

No

images_count


int

Number of photos submitted by
hotel’s owner

No

popular_amenities

list[str]

List of popular amenities

No

Users’ review schema is as follows:
Field name

Type

Description

May be empty?

hotel_name

str

Name of the hotel


No

review_text

str

Review in text

No

rating

float

Rating score (scale of 5)

No

review_timestamp

datetime

Timestamp when the review was
made

No

trip_type

str


Type of trip (Possible values:
“Business”, “Vacation”)

Yes

trip_companions

str

With whom the reviewer traveled
with (Possible values: “Family”,
“Friends”, “Couple”, “Solo”

Yes

The crawling workflow is as follows:
-

First, crawl a list of URLs to a hotel’s details page, by executing
`get_hotel_list.py` script:
python3 get_hotel_list.py [-h]
output
OUTPUT]
[--headless
locations [locations ...]

--limit LIMIT [-|
--no-headless]


List of parameters:

Parameter

Required?

Description

-h

No

Shows the help and exit


–-limit LIMIT

Yes

Number of maximum hotels by a location

--output OUTPUT

No

Path to write the URL list (defaults to stdout, or
console if not specified)

–-headless, –no-headless


No

Whether to run the script with a headless
browser or not, suitable for debugging.
Defaults to headless mode.

-

Then, from the list of URLs acquired from the above step, we proceed to get
hotel details as well its reviews in separate flows: One for collecting hotel
details, one for collecting hotel’s reviews.

To retrieve a list of hotel details (in CSV format), use get_hotel_details.py
script:
python3 get_hotel_details.py [-h] [--output OUTPUT] [-headless | --no-headless] input
List of parameters:
Parameter

Required?

Description

-h

No

Displays the help message and exit

input


Yes

Path to file containing list of URLs collected
from the previous step

--output OUTPUT

No

Path to output file, defaults to a csv file
containing the timestamp the script starts

--headless, -no-headless

No

Whether to run the script with a headless
browser or not, suitable for debugging.
Defaults to headless mode.

If your computer/runner is powerful enough, it is advised that you perform a few
batches at a time, the processes should not interfere with one another.

To retrieve a list of hotel reviews (in CSV format), use get_hotel_reviews.py script:
python3 get_hotel_reviews.py [-h] [--output OUTPUT] --limit LIMIT [--headless | -no-headless] input
List of parameters:


Parameter


Required?

Description

-h

No

Displays the help message and exit

input

Yes

Path to file containing list of URLs collected
from the previous step

--output OUTPUT

No

Path to output file, defaults to a csv file
containing the timestamp the script starts

–-limit LIMIT

Yes

Number of maximum reviews by a hotel


--headless, -no-headless

No

Whether to run the script with a headless
browser or not, suitable for debugging.
Defaults to headless mode.

1.3. Data understanding
In the data understanding phase of this project, our focus is on gaining insights into
the collected data, particularly the hotel reviews. This involves exploring, examining, and
comprehending the structure and content of the dataset. The primary objectives are to
identify patterns, trends, and potential challenges within the data, paving the way for more
informed analysis and interpretation.

Overview of Ratings Distribution :
We begin by examining the distribution of ratings across all hotel reviews. Understanding
the distribution helps us identify whether there's a skew towards positive or negative
sentiments.
Text Length Analysis :
Analyzing the length of review texts can provide insights into customers' engagement levels.
We explore the distribution of text lengths to understand if there's a correlation between
review length and the assigned rating.
Hotel-wise Analysis :
We conduct a detailed analysis of ratings, review lengths, and sentiments for each hotel
individually. This allows us to identify specific patterns and variations unique to each


location.
Sentiment Analysis :

Utilizing sentiment analysis models, we aim to categorize each review as positive, negative,
or neutral. This step is crucial in understanding the overall sentiment of customers towards
the hotels.

Topic Modeling :
Applying topic modeling techniques, we extract key themes and topics present in the
reviews. This helps in understanding the major factors influencing customer opinions.
Handling Unstructured Text :
Given that the data primarily consists of unstructured text, we address challenges related to
natural language processing (NLP), including tokenization, stemming, and lemmatization.


2. Data cleaning and preprocessing
2.1. Handling missing values and duplications
The table below shows the count and percentage of null values in various
columns of the dataset. From what is described, the trip_type column has a significant
number of missing values, roughly 49.83% of the data. The trip_companions column also has
a high percentage of null values, around 45.51%.

Given that almost half of the data for trip_type is missing, the strategy for handling
these null values is crucial. Removing such a large portion of the dataset is likely not
advisable, as it would result in a significant loss of data. Imputation might also be
challenging unless there are strong predictors for trip type within the data. One-hot
encoding the trip_type column with an additional category for nulls might be the most
suitable approach here. It would allow us to retain all the data and treat the missing values
as a separate unknown category.
For trip_companions, the approach would be similar due to the high percentage of
missing values. Including an unknown category could be beneficial for any predictive
modeling or analysis, as this allows the model to account for the fact that the information is
missing, which might itself be a pattern of interest.

The review_text column has a relatively small percentage of missing values (0.166%).
This is a very small fraction of the dataset, which could be handled differently than the
trip_type and trip_companions columns with their substantially higher percentages of
missing data. Since we plan to perform text analysis, especially sentiment analysis, the
quality and completeness of the text data will be paramount, we decided to drop all records
that have null values in this column.
Moreover, there are no duplications present within the dataset.

2.2. Datatypes
We use `data.dtypes` command to list the data types of each column in the dataset,
as shown in figure below.
The rating column is of type float64, indicating numerical values with decimal points,
which is common for rating data. The images_count column is an integer (int64), which is
appropriate for count data. The `review_timestamp` column is also listed as an `object`,


indicating it has not been interpreted as a date or time format by Pandas.

Other columns are of the type object, which typically means they are strings or
mixed types. It is worth noting that the review_timestamp column is also listed as an object,
indicating it has not been interpreted as a date or time format by Pandas. The conversion of
the review_timestamp to datetime type is a necessary step for analysis involving time series,
as it will enable functions such as resampling, time-based indexing, and extracting
components of the date like the month or day of the week. The correct interpretation of this
column will make temporal analyses and visualizations much easier and more efficient. We
convert the review_timestamp column from an object type to a datetime type, which is
crucial for any time-series analysis. This conversion will allow the use of Pandas' powerful
time-series functionality.

2.3. One-hot encoding

One-hot encoding is a common technique used in data preprocessing to convert
categorical data into a numerical format that can be provided to machine learning
algorithms. We will perform one-hot encoding on three categorical columns: trip_types,
trip_companions, and popular_amenities.

The trip_type column is one-hot encoded into three columns: type_Business,
type_Vacation, and type_unknown. This indicates that there were originally two known
types of trips (Business and Vacation) and an additional encoding has been created for the
unknown types, which represents the null values in the original data as discussed earlier.
The trip_companions column has been transformed into five columns:
companions_Couple, companions_Family, companions_Friends, companions_Solo, and
companions_unknown. This suggests that there were four known categories for trip


companions, and similar to trip_types, an additional category has been made to represent
unknown or missing values.
The popular_amenities column has been also one-hot encoded into several columns,
each representing a particular amenity such as Air conditioning, Airport shuttle, Breakfast …
Since it is not a categorical value but represented as a string, which was originally a list, we
need to follow several operations to convert them into one-hot representation:
1. Cleans the popular_amenities data by removing list delimiters.
2. Converts the list of amenities into one-hot encoded format.
3. Removes any leading or trailing spaces from the new column names.
4. Combines any duplicate columns resulting from similar amenities.
5. Merges the one-hot encoded amenities back into the original dataset.

2.4. Cleaning text data
We implemented a Cleaner class with different methods for the purpose of cleaning
the review text in the dataset. Each of them will be discussed as below.
remove_usernames: This method is designed to eliminate any potential privacy

concerns or irrelevant information by removing user mentions that could be present in the
review text. These mentions are typically prefixed with an '@' symbol and are not useful for
analyzing the sentiment or content of the review itself.
clean_text: By removing digits and special characters, this method focuses on the
textual content of the reviews. It helps standardize the text for analysis by ensuring that
only alphabetic characters are considered, which is particularly useful for NLP tasks where
numerical values and special characters are often noise.
lower_text: Converting text to lowercase is a fundamental step in text normalization,
reducing the complexity of the text data and ensuring that words are treated the same
regardless of their case (e.g., "Hotel" and "hotel" are recognized as the same word).
remove_empty: This method cleans the dataset by removing any rows where the
review text is missing. Such rows cannot contribute to text-based analysis and could
potentially skew the results if not addressed.
preprocessing: This overarching method applies a sequence of tokenization,
stopwords removal, and lemmatization to prepare the text for further analysis. The goal is
to distill the text down to its most informative elements.
tokenize: Breaking the text into tokens (typically words) is a preparatory step for


many NLP tasks. It allows the application of further processing, such as removing stopwords
or applying lemmatization, on a word-by-word basis.
remove_stopwords: Stopwords are commonly used words that usually have little
lexical content and often don't contribute to the overall meaning of a sentence (e.g., "the",
"is", "and"). Removing them helps to focus on the more meaningful content of the text.
lemmatize: Lemmatization is a more context-aware approach to reducing words to
their base or dictionary form than stemming. It uses morphological analysis and
understands the part of speech of a word, which can improve the quality of subsequent
analysis by ensuring that words are not incorrectly shortened.
summarize: Although not directly used in the cleaning process, this method provides
a way to condense longer texts into a more manageable form while retaining the most

informative content. It's particularly useful when dealing with very long reviews as the input
of the transformer-based sentiment classification model we will use later in the EDA.


3. Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a crucial step in understanding the datasets at
hand. It involves summarizing the main characteristics of the data, often with visual
methods. The objective of EDA is to see what the data can tell us beyond the formal
modeling or hypothesis testing task. This phase is preparatory in nature, intending to
uncover the underlying structure, detect outliers, test assumptions, and generate
hypotheses.

3.1. Univariate analysis
In our endeavor to explore the hotel reviews dataset, we began with univariate
analysis, which focuses on individual variables. This approach is particularly useful when we
aim to summarize and find patterns in the data without considering interactions between
variables.
We initiated our analysis by examining the 'rating' variable. Ratings are the crux of
customer feedback and serve as a direct indicator of customer satisfaction.

distribution of ratings :


This chart illustrates the frequency of ratings given on a 1 to 5 scale. Observing the
distribution, it is evident that the rating of 5 is the most frequent, suggesting a high level of
satisfaction among the raters. The number of ratings decreases progressively from 4 to 1,
with the least frequency for a rating of 2. This pattern indicates a skew towards higher
ratings and reflects a positive reception for the product or service being evaluated. So it
shows that the hotels from whom we collected the reviews provide good services and the
clients are satisfied.


Trips companion:

In this bar chart, we can see the frequency of trips taken with different companions. It is
apparent that the majority of the data falls under 'unknown', indicating that for a large
number of trips, the companion type is not specified. Among the known categories, trips
taken by couples are the most frequent, followed by those taken with family, suggesting
that the destination or service is particularly popular with these groups. Trips taken with
friends and solo trips are less frequent, with solo outings being the least common of the
identified categories.
From our perspective, we might infer that the destination or hotel appeals more to couples
and families, potentially due to the amenities, activities, or atmosphere that resonate well


with these groups. Additionally, the lower frequency of trips with friends and solo travelers
offers an area for further investigation to see if there are opportunities to enhance the
appeal for these segments.

3.2. Bivariate analysis
Review count over time

The line graph shows the trend of reviews from 2012 to 2024. There is a consistent low level
of reviews from 2012 until an initial increase starts in 2019. However, the most striking
feature of the graph is the dramatic spike in the number of reviews in 2023, where the
count soars to a peak far exceeding any previous year. This could potentially be attributed to
a special event, a promotional campaign, or a sudden surge in popularity of the service or
product being reviewed.
For 2024, the data shows a very low number of reviews, but since we are just at the
beginning of the year, this is not unexpected. It is too early to draw any conclusions for
2024, as the data is likely incomplete and will accumulate over the course of the year. The

trend for 2024 will become clearer as more data is collected with each passing month.


The violin plot indicates that across different trip companions: family, couples, friends, and
solo travelers.The ratings tend to lean towards the higher end of the scale, with most
distributions extending towards a rating of 10. This suggests that regardless of the travel
companion, the overall experience is perceived positively, indicating that the hotel or
service being evaluated is likely providing a satisfactory experience to its guests. While there
is some variation in satisfaction, particularly with family and solo travelers, the general trend
towards higher ratings is a good sign for the hotel, as it implies that guests tend to leave
happy with their experience. The consistency in higher ratings for couple trips could suggest
that the hotel's offerings are particularly well-suited for couples, while the wider spread in
ratings for family and solo trips may point to a diversity in expectations and experiences
within these groups.

3.3. Multivariate analysis


The first heatmap presents a matrix of correlation values between various
categorical variables related to travel data. In this matrix, each cell's color intensity and
corresponding numerical value represent the strength and direction of the correlation
between the variables: a value of 1 indicates a perfect positive correlation, while 0 indicates
no correlation. Strong positive correlations are visible between certain amenities and trip
types or companions, such as type_Business with amenity_Wi-Fi, and companions_Solo with
amenity_Parking. These relationships suggest that business travelers are likely to require
Wi-Fi, and solo travelers may have a preference or need for parking facilities. However,
many cells display a value of 0, indicating no correlation, which suggests that for many
combinations of these categorical variables, there is no apparent relationship. The matrix is
useful for identifying potential patterns and relationships in the data that could inform
targeted marketing strategies or service improvements.



3.4. Text analysis
3.4.1. Text length
To determine how much time users invest in writing individual reviews, we examined
the length of the reviews in detail. Our results show that the majority of users use around
50 words/tokens and around 250 characters to describe their experience with the hotel.
While there are also reviews with a length of 400 words/token and 2000 characters, such
extensive reviews are relatively rare. You can also see how long the cleaned text is
compared to the original.

3.4.2. Common words


To discern the frequently used words by users, which inherently carry high
information value regarding the customer's experience and highlight crucial aspects of
service. These key terms reveal the prevalent themes and sentiments expressed by users
and provide valuable insight into aspects that are most important to customers during their
experience. The graph shows that aspects such as rooms, staff and breakfast stand out as
decisive evaluation criteria.

3.4.3. Sentiment analysis
We are using libraries to score and classify the sentiment of review texts. Specifically,
2 different approaches are implemented: Lexicon-based and Deep learning-based.
The scatter plot shows the relationship between lexicon-based sentiment analysis
results and some sort of rating, plotted on the X and Y axes respectively. The sentiment
values, which range from -1 to 1, having numerous data points at each level of sentiment
which suggesting a good variety in the sentiment of the reviews. The ratings are discrete
and range from 1 to 5.



At a glance, there does not appear to be a strong visible correlation between the
lexicon-based sentiment scores and the ratings. We can see a wide spread of sentiment
values for each rating level. For instance, ratings of 5 have sentiment scores ranging from
very negative to very positive.
It's interesting to note that there are many reviews with a sentiment score around 0
across all rating levels. This might suggest that the lexicon-based method either finds a
balance.
There are several cases where reviews with negative sentiment scores have high
ratings and vice versa. These could be outliers or instances where the lexicon-based method
does not align well with the actual sentiment expressed in the review text. The variance in
sentiment scores for similar ratings may come from the reason that reviews might contain
sarcasm, mixed sentiments, or complex expressions that a simple lexicon-based method
cannot accurately interpret.
To fully evaluate the effectiveness of the lexicon-based sentiment analysis, it would
be useful to compare these results with the deep learning-based sentiment analysis. The
provided images below show the results of a deep learning-based sentiment analysis
compared to user ratings, and the correlation between these sentiments and the ratings.
Let's discuss each figure in detail.
The first set of bar charts shows the count of sentiments (negative, neutral, positive)
for each rating level, for both raw and cleaned review texts.



×