Skip to content

Latest commit

 

History

History
34 lines (26 loc) · 1.22 KB

File metadata and controls

34 lines (26 loc) · 1.22 KB

Investigating Spell-Checking Usage in Data Pre-Processing Phase for Topic Modeling on TripAdvisor Hotel Reviews

A project developed as part of my master's at Linköping University

Online customer reviews are very insightful and have a direct impact on revenue for hotel owners.

However, since these reviews are written by thousands of users, there’s a possibility that some reviews present grammar mistakes, resulting in a loss of data quality as a drawback. With that in mind, this research focused on analyzing whether spellchecking applied in the pre-processing phase of data, could be impacting topic modeling analysis on hotel reviews.

Topic modeling was explored using BERTopic as the main algorithm and LDA as a baseline. As expected, BERTopic’s generated topics revealed a better efficiency than LDA regarding their coherence and overall quality.

Surprisingly, the research revealed that spell-checking usage not only doesn’t seem to bring more value to the analysis but also disclosed some negative aspects (words that were incorrectly tagged as misspelled) at a cost of higher computational demand.

Tripadvisor-Logo