DATAMINING CAPSTONE TASK 1 10THJANUARY, 2018 ABSTRACT In this task I opted to use python and the toolkits give to attainresults for the task given.
It proved highly useful in the task of obtaining anoverview of the topics to be discussed and the reviews that were there.The specific packages I opted to use are the genism and sklearn toincorporate the topic extraction process. IMPLEMENTATION TASK 1.
1 TOPIC MINING OF ALLRESTURANT REVIEWS In order to come into terms with what the reviewers were talking aboutwith reference to the topic data, I chose to use the LDA topic model in theextraction process in order to attain 10 topics from all the reviews in thedata that were for the restaurants. In order to vectorize the review data I chose to apply TfidfVectorizer.The transformation produced results that were linear and I used IDFreweighting where I specified to gram range to be either 1 or 2. This basicallycollected data or terms with either one or two words.In order to visualize the data, I opted to use D3 to acquire thedrawings.To effectively represent the topic models, I chose to use word cloudvisualization using different font sizes to represent the significance of eachterm in any given topic model. Some observations acquired form the data are as follows;· From the data visualized it can be seen that thetopics 6,7,9 have a great affinity on the emphasis for a specific cuisine/ foodlike; pizza, Chinese food.
At the same time topics such as 1,4,9 talk aboutfoods or drinks such as fish chips, chicken and “food drinks” · The representation shows that there is mostly goodcomments towards the restaurants as indicated within topics 0 and 2.· With reference to topic 3 it can be seen that a veryimportant topic that comes up when the customers are reviewing a restaurant isthe time. It is very important to the customers. topic o topic 1 Topic 2 topic 3 Topic 4 topic 5 Topic 6 Topic 7 Topic 8 topic 9 Graphical representation of the topics mind from the raw data restaurantreviews. TASK 1.
2 TOPIC MINING OFPOSITIVE AND NEGATIVE REVIEWS In the quest to explore the topic distribution for the subsets of allthe reviews I was able to attain certain results. Specifically, this taskrequired that the observations made to the subsets of positive reviews andnegative reviews.For the positive results I used reviews with star number =>4, whilefor the negative reviews I used reviews with star number =<2.Still incorporating LDA as the topic model which resulted with the sameconfigurations as those used for the previous task.
From the results acquired, some observations from the data are asfollows;· It can clearly be seen that from the results,regardless of whether they are positive or negative reviews, the main focus ison food or cuisines that the reviewers frequently revisit.· When it comes to positive reviews Indian food, pizza,sushi and chicken, and as it follows for negative reviews some of the toptopics that are mentioned include but are not limited to pizza, hot dogs, sushiand tacos.· Given all the reviews it can be seen that the createdsubsets now offer different information through out the different subsets.
Thiscan be seen because in the case of the positive reviews there is no direct linkexpressing for the really rate the services offered. There is an abundance ofgeneral phrases such as great place and amazing which can’t give a properrating hence lacking any impression of negativity. When we come to the negativereviews it is clearly seen that the reviews give a better descriptionregardless of the general phrases used. This is because when they review theygive a clear and specific review on a given topic. A few examples are likeportion size which clearly describes that the food portion was notsatisfactory, limited menu is also another example that specifies that thereviewer lacked enough options on the menu in order to decided what he/she maywant to eat or drink. The diagrams below show thepositive and negative reviews. I choose to still use the word cloudrepresentation. topic 0 topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 topic 8 topic 9 Figure showing the ten topicsmined representing positive restaurant reviews.
topic 0 topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 topic 8 topic9 Figure representing the 10topics extracted for the negative restaurant reviews. CONCLUSIONIn conclusion it can be seen that ten topics can be extracted from theraw restaurant data and after extraction there can bee further extraction inorder to give more subsets. It can also be seen that from the subsets one maybe able to collect further information from the data, in this case positive andnegative reviews.I hope this report has given a detailed explanation of my findings fromthe Yelp data.