ABSTRACTThere is huge amount of unprocessedtext data in the world. Text data is one of the main sources of the information.A human can easily read and understand a piece of text and certainly canidentify most important words (topics) and can explain what it basically infersand means in a summarized way. A machine is not capable of doing such things onits own. But by applying few basic rules and making a machine to learn textdata with experience can make it possible. So topic modelling and textsummarization are methods of achieving it.AnalyticsIndustry is all about obtaining the “Information” from the data.
With thegrowing amount of data in recent years, that too mostly unstructured, it’sdifficult to obtain the relevant and desired information. But, technology hasdeveloped some powerful methods which can be used to mine through the data andfetch the information that we are looking for. One such technique in the fieldof text mining is Topic Modelling.Topic Modeling providesa convenient way to analyze big unclassified text.
A topic contains a clusterof words that frequently occurs together. A topic modeling can connect wordswith similar meanings and distinguish between uses of words with multiplemeanings. We want to catch upup-to-date information to take a suitable action. But on the contrary, theamount of the information is more and more growing. There are many categoriesof information (economy, sports, health, technology…
) and also there are manysources (news site, blog, SNS…). So to make an automatically & accuratesummaries feature will helps us to understand thetopics and shorten the time to do it.
TextSummarization is the task of extracting salient information from the originaltext document. In this process, the extracted information is generated as acondensed report and presented as a concise summary to the user. It is verydifficult for humans to understand and interpret the content of the text. I1. INTRODUCTION 1.
1 INTRODUCTIONThesize of text data is increasing at exponential rates day by day. Almost alltype of institutions, organizations, and business industries are storing theirdata electronically. A huge amount of text is flowing over the internet in theform of digital libraries, repositories, and other textual information such asblogs, social media network and e-mails . It is challenging task to determineappropriate patterns and trends to extract valuable knowledge from this largevolume of data . Ahuman can easily read and understand a piece of text and certainly can identifymost important words (topics) and can explain what it basically infers andmeans in a summarized way. A machine is not capable of doing such things on itsown. But by applying few basic rules and making a machine to learn text datawith experience can make it possible.
Topic Modelling and Text summarizationtechniques can be used to extract useful topics from the data and that text canbe summarized . There has been systems which were particularly developed for either topicmodelling or text summarization. Usually topic modelling can be done to pick upmost important topics on any sort of text data irrespective of its domain. Existingsystems usually use techniques like NMF and LDA and form different clusters oftopics as specified by user. But these systems lacks in the areas of reviewtext.
As these review text is created by user which is not grammaticallyright.So if we use the review data directly without any preprocessing thesetopic modelling and text summarization techniques would produce irrelevantresults. To get rid of this problem we need to have a preprocessing step wherethe text data could be preprocessed and cleaned. . Then we run different text summarizationmodules on the same text data.
Then we rank these summarizations algorithmsbased on the topics picked out in the topic modelling module. This makes thesystem more efficient and won’t let the text summarization module to miss anyimportant topic and information. 11.
2 OVERVIEWArchitecture: ModuleDescription:Preprocessing Text data: Preprocessingstep involves removal of stop words (there , for , to , and etc ),then stemmingof the words (studying,study,studied ),then creating the bigrams (new+york=newyork) etc..,.Keyword Extraction and TopicModelling: After going throughthe text data we can come up with few insights where these can be used to writethe data patterns like regular expressions which can be used to extractkeywords. Keyword are not the topics. Now to extract topics we apply LDA(latent Dirichlet Allocation) algorithms and specify the number topics i.
e..,clusters. Text Summarization: Itis the processs of condensing the text in a meaningful way .During text summarization we might loose some important data .tocross verify that we use the most important topics from the topic model. Afterthe verification the final summarized text is saved.
21.3 CHALLENGES• Findingpatterns in data.• Processinghuge amount of data.• ProvidingComputational power.• ReviewText data with grammatical mistakes.• Missing few topics from data.
1.4 PROJECT STATEMENTTheopportunity cost of any business to ignore unstructured data is paramount intoday’s fierce competitive world. According to an IDC survey, unstructured datatakes a lion’s share in digital space and approximately occupies 80% by volumecompared to only 20 for structured data. While the unstructured data isavailable in abundance, the number of software products and solutions that canaccurately analyze the text, present insights in an understandable manner are rare. Topic modelling and textsummarization are techniques implemented on text data to extract meaningfulinsights and knowledge from the data. Asthe data that we provide doesn’t contain labels so data need to be dealt withunsupervised approach. LDA Topic modelling technique can be used on unsupervised data toextract important topics in the form of clusters.
Since summarization does notrequire any labels it works fine with the data.1.5 OBJECTIVESThe main objectivesare:• Creating a better way for extracting useful information and insightsfrom the text data .• Toclean the data and provide more meaningful data using certain techniques(bigram , lammatization , removing stop words etc..).• Summarizingthousands of thousands of lines of text data into a meaningful crip data.
• Topresent most important information data inferring(topics , keywords etc). 31.6 SCOPE OF THE PROJECT • It is an application that usesmachine learning techniques to present the given huge amount of text data intomost important topics discussed and emphasizes the whole text into easy redableform.• This application can be trained andrun on any operating system. • LDA technique used for topicmodelling can be used to reduce the complexity in finding out the topics.• This application can be used notonly for finding topics from data but also can be used in the recommendationengines. Which can be used to recommend different articles for a user.
• Themost important aspect of this application is that it can be used on any domaintext data (sports , business , medicine etc..,)without any previous training onthat domain data.
This saves lot of time which would be usually wasted inlabelling the data. 42. BACKGROUND2.1 INTRODUCTIONIt’sno secret that the world has seen an explosion of information in the recentpast, an explosion that experts predict will continue as the billions of peoplewho use online resources continue to expand their usage, and the Internetpenetration increases.
Further, the new transformed participative Web isallowing users to become co-creators of the content, rather than merely beingconsumers. Text constitutes the largest part of the Web content. While textdocuments and the traditional science of Information Retrieval have existed fora long time, the storage of text in electronic form and the resultant ease ofdissemination and sharing over the Internet have changed the scenario. We arenow witness to large volume of text stored in electronic forms and also newways of exploiting them for obtaining useful information and inferences. 2.
2 LITERATURE SURVEYTitle : “PreprocessingTechniques for Text Mining – An Overview”Authors: Dr.S. Vijayarani , Ms. J. Ilamathi , Ms. Nithya.Year: 2015Description : In this research paper they explained how importantpreprocessing is for text data before creating models on the data.
Theydescribed and explained few techniques that can be used on text data whichmakes it more crisp and meaningful. Some of the techniques include removingstop words, stemming ,bigram , etc.. where stop words technique is the removalof words which doesn’t make any sense and doesn’t add any value (is, of ,had,for etc..).
stemming technique is similar to normalization where the word ischanged to its root word (study, studying, studied etc..).bigrams are the wordswhich makes sense only when they are together. 5Title :”Topic Modeling with Document Relative Similarities”Authors : Jianguang Du , Jing Jiang , Dandan Song, Lejian LiaoYear :2015Description: This paper explained that topic models such as Latent DirichletAllocation (LDA) are successful in learning hidden topics but they do not takeinto account metadata of documents.
To tackle this problem, many augmentedtopic models have been proposed to jointly model text and metadata. But mostexisting models handle only categorical and numerical types of metadata. Herethey identify another type of metadata that can be more natural to obtain insome scenarios. These are relative similarities among documents. In this paper,they propose a general model that links LDA with constraints derived fromdocument relative similarities. Specifically, in this model, the constraintsact as a regularizer of the log likelihood of LDA. They fit the proposed modelusing Gibbs-EM.
Experiments with two real world datasets show that their modelis able to learn meaningful topics. The results also show that their modeloutperforms the baselines in terms of topic coherence and a documentclassification task.Title :”A Text Mining research based on LDA Topic Modelling”Authors : Zhou Tong , Haiyi ZhangYear :2016Description: In this paper, Theyfirst represent an introduction to text mining and a probabilistic topic modelLatent Dirichlet allocation. Then two experiments are proposed – Wikipedia articlesand users’ tweets topic modelling. The former one builds up a document topicmodel, aiming to a topic perspective solution on searching, exploring andrecommending articles. The latter one sets up a user topic model, providing afull research and analysis over Twitter users’ interest.
Further more, theconclusion and application of this paper could be a useful computation tool forsocial and business research. 6Title :” Automatic TextSummarization”Authors : Aarti Patil ,Komal Pharande , Dipali Nale , Roshani AgrawalYear :2015.Description: This paper basically explains how textsummarization is done. It can be of two types Extractive andAbstractive.
Abstractive summary represents use of Natural Language Processing(NLP) whereas Extractive summary is based on copying exact sentences fromsource document. Ranking Of Text Units According To Shallow LinguisticFeatures: This approach recognizes the most prominent text/sentences usingvarious shallow linguistic features, taking degree of connectedness among thetext units into consideration so that it minimizes the poor linking sentencesin the resulting text summary. This method highlights the effect of lexicalchain scoring after the nouns and compound nouns are chained by searching forlexically organized relationships between words in the text using WorldNet andusing lexicographical relationships such as synonyms and hyponyms. All thesentences are ranked or given preferences on the basis of the sum of the scoresof the words in each sentence in order to extract a summary. The scores ofwords are decided using various features like term frequencies, cue words andphrases, measuring lexical resemblance (measuring chain score, word score andfinally sentence score)etc.Title :” Study of AbstractiveText Summarization Techniques”Authors : Sabina Yeasmin ,Priyanka Basak Tumpa , Adiba Mahjabin Nitu , Md. Palash Uddin , Emran Ali andMasud Ibn Afjal.
Year :2017Description: This paper explained about how Extractive summarizer finds out the most relevant sentences in thedocument. It also avoids the redundant data. It is easier than abstractivesummarizer to bring out the summary. Extractive summarization uses thefollowing methods to summarize document(s).Term Frequency-Inverse DocumentFrequency (TF- IDF) method, Cluster based method, graph theoretic approach,machine learning approach etc. are the example of extractive summarizationtechniques.