In previous chapter, I described an applicationof unsupervised semantic analysis to problem of topic segmentation.
This is an internal system task, notdirectly observable by end user. Incontrast, in this chapter I will presents work done on semantic analysis in context of digital libraries. Goal is todirectly improve user experience, through enabling similarity browsing in Czech Digital Mathematics Library.6.1 motivations:Mathematicians fromall over world dream of a World DigitalMathematics Library , where (almost) all of reviewed mathematical papers in alllanguages will be stored, indexed and search-able with today’s leading edge information retrievalmachinery.
Good resources towards this goal in addition to publisher’s digital libraries are twofold: Tworeview services for mathematicalcommunity: both ZentralBlattMath3 and Mathematical Reviews4 have morethan 2,000,000 entries (paper metadata and reviews) from more than 2,300mathematical serials and journals. 6.2:MSC classificationWithin DML-CZproject i have investigated possibilities to classify (retro digitized)mathematical papers by machine learning techniques, to enrich math searchingcapabilities and to allow semantically related search. Text of scanned pages isusually optically recognized, so machine learning algorithms may use full textin addition to metadata to aid training.Classification is an example ofsupervised text analysis: trainingalgorithm requires labeled examples with MSC codes assigned, in order to learna decision function (a “classifier”) that will thereafter automat-Mathematically assignMSC codes to new, unlabelled articles. In this I have investigated severalmachine learning algorithms with respect to quality of resulting classifier.
Detailed evaluation shows that with almost all methods i can easily exceed 90%classification accuracy to classify first two letters of MSC (primary MSC). With fine-tuning, best method (Support Vector Machine withMutual Information feature selection, at term weighting and 500–2000 features)we can increase accuracy to 95% ormore.Since focus of this thesis is unsupervisedlearning, I will not go into detail of our MSC classification experiments.Instead, I will turn to exploratory data analysis of DML-CZ library, where goal is to offer similar articles to interested user, without making use of anymetadata.6.3:DatasetsI used three digitallibraries as data source for our experiments: 1. Metadata and full texts ofmathematical journals covered by DML-CZproject during first three yearsof project, I have digitized andcollected data in a digital library, accessible via web tool called Metadata editor7. 2.
ArXMLiv, a projectfor translating library into XML with semantic MathML markup. I used part ofarXMLiv that deals with mathematics, for a total of 20,666 articles.All 65,289 articleswere tokenized and converted into bag-of-words vectors. I also extracted mathematical formulae wherepossible (arXMLiv and digital-born parts of DML-CZ), and treated them as additionalvector space dimension. Here hope wasthat mathematical mark-up, when available, may aid similarity assessment, in asimilar way to words. Final dimensionality (size of vocabulary plusmathematical formulae) was 355,025 features. implicit 65,289 355,025 matrix was very sparse (39 million non-zeroentries, or density of 0.
167%), which is expected and common in NLPapplications.i did not perform anyadditional preprocessing steps beyond removing stop words (word types thatappear in more than 30% of documents) and word types that appear in less than 5documents (mostly OCR and spelling errors). 6.4:Topic modelingFor each document inour database, i periodically pre-compute ten most semantically related articles and present them to user through our web interface. Figure 6.1 illustratesthese similarities for one randomly selected article.A natural question iswhich method of semantic analysis performs best? For example, statistics showthat the mean difference between normalized cosine similarity scores producedby TF-IDF vs. LSA is 0.
0299, with a standard deviation of 0.037. Inspectionreveals that in most cases the scores are indeed very similar, but there arealso many pairs of documents for which the two methods vary widely, as therelatively large standard deviation would suggest. and are so alsoattractive from a plagiarism detection perspective. Finding actual plagiates inour dataset is unlikely, due to highlyspecialized nature of domain(mathematics) and its structured and dedicated community, where each article isofficially reviewed and classified by experts before publishing (with review becoming permanently a part of thatarticle, with reviewer’s nameattached). Still, out of curiosity, i examined all pairs of highly similararticles, to see if i could find any cases of plagiarism within our union ofarticles from three different repositories.
6.5: EvaluationPerhaps more interesting thancomparing one semantic analysis method to another is to consider how well dothese similarities translate to semantic relatedness as perceived by us, humanusers. Unfortunately i do not have a referential tagged corpus of pair-wisedocument similarities to compare our results against. However, thanks to nature of our dataset, i have access to articlemetadata.
One piece of metadata present for each of our articles is itsposition within MSC classificationhierarchy. Although this metadata was not used during similarity computations(all used methods are unsupervised), i can use it for evaluation. 6.6:Do formulae matterDuring articleparsing, i extracted mathematical formulae from digital born documents.
Arethese additional features helpful for topic modeling used our algorithm fromChapter 4 to compute LDAtopics over 20,666 arXMLiv articles,which contain formulas in Math ML notation? Result is shown in Table , where I cansee that some short general formulas are in-deed highly indicative oftopic. As a memento ofChapters 3 and 4, semantic models like LSA or LDA characterize every input manuscript asa pliable mixture of topics.. It is produced by using LDAmodel from arXML describe above to infer topic mixture of documents of fourDML-CZ articles.
Note that two articlesdepicted in that figure were not used at all during LDA preparation; topic inference happens over new, unseen IDusing a pre-trained model. Semantic likeness browsing in digital measures is realizing throughcontrast such topic distributions, as opposite to more normal keyword explore.6.7:ConclusionDigital libraries arean ideal test-bed for methods of unsupervised text analysis.
vast amount of data coupled with a need to intelligentlyguide user in his quest for information call for an automated approach to datamining.Consequences ofsupervise analysis show viability of machine knowledge move toward to organization of geometric papers. iused standard algorithms. Such as linear support vector gear and fake neuralnetworks, with preprocessing tools of trait range, stem, term weighting etc.Automated algorithms of this sort could greatly help with reviewer job for newpiece submission, to name just one real request of our explore.Unsupervised analysisis no dully harder to evaluate; nevertheless, results presented in this chapter look promising.
In particular, topics reconstructed by Latent Dirichlet Allocation show goodinterpretability and semantic coherence. i am offering an evaluation form toour users , but a statistically significant amount of data proves hard tocollect. In future, i plan to conductcontrolled human experiments to assess topic quality, following evaluation frame-work proposed in Chang et al. An extension of thisline of research is also part of European Digital Mathematics Library (EuDML) project.