ABSTRACT—Most of the Outlier detectionalgorithms in data mining are used to find outliers in static databases. Those algorithmsare generally inappropriate for detecting outliers in dynamic databases wheredata continuously arrives in the form of streams such as sensor data.Association rule based outlier detection method can be applied to streamed datawhere frequent item sets are evaluated internally.

One of the outlier detectionapproaches used for static databases include clustering based method, whereK-means clustering algorithm is used internally for discovering outliers fromvarious static databases. In this paper, we propose two approaches for outlierdetection. One is to use association rules based method for dynamic databasesand the other is to use pruning based local outlier detection method, whichinternally uses K-means clustering method for static databases. Experiments onvarious data sets are performed to detect the deviant data effectively in fewercomputations.

Don't use plagiarized sources.

Get Your Custom Essay on "ABSTRACT—Most sets are performed to detect the deviant..."

**For You For Only $13.90/page!**

Get custom paper

Keywords—outlier detection,static databases, dynamic databases, association rules, k-means clustering.1. INTRODUCTIONAn outlier isreferred to be a data point or an object which differs from other data pointsor objects within a data set. One of the important data mining techniquescommonly used in detecting deviant data present in a data set is Outlierdetection. Examples of outlier detection includes credit card fraud detection,network intrusion detection, stock market analysis, etc., where efficientalgorithms are used in resolving those problems.

The characteristicfeature of streamed data is that data arrives continuously and in a time basedorder. Hence they are used in various applications like feeding of sensor data,measurement of network traffic. Generally, various outlier detection algorithmsare applied on static databases. Because their work is to check whole data setsin an iterative manner to find top outliers or global outliers. But detectingoutliers in streamed data is difficult for two reasons:1.

The data sets in streamed dataare unbounded, which prevents scanning the entire data stream.2. Streamed data require quickresponses for their request and the process has to be completed in a limitedtime. Multiple scans for the data stream are not possible.In staticdatabases, distance-based techniques are used to model the data points andoutliers are determined. Various distance functions used are Euclideandistance, Manhattan distance and Minkowski distance. In clustering basedoutlier detection method, K-means algorithm is the most popular one, where thedata set is to be divided into clusters.

The objects or data points which lienear to the centroid of the cluster are assigned to the cluster and thoseobjects or data points which lie far away to the centroid of the cluster arethe outliers. While evaluating the degree of outlierness of an object, theobjects within a cluster are obtained, and they are pruned from each clusterduring outlier detection. This results in the reduction of computational time.In this paper, wehave proposed two approaches for outlier detection, one involving streamingdata and the other involving static databases.

In the first approach, analgorithm employing association rules to check streamed data for find out theoutliers is proposed. This algorithm defines a bounded set of transactions fora data stream, which in turn derives association rules from the transactions.The transactions containing outliers will be evaluated based on those rules.For instance, if {P,Q} à {R} is an association rule with ahigh confidence value of 85%, then a transaction is an outlierbecause the item R is not appearing in the transaction.

A transaction is saidto contain an outlier, if it does not contain an item which it is supposed to containbased on the high confidence association rule.In the secondapproach, a static database is taken and among various data points, a distancefunction is applied on them and various clusters are formed. The objects whichlie near to the centroid of the cluster are pruned to evaluate outliers. Then adistance-based measure i.e., a Local Distance-based Outlier Factor (LDOF) isapplied to the remaining objects, whose task is to find whether an object is anoutlier or not. Once n outliers in adata set are found, then the top nobjects among n are reported asoutliers.

The remainingsections of this paper are organized as follows: Section 2 provides relatedworks on outlier detection. Section 3 describes an association rule basedoutlier detection method. Section 4 describes a pruning based localdistance-based outlier factor outlier detection method. Section 5 providesexperimental results.

Section 6 concludes the paper.2. RELATED WORKA. Data Models in Streamed DataStreamed data whencompared with the traditional static databases contain unbounded data sets andit is difficult to choose a finite number of transactions 1 for mining. Inorder to resolve this problem, two data models were proposed:1. The accumulative data modelwhich takes the transactions from the starting to the current time intoconsideration. If any new transaction arrives, then it is accumulated into thedatabase.

2. The sliding window model whichis used to form a bounded data set of transactions. When mining is to beperformed, then only the transactions that are in the sliding window are takeninto consideration. The results of mining are updated as soon as the windowpasses by.B.

Association Rules and Outliers in Streamed DataThe estDec method3 is considered which internally uses a sliding window model, and it findsfrequent item sets within the current window. Initially, the estDec methodperforms a check on the transactions that are currently in the sliding windowand builds a prefix-tree P byconsidering the frequent item sets. Once the window slides to the next bucket,then the estDec method finds new frequent item sets and it performs eitherinsertion or deletion of item sets into Pand the process continues iteratively. To obtain association rules, theprefix-tree is traversed for obtaining subsets of frequent item sets.

Atraversal stack is maintained to derive association rules, which stores afrequent itemset in the prefix-tree 1 and derives all possible associationrules from the itemset.A transaction 1is said to contain an outlier if some items were supposed to be present in atransaction but they are actually not present there. To evaluate 1 whether atransaction has an outlier, an outlier degree is defined.

C. Distance-Based Outlier Detection TechniquesKnorr and Ng 4introduced a definition to distance-based outlier detection techniques 2. An object x in a data set D is a DB(y, dist)- outlier if at least fraction y of the objects in D lie at a greater distancethan dist from x 2.Angiulli andPizzuti 5-7 proposed an outlier detection method based on the concept ofobjects’ neighbourhood, where outliers are determined after ranking each objectbased on calculating sum of the distances from its k-nearest neighbors.Breunig, Kriegeland Ng 8 proposed a new definition named as Local Outlier Factor (LOF) toeach object, which gives its degree of outlierness. To evaluate LOF for anobject, a restricted neighbourhood of it is considered by comparing the densityof the object with its neighbors.

Zhang, Hutter andJin 9 proposed a local distance-based outlier detection method for findingoutlier objects. For every object, the local distance-based outlier factor 2(LDOF) evaluates the degree to which it deviates from its nearest objects.Calculation of LDOF for all the objects is complex i.e.

, O(N2), where Nspecifies number of objects in a data set.ough set theory3 was first proposed by Z. Pawlak, which is used to deal with uncertain,vague and incomplete data. The main idea is based on the indiscernibilityrelation that describes the indistinguishability of objects. If one cannotrepresent an object in the form of a crisp set, then such object can berepresented by a rough set with the approximate representation of knowledge. Arough set gives the lower and upper approximation of the set that hasincompletely specified knowledge.

3.ASSOCIATION RULE BASED OUTLIER DETECTION METHODA. OverviewBased on the 10work, a sliding window is taken for making a bounded data stream. Afterconsidering transactions within the window, a prefix-tree is built and thenassociation rules are derived from the tree by traversing it iteratively inmultiple scans. Once a transaction’s items are not found in the associationrules list, then such transaction may be an outlier.The followingfigure shows a sample traversal stackcontaining itemsand their counts, with the arrow indicating the top of the stack: item Count in C(i1,i2,.

…in) in-1 C(i1,i2,…

.in-1) in-2 C(i1,i2,….

in-2) … .

.. i2 C(i1,i2) i1 C(i1) Figure 1. A sampletraversal stackThe followingdefinitions are regarding the outlier degree given by Narita and Katigawa in10:1 Definition 1.

Let t be a transaction and Rbe a set of association rules, and then t’sassociative closure t+ isgiven as below: (1)1 Definition 2. Let tbe a transaction and R be a set ofassociation rules, and t+ isthe t’s associative closure. Then theoutlier degree of t is given asbelow: OutlierDegree(t) = ||/|| (2)The range of outlier degree lies between0 and 1. Definition 3. A transaction is said to be an outlier transaction if itsoutlier degree is greater than or equal to minimum outlier degree, which isconsidered to be a threshold.The following stepsprovide a way to detect outliers in transaction data sets:1. Build a prefix treethat keeps track of frequent item sets within a sliding window.2.

Derive a set ofassociation rules with high confidence value from the prefix tree.3. Evaluate outliertransactions from the transaction set within the sliding window.

4. Divide outliertransactions into two parts, one containing unobserved frequent item sets andthe other containing infrequent items.B. ExampleLet us consider atransaction data set 1 consisting of 10 transactions in a sliding window,each consisting of the items 1 Book, Joke, milk, bacon, Chocos, egg.

Thefollowing table gives the 10 transactions and the items in each transaction: 1 Table 1. ASample Transaction Data set TID Items 1 Book, Joke, Milk 2 Book, Chocos, Joke, Milk 3 Book, Joke, Milk 4 Bacon, Book, Chocos, Egg, Milk 5 Bacon, Book, Chocos, Egg, Joke, Milk 6 Book, Chocos, Joke, Milk 7 Bacon, Book, Egg, Milk 8 Bacon, Book, Egg, Joke, Milk 9 Book, Joke, Milk 10 Bacon, Egg, Milk For derivingassociation rules from table 1, the values of minimum confidence and minimumsupport are set to 80% and 50% respectively. The association rules derivedafter setting the above values are given in table 2, where each associationrule has a high confidence value.1 Table 2.Association Rules Derived from Table 1 RID Rule 1 {Joke} à {Book} 2 {Joke, Milk} à {Book} 3 {Joke} à {Book, Milk} 4 {Bacon} à {Egg} 5 {Bacon, Milk} à {Egg} 6 {Bacon} à {Egg, Milk} 7 {Milk} à {Book} From (1), the associationclosure t+ is for transaction 2 will be

According todefinition 3, 1 if the minimum outlier degree is set to 0.3, then transaction2 will be an outlier. Similarly, transaction 10 will also be an outlier sinceit does not follow any of the high confidence association rule given in table2. 4.PRUNING BASED LDOF OUTLIER DETECTION METHODA.

OverviewBased on the 9work, we are using Local Distance-based Outlier Factor (LDOF) measure whichspecifies by how much an object is deviating from its neighbours. If LDOF valueis high, then such an object is said to be more deviating from its neighboursand it can be considered to be an outlier. For an object p, the LDOF value can be given as: LDOF (p) = (3) where specifies the k-nearest neighbour distance of pand specifies the k-nearest neighbour inner distance of p.For an object p, the distance is equal to the average distance fromp to all objects in its nearestneighbours’ set. = (4)For an object p, the distance is equal to the average distanceamong all the objects in its nearest neighbours’ set. = (5)B.

ProcedureThe major drawbackof LDOF algorithm is that it is more expensive with respect to thecomputations, because for each object in the data set we have to calculateLDOF. To resolve this drawback, we have proposed a K-means algorithm to formclusters. After clusters are formed, radius for each cluster is calculated andthe objects whose distance from the centroid is less than the radius of thecluster are pruned. Then LDOF for the remaining (unpruned) objects is calculated,which reduces the total computations. Among the objects, the top-n objects with higher LDOF values areconsidered as outliers.

The followingsteps are involved in performing the pruning based LDOF outlier detectionalgorithm:1. Formation of clusters: Here we considerthe entire data set and apply K-means algorithm to generate clusters and theradius for each cluster is calculated.2. Clusters with minimal objects: If acluster is found to contain less number of objects than the number of outliersrequired, then pruning is avoided in the cluster.

3. Pruning objects within a cluster:Distance for each object in a cluster from its centroid is calculated. If thedistance is less than the cluster radius, then the object is pruned.

4. Evaluation of outliers: For the unprunedobjects in each cluster, LDOF is calculated and the ton-n objects with high LDOF value are declared as outliers.The followingalgorithm gives the steps involved in pruning based LDOF outlier detectionalgorithm:beginOutlierDetectio(DS,c,it,n)setX ß Kmeans(c,it,DS)for each cluster Cj in X do Radiusjß radius(Cj)end forif |Cj| > n then foreach object pi in Cj do ifdistance(pi,oj)< Radiusj then prune(pi) else Add pi to U endif endforelse foreach object pi in Cj do Add pi to U endforend iffor each object pi in Cj do calculate LDOF (pi)end forsortthe objects based on their LDOF (pi)valueschoosen objects among the highest LDOF (pi) values and display themas outliers. endOutlierDetection()The computationalcomplexity of our algorithm is c*it*N+c*np + (w*N) 2, where c is the number of clusters that are to be generated, it is the number of iterations neededand N is the number of objects in theDS data set, np is theaverage number of objects in each cluster, wrepresents the fraction of objects that are unpruned. 5.

EXPERIMENTAL RESULTSIn this section,we perform two experiments one regarding the outlier detection by derivingassociation rules from a streamed data and the other regarding the outlierdetection by calculating LDOF through pruning. The experiments are performed ona PC with Intel 2.3GHz processor and 2GB of memory.In thisexperiment, we consider a supermarket data set consisting of 4627 transactionsand 217 attributes representing items in the supermarket. For deriving 1association rules, both the minimum support and minimum confidence are set to30% and 80% respectively. Among 217 attributes, we have considered only 18attributes.

The number ofassociation rules is set to 8. The association rules derived from the data setare given in table 3, which consists of rule id and the rules which have highconfidence value. Table 3.Association Rules Derived From Supermarket Data Set Rule ID Rule 1 {biscuits, vegetables} à {Book, cake} 2 {total} à {Book, cake} 3 {biscuits, milk-cream} à {Book, cake} 4 {biscuits, fruit} à {Book, cake} 5 {biscuits, frozen-foods} à {Book, cake} 6 {frozen-foods, fruit} à {Book, cake} 7 {frozen-foods, milk-cream} à {Book, cake} 8 {baking-needs, milk-cream} à {Book, cake} The confidencevalue for the rules 1-4 is 0.84 and the confidence value for the rules 5-8 is0.83.

As per the association rules, if a transaction contains biscuits andvegetables, then it should contain both Book and cake. If any one of Book andcake is found to be missing, then such a transaction is said to be an outlier.Similarly, transactions which do not follow above high confidence associationrules are said to be outliers.6.CONCLUSIONSIn this paper, wehave considered outlier detection problem in two varieties of databases, oneinvolving data streams, where data arrives continuously and also in a timebased order, and the other involving static databases. We have provided twoalgorithms for the problem. For streamed data,a sliding window is considered to make the data items in a database bounded.Then for all the transactions in the bounded data set, a prefix-tree is takenand items are added to it, and it serves as a stack.

Association rules arederived from the transactions data set and the items which do not obey theassociation rules are declared as outliers.For staticdatabases, pruning based outlier detection is performed, which internally usesK-means clustering algorithm followed by 2 local distance-based outlierfactor for each object within a cluster. The objects with high LDOF values aredeclared as outliers.

Finally, we hopethis work will attain further interest in various problem areas of data miningsuch as text mining, multimedia data mining etc. ACKNOWLEDGEMENTSThe authors wouldlike to acknowledge Li-Jen Kao, Yo-Ping Huang, P. Rajendra, D. Jatindra Kumarand N. Sukumar for their valuable discussions.REFERENCES1 Li-Jen Kao, Yo-Ping Huang, “Association rules basedalgorithm for identifying outlier transactions in data stream,” IEEE Int’l Conference on Systems, Man, andCybernetics, Oct. 14-17, 2012.

2 P. Rajendra, D. JatindraKumar, N. Sukumar, “An outlier detection method based on clustering,” Int’l Conference on Emerging Applicationsof Information Technology, 2011.

3 J.H.Chang and W.S. Lee, “Finding recent frequent item sets adaptively over onlinedata streams,” in Proc.

Of the 9thACM SIGKDD, Washington, DC, USA, pp.487-492, August2003.4 E.M. Knorr and R.T.

Ng, “Algorithms formining distance-based outliers in large databases,” In Proc. 24th Int. Conf. Very Large Data Bases, VLDB,pages 392-403, 1998.5 F. Angiulli, S. Basta, and C.

Pizzuti,”Distance-based detection and prediction of outliers,” IEEE Transactions on Knowledge and Data Engineering, 18:145-160,2006.6 F. Angiulli and C. Pizzuti, “Fast outlier detectionin high dimensional spaces,” In PKDD ’02:Proceedings of the 6th European Conference on Principles of DataMining and Knowledge Discovery, pages 15-26, 2002.7 F. Angiulli and C. Pizzuti,”Outlier mining in large high-dimensional data sets,” IEEE Transactions in Knowledge and Data Engineering, 17:203-215,2005.

8 M.M. Breunig, H.-P. Kriegel,R.T. Ng, and J.

Sander, “LOF: identifying density-based local outliers,” SIGMOD Rec., 29(2):93-104, 2000.9 K. Zhang, M. Hutter, and H.Jin, “A new local distance-based outlier detection approach for scatteredreal-world data,” In PAKDD ’09:Proceedings of the 13th Pacific-Asia Conference on Advances inKnowledge Discovery and Data Mining, pages 813-822, 2009.10 K. Narita and H.

Katigawa,”Outlier detection for transactional Databases using association rules,” in Proc. of the 9th Int’l Conf. OnWeb-Age Information Management, Zhangjiajie, Hunan, pp.373-380, July 2008.