Type: Definition Essays
Sample donated: Agnes Robertson
Last updated: September 13, 2019
Partitioning MethodsSamantha O’NeillMSC in Applied Science and Data AnalyticsInstitute of Technology Blanchardstown [email protected]
itb.ieAbstractThe purpose of this paper is torecent current trends in clustering algorithms. There is a focus on portioningmethods in particular k-means. Alternative methods to k-means are proposed toachieve better results based on the type of data being analysed. Influencessuch as distance and similarity measures are discussed.
Finally, the evaluationof clustering algorithms is outlined.Section 1.IntroductionClustering is a highly regarded unsupervised learningmethod. In short, clustering is the grouping of objects within a dataset basedon an intrinsic similarity between the objects.
Unlike classification where thegroups are predefined, clustering does not work based on the group being knownin advance instead the algorithm decides the optimal way to group the data. Ingrouping objects with a high similarity in to a cluster the objects must becompared with objects in other clusters to compare the dissimilarities betweenthem. Clusters are defined in severaldifferent ways generally based on their distance. A well separated cluster isdefined as a set of objects such that a given object in a cluster is closer toevery other object in that cluster than to any object not in that cluster, thiscloseness is the similarity. Centre based clusters are said to contain a set ofobjects such that an object in the cluster is closer and more like the centrecluster compared with the centre of any other cluster. Clusters have manycharacteristics that help define them.
Clusters considered density based clustersare dense regions of points separated by low density regions. Clusters can haveshared properties, this is when a cluster has a shared common property.The ability to group objects in a dataset together isimportant across many industries ranging from interdisciplinary areas such asbiology, mathematics, business and medicine.
In business, many marketers useclustering to discover customer groups based on their buying patterns, thisallows the business to create better marketing campaigns that target the rightsegment of their audience. Advances in medicine have seen the use of clusteringfor image analysis. The positron emission tomography (PET) scan has hadclustering techniques applied to aid distinguishing between areas of tissue andblood. Across industries the goal of clustering is the same, which is to creategroups based on object instances and allocate each one to an appropriate group.
Section 2 of this paper focuses on the various clusteringtechniques as well as potential changes that enhance the overall clusteringmethod. k-means is a type of partition clustering method, looking at this, anexplanation of how the algorithm works is outlined in section 3. Jain, A.K outlines that the main objective of clusteringdata is to examine and determine the real grouping of objects, points and setsof instances within a dataset (Jain, A.
K). Based on this discussion drawing aparallel based on similarity amongst objects is the central factor for acluster. In section 4 of this paper defining similarity based on measurementsof distance between the objects will be looked at.As clustering has no known class label attached it is saidto be much more difficult than supervised methods such as classification(Saxena et al). The complexity lies with being able to determine which group anobject belongs to why the label is absent. Based on this this are numerousparameters that need to be considered for clustering and evaluated to determineif they are a good fit.
This can often lead to higher computational cost as itis more time consuming. Determining what is a “good” cluster is imprecise basedon the nature of the data and the data mining objective. Clustering validationwill be covered in section 5 by examining the different evaluation methods. Section 2. ClusteringTechniquesIn this section, a brief overview of clustering techniqueswill be examined. Rokach, L. suggested that there is a need for differentclustering approaches in handling clustering techniques due to there being noprecise definition as to what a cluster really is (Rokack, L).
Clusteringtechniques are divided in to groups based on their approach. Some of thoseclustering groups include portioning methods, hierarchal methods, grid-basedmethods, density based methods and model-based methods. 2.1 HierarchicalMethodsIn hierarchical clustering methods, the clusters are formedby grouping data objects into a tree of clusters.
There are two approaches toachieve this referred to as bottom up and top down. Hierarchical methods arefound in two forms divisive hierarchical clustering and agglomerativeclustering (Murtagh, F.).2.1.1 Divisivehierarchical clustering Divisive hierarchical clustering is the top down approach.
Every object is put in to a cluster to begin with. The clusters are thensubdivided and broken up to put all objects into smaller clusters oftenreferred to as sub clusters. This approach is continued until all objects havebeen placed in its own cluster or until it satisfies certain criteria for theobject to be ended. The criteria for ending a condition is based on a range ofdeciding factors including maximum size of the cluster and the number ofclusters required.2.1.2 AgglomerativeclusteringAgglomerative clustering it the bottom up approach. Thisinvolves giving every object its own cluster to begin with.
These singleclusters are then joined together until all objects are in a single cluster oruntil it satisfies certain criteria for the object to be ended. The clustersare merged at the smallest distance between neighbouring clusters.2.
1.3 Determining theMerge and Split Points for Hierarchical Clustering MethodsTo further group hierarchical clustering methods there are afew linkages to be considered that can assist in achieving better results(Jain, A.K). The most commonly used methods of these linkages are based onsimilarity measures and are briefly discussed in the proceeding sub segments.
Themethods used to merge and split points has the biggest impact on clusterdefinition.18.104.22.168 SingleLinkage ClusteringOften referred to as the nearest neighbour method it is thelinkage between two clusters based on the distance between the closest pointsthat are in the different clusters.
This is said to be the similarity of theclusters based on their closet points. If there are similarities found in thepair of clusters the clusters are measured to be equivalent to the greatestsimilarity from any of any object of one cluster to another (Saxena et al).Recommended method for non-elliptic shapes but merging is easily impacted byoutliers in a cluster.22.214.171.124 CompleteLinkage ClusteringReferred to the further neighbour and maximum method, it isthe distance between two clusters where the distance between the furthestpoints are found in different clusters. The similarity for this method ismeasured based on their least similar points.
Works best with small clustersand tends to have a bias towards globular clusters.126.96.36.199 AverageLinkage ClusteringIn this method, the cluster proximity is the averagedistance from any object of one cluster to an object of another cluster (Ward,J.H). This method is robust to noise and outliers in the data as it assumes anaverage, it does however have a bias towards globular clusters.
4 Centroid LinkageThis linkage is based on the closeness distance between thecentroid of two clusters given the cluster has a centre. 2.2 Grid Based ModelsThe Kohonen network is an example of a grid based model.
These models are the result of the implementation of a self-organised map basedon a clustering approach of using a grid data structure to organise thecluster.2.3 Density BasedModelsDBSCAN is one example of a density based algorithm. Itcreates a model based on the clustering approach of locating regions of highdensity that are divided by areas of low density in a state of space ofobjects. DBSCAN is a very simple algorithm which has a high success rate. Section 3. PartitionClustering MethodsThe partitioning clustering method can be considered as beingthe opposite of hierarchical clustering.
In partitioning clustering, there isno hierarchical structure, instead this method constructs k partitions of thedatabase based on a criteria function (Lam, D. & Wunsch, D.C). The numberof k partitions is specified in advance and a good partitioning results inobjects from the same cluster being close to each other while objects fromdifferent clusters are said to be far apart from each other. There are severalalgorithms that show the partition clustering approach. They include k-meansclustering, fuzzy c-means and DBSCAN.
3.1 k-meansClustering AlgorithmThe k-means algorithm is one of the best known and widelypopular portioning methods used. It has been bench marked against otheralgorithms and found to be one of the simplest unsupervised learning algorithmsin solving clustering difficulties (Saxena et al). This method works well with high dimensional data.
The algorithmis said to be quick and simple making it easy to understand. It works best whenthe data set being analysed contains distinct or well separated objects (Naik,Azad). When the mean of the cluster can be defined good results are obtained butthis is often a difficult due to the use of categorical variables. As thepoints chosen initially are random it impacts on the outcome making the resultsunstable.
There are variations on k-means which can be used to improveresults, for noisy data and improvement to initial cluster centres. Theseinclude, K-medoid which differs by taking the most centrally located objectinstead of the mean value of the objects. Kernel k-means uses a kernel functionto first map the dataset onto a higher dimensional space to provide greaterseparation and k-Means++ which weights all rows based on their proximity totheir current closest cluster set.3.2 How k-means worksThe procedure follows a number straightforward steps. Themain objective is to define the k centres for each of the clusters1. First, randomly select k cluster centres.
2. Each of the objects is assigned to the group that has theclosest centroid. This is based on assigning objects that are similar to acluster based on the distance between the clusters mean. Once all objects havebeen placed in a cluster the mean can be recalculated.3. All objects are compared to the new mean and once againreallocated to the cluster it is most like.4. Steps 2 and 3 are repeated until the distance betweenobjects and their clusters means converges.
The sum of squared error is themeans for measuring convergence. Figure 1: Sum of Squared Error Formula (Naik, Azad)where, ‘||xi – vj||’ isthe Euclidean distance between xi and vj. ‘ci’ is the number ofdata points in ith cluster. ‘c’ is the number of cluster centres.
Section 4. Measuresof attributes, object similarity and dissimilarityAs previously discussed in this paper clustering is largelybased on similarity. Dissimilarities measures take in to account the degree towhich objects are different, this is what we refer to as the distance betweenobjects and clusters.
Similarity and dissimilarity are both numeric measures ofhow alike or unlike objects are. Similarity usually has a non-negative range of(0,1) where 0 is no similarity and 1 is complete similarity. Dissimilaritiesrange from zero to infinity where 0 is no dissimilarity. Both functions areused to measure the proximity between rows and objects in a dataset. The approach outlined in the previous paragraph workstypically well with numeric fields such as height and weight in calculatingproximities.
However, there are many approaches that can be used when lookingat similarity and distance between various types of attributes to determineproximity which are discussed in further detail in the following sub sections.4.1 Measures ofProximitySelecting the best distance or similarity measure willdepend on both the dataset being analyses and the research problem that istrying to be solved.
It is important to evaluate the structure of the clustersto get a sense of what the clusters look like. There is no one best fit formeasuring proximity so understanding the clusters, attributes, objects andfeatures will aid in selecting a measure.4.1.1 DistanceMeasuresWhen measuring the distance for a sense data set where theattributes are nominal a numeric calculation such as 0 value is the same 1value is different can be applied.Euclidean is a distance measure commonly used for numericdistance measures. It is one of the most common distance metrics and works wellon compact cluster sets.
It works by treating each centroid as the mean of theobjects contained in the cluster. It is highly sensitive to outliers and isdeemed not suitable for application to a dataset with hundreds of attributesand features (Saxena et al). 4.
1.2 SimilarityMeasuresEvery clustering issues applies at least one similaritymeasure. The simplest similarity measure to handle objects that have nominaland categorical variables is to count the number of variables that have thesame value. This is done by following the below formula:S(I,j) = m/pm is the number of matches and p is the total number ofvariables that are either nominal or categorical.
More complex scenarios can benefit from the use of Jaccardsimilarity coefficient, which works particularly well when a count of bothpresence ad absent objects is not appropriate (Saxena et al). The Jaccard coefficient ignores attributeswhere both values are 0. An example of this is an analysis carried out bymarket researchers to clustering customer data to determine their shopping preferences,Jaccard coefficient allows for the products not bought to be excluded from themeasurement as it would not make sense to include them as the number ofproducts not bought would far outweigh products that were bought. Section 5.
ClusterEvaluationsThis section concludes the paper with a synopsis of clusteringvalidation through clustering evaluations. There are reported to be two maincategories of clustering evaluations, subjective and objective. Subjective tends to refer to a subject matter expert who canlook at the structure of each cluster to determine if the grouping of objectsis well segmented. This is a time-consuming process which can be aided with theuse of algorithm and cluster visualisation tools. It is however open to humanerror.
Objective evaluations refer to ones where metrics are usedto measure performance of the clustering algorithm. The measures for thiscalculation are based on inter and intra cluster distances. Evaluating measuresin this way generally involves the calculation of a ratio between the distances.
Davies-Bouldin Index and Silhouette Index are two such ratios.Davies-Bouldin Index works by finding the average distancewithin clusters over distance between clusters. This results in a maximumration which when added together can be divided by the number of clusters toobtain the index. The smaller the index the better the clustering result issaid to be (Arroyo et al).
Silhouette Index is based on the between cluster and withincluster differences. The optimal number of clusters is achieved by increasingthe value of this index. A high silhouette index value is required to deem thisas a good fit as the aim it to obtain clusters with minimum intra-clusterdistance and maximum inter-cluster distance (Arroyo et al). Section 6. ConclusionThe clustering technique is one used industry wide anduseful for grouping objects together based on their similarities anddissimilarities. Determining the best clustering technique will vary dependingon the data set being analysed and business objective set.
As covered insection 3 k-means clustering algorithms is one of the most commonly utilisedalgorithms based on its simplicity when bench marked against other algorithms.k-means does have its limitations and is susceptible to noise, in cases whereoptimal results are not achieved variations such as k-medoids can be applied topartition the given dataset. Understanding the attributes of the dataset isessential to choosing the correct distance and similarity measures to achieveoptimal proximity results. As data comes in all shapes and forms and is everevolving studies suggest that there is no one best measure that can be appliedas standard and therefore understanding the data set is key to achieve a highresult. There are a number of clustering evaluation methods to validate thecluster obtained, again determine the best method and or ratio will be dependenton the data set being analysed. ReferencesJain, A.K, “Dataclustering: 50 years beyond k-means”, Pattern Recognit, Lett, 31 (8),651-666, (2010).Lam, D.
& Wunsch, D.C, “Clustering, academic press libraryin signal processing, SignalProcess”, Theory Mah Learn, (2014).Murtagh, F., “A surveyof recent advances in hierarchical clustering algorithms which use clustercentres”, Comput.J, 26 (4), 354-360, (1984).Naik, Azad, “Data Clustering Algorithms”, accessed online at URL: https://sites.
google.com/site/dataclusteringalgorithms/home,(Last accessed: December 10th, 2017).Rokack, L.
,” ClusteringMethods, Data Mining and Knowledge Discovery Handbook”, Springer, 331-352,(2005).Saxena. A, Prasad. M, Gupta.
A, Bharill. N, Patel. O.P,Tiwari. A, Er. M.J, Ding.
W, Lin. C, “Areview of clustering techniques and developments”, Elesvier, ScienceDirect, (2017).Ward, J.H, “Hierarchical grouping to optimize an objectivefunction”, J. AM Stat. Associ.
58, 235-245, (1963) Arroyo, A., Herrero, A., Tricio, V., Corchado, E., “Analysis of meteorological conditions inSpain by means of clustering techniques”, Journal of Applied Logic, (24),76-89, (2017).
Student Self-EvaluationFormClustering Paper Criteria 1: Evidence of understandingIs it evident from your paperthat you understand what you have written?I feel I have made it evidentthat I understand what I have written. To the best of my ability I have writtenin non-technical terms to convey my understanding of the clustering topic.Criteria 2: Depth, scope & bibliographyIs it evident from your paper that you haveresearched your topic in depth, from many sources, and included content notcovered in lecture notes?I found this topic to be difficult incomparison with the classification paper in terms of obtaining good qualitypapers about recent research trends. I believe it is clear that I haveresearched the theory quiet well but could have gotten more depth in regards tocurrent and emerging trends.
Criteria 3: Structure & flow of the paperDoesthe paper flow well, and is it properly structured?Eachtopic has been given its own section, with sub sections when necessary whichaids the flow of the paper. Every new point discussed in the paper is given anew paragraph to help structure the paper. It contains a clear beginning,middle and end and overall. I feel that the sections flow well, I did debatewith myself whether section 5 “Cluster Evaluations” should be before section 4″Measures of attributes, object similarity and dissimilarity” but decided thatunderstanding the attributes measurements was an important factor for decidingon cluster evaluations.
Criteria 4: Authors comments & analysisHave you included your own analysison what you have read? I have also provided myopinion as to variations of the Naïve Bayes classifier that can be used toobtain better results. In review, I feel I could have improved upon providingcomments on more real-world examples of the use of k-means as a clusteringalgorithm. Overall given the topic areas I covered I feel I commented andprovided analysis on the good and bad points.