INTRODUCTIONTODATAMINING
INTRODUCTIONTODATAMININGSECONDEDITION
PANG-NINGTANMichiganStateUniversitMICHAELSTEINBACHUniversityofMinnesotaANUJKARPATNEUniversityofMinnesotaVIPINKUMARUniversityofMinnesota
330HudsonStreet,NYNY10013
Director,PortfolioManagement:Engineering,ComputerScience&GlobalEditions:JulianPartridge
Specialist,HigherEdPortfolioManagement:MattGoldstein
PortfolioManagementAssistant:MeghanJacoby
ManagingContentProducer:ScottDisanno
ContentProducer:CaroleSnyder
WebDeveloper:SteveWright
RightsandPermissionsManager:BenFerrini
ManufacturingBuyer,HigherEd,LakeSideCommunicationsInc(LSC):MauraZaldivar-Garcia
InventoryManager:AnnLam
ProductMarketingManager:YvonneVannatta
FieldMarketingManager:DemetriusHall
MarketingAssistant:JonBryant
CoverDesigner:JoyceWells,jWellsDesign
Full-ServiceProjectManagement:ChandrasekarSubramanian,SPiGlobal
Copyright©2019PearsonEducation,Inc.Allrightsreserved.ManufacturedintheUnitedStatesofAmerica.ThispublicationisprotectedbyCopyright,andpermissionshouldbeobtainedfromthepublisherpriortoanyprohibitedreproduction,storageinaretrievalsystem,ortransmissioninanyformorbyanymeans,electronic,mechanical,photocopying,recording,orlikewise.Forinformationregardingpermissions,requestformsandtheappropriatecontactswithinthePearsonEducationGlobalRights&Permissionsdepartment,pleasevisitwww.pearsonhighed.com/permissions/.
Manyofthedesignationsbymanufacturersandsellerstodistinguishtheirproductsareclaimedastrademarks.Wherethosedesignationsappearinthisbook,andthepublisherwasawareofatrademarkclaim,thedesignationshavebeenprintedininitialcapsorallcaps.
LibraryofCongressCataloging-in-PublicationDataonFile
Names:Tan,Pang-Ning,author.|Steinbach,Michael,author.|Karpatne,Anuj,author.|Kumar,Vipin,1956-author.
Title:IntroductiontoDataMining/Pang-NingTan,MichiganStateUniversity,MichaelSteinbach,UniversityofMinnesota,AnujKarpatne,UniversityofMinnesota,VipinKumar,UniversityofMinnesota.
Description:Secondedition.|NewYork,NY:PearsonEducation,[2019]|Includesbibliographicalreferencesandindex.
Identifiers:LCCN2017048641|ISBN9780133128901|ISBN0133128903
Subjects:LCSH:Datamining.
Classification:LCCQA76.9.D343T352019|DDC006.3/12–dc23LCrecordavailableathttps://lccn.loc.gov/2017048641
118
ISBN-10:0133128903
ISBN-13:9780133128901
Toourfamilies…
PrefacetotheSecondEditionSincethefirstedition,roughly12yearsago,muchhaschangedinthefieldofdataanalysis.Thevolumeandvarietyofdatabeingcollectedcontinuestoincrease,ashastherate(velocity)atwhichitisbeingcollectedandusedtomakedecisions.Indeed,theterm,BigData,hasbeenusedtorefertothemassiveanddiversedatasetsnowavailable.Inaddition,thetermdatasciencehasbeencoinedtodescribeanemergingareathatappliestoolsandtechniquesfromvariousfields,suchasdatamining,machinelearning,statistics,andmanyothers,toextractactionableinsightsfromdata,oftenbigdata.
Thegrowthindatahascreatednumerousopportunitiesforallareasofdataanalysis.Themostdramaticdevelopmentshavebeenintheareaofpredictivemodeling,acrossawiderangeofapplicationdomains.Forinstance,recentadvancesinneuralnetworks,knownasdeeplearning,haveshownimpressiveresultsinanumberofchallengingareas,suchasimageclassification,speechrecognition,aswellastextcategorizationandunderstanding.Whilenotasdramatic,otherareas,e.g.,clustering,associationanalysis,andanomalydetectionhavealsocontinuedtoadvance.Thisneweditionisinresponsetothoseadvances.
Overview
Aswiththefirstedition,thesecondeditionofthebookprovidesacomprehensiveintroductiontodataminingandisdesignedtobeaccessibleandusefultostudents,instructors,researchers,andprofessionals.Areas
coveredincludedatapreprocessing,predictivemodeling,associationanalysis,clusteranalysis,anomalydetection,andavoidingfalsediscoveries.Thegoalistopresentfundamentalconceptsandalgorithmsforeachtopic,thusprovidingthereaderwiththenecessarybackgroundfortheapplicationofdataminingtorealproblems.Asbefore,classification,associationanalysisandclusteranalysis,areeachcoveredinapairofchapters.Theintroductorychaptercoversbasicconcepts,representativealgorithms,andevaluationtechniques,whilethemorefollowingchapterdiscussesadvancedconceptsandalgorithms.Asbefore,ourobjectiveistoprovidethereaderwithasoundunderstandingofthefoundationsofdatamining,whilestillcoveringmanyimportantadvancedtopics.Becauseofthisapproach,thebookisusefulbothasalearningtoolandasareference.
Tohelpreadersbetterunderstandtheconceptsthathavebeenpresented,weprovideanextensivesetofexamples,figures,andexercises.Thesolutionstotheoriginalexercises,whicharealreadycirculatingontheweb,willbemadepublic.Theexercisesaremostlyunchangedfromthelastedition,withtheexceptionofnewexercisesinthechapteronavoidingfalsediscoveries.Newexercisesfortheotherchaptersandtheirsolutionswillbeavailabletoinstructorsviatheweb.Bibliographicnotesareincludedattheendofeachchapterforreaderswhoareinterestedinmoreadvancedtopics,historicallyimportantpapers,andrecenttrends.Thesehavealsobeensignificantlyupdated.Thebookalsocontainsacomprehensivesubjectandauthorindex.
WhatisNewintheSecondEdition?
Someofthemostsignificantimprovementsinthetexthavebeeninthetwochaptersonclassification.Theintroductorychapterusesthedecisiontreeclassifierforillustration,butthediscussiononmanytopics—thosethatapply
acrossallclassificationapproaches—hasbeengreatlyexpandedandclarified,includingtopicssuchasoverfitting,underfitting,theimpactoftrainingsize,modelcomplexity,modelselection,andcommonpitfallsinmodelevaluation.Almosteverysectionoftheadvancedclassificationchapterhasbeensignificantlyupdated.ThematerialonBayesiannetworks,supportvectormachines,andartificialneuralnetworkshasbeensignificantlyexpanded.Wehaveaddedaseparatesectionondeepnetworkstoaddressthecurrentdevelopmentsinthisarea.Thediscussionofevaluation,whichoccursinthesectiononimbalancedclasses,hasalsobeenupdatedandimproved.
Thechangesinassociationanalysisaremorelocalized.Wehavecompletelyreworkedthesectionontheevaluationofassociationpatterns(introductorychapter),aswellasthesectionsonsequenceandgraphmining(advancedchapter).Changestoclusteranalysisarealsolocalized.TheintroductorychapteraddedtheK-meansinitializationtechniqueandanupdatedthediscussionofclusterevaluation.Theadvancedclusteringchapteraddsanewsectiononspectralgraphclustering.Anomalydetectionhasbeengreatlyrevisedandexpanded.Existingapproaches—statistical,nearestneighbor/density-based,andclusteringbased—havebeenretainedandupdated,whilenewapproacheshavebeenadded:reconstruction-based,one-classclassification,andinformation-theoretic.Thereconstruction-basedapproachisillustratedusingautoencodernetworksthatarepartofthedeeplearningparadigm.Thedatachapterhasbeenupdatedtoincludediscussionsofmutualinformationandkernel-basedtechniques.
Thelastchapter,whichdiscusseshowtoavoidfalsediscoveriesandproducevalidresults,iscompletelynew,andisnovelamongothercontemporarytextbooksondatamining.Itsupplementsthediscussionsintheotherchapterswithadiscussionofthestatisticalconcepts(statisticalsignificance,p-values,falsediscoveryrate,permutationtesting,etc.)relevanttoavoidingspuriousresults,andthenillustratestheseconceptsinthecontextofdata
miningtechniques.Thischapteraddressestheincreasingconcernoverthevalidityandreproducibilityofresultsobtainedfromdataanalysis.Theadditionofthislastchapterisarecognitionoftheimportanceofthistopicandanacknowledgmentthatadeeperunderstandingofthisareaisneededforthoseanalyzingdata.
Thedataexplorationchapterhasbeendeleted,ashavetheappendices,fromtheprinteditionofthebook,butwillremainavailableontheweb.Anewappendixprovidesabriefdiscussionofscalabilityinthecontextofbigdata.
TotheInstructor
Asatextbook,thisbookissuitableforawiderangeofstudentsattheadvancedundergraduateorgraduatelevel.Sincestudentscometothissubjectwithdiversebackgroundsthatmaynotincludeextensiveknowledgeofstatisticsordatabases,ourbookrequiresminimalprerequisites.Nodatabaseknowledgeisneeded,andweassumeonlyamodestbackgroundinstatisticsormathematics,althoughsuchabackgroundwillmakeforeasiergoinginsomesections.Asbefore,thebook,andmorespecifically,thechapterscoveringmajordataminingtopics,aredesignedtobeasself-containedaspossible.Thus,theorderinwhichtopicscanbecoveredisquiteflexible.Thecorematerialiscoveredinchapters2(data),3(classification),5(associationanalysis),7(clustering),and9(anomalydetection).WerecommendatleastacursorycoverageofChapter10(AvoidingFalseDiscoveries)toinstillinstudentssomecautionwheninterpretingtheresultsoftheirdataanalysis.Althoughtheintroductorydatachapter(2)shouldbecoveredfirst,thebasicclassification(3),associationanalysis(5),andclusteringchapters(7),canbecoveredinanyorder.Becauseoftherelationshipofanomalydetection(9)toclassification(3)andclustering(7),thesechaptersshouldprecedeChapter9.
Varioustopicscanbeselectedfromtheadvancedclassification,associationanalysis,andclusteringchapters(4,6,and8,respectively)tofitthescheduleandinterestsoftheinstructorandstudents.Wealsoadvisethatthelecturesbeaugmentedbyprojectsorpracticalexercisesindatamining.Althoughtheyaretimeconsuming,suchhands-onassignmentsgreatlyenhancethevalueofthecourse.
SupportMaterials
Supportmaterialsavailabletoallreadersofthisbookareavailableathttp://www-users.cs.umn.edu/~kumar/dmbook.
PowerPointlectureslidesSuggestionsforstudentprojectsDataminingresources,suchasalgorithmsanddatasetsOnlinetutorialsthatgivestep-by-stepexamplesforselecteddataminingtechniquesdescribedinthebookusingactualdatasetsanddataanalysissoftware
Additionalsupportmaterials,includingsolutionstoexercises,areavailableonlytoinstructorsadoptingthistextbookforclassroomuse.Thebook’sresourceswillbemirroredatwww.pearsonhighered.com/cs-resources.Commentsandsuggestions,aswellasreportsoferrors,[email protected].
Acknowledgments
Manypeoplecontributedtothefirstandsecondeditionsofthebook.Webeginbyacknowledgingourfamiliestowhomthisbookisdedicated.Withouttheirpatienceandsupport,thisprojectwouldhavebeenimpossible.
WewouldliketothankthecurrentandformerstudentsofourdatamininggroupsattheUniversityofMinnesotaandMichiganStatefortheircontributions.Eui-Hong(Sam)HanandMaheshJoshihelpedwiththeinitialdataminingclasses.Someoftheexercisesandpresentationslidesthattheycreatedcanbefoundinthebookanditsaccompanyingslides.StudentsinourdatamininggroupswhoprovidedcommentsondraftsofthebookorwhocontributedinotherwaysincludeShyamBoriah,HaibinCheng,VarunChandola,EricEilertson,LeventErtöz,JingGao,RohitGupta,SridharIyer,Jung-EunLee,BenjaminMayer,AyselOzgur,UygarOztekin,GauravPandey,KashifRiaz,JerryScripps,GyorgySimon,HuiXiong,JiepingYe,andPushengZhang.WewouldalsoliketothankthestudentsofourdataminingclassesattheUniversityofMinnesotaandMichiganStateUniversitywhoworkedwithearlydraftsofthebookandprovidedinvaluablefeedback.WespecificallynotethehelpfulsuggestionsofBernardoCraemer,ArifinRuslim,JamshidVayghan,andYuWei.
JoydeepGhosh(UniversityofTexas)andSanjayRanka(UniversityofFlorida)classtestedearlyversionsofthebook.WealsoreceivedmanyusefulsuggestionsdirectlyfromthefollowingUTstudents:PankajAdhikari,RajivBhatia,FredericBosche,ArindamChakraborty,MeghanaDeodhar,ChrisEverson,DavidGardner,SaadGodil,ToddHay,ClintJones,AjayJoshi,JoonsooLee,YueLuo,AnujNanavati,TylerOlsen,SunyoungPark,AashishPhansalkar,GeoffPrewett,MichaelRyoo,DarylShannon,andMeiYang.
RonaldKostoff(ONR)readanearlyversionoftheclusteringchapterandofferednumeroussuggestions.GeorgeKarypisprovidedinvaluableLATEXassistanceincreatinganauthorindex.IreneMoulitsasalsoprovided
assistancewithLATEXandreviewedsomeoftheappendices.MusettaSteinbachwasveryhelpfulinfindingerrorsinthefigures.
WewouldliketoacknowledgeourcolleaguesattheUniversityofMinnesotaandMichiganStatewhohavehelpedcreateapositiveenvironmentfordataminingresearch.TheyincludeArindamBanerjee,DanBoley,JoyceChai,AnilJain,RaviJanardan,RongJin,GeorgeKarypis,ClaudiaNeuhauser,HaesunPark,WilliamF.Punch,GyörgySimon,ShashiShekhar,andJaideepSrivastava.Thecollaboratorsonourmanydataminingprojects,whoalsohaveourgratitude,includeRameshAgrawal,ManeeshBhargava,SteveCannon,AlokChoudhary,ImmeEbert-Uphoff,AuroopGanguly,PietC.deGroen,FranHill,YongdaeKim,SteveKlooster,KerryLong,NiharMahapatra,RamaNemani,NikunjOza,ChrisPotter,LisianePruinelli,NagizaSamatova,JonathanShapiro,KevinSilverstein,BrianVanNess,BonnieWestra,NevinYoung,andZhi-LiZhang.
ThedepartmentsofComputerScienceandEngineeringattheUniversityofMinnesotaandMichiganStateUniversityprovidedcomputingresourcesandasupportiveenvironmentforthisproject.ARDA,ARL,ARO,DOE,NASA,NOAA,andNSFprovidedresearchsupportforPang-NingTan,MichaelStein-bach,AnujKarpatne,andVipinKumar.Inparticular,KamalAbdali,MitraBasu,DickBrackney,JagdishChandra,JoeCoughlan,MichaelCoyle,StephenDavis,FredericaDarema,RichardHirsch,ChandrikaKamath,TsengdarLee,RajuNamburu,N.Radhakrishnan,JamesSidoran,SylviaSpengler,BhavaniThuraisingham,WaltTiernin,MariaZemankova,AidongZhang,andXiaodongZhanghavebeensupportiveofourresearchindataminingandhigh-performancecomputing.
ItwasapleasureworkingwiththehelpfulstaffatPearsonEducation.Inparticular,wewouldliketothankMattGoldstein,KathySmith,CaroleSnyder,
andJoyceWells.WewouldalsoliketothankGeorgeNichols,whohelpedwiththeartworkandPaulAnagnostopoulos,whoprovidedLATEXsupport.
WearegratefultothefollowingPearsonreviewers:LemanAkoglu(CarnegieMellonUniversity),Chien-ChungChan(UniversityofAkron),ZhengxinChen(UniversityofNebraskaatOmaha),ChrisClifton(PurdueUniversity),Joy-deepGhosh(UniversityofTexas,Austin),NazliGoharian(IllinoisInstituteofTechnology),J.MichaelHardin(UniversityofAlabama),JingruiHe(ArizonaStateUniversity),JamesHearne(WesternWashingtonUniversity),HillolKargupta(UniversityofMaryland,BaltimoreCountyandAgnik,LLC),EamonnKeogh(UniversityofCalifornia-Riverside),BingLiu(UniversityofIllinoisatChicago),MariofannaMilanova(UniversityofArkansasatLittleRock),SrinivasanParthasarathy(OhioStateUniversity),ZbigniewW.Ras(UniversityofNorthCarolinaatCharlotte),XintaoWu(UniversityofNorthCarolinaatCharlotte),andMohammedJ.Zaki(RensselaerPolytechnicInstitute).
Overtheyearssincethefirstedition,wehavealsoreceivednumerouscommentsfromreadersandstudentswhohavepointedouttyposandvariousotherissues.Weareunabletomentiontheseindividualsbyname,buttheirinputismuchappreciatedandhasbeentakenintoaccountforthesecondedition.
ContentsPrefacetotheSecondEditionv
1Introduction11.1WhatIsDataMining?4
1.2MotivatingChallenges5
1.3TheOriginsofDataMining7
1.4DataMiningTasks9
1.5ScopeandOrganizationoftheBook13
1.6BibliographicNotes15
1.7Exercises21
2Data232.1TypesofData26
2.1.1AttributesandMeasurement27
2.1.2TypesofDataSets34
2.2DataQuality422.2.1MeasurementandDataCollectionIssues42
2.2.2IssuesRelatedtoApplications49
2.3DataPreprocessing502.3.1Aggregation51
2.3.2Sampling52
2.3.3DimensionalityReduction56
2.3.4FeatureSubsetSelection58
2.3.5FeatureCreation61
2.3.6DiscretizationandBinarization63
2.3.7VariableTransformation69
2.4MeasuresofSimilarityandDissimilarity712.4.1Basics72
2.4.2SimilarityandDissimilaritybetweenSimpleAttributes74
2.4.3DissimilaritiesbetweenDataObjects76
2.4.4SimilaritiesbetweenDataObjects78
2.4.5ExamplesofProximityMeasures79
2.4.6MutualInformation88
2.4.7KernelFunctions*90
2.4.8BregmanDivergence*94
2.4.9IssuesinProximityCalculation96
2.4.10SelectingtheRightProximityMeasure98
2.5BibliographicNotes100
2.6Exercises105
3Classification:BasicConceptsandTechniques113
3.1BasicConcepts114
3.2GeneralFrameworkforClassification117
3.3DecisionTreeClassifier1193.3.1ABasicAlgorithmtoBuildaDecisionTree121
3.3.2MethodsforExpressingAttributeTestConditions124
3.3.3MeasuresforSelectinganAttributeTestCondition127
3.3.4AlgorithmforDecisionTreeInduction136
3.3.5ExampleApplication:WebRobotDetection138
3.3.6CharacteristicsofDecisionTreeClassifiers140
3.4ModelOverfitting1473.4.1ReasonsforModelOverfitting149
3.5ModelSelection1563.5.1UsingaValidationSet156
3.5.2IncorporatingModelComplexity157
3.5.3EstimatingStatisticalBounds162
3.5.4ModelSelectionforDecisionTrees162
3.6ModelEvaluation1643.6.1HoldoutMethod165
3.6.2Cross-Validation165
3.7PresenceofHyper-parameters1683.7.1Hyper-parameterSelection168
3.7.2NestedCross-Validation170
3.8PitfallsofModelSelectionandEvaluation1723.8.1OverlapbetweenTrainingandTestSets172
3.8.2UseofValidationErrorasGeneralizationError172
3.9ModelComparison 1733.9.1EstimatingtheConfidenceIntervalforAccuracy174
3.9.2ComparingthePerformanceofTwoModels175
3.10BibliographicNotes176
3.11Exercises185
4Classification:AlternativeTechniques1934.1TypesofClassifiers193
4.2Rule-BasedClassifier1954.2.1HowaRule-BasedClassifierWorks197
4.2.2PropertiesofaRuleSet198
4.2.3DirectMethodsforRuleExtraction199
4.2.4IndirectMethodsforRuleExtraction204
4.2.5CharacteristicsofRule-BasedClassifiers206
4.3NearestNeighborClassifiers2084.3.1Algorithm209
4.3.2CharacteristicsofNearestNeighborClassifiers210
*
4.4NaïveBayesClassifier2124.4.1BasicsofProbabilityTheory213
4.4.2NaïveBayesAssumption218
4.5BayesianNetworks2274.5.1GraphicalRepresentation227
4.5.2InferenceandLearning233
4.5.3CharacteristicsofBayesianNetworks242
4.6LogisticRegression2434.6.1LogisticRegressionasaGeneralizedLinearModel244
4.6.2LearningModelParameters245
4.6.3CharacteristicsofLogisticRegression248
4.7ArtificialNeuralNetwork(ANN)2494.7.1Perceptron250
4.7.2Multi-layerNeuralNetwork254
4.7.3CharacteristicsofANN261
4.8DeepLearning2624.8.1UsingSynergisticLossFunctions263
4.8.2UsingResponsiveActivationFunctions266
4.8.3Regularization268
4.8.4InitializationofModelParameters271
4.8.5CharacteristicsofDeepLearning275
4.9SupportVectorMachine(SVM)2764.9.1MarginofaSeparatingHyperplane276
4.9.2LinearSVM278
4.9.3Soft-marginSVM284
4.9.4NonlinearSVM290
4.9.5CharacteristicsofSVM294
4.10EnsembleMethods2964.10.1RationaleforEnsembleMethod297
4.10.2MethodsforConstructinganEnsembleClassifier297
4.10.3Bias-VarianceDecomposition300
4.10.4Bagging302
4.10.5Boosting305
4.10.6RandomForests310
4.10.7EmpiricalComparisonamongEnsembleMethods312
4.11ClassImbalanceProblem3134.11.1BuildingClassifierswithClassImbalance314
4.11.2EvaluatingPerformancewithClassImbalance318
4.11.3FindinganOptimalScoreThreshold322
4.11.4AggregateEvaluationofPerformance323
4.12MulticlassProblem330
4.13BibliographicNotes333
4.14Exercises345
5AssociationAnalysis:BasicConceptsandAlgorithms3575.1Preliminaries358
5.2FrequentItemsetGeneration3625.2.1TheAprioriPrinciple363
5.2.2FrequentItemsetGenerationintheAprioriAlgorithm364
5.2.3CandidateGenerationandPruning368
5.2.4SupportCounting373
5.2.5ComputationalComplexity377
5.3RuleGeneration3805.3.1Confidence-BasedPruning380
5.3.2RuleGenerationinAprioriAlgorithm381
5.3.3AnExample:CongressionalVotingRecords382
5.4CompactRepresentationofFrequentItemsets3845.4.1MaximalFrequentItemsets384
5.4.2ClosedItemsets386
5.5AlternativeMethodsforGeneratingFrequentItemsets*389
5.6FP-GrowthAlgorithm*3935.6.1FP-TreeRepresentation394
5.6.2FrequentItemsetGenerationinFP-GrowthAlgorithm397
5.7EvaluationofAssociationPatterns401
5.7.1ObjectiveMeasuresofInterestingness402
5.7.2MeasuresbeyondPairsofBinaryVariables414
5.7.3Simpson’sParadox416
5.8EffectofSkewedSupportDistribution418
5.9BibliographicNotes424
5.10Exercises438
6AssociationAnalysis:AdvancedConcepts4516.1HandlingCategoricalAttributes451
6.2HandlingContinuousAttributes4546.2.1Discretization-BasedMethods454
6.2.2Statistics-BasedMethods458
6.2.3Non-discretizationMethods460
6.3HandlingaConceptHierarchy462
6.4SequentialPatterns4646.4.1Preliminaries465
6.4.2SequentialPatternDiscovery468
6.4.3TimingConstraints 473
6.4.4AlternativeCountingSchemes 477
6.5SubgraphPatterns4796.5.1Preliminaries480
∗
∗
6.5.2FrequentSubgraphMining483
6.5.3CandidateGeneration487
6.5.4CandidatePruning493
6.5.5SupportCounting493
6.6InfrequentPatterns 4936.6.1NegativePatterns494
6.6.2NegativelyCorrelatedPatterns495
6.6.3ComparisonsamongInfrequentPatterns,NegativePatterns,andNegativelyCorrelatedPatterns496
6.6.4TechniquesforMiningInterestingInfrequentPatterns498
6.6.5TechniquesBasedonMiningNegativePatterns499
6.6.6TechniquesBasedonSupportExpectation501
6.7BibliographicNotes505
6.8Exercises510
7ClusterAnalysis:BasicConceptsandAlgorithms5257.1Overview528
7.1.1WhatIsClusterAnalysis?528
7.1.2DifferentTypesofClusterings529
7.1.3DifferentTypesofClusters531
7.2K-means5347.2.1TheBasicK-meansAlgorithm535
∗
7.2.2K-means:AdditionalIssues544
7.2.3BisectingK-means547
7.2.4K-meansandDifferentTypesofClusters548
7.2.5StrengthsandWeaknesses549
7.2.6K-meansasanOptimizationProblem549
7.3AgglomerativeHierarchicalClustering5547.3.1BasicAgglomerativeHierarchicalClusteringAlgorithm555
7.3.2SpecificTechniques557
7.3.3TheLance-WilliamsFormulaforClusterProximity562
7.3.4KeyIssuesinHierarchicalClustering563
7.3.5Outliers564
7.3.6StrengthsandWeaknesses565
7.4DBSCAN5657.4.1TraditionalDensity:Center-BasedApproach565
7.4.2TheDBSCANAlgorithm567
7.4.3StrengthsandWeaknesses569
7.5ClusterEvaluation5717.5.1Overview571
7.5.2UnsupervisedClusterEvaluationUsingCohesionandSeparation574
7.5.3UnsupervisedClusterEvaluationUsingtheProximityMatrix582
7.5.4UnsupervisedEvaluationofHierarchicalClustering585
7.5.5DeterminingtheCorrectNumberofClusters587
7.5.6ClusteringTendency588
7.5.7SupervisedMeasuresofClusterValidity589
7.5.8AssessingtheSignificanceofClusterValidityMeasures594
7.5.9ChoosingaClusterValidityMeasure596
7.6BibliographicNotes597
7.7Exercises603
8ClusterAnalysis:AdditionalIssuesandAlgorithms6138.1CharacteristicsofData,Clusters,andClusteringAlgorithms614
8.1.1Example:ComparingK-meansandDBSCAN614
8.1.2DataCharacteristics615
8.1.3ClusterCharacteristics617
8.1.4GeneralCharacteristicsofClusteringAlgorithms619
8.2Prototype-BasedClustering6218.2.1FuzzyClustering621
8.2.2ClusteringUsingMixtureModels627
8.2.3Self-OrganizingMaps(SOM)637
8.3Density-BasedClustering6448.3.1Grid-BasedClustering644
8.3.2SubspaceClustering648
8.3.3DENCLUE:AKernel-BasedSchemeforDensity-BasedClustering652
8.4Graph-BasedClustering6568.4.1Sparsification657
8.4.2MinimumSpanningTree(MST)Clustering658
8.4.3OPOSSUM:OptimalPartitioningofSparseSimilaritiesUsingMETIS659
8.4.4Chameleon:HierarchicalClusteringwithDynamicModeling660
8.4.5SpectralClustering666
8.4.6SharedNearestNeighborSimilarity673
8.4.7TheJarvis-PatrickClusteringAlgorithm676
8.4.8SNNDensity678
8.4.9SNNDensity-BasedClustering679
8.5ScalableClusteringAlgorithms6818.5.1Scalability:GeneralIssuesandApproaches681
8.5.2BIRCH684
8.5.3CURE686
8.6WhichClusteringAlgorithm?690
8.7BibliographicNotes693
8.8Exercises699
9AnomalyDetection7039.1CharacteristicsofAnomalyDetectionProblems705
9.1.1ADefinitionofanAnomaly705
9.1.2NatureofData706
9.1.3HowAnomalyDetectionisUsed707
9.2CharacteristicsofAnomalyDetectionMethods708
9.3StatisticalApproaches7109.3.1UsingParametricModels710
9.3.2UsingNon-parametricModels714
9.3.3ModelingNormalandAnomalousClasses715
9.3.4AssessingStatisticalSignificance717
9.3.5StrengthsandWeaknesses718
9.4Proximity-basedApproaches7199.4.1Distance-basedAnomalyScore719
9.4.2Density-basedAnomalyScore720
9.4.3RelativeDensity-basedAnomalyScore722
9.4.4StrengthsandWeaknesses723
9.5Clustering-basedApproaches7249.5.1FindingAnomalousClusters724
9.5.2FindingAnomalousInstances725
9.5.3StrengthsandWeaknesses728
9.6Reconstruction-basedApproaches7289.6.1StrengthsandWeaknesses731
9.7One-classClassification7329.7.1UseofKernels733
9.7.2TheOriginTrick734
9.7.3StrengthsandWeaknesses738
9.8InformationTheoreticApproaches7389.8.1StrengthsandWeaknesses740
9.9EvaluationofAnomalyDetection740
9.10BibliographicNotes742
9.11Exercises749
10AvoidingFalseDiscoveries75510.1Preliminaries:StatisticalTesting756
10.1.1SignificanceTesting756
10.1.2HypothesisTesting761
10.1.3MultipleHypothesisTesting767
10.1.4PitfallsinStatisticalTesting776
10.2ModelingNullandAlternativeDistributions77810.2.1GeneratingSyntheticDataSets781
10.2.2RandomizingClassLabels782
10.2.3ResamplingInstances782
10.2.4ModelingtheDistributionoftheTestStatistic783
10.3StatisticalTestingforClassification78310.3.1EvaluatingClassificationPerformance783
10.3.2BinaryClassificationasMultipleHypothesisTesting785
10.3.3MultipleHypothesisTestinginModelSelection786
10.4StatisticalTestingforAssociationAnalysis78710.4.1UsingStatisticalModels788
10.4.2UsingRandomizationMethods794
10.5StatisticalTestingforClusterAnalysis79510.5.1GeneratingaNullDistributionforInternalIndices796
10.5.2GeneratingaNullDistributionforExternalIndices798
10.5.3Enrichment798
10.6StatisticalTestingforAnomalyDetection800
10.7BibliographicNotes803
10.8Exercises808
AuthorIndex816
SubjectIndex829
CopyrightPermissions839
1Introduction
Rapidadvancesindatacollectionandstoragetechnology,coupledwiththeeasewithwhichdatacanbegeneratedanddisseminated,havetriggeredtheexplosivegrowthofdata,leadingtothecurrentageofbigdata.Derivingactionableinsightsfromtheselargedatasetsisincreasinglyimportantindecisionmakingacrossalmostallareasofsociety,includingbusinessandindustry;scienceandengineering;medicineandbiotechnology;andgovernmentandindividuals.However,theamountofdata(volume),itscomplexity(variety),andtherateatwhichitisbeingcollectedandprocessed(velocity)havesimplybecometoogreatforhumanstoanalyzeunaided.Thus,thereisagreatneedforautomatedtoolsforextractingusefulinformationfromthebigdatadespitethechallengesposedbyitsenormityanddiversity.
Dataminingblendstraditionaldataanalysismethodswithsophisticatedalgorithmsforprocessingthisabundanceofdata.Inthisintroductorychapter,wepresentanoverviewofdataminingandoutlinethekeytopicstobecoveredinthisbook.Westartwitha
descriptionofsomeapplicationsthatrequiremoreadvancedtechniquesfordataanalysis.
BusinessandIndustryPoint-of-saledatacollection(barcodescanners,radiofrequencyidentification(RFID),andsmartcardtechnology)haveallowedretailerstocollectup-to-the-minutedataaboutcustomerpurchasesatthecheckoutcountersoftheirstores.Retailerscanutilizethisinformation,alongwithotherbusiness-criticaldata,suchaswebserverlogsfrome-commercewebsitesandcustomerservicerecordsfromcallcenters,tohelpthembetterunderstandtheneedsoftheircustomersandmakemoreinformedbusinessdecisions.
Dataminingtechniquescanbeusedtosupportawiderangeofbusinessintelligenceapplications,suchascustomerprofiling,targetedmarketing,workflowmanagement,storelayout,frauddetection,andautomatedbuyingandselling.Anexampleofthelastapplicationishigh-speedstocktrading,wheredecisionsonbuyingandsellinghavetobemadeinlessthanasecondusingdataaboutfinancialtransactions.Dataminingcanalsohelpretailersanswerimportantbusinessquestions,suchas“Whoarethemostprofitablecustomers?”“Whatproductscanbecross-soldorup-sold?”and“Whatistherevenueoutlookofthecompanyfornextyear?”Thesequestionshaveinspiredthedevelopmentofsuchdataminingtechniquesasassociationanalysis(Chapters5 and6 ).
AstheInternetcontinuestorevolutionizethewayweinteractandmakedecisionsinoureverydaylives,wearegeneratingmassiveamountsofdataaboutouronlineexperiences,e.g.,webbrowsing,messaging,andpostingonsocialnetworkingwebsites.Thishasopenedseveralopportunitiesforbusinessapplicationsthatusewebdata.Forexample,inthee-commercesector,dataaboutouronlineviewingorshoppingpreferencescanbeusedto
providepersonalizedrecommendationsofproducts.DataminingalsoplaysaprominentroleinsupportingseveralotherInternet-basedservices,suchasfilteringspammessages,answeringsearchqueries,andsuggestingsocialupdatesandconnections.Thelargecorpusoftext,images,andvideosavailableontheInternethasenabledanumberofadvancementsindataminingmethods,includingdeeplearning,whichisdiscussedinChapter4 .Thesedevelopmentshaveledtogreatadvancesinanumberofapplications,suchasobjectrecognition,naturallanguagetranslation,andautonomousdriving.
Anotherdomainthathasundergonearapidbigdatatransformationistheuseofmobilesensorsanddevices,suchassmartphonesandwearablecomputingdevices.Withbettersensortechnologies,ithasbecomepossibletocollectavarietyofinformationaboutourphysicalworldusinglow-costsensorsembeddedoneverydayobjectsthatareconnectedtoeachother,termedtheInternetofThings(IOT).Thisdeepintegrationofphysicalsensorsindigitalsystemsisbeginningtogeneratelargeamountsofdiverseanddistributeddataaboutourenvironment,whichcanbeusedfordesigningconvenient,safe,andenergy-efficienthomesystems,aswellasforurbanplanningofsmartcities.
Medicine,Science,andEngineeringResearchersinmedicine,science,andengineeringarerapidlyaccumulatingdatathatiskeytosignificantnewdiscoveries.Forexample,asanimportantsteptowardimprovingourunderstandingoftheEarth’sclimatesystem,NASAhasdeployedaseriesofEarth-orbitingsatellitesthatcontinuouslygenerateglobalobservationsofthelandsurface,oceans,andatmosphere.However,becauseofthesizeandspatio-temporalnatureofthedata,traditionalmethodsareoftennotsuitableforanalyzingthesedatasets.TechniquesdevelopedindataminingcanaidEarthscientistsinansweringquestionssuchasthefollowing:“Whatistherelationshipbetweenthefrequencyandintensityofecosystemdisturbances
suchasdroughtsandhurricanestoglobalwarming?”“Howislandsurfaceprecipitationandtemperatureaffectedbyoceansurfacetemperature?”and“Howwellcanwepredictthebeginningandendofthegrowingseasonforaregion?”
Asanotherexample,researchersinmolecularbiologyhopetousethelargeamountsofgenomicdatatobetterunderstandthestructureandfunctionofgenes.Inthepast,traditionalmethodsinmolecularbiologyallowedscientiststostudyonlyafewgenesatatimeinagivenexperiment.Recentbreakthroughsinmicroarraytechnologyhaveenabledscientiststocomparethebehaviorofthousandsofgenesundervarioussituations.Suchcomparisonscanhelpdeterminethefunctionofeachgene,andperhapsisolatethegenesresponsibleforcertaindiseases.However,thenoisy,high-dimensionalnatureofdatarequiresnewdataanalysismethods.Inadditiontoanalyzinggeneexpressiondata,dataminingcanalsobeusedtoaddressotherimportantbiologicalchallengessuchasproteinstructureprediction,multiplesequencealignment,themodelingofbiochemicalpathways,andphylogenetics.
Anotherexampleistheuseofdataminingtechniquestoanalyzeelectronichealthrecord(EHR)data,whichhasbecomeincreasinglyavailable.Notverylongago,studiesofpatientsrequiredmanuallyexaminingthephysicalrecordsofindividualpatientsandextractingveryspecificpiecesofinformationpertinenttotheparticularquestionbeinginvestigated.EHRsallowforafasterandbroaderexplorationofsuchdata.However,therearesignificantchallengessincetheobservationsonanyonepatienttypicallyoccurduringtheirvisitstoadoctororhospitalandonlyasmallnumberofdetailsaboutthehealthofthepatientaremeasuredduringanyparticularvisit.
Currently,EHRanalysisfocusesonsimpletypesofdata,e.g.,apatient’sbloodpressureorthediagnosiscodeofadisease.However,largeamountsof
morecomplextypesofmedicaldataarealsobeingcollected,suchaselectrocardiograms(ECGs)andneuroimagesfrommagneticresonanceimaging(MRI)orfunctionalMagneticResonanceImaging(fMRI).Althoughchallengingtoanalyze,thisdataalsoprovidesvitalinformationaboutpatients.Integratingandanalyzingsuchdata,withtraditionalEHRandgenomicdataisoneofthecapabilitiesneededtoenableprecisionmedicine,whichaimstoprovidemorepersonalizedpatientcare.
1.1WhatIsDataMining?Dataminingistheprocessofautomaticallydiscoveringusefulinformationinlargedatarepositories.Dataminingtechniquesaredeployedtoscourlargedatasetsinordertofindnovelandusefulpatternsthatmightotherwiseremainunknown.Theyalsoprovidethecapabilitytopredicttheoutcomeofafutureobservation,suchastheamountacustomerwillspendatanonlineorabrick-and-mortarstore.
Notallinformationdiscoverytasksareconsideredtobedatamining.Examplesincludequeries,e.g.,lookingupindividualrecordsinadatabaseorfindingwebpagesthatcontainaparticularsetofkeywords.Thisisbecausesuchtaskscanbeaccomplishedthroughsimpleinteractionswithadatabasemanagementsystemoraninformationretrievalsystem.Thesesystemsrelyontraditionalcomputersciencetechniques,whichincludesophisticatedindexingstructuresandqueryprocessingalgorithms,forefficientlyorganizingandretrievinginformationfromlargedatarepositories.Nonetheless,dataminingtechniqueshavebeenusedtoenhancetheperformanceofsuchsystemsbyimprovingthequalityofthesearchresultsbasedontheirrelevancetotheinputqueries.
DataMiningandKnowledgeDiscoveryinDatabasesDataminingisanintegralpartofknowledgediscoveryindatabases(KDD),whichistheoverallprocessofconvertingrawdataintousefulinformation,asshowninFigure1.1 .Thisprocessconsistsofaseriesofsteps,fromdatapreprocessingtopostprocessingofdataminingresults.
Figure1.1.Theprocessofknowledgediscoveryindatabases(KDD).
Theinputdatacanbestoredinavarietyofformats(flatfiles,spreadsheets,orrelationaltables)andmayresideinacentralizeddatarepositoryorbedistributedacrossmultiplesites.Thepurposeofpreprocessingistotransformtherawinputdataintoanappropriateformatforsubsequentanalysis.Thestepsinvolvedindatapreprocessingincludefusingdatafrommultiplesources,cleaningdatatoremovenoiseandduplicateobservations,andselectingrecordsandfeaturesthatarerelevanttothedataminingtaskathand.Becauseofthemanywaysdatacanbecollectedandstored,datapreprocessingisperhapsthemostlaboriousandtime-consumingstepintheoverallknowledgediscoveryprocess.
“Closingtheloop”isaphraseoftenusedtorefertotheprocessofintegratingdataminingresultsintodecisionsupportsystems.Forexample,inbusinessapplications,theinsightsofferedbydataminingresultscanbeintegratedwithcampaignmanagementtoolssothateffectivemarketingpromotionscanbeconductedandtested.Suchintegrationrequiresapostprocessingsteptoensurethatonlyvalidandusefulresultsareincorporatedintothedecisionsupportsystem.Anexampleofpostprocessingisvisualization,whichallowsanalyststoexplorethedataandthedataminingresultsfromavarietyofviewpoints.Hypothesistestingmethodscanalsobeappliedduring
postprocessingtoeliminatespuriousdataminingresults.(SeeChapter10 .)
1.2MotivatingChallengesAsmentionedearlier,traditionaldataanalysistechniqueshaveoftenencounteredpracticaldifficultiesinmeetingthechallengesposedbybigdataapplications.Thefollowingaresomeofthespecificchallengesthatmotivatedthedevelopmentofdatamining.
Scalability
Becauseofadvancesindatagenerationandcollection,datasetswithsizesofterabytes,petabytes,orevenexabytesarebecomingcommon.Ifdataminingalgorithmsaretohandlethesemassivedatasets,theymustbescalable.Manydataminingalgorithmsemployspecialsearchstrategiestohandleexponentialsearchproblems.Scalabilitymayalsorequiretheimplementationofnoveldatastructurestoaccessindividualrecordsinanefficientmanner.Forinstance,out-of-corealgorithmsmaybenecessarywhenprocessingdatasetsthatcannotfitintomainmemory.Scalabilitycanalsobeimprovedbyusingsamplingordevelopingparallelanddistributedalgorithms.AgeneraloverviewoftechniquesforscalingupdataminingalgorithmsisgiveninAppendixF.
HighDimensionality
Itisnowcommontoencounterdatasetswithhundredsorthousandsofattributesinsteadofthehandfulcommonafewdecadesago.Inbioinformatics,progressinmicroarraytechnologyhasproducedgeneexpressiondatainvolvingthousandsoffeatures.Datasetswithtemporalorspatialcomponentsalsotendtohavehighdimensionality.Forexample,
consideradatasetthatcontainsmeasurementsoftemperatureatvariouslocations.Ifthetemperaturemeasurementsaretakenrepeatedlyforanextendedperiod,thenumberofdimensions(features)increasesinproportiontothenumberofmeasurementstaken.Traditionaldataanalysistechniquesthatweredevelopedforlow-dimensionaldataoftendonotworkwellforsuchhigh-dimensionaldataduetoissuessuchascurseofdimensionality(tobediscussedinChapter2 ).Also,forsomedataanalysisalgorithms,thecomputationalcomplexityincreasesrapidlyasthedimensionality(thenumberoffeatures)increases.
HeterogeneousandComplexData
Traditionaldataanalysismethodsoftendealwithdatasetscontainingattributesofthesametype,eithercontinuousorcategorical.Astheroleofdatamininginbusiness,science,medicine,andotherfieldshasgrown,sohastheneedfortechniquesthatcanhandleheterogeneousattributes.Recentyearshavealsoseentheemergenceofmorecomplexdataobjects.Examplesofsuchnon-traditionaltypesofdataincludewebandsocialmediadatacontainingtext,hyperlinks,images,audio,andvideos;DNAdatawithsequentialandthree-dimensionalstructure;andclimatedatathatconsistsofmeasurements(temperature,pressure,etc.)atvarioustimesandlocationsontheEarth’ssurface.Techniquesdevelopedforminingsuchcomplexobjectsshouldtakeintoconsiderationrelationshipsinthedata,suchastemporalandspatialautocorrelation,graphconnectivity,andparent-childrelationshipsbetweentheelementsinsemi-structuredtextandXMLdocuments.
DataOwnershipandDistribution
Sometimes,thedataneededforananalysisisnotstoredinonelocationorownedbyoneorganization.Instead,thedataisgeographicallydistributedamongresourcesbelongingtomultipleentities.Thisrequiresthedevelopment
ofdistributeddataminingtechniques.Thekeychallengesfacedbydistributeddataminingalgorithmsincludethefollowing:(1)howtoreducetheamountofcommunicationneededtoperformthedistributedcomputation,(2)howtoeffectivelyconsolidatethedataminingresultsobtainedfrommultiplesources,and(3)howtoaddressdatasecurityandprivacyissues.
Non-traditionalAnalysis
Thetraditionalstatisticalapproachisbasedonahypothesize-and-testparadigm.Inotherwords,ahypothesisisproposed,anexperimentisdesignedtogatherthedata,andthenthedataisanalyzedwithrespecttothehypothesis.Unfortunately,thisprocessisextremelylabor-intensive.Currentdataanalysistasksoftenrequirethegenerationandevaluationofthousandsofhypotheses,andconsequently,thedevelopmentofsomedataminingtechniqueshasbeenmotivatedbythedesiretoautomatetheprocessofhypothesisgenerationandevaluation.Furthermore,thedatasetsanalyzedindataminingaretypicallynottheresultofacarefullydesignedexperimentandoftenrepresentopportunisticsamplesofthedata,ratherthanrandomsamples.
1.3TheOriginsofDataMiningWhiledatamininghastraditionallybeenviewedasanintermediateprocesswithintheKDDframework,asshowninFigure1.1 ,ithasemergedovertheyearsasanacademicfieldwithincomputerscience,focusingonallaspectsofKDD,includingdatapreprocessing,mining,andpostprocessing.Itsorigincanbetracedbacktothelate1980s,followingaseriesofworkshopsorganizedonthetopicofknowledgediscoveryindatabases.Theworkshopsbroughttogetherresearchersfromdifferentdisciplinestodiscussthechallengesandopportunitiesinapplyingcomputationaltechniquestoextractactionableknowledgefromlargedatabases.Theworkshopsquicklygrewintohugelypopularconferencesthatwereattendedbyresearchersandpractitionersfromboththeacademiaandindustry.Thesuccessoftheseconferences,alongwiththeinterestshownbybusinessesandindustryinrecruitingnewhireswithdataminingbackground,havefueledthetremendousgrowthofthisfield.
Thefieldwasinitiallybuiltuponthemethodologyandalgorithmsthatresearchershadpreviouslyused.Inparticular,dataminingresearchersdrawuponideas,suchas(1)sampling,estimation,andhypothesistestingfromstatisticsand(2)searchalgorithms,modelingtechniques,andlearningtheoriesfromartificialintelligence,patternrecognition,andmachinelearning.Datamininghasalsobeenquicktoadoptideasfromotherareas,includingoptimization,evolutionarycomputing,informationtheory,signalprocessing,visualization,andinformationretrieval,andextendingthemtosolvethechallengesofminingbigdata.
Anumberofotherareasalsoplaykeysupportingroles.Inparticular,databasesystemsareneededtoprovidesupportforefficientstorage,indexing,andqueryprocessing.Techniquesfromhighperformance(parallel)computingare
oftenimportantinaddressingthemassivesizeofsomedatasets.Distributedtechniquescanalsohelpaddresstheissueofsizeandareessentialwhenthedatacannotbegatheredinonelocation.Figure1.2 showstherelationshipofdataminingtootherareas.
Figure1.2.Dataminingasaconfluenceofmanydisciplines.
DataScienceandData-DrivenDiscoveryDatascienceisaninterdisciplinaryfieldthatstudiesandappliestoolsandtechniquesforderivingusefulinsightsfromdata.Althoughdatascienceisregardedasanemergingfieldwithadistinctidentityofitsown,thetoolsandtechniquesoftencomefrommanydifferentareasofdataanalysis,suchasdatamining,statistics,AI,machinelearning,patternrecognition,databasetechnology,anddistributedandparallelcomputing.(SeeFigure1.2 .)
Theemergenceofdatascienceasanewfieldisarecognitionthat,often,noneoftheexistingareasofdataanalysisprovidesacompletesetoftoolsforthedataanalysistasksthatareoftenencounteredinemergingapplications.
Instead,abroadrangeofcomputational,mathematical,andstatisticalskillsisoftenrequired.Toillustratethechallengesthatariseinanalyzingsuchdata,considerthefollowingexample.SocialmediaandtheWebpresentnewopportunitiesforsocialscientiststoobserveandquantitativelymeasurehumanbehavioronalargescale.Toconductsuchastudy,socialscientistsworkwithanalystswhopossessskillsinareassuchaswebmining,naturallanguageprocessing(NLP),networkanalysis,datamining,andstatistics.Comparedtomoretraditionalresearchinsocialscience,whichisoftenbasedonsurveys,thisanalysisrequiresabroaderrangeofskillsandtools,andinvolvesfarlargeramountsofdata.Thus,datascienceis,bynecessity,ahighlyinterdisciplinaryfieldthatbuildsonthecontinuingworkofmanyfields.
Thedata-drivenapproachofdatascienceemphasizesthedirectdiscoveryofpatternsandrelationshipsfromdata,especiallyinlargequantitiesofdata,oftenwithouttheneedforextensivedomainknowledge.Anotableexampleofthesuccessofthisapproachisrepresentedbyadvancesinneuralnetworks,i.e.,deeplearning,whichhavebeenparticularlysuccessfulinareaswhichhavelongprovedchallenging,e.g.,recognizingobjectsinphotosorvideosandwordsinspeech,aswellasinotherapplicationareas.However,notethatthisisjustoneexampleofthesuccessofdata-drivenapproaches,anddramaticimprovementshavealsooccurredinmanyotherareasofdataanalysis.Manyofthesedevelopmentsaretopicsdescribedlaterinthisbook.
Somecautionsonpotentiallimitationsofapurelydata-drivenapproacharegivenintheBibliographicNotes.
1.4DataMiningTasksDataminingtasksaregenerallydividedintotwomajorcategories:
PredictivetasksTheobjectiveofthesetasksistopredictthevalueofaparticularattributebasedonthevaluesofotherattributes.Theattributetobepredictediscommonlyknownasthetargetordependentvariable,whiletheattributesusedformakingthepredictionareknownastheexplanatoryorindependentvariables.
DescriptivetasksHere,theobjectiveistoderivepatterns(correlations,trends,clusters,trajectories,andanomalies)thatsummarizetheunderlyingrelationshipsindata.Descriptivedataminingtasksareoftenexploratoryinnatureandfrequentlyrequirepostprocessingtechniquestovalidateandexplaintheresults.
Figure1.3 illustratesfourofthecoredataminingtasksthataredescribedintheremainderofthisbook.
Figure1.3.Fourofthecoredataminingtasks.
Predictivemodelingreferstothetaskofbuildingamodelforthetargetvariableasafunctionoftheexplanatoryvariables.Therearetwotypesofpredictivemodelingtasks:classification,whichisusedfordiscretetargetvariables,andregression,whichisusedforcontinuoustargetvariables.Forexample,predictingwhetherawebuserwillmakeapurchaseatanonlinebookstoreisaclassificationtaskbecausethetargetvariableisbinary-valued.Ontheotherhand,forecastingthefuturepriceofastockisaregressiontaskbecausepriceisacontinuous-valuedattribute.Thegoalofbothtasksistolearnamodelthatminimizestheerrorbetweenthepredictedandtruevaluesofthetargetvariable.Predictivemodelingcanbeusedtoidentifycustomerswhowillrespondtoamarketingcampaign,predictdisturbancesintheEarth’s
ecosystem,orjudgewhetherapatienthasaparticulardiseasebasedontheresultsofmedicaltests.
Example1.1(PredictingtheTypeofaFlower).Considerthetaskofpredictingaspeciesofflowerbasedonthecharacteristicsoftheflower.Inparticular,considerclassifyinganIrisflowerasoneofthefollowingthreeIrisspecies:Setosa,Versicolour,orVirginica.Toperformthistask,weneedadatasetcontainingthecharacteristicsofvariousflowersofthesethreespecies.Adatasetwiththistypeofinformationisthewell-knownIrisdatasetfromtheUCIMachineLearningRepositoryathttp://www.ics.uci.edu/~mlearn.Inadditiontothespeciesofaflower,thisdatasetcontainsfourotherattributes:sepalwidth,sepallength,petallength,andpetalwidth.Figure1.4 showsaplotofpetalwidthversuspetallengthforthe150flowersintheIrisdataset.Petalwidthisbrokenintothecategorieslow,medium,andhigh,whichcorrespondtotheintervals[0,0.75),[0.75,1.75), ,respectively.Also,petallengthisbrokenintocategorieslow,medium,andhigh,whichcorrespondtotheintervals[0,2.5),[2.5,5), ,respectively.Basedonthesecategoriesofpetalwidthandlength,thefollowingrulescanbederived:
PetalwidthlowandpetallengthlowimpliesSetosa.
PetalwidthmediumandpetallengthmediumimpliesVersicolour.
PetalwidthhighandpetallengthhighimpliesVirginica.
Whiletheserulesdonotclassifyalltheflowers,theydoagood(butnotperfect)jobofclassifyingmostoftheflowers.NotethatflowersfromtheSetosaspeciesarewellseparatedfromtheVersicolourandVirginicaspecieswithrespecttopetalwidthandlength,butthelattertwospeciesoverlapsomewhatwithrespecttotheseattributes.
[1.75,∞)
[5,∞)
Figure1.4.Petalwidthversuspetallengthfor150Irisflowers.
Associationanalysisisusedtodiscoverpatternsthatdescribestronglyassociatedfeaturesinthedata.Thediscoveredpatternsaretypicallyrepresentedintheformofimplicationrulesorfeaturesubsets.Becauseoftheexponentialsizeofitssearchspace,thegoalofassociationanalysisistoextractthemostinterestingpatternsinanefficientmanner.Usefulapplicationsofassociationanalysisincludefindinggroupsofgenesthathaverelatedfunctionality,identifyingwebpagesthatareaccessedtogether,orunderstandingtherelationshipsbetweendifferentelementsofEarth’sclimatesystem.
Example1.2(MarketBasketAnalysis).
ThetransactionsshowninTable1.1 illustratepoint-of-saledatacollectedatthecheckoutcountersofagrocerystore.Associationanalysiscanbeappliedtofinditemsthatarefrequentlyboughttogetherbycustomers.Forexample,wemaydiscovertherule ,whichsuggeststhatcustomerswhobuydiapersalsotendtobuymilk.Thistypeofrulecanbeusedtoidentifypotentialcross-sellingopportunitiesamongrelateditems.
Table1.1.Marketbasketdata.
TransactionID Items
1 {Bread,Butter,Diapers,Milk}
2 {Coffee,Sugar,Cookies,Salmon}
3 {Bread,Butter,Coffee,Diapers,Milk,Eggs}
4 {Bread,Butter,Salmon,Chicken}
5 {Eggs,Bread,Butter}
6 {Salmon,Diapers,Milk}
7 {Bread,Tea,Sugar,Eggs}
8 {Coffee,Sugar,Chicken,Eggs}
9 {Bread,Diapers,Milk,Salt}
10 {Tea,Eggs,Cookies,Diapers,Milk}
Clusteranalysisseekstofindgroupsofcloselyrelatedobservationssothatobservationsthatbelongtothesameclusteraremoresimilartoeachotherthanobservationsthatbelongtootherclusters.Clusteringhasbeenusedto
{Diapers}→{Milk}
groupsetsofrelatedcustomers,findareasoftheoceanthathaveasignificantimpactontheEarth’sclimate,andcompressdata.
Example1.3(DocumentClustering).ThecollectionofnewsarticlesshowninTable1.2 canbegroupedbasedontheirrespectivetopics.Eacharticleisrepresentedasasetofword-frequencypairs(w:c),wherewisawordandcisthenumberoftimesthewordappearsinthearticle.Therearetwonaturalclustersinthedataset.Thefirstclusterconsistsofthefirstfourarticles,whichcorrespondtonewsabouttheeconomy,whilethesecondclustercontainsthelastfourarticles,whichcorrespondtonewsabouthealthcare.Agoodclusteringalgorithmshouldbeabletoidentifythesetwoclustersbasedonthesimilaritybetweenwordsthatappearinthearticles.
Table1.2.Collectionofnewsarticles.
Article Word-frequencypairs
1 dollar:1,industry:4,country:2,loan:3,deal:2,government:2
2 machinery:2,labor:3,market:4,industry:2,work:3,country:1
3 job:5,inflation:3,rise:2,jobless:2,market:3,country:2,index:3
4 domestic:3,forecast:2,gain:1,market:2,sale:3,price:2
5 patient:4,symptom:2,drug:3,health:2,clinic:2,doctor:2
6 pharmaceutical:2,company:3,drug:2,vaccine:1,flu:3
7 death:2,cancer:4,drug:3,public:4,health:3,director:2
8 medical:2,cost:3,increase:2,patient:2,health:3,care:1
Anomalydetectionisthetaskofidentifyingobservationswhosecharacteristicsaresignificantlydifferentfromtherestofthedata.Suchobservationsareknownasanomaliesoroutliers.Thegoalofananomalydetectionalgorithmistodiscovertherealanomaliesandavoidfalselylabelingnormalobjectsasanomalous.Inotherwords,agoodanomalydetectormusthaveahighdetectionrateandalowfalsealarmrate.Applicationsofanomalydetectionincludethedetectionoffraud,networkintrusions,unusualpatternsofdisease,andecosystemdisturbances,suchasdroughts,floods,fires,hurricanes,etc.
Example1.4(CreditCardFraudDetection).Acreditcardcompanyrecordsthetransactionsmadebyeverycreditcardholder,alongwithpersonalinformationsuchascreditlimit,age,annualincome,andaddress.Sincethenumberoffraudulentcasesisrelativelysmallcomparedtothenumberoflegitimatetransactions,anomalydetectiontechniquescanbeappliedtobuildaprofileoflegitimatetransactionsfortheusers.Whenanewtransactionarrives,itiscomparedagainsttheprofileoftheuser.Ifthecharacteristicsofthetransactionareverydifferentfromthepreviouslycreatedprofile,thenthetransactionisflaggedaspotentiallyfraudulent.
1.5ScopeandOrganizationoftheBookThisbookintroducesthemajorprinciplesandtechniquesusedindataminingfromanalgorithmicperspective.Astudyoftheseprinciplesandtechniquesisessentialfordevelopingabetterunderstandingofhowdataminingtechnologycanbeappliedtovariouskindsofdata.Thisbookalsoservesasastartingpointforreaderswhoareinterestedindoingresearchinthisfield.
Webeginthetechnicaldiscussionofthisbookwithachapterondata(Chapter2 ),whichdiscussesthebasictypesofdata,dataquality,preprocessingtechniques,andmeasuresofsimilarityanddissimilarity.Althoughthismaterialcanbecoveredquickly,itprovidesanessentialfoundationfordataanalysis.Chapters3 and4 coverclassification.Chapter3 providesafoundationbydiscussingdecisiontreeclassifiersandseveralissuesthatareimportanttoallclassification:overfitting,underfitting,modelselection,andperformanceevaluation.Usingthisfoundation,Chapter4 describesanumberofotherimportantclassificationtechniques:rule-basedsystems,nearestneighborclassifiers,Bayesianclassifiers,artificialneuralnetworks,includingdeeplearning,supportvectormachines,andensembleclassifiers,whicharecollectionsofclassifiers.Themulticlassandimbalancedclassproblemsarealsodiscussed.Thesetopicscanbecoveredindependently.
AssociationanalysisisexploredinChapters5 and6 .Chapter5describesthebasicsofassociationanalysis:frequentitemsets,associationrules,andsomeofthealgorithmsusedtogeneratethem.Specifictypesoffrequentitemsets—maximal,closed,andhyperclique—thatareimportantfor
dataminingarealsodiscussed,andthechapterconcludeswithadiscussionofevaluationmeasuresforassociationanalysis.Chapter6 considersavarietyofmoreadvancedtopics,includinghowassociationanalysiscanbeappliedtocategoricalandcontinuousdataortodatathathasaconcepthierarchy.(Aconcepthierarchyisahierarchicalcategorizationofobjects,e.g.,storeitems .)Thischapteralsodescribeshowassociationanalysiscanbeextendedtofindsequentialpatterns(patternsinvolvingorder),patternsingraphs,andnegativerelationships(ifoneitemispresent,thentheotherisnot).
ClusteranalysisisdiscussedinChapters7 and8 .Chapter7 firstdescribesthedifferenttypesofclusters,andthenpresentsthreespecificclusteringtechniques:K-means,agglomerativehierarchicalclustering,andDBSCAN.Thisisfollowedbyadiscussionoftechniquesforvalidatingtheresultsofaclusteringalgorithm.AdditionalclusteringconceptsandtechniquesareexploredinChapter8 ,includingfuzzyandprobabilisticclustering,Self-OrganizingMaps(SOM),graph-basedclustering,spectralclustering,anddensity-basedclustering.Thereisalsoadiscussionofscalabilityissuesandfactorstoconsiderwhenselectingaclusteringalgorithm.
Chapter9 ,isonanomalydetection.Aftersomebasicdefinitions,severaldifferenttypesofanomalydetectionareconsidered:statistical,distance-based,density-based,clustering-based,reconstruction-based,one-classclassification,andinformationtheoretic.Thelastchapter,Chapter10 ,supplementsthediscussionsintheotherChapterswithadiscussionofthestatisticalconceptsimportantforavoidingspuriousresults,andthendiscussesthoseconceptsinthecontextofdataminingtechniquesstudiedinthepreviouschapters.Thesetechniquesincludestatisticalhypothesistesting,p-values,thefalsediscoveryrate,andpermutationtesting.AppendicesAthroughFgiveabriefreviewofimportanttopicsthatareusedinportionsof
storeitems→clothing→shoes→sneakers
thebook:linearalgebra,dimensionalityreduction,statistics,regression,optimization,andscalingupdataminingtechniquesforbigdata.
Thesubjectofdatamining,whilerelativelyyoungcomparedtostatisticsormachinelearning,isalreadytoolargetocoverinasinglebook.Selectedreferencestotopicsthatareonlybrieflycovered,suchasdataquality,areprovidedintheBibliographicNotessectionoftheappropriatechapter.Referencestotopicsnotcoveredinthisbook,suchasminingstreamingdataandprivacy-preservingdataminingareprovidedintheBibliographicNotesofthischapter.
1.6BibliographicNotesThetopicofdatamininghasinspiredmanytextbooks.IntroductorytextbooksincludethosebyDunham[16],Hanetal.[29],Handetal.[31],RoigerandGeatz[50],ZakiandMeira[61],andAggarwal[2].DataminingbookswithastrongeremphasisonbusinessapplicationsincludetheworksbyBerryandLinoff[5],Pyle[47],andParrRud[45].BookswithanemphasisonstatisticallearningincludethosebyCherkasskyandMulier[11],andHastieetal.[32].SimilarbookswithanemphasisonmachinelearningorpatternrecognitionarethosebyDudaetal.[15],Kantardzic[34],Mitchell[43],Webb[57],andWittenandFrank[58].Therearealsosomemorespecializedbooks:Chakrabarti[9](webmining),Fayyadetal.[20](collectionofearlyarticlesondatamining),Fayyadetal.[18](visualization),Grossmanetal.[25](scienceandengineering),KarguptaandChan[35](distributeddatamining),Wangetal.[56](bioinformatics),andZakiandHo[60](paralleldatamining).
Thereareseveralconferencesrelatedtodatamining.SomeofthemainconferencesdedicatedtothisfieldincludetheACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining(KDD),theIEEEInternationalConferenceonDataMining(ICDM),theSIAMInternationalConferenceonDataMining(SDM),theEuropeanConferenceonPrinciplesandPracticeofKnowledgeDiscoveryinDatabases(PKDD),andthePacific-AsiaConferenceonKnowledgeDiscoveryandDataMining(PAKDD).DataminingpaperscanalsobefoundinothermajorconferencessuchastheConferenceandWorkshoponNeuralInformationProcessingSystems(NIPS),theInternationalConferenceonMachineLearning(ICML),theACMSIGMOD/PODSconference,theInternationalConferenceonVeryLargeDataBases(VLDB),theConferenceonInformationandKnowledgeManagement(CIKM),theInternationalConferenceonDataEngineering(ICDE),the
NationalConferenceonArtificialIntelligence(AAAI),theIEEEInternationalConferenceonBigData,andtheIEEEInternationalConferenceonDataScienceandAdvancedAnalytics(DSAA).
JournalpublicationsondataminingincludeIEEETransactionsonKnowledgeandDataEngineering,DataMiningandKnowledgeDiscovery,KnowledgeandInformationSystems,ACMTransactionsonKnowledgeDiscoveryfromData,StatisticalAnalysisandDataMining,andInformationSystems.Therearevariousopen-sourcedataminingsoftwareavailable,includingWeka[27]andScikit-learn[46].Morerecently,dataminingsoftwaresuchasApacheMahoutandApacheSparkhavebeendevelopedforlarge-scaleproblemsonthedistributedcomputingplatform.
Therehavebeenanumberofgeneralarticlesondataminingthatdefinethefieldoritsrelationshiptootherfields,particularlystatistics.Fayyadetal.[19]describedataminingandhowitfitsintothetotalknowledgediscoveryprocess.Chenetal.[10]giveadatabaseperspectiveondatamining.RamakrishnanandGrama[48]provideageneraldiscussionofdataminingandpresentseveralviewpoints.Hand[30]describeshowdataminingdiffersfromstatistics,asdoesFriedman[21].Lambert[40]explorestheuseofstatisticsforlargedatasetsandprovidessomecommentsontherespectiverolesofdataminingandstatistics.Glymouretal.[23]considerthelessonsthatstatisticsmayhavefordatamining.Smythetal.[53]describehowtheevolutionofdataminingisbeingdrivenbynewtypesofdataandapplications,suchasthoseinvolvingstreams,graphs,andtext.Hanetal.[28]consideremergingapplicationsindataminingandSmyth[52]describessomeresearchchallengesindatamining.Wuetal.[59]discusshowdevelopmentsindataminingresearchcanbeturnedintopracticaltools.DataminingstandardsarethesubjectofapaperbyGrossmanetal.[24].Bradley[7]discusseshowdataminingalgorithmscanbescaledtolargedatasets.
Theemergenceofnewdataminingapplicationshasproducednewchallengesthatneedtobeaddressed.Forinstance,concernsaboutprivacybreachesasaresultofdatamininghaveescalatedinrecentyears,particularlyinapplicationdomainssuchaswebcommerceandhealthcare.Asaresult,thereisgrowinginterestindevelopingdataminingalgorithmsthatmaintainuserprivacy.Developingtechniquesforminingencryptedorrandomizeddataisknownasprivacy-preservingdatamining.SomegeneralreferencesinthisareaincludepapersbyAgrawalandSrikant[3],Cliftonetal.[12]andKarguptaetal.[36].Vassiliosetal.[55]provideasurvey.Anotherareaofconcernisthebiasinpredictivemodelsthatmaybeusedforsomeapplications,e.g.,screeningjobapplicantsordecidingprisonparole[39].Assessingwhethersuchapplicationsareproducingbiasedresultsismademoredifficultbythefactthatthepredictivemodelsusedforsuchapplicationsareoftenblackboxmodels,i.e.,modelsthatarenotinterpretableinanystraightforwardway.
Datascience,itsconstituentfields,andmoregenerally,thenewparadigmofknowledgediscoverytheyrepresent[33],havegreatpotential,someofwhichhasbeenrealized.However,itisimportanttoemphasizethatdatascienceworksmostlywithobservationaldata,i.e.,datathatwascollectedbyvariousorganizationsaspartoftheirnormaloperation.Theconsequenceofthisisthatsamplingbiasesarecommonandthedeterminationofcausalfactorsbecomesmoreproblematic.Forthisandanumberofotherreasons,itisoftenhardtointerpretthepredictivemodelsbuiltfromthisdata[42,49].Thus,theory,experimentationandcomputationalsimulationswillcontinuetobethemethodsofchoiceinmanyareas,especiallythoserelatedtoscience.
Moreimportantly,apurelydata-drivenapproachoftenignorestheexistingknowledgeinaparticularfield.Suchmodelsmayperformpoorly,forexample,predictingimpossibleoutcomesorfailingtogeneralizetonewsituations.However,ifthemodeldoesworkwell,e.g.,hashighpredictiveaccuracy,then
thisapproachmaybesufficientforpracticalpurposesinsomefields.Butinmanyareas,suchasmedicineandscience,gaininginsightintotheunderlyingdomainisoftenthegoal.Somerecentworkattemptstoaddresstheseissuesinordertocreatetheory-guideddatascience,whichtakespre-existingdomainknowledgeintoaccount[17,37].
Recentyearshavewitnessedagrowingnumberofapplicationsthatrapidlygeneratecontinuousstreamsofdata.Examplesofstreamdataincludenetworktraffic,multimediastreams,andstockprices.Severalissuesmustbeconsideredwhenminingdatastreams,suchasthelimitedamountofmemoryavailable,theneedforonlineanalysis,andthechangeofthedataovertime.Dataminingforstreamdatahasbecomeanimportantareaindatamining.SomeselectedpublicationsareDomingosandHulten[14](classification),Giannellaetal.[22](associationanalysis),Guhaetal.[26](clustering),Kiferetal.[38](changedetection),Papadimitriouetal.[44](timeseries),andLawetal.[41](dimensionalityreduction).
Anotherareaofinterestisrecommenderandcollaborativefilteringsystems[1,6,8,13,54],whichsuggestmovies,televisionshows,books,products,etc.thatapersonmightlike.Inmanycases,thisproblem,oratleastacomponentofit,istreatedasapredictionproblemandthus,dataminingtechniquescanbeapplied[4,51].
Bibliography[1]G.AdomaviciusandA.Tuzhilin.Towardthenextgenerationof
recommendersystems:Asurveyofthestate-of-the-artandpossibleextensions.IEEEtransactionsonknowledgeanddataengineering,17(6):734–749,2005.
[2]C.Aggarwal.Datamining:TheTextbook.Springer,2009.
[3]R.AgrawalandR.Srikant.Privacy-preservingdatamining.InProc.of2000ACMSIGMODIntl.Conf.onManagementofData,pages439–450,Dallas,Texas,2000.ACMPress.
[4]X.AmatriainandJ.M.Pujol.Dataminingmethodsforrecommendersystems.InRecommenderSystemsHandbook,pages227–262.Springer,2015.
[5]M.J.A.BerryandG.Linoff.DataMiningTechniques:ForMarketing,Sales,andCustomerRelationshipManagement.WileyComputerPublishing,2ndedition,2004.
[6]J.Bobadilla,F.Ortega,A.Hernando,andA.Gutiérrez.Recommendersystemssurvey.Knowledge-basedsystems,46:109–132,2013.
[7]P.S.Bradley,J.Gehrke,R.Ramakrishnan,andR.Srikant.Scalingminingalgorithmstolargedatabases.CommunicationsoftheACM,45(8):38–43,2002.
[8]R.Burke.Hybridrecommendersystems:Surveyandexperiments.Usermodelinganduser-adaptedinteraction,12(4):331–370,2002.
[9]S.Chakrabarti.MiningtheWeb:DiscoveringKnowledgefromHypertextData.MorganKaufmann,SanFrancisco,CA,2003.
[10]M.-S.Chen,J.Han,andP.S.Yu.DataMining:AnOverviewfromaDatabasePerspective.IEEETransactionsonKnowledgeandDataEngineering,8(6):866–883,1996.
[11]V.CherkasskyandF.Mulier.LearningfromData:Concepts,Theory,andMethods.Wiley-IEEEPress,2ndedition,1998.
[12]C.Clifton,M.Kantarcioglu,andJ.Vaidya.Definingprivacyfordatamining.InNationalScienceFoundationWorkshoponNextGenerationDataMining,pages126–133,Baltimore,MD,November2002.
[13]C.DesrosiersandG.Karypis.Acomprehensivesurveyofneighborhood-basedrecommendationmethods.Recommendersystemshandbook,pages107–144,2011.
[14]P.DomingosandG.Hulten.Mininghigh-speeddatastreams.InProc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages71–80,
Boston,Massachusetts,2000.ACMPress.
[15]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley…Sons,Inc.,NewYork,2ndedition,2001.
[16]M.H.Dunham.DataMining:IntroductoryandAdvancedTopics.PrenticeHall,2006.
[17]J.H.Faghmous,A.Banerjee,S.Shekhar,M.Steinbach,V.Kumar,A.R.Ganguly,andN.Samatova.Theory-guideddatascienceforclimatechange.Computer,47(11):74–78,2014.
[18]U.M.Fayyad,G.G.Grinstein,andA.Wierse,editors.InformationVisualizationinDataMiningandKnowledgeDiscovery.MorganKaufmannPublishers,SanFrancisco,CA,September2001.
[19]U.M.Fayyad,G.Piatetsky-Shapiro,andP.Smyth.FromDataMiningtoKnowledgeDiscovery:AnOverview.InAdvancesinKnowledgeDiscoveryandDataMining,pages1–34.AAAIPress,1996.
[20]U.M.Fayyad,G.Piatetsky-Shapiro,P.Smyth,andR.Uthurusamy,editors.AdvancesinKnowledgeDiscoveryandDataMining.AAAI/MITPress,1996.
[21]J.H.Friedman.DataMiningandStatistics:What’stheConnection?Unpublished.www-stat.stanford.edu/~jhf/ftp/dm-stat.ps,1997.
[22]C.Giannella,J.Han,J.Pei,X.Yan,andP.S.Yu.MiningFrequentPatternsinDataStreamsatMultipleTimeGranularities.InH.Kargupta,A.Joshi,K.Sivakumar,andY.Yesha,editors,NextGenerationDataMining,pages191–212.AAAI/MIT,2003.
[23]C.Glymour,D.Madigan,D.Pregibon,andP.Smyth.StatisticalThemesandLessonsforDataMining.DataMiningandKnowledgeDiscovery,1(1):11–28,1997.
[24]R.L.Grossman,M.F.Hornick,andG.Meyer.Dataminingstandardsinitiatives.CommunicationsoftheACM,45(8):59–61,2002.
[25]R.L.Grossman,C.Kamath,P.Kegelmeyer,V.Kumar,andR.Namburu,editors.DataMiningforScientificandEngineeringApplications.KluwerAcademicPublishers,2001.
[26]S.Guha,A.Meyerson,N.Mishra,R.Motwani,andL.O’Callaghan.ClusteringDataStreams:TheoryandPractice.IEEETransactionsonKnowledgeandDataEngineering,15(3):515–528,May/June2003.
[27]M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann,andI.H.Witten.TheWEKADataMiningSoftware:AnUpdate.SIGKDDExplorations,11(1),2009.
[28]J.Han,R.B.Altman,V.Kumar,H.Mannila,andD.Pregibon.Emergingscientificapplicationsindatamining.CommunicationsoftheACM,45(8):54–58,2002.
[29]J.Han,M.Kamber,andJ.Pei.DataMining:ConceptsandTechniques.MorganKaufmannPublishers,SanFrancisco,3rdedition,2011.
[30]D.J.Hand.DataMining:StatisticsandMore?TheAmericanStatistician,52(2):112–118,1998.
[31]D.J.Hand,H.Mannila,andP.Smyth.PrinciplesofDataMining.MITPress,2001.
[32]T.Hastie,R.Tibshirani,andJ.H.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,Prediction.Springer,2ndedition,2009.
[33]T.Hey,S.Tansley,K.M.Tolle,etal.Thefourthparadigm:data-intensivescientificdiscovery,volume1.MicrosoftresearchRedmond,WA,2009.
[34]M.Kantardzic.DataMining:Concepts,Models,Methods,andAlgorithms.Wiley-IEEEPress,Piscataway,NJ,2003.
[35]H.KarguptaandP.K.Chan,editors.AdvancesinDistributedandParallelKnowledgeDiscovery.AAAIPress,September2002.
[36]H.Kargupta,S.Datta,Q.Wang,andK.Sivakumar.OnthePrivacyPreservingPropertiesofRandomDataPerturbationTechniques.InProc.ofthe2003IEEEIntl.Conf.onDataMining,pages99–106,Melbourne,Florida,December2003.IEEEComputerSociety.
[37]A.Karpatne,G.Atluri,J.Faghmous,M.Steinbach,A.Banerjee,A.Ganguly,S.Shekhar,N.Samatova,andV.Kumar.Theory-guidedDataScience:ANewParadigmforScientificDiscoveryfromData.IEEETransactionsonKnowledgeandDataEngineering,2017.
[38]D.Kifer,S.Ben-David,andJ.Gehrke.DetectingChangeinDataStreams.InProc.ofthe30thVLDBConf.,pages180–191,Toronto,Canada,2004.MorganKaufmann.
[39]J.Kleinberg,J.Ludwig,andS.Mullainathan.AGuidetoSolvingSocialProblemswithMachineLearning.HarvardBusinessReview,December2016.
[40]D.Lambert.WhatUseisStatisticsforMassiveData?InACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,pages54–62,2000.
[41]M.H.C.Law,N.Zhang,andA.K.Jain.NonlinearManifoldLearningforDataStreams.InProc.oftheSIAMIntl.Conf.onDataMining,LakeBuenaVista,Florida,April2004.SIAM.
[42]Z.C.Lipton.Themythosofmodelinterpretability.arXivpreprintarXiv:1606.03490,2016.
[43]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.
[44]S.Papadimitriou,A.Brockwell,andC.Faloutsos.Adaptive,unsupervisedstreammining.VLDBJournal,13(3):222–239,2004.
[45]O.ParrRud.DataMiningCookbook:ModelingDataforMarketing,RiskandCustomerRelationshipManagement.JohnWiley…Sons,NewYork,NY,2001.
[46]F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blondel,P.Prettenhofer,R.Weiss,V.Dubourg,J.Vanderplas,A.Passos,D.Cournapeau,M.Brucher,M.Perrot,andE.Duchesnay.Scikit-learn:MachineLearninginPython.JournalofMachineLearningResearch,12:2825–2830,2011.
[47]D.Pyle.BusinessModelingandDataMining.MorganKaufmann,SanFrancisco,CA,2003.
[48]N.RamakrishnanandA.Grama.DataMining:FromSerendipitytoScience—GuestEditors’Introduction.IEEEComputer,32(8):34–37,1999.
[49]M.T.Ribeiro,S.Singh,andC.Guestrin.Whyshoulditrustyou?:Explainingthepredictionsofanyclassifier.InProceedingsofthe22ndACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages1135–1144.ACM,2016.
[50]R.RoigerandM.Geatz.DataMining:ATutorialBasedPrimer.Addison-Wesley,2002.
[51]J.Schafer.TheApplicationofData-MiningtoRecommenderSystems.Encyclopediaofdatawarehousingandmining,1:44–48,2009.
[52]P.Smyth.BreakingoutoftheBlack-Box:ResearchChallengesinDataMining.InProc.ofthe2001ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,2001.
[53]P.Smyth,D.Pregibon,andC.Faloutsos.Data-drivenevolutionofdataminingalgorithms.CommunicationsoftheACM,45(8):33–37,2002.
[54]X.SuandT.M.Khoshgoftaar.Asurveyofcollaborativefilteringtechniques.Advancesinartificialintelligence,2009:4,2009.
[55]V.S.Verykios,E.Bertino,I.N.Fovino,L.P.Provenza,Y.Saygin,andY.Theodoridis.State-of-the-artinprivacypreservingdatamining.SIGMODRecord,33(1):50–57,2004.
[56]J.T.L.Wang,M.J.Zaki,H.Toivonen,andD.E.Shasha,editors.DataMininginBioinformatics.Springer,September2004.
[57]A.R.Webb.StatisticalPatternRecognition.JohnWiley…Sons,2ndedition,2002.
[58]I.H.WittenandE.Frank.DataMining:PracticalMachineLearningToolsandTechniques.MorganKaufmann,3rdedition,2011.
[59]X.Wu,P.S.Yu,andG.Piatetsky-Shapiro.DataMining:HowResearchMeetsPracticalDevelopment?KnowledgeandInformationSystems,5(2):248–261,2003.
[60]M.J.ZakiandC.-T.Ho,editors.Large-ScaleParallelDataMining.Springer,September2002.
[61]M.J.ZakiandW.MeiraJr.DataMiningandAnalysis:FundamentalConceptsandAlgorithms.CambridgeUniversityPress,NewYork,2014.
1.7Exercises1.Discusswhetherornoteachofthefollowingactivitiesisadataminingtask.
a. Dividingthecustomersofacompanyaccordingtotheirgender.
b. Dividingthecustomersofacompanyaccordingtotheirprofitability.
c. Computingthetotalsalesofacompany.
d. Sortingastudentdatabasebasedonstudentidentificationnumbers.
e. Predictingtheoutcomesoftossinga(fair)pairofdice.
f. Predictingthefuturestockpriceofacompanyusinghistoricalrecords.
g. Monitoringtheheartrateofapatientforabnormalities.
h. Monitoringseismicwavesforearthquakeactivities.
i. Extractingthefrequenciesofasoundwave.
2.SupposethatyouareemployedasadataminingconsultantforanInternetsearchenginecompany.Describehowdataminingcanhelpthecompanybygivingspecificexamplesofhowtechniques,suchasclustering,classification,associationrulemining,andanomalydetectioncanbeapplied.
3.Foreachofthefollowingdatasets,explainwhetherornotdataprivacyisanimportantissue.
a. Censusdatacollectedfrom1900–1950.
b. IPaddressesandvisittimesofwebuserswhovisityourwebsite.
c. ImagesfromEarth-orbitingsatellites.
d. Namesandaddressesofpeoplefromthetelephonebook.
e. NamesandemailaddressescollectedfromtheWeb.
2Data
Thischapterdiscussesseveraldata-relatedissuesthatareimportantforsuccessfuldatamining:
TheTypeofDataDatasetsdifferinanumberofways.Forexample,theattributesusedtodescribedataobjectscanbeofdifferenttypes—quantitativeorqualitative—anddatasetsoftenhavespecialcharacteristics;e.g.,somedatasetscontaintimeseriesorobjectswithexplicitrelationshipstooneanother.Notsurprisingly,thetypeofdatadetermineswhichtoolsandtechniquescanbeusedtoanalyzethedata.Indeed,newresearchindataminingisoftendrivenbytheneedtoaccommodatenewapplicationareasandtheirnewtypesofdata.
TheQualityoftheDataDataisoftenfarfromperfect.Whilemostdataminingtechniquescantoleratesomelevelofimperfectioninthedata,afocusonunderstandingandimprovingdataqualitytypicallyimprovesthequalityoftheresultinganalysis.Dataqualityissuesthatoftenneedtobeaddressedincludethepresenceofnoiseandoutliers;missing,inconsistent,orduplicatedata;anddatathatisbiasedor,insomeotherway,unrepresentativeofthephenomenonorpopulationthatthedataissupposedtodescribe.
PreprocessingStepstoMaketheDataMoreSuitableforDataMiningOften,therawdatamustbeprocessedinordertomakeitsuitablefor
analysis.Whileoneobjectivemaybetoimprovedataquality,othergoalsfocusonmodifyingthedatasothatitbetterfitsaspecifieddataminingtechniqueortool.Forexample,acontinuousattribute,e.g.,length,sometimesneedstobetransformedintoanattributewithdiscretecategories,e.g.,short,medium,orlong,inordertoapplyaparticulartechnique.Asanotherexample,thenumberofattributesinadatasetisoftenreducedbecausemanytechniquesaremoreeffectivewhenthedatahasarelativelysmallnumberofattributes.
AnalyzingDatainTermsofItsRelationshipsOneapproachtodataanalysisistofindrelationshipsamongthedataobjectsandthenperformtheremaininganalysisusingtheserelationshipsratherthanthedataobjectsthemselves.Forinstance,wecancomputethesimilarityordistancebetweenpairsofobjectsandthenperformtheanalysis—clustering,classification,oranomalydetection—basedonthesesimilaritiesordistances.Therearemanysuchsimilarityordistancemeasures,andtheproperchoicedependsonthetypeofdataandtheparticularapplication.
Example2.1(AnIllustrationofData-RelatedIssues).Tofurtherillustratetheimportanceoftheseissues,considerthefollowinghypotheticalsituation.Youreceiveanemailfromamedicalresearcherconcerningaprojectthatyouareeagertoworkon.
Hi,
I’veattachedthedatafilethatImentionedinmypreviousemail.Eachlinecontainsthe
informationforasinglepatientandconsistsoffivefields.Wewanttopredictthelastfieldusing
theotherfields.Idon’thavetimetoprovideanymoreinformationaboutthedatasinceI’mgoing
outoftownforacoupleofdays,buthopefullythatwon’tslowyoudowntoomuch.Andifyou
don’tmind,couldwemeetwhenIgetbacktodiscussyourpreliminaryresults?Imightinvitea
fewothermembersofmyteam.
Thanksandseeyouinacoupleofdays.
Despitesomemisgivings,youproceedtoanalyzethedata.Thefirstfewrowsofthefileareasfollows:
012 232 33.5 0 10.7
020 121 16.9 2 210.1
027 165 24.0 0 427.6
⋮
Abrieflookatthedatarevealsnothingstrange.Youputyourdoubtsasideandstarttheanalysis.Thereareonly1000lines,asmallerdatafilethanyouhadhopedfor,buttwodayslater,youfeelthatyouhavemadesomeprogress.Youarriveforthemeeting,andwhilewaitingforotherstoarrive,youstrikeupaconversationwithastatisticianwhoisworkingontheproject.Whenshelearnsthatyouhavealsobeenanalyzingthedatafromtheproject,sheasksifyouwouldmindgivingherabriefoverviewofyourresults.
Statistician:So,yougotthedataforallthepatients?
DataMiner:Yes.Ihaven’thadmuchtimeforanalysis,butIdohaveafewinterestingresults.
Statistician:Amazing.ThereweresomanydataissueswiththissetofpatientsthatIcouldn’tdomuch.
DataMiner:Oh?Ididn’thearaboutanypossibleproblems.
Statistician:Well,firstthereisfield5,thevariablewewanttopredict.
It’scommonknowledgeamongpeoplewhoanalyzethistypeofdatathatresultsarebetterifyouworkwiththelogofthevalues,butIdidn’tdiscoverthisuntillater.Wasitmentionedtoyou?
DataMiner:No.
Statistician:Butsurelyyouheardaboutwhathappenedtofield4?It’ssupposedtobemeasuredonascalefrom1to10,with0indicatingamissingvalue,butbecauseofadataentryerror,all10’swerechangedinto0’s.Unfortunately,sincesomeofthepatientshavemissingvaluesforthisfield,it’simpossibletosaywhethera0inthisfieldisareal0ora10.Quiteafewoftherecordshavethatproblem.
DataMiner:Interesting.Werethereanyotherproblems?
Statistician:Yes,fields2and3arebasicallythesame,butIassumethatyouprobablynoticedthat.
DataMiner:Yes,butthesefieldswereonlyweakpredictorsoffield5.
Statistician:Anyway,givenallthoseproblems,I’msurprisedyouwereabletoaccomplishanything.
DataMiner:True,butmyresultsarereallyquitegood.Field1isaverystrongpredictoroffield5.I’msurprisedthatthiswasn’tnoticedbefore.
Statistician:What?Field1isjustanidentificationnumber.
DataMiner:Nonetheless,myresultsspeakforthemselves.
Statistician:Oh,no!Ijustremembered.WeassignedIDnumbersafterwesortedtherecordsbasedonfield5.Thereisastrongconnection,butit’smeaningless.Sorry.
Althoughthisscenariorepresentsanextremesituation,itemphasizestheimportanceof“knowingyourdata.”Tothatend,thischapterwilladdresseach
ofthefourissuesmentionedabove,outliningsomeofthebasicchallengesandstandardapproaches.
2.1TypesofDataAdatasetcanoftenbeviewedasacollectionofdataobjects.Othernamesforadataobjectarerecord,point,vector,pattern,event,case,sample,instance,observation,orentity.Inturn,dataobjectsaredescribedbyanumberofattributesthatcapturethecharacteristicsofanobject,suchasthemassofaphysicalobjectorthetimeatwhichaneventoccurred.Othernamesforanattributearevariable,characteristic,field,feature,ordimension.
Example2.2(StudentInformation).Often,adatasetisafile,inwhichtheobjectsarerecords(orrows)inthefileandeachfield(orcolumn)correspondstoanattribute.Forexample,Table2.1 showsadatasetthatconsistsofstudentinformation.Eachrowcorrespondstoastudentandeachcolumnisanattributethatdescribessomeaspectofastudent,suchasgradepointaverage(GPA)oridentificationnumber(ID).
Table2.1.Asampledatasetcontainingstudentinformation.
StudentID Year GradePointAverage(GPA) …
⋮
1034262 Senior 3.24 …
1052663 Freshman 3.51 …
1082246 Sophomore 3.62 …
Althoughrecord-baseddatasetsarecommon,eitherinflatfilesorrelationaldatabasesystems,thereareotherimportanttypesofdatasetsandsystemsforstoringdata.InSection2.1.2 ,wewilldiscusssomeofthetypesofdatasetsthatarecommonlyencounteredindatamining.However,wefirstconsiderattributes.
2.1.1AttributesandMeasurement
Inthissection,weconsiderthetypesofattributesusedtodescribedataobjects.Wefirstdefineanattribute,thenconsiderwhatwemeanbythetypeofanattribute,andfinallydescribethetypesofattributesthatarecommonlyencountered.
WhatIsanAttribute?Westartwithamoredetaileddefinitionofanattribute.
Definition2.1.Anattributeisapropertyorcharacteristicofanobjectthatcanvary,eitherfromoneobjecttoanotherorfromonetimetoanother.
Forexample,eyecolorvariesfrompersontoperson,whilethetemperatureofanobjectvariesovertime.Notethateyecolorisasymbolicattributewitha
smallnumberofpossiblevalues{brown,black,blue,green,hazel,etc.},whiletemperatureisanumericalattributewithapotentiallyunlimitednumberofvalues.
Atthemostbasiclevel,attributesarenotaboutnumbersorsymbols.However,todiscussandmorepreciselyanalyzethecharacteristicsofobjects,weassignnumbersorsymbolstothem.Todothisinawell-definedway,weneedameasurementscale.
Definition2.2.Ameasurementscaleisarule(function)thatassociatesanumericalorsymbolicvaluewithanattributeofanobject.
Formally,theprocessofmeasurementistheapplicationofameasurementscaletoassociateavaluewithaparticularattributeofaspecificobject.Whilethismayseemabitabstract,weengageintheprocessofmeasurementallthetime.Forinstance,westeponabathroomscaletodetermineourweight,weclassifysomeoneasmaleorfemale,orwecountthenumberofchairsinaroomtoseeiftherewillbeenoughtoseatallthepeoplecomingtoameeting.Inallthesecases,the“physicalvalue”ofanattributeofanobjectismappedtoanumericalorsymbolicvalue.
Withthisbackground,wecandiscussthetypeofanattribute,aconceptthatisimportantindeterminingifaparticulardataanalysistechniqueisconsistentwithaspecifictypeofattribute.
TheTypeofanAttributeItiscommontorefertothetypeofanattributeasthetypeofameasurementscale.Itshouldbeapparentfromthepreviousdiscussionthatanattributecanbedescribedusingdifferentmeasurementscalesandthatthepropertiesofanattributeneednotbethesameasthepropertiesofthevaluesusedtomeasureit.Inotherwords,thevaluesusedtorepresentanattributecanhavepropertiesthatarenotpropertiesoftheattributeitself,andviceversa.Thisisillustratedwithtwoexamples.
Example2.3(EmployeeAgeandIDNumber).TwoattributesthatmightbeassociatedwithanemployeeareIDandage(inyears).Bothoftheseattributescanberepresentedasintegers.However,whileitisreasonabletotalkabouttheaverageageofanemployee,itmakesnosensetotalkabouttheaverageemployeeID.Indeed,theonlyaspectofemployeesthatwewanttocapturewiththeIDattributeisthattheyaredistinct.Consequently,theonlyvalidoperationforemployeeIDsistotestwhethertheyareequal.Thereisnohintofthislimitation,however,whenintegersareusedtorepresenttheemployeeIDattribute.Fortheageattribute,thepropertiesoftheintegersusedtorepresentageareverymuchthepropertiesoftheattribute.Evenso,thecorrespondenceisnotcompletebecause,forexample,ageshaveamaximum,whileintegersdonot.
Example2.4(LengthofLineSegments).ConsiderFigure2.1 ,whichshowssomeobjects—linesegments—andhowthelengthattributeoftheseobjectscanbemappedtonumbersintwodifferentways.Eachsuccessivelinesegment,goingfromthetoptothebottom,isformedbyappendingthetopmostlinesegmenttoitself.Thus,
thesecondlinesegmentfromthetopisformedbyappendingthetopmostlinesegmenttoitselftwice,thethirdlinesegmentfromthetopisformedbyappendingthetopmostlinesegmenttoitselfthreetimes,andsoforth.Inaveryreal(physical)sense,allthelinesegmentsaremultiplesofthefirst.Thisfactiscapturedbythemeasurementsontherightsideofthefigure,butnotbythoseontheleftside.Morespecifically,themeasurementscaleontheleftsidecapturesonlytheorderingofthelengthattribute,whilethescaleontherightsidecapturesboththeorderingandadditivityproperties.Thus,anattributecanbemeasuredinawaythatdoesnotcaptureallthepropertiesoftheattribute.
Figure2.1.Themeasurementofthelengthoflinesegmentsontwodifferentscalesofmeasurement.
Knowingthetypeofanattributeisimportantbecauseittellsuswhichpropertiesofthemeasuredvaluesareconsistentwiththeunderlying
propertiesoftheattribute,andtherefore,itallowsustoavoidfoolishactions,suchascomputingtheaverageemployeeID.
TheDifferentTypesofAttributesAuseful(andsimple)waytospecifythetypeofanattributeistoidentifythepropertiesofnumbersthatcorrespondtounderlyingpropertiesoftheattribute.Forexample,anattributesuchaslengthhasmanyofthepropertiesofnumbers.Itmakessensetocompareandorderobjectsbylength,aswellastotalkaboutthedifferencesandratiosoflength.Thefollowingproperties(operations)ofnumbersaretypicallyusedtodescribeattributes.
1. Distinctness and2. Order and3. Addition and4. Multiplication and/
Giventheseproperties,wecandefinefourtypesofattributes:nominal,ordinal,interval,andratio.Table2.2 givesthedefinitionsofthesetypes,alongwithinformationaboutthestatisticaloperationsthatarevalidforeachtype.Eachattributetypepossessesallofthepropertiesandoperationsoftheattributetypesaboveit.Consequently,anypropertyoroperationthatisvalidfornominal,ordinal,andintervalattributesisalsovalidforratioattributes.Inotherwords,thedefinitionoftheattributetypesiscumulative.However,thisdoesnotmeanthatthestatisticaloperationsappropriateforoneattributetypeareappropriatefortheattributetypesaboveit.
Table2.2.Differentattributetypes.
AttributeType Description Examples Operations
Categorical Nominal Thevaluesofanominalattribute zipcodes, mode,
= ≠<,≤,>, ≥
+ −×
(Qualitative) arejustdifferentnames;i.e.,nominalvaluesprovideonlyenoughinformationtodistinguishoneobjectfromanother.
employeeIDnumbers,eyecolor,gender
entropy,contingencycorrelation,test
Ordinal Thevaluesofanordinalattributeprovideenoughinformationtoorderobjects.
hardnessofminerals,{good,better,best},grades,streetnumbers
median,percentiles,rankcorrelation,runtests,signtests
Numeric(Quantitative)
Interval Forintervalattributes,thedifferencesbetweenvaluesaremeaningful,i.e.,aunitofmeasurementexists.
calendardates,temperatureinCelsiusorFahrenheit
mean,standarddeviation,Pearson’scorrelation,tandFtests
Ratio Forratiovariables,bothdifferencesandratiosaremeaningful.
temperatureinKelvin,monetaryquantities,counts,age,mass,length,electricalcurrent
geometricmean,harmonicmean,percentvariation
Nominalandordinalattributesarecollectivelyreferredtoascategoricalorqualitativeattributes.Asthenamesuggests,qualitativeattributes,suchasemployeeID,lackmostofthepropertiesofnumbers.Eveniftheyarerepresentedbynumbers,i.e.,integers,theyshouldbetreatedmorelikesymbols.Theremainingtwotypesofattributes,intervalandratio,arecollectivelyreferredtoasquantitativeornumericattributes.Quantitativeattributesarerepresentedbynumbersandhavemostofthepropertiesof
(=,≠) χ2
(<,>)
(+,−)
(×,/)
numbers.Notethatquantitativeattributescanbeinteger-valuedorcontinuous.
Thetypesofattributescanalsobedescribedintermsoftransformationsthatdonotchangethemeaningofanattribute.Indeed,S.SmithStevens,thepsychologistwhooriginallydefinedthetypesofattributesshowninTable2.2 ,definedthemintermsofthesepermissibletransformations.Forexample,themeaningofalengthattributeisunchangedifitismeasuredinmetersinsteadoffeet.
Thestatisticaloperationsthatmakesenseforaparticulartypeofattributearethosethatwillyieldthesameresultswhentheattributeistransformedbyusingatransformationthatpreservestheattribute’smeaning.Toillustrate,theaveragelengthofasetofobjectsisdifferentwhenmeasuredinmetersratherthaninfeet,butbothaveragesrepresentthesamelength.Table2.3 showsthemeaning-preservingtransformationsforthefourattributetypesofTable2.2 .
Table2.3.Transformationsthatdefineattributelevels.
AttributeType Transformation Comment
Categorical(Qualitative)
Nominal Anyone-to-onemapping,e.g.,apermutationofvalues
IfallemployeeIDnumbersarereassigned,itwillnotmakeanydifference.
Ordinal Anorder-preservingchangeofvalues,i.e.,
wherefisamonotonicfunction.
Anattributeencompassingthenotionofgood,better,bestcanberepresentedequallywellbythevalues{1,2,3}orby{0.5,1,10}.
Numeric(Quantitative)
Intervalaandbconstants.
TheFahrenheitandCelsiustemperaturescalesdifferinthe
new_value=f(old_value),
new_value=a×old_value+b,
locationoftheirzerovalueandthesizeofadegree(unit).
Ratio Lengthcanbemeasuredinmetersorfeet.
Example2.5(TemperatureScales).Temperatureprovidesagoodillustrationofsomeoftheconceptsthathavebeendescribed.First,temperaturecanbeeitheranintervaloraratioattribute,dependingonitsmeasurementscale.WhenmeasuredontheKelvinscale,atemperatureof2 is,inaphysicallymeaningfulway,twice
thatofatemperatureof1 .Thisisnottruewhentemperatureismeasured
oneithertheCelsiusorFahrenheitscales,because,physically,atemperatureof1 Fahrenheit(Celsius)isnotmuchdifferentthana
temperatureof2 Fahrenheit(Celsius).Theproblemisthatthezeropoints
oftheFahrenheitandCelsiusscalesare,inaphysicalsense,arbitrary,andtherefore,theratiooftwoCelsiusorFahrenheittemperaturesisnotphysicallymeaningful.
DescribingAttributesbytheNumberofValuesAnindependentwayofdistinguishingbetweenattributesisbythenumberofvaluestheycantake.
DiscreteAdiscreteattributehasafiniteorcountablyinfinitesetofvalues.Suchattributescanbecategorical,suchaszipcodesorIDnumbers,ornumeric,suchascounts.Discreteattributesareoftenrepresentedusingintegervariables.Binaryattributesareaspecialcaseofdiscreteattributesandassumeonlytwovalues,e.g.,true/false,yes/no,male/female,or0/1.
new_value=a×old_value
◦
◦
◦
◦
BinaryattributesareoftenrepresentedasBooleanvariables,orasintegervariablesthatonlytakethevalues0or1.
ContinuousAcontinuousattributeisonewhosevaluesarerealnumbers.Examplesincludeattributessuchastemperature,height,orweight.Continuousattributesaretypicallyrepresentedasfloating-pointvariables.Practically,realvaluescanbemeasuredandrepresentedonlywithlimitedprecision.
Intheory,anyofthemeasurementscaletypes—nominal,ordinal,interval,andratio—couldbecombinedwithanyofthetypesbasedonthenumberofattributevalues—binary,discrete,andcontinuous.However,somecombinationsoccuronlyinfrequentlyordonotmakemuchsense.Forinstance,itisdifficulttothinkofarealisticdatasetthatcontainsacontinuousbinaryattribute.Typically,nominalandordinalattributesarebinaryordiscrete,whileintervalandratioattributesarecontinuous.However,countattributes,whicharediscrete,arealsoratioattributes.
AsymmetricAttributesForasymmetricattributes,onlypresence—anon-zeroattributevalue—isregardedasimportant.Consideradatasetinwhicheachobjectisastudentandeachattributerecordswhetherastudenttookaparticularcourseatauniversity.Foraspecificstudent,anattributehasavalueof1ifthestudenttookthecourseassociatedwiththatattributeandavalueof0otherwise.Becausestudentstakeonlyasmallfractionofallavailablecourses,mostofthevaluesinsuchadatasetwouldbe0.Therefore,itismoremeaningfulandmoreefficienttofocusonthenon-zerovalues.Toillustrate,ifstudentsarecomparedonthebasisofthecoursestheydon’ttake,thenmoststudentswouldseemverysimilar,atleastifthenumberofcoursesislarge.Binaryattributeswhereonlynon-zerovaluesareimportantarecalledasymmetric
binaryattributes.Thistypeofattributeisparticularlyimportantforassociationanalysis,whichisdiscussedinChapter5 .Itisalsopossibletohavediscreteorcontinuousasymmetricfeatures.Forinstance,ifthenumberofcreditsassociatedwitheachcourseisrecorded,thentheresultingdatasetwillconsistofasymmetricdiscreteorcontinuousattributes.
GeneralCommentsonLevelsofMeasurementAsdescribedintherestofthischapter,therearemanydiversetypesofdata.Thepreviousdiscussionofmeasurementscales,whileuseful,isnotcompleteandhassomelimitations.Weprovidethefollowingcommentsandguidance.
Distinctness,order,andmeaningfulintervalsandratiosareonlyfourpropertiesofdata—manyothersarepossible.Forinstance,somedataisinherentlycyclical,e.g.,positiononthesurfaceoftheEarthortime.Asanotherexample,considersetvaluedattributes,whereeachattributevalueisasetofelements,e.g.,thesetofmoviesseeninthelastyear.Defineonesetofelements(movies)tobegreater(larger)thanasecondsetifthesecondsetisasubsetofthefirst.However,sucharelationshipdefinesonlyapartialorderthatdoesnotmatchanyoftheattributetypesjustdefined.Thenumbersorsymbolsusedtocaptureattributevaluesmaynotcaptureallthepropertiesoftheattributesormaysuggestpropertiesthatarenotthere.AnillustrationofthisforintegerswaspresentedinExample2.3 ,i.e.,averagesofIDsandoutofrangeages.Dataisoftentransformedforthepurposeofanalysis—seeSection2.3.7 .Thisoftenchangesthedistributionoftheobservedvariabletoadistributionthatiseasiertoanalyze,e.g.,aGaussian(normal)distribution.Often,suchtransformationsonlypreservetheorderoftheoriginalvalues,andotherpropertiesarelost.Nonetheless,ifthedesiredoutcomeisa
statisticaltestofdifferencesorapredictivemodel,suchatransformationisjustified.Thefinalevaluationofanydataanalysis,includingoperationsonattributes,iswhethertheresultsmakesensefromadomainpointofview.
Insummary,itcanbechallengingtodeterminewhichoperationscanbeperformedonaparticularattributeoracollectionofattributeswithoutcompromisingtheintegrityoftheanalysis.Fortunately,establishedpracticeoftenservesasareliableguide.Occasionally,however,standardpracticesareerroneousorhavelimitations.
2.1.2TypesofDataSets
Therearemanytypesofdatasets,andasthefieldofdataminingdevelopsandmatures,agreatervarietyofdatasetsbecomeavailableforanalysis.Inthissection,wedescribesomeofthemostcommontypes.Forconvenience,wehavegroupedthetypesofdatasetsintothreegroups:recorddata,graph-baseddata,andordereddata.Thesecategoriesdonotcoverallpossibilitiesandothergroupingsarecertainlypossible.
GeneralCharacteristicsofDataSetsBeforeprovidingdetailsofspecifickindsofdatasets,wediscussthreecharacteristicsthatapplytomanydatasetsandhaveasignificantimpactonthedataminingtechniquesthatareused:dimensionality,distribution,andresolution.
Dimensionality
Thedimensionalityofadatasetisthenumberofattributesthattheobjectsinthedatasetpossess.Analyzingdatawithasmallnumberofdimensionstendstobequalitativelydifferentfromanalyzingmoderateorhigh-dimensionaldata.Indeed,thedifficultiesassociatedwiththeanalysisofhigh-dimensionaldataaresometimesreferredtoasthecurseofdimensionality.Becauseofthis,animportantmotivationinpreprocessingthedataisdimensionalityreduction.TheseissuesarediscussedinmoredepthlaterinthischapterandinAppendixB.
Distribution
Thedistributionofadatasetisthefrequencyofoccurrenceofvariousvaluesorsetsofvaluesfortheattributescomprisingdataobjects.Equivalently,thedistributionofadatasetcanbeconsideredasadescriptionoftheconcentrationofobjectsinvariousregionsofthedataspace.Statisticianshaveenumeratedmanytypesofdistributions,e.g.,Gaussian(normal),anddescribedtheirproperties.(SeeAppendixC.)Althoughstatisticalapproachesfordescribingdistributionscanyieldpowerfulanalysistechniques,manydatasetshavedistributionsthatarenotwellcapturedbystandardstatisticaldistributions.
Asaresult,manydataminingalgorithmsdonotassumeaparticularstatisticaldistributionforthedatatheyanalyze.However,somegeneralaspectsofdistributionsoftenhaveastrongimpact.Forexample,supposeacategoricalattributeisusedasaclassvariable,whereoneofthecategoriesoccurs95%ofthetime,whiletheothercategoriestogetheroccuronly5%ofthetime.ThisskewnessinthedistributioncanmakeclassificationdifficultasdiscussedinSection4.11.(Skewnesshasotherimpactsondataanalysisthatarenotdiscussedhere.)
Aspecialcaseofskeweddataissparsity.Forsparsebinary,countorcontinuousdata,mostattributesofanobjecthavevaluesof0.Inmanycases,fewerthan1%ofthevaluesarenon-zero.Inpracticalterms,sparsityisanadvantagebecauseusuallyonlythenon-zerovaluesneedtobestoredandmanipulated.Thisresultsinsignificantsavingswithrespecttocomputationtimeandstorage.Indeed,somedataminingalgorithms,suchastheassociationruleminingalgorithmsdescribedinChapter5 ,workwellonlyforsparsedata.Finally,notethatoftentheattributesinsparsedatasetsareasymmetricattributes.
Resolution
Itisfrequentlypossibletoobtaindataatdifferentlevelsofresolution,andoftenthepropertiesofthedataaredifferentatdifferentresolutions.Forinstance,thesurfaceoftheEarthseemsveryunevenataresolutionofafewmeters,butisrelativelysmoothataresolutionoftensofkilometers.Thepatternsinthedataalsodependonthelevelofresolution.Iftheresolutionistoofine,apatternmaynotbevisibleormaybeburiedinnoise;iftheresolutionistoocoarse,thepatterncandisappear.Forexample,variationsinatmosphericpressureonascaleofhoursreflectthemovementofstormsandotherweathersystems.Onascaleofmonths,suchphenomenaarenotdetectable.
RecordDataMuchdataminingworkassumesthatthedatasetisacollectionofrecords(dataobjects),eachofwhichconsistsofafixedsetofdatafields(attributes).SeeFigure2.2(a) .Forthemostbasicformofrecorddata,thereisnoexplicitrelationshipamongrecordsordatafields,andeveryrecord(object)hasthesamesetofattributes.Recorddataisusuallystoredeitherinflatfilesorinrelationaldatabases.Relationaldatabasesarecertainlymorethana
collectionofrecords,butdataminingoftendoesnotuseanyoftheadditionalinformationavailableinarelationaldatabase.Rather,thedatabaseservesasaconvenientplacetofindrecords.DifferenttypesofrecorddataaredescribedbelowandareillustratedinFigure2.2 .
Figure2.2.Differentvariationsofrecorddata.
TransactionorMarketBasketData
Transactiondataisaspecialtypeofrecorddata,whereeachrecord(transaction)involvesasetofitems.Consideragrocerystore.Thesetofproductspurchasedbyacustomerduringoneshoppingtripconstitutesatransaction,whiletheindividualproductsthatwerepurchasedaretheitems.Thistypeofdataiscalledmarketbasketdatabecausetheitemsineachrecordaretheproductsinaperson’s“marketbasket.”Transactiondataisacollectionofsetsofitems,butitcanbeviewedasasetofrecordswhosefieldsareasymmetricattributes.Mostoften,theattributesarebinary,indicatingwhetheranitemwaspurchased,butmoregenerally,theattributescanbediscreteorcontinuous,suchasthenumberofitemspurchasedortheamountspentonthoseitems.Figure2.2(b) showsasampletransactiondataset.Eachrowrepresentsthepurchasesofaparticularcustomerataparticulartime.
TheDataMatrix
Ifallthedataobjectsinacollectionofdatahavethesamefixedsetofnumericattributes,thenthedataobjectscanbethoughtofaspoints(vectors)inamultidimensionalspace,whereeachdimensionrepresentsadistinctattributedescribingtheobject.Asetofsuchdataobjectscanbeinterpretedasanmbynmatrix,wheretherearemrows,oneforeachobject,andncolumns,oneforeachattribute.(Arepresentationthathasdataobjectsascolumnsandattributesasrowsisalsofine.)Thismatrixiscalledadatamatrixorapatternmatrix.Adatamatrixisavariationofrecorddata,butbecauseitconsistsofnumericattributes,standardmatrixoperationcanbeappliedtotransformandmanipulatethedata.Therefore,thedatamatrixisthestandarddataformatformoststatisticaldata.Figure2.2(c) showsasampledatamatrix.
TheSparseDataMatrix
Asparsedatamatrixisaspecialcaseofadatamatrixwheretheattributesareofthesametypeandareasymmetric;i.e.,onlynon-zerovaluesareimportant.Transactiondataisanexampleofasparsedatamatrixthathasonly0–1entries.Anothercommonexampleisdocumentdata.Inparticular,iftheorderoftheterms(words)inadocumentisignored—the“bagofwords”approach—thenadocumentcanberepresentedasatermvector,whereeachtermisacomponent(attribute)ofthevectorandthevalueofeachcomponentisthenumberoftimesthecorrespondingtermoccursinthedocument.Thisrepresentationofacollectionofdocumentsisoftencalledadocument-termmatrix.Figure2.2(d) showsasampledocument-termmatrix.Thedocumentsaretherowsofthismatrix,whilethetermsarethecolumns.Inpractice,onlythenon-zeroentriesofsparsedatamatricesarestored.
Graph-BasedDataAgraphcansometimesbeaconvenientandpowerfulrepresentationfordata.Weconsidertwospecificcases:(1)thegraphcapturesrelationshipsamongdataobjectsand(2)thedataobjectsthemselvesarerepresentedasgraphs.
DatawithRelationshipsamongObjects
Therelationshipsamongobjectsfrequentlyconveyimportantinformation.Insuchcases,thedataisoftenrepresentedasagraph.Inparticular,thedataobjectsaremappedtonodesofthegraph,whiletherelationshipsamongobjectsarecapturedbythelinksbetweenobjectsandlinkproperties,suchasdirectionandweight.ConsiderwebpagesontheWorldWideWeb,whichcontainbothtextandlinkstootherpages.Inordertoprocesssearchqueries,websearchenginescollectandprocesswebpagestoextracttheircontents.Itiswell-known,however,thatthelinkstoandfromeachpageprovideagreatdealofinformationabouttherelevanceofawebpagetoaquery,andthus,mustalsobetakenintoconsideration.Figure2.3(a) showsasetoflinked
webpages.Anotherimportantexampleofsuchgraphdataarethesocialnetworks,wheredataobjectsarepeopleandtherelationshipsamongthemaretheirinteractionsviasocialmedia.
DatawithObjectsThatAreGraphs
Ifobjectshavestructure,thatis,theobjectscontainsubobjectsthathaverelationships,thensuchobjectsarefrequentlyrepresentedasgraphs.Forexample,thestructureofchemicalcompoundscanberepresentedbyagraph,wherethenodesareatomsandthelinksbetweennodesarechemicalbonds.Figure2.3(b) showsaball-and-stickdiagramofthechemicalcompoundbenzene,whichcontainsatomsofcarbon(black)andhydrogen(gray).Agraphrepresentationmakesitpossibletodeterminewhichsubstructuresoccurfrequentlyinasetofcompoundsandtoascertainwhetherthepresenceofanyofthesesubstructuresisassociatedwiththepresenceorabsenceofcertainchemicalproperties,suchasmeltingpointorheatofformation.Frequentgraphmining,whichisabranchofdataminingthatanalyzessuchdata,isconsideredinSection6.5.
Figure2.3.Differentvariationsofgraphdata.
OrderedDataForsometypesofdata,theattributeshaverelationshipsthatinvolveorderintimeorspace.DifferenttypesofordereddataaredescribednextandareshowninFigure2.4 .
SequentialTransactionData
Sequentialtransactiondatacanbethoughtofasanextensionoftransactiondata,whereeachtransactionhasatimeassociatedwithit.Consideraretailtransactiondatasetthatalsostoresthetimeatwhichthetransactiontookplace.Thistimeinformationmakesitpossibletofindpatternssuchas“candysalespeakbeforeHalloween.”Atimecanalsobeassociatedwitheachattribute.Forexample,eachrecordcouldbethepurchasehistoryofa
customer,withalistingofitemspurchasedatdifferenttimes.Usingthisinformation,itispossibletofindpatternssuchas“peoplewhobuyDVDplayerstendtobuyDVDsintheperiodimmediatelyfollowingthepurchase.”
Figure2.4(a) showsanexampleofsequentialtransactiondata.Therearefivedifferenttimes—t1,t2,t3,t4,andt5;threedifferentcustomers—C1,C2,andC3;andfivedifferentitems—A,B,C,D,andE.Inthetoptable,eachrowcorrespondstotheitemspurchasedataparticulartimebyeachcustomer.Forinstance,attimet3,customerC2purchaseditemsAandD.Inthebottomtable,thesameinformationisdisplayed,buteachrowcorrespondstoaparticularcustomer.Eachrowcontainsinformationabouteachtransactioninvolvingthecustomer,whereatransactionisconsideredtobeasetofitemsandthetimeatwhichthoseitemswerepurchased.Forexample,customerC3boughtitemsAandCattimet2.
TimeSeriesData
Timeseriesdataisaspecialtypeofordereddatawhereeachrecordisatimeseries,i.e.,aseriesofmeasurementstakenovertime.Forexample,afinancialdatasetmightcontainobjectsthataretimeseriesofthedailypricesofvariousstocks.Asanotherexample,considerFigure2.4(c) ,whichshowsatimeseriesoftheaveragemonthlytemperatureforMinneapolisduringtheyears1982to1994.Whenworkingwithtemporaldata,suchastimeseries,itisimportanttoconsidertemporalautocorrelation;i.e.,iftwomeasurementsarecloseintime,thenthevaluesofthosemeasurementsareoftenverysimilar.
Figure2.4.Differentvariationsofordereddata.
SequenceData
Sequencedataconsistsofadatasetthatisasequenceofindividualentities,suchasasequenceofwordsorletters.Itisquitesimilartosequentialdata,exceptthattherearenotimestamps;instead,therearepositionsinanorderedsequence.Forexample,thegeneticinformationofplantsandanimalscanberepresentedintheformofsequencesofnucleotidesthatareknownasgenes.Manyoftheproblemsassociatedwithgeneticsequencedatainvolvepredictingsimilaritiesinthestructureandfunctionofgenesfromsimilaritiesinnucleotidesequences.Figure2.4(b) showsasectionofthehumangeneticcodeexpressedusingthefournucleotidesfromwhichallDNAisconstructed:A,T,G,andC.
SpatialandSpatio-TemporalData
Someobjectshavespatialattributes,suchaspositionsorareas,inadditiontoothertypesofattributes.Anexampleofspatialdataisweatherdata(precipitation,temperature,pressure)thatiscollectedforavarietyofgeographicallocations.Oftensuchmeasurementsarecollectedovertime,andthus,thedataconsistsoftimeseriesatvariouslocations.Inthatcase,werefertothedataasspatio-temporaldata.Althoughanalysiscanbeconductedseparatelyforeachspecifictimeorlocation,amorecompleteanalysisofspatio-temporaldatarequiresconsiderationofboththespatialandtemporalaspectsofthedata.
Animportantaspectofspatialdataisspatialautocorrelation;i.e.,objectsthatarephysicallyclosetendtobesimilarinotherwaysaswell.Thus,twopointsontheEarththatareclosetoeachotherusuallyhavesimilarvaluesfortemperatureandrainfall.Notethatspatialautocorrelationisanalogoustotemporalautocorrelation.
Importantexamplesofspatialandspatio-temporaldataarethescienceandengineeringdatasetsthataretheresultofmeasurementsormodeloutput
takenatregularlyorirregularlydistributedpointsonatwo-orthree-dimensionalgridormesh.Forinstance,Earthsciencedatasetsrecordthetemperatureorpressuremeasuredatpoints(gridcells)onlatitude–longitudesphericalgridsofvariousresolutions,e.g., by SeeFigure2.4(d) .Asanotherexample,inthesimulationoftheflowofagas,thespeedanddirectionofflowatvariousinstantsintimecanberecordedforeachgridpointinthesimulation.Adifferenttypeofspatio-temporaldataarisesfromtrackingthetrajectoriesofobjects,e.g.,vehicles,intimeandspace.
HandlingNon-RecordDataMostdataminingalgorithmsaredesignedforrecorddataoritsvariations,suchastransactiondataanddatamatrices.Record-orientedtechniquescanbeappliedtonon-recorddatabyextractingfeaturesfromdataobjectsandusingthesefeaturestocreatearecordcorrespondingtoeachobject.Considerthechemicalstructuredatathatwasdescribedearlier.Givenasetofcommonsubstructures,eachcompoundcanberepresentedasarecordwithbinaryattributesthatindicatewhetheracompoundcontainsaspecificsubstructure.Sucharepresentationisactuallyatransactiondataset,wherethetransactionsarethecompoundsandtheitemsarethesubstructures.
Insomecases,itiseasytorepresentthedatainarecordformat,butthistypeofrepresentationdoesnotcapturealltheinformationinthedata.Considerspatio-temporaldataconsistingofatimeseriesfromeachpointonaspatialgrid.Thisdataisoftenstoredinadatamatrix,whereeachrowrepresentsalocationandeachcolumnrepresentsaparticularpointintime.However,sucharepresentationdoesnotexplicitlycapturethetimerelationshipsthatarepresentamongattributesandthespatialrelationshipsthatexistamongobjects.Thisdoesnotmeanthatsucharepresentationisinappropriate,butratherthattheserelationshipsmustbetakenintoconsiderationduringtheanalysis.Forexample,itwouldnotbeagoodideatouseadatamining
1° 1°.
techniquethatignoresthetemporalautocorrelationoftheattributesorthespatialautocorrelationofthedataobjects,i.e.,thelocationsonthespatialgrid.
2.2DataQualityDataminingalgorithmsareoftenappliedtodatathatwascollectedforanotherpurpose,orforfuture,butunspecifiedapplications.Forthatreason,dataminingcannotusuallytakeadvantageofthesignificantbenefitsof“ad-dressingqualityissuesatthesource.”Incontrast,muchofstatisticsdealswiththedesignofexperimentsorsurveysthatachieveaprespecifiedlevelofdataquality.Becausepreventingdataqualityproblemsistypicallynotanoption,dataminingfocuseson(1)thedetectionandcorrectionofdataqualityproblemsand(2)theuseofalgorithmsthatcantoleratepoordataquality.Thefirststep,detectionandcorrection,isoftencalleddatacleaning.
Thefollowingsectionsdiscussspecificaspectsofdataquality.Thefocusisonmeasurementanddatacollectionissues,althoughsomeapplication-relatedissuesarealsodiscussed.
2.2.1MeasurementandDataCollectionIssues
Itisunrealistictoexpectthatdatawillbeperfect.Theremaybeproblemsduetohumanerror,limitationsofmeasuringdevices,orflawsinthedatacollectionprocess.Valuesorevenentiredataobjectscanbemissing.Inothercases,therecanbespuriousorduplicateobjects;i.e.,multipledataobjectsthatallcorrespondtoasingle“real”object.Forexample,theremightbetwodifferentrecordsforapersonwhohasrecentlylivedattwodifferentaddresses.Evenif
allthedataispresentand“looksfine,”theremaybeinconsistencies—apersonhasaheightof2meters,butweighsonly2kilograms.
Inthenextfewsections,wefocusonaspectsofdataqualitythatarerelatedtodatameasurementandcollection.Webeginwithadefinitionofmeasurementanddatacollectionerrorsandthenconsideravarietyofproblemsthatinvolvemeasurementerror:noise,artifacts,bias,precision,andaccuracy.Weconcludebydiscussingdataqualityissuesthatinvolvebothmeasurementanddatacollectionproblems:outliers,missingandinconsistentvalues,andduplicatedata.
MeasurementandDataCollectionErrorsThetermmeasurementerrorreferstoanyproblemresultingfromthemeasurementprocess.Acommonproblemisthatthevaluerecordeddiffersfromthetruevaluetosomeextent.Forcontinuousattributes,thenumericaldifferenceofthemeasuredandtruevalueiscalledtheerror.Thetermdatacollectionerrorreferstoerrorssuchasomittingdataobjectsorattributevalues,orinappropriatelyincludingadataobject.Forexample,astudyofanimalsofacertainspeciesmightincludeanimalsofarelatedspeciesthataresimilarinappearancetothespeciesofinterest.Bothmeasurementerrorsanddatacollectionerrorscanbeeithersystematicorrandom.
Wewillonlyconsidergeneraltypesoferrors.Withinparticulardomains,certaintypesofdataerrorsarecommonplace,andwell-developedtechniquesoftenexistfordetectingand/orcorrectingtheseerrors.Forexample,keyboarderrorsarecommonwhendataisenteredmanually,andasaresult,manydataentryprogramshavetechniquesfordetectingand,withhumanintervention,correctingsucherrors.
NoiseandArtifactsNoiseistherandomcomponentofameasurementerror.Ittypicallyinvolvesthedistortionofavalueortheadditionofspuriousobjects.Figure2.5showsatimeseriesbeforeandafterithasbeendisruptedbyrandomnoise.Ifabitmorenoisewereaddedtothetimeseries,itsshapewouldbelost.Figure2.6 showsasetofdatapointsbeforeandaftersomenoisepoints(indicatedby )havebeenadded.Noticethatsomeofthenoisepointsareintermixedwiththenon-noisepoints.
Figure2.5.Noiseinatimeseriescontext.
‘+’s
Figure2.6.Noiseinaspatialcontext.
Thetermnoiseisoftenusedinconnectionwithdatathathasaspatialortemporalcomponent.Insuchcases,techniquesfromsignalorimageprocessingcanfrequentlybeusedtoreducenoiseandthus,helptodiscoverpatterns(signals)thatmightbe“lostinthenoise.”Nonetheless,theeliminationofnoiseisfrequentlydifficult,andmuchworkindataminingfocusesondevisingrobustalgorithmsthatproduceacceptableresultsevenwhennoiseispresent.
Dataerrorscanbetheresultofamoredeterministicphenomenon,suchasastreakinthesameplaceonasetofphotographs.Suchdeterministicdistortionsofthedataareoftenreferredtoasartifacts.
Precision,Bias,andAccuracyInstatisticsandexperimentalscience,thequalityofthemeasurementprocessandtheresultingdataaremeasuredbyprecisionandbias.Weprovidethe
standarddefinitions,followedbyabriefdiscussion.Forthefollowingdefinitions,weassumethatwemakerepeatedmeasurementsofthesameunderlyingquantity.
Definition2.3(Precision).Theclosenessofrepeatedmeasurements(ofthesamequantity)tooneanother.
Definition2.4(Bias).Asystematicvariationofmeasurementsfromthequantitybeingmeasured.
Precisionisoftenmeasuredbythestandarddeviationofasetofvalues,whilebiasismeasuredbytakingthedifferencebetweenthemeanofthesetofvaluesandtheknownvalueofthequantitybeingmeasured.Biascanbedeterminedonlyforobjectswhosemeasuredquantityisknownbymeansexternaltothecurrentsituation.Supposethatwehaveastandardlaboratoryweightwithamassof1gandwanttoassesstheprecisionandbiasofournewlaboratoryscale.Weweighthemassfivetimes,andobtainthefollowingfivevalues:{1.015,0.990,1.013,1.001,0.986}.Themeanofthesevaluesis
1.001,andhence,thebiasis0.001.Theprecision,asmeasuredbythestandarddeviation,is0.013.
Itiscommontousethemoregeneralterm,accuracy,torefertothedegreeofmeasurementerrorindata.
Definition2.5(Accuracy)Theclosenessofmeasurementstothetruevalueofthequantitybeingmeasured.
Accuracydependsonprecisionandbias,butthereisnospecificformulaforaccuracyintermsofthesetwoquantities.
Oneimportantaspectofaccuracyistheuseofsignificantdigits.Thegoalistouseonlyasmanydigitstorepresenttheresultofameasurementorcalculationasarejustifiedbytheprecisionofthedata.Forexample,ifthelengthofanobjectismeasuredwithameterstickwhosesmallestmarkingsaremillimeters,thenweshouldrecordthelengthofdataonlytothenearestmillimeter.Theprecisionofsuchameasurementwouldbe Wedonotreviewthedetailsofworkingwithsignificantdigitsbecausemostreaderswillhaveencounteredtheminpreviouscoursesandtheyarecoveredinconsiderabledepthinscience,engineering,andstatisticstextbooks.
Issuessuchassignificantdigits,precision,bias,andaccuracyaresometimesoverlooked,buttheyareimportantfordataminingaswellasstatisticsandscience.Manytimes,datasetsdonotcomewithinformationaboutthe
±0.5mm.
precisionofthedata,andfurthermore,theprogramsusedforanalysisreturnresultswithoutanysuchinformation.Nonetheless,withoutsomeunderstandingoftheaccuracyofthedataandtheresults,ananalystrunstheriskofcommittingseriousdataanalysisblunders.
OutliersOutliersareeither(1)dataobjectsthat,insomesense,havecharacteristicsthataredifferentfrommostoftheotherdataobjectsinthedataset,or(2)valuesofanattributethatareunusualwithrespecttothetypicalvaluesforthatattribute.Alternatively,theycanbereferredtoasanomalousobjectsorvalues.Thereisconsiderableleewayinthedefinitionofanoutlier,andmanydifferentdefinitionshavebeenproposedbythestatisticsanddataminingcommunities.Furthermore,itisimportanttodistinguishbetweenthenotionsofnoiseandoutliers.Unlikenoise,outlierscanbelegitimatedataobjectsorvaluesthatweareinterestedindetecting.Forinstance,infraudandnetworkintrusiondetection,thegoalistofindunusualobjectsoreventsfromamongalargenumberofnormalones.Chapter9 discussesanomalydetectioninmoredetail.
MissingValuesItisnotunusualforanobjecttobemissingoneormoreattributevalues.Insomecases,theinformationwasnotcollected;e.g.,somepeopledeclinetogivetheirageorweight.Inothercases,someattributesarenotapplicabletoallobjects;e.g.,often,formshaveconditionalpartsthatarefilledoutonlywhenapersonanswersapreviousquestioninacertainway,butforsimplicity,allfieldsarestored.Regardless,missingvaluesshouldbetakenintoaccountduringthedataanalysis.
Thereareseveralstrategies(andvariationsonthesestrategies)fordealingwithmissingdata,eachofwhichisappropriateincertaincircumstances.Thesestrategiesarelistednext,alongwithanindicationoftheiradvantagesanddisadvantages.
EliminateDataObjectsorAttributes
Asimpleandeffectivestrategyistoeliminateobjectswithmissingvalues.However,evenapartiallyspecifieddataobjectcontainssomeinformation,andifmanyobjectshavemissingvalues,thenareliableanalysiscanbedifficultorimpossible.Nonetheless,ifadatasethasonlyafewobjectsthathavemissingvalues,thenitmaybeexpedienttoomitthem.Arelatedstrategyistoeliminateattributesthathavemissingvalues.Thisshouldbedonewithcaution,however,becausetheeliminatedattributesmaybetheonesthatarecriticaltotheanalysis.
EstimateMissingValues
Sometimesmissingdatacanbereliablyestimated.Forexample,consideratimeseriesthatchangesinareasonablysmoothfashion,buthasafew,widelyscatteredmissingvalues.Insuchcases,themissingvaluescanbeestimated(interpolated)byusingtheremainingvalues.Asanotherexample,consideradatasetthathasmanysimilardatapoints.Inthissituation,theattributevaluesofthepointsclosesttothepointwiththemissingvalueareoftenusedtoestimatethemissingvalue.Iftheattributeiscontinuous,thentheaverageattributevalueofthenearestneighborsisused;iftheattributeiscategorical,thenthemostcommonlyoccurringattributevaluecanbetaken.Foraconcreteillustration,considerprecipitationmeasurementsthatarerecordedbygroundstations.Forareasnotcontainingagroundstation,theprecipitationcanbeestimatedusingvaluesobservedatnearbygroundstations.
IgnoretheMissingValueduringAnalysis
Manydataminingapproachescanbemodifiedtoignoremissingvalues.Forexample,supposethatobjectsarebeingclusteredandthesimilaritybetweenpairsofdataobjectsneedstobecalculated.Ifoneorbothobjectsofapairhavemissingvaluesforsomeattributes,thenthesimilaritycanbecalculatedbyusingonlytheattributesthatdonothavemissingvalues.Itistruethatthesimilaritywillonlybeapproximate,butunlessthetotalnumberofattributesissmallorthenumberofmissingvaluesishigh,thisdegreeofinaccuracymaynotmattermuch.Likewise,manyclassificationschemescanbemodifiedtoworkwithmissingvalues.
InconsistentValuesDatacancontaininconsistentvalues.Consideranaddressfield,wherebothazipcodeandcityarelisted,butthespecifiedzipcodeareaisnotcontainedinthatcity.Itispossiblethattheindividualenteringthisinformationtransposedtwodigits,orperhapsadigitwasmisreadwhentheinformationwasscannedfromahandwrittenform.Regardlessofthecauseoftheinconsistentvalues,itisimportanttodetectand,ifpossible,correctsuchproblems.
Sometypesofinconsistencesareeasytodetect.Forinstance,aperson’sheightshouldnotbenegative.Inothercases,itcanbenecessarytoconsultanexternalsourceofinformation.Forexample,whenaninsurancecompanyprocessesclaimsforreimbursement,itchecksthenamesandaddressesonthereimbursementformsagainstadatabaseofitscustomers.
Onceaninconsistencyhasbeendetected,itissometimespossibletocorrectthedata.Aproductcodemayhave“check”digits,oritmaybepossibletodouble-checkaproductcodeagainstalistofknownproductcodes,andthen
correctthecodeifitisincorrect,butclosetoaknowncode.Thecorrectionofaninconsistencyrequiresadditionalorredundantinformation.
Example2.6(InconsistentSeaSurfaceTemperature).Thisexampleillustratesaninconsistencyinactualtimeseriesdatathatmeasurestheseasurfacetemperature(SST)atvariouspointsontheocean.SSTdatawasoriginallycollectedusingocean-basedmeasurementsfromshipsorbuoys,butmorerecently,satelliteshavebeenusedtogatherthedata.Tocreatealong-termdataset,bothsourcesofdatamustbeused.However,becausethedatacomesfromdifferentsources,thetwopartsofthedataaresubtlydifferent.ThisdiscrepancyisvisuallydisplayedinFigure2.7 ,whichshowsthecorrelationofSSTvaluesbetweenpairsofyears.Ifapairofyearshasapositivecorrelation,thenthelocationcorrespondingtothepairofyearsiscoloredwhite;otherwiseitiscoloredblack.(Seasonalvariationswereremovedfromthedatasince,otherwise,alltheyearswouldbehighlycorrelated.)Thereisadistinctchangeinbehaviorwherethedatahasbeenputtogetherin1983.Yearswithineachofthetwogroups,1958–1982and1983–1999,tendtohaveapositivecorrelationwithoneanother,butanegativecorrelationwithyearsintheothergroup.Thisdoesnotmeanthatthisdatashouldnotbeused,onlythattheanalystshouldconsiderthepotentialimpactofsuchdiscrepanciesonthedatamininganalysis.
Figure2.7.CorrelationofSSTdatabetweenpairsofyears.Whiteareasindicatepositivecorrelation.Blackareasindicatenegativecorrelation.
DuplicateDataAdatasetcanincludedataobjectsthatareduplicates,oralmostduplicates,ofoneanother.Manypeoplereceiveduplicatemailingsbecausetheyappearinadatabasemultipletimesunderslightlydifferentnames.Todetectandeliminatesuchduplicates,twomainissuesmustbeaddressed.First,iftherearetwoobjectsthatactuallyrepresentasingleobject,thenoneormorevaluesofcorrespondingattributesareusuallydifferent,andtheseinconsistentvaluesmustberesolved.Second,careneedstobetakentoavoidaccidentallycombiningdataobjectsthataresimilar,butnotduplicates,such
astwodistinctpeoplewithidenticalnames.Thetermdeduplicationisoftenusedtorefertotheprocessofdealingwiththeseissues.
Insomecases,twoormoreobjectsareidenticalwithrespecttotheattributesmeasuredbythedatabase,buttheystillrepresentdifferentobjects.Here,theduplicatesarelegitimate,butcanstillcauseproblemsforsomealgorithmsifthepossibilityofidenticalobjectsisnotspecificallyaccountedforintheirdesign.AnexampleofthisisgiveninExercise13 onpage108.
2.2.2IssuesRelatedtoApplications
Dataqualityissuescanalsobeconsideredfromanapplicationviewpointasexpressedbythestatement“dataisofhighqualityifitissuitableforitsintendeduse.”Thisapproachtodataqualityhasprovenquiteuseful,particularlyinbusinessandindustry.Asimilarviewpointisalsopresentinstatisticsandtheexperimentalsciences,withtheiremphasisonthecarefuldesignofexperimentstocollectthedatarelevanttoaspecifichypothesis.Aswithqualityissuesatthemeasurementanddatacollectionlevel,manyissuesarespecifictoparticularapplicationsandfields.Again,weconsideronlyafewofthegeneralissues.
Timeliness
Somedatastartstoageassoonasithasbeencollected.Inparticular,ifthedataprovidesasnapshotofsomeongoingphenomenonorprocess,suchasthepurchasingbehaviorofcustomersorwebbrowsingpatterns,thenthissnapshotrepresentsrealityforonlyalimitedtime.Ifthedataisoutofdate,thensoarethemodelsandpatternsthatarebasedonit.
Relevance
Theavailabledatamustcontaintheinformationnecessaryfortheapplication.Considerthetaskofbuildingamodelthatpredictstheaccidentratefordrivers.Ifinformationabouttheageandgenderofthedriverisomitted,thenitislikelythatthemodelwillhavelimitedaccuracyunlessthisinformationisindirectlyavailablethroughotherattributes.
Makingsurethattheobjectsinadatasetarerelevantisalsochallenging.Acommonproblemissamplingbias,whichoccurswhenasampledoesnotcontaindifferenttypesofobjectsinproportiontotheiractualoccurrenceinthepopulation.Forexample,surveydatadescribesonlythosewhorespondtothesurvey.(OtheraspectsofsamplingarediscussedfurtherinSection2.3.2 .)Becausetheresultsofadataanalysiscanreflectonlythedatathatispresent,samplingbiaswilltypicallyleadtoerroneousresultswhenappliedtothebroaderpopulation.
KnowledgeabouttheData
Ideally,datasetsareaccompaniedbydocumentationthatdescribesdifferentaspectsofthedata;thequalityofthisdocumentationcaneitheraidorhinderthesubsequentanalysis.Forexample,ifthedocumentationidentifiesseveralattributesasbeingstronglyrelated,theseattributesarelikelytoprovidehighlyredundantinformation,andweusuallydecidetokeepjustone.(Considersalestaxandpurchaseprice.)Ifthedocumentationispoor,however,andfailstotellus,forexample,thatthemissingvaluesforaparticularfieldareindicatedwitha-9999,thenouranalysisofthedatamaybefaulty.Otherimportantcharacteristicsaretheprecisionofthedata,thetypeoffeatures(nominal,ordinal,interval,ratio),thescaleofmeasurement(e.g.,metersorfeetforlength),andtheoriginofthedata.
2.3DataPreprocessingInthissection,weconsiderwhichpreprocessingstepsshouldbeappliedtomakethedatamoresuitablefordatamining.Datapreprocessingisabroadareaandconsistsofanumberofdifferentstrategiesandtechniquesthatareinterrelatedincomplexways.Wewillpresentsomeofthemostimportantideasandapproaches,andtrytopointouttheinterrelationshipsamongthem.Specifically,wewilldiscussthefollowingtopics:
AggregationSamplingDimensionalityreductionFeaturesubsetselectionFeaturecreationDiscretizationandbinarizationVariabletransformation
Roughlyspeaking,thesetopicsfallintotwocategories:selectingdataobjectsandattributesfortheanalysisorforcreating/changingtheattributes.Inbothcases,thegoalistoimprovethedatamininganalysiswithrespecttotime,cost,andquality.Detailsareprovidedinthefollowingsections.
Aquicknoteaboutterminology:Inthefollowing,wesometimesusesynonymsforattribute,suchasfeatureorvariable,inordertofollowcommonusage.
2.3.1Aggregation
Sometimes“lessismore,”andthisisthecasewithaggregation,thecombiningoftwoormoreobjectsintoasingleobject.Consideradatasetconsistingoftransactions(dataobjects)recordingthedailysalesofproductsinvariousstorelocations(Minneapolis,Chicago,Paris,…)fordifferentdaysoverthecourseofayear.SeeTable2.4 .Onewaytoaggregatetransactionsforthisdatasetistoreplaceallthetransactionsofasinglestorewithasinglestorewidetransaction.Thisreducesthehundredsorthousandsoftransactionsthatoccurdailyataspecificstoretoasingledailytransaction,andthenumberofdataobjectsperdayisreducedtothenumberofstores.
Table2.4.Datasetcontaininginformationaboutcustomerpurchases.
TransactionID Item StoreLocation Date Price …
⋮ ⋮ ⋮ ⋮ ⋮
101123 Watch Chicago 09/06/04 $25.99 …
101123 Battery Chicago 09/06/04 $5.99 …
101124 Shoes Minneapolis 09/06/04 $75.00 …
Anobviousissueishowanaggregatetransactioniscreated;i.e.,howthevaluesofeachattributearecombinedacrossalltherecordscorrespondingtoaparticularlocationtocreatetheaggregatetransactionthatrepresentsthesalesofasinglestoreordate.Quantitativeattributes,suchasprice,aretypicallyaggregatedbytakingasumoranaverage.Aqualitativeattribute,suchasitem,caneitherbeomittedorsummarizedintermsofahigherlevelcategory,e.g.,televisionsversuselectronics.
ThedatainTable2.4 canalsobeviewedasamultidimensionalarray,whereeachattributeisadimension.Fromthisviewpoint,aggregationistheprocessofeliminatingattributes,suchasthetypeofitem,orreducingthe
numberofvaluesforaparticularattribute;e.g.,reducingthepossiblevaluesfordatefrom365daysto12months.ThistypeofaggregationiscommonlyusedinOnlineAnalyticalProcessing(OLAP).ReferencestoOLAParegiveninthebibliographicNotes.
Thereareseveralmotivationsforaggregation.First,thesmallerdatasetsresultingfromdatareductionrequirelessmemoryandprocessingtime,andhence,aggregationoftenenablestheuseofmoreexpensivedataminingalgorithms.Second,aggregationcanactasachangeofscopeorscalebyprovidingahigh-levelviewofthedatainsteadofalow-levelview.Inthepreviousexample,aggregatingoverstorelocationsandmonthsgivesusamonthly,perstoreviewofthedatainsteadofadaily,peritemview.Finally,thebehaviorofgroupsofobjectsorattributesisoftenmorestablethanthatofindividualobjectsorattributes.Thisstatementreflectsthestatisticalfactthataggregatequantities,suchasaveragesortotals,havelessvariabilitythantheindividualvaluesbeingaggregated.Fortotals,theactualamountofvariationislargerthanthatofindividualobjects(onaverage),butthepercentageofthevariationissmaller,whileformeans,theactualamountofvariationislessthanthatofindividualobjects(onaverage).Adisadvantageofaggregationisthepotentiallossofinterestingdetails.Inthestoreexample,aggregatingovermonthslosesinformationaboutwhichdayoftheweekhasthehighestsales.
Example2.7(AustralianPrecipitation).ThisexampleisbasedonprecipitationinAustraliafromtheperiod1982–1993.Figure2.8(a) showsahistogramforthestandarddeviationofaveragemonthlyprecipitationfor by gridcellsinAustralia,whileFigure2.8(b) showsahistogramforthestandarddeviationoftheaverageyearlyprecipitationforthesamelocations.Theaverageyearlyprecipitationhaslessvariabilitythantheaveragemonthlyprecipitation.All
3,0300.5° 0.5°
precipitationmeasurements(andtheirstandarddeviations)areincentimeters.
Figure2.8.HistogramsofstandarddeviationformonthlyandyearlyprecipitationinAustraliafortheperiod1982–1993.
2.3.2Sampling
Samplingisacommonlyusedapproachforselectingasubsetofthedataobjectstobeanalyzed.Instatistics,ithaslongbeenusedforboththepreliminaryinvestigationofthedataandthefinaldataanalysis.Samplingcanalsobeveryusefulindatamining.However,themotivationsforsamplinginstatisticsanddataminingareoftendifferent.Statisticiansusesamplingbecauseobtainingtheentiresetofdataofinterestistooexpensiveortimeconsuming,whiledataminersusuallysamplebecauseitistoocomputationallyexpensiveintermsofthememoryortimerequiredtoprocess
allthedata.Insomecases,usingasamplingalgorithmcanreducethedatasizetothepointwhereabetter,butmorecomputationallyexpensivealgorithmcanbeused.
Thekeyprincipleforeffectivesamplingisthefollowing:Usingasamplewillworkalmostaswellasusingtheentiredatasetifthesampleisrepresentative.Inturn,asampleisrepresentativeifithasapproximatelythesameproperty(ofinterest)astheoriginalsetofdata.Ifthemean(average)ofthedataobjectsisthepropertyofinterest,thenasampleisrepresentativeifithasameanthatisclosetothatoftheoriginaldata.Becausesamplingisastatisticalprocess,therepresentativenessofanyparticularsamplewillvary,andthebestthatwecandoischooseasamplingschemethatguaranteesahighprobabilityofgettingarepresentativesample.Asdiscussednext,thisinvolveschoosingtheappropriatesamplesizeandsamplingtechnique.
SamplingApproachesTherearemanysamplingtechniques,butonlyafewofthemostbasiconesandtheirvariationswillbecoveredhere.Thesimplesttypeofsamplingissimplerandomsampling.Forthistypeofsampling,thereisanequalprobabilityofselectinganyparticularobject.Therearetwovariationsonrandomsampling(andothersamplingtechniquesaswell):(1)samplingwithoutreplacement—aseachobjectisselected,itisremovedfromthesetofallobjectsthattogetherconstitutethepopulation,and(2)samplingwithreplacement—objectsarenotremovedfromthepopulationastheyareselectedforthesample.Insamplingwithreplacement,thesameobjectcanbepickedmorethanonce.Thesamplesproducedbythetwomethodsarenotmuchdifferentwhensamplesarerelativelysmallcomparedtothedatasetsize,butsamplingwithreplacementissimplertoanalyzebecausetheprobabilityofselectinganyobjectremainsconstantduringthesamplingprocess.
Whenthepopulationconsistsofdifferenttypesofobjects,withwidelydifferentnumbersofobjects,simplerandomsamplingcanfailtoadequatelyrepresentthosetypesofobjectsthatarelessfrequent.Thiscancauseproblemswhentheanalysisrequiresproperrepresentationofallobjecttypes.Forexample,whenbuildingclassificationmodelsforrareclasses,itiscriticalthattherareclassesbeadequatelyrepresentedinthesample.Hence,asamplingschemethatcanaccommodatedifferingfrequenciesfortheobjecttypesofinterestisneeded.Stratifiedsampling,whichstartswithprespecifiedgroupsofobjects,issuchanapproach.Inthesimplestversion,equalnumbersofobjectsaredrawnfromeachgroupeventhoughthegroupsareofdifferentsizes.Inanothervariation,thenumberofobjectsdrawnfromeachgroupisproportionaltothesizeofthatgroup.
Example2.8(SamplingandLossofInformation).Onceasamplingtechniquehasbeenselected,itisstillnecessarytochoosethesamplesize.Largersamplesizesincreasetheprobabilitythatasamplewillberepresentative,buttheyalsoeliminatemuchoftheadvantageofsampling.Conversely,withsmallersamplesizes,patternscanbemissedorerroneouspatternscanbedetected.Figure2.9(a)showsadatasetthatcontains8000two-dimensionalpoints,whileFigures2.9(b) and2.9(c) showsamplesfromthisdatasetofsize2000and500,respectively.Althoughmostofthestructureofthisdatasetispresentinthesampleof2000points,muchofthestructureismissinginthesampleof500points.
Figure2.9.Exampleofthelossofstructurewithsampling.
Example2.9(DeterminingtheProperSampleSize).Toillustratethatdeterminingthepropersamplesizerequiresamethodicalapproach,considerthefollowingtask.
Givenasetofdataconsistingofasmallnumberofalmostequalsizedgroups,findatleastone
representativepointforeachofthegroups.Assumethattheobjectsineachgrouparehighly
similartoeachother,butnotverysimilartoobjectsindifferentgroups.Figure2.10(a) shows
anidealizedsetofclusters(groups)fromwhichthesepointsmightbedrawn.
Figure2.10.Findingrepresentativepointsfrom10groups.
Thisproblemcanbeefficientlysolvedusingsampling.Oneapproachistotakeasmallsampleofdatapoints,computethepairwisesimilaritiesbetweenpoints,andthenformgroupsofpointsthatarehighlysimilar.Thedesiredsetofrepresentativepointsisthenobtainedbytakingonepointfromeachofthesegroups.Tofollowthisapproach,however,weneedtodetermineasamplesizethatwouldguarantee,withahighprobability,thedesiredoutcome;thatis,thatatleastonepointwillbeobtainedfromeachcluster.Figure2.10(b) showstheprobabilityofgettingoneobjectfromeachofthe10groupsasthesamplesizerunsfrom10to60.Interestingly,withasamplesizeof20,thereislittlechance(20%)ofgettingasamplethatincludesall10clusters.Evenwithasamplesizeof30,thereisstillamoderatechance(almost40%)ofgettingasamplethatdoesn’tcontainobjectsfromall10clusters.ThisissueisfurtherexploredinthecontextofclusteringbyExercise4 onpage603.
ProgressiveSamplingThepropersamplesizecanbedifficulttodetermine,soadaptiveorprogressivesamplingschemesaresometimesused.Theseapproachesstartwithasmallsample,andthenincreasethesamplesizeuntilasampleofsufficientsizehasbeenobtained.Whilethistechniqueeliminatestheneedtodeterminethecorrectsamplesizeinitially,itrequiresthattherebeawaytoevaluatethesampletojudgeifitislargeenough.
Suppose,forinstance,thatprogressivesamplingisusedtolearnapredictivemodel.Althoughtheaccuracyofpredictivemodelsincreasesasthesamplesizeincreases,atsomepointtheincreaseinaccuracylevelsoff.Wewanttostopincreasingthesamplesizeatthisleveling-offpoint.Bykeepingtrackofthechangeinaccuracyofthemodelaswetakeprogressivelylargersamples,andbytakingothersamplesclosetothesizeofthecurrentone,wecangetanestimateofhowclosewearetothisleveling-offpoint,andthus,stopsampling.
2.3.3DimensionalityReduction
Datasetscanhavealargenumberoffeatures.Considerasetofdocuments,whereeachdocumentisrepresentedbyavectorwhosecomponentsarethefrequencieswithwhicheachwordoccursinthedocument.Insuchcases,therearetypicallythousandsortensofthousandsofattributes(components),oneforeachwordinthevocabulary.Asanotherexample,considerasetoftimeseriesconsistingofthedailyclosingpriceofvariousstocksoveraperiodof30years.Inthiscase,theattributes,whicharethepricesonspecificdays,againnumberinthethousands.
Thereareavarietyofbenefitstodimensionalityreduction.Akeybenefitisthatmanydataminingalgorithmsworkbetterifthedimensionality—thenumberofattributesinthedata—islower.Thisispartlybecausedimensionalityreductioncaneliminateirrelevantfeaturesandreducenoiseandpartlybecauseofthecurseofdimensionality,whichisexplainedbelow.Anotherbenefitisthatareductionofdimensionalitycanleadtoamoreunderstandablemodelbecausethemodelusuallyinvolvesfewerattributes.Also,dimensionalityreductionmayallowthedatatobemoreeasilyvisualized.Evenifdimensionalityreductiondoesn’treducethedatatotwoorthreedimensions,dataisoftenvisualizedbylookingatpairsortripletsofattributes,andthenumberofsuchcombinationsisgreatlyreduced.Finally,theamountoftimeandmemoryrequiredbythedataminingalgorithmisreducedwithareductionindimensionality.
Thetermdimensionalityreductionisoftenreservedforthosetechniquesthatreducethedimensionalityofadatasetbycreatingnewattributesthatareacombinationoftheoldattributes.Thereductionofdimensionalitybyselectingattributesthatareasubsetoftheoldisknownasfeaturesubsetselectionorfeatureselection.ItwillbediscussedinSection2.3.4 .
Intheremainderofthissection,webrieflyintroducetwoimportanttopics:thecurseofdimensionalityanddimensionalityreductiontechniquesbasedonlinearalgebraapproachessuchasprincipalcomponentsanalysis(PCA).MoredetailsondimensionalityreductioncanbefoundinAppendixB.
TheCurseofDimensionalityThecurseofdimensionalityreferstothephenomenonthatmanytypesofdataanalysisbecomesignificantlyharderasthedimensionalityofthedataincreases.Specifically,asdimensionalityincreases,thedatabecomesincreasinglysparseinthespacethatitoccupies.Thus,thedataobjectswe
observearequitepossiblynotarepresentativesampleofallpossibleobjects.Forclassification,thiscanmeanthattherearenotenoughdataobjectstoallowthecreationofamodelthatreliablyassignsaclasstoallpossibleobjects.Forclustering,thedifferencesindensityandinthedistancesbetweenpoints,whicharecriticalforclustering,becomelessmeaningful.(ThisisdiscussedfurtherinSections8.1.2,8.4.6,and8.4.8.)Asaresult,manyclusteringandclassificationalgorithms(andotherdataanalysisalgorithms)havetroublewithhigh-dimensionaldataleadingtoreducedclassificationaccuracyandpoorqualityclusters.
LinearAlgebraTechniquesforDimensionalityReductionSomeofthemostcommonapproachesfordimensionalityreduction,particularlyforcontinuousdata,usetechniquesfromlinearalgebratoprojectthedatafromahigh-dimensionalspaceintoalower-dimensionalspace.PrincipalComponentsAnalysis(PCA)isalinearalgebratechniqueforcontinuousattributesthatfindsnewattributes(principalcomponents)that(1)arelinearcombinationsoftheoriginalattributes,(2)areorthogonal(perpendicular)toeachother,and(3)capturethemaximumamountofvariationinthedata.Forexample,thefirsttwoprincipalcomponentscaptureasmuchofthevariationinthedataasispossiblewithtwoorthogonalattributesthatarelinearcombinationsoftheoriginalattributes.SingularValueDecomposition(SVD)isalinearalgebratechniquethatisrelatedtoPCAandisalsocommonlyusedfordimensionalityreduction.Foradditionaldetails,seeAppendicesAandB.
2.3.4FeatureSubsetSelection
Anotherwaytoreducethedimensionalityistouseonlyasubsetofthefeatures.Whileitmightseemthatsuchanapproachwouldloseinformation,thisisnotthecaseifredundantandirrelevantfeaturesarepresent.Redundantfeaturesduplicatemuchoralloftheinformationcontainedinoneormoreotherattributes.Forexample,thepurchasepriceofaproductandtheamountofsalestaxpaidcontainmuchofthesameinformation.Irrelevantfeaturescontainalmostnousefulinformationforthedataminingtaskathand.Forinstance,students’IDnumbersareirrelevanttothetaskofpredictingstudents’gradepointaverages.Redundantandirrelevantfeaturescanreduceclassificationaccuracyandthequalityoftheclustersthatarefound.
Whilesomeirrelevantandredundantattributescanbeeliminatedimmediatelybyusingcommonsenseordomainknowledge,selectingthebestsubsetoffeaturesfrequentlyrequiresasystematicapproach.Theidealapproachtofeatureselectionistotryallpossiblesubsetsoffeaturesasinputtothedataminingalgorithmofinterest,andthentakethesubsetthatproducesthebestresults.Thismethodhastheadvantageofreflectingtheobjectiveandbiasofthedataminingalgorithmthatwilleventuallybeused.Unfortunately,sincethenumberofsubsetsinvolvingnattributesis2 ,suchanapproachisimpractical
inmostsituationsandalternativestrategiesareneeded.Therearethreestandardapproachestofeatureselection:embedded,filter,andwrapper.
Embeddedapproaches
Featureselectionoccursnaturallyaspartofthedataminingalgorithm.Specifically,duringtheoperationofthedataminingalgorithm,thealgorithmitselfdecideswhichattributestouseandwhichtoignore.Algorithmsforbuildingdecisiontreeclassifiers,whicharediscussedinChapter3 ,oftenoperateinthismanner.
n
Filterapproaches
Featuresareselectedbeforethedataminingalgorithmisrun,usingsomeapproachthatisindependentofthedataminingtask.Forexample,wemightselectsetsofattributeswhosepairwisecorrelationisaslowaspossiblesothattheattributesarenon-redundant.
Wrapperapproaches
Thesemethodsusethetargetdataminingalgorithmasablackboxtofindthebestsubsetofattributes,inawaysimilartothatoftheidealalgorithmdescribedabove,buttypicallywithoutenumeratingallpossiblesubsets.
Becausetheembeddedapproachesarealgorithm-specific,onlythefilterandwrapperapproacheswillbediscussedfurtherhere.
AnArchitectureforFeatureSubsetSelectionItispossibletoencompassboththefilterandwrapperapproacheswithinacommonarchitecture.Thefeatureselectionprocessisviewedasconsistingoffourparts:ameasureforevaluatingasubset,asearchstrategythatcontrolsthegenerationofanewsubsetoffeatures,astoppingcriterion,andavalidationprocedure.Filtermethodsandwrappermethodsdifferonlyinthewayinwhichtheyevaluateasubsetoffeatures.Forawrappermethod,subsetevaluationusesthetargetdataminingalgorithm,whileforafilterapproach,theevaluationtechniqueisdistinctfromthetargetdataminingalgorithm.Thefollowingdiscussionprovidessomedetailsofthisapproach,whichissummarizedinFigure2.11 .
Figure2.11.Flowchartofafeaturesubsetselectionprocess.
Conceptually,featuresubsetselectionisasearchoverallpossiblesubsetsoffeatures.Manydifferenttypesofsearchstrategiescanbeused,butthesearchstrategyshouldbecomputationallyinexpensiveandshouldfindoptimalornearoptimalsetsoffeatures.Itisusuallynotpossibletosatisfybothrequirements,andthus,trade-offsarenecessary.
Anintegralpartofthesearchisanevaluationsteptojudgehowthecurrentsubsetoffeaturescomparestoothersthathavebeenconsidered.Thisrequiresanevaluationmeasurethatattemptstodeterminethegoodnessofasubsetofattributeswithrespecttoaparticulardataminingtask,suchasclassificationorclustering.Forthefilterapproach,suchmeasuresattempttopredicthowwelltheactualdataminingalgorithmwillperformonagivensetofattributes.Forthewrapperapproach,whereevaluationconsistsofactuallyrunningthetargetdataminingalgorithm,thesubsetevaluationfunctionissimplythecriterionnormallyusedtomeasuretheresultofthedatamining.
Becausethenumberofsubsetscanbeenormousanditisimpracticaltoexaminethemall,somesortofstoppingcriterionisnecessary.Thisstrategyisusuallybasedononeormoreconditionsinvolvingthefollowing:thenumberofiterations,whetherthevalueofthesubsetevaluationmeasureisoptimalorexceedsacertainthreshold,whetherasubsetofacertainsizehasbeenobtained,andwhetheranyimprovementcanbeachievedbytheoptionsavailabletothesearchstrategy.
Finally,onceasubsetoffeatureshasbeenselected,theresultsofthetargetdataminingalgorithmontheselectedsubsetshouldbevalidated.Astraightforwardvalidationapproachistorunthealgorithmwiththefullsetoffeaturesandcomparethefullresultstoresultsobtainedusingthesubsetoffeatures.Hopefully,thesubsetoffeatureswillproduceresultsthatarebetterthanoralmostasgoodasthoseproducedwhenusingallfeatures.Anothervalidationapproachistouseanumberofdifferentfeatureselectionalgorithmstoobtainsubsetsoffeaturesandthencomparetheresultsofrunningthedataminingalgorithmoneachsubset.
FeatureWeightingFeatureweightingisanalternativetokeepingoreliminatingfeatures.Moreimportantfeaturesareassignedahigherweight,whilelessimportantfeaturesaregivenalowerweight.Theseweightsaresometimesassignedbasedondomainknowledgeabouttherelativeimportanceoffeatures.Alternatively,theycansometimesbedeterminedautomatically.Forexample,someclassificationschemes,suchassupportvectormachines(Chapter4 ),produceclassificationmodelsinwhicheachfeatureisgivenaweight.Featureswithlargerweightsplayamoreimportantroleinthemodel.Thenormalizationofobjectsthattakesplacewhencomputingthecosinesimilarity(Section2.4.5 )canalsoberegardedasatypeoffeatureweighting.
2.3.5FeatureCreation
Itisfrequentlypossibletocreate,fromtheoriginalattributes,anewsetofattributesthatcapturestheimportantinformationinadatasetmuchmoreeffectively.Furthermore,thenumberofnewattributescanbesmallerthanthenumberoforiginalattributes,allowingustoreapallthepreviouslydescribedbenefitsofdimensionalityreduction.Tworelatedmethodologiesforcreatingnewattributesaredescribednext:featureextractionandmappingthedatatoanewspace.
FeatureExtractionThecreationofanewsetoffeaturesfromtheoriginalrawdataisknownasfeatureextraction.Considerasetofphotographs,whereeachphotographistobeclassifiedaccordingtowhetheritcontainsahumanface.Therawdataisasetofpixels,andassuch,isnotsuitableformanytypesofclassificationalgorithms.However,ifthedataisprocessedtoprovidehigher-levelfeatures,suchasthepresenceorabsenceofcertaintypesofedgesandareasthatarehighlycorrelatedwiththepresenceofhumanfaces,thenamuchbroadersetofclassificationtechniquescanbeappliedtothisproblem.
Unfortunately,inthesenseinwhichitismostcommonlyused,featureextractionishighlydomain-specific.Foraparticularfield,suchasimageprocessing,variousfeaturesandthetechniquestoextractthemhavebeendevelopedoveraperiodoftime,andoftenthesetechniqueshavelimitedapplicabilitytootherfields.Consequently,wheneverdataminingisappliedtoarelativelynewarea,akeytaskisthedevelopmentofnewfeaturesandfeatureextractionmethods.
Althoughfeatureextractionisoftencomplicated,Example2.10 illustratesthatitcanberelativelystraightforward.
Example2.10(Density).Consideradatasetconsistingofinformationabouthistoricalartifacts,which,alongwithotherinformation,containsthevolumeandmassofeachartifact.Forsimplicity,assumethattheseartifactsaremadeofasmallnumberofmaterials(wood,clay,bronze,gold)andthatwewanttoclassifytheartifactswithrespecttothematerialofwhichtheyaremade.Inthiscase,adensityfeatureconstructedfromthemassandvolumefeatures,i.e.,density=mass/volume,wouldmostdirectlyyieldanaccurateclassification.Althoughtherehavebeensomeattemptstoautomaticallyperformsuchsimplefeatureextractionbyexploringbasicmathematicalcombinationsofexistingattributes,themostcommonapproachistoconstructfeaturesusingdomainexpertise.
MappingtheDatatoaNewSpaceAtotallydifferentviewofthedatacanrevealimportantandinterestingfeatures.Consider,forexample,timeseriesdata,whichoftencontainsperiodicpatterns.Ifthereisonlyasingleperiodicpatternandnotmuchnoise,thenthepatterniseasilydetected.If,ontheotherhand,thereareanumberofperiodicpatternsandasignificantamountofnoise,thenthesepatternsarehardtodetect.Suchpatternscan,nonetheless,oftenbedetectedbyapplyingaFouriertransformtothetimeseriesinordertochangetoarepresentationinwhichfrequencyinformationisexplicit.InExample2.11 ,itwillnotbenecessarytoknowthedetailsoftheFouriertransform.Itisenoughtoknowthat,foreachtimeseries,theFouriertransformproducesanewdataobjectwhoseattributesarerelatedtofrequencies.
Example2.11(FourierAnalysis).ThetimeseriespresentedinFigure2.12(b) isthesumofthreeothertimeseries,twoofwhichareshowninFigure2.12(a) andhavefrequenciesof7and17cyclespersecond,respectively.Thethirdtimeseriesisrandomnoise.Figure2.12(c) showsthepowerspectrumthatcanbecomputedafterapplyingaFouriertransformtotheoriginaltimeseries.(Informally,thepowerspectrumisproportionaltothesquareofeachfrequencyattribute.)Inspiteofthenoise,therearetwopeaksthatcorrespondtotheperiodsofthetwooriginal,non-noisytimeseries.Again,themainpointisthatbetterfeaturescanrevealimportantaspectsofthedata.
Figure2.12.ApplicationoftheFouriertransformtoidentifytheunderlyingfrequenciesintimeseriesdata.
Manyothersortsoftransformationsarealsopossible.BesidestheFouriertransform,thewavelettransformhasalsoprovenveryusefulfortimeseriesandothertypesofdata.
2.3.6DiscretizationandBinarization
Somedataminingalgorithms,especiallycertainclassificationalgorithms,requirethatthedatabeintheformofcategoricalattributes.Algorithmsthatfindassociationpatternsrequirethatthedatabeintheformofbinaryattributes.Thus,itisoftennecessarytotransformacontinuousattributeintoacategoricalattribute(discretization),andbothcontinuousanddiscreteattributesmayneedtobetransformedintooneormorebinaryattributes(binarization).Additionally,ifacategoricalattributehasalargenumberofvalues(categories),orsomevaluesoccurinfrequently,thenitcanbebeneficialforcertaindataminingtaskstoreducethenumberofcategoriesbycombiningsomeofthevalues.
Aswithfeatureselection,thebestdiscretizationorbinarizationapproachistheonethat“producesthebestresultforthedataminingalgorithmthatwillbeusedtoanalyzethedata.”Itistypicallynotpracticaltoapplysuchacriteriondirectly.Consequently,discretizationorbinarizationisperformedinawaythatsatisfiesacriterionthatisthoughttohavearelationshiptogoodperformanceforthedataminingtaskbeingconsidered.Ingeneral,thebestdiscretizationdependsonthealgorithmbeingused,aswellastheotherattributesbeingconsidered.Typically,however,thediscretizationofeachattributeisconsideredinisolation.
BinarizationAsimpletechniquetobinarizeacategoricalattributeisthefollowing:Iftherearemcategoricalvalues,thenuniquelyassigneachoriginalvaluetoanintegerintheinterval Iftheattributeisordinal,thenordermustbemaintainedbytheassignment.(Notethateveniftheattributeisoriginallyrepresentedusingintegers,thisprocessisnecessaryiftheintegersarenotin
[0,m−1].
theinterval )Next,converteachofthesemintegerstoabinarynumber.Since binarydigitsarerequiredtorepresenttheseintegers,representthesebinarynumbersusingnbinaryattributes.Toillustrate,acategoricalvariablewith5values{awful,poor,OK,good,great}wouldrequirethreebinaryvariables and TheconversionisshowninTable2.5 .
Table2.5.Conversionofacategoricalattributetothreebinaryattributes.
CategoricalValue IntegerValue
awful 0 0 0 0
poor 1 0 0 1
OK 2 0 1 0
good 3 0 1 1
great 4 1 0 0
Suchatransformationcancausecomplications,suchascreatingunintendedrelationshipsamongthetransformedattributes.Forexample,inTable2.5 ,attributes and arecorrelatedbecauseinformationaboutthegoodvalueisencodedusingbothattributes.Furthermore,associationanalysisrequiresasymmetricbinaryattributes,whereonlythepresenceoftheattribute
isimportant.Forassociationproblems,itisthereforenecessarytointroduceoneasymmetricbinaryattributeforeachcategoricalvalue,asshowninTable2.6 .Ifthenumberofresultingattributesistoolarge,thenthetechniquesdescribedinthefollowingsectionscanbeusedtoreducethenumberofcategoricalvaluesbeforebinarization.
Table2.6.Conversionofacategoricalattributetofiveasymmetricbinary
[0,m−1].n=[log2(m)]
x1,x2, x3.
x1 x2 x3
x2 x3
(value=1).
attributes.
CategoricalValue IntegerValue
awful 0 1 0 0 0 0
poor 1 0 1 0 0 0
OK 2 0 0 1 0 0
good 3 0 0 0 1 0
great 4 0 0 0 0 1
Likewise,forassociationproblems,itcanbenecessarytoreplaceasinglebinaryattributewithtwoasymmetricbinaryattributes.Considerabinaryattributethatrecordsaperson’sgender,maleorfemale.Fortraditionalassociationrulealgorithms,thisinformationneedstobetransformedintotwoasymmetricbinaryattributes,onethatisa1onlywhenthepersonismaleandonethatisa1onlywhenthepersonisfemale.(Forasymmetricbinaryattributes,theinformationrepresentationissomewhatinefficientinthattwobitsofstoragearerequiredtorepresenteachbitofinformation.)
DiscretizationofContinuousAttributesDiscretizationistypicallyappliedtoattributesthatareusedinclassificationorassociationanalysis.Transformationofacontinuousattributetoacategoricalattributeinvolvestwosubtasks:decidinghowmanycategories,n,tohaveanddetermininghowtomapthevaluesofthecontinuousattributetothesecategories.Inthefirststep,afterthevaluesofthecontinuousattributearesorted,theyarethendividedintonintervalsbyspecifying splitpoints.Inthesecond,rathertrivialstep,allthevaluesinoneintervalaremappedtothesamecategoricalvalue.Therefore,theproblemofdiscretizationisoneof
x1 x2 x3 x4 x5
n−1
decidinghowmanysplitpointstochooseandwheretoplacethem.Theresultcanberepresentedeitherasasetofintervals
where and canbe or respectively,orequivalently,asaseriesofinequalities
UnsupervisedDiscretization
Abasicdistinctionbetweendiscretizationmethodsforclassificationiswhetherclassinformationisused(supervised)ornot(unsupervised).Ifclassinformationisnotused,thenrelativelysimpleapproachesarecommon.Forinstance,theequalwidthapproachdividestherangeoftheattributeintoauser-specifiednumberofintervalseachhavingthesamewidth.Suchanapproachcanbebadlyaffectedbyoutliers,andforthatreason,anequalfrequency(equaldepth)approach,whichtriestoputthesamenumberofobjectsintoeachinterval,isoftenpreferred.Asanotherexampleofunsuperviseddiscretization,aclusteringmethod,suchasK-means(seeChapter7 ),canalsobeused.Finally,visuallyinspectingthedatacansometimesbeaneffectiveapproach.
Example2.12(DiscretizationTechniques).Thisexampledemonstrateshowtheseapproachesworkonanactualdataset.Figure2.13(a) showsdatapointsbelongingtofourdifferentgroups,alongwithtwooutliers—thelargedotsoneitherend.Thetechniquesofthepreviousparagraphwereappliedtodiscretizethexvaluesofthesedatapointsintofourcategoricalvalues.(Pointsinthedatasethavearandomycomponenttomakeiteasytoseehowmanypointsareineachgroup.)Visuallyinspectingthedataworksquitewell,butisnotautomatic,andthus,wefocusontheotherthreeapproaches.Thesplitpointsproducedbythetechniquesequalwidth,equalfrequency,andK-meansareshownin
{(x0,x1],(x1,x2],…,(xn−1,xn)}, x0 xn +∞ −∞,
x0<x≤x1,…,xn−1<x<xn.
Figures2.13(b) ,2.13(c) ,and2.13(d) ,respectively.Thesplitpointsarerepresentedasdashedlines.
Figure2.13.Differentdiscretizationtechniques.
Inthisparticularexample,ifwemeasuretheperformanceofadiscretizationtechniquebytheextenttowhichdifferentobjectsthatclumptogetherhavethesamecategoricalvalue,thenK-meansperformsbest,followedbyequalfrequency,andfinally,equalwidth.Moregenerally,thebestdiscretizationwilldependontheapplicationandofteninvolvesdomain-specificdiscretization.Forexample,thediscretizationofpeopleintolowincome,middleincome,andhighincomeisbasedoneconomicfactors.
SupervisedDiscretization
Ifclassificationisourapplicationandclasslabelsareknownforsomedataobjects,thendiscretizationapproachesthatuseclasslabelsoftenproducebetterclassification.Thisshouldnotbesurprising,sinceanintervalconstructedwithnoknowledgeofclasslabelsoftencontainsamixtureofclasslabels.Aconceptuallysimpleapproachistoplacethesplitsinawaythatmaximizesthepurityoftheintervals,i.e.,theextenttowhichanintervalcontainsasingleclasslabel.Inpractice,however,suchanapproachrequirespotentiallyarbitrarydecisionsaboutthepurityofanintervalandtheminimumsizeofaninterval.
Toovercomesuchconcerns,somestatisticallybasedapproachesstartwitheachattributevalueinaseparateintervalandcreatelargerintervalsbymergingadjacentintervalsthataresimilaraccordingtoastatisticaltest.Analternativetothisbottom-upapproachisatop-downapproachthatstartsbybisectingtheinitialvaluessothattheresultingtwointervalsgiveminimumentropy.Thistechniqueonlyneedstoconsidereachvalueasapossiblesplitpoint,becauseitisassumedthatintervalscontainorderedsetsofvalues.Thesplittingprocessisthenrepeatedwithanotherinterval,typicallychoosingtheintervalwiththeworst(highest)entropy,untilauser-specifiednumberofintervalsisreached,orastoppingcriterionissatisfied.
Entropy-basedapproachesareoneofthemostpromisingapproachestodiscretization,whetherbottom-uportop-down.First,itisnecessarytodefineentropy.Letkbethenumberofdifferentclasslabels,m bethenumberof
valuesinthei intervalofapartition,andm bethenumberofvaluesofclassjinintervali.Thentheentropye ofthei intervalisgivenbytheequation
where istheprobability(fractionofvalues)ofclassjintheinterval.Thetotalentropy,e,ofthepartitionistheweightedaverageoftheindividualintervalentropies,i.e.,
wheremisthenumberofvalues, isthefractionofvaluesintheinterval,andnisthenumberofintervals.Intuitively,theentropyofanintervalisameasureofthepurityofaninterval.Ifanintervalcontainsonlyvaluesofoneclass(isperfectlypure),thentheentropyis0anditcontributesnothingtotheoverallentropy.Iftheclassesofvaluesinanintervaloccurequallyoften(theintervalisasimpureaspossible),thentheentropyisamaximum.
Example2.13(DiscretizationofTwoAttributes).Thetop-downmethodbasedonentropywasusedtoindependentlydiscretizeboththexandyattributesofthetwo-dimensionaldatashowninFigure2.14 .Inthefirstdiscretization,showninFigure2.14(a) ,thexandyattributeswerebothsplitintothreeintervals.(Thedashedlinesindicatethesplitpoints.)Intheseconddiscretization,showninFigure2.14(b) ,thexandyattributeswerebothsplitintofiveintervals.
i
thij
ith
ei=−∑j=1kpijlog2pij,
pij=mij/mi ith
e=∑i=1nwiei,
wi=mi/m ith
Figure2.14.Discretizingxandyattributesforfourgroups(classes)ofpoints.
Thissimpleexampleillustratestwoaspectsofdiscretization.First,intwodimensions,theclassesofpointsarewellseparated,butinonedimension,thisisnotso.Ingeneral,discretizingeachattributeseparatelyoftenguaranteessuboptimalresults.Second,fiveintervalsworkbetterthanthree,butsixintervalsdonotimprovethediscretizationmuch,atleastintermsofentropy.(Entropyvaluesandresultsforsixintervalsarenotshown.)Consequently,itisdesirabletohaveastoppingcriterionthatautomaticallyfindstherightnumberofpartitions.
CategoricalAttributeswithTooManyValuesCategoricalattributescansometimeshavetoomanyvalues.Ifthecategoricalattributeisanordinalattribute,thentechniquessimilartothoseforcontinuousattributescanbeusedtoreducethenumberofcategories.Ifthecategoricalattributeisnominal,however,thenotherapproachesareneeded.Considera
universitythathasalargenumberofdepartments.Consequently,adepartmentnameattributemighthavedozensofdifferentvalues.Inthissituation,wecoulduseourknowledgeoftherelationshipsamongdifferentdepartmentstocombinedepartmentsintolargergroups,suchasengineering,socialsciences,orbiologicalsciences.Ifdomainknowledgedoesnotserveasausefulguideorsuchanapproachresultsinpoorclassificationperformance,thenitisnecessarytouseamoreempiricalapproach,suchasgroupingvaluestogetheronlyifsuchagroupingresultsinimprovedclassificationaccuracyorachievessomeotherdataminingobjective.
2.3.7VariableTransformation
Avariabletransformationreferstoatransformationthatisappliedtoallthevaluesofavariable.(Weusethetermvariableinsteadofattributetoadheretocommonusage,althoughwewillalsorefertoattributetransformationonoccasion.)Inotherwords,foreachobject,thetransformationisappliedtothevalueofthevariableforthatobject.Forexample,ifonlythemagnitudeofavariableisimportant,thenthevaluesofthevariablecanbetransformedbytakingtheabsolutevalue.Inthefollowingsection,wediscusstwoimportanttypesofvariabletransformations:simplefunctionaltransformationsandnormalization.
SimpleFunctionsForthistypeofvariabletransformation,asimplemathematicalfunctionisappliedtoeachvalueindividually.Ifxisavariable,thenexamplesofsuchtransformationsinclude or Instatistics,variabletransformations,especiallysqrt,log,and1/x,areoftenusedtotransformdatathatdoesnothaveaGaussian(normal)distributionintodatathatdoes.While
xk,logx,ex,x,1/x,sinx, |x|.
thiscanbeimportant,otherreasonsoftentakeprecedenceindatamining.Supposethevariableofinterestisthenumberofdatabytesinasession,andthenumberofbytesrangesfrom1to1billion.Thisisahugerange,anditcanbeadvantageoustocompressitbyusingalog transformation.Inthiscase,sessionsthattransferred and byteswouldbemoresimilartoeachotherthansessionsthattransferred10and1000bytesForsomeapplications,suchasnetworkintrusiondetection,thismaybewhatisdesired,sincethefirsttwosessionsmostlikelyrepresenttransfersoflargefiles,whilethelattertwosessionscouldbetwoquitedistincttypesofsessions.
Variabletransformationsshouldbeappliedwithcautionbecausetheychangethenatureofthedata.Whilethisiswhatisdesired,therecanbeproblemsifthenatureofthetransformationisnotfullyappreciated.Forinstance,thetransformation1/xreducesthemagnitudeofvaluesthatare1orlarger,butincreasesthemagnitudeofvaluesbetween0and1.Toillustrate,thevalues{1,2,3}goto butthevalues goto{1,2,3}.Thus,forallsetsofvalues,thetransformation1/xreversestheorder.Tohelpclarifytheeffectofatransformation,itisimportanttoaskquestionssuchasthefollowing:Whatisthedesiredpropertyofthetransformedattribute?Doestheorderneedtobemaintained?Doesthetransformationapplytoallvalues,especiallynegativevaluesand0?Whatistheeffectofthetransformationonthevaluesbetween0and1?Exercise17 onpage109 exploresotheraspectsofvariabletransformation.
NormalizationorStandardizationThegoalofstandardizationornormalizationistomakeanentiresetofvalueshaveaparticularproperty.Atraditionalexampleisthatof“standardizingavariable”instatistics.If isthemean(average)oftheattributevaluesandistheirstandarddeviation,thenthetransformation createsanew
10
108 109(9−8=1versus3−1=3).
{1,12,13}, {1,12,13}
x¯ sxx′=(x−x¯)/sx
variablethathasameanof0andastandarddeviationof1.Ifdifferentvariablesaretobeusedtogether,e.g.,forclustering,thensuchatransformationisoftennecessarytoavoidhavingavariablewithlargevaluesdominatetheresultsoftheanalysis.Toillustrate,considercomparingpeoplebasedontwovariables:ageandincome.Foranytwopeople,thedifferenceinincomewilllikelybemuchhigherinabsoluteterms(hundredsorthousandsofdollars)thanthedifferenceinage(lessthan150).Ifthedifferencesintherangeofvaluesofageandincomearenottakenintoaccount,thenthecomparisonbetweenpeoplewillbedominatedbydifferencesinincome.Inparticular,ifthesimilarityordissimilarityoftwopeopleiscalculatedusingthesimilarityordissimilaritymeasuresdefinedlaterinthischapter,theninmanycases,suchasthatofEuclideandistance,theincomevalueswilldominatethecalculation.
Themeanandstandarddeviationarestronglyaffectedbyoutliers,sotheabovetransformationisoftenmodified.First,themeanisreplacedbythemedian,i.e.,themiddlevalue.Second,thestandarddeviationisreplacedbytheabsolutestandarddeviation.Specifically,ifxisavariable,thentheabsolutestandarddeviationofxisgivenby where isthevalueofthevariable,misthenumberofobjects,and iseitherthemean
ormedian.Otherapproachesforcomputingestimatesofthelocation(center)andspreadofasetofvaluesinthepresenceofoutliersaredescribedinstatisticsbooks.Thesemorerobustmeasurescanalsobeusedtodefineastandardizationtransformation.
σA=∑i=1m|xi−μ|, xiith μ
2.4MeasuresofSimilarityandDissimilaritySimilarityanddissimilarityareimportantbecausetheyareusedbyanumberofdataminingtechniques,suchasclustering,nearestneighborclassification,andanomalydetection.Inmanycases,theinitialdatasetisnotneededoncethesesimilaritiesordissimilaritieshavebeencomputed.Suchapproachescanbeviewedastransformingthedatatoasimilarity(dissimilarity)spaceandthenperformingtheanalysis.Indeed,kernelmethodsareapowerfulrealizationofthisidea.ThesemethodsareintroducedinSection2.4.7 andarediscussedmorefullyinthecontextofclassificationinSection4.9.4.
Webeginwithadiscussionofthebasics:high-leveldefinitionsofsimilarityanddissimilarity,andadiscussionofhowtheyarerelated.Forconvenience,thetermproximityisusedtorefertoeithersimilarityordissimilarity.Sincetheproximitybetweentwoobjectsisafunctionoftheproximitybetweenthecorrespondingattributesofthetwoobjects,wefirstdescribehowtomeasuretheproximitybetweenobjectshavingonlyoneattribute.
Wethenconsiderproximitymeasuresforobjectswithmultipleattributes.ThisincludesmeasuressuchastheJaccardandcosinesimilaritymeasures,whichareusefulforsparsedata,suchasdocuments,aswellascorrelationandEuclideandistance,whichareusefulfornon-sparse(dense)data,suchastimeseriesormulti-dimensionalpoints.Wealsoconsidermutualinformation,whichcanbeappliedtomanytypesofdataandisgoodfordetectingnonlinearrelationships.Inthisdiscussion,werestrictourselvestoobjectswithrelativelyhomogeneousattributetypes,typicallybinaryorcontinuous.
Next,weconsiderseveralimportantissuesconcerningproximitymeasures.Thisincludeshowtocomputeproximitybetweenobjectswhentheyhaveheterogeneoustypesofattributes,andapproachestoaccountfordifferencesofscaleandcorrelationamongvariableswhencomputingdistancebetweennumericalobjects.Thesectionconcludeswithabriefdiscussionofhowtoselecttherightproximitymeasure.
Althoughthissectionfocusesonthecomputationofproximitybetweendataobjects,proximitycanalsobecomputedbetweenattributes.Forexample,forthedocument-termmatrixofFigure2.2(d) ,thecosinemeasurecanbeusedtocomputesimilaritybetweenapairofdocumentsorapairofterms(words).Knowingthattwovariablesarestronglyrelatedcan,forexample,behelpfulforeliminatingredundancy.Inparticular,thecorrelationandmutualinformationmeasuresdiscussedlaterareoftenusedforthatpurpose.
2.4.1Basics
DefinitionsInformally,thesimilaritybetweentwoobjectsisanumericalmeasureofthedegreetowhichthetwoobjectsarealike.Consequently,similaritiesarehigherforpairsofobjectsthataremorealike.Similaritiesareusuallynon-negativeandareoftenbetween0(nosimilarity)and1(completesimilarity).
Thedissimilaritybetweentwoobjectsisanumericalmeasureofthedegreetowhichthetwoobjectsaredifferent.Dissimilaritiesarelowerformoresimilarpairsofobjects.Frequently,thetermdistanceisusedasasynonymfordissimilarity,although,asweshallsee,distanceoftenreferstoaspecialclass
ofdissimilarities.Dissimilaritiessometimesfallintheinterval[0,1],butitisalsocommonforthemtorangefrom0to∞.
TransformationsTransformationsareoftenappliedtoconvertasimilaritytoadissimilarity,orviceversa,ortotransformaproximitymeasuretofallwithinaparticularrange,suchas[0,1].Forinstance,wemayhavesimilaritiesthatrangefrom1to10,buttheparticularalgorithmorsoftwarepackagethatwewanttousemaybedesignedtoworkonlywithdissimilarities,oritmayworkonlywithsimilaritiesintheinterval[0,1].Wediscusstheseissuesherebecausewewillemploysuchtransformationslaterinourdiscussionofproximity.Inaddition,theseissuesarerelativelyindependentofthedetailsofspecificproximitymeasures.
Frequently,proximitymeasures,especiallysimilarities,aredefinedortransformedtohavevaluesintheinterval[0,1].Informally,themotivationforthisistouseascaleinwhichaproximityvalueindicatesthefractionofsimilarity(ordissimilarity)betweentwoobjects.Suchatransformationisoftenrelativelystraightforward.Forexample,ifthesimilaritiesbetweenobjectsrangefrom1(notatallsimilar)to10(completelysimilar),wecanmakethemfallwithintherange[0,1]byusingthetransformation wheresands′aretheoriginalandnewsimilarityvalues,respectively.Inthemoregeneralcase,thetransformationofsimilaritiestotheinterval[0,1]isgivenbytheexpression wheremax_sandmin_sarethemaximumandminimumsimilarityvalues,respectively.Likewise,dissimilaritymeasureswithafiniterangecanbemappedtotheinterval[0,1]byusingtheformula Thisisanexampleofalineartransformation,whichpreservestherelativedistancesbetweenpoints.Inotherwords,ifpoints, and aretwiceasfarapartaspoints, andthesamewillbetrueafteralineartransformation.
s′=(s−1)/9,
s′=(s−min_s)/(max_s−min_s),
d′=(d−min_d)/(max_d−min_d).
x1 x2, x3 x4,
However,therecanbecomplicationsinmappingproximitymeasurestotheinterval[0,1]usingalineartransformation.If,forexample,theproximitymeasureoriginallytakesvaluesintheinterval thenmax_disnotdefinedandanonlineartransformationisneeded.Valueswillnothavethesamerelationshiptooneanotheronthenewscale.Considerthetransformation
foradissimilaritymeasurethatrangesfrom0to Thedissimilarities0,0.5,2,10,100,and1000willbetransformedintothenewdissimilarities0,0.33,0.67,0.90,0.99,and0.999,respectively.Largervaluesontheoriginaldissimilarityscalearecompressedintotherangeofvaluesnear1,butwhetherthisisdesirabledependsontheapplication.
Notethatmappingproximitymeasurestotheinterval[0,1]canalsochangethemeaningoftheproximitymeasure.Forexample,correlation,whichisdiscussedlater,isameasureofsimilaritythattakesvaluesintheinterval
Mappingthesevaluestotheinterval[0,1]bytakingtheabsolutevaluelosesinformationaboutthesign,whichcanbeimportantinsomeapplications.SeeExercise22 onpage111 .
Transformingsimilaritiestodissimilaritiesandviceversaisalsorelativelystraightforward,althoughweagainfacetheissuesofpreservingmeaningandchangingalinearscaleintoanonlinearscale.Ifthesimilarity(ordissimilarity)fallsintheinterval[0,1],thenthedissimilaritycanbedefinedas
Anothersimpleapproachistodefinesimilarityasthenegativeofthedissimilarity(orviceversa).Toillustrate,thedissimilarities0,1,10,and100canbetransformedintothesimilarities andrespectively.
Thesimilaritiesresultingfromthenegationtransformationarenotrestrictedtotherange[0,1],butifthatisdesired,thentransformationssuchas
or canbeused.Forthetransformation thedissimilarities0,1,10,100aretransformedinto1,
[0,∞],
d=d/(1+d) ∞.
[−1,1].
d=1−s(s=1−d).
0,−1,−10, −100,
s=1d+1,s=e−d, s=1−d−min_dmax_d−min_ds=1d+1,
0.5,0.09,0.01,respectively.For theybecome1.00,0.37,0.00,0.00,respectively,whilefor theybecome1.00,0.99,0.90,0.00,respectively.Inthisdiscussion,wehavefocusedonconvertingdissimilaritiestosimilarities.ConversionintheoppositedirectionisconsideredinExercise23 onpage111 .
Ingeneral,anymonotonicdecreasingfunctioncanbeusedtoconvertdissimilaritiestosimilarities,orviceversa.Ofcourse,otherfactorsalsomustbeconsideredwhentransformingsimilaritiestodissimilarities,orviceversa,orwhentransformingthevaluesofaproximitymeasuretoanewscale.Wehavementionedissuesrelatedtopreservingmeaning,distortionofscale,andrequirementsofdataanalysistools,butthislistiscertainlynotexhaustive.
2.4.2SimilarityandDissimilaritybetweenSimpleAttributes
Theproximityofobjectswithanumberofattributesistypicallydefinedbycombiningtheproximitiesofindividualattributes,andthus,wefirstdiscussproximitybetweenobjectshavingasingleattribute.Considerobjectsdescribedbyonenominalattribute.Whatwoulditmeanfortwosuchobjectstobesimilar?Becausenominalattributesconveyonlyinformationaboutthedistinctnessofobjects,allwecansayisthattwoobjectseitherhavethesamevalueortheydonot.Hence,inthiscasesimilarityistraditionallydefinedas1ifattributevaluesmatch,andas0otherwise.Adissimilaritywouldbedefinedintheoppositeway:0iftheattributevaluesmatch,and1iftheydonot.
Forobjectswithasingleordinalattribute,thesituationismorecomplicatedbecauseinformationaboutordershouldbetakenintoaccount.Consideran
s=e−d,s=1−d−min_dmax_d−min_d
attributethatmeasuresthequalityofaproduct,e.g.,acandybar,onthescale{poor,fair,OK,good,wonderful}.Itwouldseemreasonablethataproduct,P1,whichisratedwonderful,wouldbeclosertoaproductP2,whichisratedgood,thanitwouldbetoaproductP3,whichisratedOK.Tomakethisobservationquantitative,thevaluesoftheordinalattributeareoftenmappedtosuccessiveintegers,beginningat0or1,e.g.,
Then, or,ifwewantthedissimilaritytofallbetween0and Asimilarityforordinalattributescanthenbedefinedas
Thisdefinitionofsimilarity(dissimilarity)foranordinalattributeshouldmakethereaderabituneasysincethisassumesequalintervalsbetweensuccessivevaluesoftheattribute,andthisisnotnecessarilyso.Otherwise,wewouldhaveanintervalorratioattribute.IsthedifferencebetweenthevaluesfairandgoodreallythesameasthatbetweenthevaluesOKandwonderful?Probablynot,butinpractice,ouroptionsarelimited,andintheabsenceofmoreinformation,thisisthestandardapproachfordefiningproximitybetweenordinalattributes.
Forintervalorratioattributes,thenaturalmeasureofdissimilaritybetweentwoobjectsistheabsolutedifferenceoftheirvalues.Forexample,wemightcompareourcurrentweightandourweightayearagobysaying“Iamtenpoundsheavier.”Incasessuchasthese,thedissimilaritiestypicallyrangefrom0to ratherthanfrom0to1.Thesimilarityofintervalorratioattributesistypicallyexpressedbytransformingadissimilarityintoasimilarity,aspreviouslydescribed.
Table2.7 summarizesthisdiscussion.Inthistable,xandyaretwoobjectsthathaveoneattributeoftheindicatedtype.Also,d(x,y)ands(x,y)arethedissimilarityandsimilaritybetweenxandy,respectively.Otherapproachesarepossible;thesearethemostcommonones.
{poor=0,fair=1,OK=2,good=3,wonderful=4}. d(P1,P2)=3−2=1d(P1,P2)=3−24=0.25.
s=1−d.
∞,
Table2.7.Similarityanddissimilarityforsimpleattributes
AttributeType
Dissimilarity Similarity
Nominal
Ordinal (valuesmappedtointegers0to ,wherenisthenumberofvalues)
IntervalorRatio
Thefollowingtwosectionsconsidermorecomplicatedmeasuresofproximitybetweenobjectsthatinvolvemultipleattributes:(1)dissimilaritiesbetweendataobjectsand(2)similaritiesbetweendataobjects.Thisdivisionallowsustomorenaturallydisplaytheunderlyingmotivationsforemployingvariousproximitymeasures.Weemphasize,however,thatsimilaritiescanbetransformedintodissimilaritiesandviceversausingtheapproachesdescribedearlier.
2.4.3DissimilaritiesbetweenDataObjects
Inthissection,wediscussvariouskindsofdissimilarities.Webeginwithadiscussionofdistances,whicharedissimilaritieswithcertainproperties,andthenprovideexamplesofmoregeneralkindsofdissimilarities.
Distances
d={0ifx=y1ifx≠y s={1ifx=y0ifx≠y
d=|x−y|/(n−1)n−1
s=1−d
d=|x−y| s=−d,s=11+d,s=e−d,s=1−d−min_dmax_d−min_d
Wefirstpresentsomeexamples,andthenofferamoreformaldescriptionofdistancesintermsofthepropertiescommontoalldistances.TheEuclideandistance,d,betweentwopoints,xandy,inone-,two-,three-,orhigher-dimensionalspace,isgivenbythefollowingfamiliarformula:
wherenisthenumberofdimensionsand and are,respectively,theattributes(components)ofxandy.WeillustratethisformulawithFigure2.15 andTables2.8 and2.9 ,whichshowasetofpoints,thexandycoordinatesofthesepoints,andthedistancematrixcontainingthepairwisedistancesofthesepoints.
Figure2.15.Fourtwo-dimensionalpoints.
TheEuclideandistancemeasuregiveninEquation2.1 isgeneralizedbytheMinkowskidistancemetricshowninEquation2.2 ,
d(x,y)=∑k=1n(xk−yk)2, (2.1)
xk yk kth
d(x,y)=(∑k=1n|xk−yk|r)1/r, (2.2)
whererisaparameter.ThefollowingarethethreemostcommonexamplesofMinkowskidistances.
Cityblock(Manhattan,taxicab, norm)distance.AcommonexampleistheHammingdistance,whichisthenumberofbitsthatisdifferentbetweentwoobjectsthathaveonlybinaryattributes,i.e.,betweentwobinaryvectors.
Euclideandistance( norm).Supremum( or norm)distance.Thisisthemaximum
differencebetweenanyattributeoftheobjects.Moreformally,thedistanceisdefinedbyEquation2.3
Therparametershouldnotbeconfusedwiththenumberofdimensions(at-tributes)n.TheEuclidean,Manhattan,andsupremumdistancesaredefinedforallvaluesofn:1,2,3,…,andspecifydifferentwaysofcombiningthedifferencesineachdimension(attribute)intoanoveralldistance.
Tables2.10 and2.11 ,respectively,givetheproximitymatricesfortheand distancesusingdatafromTable2.8 .Noticethatallthesedistancematricesaresymmetric;i.e.,the entryisthesameasthe entry.InTable2.9 ,forinstance,thefourthrowofthefirstcolumnandthefourthcolumnofthefirstrowbothcontainthevalue5.1.
Table2.8.xandycoordinatesoffourpoints.
point xcoordinate ycoordinate
p1 0 2
p2 2 0
p3 3 1
r=1. L1
r=2. L2r=∞. Lmax L∞
L∞
d(x,y)=limr→∞(∑k=1n|xk−yk|r)1/r. (2.3)
L1L∞
ijth jith
p4 5 1
Table2.9.EuclideandistancematrixforTable2.8 .
p1 p2 p3 p4
p1 0.0 2.8 3.2 5.1
p2 2.8 0.0 1.4 3.2
p3 3.2 1.4 0.0 2.0
p4 5.1 3.2 2.0 0.0
Table2.10. distancematrixforTable2.8 .
L p1 p2 p3 p4
p1 0.0 4.0 4.0 6.0
p2 4.0 0.0 2.0 4.0
p3 4.0 2.0 0.0 2.0
p4 6.0 4.0 2.0 0.0
Table2.11. distancematrixforTable2.8 .
p1 p2 p3 p4
p1 0.0 2.0 3.0 5.0
p2 2.0 0.0 1.0 3.0
p3 3.0 1.0 0.0 2.0
L1
1
L∞
L∞
p4 5.0 3.0 2.0 0.0
Distances,suchastheEuclideandistance,havesomewell-knownproperties.Ifd(x,y)isthedistancebetweentwopoints,xandy,thenthefollowingpropertieshold.
1. Positivitya. forallxandy,b. onlyif
2. Symmetry forallxandy.3. TriangleInequality forallpointsx,y,andz.
Measuresthatsatisfyallthreepropertiesareknownasmetrics.Somepeopleusethetermdistanceonlyfordissimilaritymeasuresthatsatisfytheseproperties,butthatpracticeisoftenviolated.Thethreepropertiesdescribedhereareuseful,aswellasmathematicallypleasing.Also,ifthetriangleinequalityholds,thenthispropertycanbeusedtoincreasetheefficiencyoftechniques(includingclustering)thatdependondistancespossessingthisproperty.(SeeExercise25 .)Nonetheless,manydissimilaritiesdonotsatisfyoneormoreofthemetricproperties.Example2.14 illustratessuchameasure.
Example2.14(Non-metricDissimilarities:SetDifferences).Thisexampleisbasedonthenotionofthedifferenceoftwosets,asdefinedinsettheory.GiventwosetsAandB, isthesetofelementsofAthatarenotin
d(x,y)≥0d(x,y)=0 x=y.
d(x,y)=d(y,x)d(x,z)≤d(x,y)+d(y,z)
A−B
B.Forexample,if and then andtheemptyset.WecandefinethedistancedbetweentwosetsA
andBas wheresizeisafunctionreturningthenumberofelementsinaset.Thisdistancemeasure,whichisanintegervaluegreaterthanorequalto0,doesnotsatisfythesecondpartofthepositivityproperty,thesymmetryproperty,orthetriangleinequality.However,thesepropertiescanbemadetoholdifthedissimilaritymeasureismodifiedasfollows: SeeExercise21 onpage110 .
2.4.4SimilaritiesbetweenDataObjects
Forsimilarities,thetriangleinequality(ortheanalogousproperty)typicallydoesnothold,butsymmetryandpositivitytypicallydo.Tobeexplicit,ifs(x,y)isthesimilaritybetweenpointsxandy,thenthetypicalpropertiesofsimilaritiesarethefollowing:
1. onlyif2. forallxandy.(Symmetry)
Thereisnogeneralanalogofthetriangleinequalityforsimilaritymeasures.Itissometimespossible,however,toshowthatasimilaritymeasurecaneasilybeconvertedtoametricdistance.ThecosineandJaccardsimilaritymeasures,whicharediscussedshortly,aretwoexamples.Also,forspecificsimilaritymeasures,itispossibletoderivemathematicalboundsonthesimilaritybetweentwoobjectsthataresimilarinspirittothetriangleinequality.
Example2.15(ANon-symmetricSimilarity
A={1,2,3,4} B={2,3,4}, A−B={1} B−A=∅,
d(A,B)=size(A−B),
d(A,B)=size(A−B)+size(B−A).
s(x,y)=1 x=y.(0≤s≤1)s(x,y)=s(y,x)
Measure).Consideranexperimentinwhichpeopleareaskedtoclassifyasmallsetofcharactersastheyflashonascreen.Theconfusionmatrixforthisexperimentrecordshowofteneachcharacterisclassifiedasitself,andhowofteneachisclassifiedasanothercharacter.Usingtheconfusionmatrix,wecandefineasimilaritymeasurebetweenacharacterxandacharacteryasthenumberoftimesthatxismisclassifiedasy,butnotethatthismeasureisnotsymmetric.Forexample,supposethat“0”appeared200timesandwasclassifiedasa“0”160times,butasan“o”40times.Likewise,supposethat“o”appeared200timesandwasclassifiedasan“o”170times,butas“0”only30times.Then, but
Insuchsituations,thesimilaritymeasurecanbemadesymmetricbysetting wheresindicatesthenewsimilaritymeasure.
2.4.5ExamplesofProximityMeasures
Thissectionprovidesspecificexamplesofsomesimilarityanddissimilaritymeasures.
SimilarityMeasuresforBinaryDataSimilaritymeasuresbetweenobjectsthatcontainonlybinaryattributesarecalledsimilaritycoefficients,andtypicallyhavevaluesbetween0and1.Avalueof1indicatesthatthetwoobjectsarecompletelysimilar,whileavalueof0indicatesthattheobjectsarenotatallsimilar.Therearemanyrationalesforwhyonecoefficientisbetterthananotherinspecificinstances.
s(0,o)=40,s(o,0)=30.
s′=(x,y)=s′(x,y)=(s(x,y+s(y,x))/2,
Letxandybetwoobjectsthatconsistofnbinaryattributes.Thecomparisonoftwosuchobjects,i.e.,twobinaryvectors,leadstothefollowingfourquantities(frequencies):
SimpleMatchingCoefficient
Onecommonlyusedsimilaritycoefficientisthesimplematchingcoefficient(SMC),whichisdefinedas
Thismeasurecountsbothpresencesandabsencesequally.Consequently,theSMCcouldbeusedtofindstudentswhohadansweredquestionssimilarlyonatestthatconsistedonlyoftrue/falsequestions.
JaccardCoefficient
Supposethatxandyaredataobjectsthatrepresenttworows(twotransactions)ofatransactionmatrix(seeSection2.1.2 ).Ifeachasymmetricbinaryattributecorrespondstoaniteminastore,thena1indicatesthattheitemwaspurchased,whilea0indicatesthattheproductwasnotpurchased.Becausethenumberofproductsnotpurchasedbyanycustomerfaroutnumbersthenumberofproductsthatwerepurchased,asimilaritymeasuresuchasSMCwouldsaythatalltransactionsareverysimilar.Asaresult,theJaccardcoefficientisfrequentlyusedtohandleobjectsconsistingofasymmetricbinaryattributes.TheJaccardcoefficient,whichisoftensymbolizedbyj,isgivenbythefollowingequation:
f00=thenumberofattributeswherexis0andyis0f01=thenumberofattributeswhere
SMC=numberofmatchingattributevaluesnumberofattributes=f11+f00f01+f10(2.4)
J=numberofmatchingpresencesnumberofattributesnotinvolvedin00matches(2.5)
Example2.16(TheSMCandJaccardSimilarityCoefficients).Toillustratethedifferencebetweenthesetwosimilaritymeasures,wecalculateSMCandjforthefollowingtwobinaryvectors.
CosineSimilarityDocumentsareoftenrepresentedasvectors,whereeachcomponent(attribute)representsthefrequencywithwhichaparticularterm(word)occursinthedocument.Eventhoughdocumentshavethousandsortensofthousandsofattributes(terms),eachdocumentissparsesinceithasrelativelyfewnonzeroattributes.Thus,aswithtransactiondata,similarityshouldnotdependonthenumberofshared0valuesbecauseanytwodocumentsarelikelyto“notcontain”manyofthesamewords,andtherefore,if0–0matchesarecounted,mostdocumentswillbehighlysimilartomostotherdocuments.Therefore,asimilaritymeasurefordocumentsneedstoignores0–0matchesliketheJaccardmeasure,butalsomustbeabletohandlenon-binaryvectors.Thecosinesimilarity,definednext,isoneofthemostcommonmeasuresofdocumentsimilarity.Ifxandyaretwodocumentvectors,then
x=(1,0,0,0,0,0,0,0,0,0)y=(0,0,0,0,0,0,1,0,0,1)
f01=2thenumberofattributeswherexwas0andywas1f10=1thenumberofattributeswhere
SMC=f11+f00f01+f10+f11+f00=0+72+1+0+7=0.7
J=f11f01+f10+f11=02+1+0=0
where′indicatesvectorormatrixtransposeand indicatestheinnerproductofthetwovectors,
and isthelengthofvector
Theinnerproductoftwovectorsworkswellforasymmetricattributessinceitdependsonlyoncomponentsthatarenon-zeroinbothvectors.Hence,thesimilaritybetweentwodocumentsdependsonlyuponthewordsthatappearinbothofthem.
Example2.17(CosineSimilaritybetweenTwoDocumentVectors).Thisexamplecalculatesthecosinesimilarityforthefollowingtwodataobjects,whichmightrepresentdocumentvectors:
AsindicatedbyFigure2.16 ,cosinesimilarityreallyisameasureofthe(cosineofthe)anglebetweenxandy.Thus,ifthecosinesimilarityis1,theanglebetweenxandyis andxandyarethesameexceptforlength.Ifthecosinesimilarityis0,thentheanglebetweenxandyis andtheydonotshareanyterms(words).
cos(x,y)=⟨x,y⟩∥x∥∥y∥=x′y∥x∥∥y∥, (2.6)
⟨x,y⟩
⟨x,y⟩=∑k=1nxkyk=x′y, (2.7)
∥x∥ x,∥x∥=∑k=1nxk2=⟨x,x⟩=x′x.
x=(3,2,0,5,0,0,0,2,0,0)y=(1,0,0,0,0,0,0,1,0,2)
⟨x,y⟩=3×1+2×0+0×0+5×0+0×0+0×0+0×0+2×1+0×0×2=5∥x∥=3×3+2×2+0×0+5×5
0°,90°,
Figure2.16.Geometricillustrationofthecosinemeasure.
Equation2.6 alsocanbewrittenasEquation2.8 .
where and Dividingxandybytheirlengthsnormalizesthemtohavealengthof1.Thismeansthatcosinesimilaritydoesnottakethelengthofthetwodataobjectsintoaccountwhencomputingsimilarity.(Euclideandistancemightbeabetterchoicewhenlengthisimportant.)Forvectorswithalengthof1,thecosinemeasurecanbecalculatedbytakingasimpleinnerproduct.Consequently,whenmanycosinesimilaritiesbetweenobjectsarebeingcomputed,normalizingtheobjectstohaveunitlengthcanreducethetimerequired.
ExtendedJaccardCoefficient(TanimotoCoefficient)TheextendedJaccardcoefficientcanbeusedfordocumentdataandthatreducestotheJaccardcoefficientinthecaseofbinaryattributes.Thiscoefficient,whichweshallrepresentasEJ,isdefinedbythefollowingequation:
cos(x,y)=⟨x∥x∥,y∥y∥⟩=⟨x′,y′⟩, (2.8)
x′=x/∥x∥ y′=y/∥y∥.
EJ(x,y)=⟨x,y⟩ǁxǁ2+ǁyǁ2−⟨x,y⟩=x′yǁxǁ2+ǁyǁ2−x′y. (2.9)
CorrelationCorrelationisfrequentlyusedtomeasurethelinearrelationshipbetweentwosetsofvaluesthatareobservedtogether.Thus,correlationcanmeasuretherelationshipbetweentwovariables(heightandweight)orbetweentwoobjects(apairoftemperaturetimeseries).Correlationisusedmuchmorefrequentlytomeasurethesimilaritybetweenattributessincethevaluesintwodataobjectscomefromdifferentattributes,whichcanhaveverydifferentattributetypesandscales.Therearemanytypesofcorrelation,andindeedcorrelationissometimesusedinageneralsensetomeantherelationshipbetweentwosetsofvaluesthatareobservedtogether.Inthisdiscussion,wewillfocusonameasureappropriatefornumericalvalues.
Specifically,Pearson’scorrelationbetweentwosetsofnumericalvalues,i.e.,twovectors,xandy,isdefinedbythefollowingequation:
whereweusethefollowingstandardstatisticalnotationanddefinitions:
corr(x,y)=covariance(x,y)standard_deviation(x)×standard_deviation(y)=sxysx(2.10)
covariance(x,y)=sxy=1n−1∑k=1n(xk−x¯)(yk−y¯) (2.11)
standard_deviation(x)=sx=1n−1∑k=1n(xk−x¯)2
standard_deviation(y)=sy=1n−1∑k=1n(yk−y¯)2
x¯=1n∑k=1nxkisthemeanofx
y¯=1n∑k=1nykisthemeanofy
Example2.18(PerfectCorrelation).Correlationisalwaysintherange to1.Acorrelationof meansthatxandyhaveaperfectpositive(negative)linearrelationship;thatis,
whereaandbareconstants.Thefollowingtwovectorsxandyillustratecaseswherethecorrelationis and respectively.Inthefirstcase,themeansofxandywerechosentobe0,forsimplicity.
Example2.19(NonlinearRelationships).Ifthecorrelationis0,thenthereisnolinearrelationshipbetweenthetwosetsofvalues.However,nonlinearrelationshipscanstillexist.Inthefollowingexample, buttheircorrelationis0.
Example2.20(VisualizingCorrelation).Itisalsoeasytojudgethecorrelationbetweentwovectorsxandybyplottingpairsofcorrespondingvaluesofxandyinascatterplot.Figure2.17 showsanumberofthesescatterplotswhenxandyconsistofasetof30pairsofvaluesthatarerandomlygenerated(withanormaldistribution)sothatthecorrelationofxandyrangesfrom to1.Eachcircleinaplotrepresentsoneofthe30pairsofxandyvalues;itsxcoordinateisthevalueofthatpairforx,whileitsycoordinateisthevalueofthesamepairfory.
−1 1(−1)
xk=ayk+b,−1 +1,
x=(−3,6,0,3,−6)y=(1,−2,0,−1,2)corr(x,y)=−1xk=−3yk
x=(3,6,0,3,6)y=(1,2,0,1,2)corr(x,y)=1xk=3yk
yk=xk2,
x=(−3,−2,−1,0,1,2,3)y=(9,4,1,0,1,4,9)
−1
Figure2.17.Scatterplotsillustratingcorrelationsfrom to1.
Ifwetransformxandybysubtractingofftheirmeansandthennormalizingthemsothattheirlengthsare1,thentheircorrelationcanbecalculatedbytakingthedotproduct.Letusrefertothesetransformedvectorsofxandyasand ,respectively.(Noticethatthistransformationisnotthesameasthe
standardizationusedinothercontexts,wherewesubtractthemeansanddividebythestandarddeviations,asdiscussedinSection2.3.7 .)Thistransformationhighlightsaninterestingrelationshipbetweenthecorrelationmeasureandthecosinemeasure.Specifically,thecorrelationbetweenxandyisidenticaltothecosinebetween and However,thecosinebetweenxandyisnotthesameasthecosinebetween and eventhoughtheybothhavethesamecorrelationmeasure.Ingeneral,thecorrelationbetweentwo
−1
x′ y′
x′ y′.x′ y′,
vectorsisequaltothecosinemeasureonlyinthespecialcasewhenthemeansofthetwovectorsare0.
DifferencesAmongMeasuresForContinuousAttributesInthissection,weillustratethedifferenceamongthethreeproximitymeasuresforcontinuousattributesthatwehavejustdefined:cosine,correlation,andMinkowskidistance.Specifically,weconsidertwotypesofdatatransformationsthatarecommonlyused,namely,scaling(multiplication)byaconstantfactorandtranslation(addition)byaconstantvalue.Aproximitymeasureisconsideredtobeinvarianttoadatatransformationifitsvalueremainsunchangedevenafterperformingthetransformation.Table2.12comparesthebehaviorofcosine,correlation,andMinkowskidistancemeasuresregardingtheirinvariancetoscalingandtranslationoperations.Itcanbeseenthatwhilecorrelationisinvarianttobothscalingandtranslation,cosineisonlyinvarianttoscalingbutnottotranslation.Minkowskidistancemeasures,ontheotherhand,aresensitivetobothscalingandtranslationandarethusinvarianttoneither.
Table2.12.Propertiesofcosine,correlation,andMinkowskidistancemeasures.
Property Cosine Correlation MinkowskiDistance
Invarianttoscaling(multiplication) Yes Yes No
Invarianttotranslation(addition) No Yes No
Letusconsideranexampletodemonstratethesignificanceofthesedifferencesamongdifferentproximitymeasures.
Example2.21(Comparingproximitymeasures).Considerthefollowingtwovectorsxandywithsevennumericattributes.
Itcanbeseenthatbothxandyhave4non-zerovalues,andthevaluesinthetwovectorsaremostlythesame,exceptforthethirdandthefourthcomponents.Thecosine,correlation,andEuclideandistancebetweenthetwovectorscanbecomputedasfollows.
Notsurprisingly,xandyhaveacosineandcorrelationmeasurecloseto1,whiletheEuclideandistancebetweenthemissmall,indicatingthattheyarequitesimilar.Nowletusconsiderthevector whichisascaledversionofy(multipliedbyaconstantfactorof2),andthevector whichisconstructedbytranslatingyby5unitsasfollows.
Weareinterestedinfindingwhether and showthesameproximitywithxasshownbytheoriginalvectory.Table2.13 showsthedifferentmeasuresofproximitycomputedforthepairs and Itcanbeseenthatthevalueofcorrelationbetweenxandyremainsunchangedevenafterreplacingywith or However,thevalueofcosineremainsequalto0.9667whencomputedfor(x,y)and butsignificantlyreducesto0.7940whencomputedfor Thishighlights
x=(1,2,4,3,0,0,0)y=(1,2,3,4,0,0,0)
cos(x,y)=2930×30=0.9667correlation(x,y)=2.35711.5811×1.5811=0.9429Euclideandistancex−yǁ=1.4142
ys,yt,
ys=2×y=(2,4,6,8,0,0,0)
yt=y+5=(6,7,8,9,5,5,5)
ys yt
(x,y),(x,ys), (x,yt).
ys yt.(x,ys),
(x,yt).
thefactthatcosineisinvarianttothescalingoperationbutnottothetranslationoperation,incontrastwiththecorrelationmeasure.TheEuclideandistance,ontheotherhand,showsdifferentvaluesforallthreepairsofvectors,asitissensitivetobothscalingandtranslation.
Table2.13.Similaritybetween and
Measure (x,y)
Cosine 0.9667 0.9667 0.7940
Correlation 0.9429 0.9429 0.9429
EuclideanDistance 1.4142 5.8310 14.2127
Wecanobservefromthisexamplethatdifferentproximitymeasuresbehavedifferentlywhenscalingortranslationoperationsareappliedonthedata.Thechoiceoftherightproximitymeasurethusdependsonthedesirednotionofsimilaritybetweendataobjectsthatismeaningfulforagivenapplication.Forexample,ifxandyrepresentedthefrequenciesofdifferentwordsinadocument-termmatrix,itwouldbemeaningfultouseaproximitymeasurethatremainsunchangedwhenyisreplacedbybecause isjustascaledversionofywiththesamedistributionofwordsoccurringinthedocument.However, isdifferentfromy,sinceitcontainsalargenumberofwordswithnon-zerofrequenciesthatdonotoccuriny.Becausecosineisinvarianttoscalingbutnottotranslation,itwillbeanidealchoiceofproximitymeasureforthisapplication.
Consideradifferentscenarioinwhichxrepresentsalocation’stemperaturemeasuredontheCelsiusscaleforsevendays.Let andbethetemperaturesmeasuredonthosedaysatadifferentlocation,but
usingthreedifferentmeasurementscales.Notethatdifferentunitsof
(x,y),(x,ys), (x,yt).
(x,ys) (x,yt)
ys,ys
yt
y,ys,yt
temperaturehavedifferentoffsets(e.g.CelsiusandKelvin)anddifferentscalingfactors(e.g.CelsiusandFahrenheit).Itisthusdesirabletouseaproximitymeasurethatcapturestheproximitybetweentemperaturevalueswithoutbeingaffectedbythemeasurementscale.Correlationwouldthenbetheidealchoiceofproximitymeasureforthisapplication,asitisinvarianttobothscalingandtranslation.
Asanotherexample,considerascenariowherexrepresentstheamountofprecipitation(incm)measuredatsevenlocations.Let and beestimatesoftheprecipitationattheselocations,whicharepredictedusingthreedifferentmodels.Ideally,wewouldliketochooseamodelthataccuratelyreconstructsthemeasurementsinxwithoutmakinganyerror.Itisevidentthatyprovidesagoodapproximationofthevaluesinx,whereasand providepoorestimatesofprecipitation,eventhoughtheydo
capturethetrendinprecipitationacrosslocations.Hence,weneedtochooseaproximitymeasurethatpenalizesanydifferenceinthemodelestimatesfromtheactualobservations,andissensitivetoboththescalingandtranslationoperations.TheEuclideandistancesatisfiesthispropertyandthuswouldbetherightchoiceofproximitymeasureforthisapplication.Indeed,theEuclideandistanceiscommonlyusedincomputingtheaccuracyofmodels,whichwillbediscussedlaterinChapter3 .
2.4.6MutualInformation
Likecorrelation,mutualinformationisusedasameasureofsimilaritybetweentwosetsofpairedvaluesthatissometimesusedasanalternativetocorrelation,particularlywhenanonlinearrelationshipissuspectedbetweenthepairsofvalues.Thismeasurecomesfrominformationtheory,whichisthe
y,ys, yt
ys yt
studyofhowtoformallydefineandquantifyinformation.Indeed,mutualinformationisameasureofhowmuchinformationonesetofvaluesprovidesaboutanother,giventhatthevaluescomeinpairs,e.g.,heightandweight.Ifthetwosetsofvaluesareindependent,i.e.,thevalueofonetellsusnothingabouttheother,thentheirmutualinformationis0.Ontheotherhand,ifthetwosetsofvaluesarecompletelydependent,i.e.,knowingthevalueofonetellsusthevalueoftheotherandvice-versa,thentheyhavemaximummutualinformation.Mutualinformationdoesnothaveamaximumvalue,butwewilldefineanormalizedversionofitthatrangesbetween0and1.
Todefinemutualinformation,weconsidertwosetsofvalues,XandY,whichoccurinpairs(X,Y).Weneedtomeasuretheaverageinformationinasinglesetofvalues,i.e.,eitherinXorinY,andinthepairsoftheirvalues.Thisiscommonlymeasuredbyentropy.Morespecifically,assumeXandYarediscrete,thatis,Xcantakemdistinctvalues, andYcantakendistinctvalues, Thentheirindividualandjointentropycanbedefinedintermsoftheprobabilitiesofeachvalueandpairofvaluesasfollows:
whereiftheprobabilityofavalueorcombinationofvaluesis0,thenisconventionallytakentobe0.
ThemutualinformationofXandYcannowbedefinedstraightforwardly:
u1,u2,…,umv1,v2,…,vn.
H(X)=−∑j=1mP(X=uj)log2P(X=uj) (2.12)
H(Y)=−∑k=1nP(Y=vk)log2P(Y=vk) (2.13)
H(X,Y)=−∑j=1m∑k=1nP(X=uj,Y=vk)log2P(X=uj,Y=vk) (2.14)
0log2(0)
I(X,Y)=H(X)+H(Y)−H(X,Y) (2.15)
NotethatH(X,Y)issymmetric,i.e., andthusmutualinformationisalsosymmetric,i.e.,
Practically,XandYareeitherthevaluesintwoattributesortworowsofthesamedataset.InExample2.22 ,wewillrepresentthosevaluesastwovectorsxandyandcalculatetheprobabilityofeachvalueorpairofvaluesfromthefrequencywithwhichvaluesorpairsofvaluesoccurinx,yand
where isthe componentofxand isthe componentofy.Letusillustrateusingapreviousexample.
Example2.22(EvaluatingNonlinearRelationshipswithMutualInformation).RecallExample2.19 where buttheircorrelationwas0.
FromFigure2.22 , Althoughavarietyofapproachestonormalizemutualinformationarepossible—seeBibliographicNotes—forthisexample,wewillapplyonethatdividesthemutualinformationby andproducesaresultbetween0and1.Thisyieldsavalueof Thus,wecanseethatxandyarestronglyrelated.Theyarenotperfectlyrelatedbecausegivenavalueofythereis,exceptfor someambiguityaboutthevalueofx.Noticethatfor thenormalizedmutualinformationwouldbe1.
Figure2.18.Computationofmutualinformation.
Table2.14.Entropyforx
H(X,Y)=H(Y,X),I(X,Y)=I(Y).
(xi,yi), xi ith yi ith
yk=xk2,
x=(−3,−2,−1,0,1,2,3)y=(9,4,1,0,1,4,9)
I(x,y)=H(x)+H(y)−H(x,y)=1.9502.
log2(min(m,n))1.9502/log2(4))=0.9751.
y=0,y=−x,
xj P(x=xj) −P(x=xj)log2P(x=xj)
1/7 0.4011
1/7 0.4011
1/7 0.4011
0 1/7 0.4011
1 1/7 0.4011
2 1/7 0.4011
3 1/7 0.4011
H(x) 2.8074
Table2.15.Entropyfory
9 2/7 0.5164
4 2/7 0.5164
1 2/7 0.5164
0 1/7 0.4011
H(y) 1.9502
Table2.16.Jointentropyforxandy
9 1/7 0.4011
4 1/7 0.4011
−3
−2
−1
yk P(y=yk) −P(y=yk)log2P(y=yk)
xj yk P(x=xj,y=xk) −P(x=xj,y=xk)log2P(x=xj,y=xk)
−3
−2
1 1/7 0.4011
0 0 1/7 0.4011
1 1 1/7 0.4011
2 4 1/7 0.4011
3 9 1/7 0.4011
H(x,y) 2.8074
2.4.7KernelFunctions*
Itiseasytounderstandhowsimilarityanddistancemightbeusefulinanapplicationsuchasclustering,whichtriestogroupsimilarobjectstogether.Whatismuchlessobviousisthatmanyotherdataanalysistasks,includingpredictivemodelinganddimensionalityreduction,canbeexpressedintermsofpairwise“proximities”ofdataobjects.Morespecifically,manydataanalysisproblemscanbemathematicallyformulatedtotakeasinput,akernelmatrix,K,whichcanbeconsideredatypeofproximitymatrix.Thus,aninitialpreprocessingstepisusedtoconverttheinputdataintoakernelmatrix,whichistheinputtothedataanalysisalgorithm.
Moreformally,ifadatasethasmdataobjects,thenKisanmbymmatrix.Ifand arethe and dataobjects,respectively,then the entryof
K,iscomputedbyakernelfunction:
−1
xi xj ith jth kij, ijth
kij=κ(xi,xj) (2.16)
Aswewillseeinthematerialthatfollows,theuseofakernelmatrixallowsbothwiderapplicabilityofanalgorithmtovariouskindsofdataandanabilitytomodelnonlinearrelationshipswithalgorithmsthataredesignedonlyfordetectinglinearrelationships.
Kernelsmakeanalgorithmdataindependent
Ifanalgorithmusesakernelmatrix,thenitcanbeusedwithanytypeofdataforwhichakernelfunctioncanbedesigned.ThisisillustratedbyAlgorithm2.1.Althoughonlysomedataanalysisalgorithmscanbemodifiedtouseakernelmatrixasinput,thisapproachisextremelypowerfulbecauseitallowssuchanalgorithmtobeusedwithalmostanytypeofdataforwhichanappropriatekernelfunctioncanbedefined.Thus,aclassificationalgorithmcanbeused,forexample,withrecorddata,stringdata,orgraphdata.Ifanalgorithmcanbereformulatedtouseakernelmatrix,thenitsapplicabilitytodifferenttypesofdataincreasesdramatically.Aswewillseeinlaterchapters,manyclustering,classification,andanomalydetectionalgorithmsworkonlywithsimilaritiesordistances,andthus,canbeeasilymodifiedtoworkwithkernels.
Algorithm2.1Basickernelalgorithm.1. Readinthemdataobjectsinthedataset.2. Computethekernelmatrix,Kbyapplyingthekernelfunction,
toeachpairofdataobjects.3. RunthedataanalysisalgorithmwithKasinput.4. Returntheanalysisresult,e.g.,predictedclassorclusterlabels.
Mappingdataintoahigherdimensionaldataspacecan
κ,
allowmodelingofnonlinearrelationships
Thereisyetanother,equallyimportant,aspectofkernelbaseddataalgorithms—theirabilitytomodelnonlinearrelationshipswithalgorithmsthatmodelonlylinearrelationships.Typically,thisworksbyfirsttransforming(mapping)thedatafromalowerdimensionaldataspacetoahigherdimensionalspace.
Example2.23(MappingDatatoaHigherDimensionalSpace).Considertherelationshipbetweentwovariablesxandygivenbythefollowingequation,whichdefinesanellipseintwodimensions(Figure2.19(a) ):
Figure2.19.Mappingdatatoahigherdimensionalspace:twotothreedimensions.
Wecanmapourtwodimensionaldatatothreedimensionsbycreatingthreenewvariables,u,v,andw,whicharedefinedasfollows:
Asaresult,wecannowexpressEquation2.17 asalinearone.Thisequationdescribesaplaneinthreedimensions.Pointsontheellipsewilllieonthatplane,whilepointsinsideandoutsidetheellipsewilllieonoppositesidesoftheplane.SeeFigure2.19(b) .Theviewpointofthis3Dplotisalongthesurfaceoftheseparatingplanesothattheplaneappearsasaline.
TheKernelTrick
Theapproachillustratedaboveshowsthevalueinmappingdatatohigherdimensionalspace,anoperationthatisintegraltokernel-basedmethods.Conceptually,wefirstdefineafunction thatmapsdatapointsxandytodatapoints and inahigherdimensionalspacesuchthattheinnerproduct givesthedesiredmeasureofproximityofxandy.Itmayseemthatwehavepotentiallysacrificedagreatdealbyusingsuchanapproach,becausewecangreatlyexpandthesizeofourdata,increasethecomputationalcomplexityofouranalysis,andencounterproblemswiththecurseofdimensionalitybycomputingsimilarityinahigh-dimensionalspace.However,thisisnotthecasesincetheseproblemscanbeavoidedbydefiningakernelfunction thatcancomputethesamesimilarityvalue,butwiththedatapointsintheoriginalspace,i.e., Thisisknownasthekerneltrick.Despitethename,thekerneltrickhasaverysolid
4×2+9xy+7y2=10 (2.17)
w=x2u=xyv=y2
4u+9v+7w=10 (2.18)
φφ(x) φ(y)
⟨x,y⟩
κκ(x,y)=⟨φ(x),φ(y)⟩.
mathematicalfoundationandisaremarkablypowerfulapproachfordataanalysis.
Noteveryfunctionofapairofdataobjectssatisfiesthepropertiesneededforakernelfunction,butithasbeenpossibletodesignmanyusefulkernelsforawidevarietyofdatatypes.Forexample,threecommonkernelfunctionsarethepolynomial,Gaussian(radialbasisfunction(RBF)),andsigmoidkernels.Ifxandyaretwodataobjects,specifically,twodatavectors,thenthesetwokernelfunctionscanbeexpressedasfollows,respectively:
where and areconstants,disanintegerparameterthatgivesthepolynomialdegree, isthelengthofthevector and isaparameterthatgovernsthe“spread”ofaGaussian.
Example2.24(ThePolynomialKernel).Notethatthekernelfunctionspresentedintheprevioussectionarecomputingthesamesimilarityvalueaswouldbecomputedifweactuallymappedthedatatoahigherdimensionalspaceandthencomputedaninnerproductthere.Forexample,forthepolynomialkernelofdegree2,letbethefunctionthatmapsatwo-dimensionaldatavector tothe
higherdimensionalspace.Specifically,let
κ(x,y)−(x′y+c)d (2.19)
κ(x,y)=exp(−ǁx−yǁ/2σ2) (2.20)
κ(x,y)=tanh(αx′y+c) (2.21)
α c≥0ǁx−yǁ x−y σ>0
φ x=(x1,x2)
φ(x)=(x12,x22,2x1x2,2cx1,2cx2,c). (2.22)
Forthehigherdimensionalspace,lettheproximitybedefinedastheinnerproductof and i.e., Then,aspreviouslymentioned,itcanbeshownthat
where isdefinedbyEquation2.19 above.Specifically,ifand then
Moregenerally,thekerneltrickdependsondefining and sothatEquation2.23 holds.Thishasbeendoneforawidevarietyofkernels.
Thisdiscussionofkernel-basedapproacheswasintendedonlytoprovideabriefintroductiontothistopicandhasomittedmanydetails.Afullerdiscussionofthekernel-basedapproachisprovidedinSection4.9.4,whichdiscussestheseissuesinthecontextofnonlinearsupportvectormachinesforclassification.MoregeneralreferencesforthekernelbasedanalysiscanbefoundintheBibliographicNotesofthischapter.
2.4.8BregmanDivergence*
ThissectionprovidesabriefdescriptionofBregmandivergences,whichareafamilyofproximityfunctionsthatsharesomecommonproperties.Asaresult,itispossibletoconstructgeneraldataminingalgorithms,suchasclusteringalgorithms,thatworkwithanyBregmandivergence.AconcreteexampleistheK-meansclusteringalgorithm(Section7.2).Notethatthissectionrequiresknowledgeofvectorcalculus.
φ(x) φ(y), ⟨φ(x),φ(y)⟩.
κ(x,y)=⟨φ(x),φ(y)⟩ (2.23)
κ x=(x1,x2)y=(y1,y2),
κ(x,y)=⟨x,y⟩=x′y=(x12y12,x22y22,2x1x2y1y2,2cx1y1,2cx2y2,c2).(2.24)
κ φ
Bregmandivergencesarelossordistortionfunctions.Tounderstandtheideaofalossfunction,considerthefollowing.Letxandybetwopoints,whereyisregardedastheoriginalpointandxissomedistortionorapproximationofit.Forexample,xmaybeapointthatwasgeneratedbyaddingrandomnoisetoy.Thegoalistomeasuretheresultingdistortionorlossthatresultsifyisapproximatedbyx.Ofcourse,themoresimilarxandyare,thesmallerthelossordistortion.Thus,Bregmandivergencescanbeusedasdissimilarityfunctions.
Moreformally,wehavethefollowingdefinition.
Definition2.6(Bregmandivergence)Givenastrictlyconvexfunction (withafewmodestrestrictionsthataregenerallysatisfied),theBregmandivergence(lossfunction) generatedbythatfunctionisgivenbythefollowingequation:
where isthegradientof evaluatedat isthevectordifferencebetweenxandy,and istheinnerproductbetween and ForpointsinEuclideanspace,theinnerproductisjustthedotproduct.
D(x,y)canbewrittenas whereandrepresentstheequationofaplanethatistangenttothefunction aty.
ϕ
D(x,y)
D(x,y)=ϕ(x)−ϕ(y)−⟨∇ϕ(y),(x−y)⟩ (2.25)
∇ϕ(y) ϕ y,x−y,⟨∇ϕ(y),(x−y)⟩
∇ϕ(y) (x−y).
D(x,y)=ϕ(x)−L(x), L(x)=ϕ(y)+⟨∇ϕ(y),(x−y)⟩ϕ
Usingcalculusterminology,L(x)isthelinearizationof aroundthepointy,andtheBregmandivergenceisjustthedifferencebetweenafunctionandalinearapproximationtothatfunction.DifferentBregmandivergencesareobtainedbyusingdifferentchoicesfor
Example2.25.WeprovideaconcreteexampleusingsquaredEuclideandistance,butrestrictourselvestoonedimensiontosimplifythemathematics.Letxandyberealnumbersand bethereal-valuedfunction, Inthatcase,thegradientreducestothederivative,andthedotproductreducestomultiplication.Specifically,Equation2.25 becomesEquation2.26 .
Thegraphforthisexample,with isshowninFigure2.20 .TheBregmandivergenceisshownfortwovaluesofx: and
ϕ
ϕ.
ϕ(t) ϕ(t)=t2.
D(x,y)=x2−y2−2y(x−y)=(x−y)2 (2.26)
y=1,x=2 x=3.
Figure2.20.IllustrationofBregmandivergence.
2.4.9IssuesinProximityCalculation
Thissectiondiscussesseveralimportantissuesrelatedtoproximitymeasures:(1)howtohandlethecaseinwhichattributeshavedifferentscalesand/orarecorrelated,(2)howtocalculateproximitybetweenobjectsthatarecomposedofdifferenttypesofattributes,e.g.,quantitativeandqualitative,(3)andhowtohandleproximitycalculationswhenattributeshavedifferentweights;i.e.,whennotallattributescontributeequallytotheproximityofobjects.
StandardizationandCorrelationforDistance
MeasuresAnimportantissuewithdistancemeasuresishowtohandlethesituationwhenattributesdonothavethesamerangeofvalues.(Thissituationisoftendescribedbysayingthat“thevariableshavedifferentscales.”)Inapreviousexample,Euclideandistancewasusedtomeasurethedistancebetweenpeoplebasedontwoattributes:ageandincome.Unlessthesetwoattributesarestandardized,thedistancebetweentwopeoplewillbedominatedbyincome.
Arelatedissueishowtocomputedistancewhenthereiscorrelationbetweensomeoftheattributes,perhapsinadditiontodifferencesintherangesofvalues.AgeneralizationofEuclideandistance,theMahalanobisdistance,isusefulwhenattributesarecorrelated,havedifferentrangesofvalues(differentvariances),andthedistributionofthedataisapproximatelyGaussian(normal).Correlatedvariableshavealargeimpactonstandarddistancemeasuressinceachangeinanyofthecorrelatedvariablesisreflectedinachangeinallthecorrelatedvariables.Specifically,theMahalanobisdistancebetweentwoobjects(vectors)xandyisdefinedas
where istheinverseofthecovariancematrixofthedata.Notethatthecovariancematrix isthematrixwhose entryisthecovarianceoftheand attributesasdefinedbyEquation2.11 .
Example2.26.InFigure2.21 ,thereare1000points,whosexandyattributeshaveacorrelationof0.6.Thedistancebetweenthetwolargepointsattheoppositeendsofthelongaxisoftheellipseis14.7intermsofEuclidean
Mahalanobis(x,y)=(x−y)′∑−1(x−y), (2.27)
∑−1∑ ijth ith
jth
distance,butonly6withrespecttoMahalanobisdistance.ThisisbecausetheMahalanobisdistancegiveslessemphasistothedirectionoflargestvariance.Inpractice,computingtheMahalanobisdistanceisexpensive,butcanbeworthwhilefordatawhoseattributesarecorrelated.Iftheattributesarerelativelyuncorrelated,buthavedifferentranges,thenstandardizingthevariablesissufficient.
Figure2.21.Setoftwo-dimensionalpoints.TheMahalanobisdistancebetweenthetwopointsrepresentedbylargedotsis6;theirEuclideandistanceis14.7.
CombiningSimilaritiesforHeterogeneousAttributes
Thepreviousdefinitionsofsimilaritywerebasedonapproachesthatassumedalltheattributeswereofthesametype.Ageneralapproachisneededwhentheattributesareofdifferenttypes.OnestraightforwardapproachistocomputethesimilaritybetweeneachattributeseparatelyusingTable2.7 ,andthencombinethesesimilaritiesusingamethodthatresultsinasimilaritybetween0and1.Onepossibleapproachistodefinetheoverallsimilarityastheaverageofalltheindividualattributesimilarities.Unfortunately,thisapproachdoesnotworkwellifsomeoftheattributesareasymmetricattributes.Forexample,ifalltheattributesareasymmetricbinaryattributes,thenthesimilaritymeasuresuggestedpreviouslyreducestothesimplematchingcoefficient,ameasurethatisnotappropriateforasymmetricbinaryattributes.Theeasiestwaytofixthisproblemistoomitasymmetricattributesfromthesimilaritycalculationwhentheirvaluesare0forbothoftheobjectswhosesimilarityisbeingcomputed.Asimilarapproachalsoworkswellforhandlingmissingvalues.
Insummary,Algorithm2.2iseffectiveforcomputinganoverallsimilaritybetweentwoobjects,xandy,withdifferenttypesofattributes.Thisprocedurecanbeeasilymodifiedtoworkwithdissimilarities.
Algorithm2.2Similaritiesofheterogeneous
objects.1:Forthe attribute,computeasimilarity, intherange[0,1].2:Defineanindicatorvariable, forthe attributeasfollows:
kth sk(x,y),
δk, kth
δk={0ifthekthattributeisanasymmetricattributeandbothobjectshaveavalueof
UsingWeightsInmuchofthepreviousdiscussion,allattributesweretreatedequallywhencomputingproximity.Thisisnotdesirablewhensomeattributesaremoreimportanttothedefinitionofproximitythanothers.Toaddressthesesituations,theformulasforproximitycanbemodifiedbyweightingthecontributionofeachattribute.
Withattributeweights, (2.28)becomes
ThedefinitionoftheMinkowskidistancecanalsobemodifiedasfollows:
2.4.10SelectingtheRightProximityMeasure
Afewgeneralobservationsmaybehelpful.First,thetypeofproximitymeasureshouldfitthetypeofdata.Formanytypesofdense,continuousdata,metricdistancemeasuressuchasEuclideandistanceareoftenused.Proximitybetweencontinuousattributesismostoftenexpressedintermsof
3:Computetheoverallsimilaritybetweenthetwoobjectsusingthefollowingformula:similarity(x,y)=∑k=1nδksk(x,y)∑k=1nδk(2.28)
wk,
similarity(x,y)=∑k=1nwkδksk(x,y)∑k=1nwkδk. (2.29)
d(x,y)=(∑k=1nwk|xk−yk|r)1/r. (2.30)
differences,anddistancemeasuresprovideawell-definedwayofcombiningthesedifferencesintoanoverallproximitymeasure.Althoughattributescanhavedifferentscalesandbeofdifferingimportance,theseissuescanoftenbedealtwithasdescribedearlier,suchasnormalizationandweightingofattributes.
Forsparsedata,whichoftenconsistsofasymmetricattributes,wetypicallyemploysimilaritymeasuresthatignore0–0matches.Conceptually,thisreflectsthefactthat,forapairofcomplexobjects,similaritydependsonthenumberofcharacteristicstheybothshare,ratherthanthenumberofcharacteristicstheybothlack.Thecosine,Jaccard,andextendedJaccardmeasuresareappropriateforsuchdata.
Thereareothercharacteristicsofdatavectorsthatoftenneedtobeconsidered.Invariancetoscaling(multiplication)andtotranslation(addition)werepreviouslydiscussedwithrespecttoEuclideandistanceandthecosineandcorrelationmeasures.Thepracticalimplicationsofsuchconsiderationsarethat,forexample,cosineismoresuitableforsparsedocumentdatawhereonlyscalingisimportant,whilecorrelationworksbetterfortimeseries,wherebothscalingandtranslationareimportant.EuclideandistanceorothertypesofMinkowskidistancearemostappropriatewhentwodatavectorsaretomatchascloselyaspossibleacrossallcomponents(features).
Insomecases,transformationornormalizationofthedataisneededtoobtainapropersimilaritymeasure.Forinstance,timeseriescanhavetrendsorperiodicpatternsthatsignificantlyimpactsimilarity.Also,apropercomputationofsimilarityoftenrequiresthattimelagsbetakenintoaccount.Finally,twotimeseriesmaybesimilaronlyoverspecificperiodsoftime.Forexample,thereisastrongrelationshipbetweentemperatureandtheuseofnaturalgas,butonlyduringtheheatingseason.
Practicalconsiderationcanalsobeimportant.Sometimes,oneormoreproximitymeasuresarealreadyinuseinaparticularfield,andthus,otherswillhaveansweredthequestionofwhichproximitymeasuresshouldbeused.Othertimes,thesoftwarepackageorclusteringalgorithmbeingusedcandrasticallylimitthechoices.Ifefficiencyisaconcern,thenwemaywanttochooseaproximitymeasurethathasaproperty,suchasthetriangleinequality,thatcanbeusedtoreducethenumberofproximitycalculations.(SeeExercise25 .)
However,ifcommonpracticeorpracticalrestrictionsdonotdictateachoice,thentheproperchoiceofaproximitymeasurecanbeatime-consumingtaskthatrequirescarefulconsiderationofbothdomainknowledgeandthepurposeforwhichthemeasureisbeingused.Anumberofdifferentsimilaritymeasuresmayneedtobeevaluatedtoseewhichonesproduceresultsthatmakethemostsense.
2.5BibliographicNotesItisessentialtounderstandthenatureofthedatathatisbeinganalyzed,andatafundamentallevel,thisisthesubjectofmeasurementtheory.Inparticular,oneoftheinitialmotivationsfordefiningtypesofattributeswastobepreciseaboutwhichstatisticaloperationswerevalidforwhatsortsofdata.WehavepresentedtheviewofmeasurementtheorythatwasinitiallydescribedinaclassicpaperbyS.S.Stevens[112].(Tables2.2 and2.3 arederivedfromthosepresentedbyStevens[113].)Whilethisisthemostcommonviewandisreasonablyeasytounderstandandapply,thereis,ofcourse,muchmoretomeasurementtheory.Anauthoritativediscussioncanbefoundinathree-volumeseriesonthefoundationsofmeasurementtheory[88,94,114].Alsoofinterestisawide-rangingarticlebyHand[77],whichdiscussesmeasurementtheoryandstatistics,andisaccompaniedbycommentsfromotherresearchersinthefield.NumerouscritiquesandextensionsoftheapproachofStevenshavebeenmade[66,97,117].Finally,manybooksandarticlesdescribemeasurementissuesforparticularareasofscienceandengineering.
Dataqualityisabroadsubjectthatspanseverydisciplinethatusesdata.Discussionsofprecision,bias,accuracy,andsignificantfigurescanbefoundinmanyintroductoryscience,engineering,andstatisticstextbooks.Theviewofdataqualityas“fitnessforuse”isexplainedinmoredetailinthebookbyRedman[103].ThoseinterestedindataqualitymayalsobeinterestedinMIT’sInformationQuality(MITIQ)Program[95,118].However,theknowledgeneededtodealwithspecificdataqualityissuesinaparticulardomainisoftenbestobtainedbyinvestigatingthedataqualitypracticesofresearchersinthatfield.
Aggregationisalesswell-definedsubjectthanmanyotherpreprocessingtasks.However,aggregationisoneofthemaintechniquesusedbythedatabaseareaofOnlineAnalyticalProcessing(OLAP)[68,76,102].Therehasalsobeenrelevantworkintheareaofsymbolicdataanalysis(BockandDiday[64]).Oneofthegoalsinthisareaistosummarizetraditionalrecorddataintermsofsymbolicdataobjectswhoseattributesaremorecomplexthantraditionalattributes.Specifically,theseattributescanhavevaluesthataresetsofvalues(categories),intervals,orsetsofvalueswithweights(histograms).Anothergoalofsymbolicdataanalysisistobeabletoperformclustering,classification,andotherkindsofdataanalysisondatathatconsistsofsymbolicdataobjects.
Samplingisasubjectthathasbeenwellstudiedinstatisticsandrelatedfields.Manyintroductorystatisticsbooks,suchastheonebyLindgren[90],havesomediscussionaboutsampling,andentirebooksaredevotedtothesubject,suchastheclassictextbyCochran[67].AsurveyofsamplingfordataminingisprovidedbyGuandLiu[74],whileasurveyofsamplingfordatabasesisprovidedbyOlkenandRotem[98].Thereareanumberofotherdatamininganddatabase-relatedsamplingreferencesthatmaybeofinterest,includingpapersbyPalmerandFaloutsos[100],Provostetal.[101],Toivonen[115],andZakietal.[119].
Instatistics,thetraditionaltechniquesthathavebeenusedfordimensionalityreductionaremultidimensionalscaling(MDS)(BorgandGroenen[65],KruskalandUslaner[89])andprincipalcomponentanalysis(PCA)(Jolliffe[80]),whichissimilartosingularvaluedecomposition(SVD)(Demmel[70]).DimensionalityreductionisdiscussedinmoredetailinAppendixB.
Discretizationisatopicthathasbeenextensivelyinvestigatedindatamining.Someclassificationalgorithmsworkonlywithcategoricaldata,andassociationanalysisrequiresbinarydata,andthus,thereisasignificant
motivationtoinvestigatehowtobestbinarizeordiscretizecontinuousattributes.Forassociationanalysis,wereferthereadertoworkbySrikantandAgrawal[111],whilesomeusefulreferencesfordiscretizationintheareaofclassificationincludeworkbyDoughertyetal.[71],ElomaaandRousu[72],FayyadandIrani[73],andHussainetal.[78].
Featureselectionisanothertopicwellinvestigatedindatamining.AbroadcoverageofthistopicisprovidedinasurveybyMolinaetal.[96]andtwobooksbyLiuandMotada[91,92].OtherusefulpapersincludethosebyBlumandLangley[63],KohaviandJohn[87],andLiuetal.[93].
Itisdifficulttoprovidereferencesforthesubjectoffeaturetransformationsbecausepracticesvaryfromonedisciplinetoanother.Manystatisticsbookshaveadiscussionoftransformations,buttypicallythediscussionisrestrictedtoaparticularpurpose,suchasensuringthenormalityofavariableormakingsurethatvariableshaveequalvariance.Weoffertworeferences:Osborne[99]andTukey[116].
Whilewehavecoveredsomeofthemostcommonlyuseddistanceandsimilaritymeasures,therearehundredsofsuchmeasuresandmorearebeingcreatedallthetime.Aswithsomanyothertopicsinthischapter,manyofthesemeasuresarespecifictoparticularfields,e.g.,intheareaoftimeseriesseepapersbyKalpakisetal.[81]andKeoghandPazzani[83].Clusteringbooksprovidethebestgeneraldiscussions.Inparticular,seethebooksbyAnderberg[62],JainandDubes[79],KaufmanandRousseeuw[82],andSneathandSokal[109].
Information-basedmeasuresofsimilarityhavebecomemorepopularlatelydespitethecomputationaldifficultiesandexpenseofcalculatingthem.AgoodintroductiontoinformationtheoryisprovidedbyCoverandThomas[69].Computingthemutualinformationforcontinuousvariablescanbe
straightforwardiftheyfollowawell-knowdistribution,suchasGaussian.However,thisisoftennotthecase,andmanytechniqueshavebeendeveloped.Asoneexample,thearticlebyKhan,etal.[85]comparesvariousmethodsinthecontextofcomparingshorttimeseries.SeealsotheinformationandmutualinformationpackagesforRandMatlab.MutualinformationhasbeenthesubjectofconsiderablerecentattentionduetopaperbyReshef,etal.[104,105]thatintroducedanalternativemeasure,albeitonebasedonmutualinformation,whichwasclaimedtohavesuperiorproperties.Althoughthisapproachhadsomeearlysupport,e.g.,[110],othershavepointedoutvariouslimitations[75,86,108].
Twopopularbooksonthetopicofkernelmethodsare[106]and[107].Thelatteralsohasawebsitewithlinkstokernel-relatedmaterials[84].Inaddition,manycurrentdatamining,machinelearning,andstatisticallearningtextbookshavesomematerialaboutkernelmethods.FurtherreferencesforkernelmethodsinthecontextofsupportvectormachineclassifiersareprovidedinthebibliographicNotesofSection4.9.4.
Bibliography[62]M.R.Anderberg.ClusterAnalysisforApplications.AcademicPress,New
York,December1973.
[63]A.BlumandP.Langley.SelectionofRelevantFeaturesandExamplesinMachineLearning.ArtificialIntelligence,97(1–2):245–271,1997.
[64]H.H.BockandE.Diday.AnalysisofSymbolicData:ExploratoryMethodsforExtractingStatisticalInformationfromComplexData(StudiesinClassification,DataAnalysis,andKnowledgeOrganization).Springer-VerlagTelos,January2000.
[65]I.BorgandP.Groenen.ModernMultidimensionalScaling—TheoryandApplications.Springer-Verlag,February1997.
[66]N.R.Chrisman.Rethinkinglevelsofmeasurementforcartography.CartographyandGeographicInformationSystems,25(4):231–242,1998.
[67]W.G.Cochran.SamplingTechniques.JohnWiley&Sons,3rdedition,July1977.
[68]E.F.Codd,S.B.Codd,andC.T.Smalley.ProvidingOLAP(On-lineAnalyticalProcessing)toUser-Analysts:AnITMandate.WhitePaper,E.F.CoddandAssociates,1993.
[69]T.M.CoverandJ.A.Thomas.Elementsofinformationtheory.JohnWiley&Sons,2012.
[70]J.W.Demmel.AppliedNumericalLinearAlgebra.SocietyforIndustrial&AppliedMathematics,September1997.
[71]J.Dougherty,R.Kohavi,andM.Sahami.SupervisedandUnsupervisedDiscretizationofContinuousFeatures.InProc.ofthe12thIntl.Conf.onMachineLearning,pages194–202,1995.
[72]T.ElomaaandJ.Rousu.GeneralandEfficientMultisplittingofNumericalAttributes.MachineLearning,36(3):201–244,1999.
[73]U.M.FayyadandK.B.Irani.Multi-intervaldiscretizationofcontinuousvaluedattributesforclassificationlearning.InProc.13thInt.JointConf.onArtificialIntelligence,pages1022–1027.MorganKaufman,1993.
[74]F.H.GaohuaGuandH.Liu.SamplingandItsApplicationinDataMining:ASurvey.TechnicalReportTRA6/00,NationalUniversityofSingapore,Singapore,2000.
[75]M.Gorfine,R.Heller,andY.Heller.CommentonDetectingnovelassociationsinlargedatasets.Unpublished(availableathttp://emotion.technion.ac.il/gorfinm/files/science6.pdfon11Nov.2012),2012.
[76]J.Gray,S.Chaudhuri,A.Bosworth,A.Layman,D.Reichart,M.Venkatrao,F.Pellow,andH.Pirahesh.DataCube:ARelationalAggregationOperatorGeneralizingGroup-By,Cross-Tab,andSub-Totals.JournalDataMiningandKnowledgeDiscovery,1(1):29–53,1997.
[77]D.J.Hand.StatisticsandtheTheoryofMeasurement.JournaloftheRoyalStatisticalSociety:SeriesA(StatisticsinSociety),159(3):445–492,1996.
[78]F.Hussain,H.Liu,C.L.Tan,andM.Dash.TRC6/99:Discretization:anenablingtechnique.Technicalreport,NationalUniversityofSingapore,Singapore,1999.
[79]A.K.JainandR.C.Dubes.AlgorithmsforClusteringData.PrenticeHallAdvancedReferenceSeries.PrenticeHall,March1988.
[80]I.T.Jolliffe.PrincipalComponentAnalysis.SpringerVerlag,2ndedition,October2002.
[81]K.Kalpakis,D.Gada,andV.Puttagunta.DistanceMeasuresforEffectiveClusteringofARIMATime-Series.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages273–280.IEEEComputerSociety,2001.
[82]L.KaufmanandP.J.Rousseeuw.FindingGroupsinData:AnIntroductiontoClusterAnalysis.WileySeriesinProbabilityandStatistics.JohnWileyandSons,NewYork,November1990.
[83]E.J.KeoghandM.J.Pazzani.Scalingupdynamictimewarpingfordataminingapplications.InKDD,pages285–289,2000.
[84]KernelMethodsforPatternAnalysisWebsite.http://www.kernel-methods.net/,2014.
[85]S.Khan,S.Bandyopadhyay,A.R.Ganguly,S.Saigal,D.J.EricksonIII,V.Protopopescu,andG.Ostrouchov.Relativeperformanceofmutualinformationestimationmethodsforquantifyingthedependenceamongshortandnoisydata.PhysicalReviewE,76(2):026209,2007.
[86]J.B.KinneyandG.S.Atwal.Equitability,mutualinformation,andthemaximalinformationcoefficient.ProceedingsoftheNationalAcademyofSciences,2014.
[87]R.KohaviandG.H.John.WrappersforFeatureSubsetSelection.ArtificialIntelligence,97(1–2):273–324,1997.
[88]D.Krantz,R.D.Luce,P.Suppes,andA.Tversky.FoundationsofMeasurements:Volume1:Additiveandpolynomialrepresentations.AcademicPress,NewYork,1971.
[89]J.B.KruskalandE.M.Uslaner.MultidimensionalScaling.SagePublications,August1978.
[90]B.W.Lindgren.StatisticalTheory.CRCPress,January1993.
[91]H.LiuandH.Motoda,editors.FeatureExtraction,ConstructionandSelection:ADataMiningPerspective.KluwerInternationalSeriesinEngineeringandComputerScience,453.KluwerAcademicPublishers,July1998.
[92]H.LiuandH.Motoda.FeatureSelectionforKnowledgeDiscoveryandDataMining.KluwerInternationalSeriesinEngineeringandComputerScience,454.KluwerAcademicPublishers,July1998.
[93]H.Liu,H.Motoda,andL.Yu.FeatureExtraction,Selection,andConstruction.InN.Ye,editor,TheHandbookofDataMining,pages22–41.LawrenceErlbaumAssociates,Inc.,Mahwah,NJ,2003.
[94]R.D.Luce,D.Krantz,P.Suppes,andA.Tversky.FoundationsofMeasurements:Volume3:Representation,Axiomatization,andInvariance.AcademicPress,NewYork,1990.
[95]MITInformationQuality(MITIQ)Program.http://mitiq.mit.edu/,2014.
[96]L.C.Molina,L.Belanche,andA.Nebot.FeatureSelectionAlgorithms:ASurveyandExperimentalEvaluation.InProc.ofthe2002IEEEIntl.Conf.onDataMining,2002.
[97]F.MostellerandJ.W.Tukey.Dataanalysisandregression:asecondcourseinstatistics.Addison-Wesley,1977.
[98]F.OlkenandD.Rotem.RandomSamplingfromDatabases—ASurvey.Statistics&Computing,5(1):25–42,March1995.
[99]J.Osborne.NotesontheUseofDataTransformations.PracticalAssessment,Research&Evaluation,28(6),2002.
[100]C.R.PalmerandC.Faloutsos.Densitybiasedsampling:Animprovedmethodfordataminingandclustering.ACMSIGMODRecord,29(2):82–92,2000.
[101]F.J.Provost,D.Jensen,andT.Oates.EfficientProgressiveSampling.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages23–32,1999.
[102]R.RamakrishnanandJ.Gehrke.DatabaseManagementSystems.McGraw-Hill,3rdedition,August2002.
[103]T.C.Redman.DataQuality:TheFieldGuide.DigitalPress,January2001.
[104]D.Reshef,Y.Reshef,M.Mitzenmacher,andP.Sabeti.Equitabilityanalysisofthemaximalinformationcoefficient,withcomparisons.arXivpreprintarXiv:1301.6314,2013.
[105]D.N.Reshef,Y.A.Reshef,H.K.Finucane,S.R.Grossman,G.McVean,P.J.Turnbaugh,E.S.Lander,M.Mitzenmacher,andP.C.
Sabeti.Detectingnovelassociationsinlargedatasets.science,334(6062):1518–1524,2011.
[106]B.SchölkopfandA.J.Smola.Learningwithkernels:supportvectormachines,regularization,optimization,andbeyond.MITpress,2002.
[107]J.Shawe-TaylorandN.Cristianini.Kernelmethodsforpatternanalysis.Cambridgeuniversitypress,2004.
[108]N.SimonandR.Tibshirani.Commenton”DetectingNovelAssociationsInLargeDataSets”byReshefEtAl,ScienceDec16,2011.arXivpreprintarXiv:1401.7645,2014.
[109]P.H.A.SneathandR.R.Sokal.NumericalTaxonomy.Freeman,SanFrancisco,1971.
[110]T.Speed.Acorrelationforthe21stcentury.Science,334(6062):1502–1503,2011.
[111]R.SrikantandR.Agrawal.MiningQuantitativeAssociationRulesinLargeRelationalTables.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Montreal,Quebec,Canada,August1996.
[112]S.S.Stevens.OntheTheoryofScalesofMeasurement.Science,103(2684):677–680,June1946.
[113]S.S.Stevens.Measurement.InG.M.Maranell,editor,Scaling:ASourcebookforBehavioralScientists,pages22–41.AldinePublishingCo.,Chicago,1974.
[114]P.Suppes,D.Krantz,R.D.Luce,andA.Tversky.FoundationsofMeasurements:Volume2:Geometrical,Threshold,andProbabilisticRepresentations.AcademicPress,NewYork,1989.
[115]H.Toivonen.SamplingLargeDatabasesforAssociationRules.InVLDB96,pages134–145.MorganKaufman,September1996.
[116]J.W.Tukey.OntheComparativeAnatomyofTransformations.AnnalsofMathematicalStatistics,28(3):602–632,September1957.
[117]P.F.VellemanandL.Wilkinson.Nominal,ordinal,interval,andratiotypologiesaremisleading.TheAmericanStatistician,47(1):65–72,1993.
[118]R.Y.Wang,M.Ziad,Y.W.Lee,andY.R.Wang.DataQuality.TheKluwerInternationalSeriesonAdvancesinDatabaseSystems,Volume23.KluwerAcademicPublishers,January2001.
[119]M.J.Zaki,S.Parthasarathy,W.Li,andM.Ogihara.EvaluationofSamplingforDataMiningofAssociationRules.TechnicalReportTR617,RensselaerPolytechnicInstitute,1996.
2.6Exercises1.IntheinitialexampleofChapter2 ,thestatisticiansays,“Yes,fields2and3arebasicallythesame.”Canyoutellfromthethreelinesofsampledatathatareshownwhyshesaysthat?
2.Classifythefollowingattributesasbinary,discrete,orcontinuous.Alsoclassifythemasqualitative(nominalorordinal)orquantitative(intervalorratio).Somecasesmayhavemorethanoneinterpretation,sobrieflyindicateyourreasoningifyouthinktheremaybesomeambiguity.
Example:Ageinyears.Answer:Discrete,quantitative,ratio
a. TimeintermsofAMorPM.
b. Brightnessasmeasuredbyalightmeter.
c. Brightnessasmeasuredbypeople’sjudgments.
d. Anglesasmeasuredindegreesbetween0and360.
e. Bronze,Silver,andGoldmedalsasawardedattheOlympics.
f. Heightabovesealevel.
g. Numberofpatientsinahospital.
h. ISBNnumbersforbooks.(LookuptheformatontheWeb.)
i. Abilitytopasslightintermsofthefollowingvalues:opaque,translucent,transparent.
j. Militaryrank.
k. Distancefromthecenterofcampus.
l. Densityofasubstanceingramspercubiccentimeter.
m. Coatchecknumber.(Whenyouattendanevent,youcanoftengiveyourcoattosomeonewho,inturn,givesyouanumberthatyoucanusetoclaimyourcoatwhenyouleave.)
3.Youareapproachedbythemarketingdirectorofalocalcompany,whobelievesthathehasdevisedafoolproofwaytomeasurecustomersatisfaction.Heexplainshisschemeasfollows:“It’ssosimplethatIcan’tbelievethatnoonehasthoughtofitbefore.Ijustkeeptrackofthenumberofcustomercomplaintsforeachproduct.Ireadinadataminingbookthatcountsareratioattributes,andso,mymeasureofproductsatisfactionmustbearatioattribute.ButwhenIratedtheproductsbasedonmynewcustomersatisfactionmeasureandshowedthemtomyboss,hetoldmethatIhadoverlookedtheobvious,andthatmymeasurewasworthless.Ithinkthathewasjustmadbecauseourbestsellingproducthadtheworstsatisfactionsinceithadthemostcomplaints.Couldyouhelpmesethimstraight?”
a. Whoisright,themarketingdirectororhisboss?Ifyouanswered,hisboss,whatwouldyoudotofixthemeasureofsatisfaction?
b. Whatcanyousayabouttheattributetypeoftheoriginalproductsatisfactionattribute?
4.Afewmonthslater,youareagainapproachedbythesamemarketingdirectorasinExercise3 .Thistime,hehasdevisedabetterapproachtomeasuretheextenttowhichacustomerprefersoneproductoverothersimilarproducts.Heexplains,“Whenwedevelopnewproducts,wetypicallycreateseveralvariationsandevaluatewhichonecustomersprefer.Ourstandardprocedureistogiveourtestsubjectsalloftheproductvariationsatonetimeandthenaskthemtoranktheproductvariationsinorderofpreference.However,ourtestsubjectsareveryindecisive,especiallywhenthereare
morethantwoproducts.Asaresult,testingtakesforever.Isuggestedthatweperformthecomparisonsinpairsandthenusethesecomparisonstogettherankings.Thus,ifwehavethreeproductvariations,wehavethecustomerscomparevariations1and2,then2and3,andfinally3and1.Ourtestingtimewithmynewprocedureisathirdofwhatitwasfortheoldprocedure,buttheemployeesconductingthetestscomplainthattheycannotcomeupwithaconsistentrankingfromtheresults.Andmybosswantsthelatestproductevaluations,yesterday.Ishouldalsomentionthathewasthepersonwhocameupwiththeoldproductevaluationapproach.Canyouhelpme?”
a. Isthemarketingdirectorintrouble?Willhisapproachworkforgeneratinganordinalrankingoftheproductvariationsintermsofcustomerpreference?Explain.
b. Isthereawaytofixthemarketingdirector’sapproach?Moregenerally,whatcanyousayabouttryingtocreateanordinalmeasurementscalebasedonpairwisecomparisons?
c. Fortheoriginalproductevaluationscheme,theoverallrankingsofeachproductvariationarefoundbycomputingitsaverageoveralltestsubjects.Commentonwhetheryouthinkthatthisisareasonableapproach.Whatotherapproachesmightyoutake?
5.Canyouthinkofasituationinwhichidentificationnumberswouldbeusefulforprediction?
6.Aneducationalpsychologistwantstouseassociationanalysistoanalyzetestresults.Thetestconsistsof100questionswithfourpossibleanswerseach.
a. Howwouldyouconvertthisdataintoaformsuitableforassociationanalysis?
b. Inparticular,whattypeofattributeswouldyouhaveandhowmanyofthemarethere?
7.Whichofthefollowingquantitiesislikelytoshowmoretemporalautocorrelation:dailyrainfallordailytemperature?Why?
8.Discusswhyadocument-termmatrixisanexampleofadatasetthathasasymmetricdiscreteorasymmetriccontinuousfeatures.
9.Manysciencesrelyonobservationinsteadof(orinadditionto)designedexperiments.Comparethedataqualityissuesinvolvedinobservationalsciencewiththoseofexperimentalscienceanddatamining.
10.Discussthedifferencebetweentheprecisionofameasurementandthetermssingleanddoubleprecision,astheyareusedincomputerscience,typicallytorepresentfloating-pointnumbersthatrequire32and64bits,respectively.
11.Giveatleasttwoadvantagestoworkingwithdatastoredintextfilesinsteadofinabinaryformat.
12.Distinguishbetweennoiseandoutliers.Besuretoconsiderthefollowingquestions.
a. Isnoiseeverinterestingordesirable?Outliers?
b. Cannoiseobjectsbeoutliers?
c. Arenoiseobjectsalwaysoutliers?
d. Areoutliersalwaysnoiseobjects?
e. Cannoisemakeatypicalvalueintoanunusualone,orviceversa?
Algorithm2.3Algorithmforfindingk-
nearestneighbors.
13.ConsidertheproblemoffindingtheK-nearestneighborsofadataobject.AprogrammerdesignsAlgorithm2.3forthistask.
a. Describethepotentialproblemswiththisalgorithmifthereareduplicateobjectsinthedataset.Assumethedistancefunctionwillreturnadistanceof0onlyforobjectsthatarethesame.
b. Howwouldyoufixthisproblem?
14.ThefollowingattributesaremeasuredformembersofaherdofAsianelephants:weight,height,tusklength,trunklength,andeararea.Basedonthesemeasurements,whatsortofproximitymeasurefromSection2.4wouldyouusetocompareorgrouptheseelephants?Justifyyouranswerandexplainanyspecialcircumstances.
15.Youaregivenasetofmobjectsthatisdividedintokgroups,wheretheigroupisofsize Ifthegoalistoobtainasampleofsize whatisthedifferencebetweenthefollowingtwosamplingschemes?(Assumesamplingwithreplacement.)
:for tonumberofdataobjectsdo
:Findthedistancesofthe objecttoallotherobjects.
3:Sortthesedistancesindecreasingorder.
(Keeptrackofwhichobjectisassociatedwitheachdistance.)
4:returntheobjectsassociatedwiththefirstkdistancesofthesortedlist
5:endfor
1 i=1
2 ith
th
mi. n<m,
a. Werandomlyselect elementsfromeachgroup.
b. Werandomlyselectnelementsfromthedataset,withoutregardforthegrouptowhichanobjectbelongs.
16.Consideradocument-termmatrix,where isthefrequencyoftheword(term)inthe documentandmisthenumberofdocuments.Considerthevariabletransformationthatisdefinedby
where isthenumberofdocumentsinwhichthe termappears,whichisknownasthedocumentfrequencyoftheterm.Thistransformationisknownastheinversedocumentfrequencytransformation.
a. Whatistheeffectofthistransformationifatermoccursinonedocument?Ineverydocument?
b. Whatmightbethepurposeofthistransformation?
17.Assumethatweapplyasquareroottransformationtoaratioattributextoobtainthenewattribute Aspartofyouranalysis,youidentifyaninterval(a,b)inwhich hasalinearrelationshiptoanotherattributey.
a. Whatisthecorrespondinginterval(A,B)intermsofx?
b. Giveanequationthatrelatesytox.
18.Thisexercisecomparesandcontrastssomesimilarityanddistancemeasures.
a. Forbinarydata,theL1distancecorrespondstotheHammingdistance;thatis,thenumberofbitsthataredifferentbetweentwobinaryvectors.TheJaccardsimilarityisameasureofthesimilaritybetweentwobinary
n×mi/m
tfij ithjth
tfij′=tfij×logmdfi, (2.31)
dfi ith
x*.x*
vectors.ComputetheHammingdistanceandtheJaccardsimilaritybetweenthefollowingtwobinaryvectors.
b. Whichapproach,JaccardorHammingdistance,ismoresimilartotheSimpleMatchingCoefficient,andwhichapproachismoresimilartothecosinemeasure?Explain.(Note:TheHammingmeasureisadistance,whiletheotherthreemeasuresaresimilarities,butdon’tletthisconfuseyou.)
c. Supposethatyouarecomparinghowsimilartwoorganismsofdifferentspeciesareintermsofthenumberofgenestheyshare.Describewhichmeasure,HammingorJaccard,youthinkwouldbemoreappropriateforcomparingthegeneticmakeupoftwoorganisms.Explain.(Assumethateachanimalisrepresentedasabinaryvector,whereeachattributeis1ifaparticulargeneispresentintheorganismand0otherwise.)
d. Ifyouwantedtocomparethegeneticmakeupoftwoorganismsofthesamespecies,e.g.,twohumanbeings,wouldyouusetheHammingdistance,theJaccardcoefficient,oradifferentmeasureofsimilarityordistance?Explain.(Notethattwohumanbeingsshare ofthesamegenes.)
19.Forthefollowingvectors,xandy,calculatetheindicatedsimilarityordistancemeasures.
a. cosine,correlation,Euclidean
b. cosine,correlation,Euclidean,Jaccard
c. cosine,correlation,Euclidean
d. cosine,correlation,Jaccard
x=0101010001y=0100011000
>99.9%
x=(1,1,1,1),y=(2,2,2,2)
x=(0,1,0,1),y=(1,0,1,0)
x=(0,−1,0,1),y=(1,0,−1,0)
x=(1,1,0,1,0,1),y=(1,1,1,0,0,1)
e. cosine,correlation
20.Here,wefurtherexplorethecosineandcorrelationmeasures.
a. Whatistherangeofvaluespossibleforthecosinemeasure?
b. Iftwoobjectshaveacosinemeasureof1,aretheyidentical?Explain.
c. Whatistherelationshipofthecosinemeasuretocorrelation,ifany?(Hint:Lookatstatisticalmeasuressuchasmeanandstandarddeviationincaseswherecosineandcorrelationarethesameanddifferent.)
d. Figure2.22(a) showstherelationshipofthecosinemeasuretoEuclideandistancefor100,000randomlygeneratedpointsthathavebeennormalizedtohaveanL2lengthof1.WhatgeneralobservationcanyoumakeabouttherelationshipbetweenEuclideandistanceandcosinesimilaritywhenvectorshaveanL2normof1?
Figure2.22.GraphsforExercise20 .
x=(2,−1,0,2,0,−3),y=(−1,1,−1,0,0,−1)
e. Figure2.22(b) showstherelationshipofcorrelationtoEuclideandistancefor100,000randomlygeneratedpointsthathavebeenstandardizedtohaveameanof0andastandarddeviationof1.WhatgeneralobservationcanyoumakeabouttherelationshipbetweenEuclideandistanceandcorrelationwhenthevectorshavebeenstandardizedtohaveameanof0andastandarddeviationof1?
f. DerivethemathematicalrelationshipbetweencosinesimilarityandEuclideandistancewheneachdataobjecthasanL lengthof1.
g. DerivethemathematicalrelationshipbetweencorrelationandEuclideandistancewheneachdatapointhasbeenbeenstandardizedbysubtractingitsmeananddividingbyitsstandarddeviation.
21.Showthatthesetdifferencemetricgivenby
satisfiesthemetricaxiomsgivenonpage77 .AandBaresetsand isthesetdifference.
22.Discusshowyoumightmapcorrelationvaluesfromtheinterval totheinterval[0,1].Notethatthetypeoftransformationthatyouusemightdependontheapplicationthatyouhaveinmind.Thus,considertwoapplications:clusteringtimeseriesandpredictingthebehaviorofonetimeseriesgivenanother.
23.Givenasimilaritymeasurewithvaluesintheinterval[0,1],describetwowaystotransformthissimilarityvalueintoadissimilarityvalueintheinterval
24.Proximityistypicallydefinedbetweenapairofobjects.
2
d(A,B)=size(A−B)+size(B−A) (2.32)
A−B
[−1,1]
[0,∞].
a. Definetwowaysinwhichyoumightdefinetheproximityamongagroupofobjects.
b. HowmightyoudefinethedistancebetweentwosetsofpointsinEuclideanspace?
c. Howmightyoudefinetheproximitybetweentwosetsofdataobjects?(Makenoassumptionaboutthedataobjects,exceptthataproximitymeasureisdefinedbetweenanypairofobjects.)
25.YouaregivenasetofpointssinEuclideanspace,aswellasthedistanceofeachpointinstoapointx.(Itdoesnotmatterif )
a. Ifthegoalistofindallpointswithinaspecifieddistance ofpointexplainhowyoucouldusethetriangleinequalityandthealreadycalculateddistancestoxtopotentiallyreducethenumberofdistancecalculationsnecessary?Hint:Thetriangleinequality,
canberewrittenas
b. Ingeneral,howwouldthedistancebetweenxandyaffectthenumberofdistancecalculations?
c. Supposethatyoucanfindasmallsubsetofpoints fromtheoriginaldataset,suchthateverypointinthedatasetiswithinaspecifieddistance ofatleastoneofthepointsin andthatyoualsohavethepairwisedistancematrixfor Describeatechniquethatusesthisinformationtocompute,withaminimumofdistancecalculations,thesetofallpointswithinadistanceof ofaspecifiedpointfromthedataset.
26.Showthat1minustheJaccardsimilarityisadistancemeasurebetweentwodataobjects,xandy,thatsatisfiesthemetricaxiomsgivenonpage77 .Specifically,
x∈S.ε y,y≠x,
d(x,z)≤d(x,y)+d(y,x), d(x,y)≥d(x,z)−d(y,z).
S′,
ε S′,S′.
β
d(x,y)=1−J(x,y).
27.Showthatthedistancemeasuredefinedastheanglebetweentwodatavectors,xandy,satisfiesthemetricaxiomsgivenonpage77 .Specifically,
28.Explainwhycomputingtheproximitybetweentwoattributesisoftensimplerthancomputingthesimilaritybetweentwoobjects.
d(x,y)=arccos(cos(x,y)).
3Classification:BasicConceptsandTechniques
Humanshaveaninnateabilitytoclassifythingsintocategories,e.g.,mundanetaskssuchasfilteringspamemailmessagesormorespecializedtaskssuchasrecognizingcelestialobjectsintelescopeimages(seeFigure3.1 ).Whilemanualclassificationoftensufficesforsmallandsimpledatasetswithonlyafewattributes,largerandmorecomplexdatasetsrequireanautomatedsolution.
Figure3.1.ClassificationofgalaxiesfromtelescopeimagestakenfromtheNASAwebsite.
Thischapterintroducesthebasicconceptsofclassificationanddescribessomeofitskeyissuessuchasmodeloverfitting,modelselection,andmodelevaluation.Whilethesetopicsareillustratedusingaclassificationtechniqueknownasdecisiontreeinduction,mostofthediscussioninthischapterisalsoapplicabletootherclassificationtechniques,manyofwhicharecoveredinChapter4 .
3.1BasicConceptsFigure3.2 illustratesthegeneralideabehindclassification.Thedataforaclassificationtaskconsistsofacollectionofinstances(records).Eachsuchinstanceischaracterizedbythetuple( ,y),where isthesetofattributevaluesthatdescribetheinstanceandyistheclasslabeloftheinstance.Theattributeset cancontainattributesofanytype,whiletheclasslabelymustbecategorical.
Figure3.2.Aschematicillustrationofaclassificationtask.
Aclassificationmodelisanabstractrepresentationoftherelationshipbetweentheattributesetandtheclasslabel.Aswillbeseeninthenexttwochapters,themodelcanberepresentedinmanyways,e.g.,asatree,aprobabilitytable,orsimply,avectorofreal-valuedparameters.Moreformally,wecanexpressitmathematicallyasatargetfunctionfthattakesasinputtheattributeset andproducesanoutputcorrespondingtothepredictedclasslabel.Themodelissaidtoclassifyaninstance( ,y)correctlyif .
Table3.1 showsexamplesofattributesetsandclasslabelsforvariousclassificationtasks.Spamfilteringandtumoridentificationareexamplesofbinaryclassificationproblems,inwhicheachdatainstancecanbecategorizedintooneoftwoclasses.Ifthenumberofclassesislargerthan2,asinthe
f(x)=y
galaxyclassificationexample,thenitiscalledamulticlassclassificationproblem.
Table3.1.Examplesofclassificationtasks.
Task Attributeset Classlabel
Spamfiltering Featuresextractedfromemailmessageheaderandcontent
spamornon-spam
Tumoridentification
Featuresextractedfrommagneticresonanceimaging(MRI)scans
malignantorbenign
Galaxyclassification
Featuresextractedfromtelescopeimages elliptical,spiral,orirregular-shaped
Weillustratethebasicconceptsofclassificationinthischapterwiththefollowingtwoexamples.
3.1.ExampleVertebrateClassificationTable3.2 showsasampledatasetforclassifyingvertebratesintomammals,reptiles,birds,fishes,andamphibians.Theattributesetincludescharacteristicsofthevertebratesuchasitsbodytemperature,skincover,andabilitytofly.Thedatasetcanalsobeusedforabinaryclassificationtasksuchasmammalclassification,bygroupingthereptiles,birds,fishes,andamphibiansintoasinglecategorycallednon-mammals.
Table3.2.Asampledataforthevertebrateclassificationproblem.VertebrateName
BodyTemperature
SkinCover
GivesBirth
AquaticCreature
AerialCreature
HasLegs
Hibernates ClassLabel
human warm-
blooded
hair yes no no yes no mammal
3.2.ExampleLoanBorrowerClassificationConsidertheproblemofpredictingwhetheraloanborrowerwillrepaytheloanordefaultontheloanpayments.Thedatasetusedtobuildthe
blooded
python cold-blooded scales no no no no yes reptile
salmon cold-blooded scales no yes no no no fish
whale warm-blooded
hair yes yes no no no mammal
frog cold-blooded none no semi no yes yes amphibian
komodo cold-blooded scales no no no yes no reptile
dragon
bat warm-blooded
hair yes no yes yes yes mammal
pigeon warm-blooded
feathers no no yes yes no bird
cat warm-blooded
fur yes no no yes no mammal
leopard cold-blooded scales yes yes no no no fish
shark
turtle cold-blooded scales no semi no yes no reptile
penguin warm-blooded
feathers no semi no yes no bird
porcupine warm-blooded
quills yes no no yes yes mammal
eel cold-blooded scales no yes no no no fish
salamander cold-blooded none no semi no yes yes amphibian
classificationmodelisshowninTable3.3 .Theattributesetincludespersonalinformationoftheborrowersuchasmaritalstatusandannualincome,whiletheclasslabelindicateswhethertheborrowerhaddefaultedontheloanpayments.
Table3.3.Asampledatafortheloanborrowerclassificationproblem.
ID HomeOwner MaritalStatus AnnualIncome Defaulted?
1 Yes Single 125000 No
2 No Married 100000 No
3 No Single 70000 No
4 Yes Married 120000 No
5 No Divorced 95000 Yes
6 No Single 60000 No
7 Yes Divorced 220000 No
8 No Single 85000 Yes
9 No Married 75000 No
10 No Single 90000 Yes
Aclassificationmodelservestwoimportantrolesindatamining.First,itisusedasapredictivemodeltoclassifypreviouslyunlabeledinstances.Agoodclassificationmodelmustprovideaccuratepredictionswithafastresponsetime.Second,itservesasadescriptivemodeltoidentifythecharacteristicsthatdistinguishinstancesfromdifferentclasses.Thisisparticularlyusefulforcriticalapplications,suchasmedicaldiagnosis,whereit
isinsufficienttohaveamodelthatmakesapredictionwithoutjustifyinghowitreachessuchadecision.
Forexample,aclassificationmodelinducedfromthevertebratedatasetshowninTable3.2 canbeusedtopredicttheclasslabelofthefollowingvertebrate:
Inaddition,itcanbeusedasadescriptivemodeltohelpdeterminecharacteristicsthatdefineavertebrateasamammal,areptile,abird,afish,oranamphibian.Forexample,themodelmayidentifymammalsaswarm-bloodedvertebratesthatgivebirthtotheiryoung.
Thereareseveralpointsworthnotingregardingthepreviousexample.First,althoughalltheattributesshowninTable3.2 arequalitative,therearenorestrictionsonthetypeofattributesthatcanbeusedaspredictorvariables.Theclasslabel,ontheotherhand,mustbeofnominaltype.Thisdistinguishesclassificationfromotherpredictivemodelingtaskssuchasregression,wherethepredictedvalueisoftenquantitative.MoreinformationaboutregressioncanbefoundinAppendixD.
Anotherpointworthnotingisthatnotallattributesmayberelevanttotheclassificationtask.Forexample,theaveragelengthorweightofavertebratemaynotbeusefulforclassifyingmammals,astheseattributescanshowsamevalueforbothmammalsandnon-mammals.Suchanattributeistypicallydiscardedduringpreprocessing.Theremainingattributesmightnotbeabletodistinguishtheclassesbythemselves,andthus,mustbeusedin
VertebrateName
BodyTemperature
SkinCover
GivesBirth
AquaticCreature
AerialCreature
HasLegs
Hibernates ClassLabel
gilamonster
cold-blooded scales no no no yes yes ?
concertwithotherattributes.Forinstance,theBodyTemperatureattributeisinsufficienttodistinguishmammalsfromothervertebrates.WhenitisusedtogetherwithGivesBirth,theclassificationofmammalsimprovessignificantly.However,whenadditionalattributes,suchasSkinCoverareincluded,themodelbecomesoverlyspecificandnolongercoversallmammals.Findingtheoptimalcombinationofattributesthatbestdiscriminatesinstancesfromdifferentclassesisthekeychallengeinbuildingclassificationmodels.
3.2GeneralFrameworkforClassificationClassificationisthetaskofassigninglabelstounlabeleddatainstancesandaclassifierisusedtoperformsuchatask.Aclassifieristypicallydescribedintermsofamodelasillustratedintheprevioussection.Themodeliscreatedusingagivenasetofinstances,knownasthetrainingset,whichcontainsattributevaluesaswellasclasslabelsforeachinstance.Thesystematicapproachforlearningaclassificationmodelgivenatrainingsetisknownasalearningalgorithm.Theprocessofusingalearningalgorithmtobuildaclassificationmodelfromthetrainingdataisknownasinduction.Thisprocessisalsooftendescribedas“learningamodel”or“buildingamodel.”Thisprocessofapplyingaclassificationmodelonunseentestinstancestopredicttheirclasslabelsisknownasdeduction.Thus,theprocessofclassificationinvolvestwosteps:applyingalearningalgorithmtotrainingdatatolearnamodel,andthenapplyingthemodeltoassignlabelstounlabeledinstances.Figure3.3 illustratesthegeneralframeworkforclassification.
Figure3.3.Generalframeworkforbuildingaclassificationmodel.
Aclassificationtechniquereferstoageneralapproachtoclassification,e.g.,thedecisiontreetechniquethatwewillstudyinthischapter.Thisclassificationtechniquelikemostothers,consistsofafamilyofrelatedmodelsandanumberofalgorithmsforlearningthesemodels.InChapter4 ,wewillstudyadditionalclassificationtechniques,includingneuralnetworksandsupportvectormachines.
Acouplenotesonterminology.First,theterms“classifier”and“model”areoftentakentobesynonymous.Ifaclassificationtechniquebuildsasingle,
globalmodel,thenthisisfine.However,whileeverymodeldefinesaclassifier,noteveryclassifierisdefinedbyasinglemodel.Someclassifiers,suchask-nearestneighborclassifiers,donotbuildanexplicitmodel(Section4.3 ),whileotherclassifiers,suchasensembleclassifiers,combinetheoutputofacollectionofmodels(Section4.10 ).Second,theterm“classifier”isoftenusedinamoregeneralsensetorefertoaclassificationtechnique.Thus,forexample,“decisiontreeclassifier”canrefertothedecisiontreeclassificationtechniqueoraspecificclassifierbuiltusingthattechnique.Fortunately,themeaningof“classifier”isusuallyclearfromthecontext.
InthegeneralframeworkshowninFigure3.3 ,theinductionanddeductionstepsshouldbeperformedseparately.Infact,aswillbediscussedlaterinSection3.6 ,thetrainingandtestsetsshouldbeindependentofeachothertoensurethattheinducedmodelcanaccuratelypredicttheclasslabelsofinstancesithasneverencounteredbefore.Modelsthatdeliversuchpredictiveinsightsaresaidtohavegoodgeneralizationperformance.Theperformanceofamodel(classifier)canbeevaluatedbycomparingthepredictedlabelsagainstthetruelabelsofinstances.Thisinformationcanbesummarizedinatablecalledaconfusionmatrix.Table3.4 depictstheconfusionmatrixforabinaryclassificationproblem.Eachentry denotesthenumberofinstancesfromclassipredictedtobeofclassj.Forexample, isthenumberofinstancesfromclass0incorrectlypredictedasclass1.Thenumberofcorrectpredictionsmadebythemodelis andthenumberofincorrectpredictionsis .
Table3.4.Confusionmatrixforabinaryclassificationproblem.
PredictedClass
ActualClass
fijf01
(f11+f00)(f10+f01)
Class=1 Class=0
Class=1 f11 f10
Althoughaconfusionmatrixprovidestheinformationneededtodeterminehowwellaclassificationmodelperforms,summarizingthisinformationintoasinglenumbermakesitmoreconvenienttocomparetherelativeperformanceofdifferentmodels.Thiscanbedoneusinganevaluationmetricsuchasaccuracy,whichiscomputedinthefollowingway:
Accuracy=
Forbinaryclassificationproblems,theaccuracyofamodelisgivenby
Errorrateisanotherrelatedmetric,whichisdefinedasfollowsforbinaryclassificationproblems:
Thelearningalgorithmsofmostclassificationtechniquesaredesignedtolearnmodelsthatattainthehighestaccuracy,orequivalently,thelowesterrorratewhenappliedtothetestset.WewillrevisitthetopicofmodelevaluationinSection3.6 .
Class=0 f01 f00
Accuracy=NumberofcorrectpredictionsTotalnumberofpredictions. (3.1)
Accuracy=f11+f00f11+f10+f01+f00. (3.2)
Errorrate=NumberofwrongpredictionsTotalnumberofpredictions=f10+f01f11(3.3)
3.3DecisionTreeClassifierThissectionintroducesasimpleclassificationtechniqueknownasthedecisiontreeclassifier.Toillustratehowadecisiontreeworks,considertheclassificationproblemofdistinguishingmammalsfromnon-mammalsusingthevertebratedatasetshowninTable3.2 .Supposeanewspeciesisdiscoveredbyscientists.Howcanwetellwhetheritisamammaloranon-mammal?Oneapproachistoposeaseriesofquestionsaboutthecharacteristicsofthespecies.Thefirstquestionwemayaskiswhetherthespeciesiscold-orwarm-blooded.Ifitiscold-blooded,thenitisdefinitelynotamammal.Otherwise,itiseitherabirdoramammal.Inthelattercase,weneedtoaskafollow-upquestion:Dothefemalesofthespeciesgivebirthtotheiryoung?Thosethatdogivebirtharedefinitelymammals,whilethosethatdonotarelikelytobenon-mammals(withtheexceptionofegg-layingmammalssuchastheplatypusandspinyanteater).
Thepreviousexampleillustrateshowwecansolveaclassificationproblembyaskingaseriesofcarefullycraftedquestionsabouttheattributesofthetestinstance.Eachtimewereceiveananswer,wecouldaskafollow-upquestionuntilwecanconclusivelydecideonitsclasslabel.Theseriesofquestionsandtheirpossibleanswerscanbeorganizedintoahierarchicalstructurecalledadecisiontree.Figure3.4 showsanexampleofthedecisiontreeforthemammalclassificationproblem.Thetreehasthreetypesofnodes:
Arootnode,withnoincominglinksandzeroormoreoutgoinglinks.Internalnodes,eachofwhichhasexactlyoneincominglinkandtwoormoreoutgoinglinks.Leaforterminalnodes,eachofwhichhasexactlyoneincominglinkandnooutgoinglinks.
Everyleafnodeinthedecisiontreeisassociatedwithaclasslabel.Thenon-terminalnodes,whichincludetherootandinternalnodes,containattributetestconditionsthataretypicallydefinedusingasingleattribute.Eachpossibleoutcomeoftheattributetestconditionisassociatedwithexactlyonechildofthisnode.Forexample,therootnodeofthetreeshowninFigure3.4 usestheattribute todefineanattributetestconditionthathastwooutcomes,warmandcold,resultingintwochildnodes.
Figure3.4.Adecisiontreeforthemammalclassificationproblem.
Givenadecisiontree,classifyingatestinstanceisstraightforward.Startingfromtherootnode,weapplyitsattributetestconditionandfollowtheappropriatebranchbasedontheoutcomeofthetest.Thiswillleaduseithertoanotherinternalnode,forwhichanewattributetestconditionisapplied,ortoaleafnode.Oncealeafnodeisreached,weassigntheclasslabelassociatedwiththenodetothetestinstance.Asanillustration,Figure3.5
tracesthepathusedtopredicttheclasslabelofaflamingo.Thepathterminatesataleafnodelabeledas .
Figure3.5.Classifyinganunlabeledvertebrate.Thedashedlinesrepresenttheoutcomesofapplyingvariousattributetestconditionsontheunlabeledvertebrate.Thevertebrateiseventuallyassignedtothe class.
3.3.1ABasicAlgorithmtoBuildaDecisionTree
Manypossibledecisiontreesthatcanbeconstructedfromaparticulardataset.Whilesometreesarebetterthanothers,findinganoptimaloneiscomputationallyexpensiveduetotheexponentialsizeofthesearchspace.Efficientalgorithmshavebeendevelopedtoinduceareasonablyaccurate,
albeitsuboptimal,decisiontreeinareasonableamountoftime.Thesealgorithmsusuallyemployagreedystrategytogrowthedecisiontreeinatop-downfashionbymakingaseriesoflocallyoptimaldecisionsaboutwhichattributetousewhenpartitioningthetrainingdata.OneoftheearliestmethodisHunt'salgorithm,whichisthebasisformanycurrentimplementationsofdecisiontreeclassifiers,includingID3,C4.5,andCART.ThissubsectionpresentsHunt'salgorithmanddescribessomeofthedesignissuesthatmustbeconsideredwhenbuildingadecisiontree.
Hunt'sAlgorithmInHunt'salgorithm,adecisiontreeisgrowninarecursivefashion.Thetreeinitiallycontainsasinglerootnodethatisassociatedwithallthetraininginstances.Ifanodeisassociatedwithinstancesfrommorethanoneclass,itisexpandedusinganattributetestconditionthatisdeterminedusingasplittingcriterion.Achildleafnodeiscreatedforeachoutcomeoftheattributetestconditionandtheinstancesassociatedwiththeparentnodearedistributedtothechildrenbasedonthetestoutcomes.Thisnodeexpansionstepcanthenberecursivelyappliedtoeachchildnode,aslongasithaslabelsofmorethanoneclass.Ifalltheinstancesassociatedwithaleafnodehaveidenticalclasslabels,thenthenodeisnotexpandedanyfurther.Eachleafnodeisassignedaclasslabelthatoccursmostfrequentlyinthetraininginstancesassociatedwiththenode.
Toillustratehowthealgorithmworks,considerthetrainingsetshowninTable3.3 fortheloanborrowerclassificationproblem.SupposeweapplyHunt'salgorithmtofitthetrainingdata.ThetreeinitiallycontainsonlyasingleleafnodeasshowninFigure3.6(a) .ThisnodeislabeledasDefaulted=No,sincethemajorityoftheborrowersdidnotdefaultontheirloanpayments.Thetrainingerrorofthistreeis30%asthreeoutofthetentraininginstanceshave
theclasslabel .Theleafnodecanthereforebefurtherexpandedbecauseitcontainstraininginstancesfrommorethanoneclass.
Figure3.6.Hunt'salgorithmforbuildingdecisiontrees.
LetHomeOwnerbetheattributechosentosplitthetraininginstances.Thejustificationforchoosingthisattributeastheattributetestconditionwillbediscussedlater.TheresultingbinarysplitontheHomeOwnerattributeisshowninFigure3.6(b) .AllthetraininginstancesforwhichHomeOwner=Yesarepropagatedtotheleftchildoftherootnodeandtherestarepropagatedtotherightchild.Hunt'salgorithmisthenrecursivelyappliedtoeachchild.Theleftchildbecomesaleafnodelabeled ,since
Defaulted=Yes
Defaulted=No
allinstancesassociatedwiththisnodehaveidenticalclasslabel.Therightchildhasinstancesfromeachclasslabel.Hence,
wesplititfurther.TheresultingsubtreesafterrecursivelyexpandingtherightchildareshowninFigures3.6(c) and(d) .
Hunt'salgorithm,asdescribedabove,makessomesimplifyingassumptionsthatareoftennottrueinpractice.Inthefollowing,wedescribetheseassumptionsandbrieflydiscusssomeofthepossiblewaysforhandlingthem.
1. SomeofthechildnodescreatedinHunt'salgorithmcanbeemptyifnoneofthetraininginstanceshavetheparticularattributevalues.Onewaytohandlethisisbydeclaringeachofthemasaleafnodewithaclasslabelthatoccursmostfrequentlyamongthetraininginstancesassociatedwiththeirparentnodes.
2. Ifalltraininginstancesassociatedwithanodehaveidenticalattributevaluesbutdifferentclasslabels,itisnotpossibletoexpandthisnodeanyfurther.Onewaytohandlethiscaseistodeclareitaleafnodeandassignittheclasslabelthatoccursmostfrequentlyinthetraininginstancesassociatedwiththisnode.
DesignIssuesofDecisionTreeInductionHunt'salgorithmisagenericprocedureforgrowingdecisiontreesinagreedyfashion.Toimplementthealgorithm,therearetwokeydesignissuesthatmustbeaddressed.
1. Whatisthesplittingcriterion?Ateachrecursivestep,anattributemustbeselectedtopartitionthetraininginstancesassociatedwithanodeintosmallersubsetsassociatedwithitschildnodes.Thesplittingcriteriondetermineswhichattributeischosenasthetestconditionand
Defaulted=No
howthetraininginstancesshouldbedistributedtothechildnodes.ThiswillbediscussedinSections3.3.2 and3.3.3 .
2. Whatisthestoppingcriterion?Thebasicalgorithmstopsexpandinganodeonlywhenallthetraininginstancesassociatedwiththenodehavethesameclasslabelsorhaveidenticalattributevalues.Althoughtheseconditionsaresufficient,therearereasonstostopexpandinganodemuchearliereveniftheleafnodecontainstraininginstancesfrommorethanoneclass.Thisprocessiscalledearlyterminationandtheconditionusedtodeterminewhenanodeshouldbestoppedfromexpandingiscalledastoppingcriterion.TheadvantagesofearlyterminationarediscussedinSection3.4 .
3.3.2MethodsforExpressingAttributeTestConditions
Decisiontreeinductionalgorithmsmustprovideamethodforexpressinganattributetestconditionanditscorrespondingoutcomesfordifferentattributetypes.
BinaryAttributes
Thetestconditionforabinaryattributegeneratestwopotentialoutcomes,asshowninFigure3.7 .
Figure3.7.Attributetestconditionforabinaryattribute.
NominalAttributes
Sinceanominalattributecanhavemanyvalues,itsattributetestconditioncanbeexpressedintwoways,asamultiwaysplitorabinarysplitasshowninFigure3.8 .Foramultiwaysplit(Figure3.8(a) ),thenumberofoutcomesdependsonthenumberofdistinctvaluesforthecorrespondingattribute.Forexample,ifanattributesuchasmaritalstatushasthreedistinctvalues—single,married,ordivorced—itstestconditionwillproduceathree-waysplit.Itisalsopossibletocreateabinarysplitbypartitioningallvaluestakenbythenominalattributeintotwogroups.Forexample,somedecisiontreealgorithms,suchasCART,produceonlybinarysplitsbyconsideringall
waysofcreatingabinarypartitionofkattributevalues.Figure3.8(b)illustratesthreedifferentwaysofgroupingtheattributevaluesformaritalstatusintotwosubsets.
2k−1−1
Figure3.8.Attributetestconditionsfornominalattributes.
OrdinalAttributes
Ordinalattributescanalsoproducebinaryormulti-waysplits.Ordinalattributevaluescanbegroupedaslongasthegroupingdoesnotviolatetheorderpropertyoftheattributevalues.Figure3.9 illustratesvariouswaysofsplittingtrainingrecordsbasedontheShirtSizeattribute.ThegroupingsshowninFigures3.9(a) and(b) preservetheorderamongtheattributevalues,whereasthegroupingshowninFigure3.9(c) violatesthispropertybecauseitcombinestheattributevaluesSmallandLargeintothesamepartitionwhileMediumandExtraLargearecombinedintoanotherpartition.
Figure3.9.Differentwaysofgroupingordinalattributevalues.
ContinuousAttributes
Forcontinuousattributes,theattributetestconditioncanbeexpressedasacomparisontest(e.g., )producingabinarysplit,orasarangequeryoftheform ,for producingamultiwaysplit.ThedifferencebetweentheseapproachesisshowninFigure3.10 .Forthebinarysplit,anypossiblevaluevbetweentheminimumandmaximumattributevaluesinthetrainingdatacanbeusedforconstructingthecomparisontest .However,itissufficienttoonlyconsiderdistinctattributevaluesinthetrainingsetascandidatesplitpositions.Forthemultiwaysplit,anypossiblecollectionofattributevaluerangescanbeused,aslongastheyaremutuallyexclusiveandcovertheentirerangeofattributevaluesbetweentheminimumandmaximumvaluesobservedinthetrainingset.OneapproachforconstructingmultiwaysplitsistoapplythediscretizationstrategiesdescribedinSection2.3.6 onpage63.Afterdiscretization,anewordinalvalueisassignedtoeachdiscretizedinterval,andtheattributetestconditionisthendefinedusingthisnewlyconstructedordinalattribute.
A<vvi≤A<vi+1 i=1,…,k,
A<v
Figure3.10.Testconditionforcontinuousattributes.
3.3.3MeasuresforSelectinganAttributeTestCondition
Therearemanymeasuresthatcanbeusedtodeterminethegoodnessofanattributetestcondition.Thesemeasurestrytogivepreferencetoattributetestconditionsthatpartitionthetraininginstancesintopurersubsetsinthechildnodes,whichmostlyhavethesameclasslabels.Havingpurernodesisusefulsinceanodethathasallofitstraininginstancesfromthesameclassdoesnotneedtobeexpandedfurther.Incontrast,animpurenodecontainingtraininginstancesfrommultipleclassesislikelytorequireseverallevelsofnodeexpansions,therebyincreasingthedepthofthetreeconsiderably.Largertreesarelessdesirableastheyaremoresusceptibletomodeloverfitting,aconditionthatmaydegradetheclassificationperformanceonunseeninstances,aswillbediscussedinSection3.4 .Theyarealsodifficulttointerpretandincurmoretrainingandtesttimeascomparedtosmallertrees.
Inthefollowing,wepresentdifferentwaysofmeasuringtheimpurityofanodeandthecollectiveimpurityofitschildnodes,bothofwhichwillbeusedtoidentifythebestattributetestconditionforanode.
ImpurityMeasureforaSingleNodeTheimpurityofanodemeasureshowdissimilartheclasslabelsareforthedatainstancesbelongingtoacommonnode.Followingareexamplesofmeasuresthatcanbeusedtoevaluatetheimpurityofanodet:
wherepi(t)istherelativefrequencyoftraininginstancesthatbelongtoclassiatnodet,cisthetotalnumberofclasses,and inentropycalculations.Allthreemeasuresgiveazeroimpurityvalueifanodecontainsinstancesfromasingleclassandmaximumimpurityifthenodehasequalproportionofinstancesfrommultipleclasses.
Figure3.11 comparestherelativemagnitudeoftheimpuritymeasureswhenappliedtobinaryclassificationproblems.Sincethereareonlytwoclasses, .Thehorizontalaxispreferstothefractionofinstancesthatbelongtooneofthetwoclasses.Observethatallthreemeasuresattaintheirmaximumvaluewhentheclassdistributionisuniform(i.e.,
)andminimumvaluewhenalltheinstancesbelongtoasingleclass(i.e.,either or equalsto1).Thefollowingexamplesillustratehowthevaluesoftheimpuritymeasuresvaryaswealtertheclassdistribution.
Entropy=−∑i=0c−1pi(t)log2pi(t), (3.4)
Giniindex=1−∑i=0c−1pi(t)2, (3.5)
Classificationerror=1−maxi[pi(t)], (3.6)
0log20=0
p0(t)+p1(t)=1
p0(t)+p1(t)=0.5p0(t) p1(t)
Figure3.11.Comparisonamongtheimpuritymeasuresforbinaryclassificationproblems.
Node Count
0
6
Node Count
1
5
Node Count
3
N1 Gini=1−(0/6)2−(6/6)2=0
Class=0 Entropy=−(0/6)log2(0/6)−(6/6)log2(6/6)=0
Class=1 Error=1−max[0/6,6/6]=0
N2 Gini=1−(1/6)2−(5/6)2=0.278
Class=0 Entropy=−(1/6)log2(1/6)−(5/6)log2(5/6)=0.650
Class=1 Error=1−max[1/6,5/6]=0.167
N3 Gini=1−(3/6)2−(3/6)2=0.5
Class=0 Entropy=−(3/6)log2(3/6)−(3/6)log2(3/6)=1
3
Basedonthesecalculations,node hasthelowestimpurityvalue,followedby and .Thisexample,alongwithFigure3.11 ,showstheconsistencyamongtheimpuritymeasures,i.e.,ifanode haslowerentropythannode ,thentheGiniindexanderrorrateof willalsobelowerthanthatof .Despitetheiragreement,theattributechosenassplittingcriterionbytheimpuritymeasurescanstillbedifferent(seeExercise6onpage187).
CollectiveImpurityofChildNodesConsideranattributetestconditionthatsplitsanodecontainingNtraininginstancesintokchildren, ,whereeverychildnoderepresentsapartitionofthedataresultingfromoneofthekoutcomesoftheattributetestcondition.Let bethenumberoftraininginstancesassociatedwithachildnode ,whoseimpurityvalueis .Sinceatraininginstanceintheparentnodereachesnode forafractionof times,thecollectiveimpurityofthechildnodescanbecomputedbytakingaweightedsumoftheimpuritiesofthechildnodes,asfollows:
3.3.ExampleWeightedEntropyConsiderthecandidateattributetestconditionshowninFigures3.12(a)and(b) fortheloanborrowerclassificationproblem.SplittingontheHomeOwnerattributewillgeneratetwochildnodes
Class=1 Error=1−max[6/6,3/6]=0.5
N1N2 N3
N1N2 N1N2
{v1,v2,⋯,vk}
N(vj)vj I(vj)
vj N(vj)/N
I(children)=∑j=1kN(vj)NI(vj), (3.7)
Figure3.12.Examplesofcandidateattributetestconditions.
whoseweightedentropycanbecalculatedasfollows:
SplittingonMaritalStatus,ontheotherhand,leadstothreechildnodeswithaweightedentropygivenby
Thus,MaritalStatushasalowerweightedentropythanHomeOwner.
IdentifyingthebestattributetestconditionTodeterminethegoodnessofanattributetestcondition,weneedtocomparethedegreeofimpurityoftheparentnode(beforesplitting)withtheweighteddegreeofimpurityofthechildnodes(aftersplitting).Thelargertheir
I(HomeOwner=yes)=03log203−33log233=0I(HomeOwner=no)=−37log237−47log247=0.985I(HomeOwner=310×0+710×0.985=0.690
I(MaritalStatus=Single)=−25log225−35log235=0.971I(MaritalStatus=Married)=−03log203−33log233=0I(MaritalStatus=Divorced)=−12log212−12log212=1.000I(MaritalStatus)=510×0.971+310×0+210×1=0.686
difference,thebetterthetestcondition.Thisdifference, ,alsotermedasthegaininpurityofanattributetestcondition,canbedefinedasfollows:
Figure3.13.SplittingcriteriafortheloanborrowerclassificationproblemusingGiniindex.
whereI(parent)istheimpurityofanodebeforesplittingandI(children)istheweightedimpuritymeasureaftersplitting.Itcanbeshownthatthegainisnon-negativesince foranyreasonablemeasuresuchasthosepresentedabove.Thehigherthegain,thepureraretheclassesinthechildnodesrelativetotheparentnode.Thesplittingcriterioninthedecisiontreelearningalgorithmselectstheattributetestconditionthatshowsthemaximumgain.NotethatmaximizingthegainatagivennodeisequivalenttominimizingtheweightedimpuritymeasureofitschildrensinceI(parent)isthesameforallcandidateattributetestconditions.Finally,whenentropyisused
Δ
Δ=I(parent)−I(children), (3.8)
I(parent)≥I(children)
astheimpuritymeasure,thedifferenceinentropyiscommonlyknownasinformationgain, .
Inthefollowing,wepresentillustrativeapproachesforidentifyingthebestattributetestconditiongivenqualitativeorquantitativeattributes.
SplittingofQualitativeAttributesConsiderthefirsttwocandidatesplitsshowninFigure3.12 involvingqualitativeattributes and .Theinitialclassdistributionattheparentnodeis(0.3,0.7),sincethereare3instancesofclass and7instancesofclass inthetrainingdata.Thus,
TheinformationgainsforHomeOwnerandMaritalStatusareeachgivenby
TheinformationgainforMaritalStatusisthushigherduetoitslowerweightedentropy,whichwillthusbeconsideredforsplitting.
BinarySplittingofQualitativeAttributesConsiderbuildingadecisiontreeusingonlybinarysplitsandtheGiniindexastheimpuritymeasure.Figure3.13 showsexamplesoffourcandidatesplittingcriteriaforthe and attributes.Sincethereare3borrowersinthetrainingsetwhodefaultedand7otherswhorepaidtheirloan(seeTableinFigure3.13 ),theGiniindexoftheparentnodebeforesplittingis
Δinfo
I(parent)=−310log2310−710log2710=0.881
Δinfo(HomeOwner)=0.881−0.690=0.191Δinfo(MaritalStatus)=0.881−0.686=0.195
If ischosenasthesplittingattribute,theGiniindexforthechildnodes and are0and0.490,respectively.TheweightedaverageGiniindexforthechildrenis
wheretheweightsrepresenttheproportionoftraininginstancesassignedtoeachchild.Thegainusing assplittingattributeis
.Similarly,wecanapplyabinarysplitontheattribute.However,since isanominalattributewith
threeoutcomes,therearethreepossiblewaystogrouptheattributevaluesintoabinarysplit.TheweightedaverageGiniindexofthechildrenforeachcandidatebinarysplitisshowninFigure3.13 .Basedontheseresults,
andthelastbinarysplitusing areclearlythebestcandidates,sincetheybothproducethelowestweightedaverageGiniindex.Binarysplitscanalsobeusedforordinalattributes,ifthebinarypartitioningoftheattributevaluesdoesnotviolatetheorderingpropertyofthevalues.
BinarySplittingofQuantitativeAttributesConsidertheproblemofidentifyingthebestbinarysplit fortheprecedingloanapprovalclassificationproblem.Asdiscussedpreviously,eventhough cantakeanyvaluebetweentheminimumandmaximumvaluesofannualincomeinthetrainingset,itissufficienttoonlyconsidertheannualincomevaluesobservedinthetrainingsetascandidatesplitpositions.Foreachcandidate ,thetrainingsetisscannedoncetocountthenumberofborrowerswithannualincomelessthanorgreaterthan alongwiththeirclassproportions.WecanthencomputetheGiniindexateachcandidatesplit
1−(310)2−(710)2=0.420.
N1 N2
(3/10)×0+(7/10)×0.490=0.343,
0.420−0.343=0.077
AnnualIncome≤τ
τ
ττ
positionandchoosethe thatproducesthelowestvalue.ComputingtheGiniindexateachcandidatesplitpositionrequiresO(N)operations,whereNisthenumberoftraininginstances.SincethereareatmostNpossiblecandidates,theoverallcomplexityofthisbrute-forcemethodis .ItispossibletoreducethecomplexityofthisproblemtoO(NlogN)byusingamethoddescribedasfollows(seeillustrationinFigure3.14 ).Inthismethod,wefirstsortthetraininginstancesbasedontheirannualincome,aone-timecostthatrequiresO(NlogN)operations.Thecandidatesplitpositionsaregivenbythemidpointsbetweeneverytwoadjacentsortedvalues:$55,000,$65,000,$72,500,andsoon.Forthefirstcandidate,sincenoneoftheinstanceshasanannualincomelessthanorequalto$55,000,theGiniindexforthechildnodewith isequaltozero.Incontrast,thereare3traininginstancesofclass and instancesofclassNowithannualincomegreaterthan$55,000.TheGiniindexforthisnodeis0.420.TheweightedaverageGiniindexforthefirstcandidatesplitposition, ,isequalto .
Figure3.14.Splittingcontinuousattributes.
Forthenextcandidate, ,theclassdistributionofitschildnodescanbeobtainedwithasimpleupdateofthedistributionforthepreviouscandidate.Thisisbecause,as increasesfrom$55,000to$65,000,thereisonlyone
τ
O(N2)
AnnualIncome<$55,000
τ=$55,0000×0+1×0.420=0.420
τ=$65,000
τ
traininginstanceaffectedbythechange.Byexaminingtheclasslabeloftheaffectedtraininginstance,thenewclassdistributionisobtained.Forexample,as increasesto$65,000,thereisonlyoneborrowerinthetrainingset,withanannualincomeof$60,000,affectedbythischange.Sincetheclasslabelfortheborroweris ,thecountforclass increasesfrom0to1(for
)anddecreasesfrom7to6(for),asshowninFigure3.14 .Thedistributionforthe
classremainsunaffected.TheupdatedGiniindexforthiscandidatesplitpositionis0.400.
ThisprocedureisrepeateduntiltheGiniindexforallcandidatesarefound.ThebestsplitpositioncorrespondstotheonethatproducesthelowestGiniindex,whichoccursat .SincetheGiniindexateachcandidatesplitpositioncanbecomputedinO(1)time,thecomplexityoffindingthebestsplitpositionisO(N)onceallthevaluesarekeptsorted,aone-timeoperationthattakesO(NlogN)time.TheoverallcomplexityofthismethodisthusO(NlogN),whichismuchsmallerthanthe timetakenbythebrute-forcemethod.Theamountofcomputationcanbefurtherreducedbyconsideringonlycandidatesplitpositionslocatedbetweentwoadjacentsortedinstanceswithdifferentclasslabels.Forexample,wedonotneedtoconsidercandidatesplitpositionslocatedbetween$60,000and$75,000becauseallthreeinstanceswithannualincomeinthisrange($60,000,$70,000,and$75,000)havethesameclasslabels.Choosingasplitpositionwithinthisrangeonlyincreasesthedegreeofimpurity,comparedtoasplitpositionlocatedoutsidethisrange.Therefore,thecandidatesplitpositionsat and
canbeignored.Similarly,wedonotneedtoconsiderthecandidatesplitpositionsat$87,500,$92,500,$110,000,$122,500,and$172,500becausetheyarelocatedbetweentwoadjacentinstanceswiththesamelabels.Thisstrategyreducesthenumberofcandidatesplitpositionstoconsiderfrom9to2(excludingthetwoboundarycases and
).
τ
AnnualIncome≤$65,000AnnualIncome>$65,000
τ=$97,500
O(N2)
τ=$65,000τ=$72,500
τ=$55,000τ=$230,000
GainRatioOnepotentiallimitationofimpuritymeasuressuchasentropyandGiniindexisthattheytendtofavorqualitativeattributeswithlargenumberofdistinctvalues.Figure3.12 showsthreecandidateattributesforpartitioningthedatasetgiveninTable3.3 .Aspreviouslymentioned,theattribute
isabetterchoicethantheattribute ,becauseitprovidesalargerinformationgain.However,ifwecomparethemagainst ,thelatterproducesthepurestpartitionswiththemaximuminformationgain,sincetheweightedentropyandGiniindexisequaltozeroforitschildren.Yet,
isnotagoodattributeforsplittingbecauseithasauniquevalueforeachinstance.Eventhoughatestconditioninvolving willaccuratelyclassifyeveryinstanceinthetrainingdata,wecannotusesuchatestconditiononnewtestinstanceswith valuesthathaven'tbeenseenbeforeduringtraining.Thisexamplesuggestshavingalowimpurityvaluealoneisinsufficienttofindagoodattributetestconditionforanode.AswewillseelaterinSection3.4 ,havingmorenumberofchildnodescanmakeadecisiontreemorecomplexandconsequentlymoresusceptibletooverfitting.Hence,thenumberofchildrenproducedbythesplittingattributeshouldalsobetakenintoconsiderationwhiledecidingthebestattributetestcondition.
Therearetwowaystoovercomethisproblem.Onewayistogenerateonlybinarydecisiontrees,thusavoidingthedifficultyofhandlingattributeswithvaryingnumberofpartitions.ThisstrategyisemployedbydecisiontreeclassifierssuchasCART.Anotherwayistomodifythesplittingcriteriontotakeintoaccountthenumberofpartitionsproducedbytheattribute.Forexample,intheC4.5decisiontreealgorithm,ameasureknownasgainratioisusedtocompensateforattributesthatproducealargenumberofchildnodes.Thismeasureiscomputedasfollows:
where isthenumberofinstancesassignedtonode andkisthetotalnumberofsplits.Thesplitinformationmeasurestheentropyofsplittinganodeintoitschildnodesandevaluatesifthesplitresultsinalargernumberofequally-sizedchildnodesornot.Forexample,ifeverypartitionhasthesamenumberofinstances,then andthesplitinformationwouldbeequaltolog k.Thus,ifanattributeproducesalargenumberofsplits,itssplitinformationisalsolarge,whichinturn,reducesthegainratio.
3.4.ExampleGainRatioConsiderthedatasetgiveninExercise2onpage185.Wewanttoselectthebestattributetestconditionamongthefollowingthreeattributes:
, ,and .Theentropybeforesplittingis
If isusedasattributetestcondition:
If isusedasattributetestcondition:
Finally,if isusedasattributetestcondition:
Gainratio=ΔinfoSplitInfo=Entropy(Parent)−∑i=1kN(vi)NEntropy(vi)−∑i=1kN(vi)Nlog2N(vi)N
(3.9)
N(vi) vi
∀i:N(vi)/N=1/k2
Entropy(parent)=−1020log21020−1020log21020=1.
Entropy(children)=1020[−610log2610−410log2410]×2=0.971GainRatio=1−0.971−1020log21020−1020log21020=0.0291=0.029
Entropy(children)=420[−14log214−34log234]+820×0+820[−18log218−78log278]=0.380GainRatio=1−0.380−420log2420−820log2820−820log2820=0.6201.52
Thus,eventhough hasthehighestinformationgain,itsgainratioislowerthan sinceitproducesalargernumberofsplits.
3.3.4AlgorithmforDecisionTreeInduction
Algorithm3.1 presentsapseudocodefordecisiontreeinductionalgorithm.TheinputtothisalgorithmisasetoftraininginstancesEalongwiththeattributesetF.Thealgorithmworksbyrecursivelyselectingthebestattributetosplitthedata(Step7)andexpandingthenodesofthetree(Steps11and12)untilthestoppingcriterionismet(Step1).Thedetailsofthisalgorithmareexplainedbelow.
1. The functionextendsthedecisiontreebycreatinganewnode.Anodeinthedecisiontreeeitherhasatestcondition,denotedasnode.testcond,oraclasslabel,denotedasnode.label.
2. The functiondeterminestheattributetestconditionforpartitioningthetraininginstancesassociatedwithanode.Thesplittingattributechosendependsontheimpuritymeasureused.ThepopularmeasuresincludeentropyandtheGiniindex.
3. The functiondeterminestheclasslabeltobeassignedtoaleafnode.Foreachleafnodet,let denotethefractionoftraininginstancesfromclassiassociatedwiththenodet.Thelabelassignedto
Entropy(children)=120[−11log211−01log201]×20=0GainRatio=1−0−120log2120×20=14.32=0.23
p(i|t)
theleafnodeistypicallytheonethatoccursmostfrequentlyinthetraininginstancesthatareassociatedwiththisnode.
Algorithm3.1Askeletondecisiontreeinductionalgorithm.
∈
∈
wheretheargmaxoperatorreturnstheclassithatmaximizes .Besidesprovidingtheinformationneededtodeterminetheclasslabel
leaf.label=argmaxip(i|t), (3.10)
p(i|t)
ofaleafnode, canalsobeusedasaroughestimateoftheprobabilitythataninstanceassignedtotheleafnodetbelongstoclassi.Sections4.11.2 and4.11.4 inthenextchapterdescribehowsuchprobabilityestimatescanbeusedtodeterminetheperformanceofadecisiontreeunderdifferentcostfunctions.
4. The functionisusedtoterminatethetree-growingprocessbycheckingwhetheralltheinstanceshaveidenticalclasslabelorattributevalues.Sincedecisiontreeclassifiersemployatop-down,recursivepartitioningapproachforbuildingamodel,thenumberoftraininginstancesassociatedwithanodedecreasesasthedepthofthetreeincreases.Asaresult,aleafnodemaycontaintoofewtraininginstancestomakeastatisticallysignificantdecisionaboutitsclasslabel.Thisisknownasthedatafragmentationproblem.Onewaytoavoidthisproblemistodisallowsplittingofanodewhenthenumberofinstancesassociatedwiththenodefallbelowacertainthreshold.Amoresystematicwaytocontrolthesizeofadecisiontree(numberofleafnodes)willbediscussedinSection3.5.4 .
3.3.5ExampleApplication:WebRobotDetection
Considerthetaskofdistinguishingtheaccesspatternsofwebrobotsfromthosegeneratedbyhumanusers.Awebrobot(alsoknownasawebcrawler)isasoftwareprogramthatautomaticallyretrievesfilesfromoneormorewebsitesbyfollowingthehyperlinksextractedfromaninitialsetofseedURLs.Theseprogramshavebeendeployedforvariouspurposes,fromgatheringwebpagesonbehalfofsearchenginestomoremaliciousactivitiessuchasspammingandcommittingclickfraudsinonlineadvertisements.
p(i|t)
Figure3.15.Inputdataforwebrobotdetection.
Thewebrobotdetectionproblemcanbecastasabinaryclassificationtask.Theinputdatafortheclassificationtaskisawebserverlog,asampleofwhichisshowninFigure3.15(a) .Eachlineinthelogfilecorrespondstoarequestmadebyaclient(i.e.,ahumanuserorawebrobot)tothewebserver.Thefieldsrecordedintheweblogincludetheclient'sIPaddress,timestampoftherequest,URLoftherequestedfile,sizeofthefile,anduseragent,whichisafieldthatcontainsidentifyinginformationabouttheclient.
Forhumanusers,theuseragentfieldspecifiesthetypeofwebbrowserormobiledeviceusedtofetchthefiles,whereasforwebrobots,itshouldtechnicallycontainthenameofthecrawlerprogram.However,webrobotsmayconcealtheirtrueidentitiesbydeclaringtheiruseragentfieldstobeidenticaltoknownbrowsers.Therefore,useragentisnotareliablefieldtodetectwebrobots.
Thefirststeptowardbuildingaclassificationmodelistopreciselydefineadatainstanceandassociatedattributes.Asimpleapproachistoconsidereachlogentryasadatainstanceandusetheappropriatefieldsinthelogfileasitsattributeset.Thisapproach,however,isinadequateforseveralreasons.First,manyoftheattributesarenominal-valuedandhaveawiderangeofdomainvalues.Forexample,thenumberofuniqueclientIPaddresses,URLs,andreferrersinalogfilecanbeverylarge.Theseattributesareundesirableforbuildingadecisiontreebecausetheirsplitinformationisextremelyhigh(seeEquation(3.9) ).Inaddition,itmightnotbepossibletoclassifytestinstancescontainingIPaddresses,URLs,orreferrersthatarenotpresentinthetrainingdata.Finally,byconsideringeachlogentryasaseparatedatainstance,wedisregardthesequenceofwebpagesretrievedbytheclient—acriticalpieceofinformationthatcanhelpdistinguishwebrobotaccessesfromthoseofahumanuser.
Abetteralternativeistoconsidereachwebsessionasadatainstance.Awebsessionisasequenceofrequestsmadebyaclientduringagivenvisittothewebsite.Eachwebsessioncanbemodeledasadirectedgraph,inwhichthenodescorrespondtowebpagesandtheedgescorrespondtohyperlinksconnectingonewebpagetoanother.Figure3.15(b) showsagraphicalrepresentationofthefirstwebsessiongiveninthelogfile.Everywebsessioncanbecharacterizedusingsomemeaningfulattributesaboutthegraphthatcontaindiscriminatoryinformation.Figure3.15(c) showssomeoftheattributesextractedfromthegraph,includingthedepthandbreadthofits
correspondingtreerootedattheentrypointtothewebsite.Forexample,thedepthandbreadthofthetreeshowninFigure3.15(b) arebothequaltotwo.
ThederivedattributesshowninFigure3.15(c) aremoreinformativethantheoriginalattributesgiveninthelogfilebecausetheycharacterizethebehavioroftheclientatthewebsite.Usingthisapproach,adatasetcontaining2916instanceswascreated,withequalnumbersofsessionsduetowebrobots(class1)andhumanusers(class0).10%ofthedatawerereservedfortrainingwhiletheremaining90%wereusedfortesting.TheinduceddecisiontreeisshowninFigure3.16 ,whichhasanerrorrateequalto3.8%onthetrainingsetand5.3%onthetestset.Inadditiontoitslowerrorrate,thetreealsorevealssomeinterestingpropertiesthatcanhelpdiscriminatewebrobotsfromhumanusers:
1. Accessesbywebrobotstendtobebroadbutshallow,whereasaccessesbyhumanuserstendtobemorefocused(narrowbutdeep).
2. Webrobotsseldomretrievetheimagepagesassociatedwithawebpage.
3. Sessionsduetowebrobotstendtobelongandcontainalargenumberofrequestedpages.
4. Webrobotsaremorelikelytomakerepeatedrequestsforthesamewebpagethanhumanuserssincethewebpagesretrievedbyhumanusersareoftencachedbythebrowser.
3.3.6CharacteristicsofDecisionTreeClassifiers
Thefollowingisasummaryoftheimportantcharacteristicsofdecisiontreeinductionalgorithms.
1. Applicability:Decisiontreesareanonparametricapproachforbuildingclassificationmodels.Thisapproachdoesnotrequireanypriorassumptionabouttheprobabilitydistributiongoverningtheclassandattributesofthedata,andthus,isapplicabletoawidevarietyofdatasets.Itisalsoapplicabletobothcategoricalandcontinuousdatawithoutrequiringtheattributestobetransformedintoacommonrepresentationviabinarization,normalization,orstandardization.UnlikesomebinaryclassifiersdescribedinChapter4 ,itcanalsodealwithmulticlassproblemswithouttheneedtodecomposethemintomultiplebinaryclassificationtasks.Anotherappealingfeatureofdecisiontreeclassifiersisthattheinducedtrees,especiallytheshorterones,arerelativelyeasytointerpret.Theaccuraciesofthetreesarealsoquitecomparabletootherclassificationtechniquesformanysimpledatasets.
2. Expressiveness:Adecisiontreeprovidesauniversalrepresentationfordiscrete-valuedfunctions.Inotherwords,itcanencodeanyfunctionofdiscrete-valuedattributes.Thisisbecauseeverydiscrete-valuedfunctioncanberepresentedasanassignmenttable,whereeveryuniquecombinationofdiscreteattributesisassignedaclasslabel.Sinceeverycombinationofattributescanberepresentedasaleafinthedecisiontree,wecanalwaysfindadecisiontreewhoselabelassignmentsattheleafnodesmatcheswiththeassignmenttableoftheoriginalfunction.Decisiontreescanalsohelpinprovidingcompactrepresentationsoffunctionswhensomeoftheuniquecombinationsofattributescanberepresentedbythesameleafnode.Forexample,Figure3.17 showstheassignmenttableoftheBooleanfunction
involvingfourbinaryattributes,resultinginatotalofpossibleassignments.ThetreeshowninFigure3.17 shows
(A∧B)∨(C∧D)24=16
acompressedencodingofthisassignmenttable.Insteadofrequiringafully-growntreewith16leafnodes,itispossibletoencodethefunctionusingasimplertreewithonly7leafnodes.Nevertheless,notalldecisiontreesfordiscrete-valuedattributescanbesimplified.Onenotableexampleistheparityfunction,whosevalueis1whenthereisanevennumberoftruevaluesamongitsBooleanattributes,and0otherwise.Accuratemodelingofsuchafunctionrequiresafulldecisiontreewith nodes,wheredisthenumberofBooleanattributes(seeExercise1onpage185).
2d
Figure3.16.Decisiontreemodelforwebrobotdetection.
Figure3.17.DecisiontreefortheBooleanfunction .
3. ComputationalEfficiency:Sincethenumberofpossibledecisiontreescanbeverylarge,manydecisiontreealgorithmsemployaheuristic-basedapproachtoguidetheirsearchinthevasthypothesisspace.Forexample,thealgorithmpresentedinSection3.3.4 usesagreedy,top-down,recursivepartitioningstrategyforgrowingadecisiontree.Formanydatasets,suchtechniquesquicklyconstructareasonablygooddecisiontreeevenwhenthetrainingsetsizeisverylarge.Furthermore,onceadecisiontreehasbeenbuilt,classifyingatestrecordisextremelyfast,withaworst-casecomplexityofO(w),wherewisthemaximumdepthofthetree.
4. HandlingMissingValues:Adecisiontreeclassifiercanhandlemissingattributevaluesinanumberofways,bothinthetrainingandthetestsets.Whentherearemissingvaluesinthetestset,theclassifiermustdecidewhichbranchtofollowifthevalueofasplitting
(A∧B)∨(C∧D)
nodeattributeismissingforagiventestinstance.Oneapproach,knownastheprobabilisticsplitmethod,whichisemployedbytheC4.5decisiontreeclassifier,distributesthedatainstancetoeverychildofthesplittingnodeaccordingtotheprobabilitythatthemissingattributehasaparticularvalue.Incontrast,theCARTalgorithmusesthesurrogatesplitmethod,wheretheinstancewhosesplittingattributevalueismissingisassignedtooneofthechildnodesbasedonthevalueofanothernon-missingsurrogateattributewhosesplitsmostresemblethepartitionsmadebythemissingattribute.Anotherapproach,knownastheseparateclassmethodisusedbytheCHAIDalgorithm,wherethemissingvalueistreatedasaseparatecategoricalvaluedistinctfromothervaluesofthesplittingattribute.Figure3.18showsanexampleofthethreedifferentwaysforhandlingmissingvaluesinadecisiontreeclassifier.Otherstrategiesfordealingwithmissingvaluesarebasedondatapreprocessing,wheretheinstancewithmissingvalueiseitherimputedwiththemode(forcategoricalattribute)ormean(forcontinuousattribute)valueordiscardedbeforetheclassifieristrained.
Figure3.18.Methodsforhandlingmissingattributevaluesindecisiontreeclassifier.
Duringtraining,ifanattributevhasmissingvaluesinsomeofthetraininginstancesassociatedwithanode,weneedawaytomeasurethegaininpurityifvisusedforsplitting.Onesimplewayistoexcludeinstanceswithmissingvaluesofvinthecountingofinstancesassociatedwitheverychildnode,generatedforeverypossibleoutcomeofv.Further,ifvischosenastheattributetestconditionatanode,traininginstanceswithmissingvaluesofvcanbepropagatedtothechildnodesusinganyofthemethodsdescribedaboveforhandlingmissingvaluesintestinstances.
5. HandlingInteractionsamongAttributes:Attributesareconsideredinteractingiftheyareabletodistinguishbetweenclasseswhenusedtogether,butindividuallytheyprovidelittleornoinformation.Duetothegreedynatureofthesplittingcriteriaindecisiontrees,suchattributescouldbepassedoverinfavorofotherattributesthatarenotasuseful.Thiscouldresultinmorecomplexdecisiontreesthannecessary.Hence,decisiontreescanperformpoorlywhenthereareinteractionsamongattributes.Toillustratethispoint,considerthethree-dimensionaldatashowninFigure3.19(a) ,whichcontains2000datapointsfromoneoftwoclasses,denotedas and inthediagram.Figure3.19(b) showsthedistributionofthetwoclassesinthetwo-dimensionalspaceinvolvingattributesXandY,whichisanoisyversionoftheXORBooleanfunction.Wecanseethateventhoughthetwoclassesarewell-separatedinthistwo-dimensionalspace,neitherofthetwoattributescontainsufficientinformationtodistinguishbetweenthetwoclasseswhenusedalone.Forexample,theentropiesofthefollowingattributetestconditions: and ,arecloseto1,indicatingthatneitherXnorYprovideanyreductionintheimpuritymeasurewhenusedindividually.XandYthusrepresentacaseofinteractionamongattributes.Thedatasetalsocontainsathirdattribute,Z,inwhichbothclassesaredistributeduniformly,asshowninFigures3.19(c) and
+ ∘
X≤10 Y≤10
3.19(d) ,andhence,theentropyofanysplitinvolvingZiscloseto1.Asaresult,Zisaslikelytobechosenforsplittingastheinteractingbutusefulattributes,XandY.Forfurtherillustrationofthisissue,readersarereferredtoExample3.7 inSection3.4.1 andExercise7attheendofthischapter.
Figure3.19.ExampleofaXORdatainvolvingXandY,alongwithanirrelevantattributeZ.
6. HandlingIrrelevantAttributes:Anattributeisirrelevantifitisnotusefulfortheclassificationtask.Sinceirrelevantattributesarepoorlyassociatedwiththetargetclasslabels,theywillprovidelittleornogaininpurityandthuswillbepassedoverbyothermorerelevantfeatures.Hence,thepresenceofasmallnumberofirrelevantattributeswillnotimpactthedecisiontreeconstructionprocess.However,notallattributesthatprovidelittletonogainareirrelevant(seeFigure3.19 ).Hence,iftheclassificationproblemiscomplex(e.g.,involvinginteractionsamongattributes)andtherearealargenumberofirrelevantattributes,thensomeoftheseattributesmaybeaccidentallychosenduringthetree-growingprocess,sincetheymayprovideabettergainthanarelevantattributejustbyrandomchance.Featureselectiontechniquescanhelptoimprovetheaccuracyofdecisiontreesbyeliminatingtheirrelevantattributesduringpreprocessing.WewillinvestigatetheissueoftoomanyirrelevantattributesinSection3.4.1 .
7. HandlingRedundantAttributes:Anattributeisredundantifitisstronglycorrelatedwithanotherattributeinthedata.Sinceredundantattributesshowsimilargainsinpurityiftheyareselectedforsplitting,onlyoneofthemwillbeselectedasanattributetestconditioninthedecisiontreealgorithm.Decisiontreescanthushandlethepresenceofredundantattributes.
8. UsingRectilinearSplits:Thetestconditionsdescribedsofarinthischapterinvolveusingonlyasingleattributeatatime.Asaconsequence,thetree-growingprocedurecanbeviewedastheprocessofpartitioningtheattributespaceintodisjointregionsuntileachregioncontainsrecordsofthesameclass.Theborderbetweentwoneighboringregionsofdifferentclassesisknownasadecisionboundary.Figure3.20 showsthedecisiontreeaswellasthedecisionboundaryforabinaryclassificationproblem.Sincethetestconditioninvolvesonlyasingleattribute,thedecisionboundariesare
rectilinear;i.e.,paralleltothecoordinateaxes.Thislimitstheexpressivenessofdecisiontreesinrepresentingdecisionboundariesofdatasetswithcontinuousattributes.Figure3.21 showsatwo-dimensionaldatasetinvolvingbinaryclassesthatcannotbeperfectlyclassifiedbyadecisiontreewhoseattributetestconditionsaredefinedbasedonsingleattributes.ThebinaryclassesinthedatasetaregeneratedfromtwoskewedGaussiandistributions,centeredat(8,8)and(12,12),respectively.Thetruedecisionboundaryisrepresentedbythediagonaldashedline,whereastherectilineardecisionboundaryproducedbythedecisiontreeclassifierisshownbythethicksolidline.Incontrast,anobliquedecisiontreemayovercomethislimitationbyallowingthetestconditiontobespecifiedusingmorethanoneattribute.Forexample,thebinaryclassificationdatashowninFigure3.21 canbeeasilyrepresentedbyanobliquedecisiontreewithasinglerootnodewithtestcondition
Figure3.20.
x+y<20.
Exampleofadecisiontreeanditsdecisionboundariesforatwo-dimensionaldataset.
Figure3.21.Exampleofdatasetthatcannotbepartitionedoptimallyusingadecisiontreewithsingleattributetestconditions.Thetruedecisionboundaryisshownbythedashedline.
Althoughanobliquedecisiontreeismoreexpressiveandcanproducemorecompacttrees,findingtheoptimaltestconditioniscomputationallymoreexpensive.
9. ChoiceofImpurityMeasure:Itshouldbenotedthatthechoiceofimpuritymeasureoftenhaslittleeffectontheperformanceofdecisiontreeclassifierssincemanyoftheimpuritymeasuresarequiteconsistentwitheachother,asshowninFigure3.11 onpage129.Instead,thestrategyusedtoprunethetreehasagreaterimpactonthefinaltreethanthechoiceofimpuritymeasure.
3.4ModelOverfittingMethodspresentedsofartrytolearnclassificationmodelsthatshowthelowesterroronthetrainingset.However,aswewillshowinthefollowingexample,evenifamodelfitswelloverthetrainingdata,itcanstillshowpoorgeneralizationperformance,aphenomenonknownasmodeloverfitting.
Figure3.22.Examplesoftrainingandtestsetsofatwo-dimensionalclassificationproblem.
Figure3.23.Effectofvaryingtreesize(numberofleafnodes)ontrainingandtesterrors.
3.5.ExampleOverfittingandUnderfittingofDecisionTreesConsiderthetwo-dimensionaldatasetshowninFigure3.22(a) .Thedatasetcontainsinstancesthatbelongtotwoseparateclasses,representedas and ,respectively,whereeachclasshas5400instances.Allinstancesbelongingtothe classweregeneratedfromauniformdistribution.Forthe class,5000instancesweregeneratedfromaGaussiandistributioncenteredat(10,10)withunitvariance,whiletheremaining400instancesweresampledfromthesameuniformdistributionasthe class.WecanseefromFigure3.22(a) thatthe classcanbelargelydistinguishedfromthe classbydrawingacircleofappropriatesizecenteredat(10,10).Tolearnaclassifierusingthistwo-dimensionaldataset,werandomlysampled10%ofthedatafortrainingandusedtheremaining90%fortesting.Thetrainingset,showninFigure3.22(b) ,looksquiterepresentativeoftheoveralldata.WeusedGiniindexasthe
+ ∘∘
+
∘ +∘
impuritymeasuretoconstructdecisiontreesofincreasingsizes(numberofleafnodes),byrecursivelyexpandinganodeintochildnodestilleveryleafnodewaspure,asdescribedinSection3.3.4 .
Figure3.23(a) showschangesinthetrainingandtesterrorratesasthesizeofthetreevariesfrom1to8.Botherrorratesareinitiallylargewhenthetreehasonlyoneortwoleafnodes.Thissituationisknownasmodelunderfitting.Underfittingoccurswhenthelearneddecisiontreeistoosimplistic,andthus,incapableoffullyrepresentingthetruerelationshipbetweentheattributesandtheclasslabels.Asweincreasethetreesizefrom1to8,wecanobservetwoeffects.First,boththeerrorratesdecreasesincelargertreesareabletorepresentmorecomplexdecisionboundaries.Second,thetrainingandtesterrorratesarequiteclosetoeachother,whichindicatesthattheperformanceonthetrainingsetisfairlyrepresentativeofthegeneralizationperformance.Aswefurtherincreasethesizeofthetreefrom8to150,thetrainingerrorcontinuestosteadilydecreasetilliteventuallyreacheszero,asshowninFigure3.23(b) .However,inastrikingcontrast,thetesterrorrateceasestodecreaseanyfurtherbeyondacertaintreesize,andthenitbeginstoincrease.Thetrainingerrorratethusgrosslyunder-estimatesthetesterrorrateoncethetreebecomestoolarge.Further,thegapbetweenthetrainingandtesterrorrateskeepsonwideningasweincreasethetreesize.Thisbehavior,whichmayseemcounter-intuitiveatfirst,canbeattributedtothephenomenaofmodeloverfitting.
3.4.1ReasonsforModelOverfitting
Modeloverfittingisthephenomenawhere,inthepursuitofminimizingthetrainingerrorrate,anoverlycomplexmodelisselectedthatcapturesspecific
patternsinthetrainingdatabutfailstolearnthetruenatureofrelationshipsbetweenattributesandclasslabelsintheoveralldata.Toillustratethis,Figure3.24 showsdecisiontreesandtheircorrespondingdecisionboundaries(shadedrectanglesrepresentregionsassignedtothe class)fortwotreesofsizes5and50.Wecanseethatthedecisiontreeofsize5appearsquitesimpleanditsdecisionboundariesprovideareasonableapproximationtotheidealdecisionboundary,whichinthiscasecorrespondstoacirclecenteredaroundtheGaussiandistributionat(10,10).Althoughitstrainingandtesterrorratesarenon-zero,theyareveryclosetoeachother,whichindicatesthatthepatternslearnedinthetrainingsetshouldgeneralizewelloverthetestset.Ontheotherhand,thedecisiontreeofsize50appearsmuchmorecomplexthanthetreeofsize5,withcomplicateddecisionboundaries.Forexample,someofitsshadedrectangles(assignedtheclass)attempttocovernarrowregionsintheinputspacethatcontainonlyoneortwo traininginstances.Notethattheprevalenceof instancesinsuchregionsishighlyspecifictothetrainingset,astheseregionsaremostlydominatedby-instancesintheoveralldata.Hence,inanattempttoperfectlyfitthetrainingdata,thedecisiontreeofsize50startsfinetuningitselftospecificpatternsinthetrainingdata,leadingtopoorperformanceonanindependentlychosentestset.
+
+
+ +
Figure3.24.Decisiontreeswithdifferentmodelcomplexities.
Figure3.25.Performanceofdecisiontreesusing20%datafortraining(twicetheoriginaltrainingsize).
Thereareanumberoffactorsthatinfluencemodeloverfitting.Inthefollowing,weprovidebriefdescriptionsoftwoofthemajorfactors:limitedtrainingsizeandhighmodelcomplexity.Thoughtheyarenotexhaustive,theinterplaybetweenthemcanhelpexplainmostofthecommonmodeloverfittingphenomenainreal-worldapplications.
LimitedTrainingSizeNotethatatrainingsetconsistingofafinitenumberofinstancescanonlyprovidealimitedrepresentationoftheoveralldata.Hence,itispossiblethatthepatternslearnedfromatrainingsetdonotfullyrepresentthetruepatternsintheoveralldata,leadingtomodeloverfitting.Ingeneral,asweincreasethesizeofatrainingset(numberoftraininginstances),thepatternslearnedfromthetrainingsetstartresemblingthetruepatternsintheoveralldata.Hence,
theeffectofoverfittingcanbereducedbyincreasingthetrainingsize,asillustratedinthefollowingexample.
3.6ExampleEffectofTrainingSizeSupposethatweusetwicethenumberoftraininginstancesthanwhatwehadusedintheexperimentsconductedinExample3.5 .Specifically,weuse20%datafortrainingandusetheremainderfortesting.Figure3.25(b) showsthetrainingandtesterrorratesasthesizeofthetreeisvariedfrom1to150.TherearetwomajordifferencesinthetrendsshowninthisfigureandthoseshowninFigure3.23(b) (usingonly10%ofthedatafortraining).First,eventhoughthetrainingerrorratedecreaseswithincreasingtreesizeinbothfigures,itsrateofdecreaseismuchsmallerwhenweusetwicethetrainingsize.Second,foragiventreesize,thegapbetweenthetrainingandtesterrorratesismuchsmallerwhenweusetwicethetrainingsize.Thesedifferencessuggestthatthepatternslearnedusing20%ofdatafortrainingaremoregeneralizablethanthoselearnedusing10%ofdatafortraining.
Figure3.25(a) showsthedecisionboundariesforthetreeofsize50,learnedusing20%ofdatafortraining.Incontrasttothetreeofthesamesizelearnedusing10%datafortraining(seeFigure3.24(d) ),wecanseethatthedecisiontreeisnotcapturingspecificpatternsofnoisyinstancesinthetrainingset.Instead,thehighmodelcomplexityof50leafnodesisbeingeffectivelyusedtolearntheboundariesofthe instancescenteredat(10,10).
HighModelComplexityGenerally,amorecomplexmodelhasabetterabilitytorepresentcomplexpatternsinthedata.Forexample,decisiontreeswithlargernumberofleaf
+
+
nodescanrepresentmorecomplexdecisionboundariesthandecisiontreeswithfewerleafnodes.However,anoverlycomplexmodelalsohasatendencytolearnspecificpatternsinthetrainingsetthatdonotgeneralizewelloverunseeninstances.Modelswithhighcomplexityshouldthusbejudiciouslyusedtoavoidoverfitting.
Onemeasureofmodelcomplexityisthenumberof“parameters”thatneedtobeinferredfromthetrainingset.Forexample,inthecaseofdecisiontreeinduction,theattributetestconditionsatinternalnodescorrespondtotheparametersofthemodelthatneedtobeinferredfromthetrainingset.Adecisiontreewithlargernumberofattributetestconditions(andconsequentlymoreleafnodes)thusinvolvesmore“parameters”andhenceismorecomplex.
Givenaclassofmodelswithacertainnumberofparameters,alearningalgorithmattemptstoselectthebestcombinationofparametervaluesthatmaximizesanevaluationmetric(e.g.,accuracy)overthetrainingset.Ifthenumberofparametervaluecombinations(andhencethecomplexity)islarge,thelearningalgorithmhastoselectthebestcombinationfromalargenumberofpossibilities,usingalimitedtrainingset.Insuchcases,thereisahighchanceforthelearningalgorithmtopickaspuriouscombinationofparametersthatmaximizestheevaluationmetricjustbyrandomchance.Thisissimilartothemultiplecomparisonsproblem(alsoreferredasmultipletestingproblem)instatistics.
Asanillustrationofthemultiplecomparisonsproblem,considerthetaskofpredictingwhetherthestockmarketwillriseorfallinthenexttentradingdays.Ifastockanalystsimplymakesrandomguesses,theprobabilitythatherpredictioniscorrectonanytradingdayis0.5.However,theprobabilitythatshewillpredictcorrectlyatleastnineoutoftentimesis
whichisextremelylow.
Supposeweareinterestedinchoosinganinvestmentadvisorfromapoolof200stockanalysts.Ourstrategyistoselecttheanalystwhomakesthemostnumberofcorrectpredictionsinthenexttentradingdays.Theflawinthisstrategyisthatevenifalltheanalystsmaketheirpredictionsinarandomfashion,theprobabilitythatatleastoneofthemmakesatleastninecorrectpredictionsis
whichisveryhigh.Althougheachanalysthasalowprobabilityofpredictingatleastninetimescorrectly,consideredtogether,wehaveahighprobabilityoffindingatleastoneanalystwhocandoso.However,thereisnoguaranteeinthefuturethatsuchananalystwillcontinuetomakeaccuratepredictionsbyrandomguessing.
Howdoesthemultiplecomparisonsproblemrelatetomodeloverfitting?Inthecontextoflearningaclassificationmodel,eachcombinationofparametervaluescorrespondstoananalyst,whilethenumberoftraininginstancescorrespondstothenumberofdays.Analogoustothetaskofselectingthebestanalystwhomakesthemostaccuratepredictionsonconsecutivedays,thetaskofalearningalgorithmistoselectthebestcombinationofparametersthatresultsinthehighestaccuracyonthetrainingset.Ifthenumberofparametercombinationsislargebutthetrainingsizeissmall,itishighlylikelyforthelearningalgorithmtochooseaspuriousparametercombinationthatprovideshightrainingaccuracyjustbyrandomchance.Inthefollowingexample,weillustratethephenomenaofoverfittingduetomultiplecomparisonsinthecontextofdecisiontreeinduction.
(109)+(1010)210=0.0107,
1−(1−0.0107)200=0.8847,
Figure3.26.Exampleofatwo-dimensional(X-Y)dataset.
Figure3.27.
Trainingandtesterrorratesillustratingtheeffectofmultiplecomparisonsproblemonmodeloverfitting.
3.7.ExampleMultipleComparisonsandOverfittingConsiderthetwo-dimensionaldatasetshowninFigure3.26 containing500 and500 instances,whichissimilartothedatashowninFigure3.19 .Inthisdataset,thedistributionsofbothclassesarewell-separatedinthetwo-dimensional(XY)attributespace,butnoneofthetwoattributes(XorY)aresufficientlyinformativetobeusedaloneforseparatingthetwoclasses.Hence,splittingthedatasetbasedonanyvalueofanXorYattributewillprovideclosetozeroreductioninanimpuritymeasure.However,ifXandYattributesareusedtogetherinthesplittingcriterion(e.g.,splittingXat10andYat10),thetwoclassescanbeeffectivelyseparated.
+ ∘
Figure3.28.Decisiontreewith6leafnodesusingXandYasattributes.Splitshavebeennumberedfrom1to5inorderofotheroccurrenceinthetree.
Figure3.27(a) showsthetrainingandtesterrorratesforlearningdecisiontreesofvaryingsizes,when30%ofthedataisusedfortrainingandtheremainderofthedatafortesting.Wecanseethatthetwoclassescanbeseparatedusingasmallnumberofleafnodes.Figure3.28showsthedecisionboundariesforthetreewithsixleafnodes,wherethesplitshavebeennumberedaccordingtotheirorderofappearanceinthetree.Notethattheeventhoughsplits1and3providetrivialgains,theirconsequentsplits(2,4,and5)providelargegains,resultingineffectivediscriminationofthetwoclasses.
Assumeweadd100irrelevantattributestothetwo-dimensionalX-Ydata.Learningadecisiontreefromthisresultantdatawillbechallengingbecausethenumberofcandidateattributestochooseforsplittingateveryinternalnodewillincreasefromtwoto102.Withsuchalargenumberofcandidateattributetestconditionstochoosefrom,itisquitelikelythatspuriousattributetestconditionswillbeselectedatinternalnodesbecauseofthemultiplecomparisonsproblem.Figure3.27(b) showsthetrainingandtesterrorratesafteradding100irrelevantattributestothetrainingset.Wecanseethatthetesterrorrateremainscloseto0.5evenafterusing50leafnodes,whilethetrainingerrorratekeepsondecliningandeventuallybecomes0.
3.5ModelSelectionTherearemanypossibleclassificationmodelswithvaryinglevelsofmodelcomplexitythatcanbeusedtocapturepatternsinthetrainingdata.Amongthesepossibilities,wewanttoselectthemodelthatshowslowestgeneralizationerrorrate.Theprocessofselectingamodelwiththerightlevelofcomplexity,whichisexpectedtogeneralizewelloverunseentestinstances,isknownasmodelselection.Asdescribedintheprevioussection,thetrainingerrorratecannotbereliablyusedasthesolecriterionformodelselection.Inthefollowing,wepresentthreegenericapproachestoestimatethegeneralizationperformanceofamodelthatcanbeusedformodelselection.Weconcludethissectionbypresentingspecificstrategiesforusingtheseapproachesinthecontextofdecisiontreeinduction.
3.5.1UsingaValidationSet
Notethatwecanalwaysestimatethegeneralizationerrorrateofamodelbyusing“out-of-sample”estimates,i.e.byevaluatingthemodelonaseparatevalidationsetthatisnotusedfortrainingthemodel.Theerrorrateonthevalidationset,termedasthevalidationerrorrate,isabetterindicatorofgeneralizationperformancethanthetrainingerrorrate,sincethevalidationsethasnotbeenusedfortrainingthemodel.Thevalidationerrorratecanbeusedformodelselectionasfollows.
GivenatrainingsetD.train,wecanpartitionD.trainintotwosmallersubsets,D.trandD.val,suchthatD.trisusedfortrainingwhileD.valisusedasthevalidationset.Forexample,two-thirdsofD.traincanbereservedasD.trfor
training,whiletheremainingone-thirdisusedasD.valforcomputingvalidationerrorrate.ForanychoiceofclassificationmodelmthatistrainedonD.tr,wecanestimateitsvalidationerrorrateonD.val, .Themodelthatshowsthelowestvalueof canthenbeselectedasthepreferredchoiceofmodel.
Theuseofvalidationsetprovidesagenericapproachformodelselection.However,onelimitationofthisapproachisthatitissensitivetothesizesofD.trandD.val,obtainedbypartitioningD.train.IfthesizeofD.tristoosmall,itmayresultinthelearningofapoorclassificationmodelwithsub-standardperformance,sinceasmallertrainingsetwillbelessrepresentativeoftheoveralldata.Ontheotherhand,ifthesizeofD.valistoosmall,thevalidationerrorratemightnotbereliableforselectingmodels,asitwouldbecomputedoverasmallnumberofinstances.
Figure3.29.
errval(m)errval(m)
ClassdistributionofvalidationdataforthetwodecisiontreesshowninFigure3.30 .
3.8.ExampleValidationErrorInthefollowingexample,weillustrateonepossibleapproachforusingavalidationsetindecisiontreeinduction.Figure3.29 showsthepredictedlabelsattheleafnodesofthedecisiontreesgeneratedinFigure3.30 .Thecountsgivenbeneaththeleafnodesrepresenttheproportionofdatainstancesinthevalidationsetthatreacheachofthenodes.Basedonthepredictedlabelsofthenodes,thevalidationerrorrateforthelefttreeis ,whilethevalidationerrorratefortherighttreeis .Basedontheirvalidationerrorrates,therighttreeispreferredovertheleftone.
3.5.2IncorporatingModelComplexity
Sincethechanceformodeloverfittingincreasesasthemodelbecomesmorecomplex,amodelselectionapproachshouldnotonlyconsiderthetrainingerrorratebutalsothemodelcomplexity.Thisstrategyisinspiredbyawell-knownprincipleknownasOccam'srazorortheprincipleofparsimony,whichsuggeststhatgiventwomodelswiththesameerrors,thesimplermodelispreferredoverthemorecomplexmodel.Agenericapproachtoaccountformodelcomplexitywhileestimatinggeneralizationperformanceisformallydescribedasfollows.
GivenatrainingsetD.train,letusconsiderlearningaclassificationmodelmthatbelongstoacertainclassofmodels, .Forexample,if representsthesetofallpossibledecisiontrees,thenmcancorrespondtoaspecificdecision
errval(TL)=6/16=0.375errval(TR)=4/16=0.25
M M
treelearnedfromthetrainingset.Weareinterestedinestimatingthegeneralizationerrorrateofm,gen.error(m).Asdiscussedpreviously,thetrainingerrorrateofm,train.error(m,D.train),canunder-estimategen.error(m)whenthemodelcomplexityishigh.Hence,werepresentgen.error(m)asafunctionofnotjustthetrainingerrorratebutalsothemodelcomplexityof asfollows:
where isahyper-parameterthatstrikesabalancebetweenminimizingtrainingerrorandreducingmodelcomplexity.Ahighervalueof givesmoreemphasistothemodelcomplexityintheestimationofgeneralizationperformance.Tochoosetherightvalueof ,wecanmakeuseofthevalidationsetinasimilarwayasdescribedin3.5.1 .Forexample,wecaniteratethrougharangeofvaluesof andforeverypossiblevalue,wecanlearnamodelonasubsetofthetrainingset,D.tr,andcomputeitsvalidationerrorrateonaseparatesubset,D.val.Wecanthenselectthevalueof thatprovidesthelowestvalidationerrorrate.
Equation3.11 providesonepossibleapproachforincorporatingmodelcomplexityintotheestimateofgeneralizationperformance.Thisapproachisattheheartofanumberoftechniquesforestimatinggeneralizationperformance,suchasthestructuralriskminimizationprinciple,theAkaike'sInformationCriterion(AIC),andtheBayesianInformationCriterion(BIC).Thestructuralriskminimizationprincipleservesasthebuildingblockforlearningsupportvectormachines,whichwillbediscussedlaterinChapter4 .FormoredetailsonAICandBIC,seetheBibliographicNotes.
Inthefollowing,wepresenttwodifferentapproachesforestimatingthecomplexityofamodel, .Whiletheformerisspecifictodecisiontrees,thelatterismoregenericandcanbeusedwithanyclassofmodels.
M,complexity(M),
gen.error(m)=train.error(m,D.train)+α×complexity(M), (3.11)
αα
α
α
α
complexity(M)
EstimatingtheComplexityofDecisionTreesInthecontextofdecisiontrees,thecomplexityofadecisiontreecanbeestimatedastheratioofthenumberofleafnodestothenumberoftraininginstances.Letkbethenumberofleafnodesand bethenumberoftraininginstances.Thecomplexityofadecisiontreecanthenbedescribedas
.Thisreflectstheintuitionthatforalargertrainingsize,wecanlearnadecisiontreewithlargernumberofleafnodeswithoutitbecomingoverlycomplex.ThegeneralizationerrorrateofadecisiontreeTcanthenbecomputedusingEquation3.11 asfollows:
whereerr(T)isthetrainingerrorofthedecisiontreeand isahyper-parameterthatmakesatrade-offbetweenreducingtrainingerrorandminimizingmodelcomplexity,similartotheuseof inEquation3.11 .canbeviewedastherelativecostofaddingaleafnoderelativetoincurringatrainingerror.Intheliteratureondecisiontreeinduction,theaboveapproachforestimatinggeneralizationerrorrateisalsotermedasthepessimisticerrorestimate.Itiscalledpessimisticasitassumesthegeneralizationerrorratetobeworsethanthetrainingerrorrate(byaddingapenaltytermformodelcomplexity).Ontheotherhand,simplyusingthetrainingerrorrateasanestimateofthegeneralizationerrorrateiscalledtheoptimisticerrorestimateortheresubstitutionestimate.
3.9.ExampleGeneralizationErrorEstimatesConsiderthetwobinarydecisiontrees, and ,showninFigure3.30 .Bothtreesaregeneratedfromthesametrainingdataand isgeneratedbyexpandingthreeleafnodesof .Thecountsshownintheleafnodesofthetreesrepresenttheclassdistributionofthetraining
Ntrain
k/Ntrain
errgen(T)=err(T)+Ω×kNtrain,
Ω
α Ω
TL TRTL
TR
instances.Ifeachleafnodeislabeledaccordingtothemajorityclassoftraininginstancesthatreachthenode,thetrainingerrorrateforthelefttreewillbe ,whilethetrainingerrorratefortherighttreewillbe .Basedontheirtrainingerrorratesalone,wouldpreferredover ,eventhough ismorecomplex(contains
largernumberofleafnodes)than .
Figure3.30.Exampleoftwodecisiontreesgeneratedfromthesametrainingdata.
Now,assumethatthecostassociatedwitheachleafnodeis .Then,thegeneralizationerrorestimatefor willbe
andthegeneralizationerrorestimatefor willbe
err(TL)=4/24=0.167err(TR)=6/24=0.25
TL TR TLTR
Ω=0.5TL
errgen(TL)=424+0.5×724=7.524=0.3125
TR
errgen(TR)=624+0.5×424=824=0.3333.
Since hasalowergeneralizationerrorrate,itwillstillbepreferredover.Notethat impliesthatanodeshouldalwaysbeexpandedinto
itstwochildnodesifitimprovesthepredictionofatleastonetraininginstance,sinceexpandinganodeislesscostlythanmisclassifyingatraininginstance.Ontheotherhand,if ,thenthegeneralizationerrorratefor is andfor is
.Inthiscase, willbepreferredoverbecauseithasalowergeneralizationerrorrate.Thisexampleillustratesthatdifferentchoicesof canchangeourpreferenceofdecisiontreesbasedontheirgeneralizationerrorestimates.However,foragivenchoiceof ,thepessimisticerrorestimateprovidesanapproachformodelingthegeneralizationperformanceonunseentestinstances.Thevalueof canbeselectedwiththehelpofavalidationset.
MinimumDescriptionLengthPrincipleAnotherwaytoincorporatemodelcomplexityisbasedonaninformation-theoreticapproachknownastheminimumdescriptionlengthorMDLprinciple.Toillustratethisapproach,considertheexampleshowninFigure3.31 .Inthisexample,bothperson andperson aregivenasetofinstanceswithknownattributevalues .AssumepersonAknowstheclasslabelyforeveryinstance,whileperson hasnosuchinformation. wouldliketosharetheclassinformationwith bysendingamessagecontainingthelabels.Themessagewouldcontain bitsofinformation,whereNisthenumberofinstances.
TLTR Ω=0.5
Ω=1TL errgen(TL)=11/24=0.458 TR
errgen(TR)=10/24=0.417 TR TL
Ω
ΩΩ
Θ(N)
Figure3.31.Anillustrationoftheminimumdescriptionlengthprinciple.
Alternatively,insteadofsendingtheclasslabelsexplicitly, canbuildaclassificationmodelfromtheinstancesandtransmititto . canthenapplythemodeltodeterminetheclasslabelsoftheinstances.Ifthemodelis100%accurate,thenthecostfortransmissionisequaltothenumberofbitsrequiredtoencodethemodel.Otherwise, mustalsotransmitinformationaboutwhichinstancesaremisclassifiedbythemodelsothat canreproducethesameclasslabels.Thus,theoveralltransmissioncost,whichisequaltothetotaldescriptionlengthofthemessage,is
wherethefirsttermontheright-handsideisthenumberofbitsneededtoencodethemisclassifiedinstances,whilethesecondtermisthenumberofbitsrequiredtoencodethemodel.Thereisalsoahyper-parameter thattrades-offtherelativecostsofthemisclassifiedinstancesandthemodel.
Cost(model,data)=Cost(data|model)+α×Cost(model), (3.12)
α
NoticethefamiliaritybetweenthisequationandthegenericequationforgeneralizationerrorratepresentedinEquation3.11 .Agoodmodelmusthaveatotaldescriptionlengthlessthanthenumberofbitsrequiredtoencodetheentiresequenceofclasslabels.Furthermore,giventwocompetingmodels,themodelwithlowertotaldescriptionlengthispreferred.AnexampleshowinghowtocomputethetotaldescriptionlengthofadecisiontreeisgiveninExercise10onpage189.
3.5.3EstimatingStatisticalBounds
InsteadofusingEquation3.11 toestimatethegeneralizationerrorrateofamodel,analternativewayistoapplyastatisticalcorrectiontothetrainingerrorrateofthemodelthatisindicativeofitsmodelcomplexity.Thiscanbedoneiftheprobabilitydistributionoftrainingerrorisavailableorcanbeassumed.Forexample,thenumberoferrorscommittedbyaleafnodeinadecisiontreecanbeassumedtofollowabinomialdistribution.Wecanthuscomputeanupperboundlimittotheobservedtrainingerrorratethatcanbeusedformodelselection,asillustratedinthefollowingexample.
3.10.ExampleStatisticalBoundsonTrainingErrorConsidertheleft-mostbranchofthebinarydecisiontreesshowninFigure3.30 .Observethattheleft-mostleafnodeof hasbeenexpandedintotwochildnodesin .Beforesplitting,thetrainingerrorrateofthenodeis .Byapproximatingabinomialdistributionwithanormaldistribution,thefollowingupperboundofthetrainingerrorrateecanbederived:
TRTL
2/7=0.286
where istheconfidencelevel, isthestandardizedvaluefromastandardnormaldistribution,andNisthetotalnumberoftraininginstancesusedtocomputee.Byreplacing and ,theupperboundfortheerrorrateis ,whichcorrespondsto errors.Ifweexpandthenodeintoitschildnodesasshownin ,thetrainingerrorratesforthechildnodesare
and ,respectively.UsingEquation(3.13) ,theupperboundsoftheseerrorratesare and
,respectively.Theoveralltrainingerrorofthechildnodesis ,whichislargerthantheestimatederrorforthecorrespondingnodein ,suggestingthatitshouldnotbesplit.
3.5.4ModelSelectionforDecisionTrees
Buildingonthegenericapproachespresentedabove,wepresenttwocommonlyusedmodelselectionstrategiesfordecisiontreeinduction.
Prepruning(EarlyStoppingRule)
Inthisapproach,thetree-growingalgorithmishaltedbeforegeneratingafullygrowntreethatperfectlyfitstheentiretrainingdata.Todothis,amorerestrictivestoppingconditionmustbeused;e.g.,stopexpandingaleafnodewhentheobservedgaininthegeneralizationerrorestimatefallsbelowacertainthreshold.Thisestimateofthegeneralizationerrorratecanbe
eupper(N,e,α)=e+zα/222N+zα/2e(1−e)N+zα/224N21+zα/22N, (3.13)
α zα/2
α=25%,N=7, e=2/7eupper(7,2/7,0.25)=0.503
7×0.503=3.521TL
1/4=0.250 1/3=0.333eupper(4,1/4,0.25)=0.537
eupper(3,1/3,0.25)=0.6504×0.537+3×0.650=4.098
TR
computedusinganyoftheapproachespresentedintheprecedingthreesubsections,e.g.,byusingpessimisticerrorestimates,byusingvalidationerrorestimates,orbyusingstatisticalbounds.Theadvantageofprepruningisthatitavoidsthecomputationsassociatedwithgeneratingoverlycomplexsubtreesthatoverfitthetrainingdata.However,onemajordrawbackofthismethodisthat,evenifnosignificantgainisobtainedusingoneoftheexistingsplittingcriterion,subsequentsplittingmayresultinbettersubtrees.Suchsubtreeswouldnotbereachedifprepruningisusedbecauseofthegreedynatureofdecisiontreeinduction.
Post-pruning
Inthisapproach,thedecisiontreeisinitiallygrowntoitsmaximumsize.Thisisfollowedbyatree-pruningstep,whichproceedstotrimthefullygrowntreeinabottom-upfashion.Trimmingcanbedonebyreplacingasubtreewith(1)anewleafnodewhoseclasslabelisdeterminedfromthemajorityclassofinstancesaffiliatedwiththesubtree(approachknownassubtreereplacement),or(2)themostfrequentlyusedbranchofthesubtree(approachknownassubtreeraising).Thetree-pruningstepterminateswhennofurtherimprovementinthegeneralizationerrorestimateisobservedbeyondacertainthreshold.Again,theestimatesofgeneralizationerrorratecanbecomputedusinganyoftheapproachespresentedinthepreviousthreesubsections.Post-pruningtendstogivebetterresultsthanprepruningbecauseitmakespruningdecisionsbasedonafullygrowntree,unlikeprepruning,whichcansufferfromprematureterminationofthetree-growingprocess.However,forpost-pruning,theadditionalcomputationsneededtogrowthefulltreemaybewastedwhenthesubtreeispruned.
Figure3.32 illustratesthesimplifieddecisiontreemodelforthewebrobotdetectionexamplegiveninSection3.3.5 .Noticethatthesubtreerootedat
hasbeenreplacedbyoneofitsbranchescorrespondingtodepth=1
,and ,usingsubtreeraising.Ontheotherhand,thesubtreecorrespondingto and hasbeenreplacedbyaleafnodeassignedtoclass0,usingsubtreereplacement.Thesubtreefor
and remainsintact.
Figure3.32.Post-pruningofthedecisiontreeforwebrobotdetection.
breadth<=7,width>3 MultiP=1depth>1 MultiAgent=0
depth>1 MultiAgent=1
3.6ModelEvaluationTheprevioussectiondiscussedseveralapproachesformodelselectionthatcanbeusedtolearnaclassificationmodelfromatrainingsetD.train.Herewediscussmethodsforestimatingitsgeneralizationperformance,i.e.itsperformanceonunseeninstancesoutsideofD.train.Thisprocessisknownasmodelevaluation.
NotethatmodelselectionapproachesdiscussedinSection3.5 alsocomputeanestimateofthegeneralizationperformanceusingthetrainingsetD.train.However,theseestimatesarebiasedindicatorsoftheperformanceonunseeninstances,sincetheywereusedtoguidetheselectionofclassificationmodel.Forexample,ifweusethevalidationerrorrateformodelselection(asdescribedinSection3.5.1 ),theresultingmodelwouldbedeliberatelychosentominimizetheerrorsonthevalidationset.Thevalidationerrorratemaythusunder-estimatethetruegeneralizationerrorrate,andhencecannotbereliablyusedformodelevaluation.
Acorrectapproachformodelevaluationwouldbetoassesstheperformanceofalearnedmodelonalabeledtestsethasnotbeenusedatanystageofmodelselection.ThiscanbeachievedbypartitioningtheentiresetoflabeledinstancesD,intotwodisjointsubsets,D.train,whichisusedformodelselectionandD.test,whichisusedforcomputingthetesterrorrate, .Inthefollowing,wepresenttwodifferentapproachesforpartitioningDintoD.trainandD.test,andcomputingthetesterrorrate, .
3.6.1HoldoutMethod
errtest
errtest
Themostbasictechniqueforpartitioningalabeleddatasetistheholdoutmethod,wherethelabeledsetDisrandomlypartitionedintotwodisjointsets,calledthetrainingsetD.trainandthetestsetD.test.AclassificationmodelistheninducedfromD.trainusingthemodelselectionapproachespresentedinSection3.5 ,anditserrorrateonD.test, ,isusedasanestimateofthegeneralizationerrorrate.Theproportionofdatareservedfortrainingandfortestingistypicallyatthediscretionoftheanalysts,e.g.,two-thirdsfortrainingandone-thirdfortesting.
Similartothetrade-offfacedwhilepartitioningD.trainintoD.trandD.valinSection3.5.1 ,choosingtherightfractionoflabeleddatatobeusedfortrainingandtestingisnon-trivial.IfthesizeofD.trainissmall,thelearnedclassificationmodelmaybeimproperlylearnedusinganinsufficientnumberoftraininginstances,resultinginabiasedestimationofgeneralizationperformance.Ontheotherhand,ifthesizeofD.testissmall, maybelessreliableasitwouldbecomputedoverasmallnumberoftestinstances.Moreover, canhaveahighvarianceaswechangetherandompartitioningofDintoD.trainandD.test.
Theholdoutmethodcanberepeatedseveraltimestoobtainadistributionofthetesterrorrates,anapproachknownasrandomsubsamplingorrepeatedholdoutmethod.Thismethodproducesadistributionoftheerrorratesthatcanbeusedtounderstandthevarianceof .
3.6.2Cross-Validation
Cross-validationisawidely-usedmodelevaluationmethodthataimstomakeeffectiveuseofalllabeledinstancesinDforbothtrainingandtesting.Toillustratethismethod,supposethatwearegivenalabeledsetthatwehave
errtest
errtest
errtest
errtest
randomlypartitionedintothreeequal-sizedsubsets, ,and ,asshowninFigure3.33 .Forthefirstrun,wetrainamodelusingsubsetsandS (shownasemptyblocks)andtestthemodelonsubset .Thetesterrorrateon ,denotedas ,isthuscomputedinthefirstrun.Similarly,forthesecondrun,weuse and asthetrainingsetand asthetestset,tocomputethetesterrorrate, ,on .Finally,weuseand fortraininginthethirdrun,while isusedfortesting,thusresultinginthetesterrorrate for .Theoveralltesterrorrateisobtainedbysummingupthenumberoferrorscommittedineachtestsubsetacrossallrunsanddividingitbythetotalnumberofinstances.Thisapproachiscalledthree-foldcross-validation.
Figure3.33.Exampledemonstratingthetechniqueof3-foldcross-validation.
Thek-foldcross-validationmethodgeneralizesthisapproachbysegmentingthelabeleddataD(ofsizeN)intokequal-sizedpartitions(orfolds).Duringthei run,oneofthepartitionsofDischosenasD.test(i)fortesting,whiletherestofthepartitionsareusedasD.train(i)fortraining.Amodelm(i)islearnedusingD.train(i)andappliedonD.test(i)toobtainthesumoftesterrors,
S1,S2 S3S2
3 S1S1 err(S1)
S1 S3 S2err(S2) S2 S1
S3 S3err(S3) S3
th
.Thisprocedureisrepeatedktimes.Thetotaltesterrorrate, ,isthencomputedas
Everyinstanceinthedataisthususedfortestingexactlyonceandfortrainingexactly times.Also,everyrunuses fractionofthedatafortrainingand1/kfractionfortesting.
Therightchoiceofkink-foldcross-validationdependsonanumberofcharacteristicsoftheproblem.Asmallvalueofkwillresultinasmallertrainingsetateveryrun,whichwillresultinalargerestimateofgeneralizationerrorratethanwhatisexpectedofamodeltrainedovertheentirelabeledset.Ontheotherhand,ahighvalueofkresultsinalargertrainingsetateveryrun,whichreducesthebiasintheestimateofgeneralizationerrorrate.Intheextremecase,when ,everyrunusesexactlyonedatainstancefortestingandtheremainderofthedatafortesting.Thisspecialcaseofk-foldcross-validationiscalledtheleave-one-outapproach.Thisapproachhastheadvantageofutilizingasmuchdataaspossiblefortraining.However,leave-one-outcanproducequitemisleadingresultsinsomespecialscenarios,asillustratedinExercise11.Furthermore,leave-one-outcanbecomputationallyexpensiveforlargedatasetsasthecross-validationprocedureneedstoberepeatedNtimes.Formostpracticalapplications,thechoiceofkbetween5and10providesareasonableapproachforestimatingthegeneralizationerrorrate,becauseeachfoldisabletomakeuseof80%to90%ofthelabeleddatafortraining.
Thek-foldcross-validationmethod,asdescribedabove,producesasingleestimateofthegeneralizationerrorrate,withoutprovidinganyinformationaboutthevarianceoftheestimate.Toobtainthisinformation,wecanrunk-foldcross-validationforeverypossiblepartitioningofthedataintokpartitions,
errsum(i) errtest
errtest=∑i=1kerrsum(i)N. (3.14)
(k−1) (k−1)/k
k=N
andobtainadistributionoftesterrorratescomputedforeverysuchpartitioning.Theaveragetesterrorrateacrossallpossiblepartitioningsservesasamorerobustestimateofgeneralizationerrorrate.Thisapproachofestimatingthegeneralizationerrorrateanditsvarianceisknownasthecompletecross-validationapproach.Eventhoughsuchanestimateisquiterobust,itisusuallytooexpensivetoconsiderallpossiblepartitioningsofalargedatasetintokpartitions.Amorepracticalsolutionistorepeatthecross-validationapproachmultipletimes,usingadifferentrandompartitioningofthedataintokpartitionsateverytime,andusetheaveragetesterrorrateastheestimateofgeneralizationerrorrate.Notethatsincethereisonlyonepossiblepartitioningfortheleave-one-outapproach,itisnotpossibletoestimatethevarianceofgeneralizationerrorrate,whichisanotherlimitationofthismethod.
Thek-foldcross-validationdoesnotguaranteethatthefractionofpositiveandnegativeinstancesineverypartitionofthedataisequaltothefractionobservedintheoveralldata.Asimplesolutiontothisproblemistoperformastratifiedsamplingofthepositiveandnegativeinstancesintokpartitions,anapproachcalledstratifiedcross-validation.
Ink-foldcross-validation,adifferentmodelislearnedateveryrunandtheperformanceofeveryoneofthekmodelsontheirrespectivetestfoldsisthenaggregatedtocomputetheoveralltesterrorrate, .Hence, doesnotreflectthegeneralizationerrorrateofanyofthekmodels.Instead,itreflectstheexpectedgeneralizationerrorrateofthemodelselectionapproach,whenappliedonatrainingsetofthesamesizeasoneofthetrainingfolds .Thisisdifferentthanthe computedintheholdoutmethod,whichexactlycorrespondstothespecificmodellearnedoverD.train.Hence,althougheffectivelyutilizingeverydatainstanceinDfortrainingandtesting,the computedinthecross-validationmethoddoesnotrepresenttheperformanceofasinglemodellearnedoveraspecificD.train.
errtest errtest
(N(k−1)/k) errtest
errtest
Nonetheless,inpractice, istypicallyusedasanestimateofthegeneralizationerrorofamodelbuiltonD.Onemotivationforthisisthatwhenthesizeofthetrainingfoldsisclosertothesizeoftheoveralldata(whenkislarge),then resemblestheexpectedperformanceofamodellearnedoveradatasetofthesamesizeasD.Forexample,whenkis10,everytrainingfoldis90%oftheoveralldata.The thenshouldapproachtheexpectedperformanceofamodellearnedover90%oftheoveralldata,whichwillbeclosetotheexpectedperformanceofamodellearnedoverD.
errtest
errtest
errtest
3.7PresenceofHyper-parametersHyper-parametersareparametersoflearningalgorithmsthatneedtobedeterminedbeforelearningtheclassificationmodel.Forinstance,considerthehyper-parameter thatappearedinEquation3.11 ,whichisrepeatedhereforconvenience.Thisequationwasusedforestimatingthegeneralizationerrorforamodelselectionapproachthatusedanexplicitrepresentationsofmodelcomplexity.(SeeSection3.5.2 .)
Forotherexamplesofhyper-parameters,seeChapter4 .
Unlikeregularmodelparameters,suchasthetestconditionsintheinternalnodesofadecisiontree,hyper-parameterssuchas donotappearinthefinalclassificationmodelthatisusedtoclassifyunlabeledinstances.However,thevaluesofhyper-parametersneedtobedeterminedduringmodelselection—aprocessknownashyper-parameterselection—andmustbetakenintoaccountduringmodelevaluation.Fortunately,bothtaskscanbeeffectivelyaccomplishedviaslightmodificationsofthecross-validationapproachdescribedintheprevioussection.
3.7.1Hyper-parameterSelection
InSection3.5.2 ,avalidationsetwasusedtoselect andthisapproachisgenerallyapplicableforhyper-parametersection.Letpbethehyper-parameterthatneedstobeselectedfromafiniterangeofvalues,
α
gen.error(m)=train.error(m,D.train)+α×complexity(M)
α
α
P=
.PartitionD.trainintoD.trandD.val.Foreverychoiceofhyper-parametervalue ,wecanlearnamodel onD.tr,andapplythismodelonD.valtoobtainthevalidationerrorrate .Let bethehyper-parametervaluethatprovidesthelowestvalidationerrorrate.Wecanthenusethemodel correspondingto asthefinalchoiceofclassificationmodel.
Theaboveapproach,althoughuseful,usesonlyasubsetofthedata,D.train,fortrainingandasubset,D.val,forvalidation.Theframeworkofcross-validation,presentedinSection3.6.2 ,addressesbothofthoseissues,albeitinthecontextofmodelevaluation.Hereweindicatehowtouseacross-validationapproachforhyper-parameterselection.Toillustratethisapproach,letuspartitionD.trainintothreefoldsasshowninFigure3.34 .Ateveryrun,oneofthefoldsisusedasD.valforvalidation,andtheremainingtwofoldsareusedasD.trforlearningamodel,foreverychoiceofhyper-parametervalue .Theoverallvalidationerrorratecorrespondingtoeachiscomputedbysummingtheerrorsacrossallthethreefolds.Wethenselectthehyper-parametervalue thatprovidesthelowestvalidationerrorrate,anduseittolearnamodel ontheentiretrainingsetD.train.
Figure3.34.Exampledemonstratingthe3-foldcross-validationframeworkforhyper-parameterselectionusingD.train.
{p1,p2,…pn}pi mi
errval(pi) p*
m* p*
pi pi
p*m*
Algorithm3.2 generalizestheaboveapproachusingak-foldcross-validationframeworkforhyper-parameterselection.Atthei runofcross-validation,thedatainthei foldisusedasD.val(i)forvalidation(Step4),whiletheremainderofthedatainD.trainisusedasD.tr(i)fortraining(Step5).Thenforeverychoiceofhyper-parametervalue ,amodelislearnedonD.tr(i)(Step7),whichisappliedonD.val(i)tocomputeitsvalidationerror(Step8).Thisisusedtocomputethevalidationerrorratecorrespondingtomodelslearningusing overallthefolds(Step11).Thehyper-parametervalue thatprovidesthelowestvalidationerrorrate(Step12)isnowusedtolearnthefinalmodel ontheentiretrainingsetD.train(Step13).Hence,attheendofthisalgorithm,weobtainthebestchoiceofthehyper-parametervalueaswellasthefinalclassificationmodel(Step14),bothofwhichareobtainedbymakinganeffectiveuseofeverydatainstanceinD.train.
Algorithm3.2Proceduremodel-select(k, ,D.train)
∈
th
th
pi
pip*
m*
P
∑
3.7.2NestedCross-Validation
TheapproachoftheprevioussectionprovidesawaytoeffectivelyusealltheinstancesinD.traintolearnaclassificationmodelwhenhyper-parameterselectionisrequired.ThisapproachcanbeappliedovertheentiredatasetDtolearnthefinalclassificationmodel.However,applyingAlgorithm3.2 onDwouldonlyreturnthefinalclassificationmodel butnotanestimateofitsgeneralizationperformance, .RecallthatthevalidationerrorratesusedinAlgorithm3.2 cannotbeusedasestimatesofgeneralizationperformance,sincetheyareusedtoguidetheselectionofthefinalmodel .However,tocompute ,wecanagainuseacross-validationframeworkforevaluatingtheperformanceontheentiredatasetD,asdescribedoriginallyinSection3.6.2 .Inthisapproach,DispartitionedintoD.train(fortraining)andD.test(fortesting)ateveryrunofcross-validation.Whenhyper-parametersareinvolved,wecanuseAlgorithm3.2 totrainamodelusingD.trainateveryrun,thus“internally”usingcross-validationformodelselection.Thisapproachiscallednestedcross-validationordoublecross-validation.Algorithm3.3describesthecompleteapproachforestimating
usingnestedcross-validationinthepresenceofhyper-parameters.
Asanillustrationofthisapproach,seeFigure3.35 wherethelabeledsetDispartitionedintoD.trainandD.test,usinga3-foldcross-validationmethod.
m*errtest
m*errtest
errtest
Figure3.35.Exampledemonstrating3-foldnestedcross-validationforcomputing .
Atthei runofthismethod,oneofthefoldsisusedasthetestset,D.test(i),whiletheremainingtwofoldsareusedasthetrainingset,D.train(i).ThisisrepresentedinFigure3.35 asthei “outer”run.InordertoselectamodelusingD.train(i),weagainusean“inner”3-foldcross-validationframeworkthatpartitionsD.train(i)intoD.trandD.valateveryoneofthethreeinnerruns(iterations).AsdescribedinSection3.7 ,wecanusetheinnercross-validationframeworktoselectthebesthyper-parametervalue aswellasitsresultingclassificationmodel learnedoverD.train(i).Wecanthenapply onD.test(i)toobtainthetesterroratthei outerrun.Byrepeatingthisprocessforeveryouterrun,wecancomputetheaveragetesterrorrate,
,overtheentirelabeledsetD.Notethatintheaboveapproach,theinnercross-validationframeworkisbeingusedformodelselectionwhiletheoutercross-validationframeworkisbeingusedformodelevaluation.
Algorithm3.3Thenestedcross-validationapproachforcomputing .
errtest
th
th
p*(i)m*(i)
m*(i) th
errtest
errtest
∑
3.8PitfallsofModelSelectionandEvaluationModelselectionandevaluation,whenusedeffectively,serveasexcellenttoolsforlearningclassificationmodelsandassessingtheirgeneralizationperformance.However,whenusingthemeffectivelyinpracticalsettings,thereareseveralpitfallsthatcanresultinimproperandoftenmisleadingconclusions.Someofthesepitfallsaresimpletounderstandandeasytoavoid,whileothersarequitesubtleinnatureanddifficulttocatch.Inthefollowing,wepresenttwoofthesepitfallsanddiscussbestpracticestoavoidthem.
3.8.1OverlapbetweenTrainingandTestSets
Oneofthebasicrequirementsofacleanmodelselectionandevaluationsetupisthatthedatausedformodelselection(D.train)mustbekeptseparatefromthedatausedformodelevaluation(D.test).Ifthereisanyoverlapbetweenthetwo,thetesterrorrate computedoverD.testcannotbeconsideredrepresentativeoftheperformanceonunseeninstances.Comparingtheeffectivenessofclassificationmodelsusing canthenbequitemisleading,asanoverlycomplexmodelcanshowaninaccuratelylowvalueof duetomodeloverfitting(seeExercise12attheendofthischapter).
errtest
errtest
errtest
ToillustratetheimportanceofensuringnooverlapbetweenD.trainandD.test,consideralabeleddatasetwherealltheattributesareirrelevant,i.e.theyhavenorelationshipwiththeclasslabels.Usingsuchattributes,weshouldexpectnoclassificationmodeltoperformbetterthanrandomguessing.However,ifthetestsetinvolvesevenasmallnumberofdatainstancesthatwereusedfortraining,thereisapossibilityforanoverlycomplexmodeltoshowbetterperformancethanrandom,eventhoughtheattributesarecompletelyirrelevant.AswewillseelaterinChapter10 ,thisscenariocanactuallybeusedasacriteriontodetectoverfittingduetoimpropersetupofexperiment.Ifamodelshowsbetterperformancethanarandomclassifierevenwhentheattributesareirrelevant,itisanindicationofapotentialfeedbackbetweenthetrainingandtestsets.
3.8.2UseofValidationErrorasGeneralizationError
Thevalidationerrorrate servesanimportantroleduringmodelselection,asitprovides“out-of-sample”errorestimatesofmodelsonD.val,whichisnotusedfortrainingthemodels.Hence, servesasabettermetricthanthetrainingerrorrateforselectingmodelsandhyper-parametervalues,asdescribedinSections3.5.1 and3.7 ,respectively.However,oncethevalidationsethasbeenusedforselectingaclassificationmodel
nolongerreflectstheperformanceof onunseeninstances.
Torealizethepitfallinusingvalidationerrorrateasanestimateofgeneralizationperformance,considertheproblemofselectingahyper-parametervaluepfromarangeofvalues usingavalidationsetD.val.Ifthenumberofpossiblevaluesin isquitelargeandthesizeofD.valissmall,itis
errval
errval
m*,errval m*
P,P
possibletoselectahyper-parametervalue thatshowsfavorableperformanceonD.valjustbyrandomchance.NoticethesimilarityofthisproblemwiththemultiplecomparisonsproblemdiscussedinSection3.4.1 .Eventhoughtheclassificationmodel learnedusing wouldshowalowvalidationerrorrate,itwouldlackgeneralizabilityonunseentestinstances.
ThecorrectapproachforestimatingthegeneralizationerrorrateofamodelistouseanindependentlychosentestsetD.testthathasn'tbeenusedin
anywaytoinfluencetheselectionof .Asaruleofthumb,thetestsetshouldneverbeexaminedduringmodelselection,toensuretheabsenceofanyformofoverfitting.Iftheinsightsgainedfromanyportionofalabeleddatasethelpinimprovingtheclassificationmodeleveninanindirectway,thenthatportionofdatamustbediscardedduringtesting.
p*
m* p*
m*m*
3.9ModelComparisonOnedifficultywhencomparingtheperformanceofdifferentclassificationmodelsiswhethertheobserveddifferenceintheirperformanceisstatisticallysignificant.Forexample,considerapairofclassificationmodels, and .Suppose achieves85%accuracywhenevaluatedonatestsetcontaining30instances,while achieves75%accuracyonadifferenttestsetcontaining5000instances.Basedonthisinformation,is abettermodelthan ?Thisexampleraisestwokeyquestionsregardingthestatisticalsignificanceofaperformancemetric:
1. Although hasahigheraccuracythan ,itwastestedonasmallertestset.Howmuchconfidencedowehavethattheaccuracyfor isactually85%?
2. Isitpossibletoexplainthedifferenceinaccuraciesbetween andasaresultofvariationsinthecompositionoftheirtestsets?
Thefirstquestionrelatestotheissueofestimatingtheconfidenceintervalofmodelaccuracy.Thesecondquestionrelatestotheissueoftestingthestatisticalsignificanceoftheobserveddeviation.Theseissuesareinvestigatedintheremainderofthissection.
3.9.1EstimatingtheConfidenceIntervalforAccuracy
*
MA MBMA
MBMA
MB
MA MBMA
MAMB
Todetermineitsconfidenceinterval,weneedtoestablishtheprobabilitydistributionforsampleaccuracy.Thissectiondescribesanapproachforderivingtheconfidenceintervalbymodelingtheclassificationtaskasabinomialrandomexperiment.Thefollowingdescribescharacteristicsofsuchanexperiment:
1. TherandomexperimentconsistsofNindependenttrials,whereeachtrialhastwopossibleoutcomes:successorfailure.
2. Theprobabilityofsuccess,p,ineachtrialisconstant.
AnexampleofabinomialexperimentiscountingthenumberofheadsthatturnupwhenacoinisflippedNtimes.IfXisthenumberofsuccessesobservedinNtrials,thentheprobabilitythatXtakesaparticularvalueisgivenbyabinomialdistributionwithmean andvariance :
Forexample,ifthecoinisfair andisflippedfiftytimes,thentheprobabilitythattheheadshowsup20timesis
Iftheexperimentisrepeatedmanytimes,thentheaveragenumberofheadsexpectedtoshowupis whileitsvarianceis
Thetaskofpredictingtheclasslabelsoftestinstancescanalsobeconsideredasabinomialexperiment.GivenatestsetthatcontainsNinstances,letXbethenumberofinstancescorrectlypredictedbyamodelandpbethetrueaccuracyofthemodel.Ifthepredictiontaskismodeledasabinomialexperiment,thenXhasabinomialdistributionwithmean andvariance Itcanbeshownthattheempiricalaccuracy, also
Np Np(1−p)
P(X=υ)=(Nυ)pυ(1−p)N−υ.
(p=0.5)
P(X=20)=(5020)0.520(1−0.5)30=0.0419.
50×0.5=25, 50×0.5×0.5=12.5.
NpNp(1−p). acc=X/N,
hasabinomialdistributionwithmeanpandvariance (seeExercise14).ThebinomialdistributioncanbeapproximatedbyanormaldistributionwhenNissufficientlylarge.Basedonthenormaldistribution,theconfidenceintervalforacccanbederivedasfollows:
where and aretheupperandlowerboundsobtainedfromastandardnormaldistributionatconfidencelevel Sinceastandardnormaldistributionissymmetricaround itfollowsthatRearrangingthisinequalityleadstothefollowingconfidenceintervalforp:
Thefollowingtableshowsthevaluesof atdifferentconfidencelevels:
0.99 0.98 0.95 0.9 0.8 0.7 0.5
2.58 2.33 1.96 1.65 1.28 1.04 0.67
3.11.ExampleConfidenceIntervalforAccuracyConsideramodelthathasanaccuracyof80%whenevaluatedon100testinstances.Whatistheconfidenceintervalforitstrueaccuracyata95%confidencelevel?Theconfidencelevelof95%correspondsto
accordingtothetablegivenabove.InsertingthistermintoEquation3.16 yieldsaconfidenceintervalbetween71.1%and86.7%.Thefollowingtableshowstheconfidenceintervalwhenthenumberofinstances,N,increases:
N 20 50 100 500 1000 5000
p(1−p)/N
P(−Zα/2≤acc−pp(1−p)/N≤Z1−α/2)=1−α, (3.15)
Zα/2 Z1−α/2(1−α).
Z=0, Zα/2=Z1−α/2.
2×N×acc×Zα/22±Zα/2Zα/22+4Nacc−4Nacc22(N+Zα/22). (3.16)
Zα/2
1−α
Zα/2
Za/2=1.96
Confidence 0.584 0.670 0.711 0.763 0.774 0.789
Interval
NotethattheconfidenceintervalbecomestighterwhenNincreases.
3.9.2ComparingthePerformanceofTwoModels
Considerapairofmodels, and whichareevaluatedontwoindependenttestsets, and Let denotethenumberofinstancesin
and denotethenumberofinstancesin Inaddition,supposetheerrorratefor on is andtheerrorratefor on is Ourgoalistotestwhethertheobserveddifferencebetween and isstatisticallysignificant.
Assumingthat and aresufficientlylarge,theerrorrates and canbeapproximatedusingnormaldistributions.Iftheobserveddifferenceintheerrorrateisdenotedas thendisalsonormallydistributedwithmean ,itstruedifference,andvariance, Thevarianceofdcanbecomputedasfollows:
where and arethevariancesoftheerrorrates.Finally,atthe confidencelevel,itcanbeshownthattheconfidenceintervalforthetruedifferencedtisgivenbythefollowingequation:
−0.919 −0.888 −0.867 −0.833 −0.824 −0.811
M1 M2,D1 D2. n1
D1 n2 D2.M1 D1 e1 M2 D2 e2.
e1 e2
n1 n2 e1 e2
d=e1−e2,dt σd2.
σd2≃σ^d2=e1(1−e1)n1+e2(1−e2)n2, (3.17)
e1(1−e1)/n1 e2(1−e1)/n2(1−α)%
3.12.ExampleSignificanceTestingConsidertheproblemdescribedatthebeginningofthissection.Modelhasanerrorrateof whenappliedto testinstances,whilemodel hasanerrorrateof whenappliedto testinstances.Theobserveddifferenceintheirerrorratesis
.Inthisexample,weareperformingatwo-sidedtesttocheckwhether or .Theestimatedvarianceoftheobserveddifferenceinerrorratescanbecomputedasfollows:
or .InsertingthisvalueintoEquation3.18 ,weobtainthefollowingconfidenceintervalfor at95%confidencelevel:
Astheintervalspansthevaluezero,wecanconcludethattheobserveddifferenceisnotstatisticallysignificantata95%confidencelevel.
Atwhatconfidencelevelcanwerejectthehypothesisthat ?Todothis,weneedtodeterminethevalueof suchthattheconfidenceintervalfordoesnotspanthevaluezero.Wecanreversetheprecedingcomputationandlookforthevalue suchthat .Replacingthevaluesofdand
gives .Thisvaluefirstoccurswhen (foratwo-sidedtest).Theresultsuggeststhatthenullhypothesiscanberejectedatconfidencelevelof93.6%orlower.
dt=d±zα/2σ^d. (3.18)
MAe1=0.15 N1=30
MB e2=0.25 N2=5000
d=|0.15−0.25|=0.1dt=0 dt≠0
σ^d2=0.15(1−0.15)30+0.25(1−0.25)5000=0.0043
σ^d=0.0655dt
dt=0.1±1.96×0.0655=0.1±0.128.
dt=0Zα/2 dt
Zα/2 d>Zσ/2σ^dσ^d Zσ/2<1.527 (1−α)<~0.936
3.10BibliographicNotesEarlyclassificationsystemsweredevelopedtoorganizevariouscollectionsofobjects,fromlivingorganismstoinanimateones.Examplesabound,fromAristotle'scataloguingofspeciestotheDeweyDecimalandLibraryofCongressclassificationsystemsforbooks.Suchatasktypicallyrequiresconsiderablehumanefforts,bothtoidentifypropertiesoftheobjectstobeclassifiedandtoorganizethemintowelldistinguishedcategories.
Withthedevelopmentofstatisticsandcomputing,automatedclassificationhasbeenasubjectofintensiveresearch.Thestudyofclassificationinclassicalstatisticsissometimesknownasdiscriminantanalysis,wheretheobjectiveistopredictthegroupmembershipofanobjectbasedonitscorrespondingfeatures.Awell-knownclassicalmethodisFisher'slineardiscriminantanalysis[142],whichseekstofindalinearprojectionofthedatathatproducesthebestseparationbetweenobjectsfromdifferentclasses.
Manypatternrecognitionproblemsalsorequirethediscriminationofobjectsfromdifferentclasses.Examplesincludespeechrecognition,handwrittencharacteridentification,andimageclassification.ReaderswhoareinterestedintheapplicationofclassificationtechniquesforpatternrecognitionmayrefertothesurveyarticlesbyJainetal.[150]andKulkarnietal.[157]orclassicpatternrecognitionbooksbyBishop[125],Dudaetal.[137],andFukunaga[143].Thesubjectofclassificationisalsoamajorresearchtopicinneuralnetworks,statisticallearning,andmachinelearning.Anin-depthtreatmentonthetopicofclassificationfromthestatisticalandmachinelearningperspectivescanbefoundinthebooksbyBishop[126],CherkasskyandMulier[132],Hastieetal.[148],Michieetal.[162],Murphy[167],andMitchell[165].Recentyearshavealsoseenthereleaseofmanypubliclyavailable
softwarepackagesforclassification,whichcanbeembeddedinprogramminglanguagessuchasJava(Weka[147])andPython(scikit-learn[174]).
AnoverviewofdecisiontreeinductionalgorithmscanbefoundinthesurveyarticlesbyBuntine[129],Moret[166],Murthy[168],andSafavianetal.[179].Examplesofsomewell-knowndecisiontreealgorithmsincludeCART[127],ID3[175],C4.5[177],andCHAID[153].BothID3andC4.5employtheentropymeasureastheirsplittingfunction.Anin-depthdiscussionoftheC4.5decisiontreealgorithmisgivenbyQuinlan[177].TheCARTalgorithmwasdevelopedbyBreimanetal.[127]andusestheGiniindexasitssplittingfunction.CHAID[153]usesthestatistical testtodeterminethebestsplitduringthetree-growingprocess.
Thedecisiontreealgorithmpresentedinthischapterassumesthatthesplittingconditionateachinternalnodecontainsonlyoneattribute.Anobliquedecisiontreecanusemultipleattributestoformtheattributetestconditioninasinglenode[149,187].Breimanetal.[127]provideanoptionforusinglinearcombinationsofattributesintheirCARTimplementation.OtherapproachesforinducingobliquedecisiontreeswereproposedbyHeathetal.[149],Murthyetal.[169],Cantú-PazandKamath[130],andUtgoffandBrodley[187].Althoughanobliquedecisiontreehelpstoimprovetheexpressivenessofthemodelrepresentation,thetreeinductionprocessbecomescomputationallychallenging.Anotherwaytoimprovetheexpressivenessofadecisiontreewithoutusingobliquedecisiontreesistoapplyamethodknownasconstructiveinduction[161].Thismethodsimplifiesthetaskoflearningcomplexsplittingfunctionsbycreatingcompoundfeaturesfromtheoriginaldata.
Besidesthetop-downapproach,otherstrategiesforgrowingadecisiontreeincludethebottom-upapproachbyLandeweerdetal.[159]andPattipatiandAlexandridis[173],aswellasthebidirectionalapproachbyKimand
χ2
Landgrebe[154].SchuermannandDoster[181]andWangandSuen[193]proposedusingasoftsplittingcriteriontoaddressthedatafragmentationproblem.Inthisapproach,eachinstanceisassignedtodifferentbranchesofthedecisiontreewithdifferentprobabilities.
Modeloverfittingisanimportantissuethatmustbeaddressedtoensurethatadecisiontreeclassifierperformsequallywellonpreviouslyunlabeleddatainstances.ThemodeloverfittingproblemhasbeeninvestigatedbymanyauthorsincludingBreimanetal.[127],Schaffer[180],Mingers[164],andJensenandCohen[151].Whilethepresenceofnoiseisoftenregardedasoneoftheprimaryreasonsforoverfitting[164,170],JensenandCohen[151]viewedoverfittingasanartifactoffailuretocompensateforthemultiplecomparisonsproblem.
Bishop[126]andHastieetal.[148]provideanexcellentdiscussionofmodeloverfitting,relatingittoawell-knownframeworkoftheoreticalanalysis,knownasbias-variancedecomposition[146].Inthisframework,thepredictionofalearningalgorithmisconsideredtobeafunctionofthetrainingset,whichvariesasthetrainingsetischanged.Thegeneralizationerrorofamodelisthendescribedintermsofitsbias(theerroroftheaveragepredictionobtainedusingdifferenttrainingsets),itsvariance(howdifferentarethepredictionsobtainedusingdifferenttrainingsets),andnoise(theirreducibleerrorinherenttotheproblem).Anunderfitmodelisconsideredtohavehighbiasbutlowvariance,whileanoverfitmodelisconsideredtohavelowbiasbuthighvariance.Althoughthebias-variancedecompositionwasoriginallyproposedforregressionproblems(wherethetargetattributeisacontinuousvariable),aunifiedanalysisthatisapplicableforclassificationhasbeenproposedbyDomingos[136].ThebiasvariancedecompositionwillbediscussedinmoredetailwhileintroducingensemblelearningmethodsinChapter4 .
Variouslearningprinciples,suchastheProbablyApproximatelyCorrect(PAC)learningframework[188],havebeendevelopedtoprovideatheoreticalframeworkforexplainingthegeneralizationperformanceoflearningalgorithms.Inthefieldofstatistics,anumberofperformanceestimationmethodshavebeenproposedthatmakeatrade-offbetweenthegoodnessoffitofamodelandthemodelcomplexity.MostnoteworthyamongthemaretheAkaike'sInformationCriterion[120]andtheBayesianInformationCriterion[182].Theybothapplycorrectivetermstothetrainingerrorrateofamodel,soastopenalizemorecomplexmodels.Anotherwidely-usedapproachformeasuringthecomplexityofanygeneralmodelistheVapnikChervonenkis(VC)Dimension[190].TheVCdimensionofaclassoffunctionsCisdefinedasthemaximumnumberofpointsthatcanbeshattered(everypointcanbedistinguishedfromtherest)byfunctionsbelongingtoC,foranypossibleconfigurationofpoints.TheVCdimensionlaysthefoundationofthestructuralriskminimizationprinciple[189],whichisextensivelyusedinmanylearningalgorithms,e.g.,supportvectormachines,whichwillbediscussedindetailinChapter4 .
TheOccam'srazorprincipleisoftenattributedtothephilosopherWilliamofOccam.Domingos[135]cautionedagainstthepitfallofmisinterpretingOccam'srazorascomparingmodelswithsimilartrainingerrors,insteadofgeneralizationerrors.Asurveyondecisiontree-pruningmethodstoavoidoverfittingisgivenbyBreslowandAha[128]andEspositoetal.[141].Someofthetypicalpruningmethodsincludereducederrorpruning[176],pessimisticerrorpruning[176],minimumerrorpruning[171],criticalvaluepruning[163],cost-complexitypruning[127],anderror-basedpruning[177].QuinlanandRivestproposedusingtheminimumdescriptionlengthprinciplefordecisiontreepruningin[178].
Thediscussionsinthischapteronthesignificanceofcross-validationerrorestimatesisinspiredfromChapter7 inHastieetal.[148].Itisalsoan
excellentresourceforunderstanding“therightandwrongwaystodocross-validation”,whichissimilartothediscussiononpitfallsinSection3.8 ofthischapter.Acomprehensivediscussionofsomeofthecommonpitfallsinusingcross-validationformodelselectionandevaluationisprovidedinKrstajicetal.[156].
Theoriginalcross-validationmethodwasproposedindependentlybyAllen[121],Stone[184],andGeisser[145]formodelassessment(evaluation).Eventhoughcross-validationcanbeusedformodelselection[194],itsusageformodelselectionisquitedifferentthanwhenitisusedformodelevaluation,asemphasizedbyStone[184].Overtheyears,thedistinctionbetweenthetwousageshasoftenbeenignored,resultinginincorrectfindings.Oneofthecommonmistakeswhileusingcross-validationistoperformpre-processingoperations(e.g.,hyper-parametertuningorfeatureselection)usingtheentiredatasetandnot“within”thetrainingfoldofeverycross-validationrun.Ambroiseetal.,usinganumberofgeneexpressionstudiesasexamples,[124]provideanextensivediscussionoftheselectionbiasthatariseswhenfeatureselectionisperformedoutsidecross-validation.UsefulguidelinesforevaluatingmodelsonmicroarraydatahavealsobeenprovidedbyAllisonetal.[122].
Theuseofthecross-validationprotocolforhyper-parametertuninghasbeendescribedindetailbyDudoitandvanderLaan[138].Thisapproachhasbeencalled“grid-searchcross-validation.”Thecorrectapproachinusingcross-validationforbothhyper-parameterselectionandmodelevaluation,asdiscussedinSection3.7 ofthischapter,isextensivelycoveredbyVarmaandSimon[191].Thiscombinedapproachhasbeenreferredtoas“nestedcross-validation”or“doublecross-validation”intheexistingliterature.Recently,TibshiraniandTibshirani[185]haveproposedanewapproachforhyper-parameterselectionandmodelevaluation.Tsamardinosetal.[186]comparedthisapproachtonestedcross-validation.Theexperimentsthey
performedfoundthat,onaverage,bothapproachesprovideconservativeestimatesofmodelperformancewiththeTibshiraniandTibshiraniapproachbeingmorecomputationallyefficient.
Kohavi[155]hasperformedanextensiveempiricalstudytocomparetheperformancemetricsobtainedusingdifferentestimationmethodssuchasrandomsubsamplingandk-foldcross-validation.Theirresultssuggestthatthebestestimationmethodisten-fold,stratifiedcross-validation.
Analternativeapproachformodelevaluationisthebootstrapmethod,whichwaspresentedbyEfronin1979[139].Inthismethod,traininginstancesaresampledwithreplacementfromthelabeledset,i.e.,aninstancepreviouslyselectedtobepartofthetrainingsetisequallylikelytobedrawnagain.IftheoriginaldatahasNinstances,itcanbeshownthat,onaverage,abootstrapsampleofsizeNcontainsabout63.2%oftheinstancesintheoriginaldata.Instancesthatarenotincludedinthebootstrapsamplebecomepartofthetestset.Thebootstrapprocedureforobtainingtrainingandtestsetsisrepeatedbtimes,resultinginadifferenterrorrateonthetestset,err(i),atthei run.Toobtaintheoverallerrorrate, ,the.632bootstrapapproachcombineserr(i)withtheerrorrateobtainedfromatrainingsetcontainingallthelabeledexamples, ,asfollows:
EfronandTibshirani[140]providedatheoreticalandempiricalcomparisonbetweencross-validationandabootstrapmethodknownasthe rule.
Whilethe.632bootstrapmethodpresentedaboveprovidesarobustestimateofthegeneralizationperformancewithlowvarianceinitsestimate,itmayproducemisleadingresultsforhighlycomplexmodelsincertainconditions,asdemonstratedbyKohavi[155].Thisisbecausetheoverallerrorrateisnot
th errboot
errs
errboot=1b∑i=1b(0.632)×err(i)+0.386×errs). (3.19)
632+
trulyanout-of-sampleerrorestimateasitdependsonthetrainingerrorrate,,whichcanbequitesmallifthereisoverfitting.
CurrenttechniquessuchasC4.5requirethattheentiretrainingdatasetfitintomainmemory.Therehasbeenconsiderableefforttodevelopparallelandscalableversionsofdecisiontreeinductionalgorithms.SomeoftheproposedalgorithmsincludeSLIQbyMehtaetal.[160],SPRINTbyShaferetal.[183],CMPbyWangandZaniolo[192],CLOUDSbyAlsabtietal.[123],RainForestbyGehrkeetal.[144],andScalParCbyJoshietal.[152].Asurveyofparallelalgorithmsforclassificationandotherdataminingtasksisgivenin[158].Morerecently,therehasbeenextensiveresearchtoimplementlarge-scaleclassifiersonthecomputeunifieddevicearchitecture(CUDA)[131,134]andMapReduce[133,172]platforms.
errs
Bibliography[120]H.Akaike.Informationtheoryandanextensionofthemaximum
likelihoodprinciple.InSelectedPapersofHirotuguAkaike,pages199–213.Springer,1998.
[121]D.M.Allen.Therelationshipbetweenvariableselectionanddataagumentationandamethodforprediction.Technometrics,16(1):125–127,1974.
[122]D.B.Allison,X.Cui,G.P.Page,andM.Sabripour.Microarraydataanalysis:fromdisarraytoconsolidationandconsensus.Naturereviewsgenetics,7(1):55–65,2006.
[123]K.Alsabti,S.Ranka,andV.Singh.CLOUDS:ADecisionTreeClassifierforLargeDatasets.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages2–8,NewYork,NY,August1998.
[124]C.AmbroiseandG.J.McLachlan.Selectionbiasingeneextractiononthebasisofmicroarraygene-expressiondata.Proceedingsofthenationalacademyofsciences,99(10):6562–6566,2002.
[125]C.M.Bishop.NeuralNetworksforPatternRecognition.OxfordUniversityPress,Oxford,U.K.,1995.
[126]C.M.Bishop.PatternRecognitionandMachineLearning.Springer,2006.
[127]L.Breiman,J.H.Friedman,R.Olshen,andC.J.Stone.ClassificationandRegressionTrees.Chapman&Hall,NewYork,1984.
[128]L.A.BreslowandD.W.Aha.SimplifyingDecisionTrees:ASurvey.KnowledgeEngineeringReview,12(1):1–40,1997.
[129]W.Buntine.Learningclassificationtrees.InArtificialIntelligenceFrontiersinStatistics,pages182–201.Chapman&Hall,London,1993.
[130]E.Cantú-PazandC.Kamath.Usingevolutionaryalgorithmstoinduceobliquedecisiontrees.InProc.oftheGeneticandEvolutionaryComputationConf.,pages1053–1060,SanFrancisco,CA,2000.
[131]B.Catanzaro,N.Sundaram,andK.Keutzer.Fastsupportvectormachinetrainingandclassificationongraphicsprocessors.InProceedingsofthe25thInternationalConferenceonMachineLearning,pages104–111,2008.
[132]V.CherkasskyandF.M.Mulier.LearningfromData:Concepts,Theory,andMethods.Wiley,2ndedition,2007.
[133]C.Chu,S.K.Kim,Y.-A.Lin,Y.Yu,G.Bradski,A.Y.Ng,andK.Olukotun.Map-reduceformachinelearningonmulticore.Advancesinneuralinformationprocessingsystems,19:281,2007.
[134]A.Cotter,N.Srebro,andJ.Keshet.AGPU-tailoredApproachforTrainingKernelizedSVMs.InProceedingsofthe17thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages805–813,SanDiego,California,USA,2011.
[135]P.Domingos.TheRoleofOccam'sRazorinKnowledgeDiscovery.DataMiningandKnowledgeDiscovery,3(4):409–425,1999.
[136]P.Domingos.Aunifiedbias-variancedecomposition.InProceedingsof17thInternationalConferenceonMachineLearning,pages231–238,2000.
[137]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley&Sons,Inc.,NewYork,2ndedition,2001.
[138]S.DudoitandM.J.vanderLaan.Asymptoticsofcross-validatedriskestimationinestimatorselectionandperformanceassessment.StatisticalMethodology,2(2):131–154,2005.
[139]B.Efron.Bootstrapmethods:anotherlookatthejackknife.InBreakthroughsinStatistics,pages569–593.Springer,1992.
[140]B.EfronandR.Tibshirani.Cross-validationandtheBootstrap:EstimatingtheErrorRateofaPredictionRule.Technicalreport,StanfordUniversity,1995.
[141]F.Esposito,D.Malerba,andG.Semeraro.AComparativeAnalysisofMethodsforPruningDecisionTrees.IEEETrans.PatternAnalysisandMachineIntelligence,19(5):476–491,May1997.
[142]R.A.Fisher.Theuseofmultiplemeasurementsintaxonomicproblems.AnnalsofEugenics,7:179–188,1936.
[143]K.Fukunaga.IntroductiontoStatisticalPatternRecognition.AcademicPress,NewYork,1990.
[144]J.Gehrke,R.Ramakrishnan,andV.Ganti.RainForest—AFrameworkforFastDecisionTreeConstructionofLargeDatasets.DataMiningandKnowledgeDiscovery,4(2/3):127–162,2000.
[145]S.Geisser.Thepredictivesamplereusemethodwithapplications.JournaloftheAmericanStatisticalAssociation,70(350):320–328,1975.
[146]S.Geman,E.Bienenstock,andR.Doursat.Neuralnetworksandthebias/variancedilemma.Neuralcomputation,4(1):1–58,1992.
[147]M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann,andI.H.Witten.TheWEKADataMiningSoftware:AnUpdate.SIGKDDExplorations,11(1),2009.
[148]T.Hastie,R.Tibshirani,andJ.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,andPrediction.Springer,2ndedition,2009.
[149]D.Heath,S.Kasif,andS.Salzberg.InductionofObliqueDecisionTrees.InProc.ofthe13thIntl.JointConf.onArtificialIntelligence,pages1002–1007,Chambery,France,August1993.
[150]A.K.Jain,R.P.W.Duin,andJ.Mao.StatisticalPatternRecognition:AReview.IEEETran.Patt.Anal.andMach.Intellig.,22(1):4–37,2000.
[151]D.JensenandP.R.Cohen.MultipleComparisonsinInductionAlgorithms.MachineLearning,38(3):309–338,March2000.
[152]M.V.Joshi,G.Karypis,andV.Kumar.ScalParC:ANewScalableandEfficientParallelClassificationAlgorithmforMiningLargeDatasets.InProc.of12thIntl.ParallelProcessingSymp.(IPPS/SPDP),pages573–579,Orlando,FL,April1998.
[153]G.V.Kass.AnExploratoryTechniqueforInvestigatingLargeQuantitiesofCategoricalData.AppliedStatistics,29:119–127,1980.
[154]B.KimandD.Landgrebe.Hierarchicaldecisionclassifiersinhigh-dimensionalandlargeclassdata.IEEETrans.onGeoscienceandRemoteSensing,29(4):518–528,1991.
[155]R.Kohavi.AStudyonCross-ValidationandBootstrapforAccuracyEstimationandModelSelection.InProc.ofthe15thIntl.JointConf.onArtificialIntelligence,pages1137–1145,Montreal,Canada,August1995.
[156]D.Krstajic,L.J.Buturovic,D.E.Leahy,andS.Thomas.Cross-validationpitfallswhenselectingandassessingregressionandclassificationmodels.Journalofcheminformatics,6(1):1,2014.
[157]S.R.Kulkarni,G.Lugosi,andS.S.Venkatesh.LearningPatternClassification—ASurvey.IEEETran.Inf.Theory,44(6):2178–2206,1998.
[158]V.Kumar,M.V.Joshi,E.-H.Han,P.N.Tan,andM.Steinbach.HighPerformanceDataMining.InHighPerformanceComputingforComputationalScience(VECPAR2002),pages111–125.Springer,2002.
[159]G.Landeweerd,T.Timmers,E.Gersema,M.Bins,andM.Halic.Binarytreeversussingleleveltreeclassificationofwhitebloodcells.PatternRecognition,16:571–577,1983.
[160]M.Mehta,R.Agrawal,andJ.Rissanen.SLIQ:AFastScalableClassifierforDataMining.InProc.ofthe5thIntl.Conf.onExtendingDatabaseTechnology,pages18–32,Avignon,France,March1996.
[161]R.S.Michalski.Atheoryandmethodologyofinductivelearning.ArtificialIntelligence,20:111–116,1983.
[162]D.Michie,D.J.Spiegelhalter,andC.C.Taylor.MachineLearning,NeuralandStatisticalClassification.EllisHorwood,UpperSaddleRiver,NJ,1994.
[163]J.Mingers.ExpertSystems—RuleInductionwithStatisticalData.JOperationalResearchSociety,38:39–47,1987.
[164]J.Mingers.Anempiricalcomparisonofpruningmethodsfordecisiontreeinduction.MachineLearning,4:227–243,1989.
[165]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.
[166]B.M.E.Moret.DecisionTreesandDiagrams.ComputingSurveys,14(4):593–623,1982.
[167]K.P.Murphy.MachineLearning:AProbabilisticPerspective.MITPress,2012.
[168]S.K.Murthy.AutomaticConstructionofDecisionTreesfromData:AMulti-DisciplinarySurvey.DataMiningandKnowledgeDiscovery,2(4):345–389,1998.
[169]S.K.Murthy,S.Kasif,andS.Salzberg.Asystemforinductionofobliquedecisiontrees.JofArtificialIntelligenceResearch,2:1–33,1994.
[170]T.Niblett.Constructingdecisiontreesinnoisydomains.InProc.ofthe2ndEuropeanWorkingSessiononLearning,pages67–78,Bled,Yugoslavia,May1987.
[171]T.NiblettandI.Bratko.LearningDecisionRulesinNoisyDomains.InResearchandDevelopmentinExpertSystemsIII,Cambridge,1986.
CambridgeUniversityPress.
[172]I.PalitandC.K.Reddy.Scalableandparallelboostingwithmapreduce.IEEETransactionsonKnowledgeandDataEngineering,24(10):1904–1916,2012.
[173]K.R.PattipatiandM.G.Alexandridis.Applicationofheuristicsearchandinformationtheorytosequentialfaultdiagnosis.IEEETrans.onSystems,Man,andCybernetics,20(4):872–887,1990.
[174]F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blondel,P.Prettenhofer,R.Weiss,V.Dubourg,J.Vanderplas,A.Passos,D.Cournapeau,M.Brucher,M.Perrot,andE.Duchesnay.Scikit-learn:MachineLearninginPython.JournalofMachineLearningResearch,12:2825–2830,2011.
[175]J.R.Quinlan.Discoveringrulesbyinductionfromlargecollectionofexamples.InD.Michie,editor,ExpertSystemsintheMicroElectronicAge.EdinburghUniversityPress,Edinburgh,UK,1979.
[176]J.R.Quinlan.SimplifyingDecisionTrees.Intl.J.Man-MachineStudies,27:221–234,1987.
[177]J.R.Quinlan.C4.5:ProgramsforMachineLearning.Morgan-KaufmannPublishers,SanMateo,CA,1993.
[178]J.R.QuinlanandR.L.Rivest.InferringDecisionTreesUsingtheMinimumDescriptionLengthPrinciple.InformationandComputation,80(3):227–248,1989.
[179]S.R.SafavianandD.Landgrebe.ASurveyofDecisionTreeClassifierMethodology.IEEETrans.Systems,ManandCybernetics,22:660–674,May/June1998.
[180]C.Schaffer.Overfittingavoidenceasbias.MachineLearning,10:153–178,1993.
[181]J.SchuermannandW.Doster.Adecision-theoreticapproachinhierarchicalclassifierdesign.PatternRecognition,17:359–369,1984.
[182]G.Schwarzetal.Estimatingthedimensionofamodel.Theannalsofstatistics,6(2):461–464,1978.
[183]J.C.Shafer,R.Agrawal,andM.Mehta.SPRINT:AScalableParallelClassifierforDataMining.InProc.ofthe22ndVLDBConf.,pages544–555,Bombay,India,September1996.
[184]M.Stone.Cross-validatorychoiceandassessmentofstatisticalpredictions.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),pages111–147,1974.
[185]R.J.TibshiraniandR.Tibshirani.Abiascorrectionfortheminimumerrorrateincross-validation.TheAnnalsofAppliedStatistics,pages822–
829,2009.
[186]I.Tsamardinos,A.Rakhshani,andV.Lagani.Performance-estimationpropertiesofcross-validation-basedprotocolswithsimultaneoushyper-parameteroptimization.InHellenicConferenceonArtificialIntelligence,pages1–14.Springer,2014.
[187]P.E.UtgoffandC.E.Brodley.Anincrementalmethodforfindingmultivariatesplitsfordecisiontrees.InProc.ofthe7thIntl.Conf.onMachineLearning,pages58–65,Austin,TX,June1990.
[188]L.Valiant.Atheoryofthelearnable.CommunicationsoftheACM,27(11):1134–1142,1984.
[189]V.N.Vapnik.StatisticalLearningTheory.Wiley-Interscience,1998.
[190]V.N.VapnikandA.Y.Chervonenkis.Ontheuniformconvergenceofrelativefrequenciesofeventstotheirprobabilities.InMeasuresofComplexity,pages11–30.Springer,2015.
[191]S.VarmaandR.Simon.Biasinerrorestimationwhenusingcross-validationformodelselection.BMCbioinformatics,7(1):1,2006.
[192]H.WangandC.Zaniolo.CMP:AFastDecisionTreeClassifierUsingMultivariatePredictions.InProc.ofthe16thIntl.Conf.onDataEngineering,pages449–460,SanDiego,CA,March2000.
[193]Q.R.WangandC.Y.Suen.Largetreeclassifierwithheuristicsearchandglobaltraining.IEEETrans.onPatternAnalysisandMachineIntelligence,9(1):91–102,1987.
[194]Y.ZhangandY.Yang.Cross-validationforselectingamodelselectionprocedure.JournalofEconometrics,187(1):95–112,2015.
3.11Exercises1.DrawthefulldecisiontreefortheparityfunctionoffourBooleanattributes,A,B,C,andD.Isitpossibletosimplifythetree?
2.ConsiderthetrainingexamplesshowninTable3.5 forabinaryclassificationproblem.
Table3.5.DatasetforExercise2.
CustomerID Gender CarType ShirtSize Class
1 M Family Small C0
2 M Sports Medium C0
3 M Sports Medium C0
4 M Sports Large C0
5 M Sports ExtraLarge C0
6 M Sports ExtraLarge C0
7 F Sports Small C0
8 F Sports Small C0
9 F Sports Medium C0
10 F Luxury Large C0
11 M Family Large C1
12 M Family ExtraLarge C1
13 M Family Medium C1
14 M Luxury ExtraLarge C1
15 F Luxury Small C1
16 F Luxury Small C1
17 F Luxury Medium C1
18 F Luxury Medium C1
19 F Luxury Medium C1
20 F Luxury Large C1
a. ComputetheGiniindexfortheoverallcollectionoftrainingexamples.
b. ComputetheGiniindexforthe attribute.
c. ComputetheGiniindexforthe attribute.
d. ComputetheGiniindexforthe attributeusingmultiwaysplit.
e. ComputetheGiniindexforthe attributeusingmultiwaysplit.
f. Whichattributeisbetter, , ,or ?
g. Explainwhy shouldnotbeusedastheattributetestconditioneventhoughithasthelowestGini.
3.ConsiderthetrainingexamplesshowninTable3.6 forabinaryclassificationproblem.
Table3.6.DatasetforExercise3.
Instance TargetClassa1 a2 a3
1 T T 1.0 +
2 T T 6.0
3 T F 5.0
4 F F 4.0
5 F T 7.0
6 F T 3.0
7 F F 8.0
8 T F 7.0
9 F T 5.0
a. Whatistheentropyofthiscollectionoftrainingexampleswithrespecttotheclassattribute?
b. Whataretheinformationgainsof and relativetothesetrainingexamples?
c. For ,whichisacontinuousattribute,computetheinformationgainforeverypossiblesplit.
d. Whatisthebestsplit(among ,and )accordingtotheinformationgain?
e. Whatisthebestsplit(between and )accordingtothemisclassificationerrorrate?
f. Whatisthebestsplit(between and )accordingtotheGiniindex?
+
−
+
−
−
−
+
−
a1 a2
a3
a1,a2 a3
a1 a2
a1 a2
4.Showthattheentropyofanodeneverincreasesaftersplittingitintosmallersuccessornodes.
5.Considerthefollowingdatasetforabinaryclassproblem.
A B ClassLabel
T F
T T
T T
T F
T T
F F
F F
F F
T T
T F
a. CalculatetheinformationgainwhensplittingonAandB.Whichattributewouldthedecisiontreeinductionalgorithmchoose?
b. CalculatethegainintheGiniindexwhensplittingonAandB.Whichattributewouldthedecisiontreeinductionalgorithmchoose?
c. Figure3.11 showsthatentropyandtheGiniindexarebothmonotonicallyincreasingontherange[0,0.5]andtheyarebothmonotonicallydecreasingontherange[0.5,1].Isitpossiblethat
+
+
+
−
+
−
−
−
−
−
informationgainandthegainintheGiniindexfavordifferentattributes?Explain.
6.ConsidersplittingaparentnodePintotwochildnodes, and ,usingsomeattributetestcondition.ThecompositionoflabeledtraininginstancesateverynodeissummarizedintheTablebelow.
P
Class0 7 3 4
Class1 3 0 3
a. CalculatetheGiniindexandmisclassificationerrorrateoftheparentnodeP.
b. CalculatetheweightedGiniindexofthechildnodes.WouldyouconsiderthisattributetestconditionifGiniisusedastheimpuritymeasure?
c. Calculatetheweightedmisclassificationrateofthechildnodes.Wouldyouconsiderthisattributetestconditionifmisclassificationrateisusedastheimpuritymeasure?
7.Considerthefollowingsetoftrainingexamples.
X Y Z No.ofClassC1Examples No.ofClassC2Examples
0 0 0 5 40
0 0 1 0 15
0 1 0 10 5
0 1 1 45 0
C1 C2
C1 C2
1 0 0 10 5
1 0 1 25 0
1 1 0 5 20
1 1 1 0 15
a. Computeatwo-leveldecisiontreeusingthegreedyapproachdescribedinthischapter.Usetheclassificationerrorrateasthecriterionforsplitting.Whatistheoverallerrorrateoftheinducedtree?
b. Repeatpart(a)usingXasthefirstsplittingattributeandthenchoosethebestremainingattributeforsplittingateachofthetwosuccessornodes.Whatistheerrorrateoftheinducedtree?
c. Comparetheresultsofparts(a)and(b).Commentonthesuitabilityofthegreedyheuristicusedforsplittingattributeselection.
8.ThefollowingtablesummarizesadatasetwiththreeattributesA,B,Candtwoclasslabels .Buildatwo-leveldecisiontree.
A B C NumberofInstances
+
T T T 5 0
F T T 0 20
T F T 20 0
F F T 0 5
T T F 0 0
+,−
−
F T F 25 0
T F F 0 0
F F F 0 25
a. Accordingtotheclassificationerrorrate,whichattributewouldbechosenasthefirstsplittingattribute?Foreachattribute,showthecontingencytableandthegainsinclassificationerrorrate.
b. Repeatforthetwochildrenoftherootnode.
c. Howmanyinstancesaremisclassifiedbytheresultingdecisiontree?
d. Repeatparts(a),(b),and(c)usingCasthesplittingattribute.
e. Usetheresultsinparts(c)and(d)toconcludeaboutthegreedynatureofthedecisiontreeinductionalgorithm.
9.ConsiderthedecisiontreeshowninFigure3.36 .
Figure3.36.DecisiontreeanddatasetsforExercise9.
a. Computethegeneralizationerrorrateofthetreeusingtheoptimisticapproach.
b. Computethegeneralizationerrorrateofthetreeusingthepessimisticapproach.(Forsimplicity,usethestrategyofaddingafactorof0.5toeachleafnode.)
c. Computethegeneralizationerrorrateofthetreeusingthevalidationsetshownabove.Thisapproachisknownasreducederrorpruning.
10.ConsiderthedecisiontreesshowninFigure3.37 .Assumetheyaregeneratedfromadatasetthatcontains16binaryattributesand3classes,
,and .C1,C2 C3
Computethetotaldescriptionlengthofeachdecisiontreeaccordingtothefollowingformulationoftheminimumdescriptionlengthprinciple.
Thetotaldescriptionlengthofatreeisgivenby
EachinternalnodeofthetreeisencodedbytheIDofthesplittingattribute.Iftherearemattributes,thecostofencodingeachattributeis
bits.
Figure3.37.DecisiontreesforExercise10.
EachleafisencodedusingtheIDoftheclassitisassociatedwith.Iftherearekclasses,thecostofencodingaclassis bits.
Cost(tree)isthecostofencodingallthenodesinthetree.Tosimplifythecomputation,youcanassumethatthetotalcostofthetreeisobtainedbyaddingupthecostsofencodingeachinternalnodeandeachleafnode.
Cost(tree,data)=Cost(tree)+Cost(data|tree).
log2m
log2k
isencodedusingtheclassificationerrorsthetreecommitsonthetrainingset.Eacherrorisencodedby bits,wherenisthetotalnumberoftraininginstances.
Whichdecisiontreeisbetter,accordingtotheMDLprinciple?
11.Thisexercise,inspiredbythediscussionsin[155],highlightsoneoftheknownlimitationsoftheleave-one-outmodelevaluationprocedure.Letusconsideradatasetcontaining50positiveand50negativeinstances,wheretheattributesarepurelyrandomandcontainnoinformationabouttheclasslabels.Hence,thegeneralizationerrorrateofanyclassificationmodellearnedoverthisdataisexpectedtobe0.5.Letusconsideraclassifierthatassignsthemajorityclasslabeloftraininginstances(tiesresolvedbyusingthepositivelabelasthedefaultclass)toanytestinstance,irrespectiveofitsattributevalues.Wecancallthisapproachasthemajorityinducerclassifier.Determinetheerrorrateofthisclassifierusingthefollowingmethods.
a. Leave-one-out.
b. 2-foldstratifiedcross-validation,wheretheproportionofclasslabelsateveryfoldiskeptsameasthatoftheoveralldata.
c. Fromtheresultsabove,whichmethodprovidesamorereliableevaluationoftheclassifier'sgeneralizationerrorrate?
12.Consideralabeleddatasetcontaining100datainstances,whichisrandomlypartitionedintotwosetsAandB,eachcontaining50instances.WeuseAasthetrainingsettolearntwodecisiontrees, with10leafnodesand with100leafnodes.TheaccuraciesofthetwodecisiontreesondatasetsAandBareshowninTable3.7 .
Table3.7.Comparingthetestaccuracyofdecisiontrees and .
Accuracy
Cost(data|tree)log2n
T10T100
T10 T100
DataSet
A 0.86 0.97
B 0.84 0.77
a. BasedontheaccuraciesshowninTable3.7 ,whichclassificationmodelwouldyouexpecttohavebetterperformanceonunseeninstances?
b. Now,youtested and ontheentiredataset andfoundthattheclassificationaccuracyof ondataset is0.85,whereastheclassificationaccuracyof onthedataset is0.87.BasedonthisnewinformationandyourobservationsfromTable3.7 ,whichclassificationmodelwouldyoufinallychooseforclassification?
13.ConsiderthefollowingapproachfortestingwhetheraclassifierAbeatsanotherclassifierB.LetNbethesizeofagivendataset, betheaccuracyofclassifierA, betheaccuracyofclassifierB,and betheaverageaccuracyforbothclassifiers.TotestwhetherclassifierAissignificantlybetterthanB,thefollowingZ-statisticisused:
ClassifierAisassumedtobebetterthanclassifierBif .
Table3.8 comparestheaccuraciesofthreedifferentclassifiers,decisiontreeclassifiers,naïveBayesclassifiers,andsupportvectormachines,onvariousdatasets.(ThelattertwoclassifiersaredescribedinChapter4 .)
SummarizetheperformanceoftheclassifiersgiveninTable3.8 usingthefollowing table:
win-loss-draw Decisiontree NaïveBayes Supportvectormachine
T10 T100
T10 T100 (A+B)T10 (A+B)
T100 (A+B)
pApB p=(pA+pB)/2
Z=pA−pB2p(1−p)N.
Z>1.96
3×3
Decisiontree 0-0-23
NaïveBayes 0-0-23
Supportvectormachine 0-0-23
Table3.8.Comparingtheaccuracyofvariousclassificationmethods.
DataSet Size(N) DecisionTree(%)
naïveBayes(%)
Supportvectormachine(%)
Anneal 898 92.09 79.62 87.19
Australia 690 85.51 76.81 84.78
Auto 205 81.95 58.05 70.73
Breast 699 95.14 95.99 96.42
Cleve 303 76.24 83.50 84.49
Credit 690 85.80 77.54 85.07
Diabetes 768 72.40 75.91 76.82
German 1000 70.90 74.70 74.40
Glass 214 67.29 48.59 59.81
Heart 270 80.00 84.07 83.70
Hepatitis 155 81.94 83.23 87.10
Horse 368 85.33 78.80 82.61
Ionosphere 351 89.17 82.34 88.89
Iris 150 94.67 95.33 96.00
Labor 57 78.95 94.74 92.98
Led7 3200 73.34 73.16 73.56
Lymphography 148 77.03 83.11 86.49
Pima 768 74.35 76.04 76.95
Sonar 208 78.85 69.71 76.92
Tic-tac-toe 958 83.72 70.04 98.33
Vehicle 846 71.04 45.04 74.94
Wine 178 94.38 96.63 98.88
Zoo 101 93.07 93.07 96.04
Eachcellinthetablecontainsthenumberofwins,losses,anddrawswhencomparingtheclassifierinagivenrowtotheclassifierinagivencolumn.
14.LetXbeabinomialrandomvariablewithmean andvariance .ShowthattheratioX/Nalsohasabinomialdistributionwithmeanpandvariance .
Np Np(1−p)
p(1−p)N
4Classification:AlternativeTechniques
Thepreviouschapterintroducedtheclassificationproblemandpresentedatechniqueknownasthedecisiontreeclassifier.Issuessuchasmodeloverfittingandmodelevaluationwerealsodiscussed.Thischapterpresentsalternativetechniquesforbuildingclassificationmodels—fromsimpletechniquessuchasrule-basedandnearestneighborclassifierstomoresophisticatedtechniquessuchasartificialneuralnetworks,deeplearning,supportvectormachines,andensemblemethods.Otherpracticalissuessuchastheclassimbalanceandmulticlassproblemsarealsodiscussedattheendofthechapter.
4.1TypesofClassifiersBeforepresentingspecifictechniques,wefirstcategorizethedifferenttypesofclassifiersavailable.Onewaytodistinguishclassifiersisbyconsideringthecharacteristicsoftheiroutput.
BinaryversusMulticlass
Binaryclassifiersassigneachdatainstancetooneoftwopossiblelabels,typicallydenotedas and .Thepositiveclassusuallyreferstothecategorywearemoreinterestedinpredictingcorrectlycomparedtothenegativeclass(e.g.,the categoryinemailclassificationproblems).Iftherearemorethantwopossiblelabelsavailable,thenthetechniqueisknownasamulticlassclassifier.Assomeclassifiersweredesignedforbinaryclassesonly,theymustbeadaptedtodealwithmulticlassproblems.TechniquesfortransformingbinaryclassifiersintomulticlassclassifiersaredescribedinSection4.12 .
DeterministicversusProbabilistic
Adeterministicclassifierproducesadiscrete-valuedlabeltoeachdatainstanceitclassifieswhereasaprobabilisticclassifierassignsacontinuousscorebetween0and1toindicatehowlikelyitisthataninstancebelongstoaparticularclass,wheretheprobabilityscoresforalltheclassessumupto1.SomeexamplesofprobabilisticclassifiersincludethenaïveBayesclassifier,Bayesiannetworks,andlogisticregression.Probabilisticclassifiersprovideadditionalinformationabouttheconfidenceinassigninganinstancetoaclassthandeterministicclassifiers.Adatainstanceistypicallyassignedtotheclass
+1 −1
withthehighestprobabilityscore,exceptwhenthecostofmisclassifyingtheclasswithlowerprobabilityissignificantlyhigher.Wewilldiscussthetopicofcost-sensitiveclassificationwithprobabilisticoutputsinSection4.11.2 .
Anotherwaytodistinguishthedifferenttypesofclassifiersisbasedontheirtechniquefordiscriminatinginstancesfromdifferentclasses.
LinearversusNonlinear
Alinearclassifierusesalinearseparatinghyperplanetodiscriminateinstancesfromdifferentclasseswhereasanonlinearclassifierenablestheconstructionofmorecomplex,nonlineardecisionsurfaces.Weillustrateanexampleofalinearclassifier(perceptron)anditsnonlinearcounterpart(multi-layerneuralnetwork)inSection4.7 .Althoughthelinearityassumptionmakesthemodellessflexibleintermsoffittingcomplexdata,linearclassifiersarethuslesssusceptibletomodeloverfittingcomparedtononlinearclassifiers.Furthermore,onecantransformtheoriginalsetofattributes,
,intoamorecomplexfeatureset,e.g.,,beforeapplyingthelinearclassifier.Suchfeature
transformationallowsthelinearclassifiertofitdatasetswithnonlineardecisionsurfaces(seeSection4.9.4 ).
GlobalversusLocal
Aglobalclassifierfitsasinglemodeltotheentiredataset.Unlessthemodelishighlynonlinear,thisone-size-fits-allstrategymaynotbeeffectivewhentherelationshipbetweentheattributesandtheclasslabelsvariesovertheinputspace.Incontrast,alocalclassifierpartitionstheinputspaceintosmallerregionsandfitsadistinctmodeltotraininginstancesineachregion.Thek-nearestneighborclassifier(seeSection4.3 )isaclassicexampleoflocalclassifiers.Whilelocalclassifiersaremoreflexibleintermsoffittingcomplex
x=(x1,x2,⋯,xd) Φ(x)=(x1,x2,x1x2,x12,x22,⋯)
decisionboundaries,theyarealsomoresusceptibletothemodeloverfittingproblem,especiallywhenthelocalregionscontainfewtrainingexamples.
GenerativeversusDiscriminative
Givenadatainstance ,theprimaryobjectiveofanyclassifieristopredicttheclasslabel,y,ofthedatainstance.However,apartfrompredictingtheclasslabel,wemayalsobeinterestedindescribingtheunderlyingmechanismthatgeneratestheinstancesbelongingtoeveryclasslabel.Forexample,intheprocessofclassifyingspamemailmessages,itmaybeusefultounderstandthetypicalcharacteristicsofemailmessagesthatarelabeledasspam,e.g.,specificusageofkeywordsinthesubjectorthebodyoftheemail.Classifiersthatlearnagenerativemodelofeveryclassintheprocessofpredictingclasslabelsareknownasgenerativeclassifiers.SomeexamplesofgenerativeclassifiersincludethenaïveBayesclassifierandBayesiannetworks.Incontrast,discriminativeclassifiersdirectlypredicttheclasslabelswithoutexplicitlydescribingthedistributionofeveryclasslabel.Theysolveasimplerproblemthangenerativemodelssincetheydonothavetheonusofderivinginsightsaboutthegenerativemechanismofdatainstances.Theyarethussometimespreferredovergenerativemodels,especiallywhenitisnotcrucialtoobtaininformationaboutthepropertiesofeveryclass.Someexamplesofdiscriminativeclassifiersincludedecisiontrees,rule-basedclassifier,nearestneighborclassifier,artificialneuralnetworks,andsupportvectormachines.
4.2Rule-BasedClassifierArule-basedclassifierusesacollectionof“if…then…”rules(alsoknownasaruleset)toclassifydatainstances.Table4.1 showsanexampleofarulesetgeneratedforthevertebrateclassificationproblemdescribedinthepreviouschapter.Eachclassificationruleintherulesetcanbeexpressedinthefollowingway:
Theleft-handsideoftheruleiscalledtheruleantecedentorprecondition.Itcontainsaconjunctionofattributetestconditions:
where isanattribute-valuepairandopisacomparisonoperatorchosenfromtheset .Eachattributetest isalsoknownasaconjunct.Theright-handsideoftheruleiscalledtheruleconsequent,whichcontainsthepredictedclass .
Arulercoversadatainstancexifthepreconditionofrmatchestheattributesofx.risalsosaidtobefiredortriggeredwheneveritcoversagiveninstance.Foranillustration,considertherule giveninTable4.1 andthefollowingattributesfortwovertebrates:hawkandgrizzlybear.
Table4.1.Exampleofarulesetforthevertebrateclassificationproblem.
ri:(Conditioni)→yi. (4.1)
Conditioni=(A1opv1)∧(A2opv2)…(Akopvk), (4.2)
(Aj,vj){=,≠,<,>,≤,≥} (Ajopvj)
yi
r1
r1:(GivesBirth=no)∧(AerialCreature=yes)→Birdsr2:(GivesBirth=no)∧(AquaticCreature=yes)→Fishesr3:(GivesBirth=yes)∧(BodyTemperature=warm-blooded)→Mammalsr4:(GivesBirth=no)∧(AerialCreature=no)→Reptilesr5:(AquaticCreature=semi)→Amphibians
Name BodyTemperature
SkinCover
GivesBirth
AquaticCreature
AerialCreature
HasLegs
Hibernates
hawk warm-blooded
feather no no yes yes no
grizzlybear
warm-blooded
fur yes no no yes yes
coversthefirstvertebratebecauseitspreconditionissatisfiedbythehawk'sattributes.Theruledoesnotcoverthesecondvertebratebecausegrizzlybearsgivebirthtotheiryoungandcannotfly,thusviolatingthepreconditionof .
Thequalityofaclassificationrulecanbeevaluatedusingmeasuressuchascoverageandaccuracy.GivenadatasetDandaclassificationruler: ,thecoverageoftheruleisthefractionofinstancesinDthattriggertheruler.Ontheotherhand,itsaccuracyorconfidencefactoristhefractionofinstancestriggeredbyrwhoseclasslabelsareequaltoy.Theformaldefinitionsofthesemeasuresare
where isthenumberofinstancesthatsatisfytheruleantecedent, isthenumberofinstancesthatsatisfyboththeantecedentandconsequent,and
isthetotalnumberofinstances.
Example4.1.ConsiderthedatasetshowninTable4.2 .Therule
r1
r1
A→y
Coverage(r)=|A||D|Coverage(r)=|A∩y||A|, (4.3)
|A| |A∩y|
|D|
(GivesBirth=yes)∧(BodyTemperature=warm-blooded)→Mammals
hasacoverageof33%sincefiveofthefifteeninstancessupporttheruleantecedent.Theruleaccuracyis100%becauseallfivevertebratescoveredbytherulearemammals.
Table4.2.Thevertebratedataset.Name Body
TemperatureSkinCover
GivesBirth
AquaticCreature
AerialCreature
HasLegs
Hibernates ClassLabel
human warm-blooded
hair yes no no yes no Mammals
python cold-blooded scales no no no no yes Reptiles
salmon cold-blooded scales no yes no no no Fishes
whale warm-blooded
hair yes yes no no no Mammals
frog cold-blooded none no semi no yes yes Amphibians
komododragon
cold-blooded scales no no no yes no Reptiles
bat warm-blooded
hair yes no yes yes yes Mammals
pigeon warm-blooded
feathers no no yes yes no Birds
cat warm-blooded
fur yes no no yes no Mammals
guppy cold-blooded scales yes yes no no no Fishes
alligator cold-blooded scales no semi no yes no Reptiles
penguin warm-
blooded
feathers no semi no yes no Birds
porcupine warm-blooded
quills yes no no yes yes Mammals
eel cold-blooded scales no yes no no no Fishes
4.2.1HowaRule-BasedClassifierWorks
Arule-basedclassifierclassifiesatestinstancebasedontheruletriggeredbytheinstance.Toillustratehowarule-basedclassifierworks,considertherulesetshowninTable4.1 andthefollowingvertebrates:
Name BodyTemperature
SkinCover
GivesBirth
AquaticCreature
AerialCreature
HasLegs
Hibernates
lemur warm-blooded
fur yes no no yes yes
turtle cold-blooded scales no semi no yes no
dogfishshark
cold-blooded scales yes yes no no no
Thefirstvertebrate,whichisalemur,iswarm-bloodedandgivesbirthtoitsyoung.Ittriggerstherule ,andthus,isclassifiedasamammal.Thesecondvertebrate,whichisaturtle,triggerstherules and .Sincetheclassespredictedbytherulesarecontradictory(reptilesversusamphibians),theirconflictingclassesmustberesolved.Noneoftherulesareapplicabletoadogfishshark.Inthiscase,weneedtodeterminewhatclasstoassigntosuchatestinstance.
eel cold-blooded scales no yes no no no Fishes
salamander cold-blooded none no semi no yes yes Amphibians
r3r4 r5
4.2.2PropertiesofaRuleSet
Therulesetgeneratedbyarule-basedclassifiercanbecharacterizedbythefollowingtwoproperties.
Definition4.1(MutuallyExclusiveRuleSet).TherulesinarulesetRaremutuallyexclusiveifnotworulesinRaretriggeredbythesameinstance.ThispropertyensuresthateveryinstanceiscoveredbyatmostoneruleinR.
Definition4.2(ExhaustiveRuleSet).ArulesetRhasexhaustivecoverageifthereisaruleforeachcombinationofattributevalues.ThispropertyensuresthateveryinstanceiscoveredbyatleastoneruleinR.
Table4.3.Exampleofamutuallyexclusiveandexhaustiveruleset.
r1:(BodyTemperature=cold-blooded)→Non-mammalsr2:(BodyTemperature=warm-blooded)∧(GivesBirth=yes)→Mammalsr3:(BodyTemperature=warm-
Together,thesetwopropertiesensurethateveryinstanceiscoveredbyexactlyonerule.AnexampleofamutuallyexclusiveandexhaustiverulesetisshowninTable4.3 .Unfortunately,manyrule-basedclassifiers,includingtheoneshowninTable4.1 ,donothavesuchproperties.Iftherulesetisnotexhaustive,thenadefaultrule, ,mustbeaddedtocovertheremainingcases.Adefaultrulehasanemptyantecedentandistriggeredwhenallotherruleshavefailed. isknownasthedefaultclassandistypicallyassignedtothemajorityclassoftraininginstancesnotcoveredbytheexistingrules.Iftherulesetisnotmutuallyexclusive,thenaninstancecanbecoveredbymorethanonerule,someofwhichmaypredictconflictingclasses.
Definition4.3(OrderedRuleSet).TherulesinanorderedrulesetRarerankedindecreasingorderoftheirpriority.Anorderedrulesetisalsoknownasadecisionlist.
Therankofarulecanbedefinedinmanyways,e.g.,basedonitsaccuracyortotaldescriptionlength.Whenatestinstanceispresented,itwillbeclassifiedbythehighest-rankedrulethatcoverstheinstance.Thisavoidstheproblemofhavingconflictingclassespredictedbymultipleclassificationrulesiftherulesetisnotmutuallyexclusive.
blooded)∧(GivesBirth=no)→Non-mammals
rd:()→yd
yd
Analternativewaytohandleanon-mutuallyexclusiverulesetwithoutorderingtherulesistoconsidertheconsequentofeachruletriggeredbyatestinstanceasavoteforaparticularclass.Thevotesarethentalliedtodeterminetheclasslabelofthetestinstance.Theinstanceisusuallyassignedtotheclassthatreceivesthehighestnumberofvotes.Thevotemayalsobeweightedbytherule'saccuracy.Usingunorderedrulestobuildarule-basedclassifierhasbothadvantagesanddisadvantages.Unorderedrulesarelesssusceptibletoerrorscausedbythewrongrulebeingselectedtoclassifyatestinstanceunlikeclassifiersbasedonorderedrules,whicharesensitivetothechoiceofrule-orderingcriteria.Modelbuildingisalsolessexpensivebecausetherulesdonotneedtobekeptinsortedorder.Nevertheless,classifyingatestinstancecanbequiteexpensivebecausetheattributesofthetestinstancemustbecomparedagainstthepreconditionofeveryruleintheruleset.
Inthenexttwosections,wepresenttechniquesforextractinganorderedrulesetfromdata.Arule-basedclassifiercanbeconstructedusing(1)directmethods,whichextractclassificationrulesdirectlyfromdata,and(2)indirectmethods,whichextractclassificationrulesfrommorecomplexclassificationmodels,suchasdecisiontreesandneuralnetworks.DetaileddiscussionsofthesemethodsarepresentedinSections4.2.3 and4.2.4 ,respectively.
4.2.3DirectMethodsforRuleExtraction
Toillustratethedirectmethod,weconsiderawidely-usedruleinductionalgorithmcalledRIPPER.Thisalgorithmscalesalmostlinearlywiththenumberoftraininginstancesandisparticularlysuitedforbuildingmodelsfrom
datasetswithimbalancedclassdistributions.RIPPERalsoworkswellwithnoisydatabecauseitusesavalidationsettopreventmodeloverfitting.
RIPPERusesthesequentialcoveringalgorithmtoextractrulesdirectlyfromdata.Rulesaregrowninagreedyfashiononeclassatatime.Forbinaryclassproblems,RIPPERchoosesthemajorityclassasitsdefaultclassandlearnstherulestodetectinstancesfromtheminorityclass.Formulticlassproblems,theclassesareorderedaccordingtotheirprevalenceinthetrainingset.Let betheorderedlistofclasses,where istheleastprevalentclassand isthemostprevalentclass.Alltraininginstancesthatbelongto areinitiallylabeledaspositiveexamples,whilethosethatbelongtootherclassesarelabeledasnegativeexamples.Thesequentialcoveringalgorithmlearnsasetofrulestodiscriminatethepositivefromnegativeexamples.Next,alltraininginstancesfrom arelabeledaspositive,whilethosefromclasses arelabeledasnegative.Thesequentialcoveringalgorithmwouldlearnthenextsetofrulestodistinguish fromotherremainingclasses.Thisprocessisrepeateduntilweareleftwithonlyoneclass, ,whichisdesignatedasthedefaultclass.
Example4.1.Sequentialcoveringalgorithm.
∈
∨
(y1,y2,…,yc) y1yc
y1
y2y3,y4,⋯,yc
y2
yc
AsummaryofthesequentialcoveringalgorithmisshowninAlgorithm4.1 .Thealgorithmstartswithanemptydecisionlist,R,andextractsrulesforeachclassbasedontheorderingspecifiedbytheclassprevalence.ItiterativelyextractstherulesforagivenclassyusingtheLearn-One-Rulefunction.Oncesucharuleisfound,allthetraininginstancescoveredbytheruleareeliminated.ThenewruleisaddedtothebottomofthedecisionlistR.Thisprocedureisrepeateduntilthestoppingcriterionismet.Thealgorithmthenproceedstogeneraterulesforthenextclass.
Figure4.1 demonstrateshowthesequentialcoveringalgorithmworksforadatasetthatcontainsacollectionofpositiveandnegativeexamples.TheruleR1,whosecoverageisshowninFigure4.1(b) ,isextractedfirstbecauseitcoversthelargestfractionofpositiveexamples.AllthetraininginstancescoveredbyR1aresubsequentlyremovedandthealgorithmproceedstolookforthenextbestrule,whichisR2.
Learn-One-RuleFunctionFindinganoptimalruleiscomputationallyexpensiveduetotheexponentialsearchspacetoexplore.TheLearn-One-Rulefunctionaddressesthisproblembygrowingtherulesinagreedyfashion.Itgeneratesaninitialrule
,wheretheleft-handsideisanemptysetandtheright-handsidecorrespondstothepositiveclass.Itthenrefinestheruleuntilacertainstoppingcriterionismet.Theaccuracyoftheinitialrulemaybepoorbecausesomeofthetraininginstancescoveredbytherulebelongtothenegative
r:{}→+
class.Anewconjunctmustbeaddedtotheruleantecedenttoimproveitsaccuracy.
Figure4.1.Anexampleofthesequentialcoveringalgorithm.
RIPPERusestheFOIL'sinformationgainmeasuretochoosethebestconjuncttobeaddedintotheruleantecedent.Themeasuretakesintoconsiderationboththegaininaccuracyandsupportofacandidaterule,wheresupportisdefinedasthenumberofpositiveexamplescoveredbytherule.Forexample,supposetherule initiallycovers positiveexamplesand negativeexamples.AfteraddinganewconjunctB,theextendedrule covers positiveexamplesand negative
r:A→+ p0n0r′:A∧B→+ p1 n1
examples.TheFOIL'sinformationgainoftheextendedruleiscomputedasfollows:
RIPPERchoosestheconjunctwithhighestFOIL'sinformationgaintoextendtherule,asillustratedinthenextexample.
Example4.2.[Foil'sInformationGain]ConsiderthetrainingsetforthevertebrateclassificationproblemshowninTable4.2 .SupposethetargetclassfortheLearn-One-Rulefunctionismammals.Initially,theantecedentoftherule covers5positiveand10negativeexamples.Thus,theaccuracyoftheruleisonly0.333.Next,considerthefollowingthreecandidateconjunctstobeaddedtotheleft-handsideoftherule: ,and .ThenumberofpositiveandnegativeexamplescoveredbytheruleafteraddingeachconjunctalongwiththeirrespectiveaccuracyandFOIL'sinformationgainareshowninthefollowingtable.
Candidaterule Accuracy InfoGain
3 0 1.000 4.755
5 1 0.714 5.498
2 4 0.200
Although hasthehighestaccuracyamongthethreecandidates,theconjunct hasthehighestFOIL'sinformationgain.Thus,itischosentoextendtherule(seeFigure4.2 ).
FOIL'sinformationgain=p1×(log2p1p1+n1−log2p0p0+n0). (4.4)
{}→Mammals
Skincover=hair,Bodytemperature=warmHaslegs=No
p1 n1
{SkinCover=hair}→mammals
{Bodytemperature=wam}→mammals
{Haslegs=No}→mammals −0.737
Skincover=hairBodytemperature=warm
Thisprocesscontinuesuntiladdingnewconjunctsnolongerimprovestheinformationgainmeasure.
RulePruning
TherulesgeneratedbytheLearn-One-Rulefunctioncanbeprunedtoimprovetheirgeneralizationerrors.RIPPERprunestherulesbasedontheirperformanceonthevalidationset.Thefollowingmetriciscomputedtodeterminewhetherpruningisneeded: ,wherep(n)isthenumberofpositive(negative)examplesinthevalidationsetcoveredbytherule.Thismetricismonotonicallyrelatedtotherule'saccuracyonthevalidationset.Ifthemetricimprovesafterpruning,thentheconjunctisremoved.Pruningisdonestartingfromthelastconjunctaddedtotherule.Forexample,givenarule ,RIPPERcheckswhetherDshouldbeprunedfirst,followedbyCD,BCD,etc.Whiletheoriginalrulecoversonlypositiveexamples,theprunedrulemaycoversomeofthenegativeexamplesinthetrainingset.
BuildingtheRuleSet
Aftergeneratingarule,allthepositiveandnegativeexamplescoveredbytheruleareeliminated.Theruleisthenaddedintotherulesetaslongasitdoesnotviolatethestoppingcondition,whichisbasedontheminimumdescriptionlengthprinciple.Ifthenewruleincreasesthetotaldescriptionlengthoftherulesetbyatleastdbits,thenRIPPERstopsaddingrulesintoitsruleset(bydefault,dischosentobe64bits).AnotherstoppingconditionusedbyRIPPERisthattheerrorrateoftheruleonthevalidationsetmustnotexceed50%.
(p−n)/(p+n)
ABCD→y
Figure4.2.General-to-specificandspecific-to-generalrule-growingstrategies.
RIPPERalsoperformsadditionaloptimizationstepstodeterminewhethersomeoftheexistingrulesintherulesetcanbereplacedbybetteralternativerules.Readerswhoareinterestedinthedetailsoftheoptimizationmethodmayrefertothereferencecitedattheendofthischapter.
InstanceElimination
Afteraruleisextracted,RIPPEReliminatesthepositiveandnegativeexamplescoveredbytherule.Therationalefordoingthisisillustratedinthenextexample.
Figure4.3 showsthreepossiblerules,R1,R2,andR3,extractedfromatrainingsetthatcontains29positiveexamplesand21negativeexamples.TheaccuraciesofR1,R2,andR3are12/15(80%),7/10(70%),and8/12(66.7%),respectively.R1isgeneratedfirstbecauseithasthehighestaccuracy.AftergeneratingR1,thealgorithmmustremovetheexamplescoveredbytherulesothatthenextrulegeneratedbythealgorithmisdifferentthanR1.Thequestionis,shoulditremovethepositiveexamplesonly,negativeexamplesonly,orboth?Toanswerthis,supposethealgorithmmustchoosebetweengeneratingR2orR3afterR1.EventhoughR2hasahigheraccuracythanR3(70%versus66.7%),observethattheregioncoveredbyR2isdisjointfromR1,whiletheregioncoveredbyR3overlapswithR1.Asaresult,R1andR3togethercover18positiveand5negativeexamples(resultinginanoverallaccuracyof78.3%),whereasR1andR2togethercover19positiveand6negativeexamples(resultinginaloweroverallaccuracyof76%).IfthepositiveexamplescoveredbyR1arenotremoved,thenwemayoverestimatetheeffectiveaccuracyofR3.IfthenegativeexamplescoveredbyR1arenotremoved,thenwemayunderestimatetheaccuracyofR3.Inthelattercase,wemightenduppreferringR2overR3eventhoughhalfofthefalsepositiveerrorscommittedbyR3havealreadybeenaccountedforbytheprecedingrule,R1.ThisexampleshowsthattheeffectiveaccuracyafteraddingR2orR3totherulesetbecomesevidentonlywhenbothpositiveandnegativeexamplescoveredbyR1areremoved.
Figure4.3.Eliminationoftraininginstancesbythesequentialcoveringalgorithm.R1,R2,andR3representregionscoveredbythreedifferentrules.
4.2.4IndirectMethodsforRuleExtraction
Thissectionpresentsamethodforgeneratingarulesetfromadecisiontree.Inprinciple,everypathfromtherootnodetotheleafnodeofadecisiontreecanbeexpressedasaclassificationrule.Thetestconditionsencounteredalongthepathformtheconjunctsoftheruleantecedent,whiletheclasslabelattheleafnodeisassignedtotheruleconsequent.Figure4.4 showsanexampleofarulesetgeneratedfromadecisiontree.Noticethattherulesetisexhaustiveandcontainsmutuallyexclusiverules.However,someoftherulescanbesimplifiedasshowninthenextexample.
Figure4.4.Convertingadecisiontreeintoclassificationrules.
Example4.3.ConsiderthefollowingthreerulesfromFigure4.4 :
ObservethattherulesetalwayspredictsapositiveclasswhenthevalueofQisYes.Therefore,wemaysimplifytherulesasfollows:
isretainedtocovertheremaininginstancesofthepositiveclass.Althoughtherulesobtainedaftersimplificationarenolongermutuallyexclusive,theyarelesscomplexandareeasiertointerpret.
Inthefollowing,wedescribeanapproachusedbytheC4.5rulesalgorithmtogeneratearulesetfromadecisiontree.Figure4.5 showsthedecisiontree
r2:(P=No)∧(Q=Yes)→+r3:(P=Yes)∧(R=No)→+r5:(P=Yes)∧(R=Yes)∧(Q=Yes)→+.
r2′:(Q=Yes)→+r3:(P=Yes)∧(R=No)→+.
r3
andresultingclassificationrulesobtainedforthedatasetgiveninTable4.2 .
RuleGeneration
Classificationrulesareextractedforeverypathfromtheroottooneoftheleafnodesinthedecisiontree.Givenaclassificationrule ,weconsiderasimplifiedrule, where isobtainedbyremovingoneoftheconjunctsinA.Thesimplifiedrulewiththelowestpessimisticerrorrateisretainedprovideditserrorrateislessthanthatoftheoriginalrule.Therule-pruningstepisrepeateduntilthepessimisticerroroftherulecannotbeimprovedfurther.Becausesomeoftherulesmaybecomeidenticalafterpruning,theduplicaterulesarediscarded.
Figure4.5.
r:A→yr′:A′→y A′
Classificationrulesextractedfromadecisiontreeforthevertebrateclassificationproblem.
RuleOrdering
Aftergeneratingtheruleset,C4.5rulesusestheclass-basedorderingschemetoordertheextractedrules.Rulesthatpredictthesameclassaregroupedtogetherintothesamesubset.Thetotaldescriptionlengthforeachsubsetiscomputed,andtheclassesarearrangedinincreasingorderoftheirtotaldescriptionlength.Theclassthathasthesmallestdescriptionlengthisgiventhehighestprioritybecauseitisexpectedtocontainthebestsetofrules.Thetotaldescriptionlengthforaclassisgivenby ,where
isthenumberofbitsneededtoencodethemisclassifiedexamples,Lmodelisthenumberofbitsneededtoencodethemodel,andgisatuningparameterwhosedefaultvalueis0.5.Thetuningparameterdependsonthenumberofredundantattributespresentinthemodel.Thevalueofthetuningparameterissmallifthemodelcontainsmanyredundantattributes.
4.2.5CharacteristicsofRule-BasedClassifiers
1. Rule-basedclassifiershaveverysimilarcharacteristicsasdecisiontrees.Theexpressivenessofarulesetisalmostequivalenttothatofadecisiontreebecauseadecisiontreecanberepresentedbyasetofmutuallyexclusiveandexhaustiverules.Bothrule-basedanddecisiontreeclassifierscreaterectilinearpartitionsoftheattributespaceandassignaclasstoeachpartition.However,arule-basedclassifiercan
Lexception+g×LmodelLexception
allowmultiplerulestobetriggeredforagiveninstance,thusenablingthelearningofmorecomplexmodelsthandecisiontrees.
2. Likedecisiontrees,rule-basedclassifierscanhandlevaryingtypesofcategoricalandcontinuousattributesandcaneasilyworkinmulticlassclassificationscenarios.Rule-basedclassifiersaregenerallyusedtoproducedescriptivemodelsthatareeasiertointerpretbutgivecomparableperformancetothedecisiontreeclassifier.
3. Rule-basedclassifierscaneasilyhandlethepresenceofredundantattributesthatarehighlycorrelatedwithoneother.Thisisbecauseonceanattributehasbeenusedasaconjunctinaruleantecedent,theremainingredundantattributeswouldshowlittletonoFOIL'sinformationgainandwouldthusbeignored.
4. Sinceirrelevantattributesshowpoorinformationgain,rule-basedclassifierscanavoidselectingirrelevantattributesifthereareotherrelevantattributesthatshowbetterinformationgain.However,iftheproblemiscomplexandthereareinteractingattributesthatcancollectivelydistinguishbetweentheclassesbutindividuallyshowpoorinformationgain,itislikelyforanirrelevantattributetobeaccidentallyfavoredoverarelevantattributejustbyrandomchance.Hence,rule-basedclassifierscanshowpoorperformanceinthepresenceofinteractingattributes,whenthenumberofirrelevantattributesislarge.
5. Theclass-basedorderingstrategyadoptedbyRIPPER,whichemphasizesgivinghigherprioritytorareclasses,iswellsuitedforhandlingtrainingdatasetswithimbalancedclassdistributions.
6. Rule-basedclassifiersarenotwell-suitedforhandlingmissingvaluesinthetestset.Thisisbecausethepositionofrulesinarulesetfollowsacertainorderingstrategyandevenifatestinstanceiscoveredbymultiplerules,theycanassigndifferentclasslabelsdependingontheirpositionintheruleset.Hence,ifacertainruleinvolvesanattributethatismissinginatestinstance,itisdifficulttoignoretheruleandproceed
tothesubsequentrulesintheruleset,assuchastrategycanresultinincorrectclassassignments.
4.3NearestNeighborClassifiersTheclassificationframeworkshowninFigure3.3 involvesatwo-stepprocess:
(1)aninductivestepforconstructingaclassificationmodelfromdata,and
(2)adeductivestepforapplyingthemodeltotestexamples.Decisiontreeandrule-basedclassifiersareexamplesofeagerlearnersbecausetheyaredesignedtolearnamodelthatmapstheinputattributestotheclasslabelassoonasthetrainingdatabecomesavailable.Anoppositestrategywouldbetodelaytheprocessofmodelingthetrainingdatauntilitisneededtoclassifythetestinstances.Techniquesthatemploythisstrategyareknownaslazylearners.AnexampleofalazylearneristheRoteclassifier,whichmemorizestheentiretrainingdataandperformsclassificationonlyiftheattributesofatestinstancematchoneofthetrainingexamplesexactly.Anobviousdrawbackofthisapproachisthatsometestinstancesmaynotbeclassifiedbecausetheydonotmatchanytrainingexample.
Onewaytomakethisapproachmoreflexibleistofindallthetrainingexamplesthatarerelativelysimilartotheattributesofthetestinstances.Theseexamples,whichareknownasnearestneighbors,canbeusedtodeterminetheclasslabelofthetestinstance.Thejustificationforusingnearestneighborsisbestexemplifiedbythefollowingsaying:“Ifitwalkslikeaduck,quackslikeaduck,andlookslikeaduck,thenit'sprobablyaduck.”Anearestneighborclassifierrepresentseachexampleasadatapointinad-dimensionalspace,wheredisthenumberofattributes.Givenatestinstance,wecomputeitsproximitytothetraininginstancesaccordingtooneoftheproximitymeasuresdescribedinSection2.4 onpage71.Thek-nearest
neighborsofagiventestinstancezrefertothektrainingexamplesthatareclosesttoz.
Figure4.6 illustratesthe1-,2-,and3-nearestneighborsofatestinstancelocatedatthecenterofeachcircle.Theinstanceisclassifiedbasedontheclasslabelsofitsneighbors.Inthecasewheretheneighborshavemorethanonelabel,thetestinstanceisassignedtothemajorityclassofitsnearestneighbors.InFigure4.6(a) ,the1-nearestneighboroftheinstanceisanegativeexample.Thereforetheinstanceisassignedtothenegativeclass.Ifthenumberofnearestneighborsisthree,asshowninFigure4.6(c) ,thentheneighborhoodcontainstwopositiveexamplesandonenegativeexample.Usingthemajorityvotingscheme,theinstanceisassignedtothepositiveclass.Inthecasewherethereisatiebetweentheclasses(seeFigure4.6(b) ),wemayrandomlychooseoneofthemtoclassifythedatapoint.
Figure4.6.The1-,2-,and3-nearestneighborsofaninstance.
Theprecedingdiscussionunderscorestheimportanceofchoosingtherightvaluefork.Ifkistoosmall,thenthenearestneighborclassifiermaybesusceptibletooverfittingduetonoise,i.e.,mislabeledexamplesinthetraining
data.Ontheotherhand,ifkistoolarge,thenearestneighborclassifiermaymisclassifythetestinstancebecauseitslistofnearestneighborsincludestrainingexamplesthatarelocatedfarawayfromitsneighborhood(seeFigure4.7 ).
Figure4.7.k-nearestneighborclassificationwithlargek.
4.3.1Algorithm
Ahigh-levelsummaryofthenearestneighborclassificationmethodisgiveninAlgorithm4.2 .Thealgorithmcomputesthedistance(orsimilarity)betweeneachtestinstance andallthetrainingexamples todetermineitsnearestneighborlist, .Suchcomputationcanbecostlyifthenumberoftrainingexamplesislarge.However,efficientindexingtechniquesareavailabletoreducethecomputationneededtofindthenearestneighborsofatestinstance.
z=(x′,y′) (x,y)∈DDz
Algorithm4.2Thek-nearestneighborclassifier.
′ ′
′ ∑ ∈
Oncethenearestneighborlistisobtained,thetestinstanceisclassifiedbasedonthemajorityclassofitsnearestneighbors:
wherevisaclasslabel, istheclasslabelforoneofthenearestneighbors,and isanindicatorfunctionthatreturnsthevalue1ifitsargumentistrueand0otherwise.
Inthemajorityvotingapproach,everyneighborhasthesameimpactontheclassification.Thismakesthealgorithmsensitivetothechoiceofk,asshowninFigure4.6 .Onewaytoreducetheimpactofkistoweighttheinfluenceofeachnearestneighbor accordingtoitsdistance: .Asaresult,trainingexamplesthatarelocatedfarawayfromzhaveaweakerimpactontheclassificationcomparedtothosethatarelocatedclosetoz.Usingthedistance-weightedvotingscheme,theclasslabelcanbedeterminedasfollows:
′
∈
⊆
MajorityVoting:y′=argmaxv∑(xi,yi)∈DzI(v=yi), (4.5)
yiI(⋅)
xi wi=1/d(x′,xi)2
Distance-WeightedVoting:y′=argmaxv∑(xi,yi)∈Dzwi×I(v=yi). (4.6)
4.3.2CharacteristicsofNearestNeighborClassifiers
1. Nearestneighborclassificationispartofamoregeneraltechniqueknownasinstance-basedlearning,whichdoesnotbuildaglobalmodel,butratherusesthetrainingexamplestomakepredictionsforatestinstance.(Thus,suchclassifiersareoftensaidtobe“modelfree.”)Suchalgorithmsrequireaproximitymeasuretodeterminethesimilarityordistancebetweeninstancesandaclassificationfunctionthatreturnsthepredictedclassofatestinstancebasedonitsproximitytootherinstances.
2. Althoughlazylearners,suchasnearestneighborclassifiers,donotrequiremodelbuilding,classifyingatestinstancecanbequiteexpensivebecauseweneedtocomputetheproximityvaluesindividuallybetweenthetestandtrainingexamples.Incontrast,eagerlearnersoftenspendthebulkoftheircomputingresourcesformodelbuilding.Onceamodelhasbeenbuilt,classifyingatestinstanceisextremelyfast.
3. Nearestneighborclassifiersmaketheirpredictionsbasedonlocalinformation.(Thisisequivalenttobuildingalocalmodelforeachtestinstance.)Bycontrast,decisiontreeandrule-basedclassifiersattempttofindaglobalmodelthatfitstheentireinputspace.Becausetheclassificationdecisionsaremadelocally,nearestneighborclassifiers(withsmallvaluesofk)arequitesusceptibletonoise.
4. Nearestneighborclassifierscanproducedecisionboundariesofarbitraryshape.Suchboundariesprovideamoreflexiblemodelrepresentationcomparedtodecisiontreeandrule-basedclassifiersthatareoftenconstrainedtorectilineardecisionboundaries.Thedecisionboundariesofnearestneighborclassifiersalsohavehigh
variabilitybecausetheydependonthecompositionoftrainingexamplesinthelocalneighborhood.Increasingthenumberofnearestneighborsmayreducesuchvariability.
5. Nearestneighborclassifiershavedifficultyhandlingmissingvaluesinboththetrainingandtestsetssinceproximitycomputationsnormallyrequirethepresenceofallattributes.Although,thesubsetofattributespresentintwoinstancescanbeusedtocomputeaproximity,suchanapproachmaynotproducegoodresultssincetheproximitymeasuresmaybedifferentforeachpairofinstancesandthushardtocompare.
6. Nearestneighborclassifierscanhandlethepresenceofinteractingattributes,i.e.,attributesthathavemorepredictivepowertakenincombinationthenbythemselves,byusingappropriateproximitymeasuresthatcanincorporatetheeffectsofmultipleattributestogether.
7. Thepresenceofirrelevantattributescandistortcommonlyusedproximitymeasures,especiallywhenthenumberofirrelevantattributesislarge.Furthermore,iftherearealargenumberofredundantattributesthatarehighlycorrelatedwitheachother,thentheproximitymeasurecanbeoverlybiasedtowardsuchattributes,resultinginimproperestimatesofdistance.Hence,thepresenceofirrelevantandredundantattributescanadverselyaffecttheperformanceofnearestneighborclassifiers.
8. Nearestneighborclassifierscanproducewrongpredictionsunlesstheappropriateproximitymeasureanddatapreprocessingstepsaretaken.Forexample,supposewewanttoclassifyagroupofpeoplebasedonattributessuchasheight(measuredinmeters)andweight(measuredinpounds).Theheightattributehasalowvariability,rangingfrom1.5mto1.85m,whereastheweightattributemayvaryfrom90lb.to250lb.Ifthescaleoftheattributesarenottakenintoconsideration,theproximitymeasuremaybedominatedbydifferencesintheweightsofaperson.
4.4NaïveBayesClassifierManyclassificationproblemsinvolveuncertainty.First,theobservedattributesandclasslabelsmaybeunreliableduetoimperfectionsinthemeasurementprocess,e.g.,duetothelimitedprecisenessofsensordevices.Second,thesetofattributeschosenforclassificationmaynotbefullyrepresentativeofthetargetclass,resultinginuncertainpredictions.Toillustratethis,considertheproblemofpredictingaperson'sriskforheartdiseasebasedonamodelthatusestheirdietandworkoutfrequencyasattributes.Althoughmostpeoplewhoeathealthilyandexerciseregularlyhavelesschanceofdevelopingheartdisease,theymaystillbeatriskduetootherlatentfactors,suchasheredity,excessivesmoking,andalcoholabuse,thatarenotcapturedinthemodel.Third,aclassificationmodellearnedoverafinitetrainingsetmaynotbeabletofullycapturethetruerelationshipsintheoveralldata,asdiscussedinthecontextofmodeloverfittinginthepreviouschapter.Finally,uncertaintyinpredictionsmayariseduetotheinherentrandomnatureofreal-worldsystems,suchasthoseencounteredinweatherforecastingproblems.
Inthepresenceofuncertainty,thereisaneedtonotonlymakepredictionsofclasslabelsbutalsoprovideameasureofconfidenceassociatedwitheveryprediction.Probabilitytheoryoffersasystematicwayforquantifyingandmanipulatinguncertaintyindata,andthus,isanappealingframeworkforassessingtheconfidenceofpredictions.Classificationmodelsthatmakeuseofprobabilitytheorytorepresenttherelationshipbetweenattributesandclasslabelsareknownasprobabilisticclassificationmodels.Inthissection,wepresentthenaïveBayesclassifier,whichisoneofthesimplestandmostwidely-usedprobabilisticclassificationmodels.
4.4.1BasicsofProbabilityTheory
BeforewediscusshowthenaïveBayesclassifierworks,wefirstintroducesomebasicsofprobabilitytheorythatwillbeusefulinunderstandingtheprobabilisticclassificationmodelspresentedinthischapter.Thisinvolvesdefiningthenotionofprobabilityandintroducingsomecommonapproachesformanipulatingprobabilityvalues.
ConsideravariableX,whichcantakeanydiscretevaluefromtheset.Whenwehavemultipleobservationsofthatvariable,suchasina
datasetwherethevariabledescribessomecharacteristicofdataobjects,thenwecancomputetherelativefrequencywithwhicheachvalueoccurs.Specifically,supposethatXhasthevalue for dataobjects.Therelativefrequencywithwhichweobservetheevent isthen ,whereNdenotesthetotalnumberofoccurrences( ).TheserelativefrequenciescharacterizetheuncertaintythatwehavewithrespecttowhatvalueXmaytakeforanunseenobservationandmotivatesthenotionofprobability.
Moreformally,theprobabilityofanevente,e.g., ,measureshowlikelyitisfortheeventetooccur.Themosttraditionalviewofprobabilityisbasedonrelativefrequencyofevents(frequentist),whiletheBayesianviewpoint(describedlater)takesamoreflexibleviewofprobabilities.Ineithercase,aprobabilityisalwaysanumberbetween0and1.Further,thesumofprobabilityvaluesofallpossibleevents,e.g.,outcomesofavariableXisequalto1.Variablesthathaveprobabilitiesassociatedwitheachpossibleoutcome(values)areknownasrandomvariables.
Now,letusconsidertworandomvariables,XandY,thatcaneachtakekdiscretevalues.Let bethenumberoftimesweobserve and ,out
{x1,…,xk}
xi niX=xi ni/N
N=∑i=1kni
P(X=xi)
nij X=xi Y=yj
ofatotalnumberofNoccurrences.Thejointprobabilityofobservingand togethercanbeestimatedas
(Thisisanestimatesincewetypicallyhaveonlyafinitesubsetofallpossibleobservations.)Jointprobabilitiescanbeusedtoanswerquestionssuchas“whatistheprobabilitythattherewillbeasurprisequiztoday Iwillbelatefortheclass.”Jointprobabilitiesaresymmetric,i.e.,
.Forjointprobabilities,itistousefultoconsidertheirsumwithrespecttooneoftherandomvariables,asdescribedinthefollowingequation:
where isthetotalnumberoftimesweobserve irrespectiveofthevalueofY.Noticethat isessentiallytheprobabilityofobserving .Hence,bysummingoutthejointprobabilitieswithrespecttoarandomvariableY,weobtaintheprobabilityofobservingtheremainingvariableX.Thisoperationiscalledmarginalizationandtheprobabilityvalue obtainedbymarginalizingoutYissometimescalledthemarginalprobabilityofX.Aswewillseelater,jointprobabilityandmarginalprobabilityformthebasicbuildingblocksofanumberofprobabilisticclassificationmodelsdiscussedinthischapter.
Noticethatinthepreviousdiscussions,weused todenotetheprobabilityofaparticularoutcomeofarandomvariableX.Thisnotationcaneasilybecomecumbersomewhenanumberofrandomvariablesareinvolved.Hence,intheremainderofthissection,wewilluseP(X)todenotetheprobabilityofanygenericoutcomeoftherandomvariableX,while willbeusedtorepresenttheprobabilityofthespecificoutcome .
X=xiY=yj
P(X=xi,Y=yi)=nijN. (4.7)
P(X=x,Y=y)=P(Y=y,X=x)
∑j=1kP(X=xi,Y=yj)=∑j=1knijN=niN=P(X=xi), (4.8)
ni X=xini/N X=xi
P(X=xi)
P(X=xi)
P(xi)xi
BayesTheoremSupposeyouhaveinvitedtwoofyourfriendsAlexandMarthatoadinnerparty.YouknowthatAlex
attends40%ofthepartiesheisinvitedto.Further,ifAlexisgoingtoaparty,thereisan80%chance
ofMarthacomingalong.Ontheotherhand,ifAlexisnotgoingtotheparty,thechanceofMartha
comingtothepartyisreducedto30%.IfMarthahasrespondedthatshewillbecomingtoyourparty,
whatistheprobabilitythatAlexwillalsobecoming?
Bayestheorempresentsthestatisticalprincipleforansweringquestionslikethepreviousone,whereevidencefrommultiplesourceshastobecombinedwithpriorbeliefstoarriveatpredictions.Bayestheoremcanbebrieflydescribedasfollows.
Let denotetheconditionalprobabilityofobservingtherandomvariableYwhenevertherandomvariableXtakesaparticularvalue. isoftenreadastheprobabilityofobservingYconditionedontheoutcomeofX.Conditionalprobabilitiescanbeusedforansweringquestionssuchas“giventhatitisgoingtoraintoday,whatwillbetheprobabilitythatIwillgototheclass.”ConditionalprobabilitiesofXandYarerelatedtotheirjoint
probabilityinthefollowingway:
RearrangingthelasttwoexpressionsinEquation4.10 leadstoEquation4.11 ,whichisknownasBayestheorem:
P(Y|X)P(Y|X)
P(Y|X)=P(X,Y)P(X),whichimplies (4.9)
P(X,Y)=P(Y|X)×P(X)=P(X|Y)×P(Y). (4.10)
P(Y|X)=P(X|Y)P(Y)P(X). (4.11)
Bayestheoremprovidesarelationshipbetweentheconditionalprobabilitiesand .NotethatthedenominatorinEquation4.11 involvesthe
marginalprobabilityofX,whichcanalsoberepresentedas
UsingthepreviousexpressionforP(X),wecanobtainthefollowingequationfor solelyintermsof andP(Y):
Example4.4.[BayesTheorem]Bayestheoremcanbeusedtosolveanumberofinferentialquestionsaboutrandomvariables.Forexample,considertheproblemstatedatthebeginningoninferringwhetherAlexwillcometotheparty.LetdenotetheprobabilityofAlexgoingtoaparty,while denotestheprobabilityofhimnotgoingtoaparty.Weknowthat
Further,let denotetheconditionalprobabilityofMarthagoingtoapartyconditionedonwhetherAlexisgoingtotheparty. takesthefollowingvalues:
Wecanusetheabovevaluesof andP(A)tocomputetheprobabilityofAlexgoingtothepartygivenMarthaisgoingtotheparty,
,asfollows:
P(Y|X) P(X|Y)
P(X)=∑i=1kP(X,yi)=∑i=1kP(X|yi)×P(yi).
P(Y|X) P(X|Y)
P(Y|X)=P(X|Y)P(Y)∑i−1kP(X|yi)P(yi). (4.12)
P(A=1)P(A=0)
P(A=1)=0.4,andP(A=0)=1−P(A=1)=0.6.
P(M=1|A)P(M=1|A)
P(M=1|A=1)=0.8,andP(M=1|A=0)=0.3.
P(M|A)
P(A=1|M=1)
NoticethateventhoughthepriorprobabilityP(A)ofAlexgoingtothepartyislow,theobservationthatMarthaisgoing, ,affectstheconditionalprobability .ThisshowsthevalueofBayestheoremincombiningpriorassumptionswithobservedoutcomestomakepredictions.Since ,itismorelikelyforAlextojoinifMarthaisgoingtotheparty.
UsingBayesTheoremforClassificationForthepurposeofclassification,weareinterestedincomputingtheprobabilityofobservingaclasslabelyforadatainstancegivenitssetofattributevalues .Thiscanberepresentedas ,whichisknownastheposteriorprobabilityofthetargetclass.UsingtheBayesTheorem,wecanrepresenttheposteriorprobabilityas
Notethatthenumeratorofthepreviousequationinvolvestwoterms,andP(y),bothofwhichcontributetotheposteriorprobability .Wedescribebothofthesetermsinthefollowing.
Thefirstterm isknownastheclass-conditionalprobabilityoftheattributesgiventheclasslabel. measuresthelikelihoodofobservingfromthedistributionofinstancesbelongingtoy.If indeedbelongstoclassy,thenweshouldexpect tobehigh.Fromthispointofview,theuseofclass-conditionalprobabilitiesattemptstocapturetheprocessfromwhichthedatainstancesweregenerated.Becauseofthisinterpretation,probabilisticclassificationmodelsthatinvolvecomputingclass-conditionalprobabilitiesare
P(A=1|M=1)=P(M=1|A=1)P(A=1)P(M=1|A=0)P(A=0)+P(M=1|A=1)P(A=1),=0.8(4.13)
M=1P(A=1|M=1)
P(A=1|M=1)>0.5
P(y|x)
P(y|x)=P(x|y)P(y)P(x) (4.14)
P(x|y)P(y|x)
P(x|y)P(x|y)
P(x|y)
knownasgenerativeclassificationmodels.Apartfromtheiruseincomputingposteriorprobabilitiesandmakingpredictions,class-conditionalprobabilitiesalsoprovideinsightsabouttheunderlyingmechanismbehindthegenerationofattributevalues.
ThesecondterminthenumeratorofEquation4.14 isthepriorprobabilityP(y).Thepriorprobabilitycapturesourpriorbeliefsaboutthedistributionofclasslabels,independentoftheobservedattributevalues.(ThisistheBayesianviewpoint.)Forexample,wemayhaveapriorbeliefthatthelikelihoodofanypersontosufferfromaheartdiseaseis ,irrespectiveoftheirdiagnosisreports.Thepriorprobabilitycaneitherbeobtainedusingexpertknowledge,orinferredfromhistoricaldistributionofclasslabels.
ThedenominatorinEquation4.14 involvestheprobabilityofevidence,P( ).Notethatthistermdoesnotdependontheclasslabelandthuscanbetreatedasanormalizationconstantinthecomputationofposteriorprobabilities.Further,thevalueofP( )canbecalculatedas
.
Bayestheoremprovidesaconvenientwaytocombineourpriorbeliefswiththelikelihoodofobtainingtheobservedattributevalues.Duringthetrainingphase,wearerequiredtolearntheparametersforP(y)and .ThepriorprobabilityP(y)canbeeasilyestimatedfromthetrainingsetbycomputingthefractionoftraininginstancesthatbelongtoeachclass.Tocomputetheclass-conditionalprobabilities,oneapproachistoconsiderthefractionoftraininginstancesofagivenclassforeverypossiblecombinationofattributevalues.Forexample,supposethattherearetwoattributes and thatcaneachtakeadiscretevaluefrom to .Let denotethenumberoftraininginstancesbelongingtoclass0,outofwhich numberoftraininginstanceshave and .Theclass-conditionalprobabilitycanthenbegivenas
α
P(x)=∑iP(x|yi)P(yi)
P(x|y)
X1 X2c1 ck n0
nij0X1=ci X2=cj
Thisapproachcaneasilybecomecomputationallyprohibitiveasthenumberofattributesincrease,duetotheexponentialgrowthinthenumberofattributevaluecombinations.Forexample,ifeveryattributecantakekdiscretevalues,thenthenumberofattributevaluecombinationsisequalto ,wheredisthenumberofattributes.Thelargenumberofattributevaluecombinationscanalsoresultinpoorestimatesofclass-conditionalprobabilities,sinceeverycombinationwillhavefewertraininginstanceswhenthesizeoftrainingsetissmall.
Inthefollowing,wepresentthenaïveBayesclassifier,whichmakesasimplifyingassumptionabouttheclass-conditionalprobabilities,knownasthenaïveBayesassumption.Theuseofthisassumptionsignificantlyhelpsinobtainingreliableestimatesofclass-conditionalprobabilities,evenwhenthenumberofattributesarelarge.
4.4.2NaïveBayesAssumption
ThenaïveBayesclassifierassumesthattheclass-conditionalprobabilityofallattributes canbefactoredasaproductofclass-conditionalprobabilitiesofeveryattribute ,asdescribedinthefollowingequation:
whereeverydatainstance consistsofdattributes, .Thebasicassumptionbehindthepreviousequationisthattheattributevaluesareconditionallyindependentofeachother,giventheclasslabely.Thismeansthattheattributesareinfluencedonlybythetargetclassandifwe
P(X1=ci,X2=cj|Y=0)=nij0n0.
kd
xi
P(x|y)=∏i=1dP(xi|y), (4.15)
{x1,x2,…,xd}xi
knowtheclasslabel,thenwecanconsidertheattributestobeindependentofeachother.Theconceptofconditionalindependencecanbeformallystatedasfollows.
ConditionalIndependenceLet ,andYdenotethreesetsofrandomvariables.Thevariablesinaresaidtobeconditionallyindependentof ,givenY,ifthefollowingconditionholds:
ThismeansthatconditionedonY,thedistributionof isnotinfluencedbytheoutcomesof ,andhenceisconditionallyindependentof .Toillustratethenotionofconditionalindependence,considertherelationshipbetweenaperson'sarmlength andhisorherreadingskills .Onemightobservethatpeoplewithlongerarmstendtohavehigherlevelsofreadingskills,andthusconsider and toberelatedtoeachother.However,thisrelationshipcanbeexplainedbyanotherfactor,whichistheageoftheperson(Y).Ayoungchildtendstohaveshortarmsandlacksthereadingskillsofanadult.Iftheageofapersonisfixed,thentheobservedrelationshipbetweenarmlengthandreadingskillsdisappears.Thus,wecanconcludethatarmlengthandreadingskillsarenotdirectlyrelatedtoeachotherandareconditionallyindependentwhentheagevariableisfixed.
Anotherwayofdescribingconditionalindependenceistoconsiderthejointconditionalprobability, ,asfollows:
X1,X2, X1X2
P(X1|X2,Y)=P(X1|Y). (4.16)
X1X2 X2
(X1) (X2)
X1 X2
P(X1,X2|Y)
P(X1,X2|Y)=P(X1,X2,Y)P(Y)=P(X1,X2,Y)P(X2,Y)×P(X2,Y)P(Y)=P(X1|X2,Y(4.17)
whereEquation4.16 wasusedtoobtainthelastlineofEquation4.17 .Thepreviousdescriptionofconditionalindependenceisquiteusefulfromanoperationalperspective.Itstatesthatthejointconditionalprobabilityof and
givenYcanbefactoredastheproductofconditionalprobabilitiesofand consideredseparately.ThisformsthebasisofthenaïveBayesassumptionstatedinEquation4.15 .
HowaNaïveBayesClassifierWorksUsingthenaïveBayesassumption,weonlyneedtoestimatetheconditionalprobabilityofeach givenYseparately,insteadofcomputingtheclass-conditionalprobabilityforeverycombinationofattributevalues.Forexample,if and denotethenumberoftraininginstancesbelongingtoclass0with and ,respectively,thentheclass-conditionalprobabilitycanbeestimatedas
Inthepreviousequation,weonlyneedtocountthenumberoftraininginstancesforeveryoneofthekvaluesofanattributeX,irrespectiveofthevaluesofotherattributes.Hence,thenumberofparametersneededtolearnclass-conditionalprobabilitiesisreducedfrom todk.Thisgreatlysimplifiestheexpressionfortheclass-conditionalprobabilityandmakesitmoreamenabletolearningparametersandmakingpredictions,eveninhigh-dimensionalsettings.
ThenaïveBayesclassifiercomputestheposteriorprobabilityforatestinstance byusingthefollowingequation:
X1X2 X1
X2
xi
ni0 nj0X1=ci X2=cj
P(X1=ci,X2=xj|Y=0)=ni0n0×nj0n0.
dk
P(y|x)=P(y)∏i=1dP(xi|y)P(x) (4.18)
SinceP( )isfixedforeveryyandonlyactsasanormalizingconstanttoensurethat ,wecanwrite
Hence,itissufficienttochoosetheclassthatmaximizes .
OneoftheusefulpropertiesofthenaïveBayesclassifieristhatitcaneasilyworkwithincompleteinformationaboutdatainstances,whenonlyasubsetofattributesareobservedateveryinstance.Forexample,ifweonlyobservepoutofdattributesatadatainstance,thenwecanstillcompute
usingthosepattributesandchoosetheclasswiththemaximumvalue.ThenaïveBayesclassifiercanthusnaturallyhandlemissingvaluesintestinstances.Infact,intheextremecasewherenoattributesareobserved,wecanstillusethepriorprobabilityP(y)asanestimateoftheposteriorprobability.Asweobservemoreattributes,wecankeeprefiningtheposteriorprobabilitytobetterreflectthelikelihoodofobservingthedatainstance.
Inthenexttwosubsections,wedescribeseveralapproachesforestimatingtheconditionalprobabilities forcategoricalandcontinuousattributesfromthetrainingset.
EstimatingConditionalProbabilitiesforCategoricalAttributesForacategoricalattribute ,theconditionalprobability isestimatedaccordingtothefractionoftraininginstancesinclassywhere takesonaparticularcategoricalvaluec.
P(y|x)∈[0,1]
P(y|x)∝P(y)∏i=1dP(xi|y).
P(y)∏i=1dP(xi|y)
P(y)∏i=1pP(xi|y)
P(xi|y)
Xi P(Xi=c|y)Xi
wherenisthenumberoftraininginstancesbelongingtoclassy,outofwhichnumberofinstanceshave .Forexample,inthetrainingsetgivenin
Figure4.8 ,sevenpeoplehavetheclasslabel ,outofwhichthreepeoplehave whiletheremainingfourhave
.Asaresult,theconditionalprobabilityforisequalto3/7.Similarly,the
conditionalprobabilityfordefaultedborrowerswith isgivenby .Notethatthesumofconditionalprobabilitiesoverallpossibleoutcomesof isequaltoone,i.e., .
Figure4.8.Trainingsetforpredictingtheloandefaultproblem.
P(Xi=c|y)=ncn,
nc Xi=cDefaultedBorrower=No
HomeOwner=YesHomeOwner=NoP(HomeOwner=Yes|DefaultedBorrower=No)
MaritalStatus=SingleP(MaritalStatus=Single|DefaultedBorrower=Yes)=2/3
Xi∑cP(Xi=c|y)=1,
EstimatingConditionalProbabilitiesforContinuousAttributesTherearetwowaystoestimatetheclass-conditionalprobabilitiesforcontinuousattributes:
1. Wecandiscretizeeachcontinuousattributeandthenreplacethecontinuousvalueswiththeircorrespondingdiscreteintervals.Thisapproachtransformsthecontinuousattributesintoordinalattributes,andthesimplemethoddescribedpreviouslyforcomputingtheconditionalprobabilitiesofcategoricalattributescanbeemployed.Notethattheestimationerrorofthismethoddependsonthediscretizationstrategy(asdescribedinSection2.3.6 onpage63),aswellasthenumberofdiscreteintervals.Ifthenumberofintervalsistoolarge,everyintervalmayhaveaninsufficientnumberoftraininginstancestoprovideareliableestimateof .Ontheotherhand,ifthenumberofintervalsistoosmall,thenthediscretizationprocessmaylooseinformationaboutthetruedistributionofcontinuousvalues,andthusresultinpoorpredictions.
2. Wecanassumeacertainformofprobabilitydistributionforthecontinuousvariableandestimatetheparametersofthedistributionusingthetrainingdata.Forexample,wecanuseaGaussiandistributiontorepresenttheconditionalprobabilityofcontinuousattributes.TheGaussiandistributionischaracterizedbytwoparameters,themean, ,andthevariance, .Foreachclass ,theclass-conditionalprobabilityforattribute is
P(Xi|Y)
μ σ2 yj
XiP(Xi=xi|Y=yj)=12πσijexp[−(xi−μij)22σij2]. (4.19)
Theparameter canbeestimatedusingthesamplemeanofforalltraininginstancesthatbelongto .Similarly, canbeestimatedfromthesamplevariance ofsuchtraininginstances.Forexample,considertheannualincomeattributeshowninFigure4.8 .Thesamplemeanandvarianceforthisattributewithrespecttotheclass are
Givenatestinstancewithtaxableincomeequalto$120K,wecanusethefollowingvalueasitsconditionalprobabilitygivenclass :
Example4.5.[NaïveBayesClassifier]ConsiderthedatasetshowninFigure4.9(a) ,wherethetargetclassisDefaultedBorrower,whichcantaketwovaluesYesandNo.Wecancomputetheclass-conditionalprobabilityforeachcategoricalattributeandthesamplemeanandvarianceforthecontinuousattribute,assummarizedinFigure4.9(b) .
Weareinterestedinpredictingtheclasslabelofatestinstance.Todo
this,wefirstcomputethepriorprobabilitiesbycountingthenumberoftraininginstancesbelongingtoeveryclass.Wethusobtain and
.Next,wecancomputetheclass-conditionalprobabilityasfollows:
μij Xi(x¯)yj σij2
(s2)
x¯=125+100+70+…+757=100s2=(125−110)2+(100−110)2+…(75−110)26=2975s=2975=54.54.
P(Income=120|No)=12π(54.54)exp−(120−110)22×2975=0.0072.
x=(HomeOwner=No,MaritalStatus=Married,AnnualIncome=$120K)
P(yes)=0.3P(No)=0.7
Figure4.9.ThenaïveBayesclassifierfortheloanclassificationproblem.
Noticethattheclass-conditionalprobabilityforclass hasbecome0becausetherearenoinstancesbelongingtoclass with
inthetrainingset.Usingtheseclass-conditionalprobabilities,wecanestimatetheposteriorprobabilitiesas
where isanormalizingconstant.Since ,theinstanceisclassifiedas .
P(x|NO)=P(HomeOwner=No|No)×P(Status=Married|No)×P(AnnualIncome
Status=Married
P(No|x)=0.7×0.0024P(x).=0.0016α.P(Yes|x)=0.3×0P(x)=0.
α=1/P(x) P(No|x)>P(Yes|x)
HandlingZeroConditionalProbabilitiesTheprecedingexampleillustratesapotentialproblemwithusingthenaïveBayesassumptioninestimatingclass-conditionalprobabilities.Iftheconditionalprobabilityforanyoftheattributesiszero,thentheentireexpressionfortheclass-conditionalprobabilitybecomeszero.Notethatzeroconditionalprobabilitiesarisewhenthenumberoftraininginstancesissmallandthenumberofpossiblevaluesofanattributeislarge.Insuchcases,itmayhappenthatacombinationofattributevaluesandclasslabelsareneverobserved,resultinginazeroconditionalprobability.
Inamoreextremecase,ifthetraininginstancesdonotcoversomecombinationsofattributevaluesandclasslabels,thenwemaynotbeabletoevenclassifysomeofthetestinstances.Forexample,if
iszeroinsteadof1/7,thenadatainstancewithattributeset
hasthefollowingclass-conditionalprobabilities:
Sinceboththeclass-conditionalprobabilitiesare0,thenaïveBayesclassifierwillnotbeabletoclassifytheinstance.Toaddressthisproblem,itisimportanttoadjusttheconditionalprobabilityestimatessothattheyarenotasbrittleassimplyusingfractionsoftraininginstances.Thiscanbeachievedbyusingthefollowingalternateestimatesofconditionalprobability:
P(MaritalStatus=Divorced|No)x=
(HomeOwner=Yes,MaritalStatus=Divorced,Income=$120K)
P(x|No)=3/7×0×0.0072=0.P(x|Yes)=0×1/3×1.2×10−9=0.
Laplaceestimate:P(Xi=c|y)=nc+1n+v, (4.20)
m-estimate:P(Xi=c|y)=nc+mpn+m, (4.21)
wherenisthenumberoftraininginstancesbelongingtoclassy, isthenumberoftraininginstanceswith and ,visthetotalnumberofattributevaluesthat cantake,pissomeinitialestimateof thatisknownapriori,andmisahyper-parameterthatindicatesourconfidenceinusingpwhenthefractionoftraininginstancesistoobrittle.Notethatevenif
,bothLaplaceandm-estimateprovidenon-zerovaluesofconditionalprobabilities.Hence,theyavoidtheproblemofvanishingclass-conditionalprobabilitiesandthusgenerallyprovidemorerobustestimatesofposteriorprobabilities.
CharacteristicsofNaïveBayesClassifiers1. NaïveBayesclassifiersareprobabilisticclassificationmodelsthatare
abletoquantifytheuncertaintyinpredictionsbyprovidingposteriorprobabilityestimates.Theyarealsogenerativeclassificationmodelsastheytreatthetargetclassasthecausativefactorforgeneratingthedatainstances.Hence,apartfromcomputingposteriorprobabilities,naïveBayesclassifiersalsoattempttocapturetheunderlyingmechanismbehindthegenerationofdatainstancesbelongingtoeveryclass.Theyarethususefulforgainingpredictiveaswellasdescriptiveinsights.
2. ByusingthenaïveBayesassumption,theycaneasilycomputeclass-conditionalprobabilitieseveninhigh-dimensionalsettings,providedthattheattributesareconditionallyindependentofeachothergiventheclasslabels.ThispropertymakesnaïveBayesclassifierasimpleandeffectiveclassificationtechniquethatiscommonlyusedindiverseapplicationproblems,suchastextclassification.
3. NaïveBayesclassifiersarerobusttoisolatednoisepointsbecausesuchpointsarenotabletosignificantlyimpacttheconditionalprobabilityestimates,astheyareoftenaveragedoutduringtraining.
ncXi=c Y=y
Xi P(Xi=c|y)
nc=0
4. NaïveBayesclassifierscanhandlemissingvaluesinthetrainingsetbyignoringthemissingvaluesofeveryattributewhilecomputingitsconditionalprobabilityestimates.Further,naïveBayesclassifierscaneffectivelyhandlemissingvaluesinatestinstance,byusingonlythenon-missingattributevalueswhilecomputingposteriorprobabilities.Ifthefrequencyofmissingvaluesforaparticularattributevaluedependsonclasslabel,thenthisapproachwillnotaccuratelyestimateposteriorprobabilities.
5. NaïveBayesclassifiersarerobusttoirrelevantattributes.If isanirrelevantattribute,then becomesalmostuniformlydistributedforeveryclassy.Theclass-conditionalprobabilitiesforeveryclassthusreceivesimilarcontributionsof ,resultinginnegligibleimpactontheposteriorprobabilityestimates.
6. CorrelatedattributescandegradetheperformanceofnaïveBayesclassifiersbecausethenaïveBayesassumptionofconditionalindependencenolongerholdsforsuchattributes.Forexample,considerthefollowingprobabilities:
whereAisabinaryattributeandYisabinaryclassvariable.SupposethereisanotherbinaryattributeBthatisperfectlycorrelatedwithAwhen ,butisindependentofAwhen .Forsimplicity,assumethattheconditionalprobabilitiesforBarethesameasforA.Givenaninstancewithattributes ,andassumingconditionalindependence,wecancomputeitsposteriorprobabilitiesasfollows:
If ,thenthenaïveBayesclassifierwouldassigntheinstancetoclass1.However,thetruthis,
XiP(Xi|Y)
P(Xi|Y)
P(A=0|Y=0)=0.4,P(A=1|Y=0)=0.6,P(A=0|Y=1)=0.6,P(A=1|Y=1)=0.4,
Y=0 Y=1
A=0,B=0
P(Y=0|A=0,B=0)=P(A=0|Y=0)P(B=0|Y=0)P(Y=0)P(A=0,B=0)=0.16×P(Y
P(Y=0)=P(Y=1)
P(A=0,B=0|Y=0)=P(A=0|Y=0)=0.4,
becauseAandBareperfectlycorrelatedwhen .Asaresult,theposteriorprobabilityfor is
whichislargerthanthatfor .Theinstanceshouldhavebeenclassifiedasclass0.Hence,thenaïveBayesclassifiercanproduceincorrectresultswhentheattributesarenotconditionallyindependentgiventheclasslabels.NaïveBayesclassifiersarethusnotwell-suitedforhandlingredundantorinteractingattributes.
Y=0Y=0
P(Y=0|A=0,B=0)=P(A=0,B=0|Y=0)P(Y=0)P(A=0,B=0)=0.4×P(Y=0)P(A=0,B=
Y=1
4.5BayesianNetworksTheconditionalindependenceassumptionmadebynaïveBayesclassifiersmayseemtoorigid,especiallyforclassificationproblemswheretheattributesaredependentoneachotherevenafterconditioningontheclasslabels.WethusneedanapproachtorelaxthenaïveBayesassumptionsothatwecancapturemoregenericrepresentationsofconditionalindependenceamongattributes.
Inthissection,wepresentaflexibleframeworkformodelingprobabilisticrelationshipsbetweenattributesandclasslabels,knownasBayesianNetworks.Bybuildingonconceptsfromprobabilitytheoryandgraphtheory,Bayesiannetworksareabletocapturemoregenericformsofconditionalindependenceusingsimpleschematicrepresentations.Theyalsoprovidethenecessarycomputationalstructuretoperforminferencesoverrandomvariablesinanefficientway.Inthefollowing,wefirstdescribethebasicrepresentationofaBayesiannetwork,andthendiscussmethodsforperforminginferenceandlearningmodelparametersinthecontextofclassification.
4.5.1GraphicalRepresentation
Bayesiannetworksbelongtoabroaderfamilyofmodelsforcapturingprobabilisticrelationshipsamongrandomvariables,knownasprobabilisticgraphicalmodels.Thebasicconceptbehindthesemodelsistousegraphicalrepresentationswherethenodesofthegraphcorrespondtorandomvariablesandtheedgesbetweenthenodesexpressprobabilistic
relationships.Figures4.10(a) and4.10(b) showexamplesofprobabilisticgraphicalmodelsusingdirectededges(witharrows)andundirectededges(withoutarrows),respectively.DirectedgraphicalmodelsarealsoknownasBayesiannetworkswhileundirectedgraphicalmodelsareknownasMarkovrandomfields.Thetwoapproachesusedifferentsemanticsforexpressingrelationshipsamongrandomvariablesandarethususefulindifferentcontexts.Inthefollowing,webrieflydescribeBayesiannetworksthatareusefulinthecontextofclassification.
ABayesiannetwork(alsoreferredtoasabeliefnetwork)involvesdirectededgesbetweennodes,whereeveryedgerepresentsadirectionofinfluenceamongrandomvariables.Forexample,Figure4.10(a) showsaBayesiannetworkwherevariableCdependsuponthevaluesofvariablesAandB,asindicatedbythearrowspointingtowardCfromAandB.Consequently,thevariableCinfluencesthevaluesofvariablesDandE.EveryedgeinaBayesiannetworkthusencodesadependencerelationshipbetweenrandomvariableswithaparticulardirectionality.
Figure4.10.Illustrationsoftwobasictypesofgraphicalmodels.
Bayesiannetworksaredirectedacyclicgraphs(DAG)becausetheydonotcontainanydirectedcyclessuchthattheinfluenceofanodeloopsbacktothesamenode.Figure4.11 showssomeexamplesofBayesiannetworksthatcapturedifferenttypesofdependencestructuresamongrandomvariables.Inadirectedacyclicgraph,ifthereisadirectededgefromXtoY,thenXiscalledtheparentofYandYiscalledthechildofX.NotethatanodecanhavemultipleparentsinaBayesiannetwork,e.g.,nodeDhastwoparentnodes,BandC,inFigure4.11(a) .Furthermore,ifthereisadirectedpathinthenetworkfromXtoZ,thenXisanancestorofZ,whileZisadescendantofX.Forexample,inthediagramshowninFigure4.11(b) ,AisadescendantofDandDisanancestorofB.Notethattherecanbemultipledirectedpathsbetweentwonodesofadirectedacyclicgraph,asisthecasefornodesAandDinFigure4.11(a) .
Figure4.11.ExamplesofBayesiannetworks.
ConditionalIndependenceAnimportantpropertyofaBayesiannetworkisitsabilitytorepresentvaryingformsofconditionalindependenceamongrandomvariables.ThereareseveralwaysofdescribingtheconditionalindependenceassumptionscapturedbyBayesiannetworks.Oneofthemostgenericwaysofexpressingconditionalindependenceistheconceptofd-separation,whichcanbeusedtodetermineifanytwosetsofnodesAandBareconditionallyindependentgivenanothersetofnodesC.AnotherusefulconceptisthatoftheMarkovblanketofanodeY,whichdenotestheminimalsetofnodesXthatmakesYindependentoftheothernodesinthegraph,whenconditionedonX.(SeeBibliographicNotesformoredetailsond-separationandMarkovblanket.)However,forthepurposeofclassification,itissufficienttodescribeasimplerexpressionofconditionalindependenceinBayesiannetworks,knownasthelocalMarkovproperty.
Property1(LocalMarkovProperty).AnodeinaBayesiannetworkisconditionallyindependentofitsnon-descendants,ifitsparentsareknown.
ToillustratethelocalMarkovproperty,considertheBayesnetworkshowninFigure4.11(b) .WecanstatethatAisconditionallyindependentofbothBandDgivenC,becauseCistheparentofAandnodesBandDarenon-descendantsofA.ThelocalMarkovpropertyhelpsininterpretingparent-childrelationshipsinBayesiannetworksasrepresentationsofconditionalprobabilities.Sinceanodeisconditionallyindependentofitsnon-descendants
givenitparents,theconditionalindependenceassumptionsimposedbyaBayesiannetworkisoftensparseinstructure.Nonetheless,BayesiannetworksareabletoexpressaricherclassofconditionalindependencestatementsamongattributesandclasslabelsthanthenaïveBayesclassifier.Infact,thenaïveBayesclassifiercanbeviewedasaspecialtypeofBayesiannetwork,wherethetargetclassYisattherootofatreeandeveryattributeisconnectedtotherootnodebyadirectededge,asshowninFigure4.12(a) .
Figure4.12.ComparingthegraphicalrepresentationofanaïveBayesclassifierwiththatofagenericBayesiannetwork.
NotethatinanaïveBayesclassifier,everydirectededgepointsfromthetargetclasstotheobservedattributes,suggestingthattheclasslabelisafactorbehindthegenerationofattributes.Inferringtheclasslabelcanthusbeviewedasdiagnosingtherootcausebehindtheobservedattributes.Ontheotherhand,Bayesiannetworksprovideamoregenericstructureofprobabilisticrelationships,sincethetargetclassisnotrequiredtobeattherootofatreebutcanappearanywhereinthegraph,asshowninFigure
Xi
4.12(b) .Inthisdiagram,inferringYnotonlyhelpsindiagnosingthefactorsinfluencing and ,butalsohelpsinpredictingtheinfluenceof and .
JointProbabilityThelocalMarkovpropertycanbeusedtosuccinctlyexpressthejointprobabilityofthesetofrandomvariablesinvolvedinaBayesiannetwork.Torealizethis,letusfirstconsideraBayesiannetworkconsistingofdnodes,to ,wherethenodeshavebeennumberedinsuchawaythat isanancestorof onlyif .Thejointprobabilityof canbegenericallyfactorizedusingthechainruleofprobabilityas
Bythewaywehaveconstructedthegraph,notethatthesetcontainsonlynon-descendantsof .Hence,byusingthelocalMarkovproperty,wecanwrite as ,where denotestheparentsof .Thejointprobabilitycanthenberepresentedas
Itisthussufficienttorepresenttheprobabilityofeverynode intermsofitsparentnodes, ,forcomputingP( ).Thisisachievedwiththehelpofprobabilitytablesthatassociateeverynodetoitsparentnodesasfollows:
1. Theprobabilitytablefornode containstheconditionalprobabilityvalues foreverycombinationofvaluesin and .
2. If hasnoparents ,thenthetablecontainsonlythepriorprobability .
X3 X4 X1 X2
X1Xd Xi
Xj i<j X={X1,…,Xd}
P(X)=P(X1)P(X2|X1)P(X3|X1,X2)…P(Xd|X1,…Xd−1)=∏i=1dP(Xi|X1,…Xi−1)
(4.22)
{X1,…Xi−1}Xi
P(Xi|X1,…Xi−1) P(Xi|pa(Xi)) pa(Xi)Xi
P(X)=∏i=1dP(Xi|pa(Xi)) (4.23)
Xipa(Xi)
XiP(Xi|pa(Xi)) Xi pa(Xi)
Xi (pa(Xi)=ϕ)P(Xi)
Example4.6.[ProbabilityTables]Figure4.13 showsanexampleofaBayesiannetworkformodelingtherelationshipsbetweenapatient'ssymptomsandriskfactors.Theprobabilitytablesareshownatthesideofeverynodeinthefigure.Theprobabilitytablesassociatedwiththeriskfactors(ExerciseandDiet)containonlythepriorprobabilities,whereasthetablesforheartdisease,heartburn,bloodpressure,andchestpain,containtheconditionalprobabilities.
Figure4.13.ABayesiannetworkfordetectingheartdiseaseandheartburninpatients.
UseofHiddenVariables
ABayesiannetworktypicallyinvolvestwotypesofvariables:observedvariablesthatareclampedtospecificobservedvalues,andunobservedvariables,whosevaluesarenotknownandneedtobeinferredfromthenetwork.Todistinguishbetweenthesetwotypesofvariables,observedvariablesaregenerallyrepresentedusingshadednodeswhileunobservedvariablesarerepresentedusingemptynodes.Figure4.14 showsanexampleofaBayesiannetworkwithobservedvariables(A,B,andE)andunobservedvariables(CandD).
Figure4.14.Observedandunobservedvariablesarerepresentedusingunshadedandshadedcircles,respectively.
Inthecontextofclassification,theobservedvariablescorrespondtothesetofattributesX,whilethetargetclassisrepresentedusinganunobservedvariableYthatneedstobeinferredduringtesting.However,notethatagenericBayesiannetworkmaycontainmanyotherunobservedvariablesapartfromthetargetclass,asrepresentedinFigure4.15 asthesetofvariablesH.Theseunobservedvariablesrepresenthiddenorconfoundingfactorsthataffecttheprobabilitiesofattributesandclasslabels,althoughtheyareneverdirectlyobserved.TheuseofhiddenvariablesenhancestheexpressivepowerofBayesiannetworksinrepresentingcomplexprobabilistic
relationshipsbetweenattributesandclasslabels.ThisisoneofthekeydistinguishingpropertiesofBayesiannetworksascomparedtonaïveBayesclassifiers.
4.5.2InferenceandLearning
GiventheprobabilitytablescorrespondingtoeverynodeinaBayesiannetwork,theproblemofinferencecorrespondstocomputingtheprobabilitiesofdifferentsetsofrandomvariables.Inthecontextofclassification,oneofthekeyinferenceproblemsistocomputetheprobabilityofatargetclassYtakingonaspecificvaluey,giventhesetofobservedattributesatadatainstance,.Thiscanberepresentedusingthefollowingconditionalprobability:
ThepreviousequationinvolvesmarginalprobabilitiesoftheformP(y, ).TheycanbecomputedbymarginalizingoutthehiddenvariablesHfromthejointprobabilityasfollows:
wherethejointprobabilityP(y, ,H)canbeobtainedbyusingthefactorizationdescribedinEquation4.23 .TounderstandthenatureofcomputationsinvolvedinestimatingP(y, ),considertheexampleBayesiannetworkshowninFigure4.15 ,whichinvolvesatargetclass,Y,threeobservedattributes, to ,andfourhiddenvariables, to .Forthisnetwork,wecanexpressP(y, )as
P(Y=y|x)=(y,x)P(x)=(y,x)∑y′P(y′,x) (4.24)
P(y,x)=∑HP(y,x,H), (4.25)
X1 X3 H1 H4
Figure4.15.AnexampleofaBayesiannetworkwithfourhiddenvariables, to ,threeobservedattributes, to ,andonetargetclassY.
wherefisafactorthatdependsonthevaluesof to .IntheprevioussimplisticexpressionofP(y, ),adifferentsummandisconsideredforeverycombinationofvalues, to ,inthehiddenvariables, to .Ifweassumethateveryvariableinthenetworkcantakekdiscretevalues,thenthesummationhastobecarriedoutforatotalnumberof times.Thecomputationalcomplexityofthisapproachisthus .Moreover,thenumberofcomputationsgrowsexponentiallywiththenumberofhiddenvariables,makingitdifficulttousethisapproachwithnetworksthathavealargenumberofhiddenvariables.Inthefollowing,wepresentdifferentcomputationaltechniquesforefficientlyperforminginferencesinBayesiannetworks.
H1 H4X1 X3
P(y,x)=∑h1∑h2∑h3∑h4P(y,x1,x2,h1,h2,h3,h4),=∑h1∑h2∑h3∑h4[P(h1)P(h2)P(x2)P(h4)P(x1|h1,h2)×P(h3|x2,h2)P(y|x1,h3)P(x3|h3,h4)],
(4.26)
=∑h1∑h2∑h3∑h4f(h1,h2,h3,h4), (4.27)
h1 h4
h1 h4 H1 H4
k4O(k4)
VariableEliminationToreducethenumberofcomputationsinvolvedinestimatingP(y, ),letuscloselyexaminetheexpressionsinEquations4.26 and4.27 .Noticethatalthough dependsonthevaluesofallfourhiddenvariables,itcanbedecomposedasaproductofseveralsmallerfactors,whereeveryfactorinvolvesonlyasmallnumberofhiddenvariables.Forexample,thefactor dependsonlyonthevalueof ,andthusactsasaconstantmultiplicativetermwhensummationsareperformedover ,or .Hence,ifweplace outsidethesummationsof to ,wecansavesomerepeatedmultiplicationsoccurringinsideeverysummand.
Ingeneral,wecanpusheverysummationasfarinsideaspossible,sothatthefactorsthatdonotdependonthesummingvariableareplacedoutsidethesummation.Thiswillhelpreducethenumberofwastefulcomputationsbyusingsmallerfactorsateverysummation.Toillustratethisprocess,considerthefollowingsequenceofstepsforcomputingP(y, ),byrearrangingtheorder
ofsummationsinEquation4.26 .
where representstheintermediatefactortermobtainedbysummingout .Tocheckifthepreviousrearrangementsprovideanyimprovementsin
f(h1,h2,h3,h4)
P(h4) h4h1,h2 h3
P(h4) h1 h3
P(y,x)=P(x2)∑h4P(h4)∑h3P(y|x1,h3)P(x3|h3,h4)×∑h2P(h2)P(h3|x2,h2)∑h1P(4.28)
=P(x2)∑h4P(h4)∑h3P(y|x1,h3)P(x3|h3,h4)×∑h2P(h2)P(h3|x2,h2)f1(h2)(4.29)
=P(x2)∑h4P(h4)∑h3P(y|x1,h3)P(x3|h3,h4)f2(h3) (4.30)
=P(x2)∑h4P(h4)f3(h4) (4.31)
fi hi
computationalefficiency,letuscountthenumberofcomputationsoccurringateverystepoftheprocess.Atthefirststep(Equation4.28 ),weperformasummationover usingfactorsthatdependon and .Thisrequiresconsideringeverypairofvaluesin and ,resultingin computations.Similarly,thesecondstep(Equation4.29 )involvessummingout usingfactorsof and ,leadingto computations.Thethirdstep(Equation4.30 )againrequires computationsasitinvolvessummingoutoverfactorsdependingon and .Finally,thefourthstep(Equation4.31 )involvessummingout usingfactorsdependingon ,resultinginO(k)computations.
Theoverallcomplexityofthepreviousapproachisthus ,whichisconsiderablysmallerthanthe complexityofthebasicapproach.Hence,bymerelyrearrangingsummationsandusingalgebraicmanipulations,weareabletoimprovethecomputationalefficiencyincomputingP(y, ).Thisprocedureisknownasvariableelimination.
Thebasicconceptthatvariableeliminationexploitstoreducethenumberofcomputationsisthedistributivenatureofmultiplicationoveradditionoperations.Forexample,considerthefollowingmultiplicationandadditionoperations:
Noticethattheright-handsideofthepreviousequationinvolvesthreemultiplicationsandthreeadditions,whiletheleft-handsideinvolvesonlyonemultiplicationandthreeadditions,thussavingontwoarithmeticoperations.Thispropertyisutilizedbyvariableeliminationinpushingoutconstanttermsoutsidethesummation,suchthattheyaremultipliedonlyonce.
h1 h1 h2h1 h2 O(k2)
h2h2 h3 O(k2)
O(k2) h3h3 h4
h4 h4
O(k2)O(k4)
a.(b+c+d)=a.b+a.c+a.d
Notethattheefficiencyofvariableeliminationdependsontheorderofhiddenvariablesusedforperformingsummations.Hence,wewouldideallyliketofindtheoptimalorderofvariablesthatresultinthesmallestnumberofcomputations.Unfortunately,findingtheoptimalorderofsummationsforagenericBayesiannetworkisanNP-Hardproblem,i.e.,theredoesnotexistanefficientalgorithmforfindingtheoptimalorderingthatcanruninpolynomialtime.However,thereexistsefficienttechniquesforhandlingspecialtypesofBayesiannetworks,e.g.,thoseinvolvingtree-likegraphs,asdescribedinthefollowing.
Sum-ProductAlgorithmforTreesNotethatinEquations4.28 and4.29 ,wheneveravariable iseliminatedduringmarginalization,itresultsinthecreationofafactor thatdependsontheneighboringnodesof . isthenabsorbedinthefactorsofneighboringvariablesandtheprocessisrepeateduntilallunobservedvariableshavemarginalized.Thisphenomenaofvariableeliminationcanbeviewedastransmittingalocalmessagefromthevariablebeingmarginalizedtoitsneighboringnodes.Thisideaofmessagepassingutilizesthestructureofthegraphforperformingcomputations,thusmakingitpossibletousegraph-theoreticapproachesformakingeffectiveinferences.Thesum-productalgorithmbuildsontheconceptofmessagepassingforcomputingmarginalandconditionalprobabilitiesontree-basedgraphs.
Figure4.16 showsanexampleofatreeinvolvingfivevariables, to .Akeycharacteristicofatreeisthateverynodeinthetreehasexactlyoneparent,andthereisonlyonedirectededgebetweenanytwonodesinthetree.Forthepurposeofillustration,letusconsidertheproblemofestimatingthemarginalprobabilityof .Thiscanbeobtainedbymarginalizingouteveryvariableinthegraphexcept andrearrangingthesummationstoobtainthefollowingexpression:
hifi
hi fi
X1 X5
X2,P(X2)X2
Figure4.16.AnexampleofaBayesiannetworkwithatreestructure.
where hasbeenconvenientlychosentorepresentthefactorof thatisobtainedbysummingout .Wecanview asalocalmessagepassedfromnode tonode ,asshownusingarrowsinFigure4.17(a) .Theselocalmessagescapturetheinfluenceofeliminatingnodesonthemarginalprobabilitiesofneighboringnodes.
Beforeweformallydescribetheformulaforcomputing and ,wefirstdefineapotentialfunction thatisassociatedeverynodeandedgeofthegraph.Wecandefinethepotentialofanode as
P(x2)=∑x1∑x3∑x4∑x5P(x1)P(x2|x1)P(x3|x2)P(x4|x3)P(x5|x3),=(∑x1P(x1)P(x2|x1))︸m12(x2)(∑x3P(x3|x2)(∑x4P(x4|x3))︸m43(x3)(∑x5P(x5|x3))︸m53(x3)),︸m32(x2)
mij(xj) xjxi mij(xj)
xi xj
mij(xj) P(xj)ψ(⋅)
Xi
ψ(Xi)={P(Xi),ifXiistherootnode.1,otherwise. (4.32)
Figure4.17.Illustrationofmessagepassinginthesum-productalgorithm.
Similarly,wecandefinethepotentialofanedgebetweennodes and(where istheparentof )as
Using and ,wecanrepresent usingthefollowingequation:
whereN(i)representsthesetofneighborsofnode .Themessage thatistransmittedfrom to canthusberecursivelycomputedusingthe
Xi XjXi Xj
ψ(Xi,Xj)=P(Xj|Xi).
ψ(Xi) ψ(Xi,Xj) mij(xj)
mij(xj)=∑xi(ψ(xi)ψ(xi,xj)∏k∈N(i)imki(xi)), (4.33)
Xi mijXi Xj
messagesincidenton fromitsneighboringnodesexcluding .Notethattheformulafor involvestakingasumoverallpossiblevaluesof ,aftermultiplyingthefactorsobtainedfromtheneighborsof .Thisapproachofmessagepassingisthuscalledthe“sum-product”algorithm.Further,sincerepresentsanotionof“belief”propagatedfrom to ,thisalgorithmisalsoknownasbeliefpropagation.Themarginalprobabilityofanode
isthengivenas
Ausefulpropertyofthesum-productalgorithmisthatitallowsthemessagestobereusedforcomputingadifferentmarginalprobabilityinthefuture.Forexample,ifwehadtocomputethemarginalprobabilityfornode ,wewouldrequirethefollowingmessagesfromitsneighboringnodes: ,and .However,notethat ,and havealreadybeencomputedintheprocessofcomputingthemarginalprobabilityof andthuscanbereused.
Noticethatthebasicoperationsofthesum-productalgorithmresembleamessagepassingprotocolovertheedgesofthenetwork.Anodesendsoutamessagetoallitsneighboringnodesonlyafterithasreceivedincomingmessagesfromallitsneighbors.Hence,wecaninitializethemessagepassingprotocolfromtheleafnodes,andtransmitmessagestillwereachtherootnode.Wecanthenrunasecondpassofmessagesfromtherootnodebacktotheleafnodes.Inthisway,wecancomputethemessagesforeveryedgeinbothdirections,usingjust operations,where isthenumberofedges.OncewehavetransmittedallpossiblemessagesasshowninFigure4.17(b) ,wecaneasilycomputethemarginalprobabilityofeverynodeinthegraphusingEquation4.34 .
Xi Ximij Xj
Xjmij
Xi XjXi
P(xi)=ψ(xi)∏j∈N(i)mji(xi). (4.34)
X3m23(x3),m43(x3)
m53(x3) m43(x3) m53(x3)X2
O(2|E|) |E|
Inthecontextofclassification,thesum-productalgorithmcanbeeasilymodifiedforcomputingtheconditionalprobabilityoftheclasslabelygiventhesetofobservedattributes ,i.e., .Thisbasicallyamountstocomputing inEquation4.24 ,whereXisclampedtotheobservedvalues .Tohandlethescenariowheresomeoftherandomvariablesarefixedanddonotneedtobenormalized,weconsiderthefollowingmodification.
If isarandomvariablethatisfixedtoaspecificvalue ,thenwecansimplymodify and asfollows:
Wecanrunthesum-productalgorithmusingthesemodifiedvaluesforeveryobservedvariableandthuscompute .
x^ P(y|x^)P(y,X=x^)
x^
Xi x^iψ(Xi) ψ(Xi,Xj)
ψ(Xi)={1,ifXi=x^i.0,otherwise. (4.35)
ψ(Xi,Xj)={P(Xi|x^i),ifXi=x^i.0,otherwise. (4.36)
P(y,X=x^)
Figure4.18.Exampleofapoly-treeanditscorrespondingfactorgraph.
GeneralizationsforNon-TreeGraphsThesum-productalgorithmisguaranteedtooptimallyconvergeinthecaseoftreesusingasinglerunofmessagepassinginbothdirectionsofeveryedge.Thisisbecauseanytwonodesinatreehaveauniquepathforthetransmissionofmessages.Furthermore,sinceeverynodeinatreehasasingleparent,thejointprobabilityinvolvesonlyfactorsofatmosttwovariables.Hence,itissufficienttoconsiderpotentialsoveredgesandnotothergenericsubstructuresinthegraph.
Bothofthepreviouspropertiesareviolatedingraphsthatarenottrees,thusmakingitdifficulttodirectlyapplythesum-productalgorithmformakinginferences.However,anumberofvariantsofthesum-productalgorithmhavebeendevisedtoperforminferencesonabroaderfamilyofgraphsthantrees.Manyofthesevariantstransformtheoriginalgraphintoanalternativetree-basedrepresentation,andthenapplythesum-productalgorithmonthetransformedtree.Inthissection,webrieflydiscussonesuchtransformationsknownasfactorgraphs.
Factorgraphsareusefulformakinginferencesovergraphsthatviolatetheconditionthateverynodehasasingleparent.Nonetheless,theystillrequiretheabsenceofmultiplepathsbetweenanytwonodes,toguaranteeconvergence.Suchgraphsareknownaspoly-trees.Anexampleofapoly-treeisshowninFigure4.18(a) .
Apoly-treecanbetransformedintoatree-basedrepresentationwiththehelpoffactorgraphs.Thesegraphsconsistoftwotypesofnodes,variablesnodes(thatarerepresentedusingcircles)andfactornodes(thatarerepresented
usingsquares).Thefactornodesrepresentconditionalindependencerelationshipsamongthevariablesofthepoly-tree.Inparticular,everyprobabilitytablecanberepresentedasafactornode.Theedgesinafactorgraphareundirectedinnatureandrelateavariablenodetoafactornodeifthevariableisinvolvedintheprobabilitytablecorrespondingtothefactornode.Figure4.18(b) presentsthefactorgraphrepresentationofthepoly-treeshowninFigure4.18(a) .
Notethatthefactorgraphofapoly-treealwaysformsatree-likestructure,wherethereisauniquepathofinfluencebetweenanytwonodesinthefactorgraph.Hence,wecanapplyamodifiedformofsum-productalgorithmtotransmitmessagesbetweenvariablenodesandfactornodes,whichisguaranteedtoconvergetooptimalvalues.
LearningModelParametersInallourpreviousdiscussionsonBayesiannetworks,wehadassumedthatthetopologyoftheBayesiannetworkandthevaluesintheprobabilitytablesofeverynodewerealreadyknown.Inthissection,wediscussapproachesforlearningboththetopologyandtheprobabilitytablevaluesofaBayesiannetworkfromthetrainingdata.
Letusfirstconsiderthecasewherethetopologyofthenetworkisknownandweareonlyrequiredtocomputetheprobabilitytables.Iftherearenounobservedvariablesinthetrainingdata,thenwecaneasilycomputetheprobabilitytablefor ,bycountingthefractionoftraininginstancesforeveryvalueof andeverycombinationofvaluesin .However,ifthereareunobservedvariablesin or ,thencomputingthefractionoftraininginstancesforsuchvariablesisnon-trivialandrequirestheuseofadvancestechniquessuchastheExpectation-Maximizationalgorithm(describedlaterinChapter8 ).
P(Xi|pa(Xi))Xi pa(Xi)
Xi pa(Xi)
LearningthestructureoftheBayesiannetworkisamuchmorechallengingtaskthanlearningtheprobabilitytables.Althoughtherearesomescoringapproachesthatattempttofindagraphstructurethatmaximizesthetraininglikelihood,theyareoftencomputationallyinfeasiblewhenthegraphislarge.Hence,acommonapproachforconstructingBayesiannetworksistousethesubjectiveknowledgeofdomainexperts.
4.5.3CharacteristicsofBayesianNetworks
1. Bayesiannetworksprovideapowerfulapproachforrepresentingprobabilisticrelationshipsbetweenattributesandclasslabelswiththehelpofgraphicalmodels.Theyareabletocapturecomplexformsofdependenciesamongvariables.Apartfromencodingpriorbeliefs,theyarealsoabletomodelthepresenceoflatent(unobserved)factorsashiddenvariablesinthegraph.Bayesiannetworksarethusquiteexpressiveandprovidepredictiveaswellasdescriptiveinsightsaboutthebehaviorofattributesandclasslabels.
2. Bayesiannetworkscaneasilyhandlethepresenceofcorrelatedorredundantattributes,asopposedtothenaïveBayesclassifier.ThisisbecauseBayesiannetworksdonotusethenaïveBayesassumptionaboutconditionalindependence,butinsteadareabletoexpressricherformsofconditionalindependence.
3. SimilartothenaïveBayesclassifier,Bayesiannetworksarealsoquiterobusttothepresenceofnoiseinthetrainingdata.Further,theycanhandlemissingvaluesduringtrainingaswellastesting.Ifatestinstancecontainsanattribute withamissingvalue,thenaBayesiannetworkcanperforminferencebytreating asanunobservednode
XiXi
andmarginalizingoutitseffectonthetargetclass.Hence,Bayesiannetworksarewell-suitedforhandlingincompletenessinthedata,andcanworkwithpartialinformation.However,unlessthepatternwithwhichmissingvaluesoccursiscompletelyrandom,thentheirpresencewilllikelyintroducesomedegreeoferrorand/orbiasintotheanalysis.
4. Bayesiannetworksarerobusttoirrelevantattributesthatcontainnodiscriminatoryinformationabouttheclasslabels.Suchattributesshownoimpactontheconditionalprobabilityofthetargetclass,andarethusrightfullyignored.
5. LearningthestructureofaBayesiannetworkisacumbersometaskthatoftenrequiresassistancefromexpertknowledge.However,oncethestructurehasbeendecided,learningtheparametersofthenetworkcanbequitestraightforward,especiallyifallthevariablesinthenetworkareobserved.
6. Duetoitsadditionalabilityofrepresentingcomplexformsofrelationships,BayesiannetworksaremoresusceptibletooverfittingascomparedtothenaïveBayesclassifier.Furthermore,BayesiannetworkstypicallyrequiremoretraininginstancesforeffectivelylearningtheprobabilitytablesthanthenaïveBayesclassifier.
7. Althoughthesum-productalgorithmprovidescomputationallyefficienttechniquesforperforminginferenceovertree-likegraphs,thecomplexityoftheapproachincreasesignificantlywhendealingwithgenericgraphsoflargesizes.Insituationswhereexactinferenceiscomputationallyinfeasible,itisquitecommontouseapproximateinferencetechniques.
4.6LogisticRegressionThenaïveBayesandtheBayesiannetworkclassifiersdescribedintheprevioussectionsprovidedifferentwaysofestimatingtheconditionalprobabilityofaninstance givenclassy, .Suchmodelsareknownasprobabilisticgenerativemodels.Notethattheconditionalprobabilityessentiallydescribesthebehaviorofinstancesintheattributespacethataregeneratedfromclassy.However,forthepurposeofmakingpredictions,wearefinallyinterestedincomputingtheposteriorprobability .Forexample,computingthefollowingratioofposteriorprobabilitiesissufficientforinferringclasslabelsinabinaryclassificationproblem:
Thisratioisknownastheodds.Ifthisratioisgreaterthan1,then isclassifiedas .Otherwise,itisassignedtoclass .Hence,onemaysimplylearnamodeloftheoddsbasedontheattributevaluesoftraininginstances,withouthavingtocompute asanintermediatequantityintheBayestheorem.
Classificationmodelsthatdirectlyassignclasslabelswithoutcomputingclass-conditionalprobabilitiesarecalleddiscriminativemodels.Inthissection,wepresentaprobabilisticdiscriminativemodelknownaslogisticregression,whichdirectlyestimatestheoddsofadatainstance usingitsattributevalues.Thebasicideaoflogisticregressionistousealinearpredictor,
,forrepresentingtheoddsof asfollows:
P(x|y)P(x|y)
P(y|x)
P(y=1|x)P(y=0|x)
y=1 y=0
P(x|y)
z=wTx+b
P(y=1|x)P(y=0|x)=ez=ewTx+b, (4.37)
where andbaretheparametersofthemodeland denotesthetransposeofavector .Notethatif ,then belongstoclass1sinceitsoddsisgreaterthan1.Otherwise, belongstoclass0.
Figure4.19.Plotofsigmoid(logistic)function, .
Since ,wecanre-writeEquation4.37 as
Thiscanbefurthersimplifiedtoexpress asafunctionofz.
wherethefunction isknownasthelogisticorsigmoidfunction.Figure4.19 showsthebehaviorofthesigmoidfunctionaswevaryz.Wecanseethat onlywhen .Wecanalsoderive using asfollows:
aTwTx+b>0
σ(z)
P(y=0|x)+P(y=1|x)=1
P(y=1|x)1−P(y=1|x)=ez.
P(y=1|x)
P(y=1|x)=11+e−z=σ(z), (4.38)
σ(⋅)
σ(z)≥0.5 z≥0 P(y=0|x) σ(z)
Hence,ifwehavelearnedasuitablevalueofparameters andb,wecanuseEquations4.38 and4.39 toestimatetheposteriorprobabilitiesofanydatainstance anddetermineitsclasslabel.
4.6.1LogisticRegressionasaGeneralizedLinearModel
Sincetheposteriorprobabilitiesarereal-valued,theirestimationusingthepreviousequationscanbeviewedassolvingaregressionproblem.Infact,logisticregressionbelongstoabroaderfamilyofstatisticalregressionmodels,knownasgeneralizedlinearmodels(GLM).Inthesemodels,thetargetvariableyisconsideredtobegeneratedfromaprobabilitydistribution ,whosemean canbeestimatedusingalinkfunction asfollows:
Forbinaryclassificationusinglogisticregression,yfollowsaBernoullidistribution(ycaneitherbe0or1)and isequalto .Thelinkfunction oflogisticregression,calledthelogitfunction,canthusberepresentedas
Dependingonthechoiceoflinkfunction andtheformofprobabilitydistribution ,GLMsareabletorepresentabroadfamilyofregressionmodels,suchaslinearregressionandPoissonregression.Theyrequire
P(y=0|x)=1−σ(z)=11+e−z (4.39)
P(y|x)μ g(⋅)
g(μ)=z=wTx+b. (4.40)
μ P(y=1|x)g(⋅)
g(μ)=log(μ1−μ).
g(⋅)P(y|x)
differentapproachesforestimatingtheirmodelparameters,( , ).Inthischapter,wewillonlydiscussapproachesforestimatingthemodelparametersoflogisticregression,althoughmethodsforestimatingparametersofothertypesofGLMsareoftensimilar(andsometimesevensimpler).(SeeBibliographicNotesformoredetailsonGLMs.)
Notethateventhoughlogisticregressionhasrelationshipswithregressionmodels,itisaclassificationmodelsincethecomputedposteriorprobabilitiesareeventuallyusedtodeterminetheclasslabelofadatainstance.
4.6.2LearningModelParameters
Theparametersoflogisticregression,( , ),areestimatedduringtrainingusingastatisticalapproachknownasthemaximumlikelihoodestimation(MLE)method.Thismethodinvolvescomputingthelikelihoodofobservingthetrainingdatagiven( , ),andthendeterminingthemodelparameters
thatyieldmaximumlikelihood.
Let denoteasetofntraininginstances,where isabinaryvariable(0or1).Foragiventraininginstance,wecancomputeitsposteriorprobabilitiesusingEquations4.38 and
4.39 .Wecanthenexpressthelikelihoodofobserving given , ,andbas
where isthesigmoidfunctionasdescribedabove,Equation4.41basicallymeansthatthelikelihood isequalto when
(w*,b*)
D.train={(x1,y1),(x2,y2),…,(xn,yn)}yi
xiyi xi
P(yi|xi,w,b)=P(y=1|xi)yi×P(y=0|xi)1−yi,=(σ(zi))yi×(1−σ(zi))1−yi,=(σ(wTxi+b))yi×(1−σ(wTxi+b))1−yi,
(4.41)
σ(⋅)P(yi|xi,w,b) P(y=1|xi)
,andequalto when .Thelikelihoodofalltraininginstances,,canthenbecomputedbytakingtheproductofindividuallikelihoods
(assumingindependenceamongtraininginstances)asfollows:
Thepreviousequationinvolvesmultiplyingalargenumberofprobabilityvalues,eachofwhicharesmallerthanorequalto1.Sincethisnaïvecomputationcaneasilybecomenumericallyunstablewhennislarge,amorepracticalapproachistoconsiderthenegativelogarithm(tobasee)ofthelikelihoodfunction,alsoknownasthecrossentropyfunction:
Thecrossentropyisalossfunctionthatmeasureshowunlikelyitisforthetrainingdatatobegeneratedfromthelogisticregressionmodelwithparameters( , ).Intuitively,wewouldliketofindmodelparametersthatresultinthelowestcrossentropy, .
where isthelossfunction.ItisworthemphasizingthatE( , )isaconvexfunction,i.e.,anyminimaofE( , )willbeaglobalminima.Hence,wecanuseanyofthestandardconvexoptimizationtechniquestosolveEquation4.43 ,whicharementionedinAppendixE.Here,webrieflydescribetheNewton-Raphsonmethodthatiscommonlyusedforestimatingtheparametersoflogisticregression.Foreaseofrepresentation,wewilluseasinglevectortodescribe ,whichisofsizeonegreaterthan .Similarly,wewillconsidertheconcatenatedfeaturevector ,suchthatthelinearpredictor canbesuccinctly
yi=1 P(y=0|xi) yi=0L(w,b)
L(w,b)=∏i=1nP(yi|xi,w,b)=∏i=1nP(y=1|xi)yi×P(y=0|xi)1−yi. (4.42)
−logL(w,b)=−∑i=1nyilog(P(y=1|xi))+(1−yi)log(P(y=0|xi)).=−∑i=1nyilog(σ(wTxi+b))+(1−yi)log(1−σ(wTxi+b)).
(w*,b*)−logL(w*,b*)
(w*,b*)=argmin(w,b)E(w,b)=argmin(w,b)−logL(w,b) (4.43)
E(w,b)=−logL(w,b)
w˜=(wTb)T
x˜=(xT1)T z=wTx+b
writtenas .Also,theconcatenationofalltraininglabels, to ,willberepresentedasy,thesetconsistingof to willberepresentedas,andtheconcatenationof to willberepresentedas .
TheNewton-Raphsonisaniterativemethodforfinding thatusesthefollowingequationtoupdatethemodelparametersateveryiteration:
where andHarethefirst-andsecond-orderderivativesofthelossfunction withrespectto ,respectively.ThekeyintuitionbehindEquation4.44 istomovethemodelparametersinthedirectionofmaximumgradient,suchthat takeslargerstepswhen islarge.When arrivesataminimaaftersomenumberofiterations,thenwouldbecomeequalto0andthusresultinconvergence.Hence,westartwithsomeinitialvaluesof (eitherrandomlyassignedorsetto0)anduseEquation4.44 toiterativelyupdate tilltherearenosignificantchangesinitsvalue(beyondacertainthreshold).
Thefirst-orderderivativeof isgivenby
wherewehaveusedthefactthat .Using ,wecancomputethesecond-orderderivativeof as
whereRisadiagonalmatrixwhosei diagonalelement .Wecannowusethefirst-andsecond-orderderivativesof inEquation4.44 to
z=w˜Tx˜ y1 ynσ(z1) σ(zn)
σ x˜1 x˜n X˜
w˜*
w˜(new)=w˜(old)−H−1∇E(w˜), (4.44)
∇E(w˜)E(w˜) w˜
w˜ ∇E(w˜)w˜ ∇E(w˜)
w˜w˜
E(w˜)
∇E(w˜)=−∑i=1nyix˜i(1−σ(w˜Tx˜i))−(1−yi)x˜iσ(w˜Tx˜i),=−∑i=1n(σ(w˜Tx˜i)−yi)x˜i,=X˜(σ−y),
(4.45)
dσ(z)/dz=σ(z)(1−σ(z)) ∇E(w˜)E(w˜)
H=∇∇E(w˜)=∑i=1nσ(w˜Tx˜i)(1−σ(w˜Tx˜i)x˜ix˜iT)=X˜TRX˜, (4.46)
th Rii=σi(1−σi)E(w˜)
th
obtainthefollowingupdateequationatthek iteration:
wherethesubscriptkunder and referstousing tocomputebothterms.
4.6.3CharacteristicsofLogisticRegression
1. LogisticRegressionisadiscriminativemodelforclassificationthatdirectlycomputestheposterprobabilitieswithoutmakinganyassumptionabouttheclassconditionalprobabilities.Hence,itisquitegenericandcanbeappliedindiverseapplications.Itcanalsobeeasilyextendedtomulticlassclassification,whereitisknownasmultinomiallogisticregression.However,itsexpressivepowerislimitedtolearningonlylineardecisionboundaries.
2. Becausetherearedifferentweights(parameters)foreveryattribute,thelearnedparametersoflogisticregressioncanbeanalyzedtounderstandtherelationshipsbetweenattributesandclasslabels.
3. Becauselogisticregressiondoesnotinvolvecomputingdensitiesanddistancesintheattributespace,itcanworkmorerobustlyeveninhigh-dimensionalsettingsthandistance-basedmethodssuchasnearestneighborclassifiers.However,theobjectivefunctionoflogisticregressiondoesnotinvolveanytermrelatingtothecomplexityofthemodel.Hence,logisticregressiondoesnotprovideawaytomakeatrade-offbetweenmodelcomplexityandtrainingperformance,ascomparedtootherclassificationmodelssuchassupportvector
th
w˜(k+1)=w˜(k)−(X˜TRkX˜)−1X˜T(σk−y) (4.47)
Rk σk w˜(k)
machines.Nevertheless,variantsoflogisticregressioncaneasilybedevelopedtoaccountformodelcomplexity,byincludingappropriatetermsintheobjectivefunctionalongwiththecrossentropyfunction.
4. Logisticregressioncanhandleirrelevantattributesbylearningweightparameterscloseto0forattributesthatdonotprovideanygaininperformanceduringtraining.Itcanalsohandleinteractingattributessincethelearningofmodelparametersisachievedinajointfashionbyconsideringtheeffectsofallattributestogether.Furthermore,ifthereareredundantattributesthatareduplicatesofeachother,thenlogisticregressioncanlearnequalweightsforeveryredundantattribute,withoutdegradingclassificationperformance.However,thepresenceofalargenumberofirrelevantorredundantattributesinhigh-dimensionalsettingscanmakelogisticregressionsusceptibletomodeloverfitting.
5. Logisticregressioncannothandledatainstanceswithmissingvalues,sincetheposteriorprobabilitiesareonlycomputedbytakingaweightedsumofalltheattributes.Iftherearemissingvaluesinatraininginstance,itcanbediscardedfromthetrainingset.However,iftherearemissingvaluesinatestinstance,thenlogisticregressionwouldfailtopredictitsclasslabel.
4.7ArtificialNeuralNetwork(ANN)Artificialneuralnetworks(ANN)arepowerfulclassificationmodelsthatareabletolearnhighlycomplexandnonlineardecisionboundariespurelyfromthedata.Theyhavegainedwidespreadacceptanceinseveralapplicationssuchasvision,speech,andlanguageprocessing,wheretheyhavebeenrepeatedlyshowntooutperformotherclassificationmodels(andinsomecasesevenhumanperformance).Historically,thestudyofartificialneuralnetworkswasinspiredbyattemptstoemulatebiologicalneuralsystems.Thehumanbrainconsistsprimarilyofnervecellscalledneurons,linkedtogetherwithotherneuronsviastrandsoffibercalledaxons.Wheneveraneuronisstimulated(e.g.,inresponsetoastimuli),ittransmitsnerveactivationsviaaxonstootherneurons.Thereceptorneuronscollectthesenerveactivationsusingstructurescalleddendrites,whichareextensionsfromthecellbodyoftheneuron.Thestrengthofthecontactpointbetweenadendriteandanaxon,knownasasynapse,determinestheconnectivitybetweenneurons.Neuroscientistshavediscoveredthatthehumanbrainlearnsbychangingthestrengthofthesynapticconnectionbetweenneuronsuponrepeatedstimulationbythesameimpulse.
Thehumanbrainconsistsofapproximately100billionneuronsthatareinter-connectedincomplexways,makingitpossibleforustolearnnewtasksandperformregularactivities.Notethatasingleneurononlyperformsasimplemodularfunction,whichistorespondtothenerveactivationscomingfromsenderneuronsconnectedatitsdendrite,andtransmititsactivationtoreceptorneuronsviaaxons.However,itisthecompositionofthesesimplefunctionsthattogetherisabletoexpresscomplexfunctions.Thisideaisatthebasisofconstructingartificialneuralnetworks.
Analogoustothestructureofahumanbrain,anartificialneuralnetworkiscomposedofanumberofprocessingunits,callednodes,thatareconnectedwitheachotherviadirectedlinks.Thenodescorrespondtoneuronsthatperformthebasicunitsofcomputation,whilethedirectedlinkscorrespondtoconnectionsbetweenneurons,consistingofaxonsanddendrites.Further,theweightofadirectedlinkbetweentwoneuronsrepresentsthestrengthofthesynapticconnectionbetweenneurons.Asinbiologicalneuralsystems,theprimaryobjectiveofANNistoadapttheweightsofthelinksuntiltheyfittheinput-outputrelationshipsoftheunderlyingdata.
ThebasicmotivationbehindusinganANNmodelistoextractusefulfeaturesfromtheoriginalattributesthataremostrelevantforclassification.Traditionally,featureextractionhasbeenachievedbyusingdimensionalityreductiontechniquessuchasPCA(introducedinChapter2),whichshowlimitedsuccessinextractingnonlinearfeatures,orbyusinghand-craftedfeaturesprovidedbydomainexperts.Byusingacomplexcombinationofinter-connectednodes,ANNmodelsareabletoextractmuchrichersetsoffeatures,resultingingoodclassificationperformance.Moreover,ANNmodelsprovideanaturalwayofrepresentingfeaturesatmultiplelevelsofabstraction,wherecomplexfeaturesareseenascompositionsofsimplerfeatures.Inmanyclassificationproblems,modelingsuchahierarchyoffeaturesturnsouttobeveryuseful.Forexample,inordertodetectahumanfaceinanimage,wecanfirstidentifylow-levelfeaturessuchassharpedgeswithdifferentgradientsandorientations.Thesefeaturescanthenbecombinedtoidentifyfacialpartssuchaseyes,nose,ears,andlips.Finally,anappropriatearrangementoffacialpartscanbeusedtocorrectlyidentifyahumanface.ANNmodelsprovideapowerfularchitecturetorepresentahierarchicalabstractionoffeatures,fromlowerlevelsofabstraction(e.g.,edges)tohigherlevels(e.g.,facialparts).
Artificialneuralnetworkshavehadalonghistoryofdevelopmentsspanningoverfivedecadesofresearch.AlthoughclassicalmodelsofANNsufferedfromseveralchallengesthathinderedprogressforalongtime,theyhavere-emergedwithwidespreadpopularitybecauseofanumberofrecentdevelopmentsinthelastdecade,collectivelyknownasdeeplearning.Inthissection,weexamineclassicalapproachesforlearningANNmodels,startingfromthesimplestmodelcalledperceptronstomorecomplexarchitecturescalledmulti-layerneuralnetworks.Inthenextsection,wediscusssomeoftherecentadvancementsintheareaofANNthathavemadeitpossibletoeffectivelylearnmodernANNmodelswithdeeparchitectures.
4.7.1Perceptron
AperceptronisabasictypeofANNmodelthatinvolvestwotypesofnodes:inputnodes,whichareusedtorepresenttheinputattributes,andanoutputnode,whichisusedtorepresentthemodeloutput.Figure4.20 illustratesthebasicarchitectureofaperceptronthattakesthreeinputattributes, ,and ,andproducesabinaryoutputy.Theinputnodecorrespondingtoanattribute isconnectedviaaweightedlink totheoutputnode.Theweightedlinkisusedtoemulatethestrengthofasynapticconnectionbetweenneurons.
x1,x2x3
xi wi
Figure4.20.Basicarchitectureofaperceptron.
Theoutputnodeisamathematicaldevicethatcomputesaweightedsumofitsinputs,addsabiasfactorbtothesum,andthenexaminesthesignoftheresulttoproducetheoutput asfollows:
Tosimplifynotations, andbcanbeconcatenatedtoform ,whilecanbeappendedwith1attheendtoform .Theoutputofthe
perceptron canthenbewritten:
wherethesignfunctionactsasanactivationfunctionbyprovidinganoutputvalueof iftheargumentispositiveand ifitsargumentisnegative.
LearningthePerceptronGivenatrainingset,weareinterestedinlearningparameters suchthatcloselyresemblesthetrueyoftraininginstances.ThisisachievedbyusingtheperceptronlearningalgorithmgiveninAlgorithm4.3 .ThekeycomputationforthisalgorithmistheiterativeweightupdateformulagiveninStep8ofthealgorithm:
where istheweightparameterassociatedwiththei inputlinkafterthek iteration, isaparameterknownasthelearningrate,and isthevalue
y^
3^y={1,ifwTx+b>0.−1,otherwise. (4.48)
w˜=(wTb)Tx˜=(xT1)T
y^
y^=sign(w˜Tx˜),
+1 −1
w˜ y^
wj(k+1)=wj(k)+λ(yi−yi^(k))xij, (4.49)
w(k) th
th λ xij
th
ofthej attributeofthetrainingexample .ThejustificationforEquation4.49 isratherintuitive.Notethat capturesthediscrepancybetweenand ,suchthatitsvalueis0onlywhenthetruelabelandthepredicted
outputmatch.Assume ispositive.If and ,then isincreasedatthenextiterationsothat canbecomepositive.Ontheotherhand,if
and ,then isdecreasedsothat canbecomenegative.Hence,theweightsaremodifiedateveryiterationtoreducethediscrepanciesbetween andyacrossalltraininginstances.Thelearningrate ,aparameterwhosevalueisbetween0and1,canbeusedtocontroltheamountofadjustmentsmadeineachiteration.Thealgorithmhaltswhentheaveragenumberofdiscrepanciesaresmallerthanathreshold .
Algorithm4.3Perceptronlearningalgorithm.
∈
λ
∑ γ
Theperceptronisasimpleclassificationmodelthatisdesignedtolearnlineardecisionboundariesintheattributespace.Figure4.21 showsthedecision
th xi(yi−y^i)
yi y^ixij y^=0 y=1 wjw˜Txi
y^=1 y=0 wj w˜Txi
y^ λ
γ
boundaryobtainedbyapplyingtheperceptronlearningalgorithmtothedatasetprovidedontheleftofthefigure.However,notethattherecanbemultipledecisionboundariesthatcanseparatethetwoclasses,andtheperceptronarbitrarilylearnsoneoftheseboundariesdependingontherandominitialvaluesofparameters.(TheselectionoftheoptimaldecisionboundaryisaproblemthatwillberevisitedinthecontextofsupportvectormachinesinSection4.9 .)Further,theperceptronlearningalgorithmisonlyguaranteedtoconvergewhentheclassesarelinearlyseparable.However,iftheclassesarenotlinearlyseparable,thealgorithmfailstoconverge.Figure4.22showsanexampleofanonlinearlyseparabledatagivenbytheXORfunction.Theperceptroncannotfindtherightsolutionforthisdatabecausethereisnolineardecisionboundarythatcanperfectlyseparatethetraininginstances.Thus,thestoppingconditionatline12ofAlgorithm4.3 wouldneverbemetandhence,theperceptronlearningalgorithmwouldfailtoconverge.Thisisamajorlimitationofperceptronssincereal-worldclassificationproblemsofteninvolvenonlinearlyseparableclasses.
Figure4.21.Perceptrondecisionboundaryforthedatagivenontheleft( representsapositivelylabeledinstancewhileorepresentsanegativelylabeledinstance.
+
Figure4.22.XORclassificationproblem.Nolinearhyperplanecanseparatethetwoclasses.
4.7.2Multi-layerNeuralNetwork
Amulti-layerneuralnetworkgeneralizesthebasicconceptofaperceptrontomorecomplexarchitecturesofnodesthatarecapableoflearningnonlineardecisionboundaries.Agenericarchitectureofamulti-layerneuralnetworkisshowninFigure4.23 wherethenodesarearrangedingroupscalledlayers.Theselayersarecommonlyorganizedintheformofachainsuchthateverylayeroperatesontheoutputsofitsprecedinglayer.Inthisway,thelayersrepresentdifferentlevelsofabstractionthatareappliedontheinputfeaturesinasequentialmanner.Thecompositionoftheseabstractionsgeneratesthefinaloutputatthelastlayer,whichisusedformakingpredictions.Inthefollowing,webrieflydescribethethreetypesoflayersusedinmulti-layerneuralnetworks.
Figure4.23.Exampleofamulti-layerartificialneuralnetwork(ANN).
Thefirstlayerofthenetwork,calledtheinputlayer,isusedforrepresentinginputsfromattributes.Everynumericalorbinaryattributeistypicallyrepresentedusingasinglenodeonthislayer,whileacategoricalattributeiseitherrepresentedusingadifferentnodeforeachcategoricalvalue,orbyencodingthek-aryattributeusing inputnodes.Theseinputsarefedintointermediarylayersknownashiddenlayers,whicharemadeupofprocessingunitsknownashiddennodes.Everyhiddennodeoperatesonsignalsreceivedfromtheinputnodesorhiddennodesattheprecedinglayer,andproducesanactivationvaluethatistransmittedtothenextlayer.Thefinallayeriscalledtheoutputlayerandprocessestheactivationvaluesfromitsprecedinglayertoproducepredictionsofoutputvariables.Forbinaryclassification,theoutputlayercontainsasinglenoderepresentingthebinaryclasslabel.Inthisarchitecture,sincethesignalsarepropagatedonlyintheforwarddirectionfromtheinputlayertotheoutputlayer,theyarealsocalledfeedforwardneuralnetworks.
⌈log2k⌉
Amajordifferencebetweenmulti-layerneuralnetworksandperceptronsistheinclusionofhiddenlayers,whichdramaticallyimprovestheirabilitytorepresentarbitrarilycomplexdecisionboundaries.Forexample,considertheXORproblemdescribedintheprevioussection.Theinstancescanbeclassifiedusingtwohyperplanesthatpartitiontheinputspaceintotheirrespectiveclasses,asshowninFigure4.24(a) .Becauseaperceptroncancreateonlyonehyperplane,itcannotfindtheoptimalsolution.However,thisproblemcanbeaddressedbyusingahiddenlayerconsistingoftwonodes,asshowninFigure4.24(b) .Intuitively,wecanthinkofeachhiddennodeasaperceptronthattriestoconstructoneofthetwohyperplanes,whiletheoutputnodesimplycombinestheresultsoftheperceptronstoyieldthedecisionboundaryshowninFigure4.24(a) .
Figure4.24.Atwo-layerneuralnetworkfortheXORproblem.
Thehiddennodescanbeviewedaslearninglatentrepresentationsorfeaturesthatareusefulfordistinguishingbetweentheclasses.Whilethefirsthiddenlayerdirectlyoperatesontheinputattributesandthuscapturessimplerfeatures,thesubsequenthiddenlayersareabletocombinethemand
constructmorecomplexfeatures.Fromthisperspective,multi-layerneuralnetworkslearnahierarchyoffeaturesatdifferentlevelsofabstractionthatarefinallycombinedattheoutputnodestomakepredictions.Further,therearecombinatoriallymanywayswecancombinethefeatureslearnedatthehiddenlayersofANN,makingthemhighlyexpressive.ThispropertychieflydistinguishesANNfromotherclassificationmodelssuchasdecisiontrees,whichcanlearnpartitionsintheattributespacebutareunabletocombinetheminexponentialways.
Figure4.25.SchematicillustrationoftheparametersofanANNmodelwith hiddenlayers.
TounderstandthenatureofcomputationshappeningatthehiddenandoutputnodesofANN,considerthei nodeatthel layerofthenetwork ,wherethelayersarenumberedfrom0(inputlayer)toL(outputlayer),asshowninFigure4.25 .Theactivationvaluegeneratedatthisnode, ,canberepresentedasafunctionoftheinputsreceivedfromnodesattheprecedinglayer.Let representtheweightoftheconnectionfromthej nodeatlayer
(L−1)
th th (l>0)
ail
wijl th
th
tothei nodeatlayerl.Similarly,letusdenotethebiastermatthisnodeas .Theactivationvalue canthenbeexpressedas
whereziscalledthelinearpredictorand istheactivationfunctionthatconvertsztoa.Further,notethat,bydefinition, attheinputlayerand
attheoutputnode.
Thereareanumberofalternateactivationfunctionsapartfromthesignfunctionthatcanbeusedinmulti-layerneuralnetworks.Someexamplesincludelinear,sigmoid(logistic),andhyperbolictangentfunctions,asshowninFigure4.26 .Thesefunctionsareabletoproducereal-valuedandnonlinearactivationvalues.Amongtheseactivationfunctions,thesigmoid hasbeenwidelyusedinmanyANNmodels,althoughtheuseofothertypesofactivationfunctionsinthecontextofdeeplearningwillbediscussedinSection4.8 .Wecanthusrepresent as
(l−1) th
bjl ail
ail=f(zil)=f(∑jwijlajl−1+bil),
f(⋅)aj0=xj
aL=y^
σ(⋅)
ail
Figure4.26.Typesofactivationfunctionsusedinmulti-layerneuralnetworks.
LearningModelParametersTheweightsandbiasterms( ,b)oftheANNmodelarelearnedduringtrainingsothatthepredictionsontraininginstancesmatchthetruelabels.Thisisachievedbyusingalossfunction
ail=σ(zil)=11+e−zil. (4.50)
E(w,b)=∑k=1nLoss(yk,y^k) (4.51)
where isthetruelabelofthekthtraininginstanceand isequalto ,producedbyusing .Atypicalchoiceofthelossfunctionisthesquaredlossfunction:.
NotethatE( ,b)isafunctionofthemodelparameters( ,b)becausetheoutputactivationvalue dependsontheweightsandbiasterms.Weareinterestedinchoosing( ,b)thatminimizesthetraininglossE( ,b).Unfortunately,becauseoftheuseofhiddennodeswithnonlinearactivationfunctions,E( ,b)isnotaconvexfunctionof andb,whichmeansthatE( ,b)canhavelocalminimathatarenotgloballyoptimal.However,wecanstillapplystandardoptimizationtechniquessuchasthegradientdescentmethodtoarriveatalocallyoptimalsolution.Inparticular,theweightparameter andthebiasterm canbeiterativelyupdatedusingthefollowingequations:
where isahyper-parameterknownasthelearningrate.Theintuitionbehindthisequationistomovetheweightsinadirectionthatreducesthetrainingloss.Ifwearriveataminimausingthisprocedure,thegradientofthetraininglosswillbecloseto0,eliminatingthesecondtermandresultingintheconvergenceofweights.TheweightsarecommonlyinitializedwithvaluesdrawnrandomlyfromaGaussianorauniformdistribution.
AnecessarytoolforupdatingweightsinEquation4.53 istocomputethepartialderivativeofEwithrespectto .Thiscomputationisnontrivialespeciallyathiddenlayers ,since doesnotdirectlyaffect (and
yk y^k aLxk
Loss(yk,y^k)=(yk,y^k)2. (4.52)
aL
wijl bil
wijl←wijl−λ∂E∂wijl, (4.53)
bil←bil−λ∂E∂bil, (4.54)
λ
wijl(l<L) wijl y^=aL
hencethetrainingloss),buthascomplexchainsofinfluencesviaactivationvaluesatsubsequentlayers.Toaddressthisproblem,atechniqueknownasbackpropagationwasdeveloped,whichpropagatesthederivativesbackwardfromtheoutputlayertothehiddenlayers.Thistechniquecanbedescribedasfollows.
RecallthatthetraininglossEissimplythesumofindividuallossesattraininginstances.HencethepartialderivativeofEcanbedecomposedasasumofpartialderivativesofindividuallosses.
Tosimplifydiscussions,wewillconsideronlythederivativesofthelossatthek traininginstance,whichwillbegenericallyrepresentedas .Byusingthechainruleofdifferentiation,wecanrepresentthepartialderivativesofthelosswithrespectto as
Thelasttermofthepreviousequationcanbewrittenas
Also,ifweusethesigmoidactivationfunction,then
Equation4.55 canthusbesimplifiedas
∂E∂wjl=∑k=1n∂Loss(yk,y^k)∂wjl.
th Loss(y,aL)
wijl
∂Loss∂wijl=∂Loss∂ail×∂ail∂zil×∂zil∂wijl. (4.55)
∂zil∂wijl=∂(∑jwijlajl−1+bil)∂wijl=ajl−1.
∂ail∂zil=∂σ(zil)∂zil=ail(1−ai1).
∂Loss∂wijl=δil×ail(1−ai1)×ajl−1,whereδil=∂Loss∂ail. (4.56)
Asimilarformulaforthepartialderivativeswithrespecttothebiasterms isgivenby
Hence,tocomputethepartialderivatives,weonlyneedtodetermine .Usingasquaredlossfunction,wecaneasilywrite attheoutputnodeas
However,theapproachforcomputing athiddennodes ismoreinvolved.Noticethat affectstheactivationvalues ofallnodesatthenextlayer,whichinturninfluencestheloss.Hence,againusingthechainruleofdifferentiation, canberepresentedas
Thepreviousequationprovidesaconciserepresentationofthe valuesatlayerlintermsofthe valuescomputedatlayer .Hence,proceedingbackwardfromtheoutputlayerLtothehiddenlayers,wecanrecursivelyapplyEquation4.59 tocompute ateveryhiddennode. canthenbeusedinEquations4.56 and4.57 tocomputethepartialderivativesofthelosswithrespectto and ,respectively.Algorithm4.4 summarizesthecompleteapproachforlearningthemodelparametersofANNusingbackpropagationandgradientdescentmethod.
Algorithm4.4LearningANNusingbackpropagationandgradientdescent.
bli
∂Loss∂bil=δil×ail(1−ai1). (4.57)
δilδL
δL=∂Loss∂aL=∂(y−aL)2∂aL=2(aL−y). (4.58)
δjl (l<L)ajl ail+1
δjl
δjl=∂Loss∂ajl=∑i(∂Loss∂ail+1×∂ail+1∂ajl).=∑i(∂Loss∂ail+1×∂ail+1∂zil+1×∂zil+1(4.59)
δjlδjl+1 l+1
δil δil
wijl bil
∈
∂ ∂ ∂ ∂
∂ ∂ ∑ ∂ ∂
∂ ∂ ∑ ∂ ∂
4.7.3CharacteristicsofANN
1. Multi-layerneuralnetworkswithatleastonehiddenlayerareuniversalapproximators;i.e.,theycanbeusedtoapproximateanytargetfunction.Theyarethushighlyexpressiveandcanbeusedtolearncomplexdecisionboundariesindiverseapplications.ANNcanalsobeusedformulticlassclassificationandregressionproblems,by
appropriatelymodifyingtheoutputlayer.However,thehighmodelcomplexityofclassicalANNmodelsmakesitsusceptibletooverfitting,whichcanbeovercometosomeextentbyusingdeeplearningtechniquesdiscussedinSection4.8.3 .
2. ANNprovidesanaturalwaytorepresentahierarchyoffeaturesatmultiplelevelsofabstraction.TheoutputsatthefinalhiddenlayeroftheANNmodelthusrepresentfeaturesatthehighestlevelofabstractionthataremostusefulforclassification.Thesefeaturescanalsobeusedasinputsinothersupervisedclassificationmodels,e.g.,byreplacingtheoutputnodeoftheANNbyanygenericclassifier.
3. ANNrepresentscomplexhigh-levelfeaturesascompositionsofsimplerlower-levelfeaturesthatareeasiertolearn.ThisprovidesANNtheabilitytograduallyincreasethecomplexityofrepresentations,byaddingmorehiddenlayerstothearchitecture.Further,sincesimplerfeaturescanbecombinedincombinatorialways,thenumberofcomplexfeatureslearnedbyANNismuchlargerthantraditionalclassificationmodels.Thisisoneofthemainreasonsbehindthehighexpressivepowerofdeepneuralnetworks.
4. ANNcaneasilyhandleirrelevantattributes,byusingzeroweightsforattributesthatdonothelpinimprovingthetrainingloss.Also,redundantattributesreceivesimilarweightsanddonotdegradethequalityoftheclassifier.However,ifthenumberofirrelevantorredundantattributesislarge,thelearningoftheANNmodelmaysufferfromoverfitting,leadingtopoorgeneralizationperformance.
5. SincethelearningofANNmodelinvolvesminimizinganon-convexfunction,thesolutionsobtainedbygradientdescentarenotguaranteedtobegloballyoptimal.Forthisreason,ANNhasatendencytogetstuckinlocalminima,achallengethatcanbeaddressedbyusingdeeplearningtechniquesdiscussedinSection4.8.4 .
6. TraininganANNisatimeconsumingprocess,especiallywhenthenumberofhiddennodesislarge.Nevertheless,testexamplescanbe
classifiedrapidly.7. Justlikelogisticregression,ANNcanlearninthepresenceof
interactingvariables,sincethemodelparametersarejointlylearnedoverallvariablestogether.Inaddition,ANNcannothandleinstanceswithmissingvaluesinthetrainingortestingphase.
4.8DeepLearningAsdescribedabove,theuseofhiddenlayersinANNisbasedonthegeneralbeliefthatcomplexhigh-levelfeaturescanbeconstructedbycombiningsimplerlower-levelfeatures.Typically,thegreaterthenumberofhiddenlayers,thedeeperthehierarchyoffeatureslearnedbythenetwork.ThismotivatesthelearningofANNmodelswithlongchainsofhiddenlayers,knownasdeepneuralnetworks.Incontrastto“shallow”neuralnetworksthatinvolveonlyasmallnumberofhiddenlayers,deepneuralnetworksareabletorepresentfeaturesatmultiplelevelsofabstractionandoftenrequirefarfewernodesperlayertoachievegeneralizationperformancesimilartoshallownetworks.
Despitethehugepotentialinlearningdeepneuralnetworks,ithasremainedchallengingtolearnANNmodelswithalargenumberofhiddenlayersusingclassicalapproaches.Apartfromreasonsrelatedtolimitedcomputationalresourcesandhardwarearchitectures,therehavebeenanumberofalgorithmicchallengesinlearningdeepneuralnetworks.First,learningadeepneuralnetworkwithlowtrainingerrorhasbeenadauntingtaskbecauseofthesaturationofsigmoidactivationfunctions,resultinginslowconvergenceofgradientdescent.Thisproblembecomesevenmoreseriousaswemoveawayfromtheoutputnodetothehiddenlayers,becauseofthecompoundedeffectsofsaturationatmultiplelayers,knownasthevanishinggradientproblem.Becauseofthisreason,classicalANNmodelshavesufferedfromslowandineffectivelearning,leadingtopoortrainingandtestperformance.Second,thelearningofdeepneuralnetworksisquitesensitivetotheinitialvaluesofmodelparameters,chieflybecauseofthenon-convexnatureoftheoptimizationfunctionandtheslowconvergenceofgradientdescent.Third,deepneuralnetworkswithalargenumberofhiddenlayershavehighmodel
complexity,makingthemsusceptibletooverfitting.Hence,evenifadeepneuralnetworkhasbeentrainedtoshowlowtrainingerror,itcanstillsufferfrompoorgeneralizationperformance.
Thesechallengeshavedeterredprogressinbuildingdeepneuralnetworksforseveraldecadesanditisonlyrecentlythatwehavestartedtounlocktheirimmensepotentialwiththehelpofanumberofadvancesbeingmadeintheareaofdeeplearning.Althoughsomeoftheseadvanceshavebeenaroundforsometime,theyhaveonlygainedmainstreamattentioninthelastdecade,withdeepneuralnetworkscontinuallybeatingrecordsinvariouscompetitionsandsolvingproblemsthatweretoodifficultforotherclassificationapproaches.
Therearetwofactorsthathaveplayedamajorroleintheemergenceofdeeplearningtechniques.First,theavailabilityoflargerlabeleddatasets,e.g.,theImageNetdatasetcontainsmorethan10millionlabeledimages,hasmadeitpossibletolearnmorecomplexANNmodelsthaneverbefore,withoutfallingeasilyintothetrapsofmodeloverfitting.Second,advancesincomputationalabilitiesandhardwareinfrastructures,suchastheuseofgraphicalprocessingunits(GPU)fordistributedcomputing,havegreatlyhelpedinexperimentingwithdeepneuralnetworkswithlargerarchitecturesthatwouldnothavebeenfeasiblewithtraditionalresources.
Inadditiontotheprevioustwofactors,therehavebeenanumberofalgorithmicadvancementstoovercomethechallengesfacedbyclassicalmethodsinlearningdeepneuralnetworks.Someexamplesincludetheuseofmoreresponsivecombinationsoflossfunctionsandactivationfunctions,betterinitializationofmodelparameters,novelregularizationtechniques,moreagilearchitecturedesigns,andbettertechniquesformodellearningandhyper-parameterselection.Inthefollowing,wedescribesomeofthedeeplearningadvancesmadetoaddressthechallengesinlearningdeepneural
networks.FurtherdetailsonrecentdevelopmentsindeeplearningcanbeobtainedfromtheBibliographicNotes.
4.8.1UsingSynergisticLossFunctions
Oneofthemajorrealizationsleadingtodeeplearninghasbeentheimportanceofchoosingappropriatecombinationsofactivationandlossfunctions.ClassicalANNmodelscommonlymadeuseofthesigmoidactivationfunctionattheoutputlayer,becauseofitsabilitytoproducereal-valuedoutputsbetween0and1,whichwascombinedwithasquaredlossobjectivetoperformgradientdescent.Itwassoonnoticedthatthisparticularcombinationofactivationandlossfunctionresultedinthesaturationofoutputactivationvalues,whichcanbedescribedasfollows.
SaturationofOutputsAlthoughthesigmoidhasbeenwidely-usedasanactivationfunction,iteasilysaturatesathighandlowvaluesofinputsthatarefarawayfrom0.ObservefromFigure4.27(a) that showsvarianceinitsvaluesonlywhenziscloseto0.Forthisreason, isnon-zeroforonlyasmallrangeofzaround0,asshowninFigure4.27(b) .Since isoneofthecomponentsinthegradientofloss(seeEquation4.55 ),wegetadiminishinggradientvaluewhentheactivationvaluesarefarfrom0.
σ(z)∂σ(z)/∂z
∂σ(z)/∂z
Figure4.27.Plotsofsigmoidfunctionanditsderivative.
Toillustratetheeffectofsaturationonthelearningofmodelparametersattheoutputnode,considerthepartialderivativeoflosswithrespecttotheweight
attheoutputnode.Usingthesquaredlossfunction,wecanwritethisas
Inthepreviousequation,noticethatwhen ishighlynegative, (andhencethegradient)iscloseto0.Ontheotherhand,when ishighlypositive, becomescloseto0,nullifyingthevalueofthegradient.Hence,irrespectiveofwhethertheprediction matchesthetruelabelyornot,thegradientofthelosswithrespecttotheweightsiscloseto0wheneverishighlypositiveornegative.Thiscausesanunnecessarilyslow
convergenceofthemodelparametersoftheANNmodel,oftenresultinginpoorlearning.
Notethatitisthecombinationofthesquaredlossfunctionandthesigmoidactivationfunctionattheoutputnodethattogetherresultsindiminishing
wjL
∂Loss∂wjL=2(aL−y)×σ(zL)(1−σ(zL))×ajL−1. (4.60)
zL σ(zL)zL
(1−σ(zL))aL
zL
gradients(andthuspoorlearning)uponsaturationofoutputs.Itisthusimportanttochooseasynergisticcombinationoflossfunctionandactivationfunctionthatdoesnotsufferfromthesaturationofoutputs.
CrossentropylossfunctionThecrossentropylossfunction,whichwasdescribedinthecontextoflogisticregressioninSection4.6.2 ,cansignificantlyavoidtheproblemofsaturatingoutputswhenusedincombinationwiththesigmoidactivationfunction.Thecrossentropylossfunctionofareal-valuedpredictiononadatainstancewithbinarylabel canbedefinedas
wherelogrepresentsthenaturallogarithm(tobasee)and forconvenience.Thecrossentropyfunctionhasfoundationsininformationtheoryandmeasurestheamountofdisagreementbetweenyand .Thepartialderivativeofthislossfunctionwithrespectto canbegivenas
Usingthisvalueof inEquation4.56 ,wecanobtainthepartialderivativeofthelosswithrespecttotheweight attheoutputnodeas
Noticethesimplicityofthepreviousformulausingthecrossentropylossfunction.Thepartialderivativesofthelosswithrespecttotheweightsattheoutputnodedependonlyonthedifferencebetweentheprediction andthetruelabely.IncontrasttoEquation4.60 ,itdoesnotinvolvetermssuchas
thatcanbeimpactedbysaturationof .Hence,thegradients
y^∈(0,1)y∈{0,1}
Loss(y,y^)=−ylog(y^)−(1−y)log(1−y^), (4.61)
0log(0)=0
y^y^=aL
δL=∂Loss∂aL=−yaL+(1−y)(1−aL).=(aL−y)aL(1−aL). (4.62)
δLwjl
∂Loss∂wjL=(aL−y)aL(1−aL)×aL(1−aL)×ajL−1.=(aL−y)×ajL−1. (4.63)
aL
σ(zL)(1−σ(zL)) zL
arehighwhenever islarge,promotingeffectivelearningofthemodelparametersattheoutputnode.ThishasbeenamajorbreakthroughinthelearningofmodernANNmodelsanditisnowacommonpracticetousethecrossentropylossfunctionwithsigmoidactivationsattheoutputnode.
4.8.2UsingResponsiveActivationFunctions
Eventhoughthecrossentropylossfunctionhelpsinovercomingtheproblemofsaturatingoutputs,itstilldoesnotsolvetheproblemofsaturationathiddenlayers,arisingduetotheuseofsigmoidactivationfunctionsathiddennodes.Infact,theeffectofsaturationonthelearningofmodelparametersisevenmoreaggravatedathiddenlayers,aproblemknownasthevanishinggradientproblem.Inthefollowing,wedescribethevanishinggradientproblemandtheuseofamoreresponsiveactivationfunction,calledtherectifiedlinearoutputunit(ReLU),toovercomethisproblem.
VanishingGradientProblemTheimpactofsaturatingactivationvaluesonthelearningofmodelparametersincreasesatdeeperhiddenlayersthatarefartherawayfromtheoutputnode.Eveniftheactivationintheoutputlayerdoesnotsaturate,therepeatedmultiplicationsperformedaswebackpropagatethegradientsfromtheoutputlayertothehiddenlayersmayleadtodecreasinggradientsinthehiddenlayers.Thisiscalledthevanishinggradientproblem,whichhasbeenoneofthemajorhindrancesinlearningdeepneuralnetworks.
(aL−y)
Toillustratethevanishinggradientproblem,consideranANNmodelthatconsistsofasinglenodeateveryhiddenlayerofthenetwork,asshowninFigure4.28 .Thissimplifiedarchitectureinvolvesasinglechainofhiddennodeswhereasingleweightedlink connectsthenodeatlayer tothenodeatlayerl.UsingEquations4.56 and4.59 ,wecanrepresentthepartialderivativeofthelosswithrespectto as
Noticethatifanyofthelinearpredictors saturatesatsubsequentlayers,thentheterm becomescloseto0,thusdiminishingtheoverallgradient.Thesaturationofactivationsthusgetscompoundedandhasmultiplicativeeffectsonthegradientsathiddenlayers,makingthemhighlyunstableandthus,unsuitableforusewithgradientdescent.Eventhoughthepreviousdiscussiononlypertainstothesimplifiedarchitectureinvolvingasinglechainofhiddennodes,asimilarargumentcanbemadeforanygenericANNarchitectureinvolvingmultiplechainsofhiddennodes.Notethatthevanishinggradientproblemprimarilyarisesbecauseoftheuseofsigmoidactivationfunctionathiddennodes,whichisknowntoeasilysaturateespeciallyafterrepeatedmultiplications.
Figure4.28.AnexampleofanANNmodelwithonlyonenodeateveryhiddenlayer.
wl l−1
wl
∂Loss∂wl=δl×al(1−al)×al−1,whereδl=2(aL−y)×∏r=lL−1(ar+1(1−ar+1)×wr+1).(4.64)
zr+1ar+1(1−ar+1)
Figure4.29.Plotoftherectifiedlinearunit(ReLU)activationfunction.
RectifiedLinearUnits(ReLU)Toovercomethevanishinggradientproblem,itisimportanttouseanactivationfunctionf(z)atthehiddennodesthatprovidesastableandsignificantvalueofthegradientwheneverahiddennodeisactive,i.e., .Thisisachievedbyusingrectifiedlinearunits(ReLU)asactivationfunctionsathiddennodes,whichcanbedefinedas
TheideaofReLUhasbeeninspiredfrombiologicalneurons,whichareeitherinaninactivestate orshowanactivationvalueproportionaltotheinput.Figure4.29 showsaplotoftheReLUfunction.Wecanseethatitislinearwithrespecttozwhen .Hence,thegradientoftheactivationvaluewithrespecttozcanbewrittenas
z>0
a=f(z)={z,ifz>0.0,otherwise. (4.65)
(f(z)=0)
z>0
Althoughf(z)isnotdifferentiableat0,itiscommonpracticetousewhen .SincethegradientoftheReLUactivationfunctionisequalto1whenever ,itavoidstheproblemofsaturationathiddennodes,evenafterrepeatedmultiplications.UsingReLU,thepartialderivativesofthelosswithrespecttotheweightandbiasparameterscanbegivenby
NoticethatReLUshowsalinearbehaviorintheactivationvalueswheneveranodeisactive,ascomparedtothenonlinearpropertiesofthesigmoidfunction.Thislinearitypromotesbetterflowsofgradientsduringbackpropagation,andthussimplifiesthelearningofANNmodelparameters.TheReLUisalsohighlyresponsiveatlargevaluesofzawayfrom0,asopposedtothesigmoidactivationfunction,makingitmoresuitableforgradientdescent.ThesedifferencesgiveReLUamajoradvantageoverthesigmoidfunction.Indeed,ReLUisusedasthepreferredchoiceofactivationfunctionathiddenlayersinmostmodernANNmodels.
4.8.3Regularization
AmajorchallengeinlearningdeepneuralnetworksisthehighmodelcomplexityofANNmodels,whichgrowswiththeadditionofhiddenlayersinthenetwork.Thiscanbecomeaseriousconcern,especiallywhenthetrainingsetissmall,duetothephenomenaofmodeloverfitting.Toovercomethis
∂a∂z={1,ifz>0.0,ifz<0. (4.66)
∂a/∂z=0z=0
z>0
∂Loss∂wijl=δil×I(zil)×ajl−1, (4.67)
∂Loss∂bil=δil×I(zil),whereδil=∑i=1n(δil+1×I(zil+1)×wijl+1),andI(z)={1,ifz>0.0,otherwise.
(4.68)
challenge,itisimportanttousetechniquesthatcanhelpinreducingthecomplexityofthelearnedmodel,knownasregularizationtechniques.ClassicalapproachesforlearningANNmodelsdidnothaveaneffectivewaytopromoteregularizationofthelearnedmodelparameters.Hence,theyhadoftenbeensidelinedbyotherclassificationmethods,suchassupportvectormachines(SVM),whichhavein-builtregularizationmechanisms.(SVMswillbediscussedinmoredetailinSection4.9 ).
OneofthemajoradvancementsindeeplearninghasbeenthedevelopmentofnovelregularizationtechniquesforANNmodelsthatareabletooffersignificantimprovementsingeneralizationperformance.Inthefollowing,wediscussoneoftheregularizationtechniquesforANN,knownasthedropoutmethod,thathavegainedalotofattentioninseveralapplications.
DropoutThemainobjectiveofdropoutistoavoidthelearningofspuriousfeaturesathiddennodes,occurringduetomodeloverfitting.Itusesthebasicintuitionthatspuriousfeaturesoften“co-adapt”themselvessuchthattheyshowgoodtrainingperformanceonlywhenusedinhighlyselectivecombinations.Ontheotherhand,relevantfeaturescanbeusedinadiversityoffeaturecombinationsandhencearequiteresilienttotheremovalormodificationofotherfeatures.Thedropoutmethodusesthisintuitiontobreakcomplex“co-adaptations”inthelearnedfeaturesbyrandomlydroppinginputandhiddennodesinthenetworkduringtraining.
Dropoutbelongstoafamilyofregularizationtechniquesthatusesthecriteriaofresiliencetorandomperturbationsasameasureoftherobustness(andhence,simplicity)ofamodel.Forexample,oneapproachtoregularizationistoinjectnoiseintheinputattributesofthetrainingsetandlearnamodelwiththenoisytraininginstances.Ifafeaturelearnedfromthetrainingdatais
indeedgeneralizable,itshouldnotbeaffectedbytheadditionofnoise.Dropoutcanbeviewedasasimilarregularizationapproachthatperturbstheinformationcontentofthetrainingsetnotonlyatthelevelofattributesbutalsoatmultiplelevelsofabstractions,bydroppinginputandhiddennodes.
Thedropoutmethoddrawsinspirationfromthebiologicalprocessofgeneswappinginsexualreproduction,wherehalfofthegenesfrombothparentsarecombinedtogethertocreatethegenesoftheoffspring.Thisfavorstheselectionofparentgenesthatarenotonlyusefulbutcanalsointer-minglewithdiversecombinationsofgenescomingfromtheotherparent.Ontheotherhand,co-adaptedgenesthatfunctiononlyinhighlyselectivecombinationsaresooneliminatedintheprocessofevolution.Thisideaisusedinthedropoutmethodforeliminatingspuriousco-adaptedfeatures.Asimplifieddescriptionofthedropoutmethodisprovidedintherestofthissection.
Figure4.30.Examplesofsub-networksgeneratedinthedropoutmethodusing .
Let representthemodelparametersoftheANNmodelatthekiterationofthegradientdescentmethod.Ateveryiteration,werandomlyselectafraction ofinputandhiddennodestobedroppedfromthenetwork,where isahyper-parameterthatistypicallychosentobe0.5.Theweightedlinksandbiastermsinvolvingthedroppednodesaretheneliminated,resultingina“thinned”sub-networkofsmallersize.Themodelparametersofthesub-network arethenupdatedbycomputingactivationvaluesandperformingbackpropagationonthissmallersub-network.Theseupdatedvaluesarethenaddedbackintheoriginalnetworkto
γ=0.5
(wk,bk) th
γγ∈(0,1)
(wsk,bsk)
obtaintheupdatedmodelparameters, ,tobeusedinthenextiteration.
Figure4.30 showssomeexamplesofsub-networksthatcanbegeneratedatdifferentiterationsofthedropoutmethod,byrandomlydroppinginputandhiddennodes.Sinceeverysub-networkhasadifferentarchitecture,itisdifficulttolearncomplexco-adaptationsinthefeaturesthatcanresultinoverfitting.Instead,thefeaturesatthehiddennodesarelearnedtobemoreagiletorandommodificationsinthenetworkstructure,thusimprovingtheirgeneralizationability.Themodelparametersareupdatedusingadifferentrandomsub-networkateveryiteration,tillthegradientdescentmethodconverges.
Let denotethemodelparametersatthelastiterationofthegradientdescentmethod.Theseparametersarefinallyscaleddownbyafactorof ,toproducetheweightsandbiastermsofthefinalANNmodel,asfollows:
Wecannowusethecompleteneuralnetworkwithmodelparametersfortesting.ThedropoutmethodhasbeenshowntoprovidesignificantimprovementsinthegeneralizationperformanceofANNmodelsinanumberofapplications.Itiscomputationallycheapandcanbeappliedincombinationwithanyoftheotherdeeplearningtechniques.Italsohasanumberofsimilaritieswithawidely-usedensemblelearningmethodknownasbagging,whichlearnsmultiplemodelsusingrandomsubsetsofthetrainingset,andthenusestheaverageoutputofallthemodelstomakepredictions.(BaggingwillbepresentedinmoredetaillaterinSection4.10.4 ).Inasimilarvein,itcanbeshownthatthepredictionsofthefinalnetworklearnedusingdropoutapproximatestheaverageoutputofallpossible sub-networksthatcanbe
(wk+1,bk+1)
(wkmax,bkmax) kmax
(1−γ)
(w*,b*)=((1−γ)×wkmax,(1−γ)×bkmax)
(w*,b*)
2n
formedusingnnodes.Thisisoneofthereasonsbehindthesuperiorregularizationabilitiesofdropout.
4.8.4InitializationofModelParameters
Becauseofthenon-convexnatureofthelossfunctionusedbyANNmodels,itispossibletogetstuckinlocallyoptimalbutgloballyinferiorsolutions.Hence,theinitialchoiceofmodelparametervaluesplaysasignificantroleinthelearningofANNbygradientdescent.Theimpactofpoorinitializationisevenmoreaggravatedwhenthemodeliscomplex,thenetworkarchitectureisdeep,ortheclassificationtaskisdifficult.Insuchcases,itisoftenadvisabletofirstlearnasimplermodelfortheproblem,e.g.,usingasinglehiddenlayer,andthenincrementallyincreasethecomplexityofthemodel,e.g.,byaddingmorehiddenlayers.Analternateapproachistotrainthemodelforasimplertaskandthenusethelearnedmodelparametersasinitialparameterchoicesinthelearningoftheoriginaltask.TheprocessofinitializingANNmodelparametersbeforetheactualtrainingprocessisknownaspretraining.
Pretraininghelpsininitializingthemodeltoasuitableregionintheparameterspacethatwouldotherwisebeinaccessiblebyrandominitialization.Pretrainingalsoreducesthevarianceinthemodelparametersbyfixingthestartingpointofgradientdescent,thusreducingthechancesofoverfittingduetomultiplecomparisons.Themodelslearnedbypretrainingarethusmoreconsistentandprovidebettergeneralizationperformance.
SupervisedPretrainingAcommonapproachforpretrainingistoincrementallytraintheANNmodelinalayer-wisemanner,byaddingonehiddenlayeratatime.Thisapproach,
knownassupervisedpretraining,ensuresthattheparameterslearnedateverylayerareobtainedbysolvingasimplerproblem,ratherthanlearningallmodelparameterstogether.TheseparametervaluesthusprovideagoodchoiceforinitializingtheANNmodel.Theapproachforsupervisedpretrainingcanbebrieflydescribedasfollows.
WestartthesupervisedpretrainingprocessbyconsideringareducedANNmodelwithonlyasinglehiddenlayer.Byapplyinggradientdescentonthissimplemodel,weareabletolearnthemodelparametersofthefirsthiddenlayer.Atthenextrun,weaddanotherhiddenlayertothemodelandapplygradientdescenttolearntheparametersofthenewlyaddedhiddenlayer,whilekeepingtheparametersofthefirstlayerfixed.Thisprocedureisrecursivelyappliedsuchthatwhilelearningtheparametersofthel hiddenlayer,weconsiderareducedmodelwithonlylhiddenlayers,whosefirsthiddenlayersarenotupdatedonthel runbutareinsteadfixedusingpretrainedvaluesfrompreviousruns.Inthisway,weareabletolearnthemodelparametersofall hiddenlayers.ThesepretrainedvaluesareusedtoinitializethehiddenlayersofthefinalANNmodel,whichisfine-tunedbyapplyingafinalroundofgradientdescentoverallthelayers.
UnsupervisedPretrainingSupervisedpretrainingprovidesapowerfulwaytoinitializemodelparameters,bygraduallygrowingthemodelcomplexityfromshallowertodeepernetworks.However,supervisedpretrainingrequiresasufficientnumberoflabeledtraininginstancesforeffectiveinitializationoftheANNmodel.Analternatepretrainingapproachisunsupervisedpretraining,whichinitializesmodelparametersbyusingunlabeledinstancesthatareoftenabundantlyavailable.ThebasicideaofunsupervisedpretrainingistoinitializetheANN
th
(l−1)th
(L−1)
modelinsuchawaythatthelearnedfeaturescapturethelatentstructureintheunlabeleddata.
Figure4.31.Thebasicarchitectureofasingle-layerautoencoder.
Unsupervisedpretrainingreliesontheassumptionthatlearningthedistributionoftheinputdatacanindirectlyhelpinlearningtheclassificationmodel.Itismosthelpfulwhenthenumberoflabeledexamplesissmallandthefeaturesforthesupervisedproblembearresemblancetothefactorsgeneratingtheinputdata.Unsupervisedpretrainingcanbeviewedasadifferentformofregularization,wherethefocusisnotexplicitlytowardfindingsimplerfeaturesbutinsteadtowardfindingfeaturesthatcanbestexplaintheinputdata.Historically,unsupervisedpretraininghasplayedanimportantroleinrevivingtheareaofdeeplearning,bymakingitpossibletotrainanygenericdeepneuralnetworkwithoutrequiringspecializedarchitectures.
UseofAutoencoders
OnesimpleandcommonlyusedapproachforunsupervisedpretrainingistouseanunsupervisedANNmodelknownasanautoencoder.ThebasicarchitectureofanautoencoderisshowninFigure4.31 .Anautoencoderattemptstolearnareconstructionoftheinputdatabymappingtheattributestolatentfeatures ,andthenre-projecting backtotheoriginalattribute
spacetocreatethereconstruction .Thelatentfeaturesarerepresentedusingahiddenlayerofnodes,whiletheinputandoutputlayersrepresenttheattributesandcontainthesamenumberofnodes.Duringtraining,thegoalistolearnanautoencodermodelthatprovidesthelowestreconstructionerror,
,onallinputdatainstances.Atypicalchoiceofthereconstructionerroristhesquaredlossfunction:
ThemodelparametersoftheautoencodercanbelearnedbyusingasimilargradientdescentmethodastheoneusedforlearningsupervisedANNmodelsforclassification.Thekeydifferenceistheuseofthereconstructionerroronalltraininginstancesasthetrainingloss.Autoencodersthathavemultiplelayersofhiddenlayersareknownasstackedautoencoders.
Autoencodersareabletocapturecomplexrepresentationsoftheinputdatabytheuseofhiddennodes.However,ifthenumberofhiddennodesislarge,itispossibleforanautoencodertolearntheidentityrelationship,wheretheinput isjustcopiedandreturnedastheoutput ,resultinginatrivialsolution.Forexample,ifweuseasmanyhiddennodesasthenumberofattributes,thenitispossibleforeveryhiddennodetocopyanattributeandsimplypassitalongtoanoutputnode,withoutextractinganyusefulinformation.Toavoidthisproblem,itiscommonpracticetokeepthenumberofhiddennodessmallerthanthenumberofinputattributes.Thisforcestheautoencodertolearnacompactandusefulencodingoftheinputdata,similartoadimensionalityreductiontechnique.Analternateapproachistocorrupt
x^
RE(x,x^)
RE(x,x^)=ǁx−x^ǁ2.
x^
theinputinstancesbyaddingrandomnoise,andthenlearntheautoencodertoreconstructtheoriginalinstancefromthenoisyinput.Thisapproachisknownasthedenoisingautoencoder,whichoffersstrongregularizationcapabilitiesandisoftenusedtolearncomplexfeatureseveninthepresenceofalargenumberofhiddennodes.
Touseanautoencoderforunsupervisedpretraining,wecanfollowasimilarlayer-wiseapproachlikesupervisedpretraining.Inparticular,topretrainthemodelparametersofthel hiddenlayer,wecanconstructareducedANNmodelwithonlylhiddenlayersandanoutputlayercontainingthesamenumberofnodesastheattributesandisusedforreconstruction.Theparametersofthel hiddenlayerofthisnetworkarethenlearnedusingagradientdescentmethodtominimizethereconstructionerror.Theuseofunlabeleddatacanbeviewedasprovidinghintstothelearningofparametersateverylayerthataidingeneralization.ThefinalmodelparametersoftheANNmodelarethenlearnedbyapplyinggradientdescentoverallthelayers,usingtheinitialvaluesofparametersobtainedfrompretraining.
HybridPretrainingUnsupervisedpretrainingcanalsobecombinedwithsupervisedpretrainingbyusingtwooutputlayersateveryrunofpretraining,oneforreconstructionandtheotherforsupervisedclassification.Theparametersofthel hiddenlayerarethenlearnedbyjointlyminimizingthelossesonbothoutputlayers,usuallyweightedbyatrade-offhyper-parameter .Suchacombinedapproachoftenshowsbettergeneralizationperformancethaneitheroftheapproaches,sinceitprovidesawaytobalancebetweenthecompetingobjectivesofrepresentingtheinputdataandimprovingclassificationperformance.
th
th
th
α
4.8.5CharacteristicsofDeepLearning
ApartfromthebasiccharacteristicsofANNdiscussedinSection4.7.3 ,theuseofdeeplearningtechniquesprovidesthefollowingadditionalcharacteristics:
1. AnANNmodeltrainedforsometaskcanbeeasilyre-usedforadifferenttaskthatinvolvesthesameattributes,byusingpretrainingstrategies.Forexample,wecanusethelearnedparametersoftheoriginaltaskasinitialparameterchoicesforthetargettask.Inthisway,ANNpromotesre-usabilityoflearning,whichcanbequiteusefulwhenthetargetapplicationhasasmallernumberoflabeledtraininginstances.
2. Deeplearningtechniquesforregularization,suchasthedropoutmethod,helpinreducingthemodelcomplexityofANNandthuspromotinggoodgeneralizationperformance.Theuseofregularizationtechniquesisespeciallyusefulinhigh-dimensionalsettings,wherethenumberoftraininglabelsissmallbuttheclassificationproblemisinherentlydifficult.
3. Theuseofanautoencoderforpretrainingcanhelpeliminateirrelevantattributesthatarenotrelatedtootherattributes.Further,itcanhelpreducetheimpactofredundantattributesbyrepresentingthemascopiesofthesameattribute.
4. AlthoughthelearningofanANNmodelcansuccumbtofindinginferiorandlocallyoptimalsolutions,thereareanumberofdeeplearningtechniquesthathavebeenproposedtoensureadequatelearningofanANN.Apartfromthemethodsdiscussedinthissection,someothertechniquesinvolvenovelarchitecturedesignssuchasskipconnectionsbetweentheoutputlayerandlowerlayers,whichaidstheeasyflowofgradientsduringbackpropagation.
5. AnumberofspecializedANNarchitectureshavebeendesignedtohandleavarietyofinputdatasets.Someexamplesincludeconvolutionalneuralnetworks(CNN)fortwo-dimensionalgriddedobjectssuchasimages,andrecurrentneuralnetwork(RNN)forsequences.WhileCNNshavebeenextensivelyusedintheareaofcomputervision,RNNshavefoundapplicationsinprocessingspeechandlanguage.
4.9SupportVectorMachine(SVM)Asupportvectormachine(SVM)isadiscriminativeclassificationmodelthatlearnslinearornonlineardecisionboundariesintheattributespacetoseparatetheclasses.Apartfrommaximizingtheseparabilityofthetwoclasses,SVMoffersstrongregularizationcapabilities,i.e.,itisabletocontrolthecomplexityofthemodelinordertoensuregoodgeneralizationperformance.Duetoitsuniqueabilitytoinnatelyregularizeitslearning,SVMisabletolearnhighlyexpressivemodelswithoutsufferingfromoverfitting.Ithasthusreceivedconsiderableattentioninthemachinelearningcommunityandiscommonlyusedinseveralpracticalapplications,rangingfromhandwrittendigitrecognitiontotextcategorization.SVMhasstrongrootsinstatisticallearningtheoryandisbasedontheprincipleofstructuralriskminimization.AnotheruniqueaspectofSVMisthatitrepresentsthedecisionboundaryusingonlyasubsetofthetrainingexamplesthataremostdifficulttoclassify,knownasthesupportvectors.Hence,itisadiscriminativemodelthatisimpactedonlybytraininginstancesneartheboundaryofthetwoclasses,incontrasttolearningthegenerativedistributionofeveryclass.
ToillustratethebasicideabehindSVM,wefirstintroducetheconceptofthemarginofaseparatinghyperplaneandtherationaleforchoosingsuchahyperplanewithmaximummargin.WethendescribehowalinearSVMcanbetrainedtoexplicitlylookforthistypeofhyperplane.WeconcludebyshowinghowtheSVMmethodologycanbeextendedtolearnnonlineardecisionboundariesbyusingkernelfunctions.
4.9.1MarginofaSeparating
Hyperplane
Thegenericequationofaseparatinghyperplanecanbewrittenas
where representstheattributesand( , )representtheparametersofthehyperplane.Adatainstance canbelongtoeithersideofthehyperplanedependingonthesignof .Forthepurposeofbinaryclassification,weareinterestedinfindingahyperplanethatplacesinstancesofbothclassesonoppositesidesofthehyperplane,thusresultinginaseparationofthetwoclasses.Ifthereexistsahyperplanethatcanperfectlyseparatetheclassesinthedataset,wesaythatthedatasetislinearlyseparable.Figure4.32showsanexampleoflinearlyseparabledatainvolvingtwoclasses,squaresandcircles.Notethattherecanbeinfinitelymanyhyperplanesthatcanseparatetheclasses,twoofwhichareshowninFigure4.32 aslinesand .Eventhougheverysuchhyperplanewillhavezerotrainingerror,theycanprovidedifferentresultsonpreviouslyunseeninstances.Whichseparatinghyperplaneshouldwethusfinallychoosetoobtainthebestgeneralizationperformance?Ideally,wewouldliketochooseasimplehyperplanethatisrobusttosmallperturbations.Thiscanbeachievedbyusingtheconceptofthemarginofaseparatinghyperplane,whichcanbebrieflydescribedasfollows.
wTx+b=0,
xi(wTxi+b)
B1B2
Figure4.32.Marginofahyperplaneinatwo-dimensionaldataset.
Foreveryseparatinghyperplane ,letusassociateapairofparallelhyperplanes, and ,suchthattheytouchtheclosestinstancesofbothclasses,respectively.Forexample,ifwemove paralleltoitsdirection,wecantouchthefirstsquareusing andthefirstcircleusing . andareknownasthemarginhyperplanesof andthedistancebetweenthemisknownasthemarginoftheseparatinghyperplane .FromthediagramshowninFigure4.32 ,noticethatthemarginfor isconsiderablylargerthanthatfor .Inthisexample, turnsouttobetheseparatinghyperplanewiththemaximummargin,knownasthemaximummarginhyperplane.
RationaleforMaximumMargin
Bibi1 bi2
B1b11 b12 bi1 bi2
BiBi
B1B2 b1
Hyperplaneswithlargemarginstendtohavebettergeneralizationperformancethanthosewithsmallmargins.Intuitively,ifthemarginissmall,thenanyslightperturbationinthehyperplaneorthetraininginstanceslocatedattheboundarycanhavequiteanimpactontheclassificationperformance.Smallmarginhyperplanesarethusmoresusceptibletooverfitting,astheyarebarelyabletoseparatetheclasseswithaverynarrowroomtoallowperturbations.Ontheotherhand,ahyperplanethatisfartherawayfromtraininginstancesofbothclasseshassufficientleewaytoberobusttominormodificationsinthedata,andthusshowssuperiorgeneralizationperformance.
Theideaofchoosingthemaximummarginseparatinghyperplanealsohasstrongfoundationsinstatisticallearningtheory.ItcanbeshownthatthemarginofsuchahyperplaneisinverselyrelatedtotheVC-dimensionoftheclassifier,whichisacommonlyusedmeasureofthecomplexityofamodel.AsdiscussedinSection3.4 ofthelastchapter,asimplermodelshouldbepreferredoveramorecomplexmodeliftheybothshowsimilartrainingperformance.Hence,maximizingthemarginresultsintheselectionofaseparatinghyperplanewiththelowestmodelcomplexity,whichisexpectedtoshowbettergeneralizationperformance.
4.9.2LinearSVM
AlinearSVMisaclassifierthatsearchesforaseparatinghyperplanewiththelargestmargin,whichiswhyitisoftenknownasamaximalmarginclassifier.ThebasicideaofSVMcanbedescribedasfollows.
Considerabinaryclassificationproblemconsistingofntraininginstances,whereeverytraininginstance isassociatedwithabinarylabel .xi yi∈{−1,1}
Let betheequationofaseparatinghyperplanethatseparatesthetwoclassesbyplacingthemonoppositesides.Thismeansthat
Thedistanceofanypoint fromthehyperplaneisthengivenby
where denotestheabsolutevalueand denotesthelengthofavector.Letthedistanceoftheclosestpointfromthehyperplanewith be .Similarly,let denotethedistanceoftheclosestpointfromclass .
Thiscanberepresentedusingthefollowingconstraints:
Thepreviousequationscanbesuccinctlyrepresentedbyusingtheproductofand as
whereMisaparameterrelatedtothemarginofthehyperplane,i.e.,if,thenmargin .Inordertofindthemaximummargin
hyperplanethatadherestothepreviousconstraints,wecanconsiderthefollowingoptimizationproblem:
Tofindthesolutiontothepreviousproblem,notethatif andbsatisfytheconstraintsofthepreviousproblem,thenanyscaledversionof andbwould
wTx+b=0
wTxi+b>0ifyi=1,wTxi+b<0ifyi=−1.
D(x)=|wTx+b|ǁwǁ
|⋅| ǁ⋅ǁy=1 k+>0
k−>0 −1
wTxi+bǁwǁ≥k+ifyi=1,wTxi+bǁwǁ≤−k−ifyi=−1, (4.69)
yi (wTxi+b)
yi(wTxi+b)≥Mǁwǁ (4.70)
k+=k−=M =k+−k−=2M
maxw,bMsubjecttoyi(wTxi+b)≥Mǁwǁ. (4.71)
satisfythemtoo.Hence,wecanconvenientlychoose tosimplifytheright-handsideoftheinequalities.Furthermore,maximizingMamountstominimizing .Hence,theoptimizationproblemofSVMiscommonlyrepresentedinthefollowingform:
LearningModelParametersEquation4.72 representsaconstrainedoptimizationproblemwithlinearinequalities.Sincetheobjectivefunctionisconvexandquadraticwithrespectto ,itisknownasaquadraticprogrammingproblem(QPP),whichcanbesolvedusingstandardoptimizationtechniques,asdescribedinAppendixE.Inthefollowing,wepresentabriefsketchofthemainideasforlearningthemodelparametersofSVM.
First,werewritetheobjectivefunctioninaformthattakesintoaccounttheconstraintsimposedonitssolutions.ThenewobjectivefunctionisknownastheLagrangianprimalproblem,whichcanberepresentedasfollows,
wheretheparameters correspondtotheconstraintsandarecalledtheLagrangemultipliers.Next,tominimizetheLagrangian,wetakethederivativeof withrespectto andbandsetthemequaltozero:
ǁwǁ=1/M
ǁwǁ2
minw,bǁwǁ22subjecttoyi(wTxi+b)≥1. (4.72)
LP=12ǁwǁ2−∑i=1nλi(yi(wTxi+b)−1), (4.73)
λi≥0
LP
∂LP∂w=0⇒w=∑i=1nλiyixi, (4.74)
∂LP∂b=0⇒∑i=1nλiyi=0. (4.75)
NotethatusingEquation4.74 ,wecanrepresent completelyintermsoftheLagrangemultipliers.Thereisanotherrelationshipbetween( ,b)andthatisderivedfromtheKarush-Kuhn-Tucker(KKT)conditions,acommonlyusedtechniqueforsolvingQPP.Thisrelationshipcanbedescribedas
Equation4.76 isknownasthecomplementaryslacknesscondition,whichshedslightonavaluablepropertyofSVM.ItstatesthattheLagrangemultiplier isstrictlygreaterthan0onlywhen satisfiestheequation
,whichmeansthat liesexactlyonamarginhyperplane.However,if isfartherawayfromthemarginhyperplanessuchthat
,then isnecessarily0.Hence, foronlyasmallnumberofinstancesthatareclosesttotheseparatinghyperplane,whichareknownassupportvectors.Figure4.33 showsthesupportvectorsofahyperplaneasfilledcirclesandsquares.Further,ifwelookatEquation4.74 ,wewillobservethattraininginstanceswith donotcontributetotheweightparameter .Thissuggeststhat canbeconciselyrepresentedonlyintermsofthesupportvectorsinthetrainingdata,whicharequitefewerthantheoverallnumberoftraininginstances.Thisabilitytorepresentthedecisionfunctiononlyintermsofthesupportvectorsiswhatgivesthisclassifierthenamesupportvectormachines.
λi
λi[yi(wTxi+b)−1]=0. (4.76)
λi xiyi(w⋅xi+b)=1 xi
xiyi(w⋅xi+b)>1 λi λi>0
λi=0
Figure4.33.Supportvectorsofahyperplaneshownasfilledcirclesandsquares.
Usingequations4.74 ,4.75 ,and4.76 inEquation4.73 ,weobtainthefollowingoptimizationproblemintermsoftheLagrangemultipliers :
Thepreviousoptimizationproblemiscalledthedualoptimizationproblem.Maximizingthedualproblemwithrespectto isequivalenttominimizingtheprimalproblemwithrespectto andb.
Thekeydifferencesbetweenthedualandprimalproblemsareasfollows:
λi
maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxiTxjsubjectto∑i=1nλiyi=0,λi≥0. (4.77)
λi
1. Solvingthedualproblemhelpsusidentifythesupportvectorsinthedatathathavenon-zerovaluesof .Further,thesolutionofthedualproblemisinfluencedonlybythesupportvectorsthatareclosesttothedecisionboundaryofSVM.ThishelpsinsummarizingthelearningofSVMsolelyintermsofitssupportvectors,whichareeasiertomanagecomputationally.Further,itrepresentsauniqueabilityofSVMtobedependentonlyontheinstancesclosesttotheboundary,whicharehardertoclassify,ratherthanthedistributionofinstancesfartherawayfromtheboundary.
2. Theobjectiveofthedualprobleminvolvesonlytermsoftheform ,whicharebasicallyinnerproductsintheattributespace.AswewillseelaterinSection4.9.4 ,thispropertywillprovetobequiteusefulinlearningnonlineardecisionboundariesusingSVM.
Becauseofthesedifferences,itisusefultosolvethedualoptimizationproblemusinganyofthestandardsolversforQPP.Havingfoundanoptimalsolutionfor ,wecanuseEquation4.74 tosolvefor .WecanthenuseEquation4.76 onthesupportvectorstosolveforbasfollows:
whereSrepresentsthesetofsupportvectors and isthenumberofsupportvectors.Themaximummarginhyperplanecanthenbeexpressedas
Usingthisseparatinghyperplane,atestinstance canbeassignedaclasslabelusingthesignoff( ).
λi
xiTxj
λi
b=1nS∑i∈S1−yiwTxiyi (4.78)
(S={i|λi>0}) nS
f(x)=(∑i=1nλiyixiTx)+b=0. (4.79)
Example4.7.Considerthetwo-dimensionaldatasetshowninFigure4.34 ,whichcontainseighttraininginstances.Usingquadraticprogramming,wecansolvetheoptimizationproblemstatedinEquation4.77 toobtaintheLagrangemultiplier foreachtraininginstance.TheLagrangemultipliersaredepictedinthelastcolumnofthetable.Noticethatonlythefirsttwoinstanceshavenon-zeroLagrangemultipliers.Theseinstancescorrespondtothesupportvectorsforthisdataset.
Let andbdenotetheparametersofthedecisionboundary.UsingEquation4.74 ,wecansolvefor and inthefollowingway:
λi
w=(w1,w2)w1 w2
w1=∑iλiyixi1=65.5261×1×0.3858+65.5261×−1×0.4871=−6.64.w2=∑iλiyixi2=65.5261×1×0.4687+65.5261×−1×0.611=−9.32.
Figure4.34.Exampleofalinearlyseparabledataset.
ThebiastermbcanbecomputedusingEquation4.76 foreachsupportvector:
Averagingthesevalues,weobtain .ThedecisionboundarycorrespondingtotheseparametersisshowninFigure4.34 .
4.9.3Soft-marginSVM
Figure4.35 showsadatasetthatissimilartoFigure4.32 ,exceptithastwonewexamples,PandQ.Althoughthedecisionboundary misclassifiesthenewexamples,while classifiesthemcorrectly,thisdoesnotmeanthat
isabetterdecisionboundarythan becausethenewexamplesmaycorrespondtonoiseinthetrainingdata. shouldstillbepreferredoverbecauseithasawidermargin,andthus,islesssusceptibletooverfitting.However,theSVMformulationpresentedintheprevioussectiononlyconstructsdecisionboundariesthataremistake-free.
b(1)=1−w⋅x1=1−(−6.64)(0.3858)−(−9.32)(0.4687)=7.9300.b(2)=1−w⋅x2=−1−(−6.64)(0.4871)−(−9.32)(0.611)=7.9289.
b=7.93
B1B2
B2 B1B1 B2
Figure4.35.DecisionboundaryofSVMforthenon-separablecase.
ThissectionexamineshowtheformulationofSVMcanbemodifiedtolearnaseparatinghyperplanethatistolerabletosmallnumberoftrainingerrorsusingamethodknownasthesoft-marginapproach.Moreimportantly,themethodpresentedinthissectionallowsSVMtolearnlinearhyperplaneseveninsituationswheretheclassesarenotlinearlyseparable.Todothis,thelearningalgorithminSVMmustconsiderthetrade-offbetweenthewidthofthemarginandthenumberoftrainingerrorscommittedbythelinearhyperplane.
TointroducetheconceptoftrainingerrorsintheSVMformulation,letusrelaxtheinequalityconstraintstoaccommodateforsomeviolationsonasmallnumberoftraininginstances.Thiscanbedonebyintroducingaslackvariable foreverytraininginstance asfollows:ξ≥0 xi
Thevariable allowsforsomeslackintheinequalitiesoftheSVMsuchthateveryinstance doesnotneedtostrictlysatisfy .Further, isnon-zeroonlyifthemarginhyperplanesarenotabletoplace onthesamesideastherestoftheinstancesbelongingto .Toillustratethis,Figure4.36 showsacirclePthatfallsontheoppositesideoftheseparatinghyperplaneastherestofthecircles,andthussatisfies .ThedistancebetweenPandthemarginhyperplane isequalto .Hence, providesameasureoftheerrorofSVMinrepresenting usingsoftinequalityconstraints.
Figure4.36.Slackvariablesusedinsoft-marginSVM.
yi(wTxi+b)≥1−ξi (4.80)
ξixi yi(wTxi+b)≥1 ξi
xiyi
wTx+b=−1+ξwTx+b=−1 ξ/ǁwǁ
ξi xi
Inthepresenceofslackvariables,itisimportanttolearnaseparatinghyperplanethatjointlymaximizesthemargin(ensuringgoodgeneralizationperformance)andminimizesthevaluesofslackvariables(ensuringlowtrainingerror).ThiscanbeachievedbymodifyingtheoptimizationproblemofSVMasfollows:
whereCisahyper-parameterthatmakesatrade-offbetweenmaximizingthemarginandminimizingthetrainingerror.AlargevalueofCpaysmoreemphasisonminimizingthetrainingerrorthanmaximizingthemargin.NoticethesimilarityofthepreviousequationwiththegenericformulaofgeneralizationerrorrateintroducedinSection3.4 ofthepreviouschapter.Indeed,SVMprovidesanaturalwaytobalancebetweenmodelcomplexityandtrainingerrorinordertomaximizegeneralizationperformance.
TosolveEquation4.81 weapplytheLagrangemultipliermethodandconverttheprimalproblemtoitscorrespondingdualproblem,similartotheapproachdescribedintheprevioussection.TheLagrangianprimalproblemofEquation4.81 canbewrittenasfollows:
where and aretheLagrangemultiplierscorrespondingtotheinequalityconstraintsofEquation4.81 .Settingthederivativeof withrespectto ,b,and equalto0,weobtainthefollowingequations:
minw,b,ξiǁwǁ22+c∑i=1nξisubjecttoyi(wTxi+b)≥1−ξi,ξi≥0. (4.81)
LP=12ǁwǁ2+C∑i=1nξi−∑i=1nλi(yi(wTxi+b)−1+ξi)−∑i=1nμi(ξi), (4.82)
λi≥0 μi≥0LP
ξi
∂LP∂w=0⇒w=∑i=1nλiyixi. (4.83)
∂L∂b=0⇒∑i=1nλiyi=0. (4.84)
WecanalsoobtainthecomplementaryslacknessconditionsbyusingthefollowingKKTconditions:
Equation4.86 suggeststhat iszeroforalltraininginstancesexceptthosethatresideonthemarginhyperplanes ,orhave .Theseinstanceswith areknownassupportvectors.Ontheotherhand, giveninEquation4.87 iszeroforanytraininginstancethatismisclassified,i.e.,
.Further, and arerelatedwitheachotherbyEquation4.85 .Thisresultsinthefollowingthreeconfigurationsof :
1. If and ,then doesnotresideonthemarginhyperplanesandiscorrectlyclassifiedonthesamesideasotherinstancesbelongingto .
2. If and ,then ismisclassifiedandhasanon-zeroslackvariable .
3. If and ,then residesononeofthemarginhyperplanes.
SubstitutingEquations4.83 to4.87 intoEquation4.82 ,weobtainthefollowingdualoptimizationproblem:
NoticethatthepreviousproblemlooksalmostidenticaltothedualproblemofSVMforthelinearlyseparablecase(Equation4.77 ),exceptthat is
∂L∂ξi=0⇒λi+μi=C. (4.85)
λi(yi(wTxi+b)−1+ξi)=0, (4.86)
μiξi=0. (4.87)
λiwTxi+b=±1 ξi>0
λi>0 μi
ξi>0 λi μi(λi,μi)
λi=0 μi=C xi
yiλi=C μi=0 xi
ξi0<λi<C 0<μi<C xi
maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxiTxjsubjectto∑i=1nλiyi=0,0≤λi≤C. (4.88)
λi
requiredtonotonlybegreaterthan0butalsosmallerthanaconstantvalueC.Clearly,whenCreachesinfinity,thepreviousoptimizationproblembecomesequivalenttoEquation4.77 ,wherethelearnedhyperplaneperfectlyseparatestheclasses(withnotrainingerrors).However,bycappingthevaluesof toC,thelearnedhyperplaneisabletotolerateafewtrainingerrorsthathave .
Figure4.37.Hingelossasafunctionof .
Asbefore,Equation4.88 canbesolvedbyusinganyofthestandardsolversforQPP,andtheoptimalvalueof canbeobtainedbyusingEquation4.83 .Tosolveforb,wecanuseEquation4.86 onthesupportvectorsthatresideonthemarginhyperplanesasfollows:
λiξi>0
yy^
b=1nS∑i∈S1−yiwTxiyi (4.89)
whereSrepresentsthesetofsupportvectorsresidingonthemarginhyperplanes and isthenumberofelementsinS.
SVMasaRegularizerofHingeLossSVMbelongstoabroadclassofregularizationtechniquesthatusealossfunctiontorepresentthetrainingerrorsandanormofthemodelparameterstorepresentthemodelcomplexity.Torealizethis,noticethattheslackvariable ,usedformeasuringthetrainingerrorsinSVM,isequivalenttothehingelossfunction,whichcanbedefinedasfollows:
where .InthecaseofSVM, correspondsto .Figure4.37 showsaplotofthehingelossfunctionaswevary .Wecanseethatthehingelossisequalto0aslongasyand havethesamesignand
.However,thehingelossgrowslinearlywith wheneveryand areoftheoppositesignor .Thisissimilartothenotionoftheslackvariable,whichisusedtomeasurethedistanceofapointfromitsmarginhyperplane.Hence,theoptimizationproblemofSVMcanberepresentedinthefollowingequivalentform:
Notethatusingthehingelossensuresthattheoptimizationproblemisconvexandcanbesolvedusingstandardoptimizationtechniques.However,ifweuseadifferentlossfunction,suchasthesquaredlossfunctionthatwasintroducedinSection4.7 onANN,itwillresultinadifferentoptimizationproblemthatmayormaynotremainconvex.Nevertheless,differentlossfunctionscanbeexploredtocapturevaryingnotionsoftrainingerror,dependingonthecharacteristicsoftheproblem.
(S={i|0<λi<C}) nS
ξ
Loss(y,y^)=max(0,1−yy^),
y∈{+1,−1} y^ wTx+byy^
y^|y^|≥1 |y^| y^
|y^|<1ξ
minw,bǁwǁ22+C∑i=1nLoss(yi,wTxi+b) (4.90)
AnotherinterestingpropertyofSVMthatrelatesittoabroaderclassofregularizationtechniquesistheconceptofamargin.Althoughminimizinghasthegeometricinterpretationofmaximizingthemarginofaseparating
hyperplane,itisessentiallythesquared normofthemodelparameters,.Ingeneral,the normof , ,isequaltotheMinkowskidistanceof
orderqfrom totheorigin,i.e.,
Minimizingthe normof toachievelowermodelcomplexityisagenericregularizationconceptthathasseveralinterpretations.Forexample,minimizingthe normamountstofindingasolutiononahypersphereofsmallestradiusthatshowssuitabletrainingperformance.Tovisualizethisintwo-dimensions,Figure4.38(a) showstheplotofacirclewithconstantradiusr,whereeverypointhasthesame norm.Ontheotherhand,usingthe normensuresthatthesolutionliesonthesurfaceofahypercubewithsmallestsize,withverticesalongtheaxes.ThisisillustratedinFigure4.38(b) asasquarewithverticesontheaxesatadistanceofrfromtheorigin.The normiscommonlyusedasaregularizertoobtainsparsemodelparameterswithonlyasmallnumberofnon-zeroparametervalues,suchastheuseofLassoinregressionproblems(seeBibliographicNotes).
ǁwǁ2
L2 ǁwǁ22 Lq ǁwǁq
ǁwǁq=(∑ipwiq)1/q
Lq
L2
L2L1
L1
Figure4.38.Plotsshowingthebehavioroftwo-dimensionalsolutionswithconstant andnorms.
Ingeneral,dependingonthecharacteristicsoftheproblem,differentcombinationsof normsandtraininglossfunctionscanbeusedforlearningthemodelparameters,eachrequiringadifferentoptimizationsolver.Thisformsthebackboneofawiderangeofmodelingtechniquesthatattempttoimprovethegeneralizationperformancebyjointlyminimizingtrainingerrorandmodelcomplexity.However,inthissection,wefocusonlyonthesquarednormandthehingelossfunction,resultingintheclassicalformulationof
SVM.
4.9.4NonlinearSVM
L2L1
Lq
L2
TheSVMformulationsdescribedintheprevioussectionsconstructalineardecisionboundarytoseparatethetrainingexamplesintotheirrespectiveclasses.ThissectionpresentsamethodologyforapplyingSVMtodatasetsthathavenonlineardecisionboundaries.Thebasicideaistotransformthedatafromitsoriginalattributespacein intoanewspace sothatalinearhyperplanecanbeusedtoseparatetheinstancesinthetransformedspace,usingtheSVMapproach.Thelearnedhyperplanecanthenbeprojectedbacktotheoriginalattributespace,resultinginanonlineardecisionboundary.
Figure4.39.Classifyingdatawithanonlineardecisionboundary.
AttributeTransformationToillustratehowattributetransformationcanleadtoalineardecisionboundary,Figure4.39(a) showsanexampleofatwo-dimensionaldatasetconsistingofsquares(classifiedas )andcircles(classifiedas ).The
φ(x)
y=1 y=−1
datasetisgeneratedinsuchawaythatallthecirclesareclusterednearthecenterofthediagramandallthesquaresaredistributedfartherawayfromthecenter.Instancesofthedatasetcanbeclassifiedusingthefollowingequation:
Thedecisionboundaryforthedatacanthereforebewrittenasfollows:
whichcanbefurthersimplifiedintothefollowingquadraticequation:
Anonlineartransformation isneededtomapthedatafromitsoriginalattributespaceintoanewspacesuchthatalinearhyperplanecanseparatetheclasses.Thiscanbeachievedbyusingthefollowingsimpletransformation:
Figure4.39(b) showsthepointsinthetransformedspace,wherewecanseethatallthecirclesarelocatedinthelowerleft-handsideofthediagram.Alinearhyperplanewithparameters andbcanthereforebeconstructedinthetransformedspace,toseparatetheinstancesintotheirrespectiveclasses.
Onemaythinkthatbecausethenonlineartransformationpossiblyincreasesthedimensionalityoftheinputspace,thisapproachcansufferfromthecurseofdimensionalitythatisoftenassociatedwithhigh-dimensionaldata.
y={1if(x1−0.5)2+(x2−0.5)2>0.2,−1otherwise. (4.91)
(x1−0.5)2+(x2−0.5)2>0.2,
x12−x1+x22−x2=−0.46.
φ
φ:(x1,x2)→(x12−x1,x22−x2). (4.92)
However,aswewillseeinthefollowingsection,nonlinearSVMisabletoavoidthisproblembyusingkernelfunctions.
LearningaNonlinearSVMModelUsingasuitablefunction, ,wecantransformanydatainstance to .(Thedetailsonhowtochoose willbecomeclearlater.)Thelinearhyperplaneinthetransformedspacecanbeexpressedas .Tolearntheoptimalseparatinghyperplane,wecansubstitute for intheformulationofSVMtoobtainthefollowingoptimizationproblem:
UsingLagrangemultipliers ,thiscanbeconvertedintoadualoptimizationproblem:max
where denotestheinnerproductbetweenvectors and .Also,theequationofthehyperplaneinthetransformedspacecanberepresentedusing
asfollows:
Further,bisgivenby
φ(⋅) φ(x)φ(⋅)
wTφ(x)+b=0φ(x)
minw,b,ξiǁwǁ22+C∑i=1nξisubjecttoyi(wTφ(xi)+b)≥1−ξi,ξi≥0. (4.93)
λi
maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyj⟨φ(xi),φ(xj)⟩subjectto∑i=1nλjyi=0,0≤λi≤C,
(4.94)
⟨a,b⟩
λi
∑i=1nλiyi⟨φ(xi),φ(x)⟩+b=0. (4.95)
b=1nS(∑i∈S1yi−∑i∈S∑j=1nλjyiyj⟨φ(xi),φ(xj)⟩yi), (4.96)
where isthesetofsupportvectorsresidingonthemarginhyperplanesand isthenumberofelementsinS.
NotethatinordertosolvethedualoptimizationprobleminEquation4.94 ,ortousethelearnedmodelparameterstomakepredictionsusingEquations4.95 and4.96 ,weneedonlyinnerproductsof .Hence,eventhough
maybenonlinearandhigh-dimensional,itsufficestouseafunctionoftheinnerproductsof inthetransformedspace.Thiscanbeachievedbyusingakerneltrick,whichcanbedescribedasfollows.
Theinnerproductbetweentwovectorsisoftenregardedasameasureofsimilaritybetweenthevectors.Forexample,thecosinesimilaritydescribedinSection2.4.5 onpage79canbedefinedasthedotproductbetweentwovectorsthatarenormalizedtounitlength.Analogously,theinnerproduct
canalsoberegardedasameasureofsimilaritybetweentwoinstances, and ,inthetransformedspace.Thekerneltrickisamethodforcomputingthissimilarityasafunctionoftheoriginalattributes.Specifically,thekernelfunctionK(u,v)betweentwoinstancesuandvcanbedefinedasfollows:
where isafunctionthatfollowscertainconditionsasstatedbytheMercer'sTheorem.Althoughthedetailsofthistheoremareoutsidethescopeofthebook,weprovidealistofsomeofthecommonlyusedkernelfunctions:
S={i|0>λi<C}nS
φ(x)φ(x)
φ(x)
φ(xi),φ(xj)xi xj
K(u,v)=⟨φ(u),φ(v)⟩=f(u,v) (4.97)
f(⋅)
PolynomialkernelK(u,v)=(uTv+1)p (4.98)
RadialBasisFunctionkernelK(u,v)=e−ǁu−vǁ2/(2σ2) (4.99)
SigmoidkernelK(u,v)=tanh(kuTv−δ) (4.100)
Byusingakernelfunction,wecandirectlyworkwithinnerproductsinthetransformedspacewithoutdealingwiththeexactformsofthenonlineartransformationfunction .Specifically,thisallowsustousehigh-dimensionaltransformations(sometimeseveninvolvinginfinitelymanydimensions),whileperformingcalculationsonlyintheoriginalattributespace.Computingtheinnerproductsusingkernelfunctionsisalsoconsiderablycheaperthanusingthetransformedattributeset .Hence,theuseofkernelfunctionsprovidesasignificantadvantageinrepresentingnonlineardecisionboundaries,withoutsufferingfromthecurseofdimensionality.ThishasbeenoneofthemajorreasonsbehindthewidespreadusageofSVMinhighlycomplexandnonlinearproblems.
Figure4.40.DecisionboundaryproducedbyanonlinearSVMwithpolynomialkernel.
Figure4.40 showsthenonlineardecisionboundaryobtainedbySVMusingthepolynomialkernelfunctiongiveninEquation4.98 .Wecanseethatthe
φ
φ(x)
learneddecisionboundaryisquiteclosetothetruedecisionboundaryshowninFigure4.39(a) .Althoughthechoiceofkernelfunctiondependsonthecharacteristicsoftheinputdata,acommonlyusedkernelfunctionistheradialbasisfunction(RBF)kernel,whichinvolvesasinglehyper-parameter ,knownasthestandarddeviationoftheRBFkernel.
4.9.5CharacteristicsofSVM
1. TheSVMlearningproblemcanbeformulatedasaconvexoptimizationproblem,inwhichefficientalgorithmsareavailabletofindtheglobalminimumoftheobjectivefunction.Otherclassificationmethods,suchasrule-basedclassifiersandartificialneuralnetworks,employagreedystrategytosearchthehypothesisspace.Suchmethodstendtofindonlylocallyoptimumsolutions.
2. SVMprovidesaneffectivewayofregularizingthemodelparametersbymaximizingthemarginofthedecisionboundary.Furthermore,itisabletocreateabalancebetweenmodelcomplexityandtrainingerrorsbyusingahyper-parameterC.Thistrade-offisgenerictoabroaderclassofmodellearningtechniquesthatcapturethemodelcomplexityandthetraininglossusingdifferentformulations.
3. LinearSVMcanhandleirrelevantattributesbylearningzeroweightscorrespondingtosuchattributes.Itcanalsohandleredundantattributesbylearningsimilarweightsfortheduplicateattributes.Furthermore,theabilityofSVMtoregularizeitslearningmakesitmorerobusttothepresenceofalargenumberofirrelevantandredundantattributesthanotherclassifiers,eveninhigh-dimensionalsettings.Forthisreason,nonlinearSVMsarelessimpactedbyirrelevantandredundantattributesthanotherhighlyexpressiveclassifiersthatcanlearnnonlineardecisionboundariessuchasdecisiontrees.
σ
TocomparetheeffectofirrelevantattributesontheperformanceofnonlinearSVMsanddecisiontrees,considerthetwo-dimensionaldatasetshowninFigure4.41(a) containing and instances,wherethetwoclassescanbeeasilyseparatedusinganonlineardecisionboundary.Weincrementallyaddirrelevantattributestothisdatasetandcomparetheperformanceoftwoclassifiers:decisiontreeandnonlinearSVM(usingradialbasisfunctionkernel),using70%ofthedatafortrainingandtherestfortesting.Figure4.41(b) showsthetesterrorratesofthetwoclassifiersasweincreasethenumberofirrelevantattributes.Wecanseethatthetesterrorrateofdecisiontreesswiftlyreaches0.5(sameasrandomguessing)inthepresenceofevenasmallnumberofirrelevantattributes.ThiscanbeattributedtotheproblemofmultiplecomparisonswhilechoosingsplittingattributesatinternalnodesasdiscussedinExample3.7 ofthepreviouschapter.Ontheotherhand,nonlinearSVMshowsamorerobustandsteadyperformanceevenafteraddingamoderatelylargenumberofirrelevantattributes.Itstesterrorrategraduallydeclinesandeventuallyreachescloseto0.5afteradding125irrelevantattributes,atwhichpointitbecomesdifficulttodiscernthediscriminativeinformationintheoriginaltwoattributesfromthenoiseintheremainingattributesforlearningnonlineardecisionboundaries.
500+ 500o
Figure4.41.ComparingtheeffectofaddingirrelevantattributesontheperformanceofnonlinearSVMsanddecisiontrees.
4. SVMcanbeappliedtocategoricaldatabyintroducingdummyvariablesforeachcategoricalattributevaluepresentinthedata.Forexample,if hasthreevalues ,wecanintroduceabinaryvariableforeachoftheattributevalues.
5. TheSVMformulationpresentedinthischapterisforbinaryclassproblems.However,multiclassextensionsofSVMhavealsobeenproposed.
6. AlthoughthetrainingtimeofanSVMmodelcanbelarge,thelearnedparameterscanbesuccinctlyrepresentedwiththehelpofasmallnumberofsupportvectors,makingtheclassificationoftestinstancesquitefast.
{Single,Married,Divorced}
4.10EnsembleMethodsThissectionpresentstechniquesforimprovingclassificationaccuracybyaggregatingthepredictionsofmultipleclassifiers.Thesetechniquesareknownasensembleorclassifiercombinationmethods.Anensemblemethodconstructsasetofbaseclassifiersfromtrainingdataandperformsclassificationbytakingavoteonthepredictionsmadebyeachbaseclassifier.Thissectionexplainswhyensemblemethodstendtoperformbetterthananysingleclassifierandpresentstechniquesforconstructingtheclassifierensemble.
4.10.1RationaleforEnsembleMethod
Thefollowingexampleillustrateshowanensemblemethodcanimproveaclassifier'sperformance.
Example4.8.Consideranensembleof25binaryclassifiers,eachofwhichhasanerrorrateof .Theensembleclassifierpredictstheclasslabelofatestexamplebytakingamajorityvoteonthepredictionsmadebythebaseclassifiers.Ifthebaseclassifiersareidentical,thenallthebaseclassifierswillcommitthesamemistakes.Thus,theerrorrateoftheensembleremains0.35.Ontheotherhand,ifthebaseclassifiersareindependent—i.e.,theirerrorsareuncorrelated—thentheensemblemakesawrongpredictiononlyifmorethanhalfofthebaseclassifierspredictincorrectly.Inthiscase,theerrorrateoftheensembleclassifieris
∈=0.35
whichisconsiderablylowerthantheerrorrateofthebaseclassifiers.
Figure4.42 showstheerrorrateofanensembleof25binaryclassifiersfordifferentbaseclassifiererrorrates .Thediagonalline
representsthecaseinwhichthebaseclassifiersareidentical,whilethesolidlinerepresentsthecaseinwhichthebaseclassifiersareindependent.Observethattheensembleclassifierperformsworsethanthebaseclassifierswhen islargerthan0.5.
Theprecedingexampleillustratestwonecessaryconditionsforanensembleclassifiertoperformbetterthanasingleclassifier:(1)thebaseclassifiersshouldbeindependentofeachother,and(2)thebaseclassifiersshoulddobetterthanaclassifierthatperformsrandomguessing.Inpractice,itisdifficulttoensuretotalindependenceamongthebaseclassifiers.Nevertheless,improvementsinclassificationaccuracieshavebeenobservedinensemblemethodsinwhichthebaseclassifiersaresomewhatcorrelated.
4.10.2MethodsforConstructinganEnsembleClassifier
AlogicalviewoftheensemblemethodispresentedinFigure4.43 .Thebasicideaistoconstructmultipleclassifiersfromtheoriginaldataandthenaggregatetheirpredictionswhenclassifyingunknownexamples.Theensembleofclassifierscanbeconstructedinmanyways:
eensemble=∑i=1325(25i)∈i(1−∈)25−i=0.06, (4.101)
(eensemble) (∈)
∈
1. Bymanipulatingthetrainingset.Inthisapproach,multipletrainingsetsarecreatedbyresamplingtheoriginaldataaccordingtosomesamplingdistributionandconstructingaclassifierfromeachtrainingset.Thesamplingdistributiondetermineshowlikelyitisthatanexamplewillbeselectedfortraining,anditmayvaryfromonetrialtoanother.Baggingandboostingaretwoexamplesofensemblemethodsthatmanipulatetheirtrainingsets.ThesemethodsaredescribedinfurtherdetailinSections4.10.4 and4.10.5 .
Figure4.42.Comparisonbetweenerrorsofbaseclassifiersanderrorsoftheensembleclassifier.
Figure4.43.Alogicalviewoftheensemblelearningmethod.
2. Bymanipulatingtheinputfeatures.Inthisapproach,asubsetofinputfeaturesischosentoformeachtrainingset.Thesubsetcanbeeitherchosenrandomlyorbasedontherecommendationofdomainexperts.Somestudieshaveshownthatthisapproachworksverywellwithdatasetsthatcontainhighlyredundantfeatures.Randomforest,whichisdescribedinSection4.10.6 ,isanensemblemethodthatmanipulatesitsinputfeaturesandusesdecisiontreesasitsbaseclassifiers.
3. Bymanipulatingtheclasslabels.Thismethodcanbeusedwhenthenumberofclassesissufficientlylarge.Thetrainingdataistransformedintoabinaryclassproblembyrandomlypartitioningtheclasslabelsintotwodisjointsubsets, and .TrainingexampleswhoseclassA0 A1
labelbelongstothesubset areassignedtoclass0,whilethosethatbelongtothesubset areassignedtoclass1.Therelabeledexamplesarethenusedtotrainabaseclassifier.Byrepeatingthisprocessmultipletimes,anensembleofbaseclassifiersisobtained.Whenatestexampleispresented,eachbaseclassifier isusedtopredictitsclasslabel.Ifthetestexampleispredictedasclass0,thenalltheclassesthatbelongto willreceiveavote.Conversely,ifitispredictedtobeclass1,thenalltheclassesthatbelongto willreceiveavote.Thevotesaretalliedandtheclassthatreceivesthehighestvoteisassignedtothetestexample.Anexampleofthisapproachistheerror-correctingoutputcodingmethoddescribedonpage331.
4. Bymanipulatingthelearningalgorithm.Manylearningalgorithmscanbemanipulatedinsuchawaythatapplyingthealgorithmseveraltimesonthesametrainingdatawillresultintheconstructionofdifferentclassifiers.Forexample,anartificialneuralnetworkcanchangeitsnetworktopologyortheinitialweightsofthelinksbetweenneurons.Similarly,anensembleofdecisiontreescanbeconstructedbyinjectingrandomnessintothetree-growingprocedure.Forexample,insteadofchoosingthebestsplittingattributeateachnode,wecanrandomlychooseoneofthetopkattributesforsplitting.
Thefirstthreeapproachesaregenericmethodsthatareapplicabletoanyclassifier,whereasthefourthapproachdependsonthetypeofclassifierused.Thebaseclassifiersformostoftheseapproachescanbegeneratedsequentially(oneafteranother)orinparallel(allatonce).Onceanensembleofclassifiershasbeenlearned,atestexample isclassifiedbycombiningthepredictionsmadebythebaseclassifiers :
A0A1
Ci
A0A1
Ci(x)
C*(x)=f(C1(x),C2(x),…,Ck(x)).
wherefisthefunctionthatcombinestheensembleresponses.Onesimpleapproachforobtaining istotakeamajorityvoteoftheindividualpredictions.Analternateapproachistotakeaweightedmajorityvote,wheretheweightofabaseclassifierdenotesitsaccuracyorrelevance.
Ensemblemethodsshowthemostimprovementwhenusedwithunstableclassifiers,i.e.,baseclassifiersthataresensitivetominorperturbationsinthetrainingset,becauseofhighmodelcomplexity.Althoughunstableclassifiersmayhavealowbiasinfindingtheoptimaldecisionboundary,theirpredictionshaveahighvarianceforminorchangesinthetrainingsetormodelselection.Thistrade-offbetweenbiasandvarianceisdiscussedindetailinthenextsection.Byaggregatingtheresponsesofmultipleunstableclassifiers,ensemblelearningattemptstominimizetheirvariancewithoutworseningtheirbias.
4.10.3Bias-VarianceDecomposition
Bias-variancedecompositionisaformalmethodforanalyzingthegeneralizationerrorofapredictivemodel.Althoughtheanalysisisslightlydifferentforclassificationthanregression,wefirstdiscussthebasicintuitionofthisdecompositionbyusingananalogueofaregressionproblem.
Considertheillustrativetaskofreachingatargetybyfiringprojectilesfromastartingposition ,asshowninFigure4.44 .Thetargetcorrespondstothedesiredoutputatatestinstance,whilethestartingpositioncorrespondstoitsobservedattributes.Inthisanalogy,theprojectilerepresentsthemodelusedforpredictingthetargetusingtheobservedattributes.Let denotethepointwheretheprojectilehitstheground,whichisanalogousofthepredictionofthemodel.
C*(x)
y^
Figure4.44.Bias-variancedecomposition.
Ideally,wewouldlikeourpredictionstobeasclosetothetruetargetaspossible.However,notethatdifferenttrajectoriesofprojectilesarepossiblebasedondifferencesinthetrainingdataorintheapproachusedformodelselection.Hence,wecanobserveavarianceinthepredictions overdifferentrunsofprojectile.Further,thetargetinourexampleisnotfixedbuthassomefreedomtomovearound,resultinginanoisecomponentinthetruetarget.Thiscanbeunderstoodasthenon-deterministicnatureoftheoutputvariable,wherethesamesetofattributescanhavedifferentoutputvalues.Let representtheaveragepredictionoftheprojectileovermultipleruns,and denotetheaveragetargetvalue.Thedifferencebetween and
isknownasthebiasofthemodel.
Inthecontextofclassification,itcanbeshownthatthegeneralizationerrorofaclassificationmodelmcanbedecomposedintotermsinvolvingthebias,variance,andnoisecomponentsofthemodelinthefollowingway:
where and areconstantsthatdependonthecharacteristicsoftrainingandtestsets.Notethatwhilethenoisetermisintrinsictothetargetclass,the
y^
y^avgyavg y^avg
yavg
gen.error(m)=c1×noise+bias(m)+c2×variance(m)
c1 c2
biasandvariancetermsdependonthechoiceoftheclassificationmodel.Thebiasofamodelrepresentshowclosetheaveragepredictionofthemodelistotheaveragetarget.Modelsthatareabletolearncomplexdecisionboundaries,e.g.,modelsproducedbyk-nearestneighborandmulti-layerANN,generallyshowlowbias.Thevarianceofamodelcapturesthestabilityofitspredictionsinresponsetominorperturbationsinthetrainingsetorthemodelselectionapproach.
Wecansaythatamodelshowsbettergeneralizationperformanceifithasalowerbiasandlowervariance.However,ifthecomplexityofamodelishighbutthetrainingsizeissmall,wegenerallyexpecttoseealowerbiasbuthighervariance,resultinginthephenomenaofoverfitting.ThisphenomenaispictoriallyrepresentedinFigure4.45(a) .Ontheotherhand,anoverlysimplisticmodelthatsuffersfromunderfittingmayshowalowervariancebutwouldsufferfromahighbias,asshowninFigure4.45(b) .Hence,thetrade-offbetweenbiasandvarianceprovidesausefulwayforinterpretingtheeffectsofunderfittingandoverfittingonthegeneralizationperformanceofamodel.
Figure4.45.Plotsshowingthebehavioroftwo-dimensionalsolutionswithconstant andnorms.
Thebias-variancetrade-offcanbeusedtoexplainwhyensemblelearningimprovesthegeneralizationperformanceofunstableclassifiers.Ifabaseclassifiershowlowbiasbuthighvariance,itcanbecomesusceptibletooverfitting,asevenasmallchangeinthetrainingsetwillresultindifferentpredictions.However,bycombiningtheresponsesofmultiplebaseclassifiers,wecanexpecttoreducetheoverallvariance.Hence,ensemblelearningmethodsshowbetterperformanceprimarilybyloweringthevarianceinthepredictions,althoughtheycanevenhelpinreducingthebias.Oneofthesimplestapproachesforcombiningpredictionsandreducingtheirvarianceistocomputetheiraverage.Thisformsthebasisofthebaggingmethod,describedinthefollowingsubsection.
L2L1
4.10.4Bagging
Bagging,whichisalsoknownasbootstrapaggregating,isatechniquethatrepeatedlysamples(withreplacement)fromadatasetaccordingtoauniformprobabilitydistribution.Eachbootstrapsamplehasthesamesizeastheoriginaldata.Becausethesamplingisdonewithreplacement,someinstancesmayappearseveraltimesinthesametrainingset,whileothersmaybeomittedfromthetrainingset.Onaverage,abootstrapsample containsapproximately63%oftheoriginaltrainingdatabecauseeachsamplehasaprobability ofbeingselectedineach .IfNissufficientlylarge,thisprobabilityconvergesto .ThebasicprocedureforbaggingissummarizedinAlgorithm4.5 .Aftertrainingthekclassifiers,atestinstanceisassignedtotheclassthatreceivesthehighestnumberofvotes.
Toillustratehowbaggingworks,considerthedatasetshowninTable4.4 .Letxdenoteaone-dimensionalattributeandydenotetheclasslabel.Supposeweuseonlyone-levelbinarydecisiontrees,withatestcondition
,wherekisasplitpointchosentominimizetheentropyoftheleafnodes.Suchatreeisalsoknownasadecisionstump.
Table4.4.Exampleofdatasetusedtoconstructanensembleofbaggingclassifiers.
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 1 1 1
Withoutbagging,thebestdecisionstumpwecanproducesplitstheinstancesateither or .Eitherway,theaccuracyofthetreeisatmost70%.Supposeweapplythebaggingprocedureonthedatasetusing10bootstrapsamples.Theexampleschosenfortrainingineachbaggingroundareshown
Di
1−(1−1/N)N Di1−1/e≃0.632
x≤k
−1 −1 −1 −1
x≤0.35 x≤0.75
inFigure4.46 .Ontheright-handsideofeachtable,wealsodescribethedecisionstumpbeingusedineachround.
WeclassifytheentiredatasetgiveninTable4.4 bytakingamajorityvoteamongthepredictionsmadebyeachbaseclassifier.TheresultsofthepredictionsareshowninFigure4.47 .Sincetheclasslabelsareeitheror ,takingthemajorityvoteisequivalenttosummingupthepredictedvaluesofyandexaminingthesignoftheresultingsum(refertothesecondtolastrowinFigure4.47 ).Noticethattheensembleclassifierperfectlyclassifiesall10examplesintheoriginaldata.
Algorithm4.5Baggingalgorithm.
∑
⋅
−1+1
Figure4.46.Exampleofbagging.
Theprecedingexampleillustratesanotheradvantageofusingensemblemethodsintermsofenhancingtherepresentationofthetargetfunction.Eventhougheachbaseclassifierisadecisionstump,combiningtheclassifierscanleadtoadecisionboundarythatmimicsadecisiontreeofdepth2.
Baggingimprovesgeneralizationerrorbyreducingthevarianceofthebaseclassifiers.Theperformanceofbaggingdependsonthestabilityofthebaseclassifier.Ifabaseclassifierisunstable,bagginghelpstoreducetheerrorsassociatedwithrandomfluctuationsinthetrainingdata.Ifabaseclassifierisstable,i.e.,robusttominorperturbationsinthetrainingset,thentheerroroftheensembleisprimarilycausedbybiasinthebaseclassifier.Inthissituation,baggingmaynotbeabletoimprovetheperformanceofthebaseclassifierssignificantly.Itmayevendegradetheclassifier'sperformancebecausetheeffectivesizeofeachtrainingsetisabout37%smallerthantheoriginaldata.
Figure4.47.Exampleofcombiningclassifiersconstructedusingthebaggingapproach.
4.10.5Boosting
Boostingisaniterativeprocedureusedtoadaptivelychangethedistributionoftrainingexamplesforlearningbaseclassifierssothattheyincreasinglyfocusonexamplesthatarehardtoclassify.Unlikebagging,boostingassignsaweighttoeachtrainingexampleandmayadaptivelychangetheweightattheendofeachboostinground.Theweightsassignedtothetrainingexamplescanbeusedinthefollowingways:
1. Theycanbeusedtoinformthesamplingdistributionusedtodrawasetofbootstrapsamplesfromtheoriginaldata.
2. Theycanbeusedtolearnamodelthatisbiasedtowardexampleswithhigherweight.
Thissectiondescribesanalgorithmthatusesweightsofexamplestodeterminethesamplingdistributionofitstrainingset.Initially,theexamplesareassignedequalweights,1/N,sothattheyareequallylikelytobechosenfortraining.Asampleisdrawnaccordingtothesamplingdistributionofthetrainingexamplestoobtainanewtrainingset.Next,aclassifierisbuiltfromthetrainingsetandusedtoclassifyalltheexamplesintheoriginaldata.Theweightsofthetrainingexamplesareupdatedattheendofeachboostinground.Examplesthatareclassifiedincorrectlywillhavetheirweightsincreased,whilethosethatareclassifiedcorrectlywillhavetheirweightsdecreased.Thisforcestheclassifiertofocusonexamplesthataredifficulttoclassifyinsubsequentiterations.
Thefollowingtableshowstheexampleschosenduringeachboostinground,whenappliedtothedatashowninTable4.4 .
Boosting(Round1): 7 3 2 8 7 9 4 10 6 3
Boosting(Round2): 5 4 9 4 2 5 1 7 4 2
Boosting(Round3): 4 4 8 10 4 5 4 6 3 4
Initially,alltheexamplesareassignedthesameweights.However,someexamplesmaybechosenmorethanonce,e.g.,examples3and7,becausethesamplingisdonewithreplacement.Aclassifierbuiltfromthedataisthenusedtoclassifyalltheexamples.Supposeexample4isdifficulttoclassify.Theweightforthisexamplewillbeincreasedinfutureiterationsasitgetsmisclassifiedrepeatedly.Meanwhile,examplesthatwerenotchoseninthepreviousround,e.g.,examples1and5,alsohaveabetterchanceofbeingselectedinthenextroundsincetheirpredictionsinthepreviousroundwerelikelytobewrong.Astheboostingroundsproceed,examplesthatarethehardesttoclassifytendtobecomeevenmoreprevalent.Thefinalensembleisobtainedbyaggregatingthebaseclassifiersobtainedfromeachboostinground.
Overtheyears,severalimplementationsoftheboostingalgorithmhavebeendeveloped.Thesealgorithmsdifferintermsof(1)howtheweightsofthetrainingexamplesareupdatedattheendofeachboostinground,and(2)howthepredictionsmadebyeachclassifierarecombined.AnimplementationcalledAdaBoostisexploredinthenextsection.
AdaBoostLet denoteasetofNtrainingexamples.IntheAdaBoostalgorithm,theimportanceofabaseclassifier dependsonits
{(xj,yj)|j=1,2,…,N}Ci
Figure4.48.Plotof asafunctionoftrainingerror .
errorrate,whichisdefinedas
where ifthepredicatepistrue,and0otherwise.Theimportanceofaclassifier isgivenbythefollowingparameter,
Notethat hasalargepositivevalueiftheerrorrateiscloseto0andalargenegativevalueiftheerrorrateiscloseto1,asshowninFigure4.48 .
The parameterisalsousedtoupdatetheweightofthetrainingexamples.Toillustrate,let denotetheweightassignedtoexample( duringthe
α ∈
∈i=1N[∑j=1NwjI(Ci(xj)≠yj)], (4.102)
I(p)=1Ci
αi=12ln(1−∈i∈i).
αi
αiwi(j) xi,yi)
th
j boostinground.TheweightupdatemechanismforAdaBoostisgivenbytheequation:
where isthenormalizationfactorusedtoensurethat .TheweightupdateformulagiveninEquation4.103 increasestheweightsofincorrectlyclassifiedexamplesanddecreasestheweightsofthoseclassifiedcorrectly.
Insteadofusingamajorityvotingscheme,thepredictionmadebyeachclassifier isweightedaccordingto .ThisapproachallowsAdaBoosttopenalizemodelsthathavepooraccuracy,e.g.,thosegeneratedattheearlierboostingrounds.Inaddition,ifanyintermediateroundsproduceanerrorratehigherthan50%,theweightsarerevertedbacktotheiroriginaluniformvalues, ,andtheresamplingprocedureisrepeated.TheAdaBoostalgorithmissummarizedinAlgorithm4.6 .
Algorithm4.6AdaBoostalgorithm.
∈ ∑
∈
th
wi(j+1)=wi(j)Zj×{e−αjifCj(xi)=yi,eαjifCj(xi)≠yi (4.103)
Zj ∑iwi(j+1)=1
Cj αj
wi=1/N
∈ ∈
∑
LetusexaminehowtheboostingapproachworksonthedatasetshowninTable4.4 .Initially,alltheexampleshaveidenticalweights.Afterthreeboostingrounds,theexampleschosenfortrainingareshowninFigure4.49(a) .TheweightsforeachexampleareupdatedattheendofeachboostingroundusingEquation4.103 ,asshowninFigure4.50(b) .
Withoutboosting,theaccuracyofthedecisionstumpis,atbest,70%.WithAdaBoost,theresultsofthepredictionsaregiveninFigure4.50(b) .Thefinalpredictionoftheensembleclassifierisobtainedbytakingaweightedaverageofthepredictionsmadebyeachbaseclassifier,whichisshowninthelastrowofFigure4.50(b) .NoticethatAdaBoostperfectlyclassifiesalltheexamplesinthetrainingdata.
Figure4.49.Exampleofboosting.
Animportantanalyticalresultofboostingshowsthatthetrainingerroroftheensembleisboundedbythefollowingexpression:
where istheerrorrateofeachbaseclassifieri.Iftheerrorrateofthebaseclassifierislessthan50%,wecanwrite ,where measureshowmuchbettertheclassifieristhanrandomguessing.Theboundonthetrainingerroroftheensemblebecomes
eensemble≤∏i[∈i(1−∈i)], (4.104)
∈i∈i=0.5−γi γi
Hence,thetrainingerroroftheensembledecreasesexponentially,whichleadstothefastconvergenceofthealgorithm.Byfocusingonexamplesthataredifficulttoclassifybybaseclassifiers,itisabletoreducethebiasofthefinalpredictionsalongwiththevariance.AdaBoosthasbeenshowntoprovidesignificantimprovementsinperformanceoverbaseclassifiersonarangeofdatasets.Nevertheless,becauseofitstendencytofocusontrainingexamplesthatarewronglyclassified,theboostingtechniquecanbesusceptibletooverfitting,resultinginpoorgeneralizationperformanceinsomescenarios.
Figure4.50.ExampleofcombiningclassifiersconstructedusingtheAdaBoostapproach.
eensemble≤∏i1−4γi2≤exp(−2∑iγi2). (4.105)
4.10.6RandomForests
Randomforestsattempttoimprovethegeneralizationperformancebyconstructinganensembleofdecorrelateddecisiontrees.Randomforestsbuildontheideaofbaggingtouseadifferentbootstrapsampleofthetrainingdataforlearningdecisiontrees.However,akeydistinguishingfeatureofrandomforestsfrombaggingisthatateveryinternalnodeofatree,thebestsplittingcriterionischosenamongasmallsetofrandomlyselectedattributes.Inthisway,randomforestsconstructensemblesofdecisiontreesbynotonlymanipulatingtraininginstances(byusingbootstrapsamplessimilartobagging),butalsotheinputattributes(byusingdifferentsubsetsofattributesateveryinternalnode).
GivenatrainingsetDconsistingofninstancesanddattributes,thebasicprocedureoftrainingarandomforestclassifiercanbesummarizedusingthefollowingsteps:
1. Constructabootstrapsample ofthetrainingsetbyrandomlysamplingninstances(withreplacement)fromD.
2. Use tolearnadecisiontree asfollows.Ateveryinternalnodeof,randomlysampleasetofpattributesandchooseanattributefrom
thissubsetthatshowsthemaximumreductioninanimpuritymeasureforsplitting.Repeatthisproceduretilleveryleafispure,i.e.,containinginstancesfromthesameclass.
Onceanensembleofdecisiontreeshavebeenconstructed,theiraverageprediction(majorityvote)onatestinstanceisusedasthefinalpredictionoftherandomforest.Notethatthedecisiontreesinvolvedinarandomforestareunprunedtrees,astheyareallowedtogrowtotheirlargestpossiblesizetilleveryleafispure.Hence,thebaseclassifiersofrandomforestrepresent
Di
Di TiTi
unstableclassifiersthathavelowbiasbuthighvariance,becauseoftheirlargesize.
Anotherpropertyofthebaseclassifierslearnedinrandomforestsisthelackofcorrelationamongtheirmodelparametersandtestpredictions.Thiscanbeattributedtotheuseofanindependentlysampleddataset forlearningeverydecisiontree ,similartothebaggingapproach.However,randomforestshavetheadditionaladvantageofchoosingasplittingcriterionateveryinternalnodeusingadifferent(andrandomlyselected)subsetofattributes.Thispropertysignificantlyhelpsinbreakingthecorrelationstructure,ifany,amongthedecisiontrees .
Torealizethis,consideratrainingsetinvolvingalargenumberofattributes,whereonlyasmallsubsetofattributesarestrongpredictorsofthetargetclass,whereasotherattributesareweakindicators.Givensuchatrainingset,evenifweconsiderdifferentbootstrapsamples forlearning ,wewouldmostlybechoosingthesameattributesforsplittingatinternalnodes,becausetheweakattributeswouldbelargelyoverlookedwhencomparedwiththestrongpredictors.Thiscanresultinaconsiderablecorrelationamongthetrees.However,ifwerestrictthechoiceofattributesateveryinternalnodetoarandomsubsetofattributes,wecanensuretheselectionofbothstrongandweakpredictors,thuspromotingdiversityamongthetrees.Thisprincipleisutilizedbyrandomforestsforcreatingdecorrelateddecisiontrees.
Byaggregatingthepredictionsofanensembleofstronganddecorrelateddecisiontrees,randomforestsareabletoreducethevarianceofthetreeswithoutnegativelyimpactingtheirlowbias.Thismakesrandomforestsquiterobusttooverfitting.Additionally,becauseoftheirabilitytoconsideronlyasmallsubsetofattributesateveryinternalnode,randomforestsarecomputationallyfastandrobusteveninhigh-dimensionalsettings.
DiTi
Ti
Di Ti
Thenumberofattributestobeselectedateverynode,p,isahyper-parameteroftherandomforestclassifier.Asmallvalueofpcanreducethecorrelationamongtheclassifiersbutmayalsoreducetheirstrength.Alargevaluecanimprovetheirstrengthbutmayresultincorrelatedtreessimilartobagging.Althoughcommonsuggestionsforpintheliteratureinclude and ,asuitablevalueofpforagiventrainingsetcanalwaysbeselectedbytuningitoveravalidationset,asdescribedinthepreviouschapter.However,thereisanalternativewayforselectinghyper-parametersinrandomforests,whichdoesnotrequireusingaseparatevalidationset.Itinvolvescomputingareliableestimateofthegeneralizationerrorratedirectlyduringtraining,knownastheout-of-bag(oob)errorestimate.Theoobestimatecanbecomputedforanygenericensemblelearningmethodthatbuildsindependentbaseclassifiersusingbootstrapsamplesofthetrainingset,e.g.,baggingandrandomforests.Theapproachforcomputingoobestimatecanbedescribedasfollows.
Consideranensemblelearningmethodthatusesanindependentbaseclassifier builtonabootstrapsampleofthetrainingset .Sinceeverytraininginstance willbeusedfortrainingapproximately63%ofbaseclassifiers,wecancall asanout-of-bagsamplefortheremaining27%ofbaseclassifiersthatdidnotuseitfortraining.Ifweusetheseremaining27%classifierstomakepredictionson ,wecanobtaintheooberroron bytakingtheirmajorityvoteandcomparingitwithitsclasslabel.Notethattheooberrorestimatestheerrorof27%classifiersonaninstancethatwasnotusedfortrainingthoseclassifiers.Hence,theooberrorcanbeconsideredasareliableestimateofgeneralizationerror.Bytakingtheaverageofooberrorsofalltraininginstances,wecancomputetheoverallooberrorestimate.Thiscanbeusedasanalternativetothevalidationerrorrateforselectinghyper-parameters.Hence,randomforestsdonotneedtouseaseparatepartitionofthetrainingsetforvalidation,asitcansimultaneouslytrainthebaseclassifiersandcomputegeneralizationerrorestimatesonthesamedataset.
d log2d+1
Ti Di
Randomforestshavebeenempiricallyfoundtoprovidesignificantimprovementsingeneralizationperformancethatareoftencomparable,ifnotsuperior,totheimprovementsprovidedbytheAdaBoostalgorithm.RandomforestsarealsomorerobusttooverfittingandrunmuchfasterthantheAdaBoostalgorithm.
4.10.7EmpiricalComparisonamongEnsembleMethods
Table4.5 showstheempiricalresultsobtainedwhencomparingtheperformanceofadecisiontreeclassifieragainstbagging,boosting,andrandomforest.Thebaseclassifiersusedineachensemblemethodconsistof50decisiontrees.Theclassificationaccuraciesreportedinthistableareobtainedfromtenfoldcross-validation.Noticethattheensembleclassifiersgenerallyoutperformasingledecisiontreeclassifieronmanyofthedatasets.
Table4.5.Comparingtheaccuracyofadecisiontreeclassifieragainstthreeensemblemethods.
DataSet Numberof(Attributes,Classes,Instances)
DecisionTree(%)
Bagging(%) Boosting(%) RF(%)
Anneal (39,6,898) 92.09 94.43 95.43 95.43
Australia (15,2,690) 85.51 87.10 85.22 85.80
Auto (26,7,205) 81.95 85.37 85.37 84.39
Breast (11,2,699) 95.14 96.42 97.28 96.14
Cleve (14,2,303) 76.24 81.52 82.18 82.18
Credit (16,2,690) 85.8 86.23 86.09 85.8
Diabetes (9,2,768) 72.40 76.30 73.18 75.13
German (21,2,1000) 70.90 73.40 73.00 74.5
Glass (10,7,214) 67.29 76.17 77.57 78.04
Heart (14,2,270) 80.00 81.48 80.74 83.33
Hepatitis (20,2,155) 81.94 81.29 83.87 83.23
Horse (23,2,368) 85.33 85.87 81.25 85.33
Ionosphere (35,2,351) 89.17 92.02 93.73 93.45
Iris (5,3,150) 94.67 94.67 94.00 93.33
Labor (17,2,57) 78.95 84.21 89.47 84.21
Led7 (8,10,3200) 73.34 73.66 73.34 73.06
Lymphography (19,4,148) 77.03 79.05 85.14 82.43
Pima (9,2,768) 74.35 76.69 73.44 77.60
Sonar (61,2,208) 78.85 78.85 84.62 85.58
Tic-tac-toe (10,2,958) 83.72 93.84 98.54 95.82
Vehicle (19,4,846) 71.04 74.11 78.25 74.94
Waveform (22,3,5000) 76.44 83.30 83.90 84.04
Wine (14,3,178) 94.38 96.07 97.75 97.75
Zoo (17,7,101) 93.07 93.07 95.05 97.03
4.11ClassImbalanceProblemInmanydatasetsthereareadisproportionatenumberofinstancesthatbelongtodifferentclasses,apropertyknownasskeworclassimbalance.Forexample,considerahealth-careapplicationwherediagnosticreportsareusedtodecidewhetherapersonhasararedisease.Becauseoftheinfrequentnatureofthedisease,wecanexpecttoobserveasmallernumberofsubjectswhoarepositivelydiagnosed.Similarly,increditcardfrauddetection,fraudulenttransactionsaregreatlyoutnumberedbylegitimatetransactions.
Thedegreeofimbalancebetweentheclassesvariesacrossdifferentapplicationsandevenacrossdifferentdatasetsfromthesameapplication.Forexample,theriskforararediseasemayvaryacrossdifferentpopulationsofsubjectsdependingontheirdietaryandlifestylechoices.However,despitetheirinfrequentoccurrences,acorrectclassificationoftherareclassoftenhasgreatervaluethanacorrectclassificationofthemajorityclass.Forexample,itmaybemoredangeroustoignoreapatientsufferingfromadiseasethantomisdiagnoseahealthyperson.
Moregenerally,classimbalanceposestwochallengesforclassification.First,itcanbedifficulttofindsufficientlymanylabeledsamplesofarareclass.Notethatmanyoftheclassificationmethodsdiscussedsofarworkwellonlywhenthetrainingsethasabalancedrepresentationofbothclasses.Althoughsomeclassifiersaremoreeffectiveathandlingimbalanceinthetrainingdatathanothers,e.g.,rule-basedclassifiersandk-NN,theyareallimpactediftheminorityclassisnotwell-representedinthetrainingset.Ingeneral,aclassifiertrainedoveranimbalanceddatasetshowsabiastowardimprovingitsperformanceoverthemajorityclass,whichisoftennotthedesiredbehavior.
Asaresult,manyexistingclassificationmodels,whentrainedonanimbalanceddataset,maynoteffectivelydetectinstancesoftherareclass.
Second,accuracy,whichisthetraditionalmeasureforevaluatingclassificationperformance,isnotwell-suitedforevaluatingmodelsinthepresenceofclassimbalanceinthetestdata.Forexample,if1%ofthecreditcardtransactionsarefraudulent,thenatrivialmodelthatpredictseverytransactionaslegitimatewillhaveanaccuracyof99%eventhoughitfailstodetectanyofthefraudulentactivities.Thus,thereisaneedtousealternativeevaluationmetricsthataresensitivetotheskewandcancapturedifferentcriteriaofperformancethanaccuracy.
Inthissection,wefirstpresentsomeofthegenericmethodsforbuildingclassifierswhenthereisclassimbalanceinthetrainingset.Wethendiscussmethodsforevaluatingclassificationperformanceandadaptingclassificationdecisionsinthepresenceofaskewedtestset.Intheremainderofthissection,wewillconsiderbinaryclassificationproblemsforsimplicity,wheretheminorityclassisreferredasthepositive classwhilethemajorityclassisreferredasthenegative class.
4.11.1BuildingClassifierswithClassImbalance
Therearetwoprimaryconsiderationsforbuildingclassifiersinthepresenceofclassimbalanceinthetrainingset.First,weneedtoensurethatthelearningalgorithmistrainedoveradatasetthathasadequaterepresentationofboththemajorityaswellastheminorityclasses.Somecommonapproachesforensuringthisincludesthemethodologiesofoversamplingandundersampling
(+)(−)
thetrainingset.Second,havinglearnedaclassificationmodel,weneedawaytoadaptitsclassificationdecisions(andthuscreateanappropriatelytunedclassifier)tobestmatchtherequirementsoftheimbalancedtestset.Thisistypicallydonebyconvertingtheoutputsoftheclassificationmodeltoreal-valuedscores,andthenselectingasuitablethresholdontheclassificationscoretomatchtheneedsofatestset.Boththeseconsiderationsarediscussedindetailinthefollowing.
OversamplingandUndersamplingThefirststepinlearningwithimbalanceddataistotransformthetrainingsettoabalancedtrainingset,wherebothclasseshavenearlyequalrepresentation.Thebalancedtrainingsetcanthenbeusedwithanyoftheexistingclassificationtechniques(withoutmakinganymodificationsinthelearningalgorithm)tolearnamodelthatgivesequalemphasistobothclasses.Inthefollowing,wepresentsomeofthecommontechniquesfortransforminganimbalancedtrainingsettoabalancedone.
Abasicapproachforcreatingbalancedtrainingsetsistogenerateasampleoftraininginstanceswheretherareclasshasadequaterepresentation.Therearetwotypesofsamplingmethodsthatcanbeusedtoenhancetherepresentationoftheminorityclass:(a)undersampling,wherethefrequencyofthemajorityclassisreducedtomatchthefrequencyoftheminorityclass,and(b)oversampling,whereartificialexamplesoftheminorityclassarecreatedtomakethemequalinproportiontothenumberofnegativeinstances.
Toillustrateundersampling,consideratrainingsetthatcontains100positiveexamplesand1000negativeexamples.Toovercometheskewamongtheclasses,wecanselectarandomsampleof100examplesfromthenegativeclassandusethemwiththe100positiveexamplestocreateabalancedtrainingset.Aclassifierbuiltovertheresultantbalancedsetwillthenbe
unbiasedtowardbothclasses.However,onelimitationofundersamplingisthatsomeoftheusefulnegativeexamples(e.g.,thoseclosertotheactualdecisionboundary)maynotbechosenfortraining,therefore,resultinginaninferiorclassificationmodel.Anotherlimitationisthatthesmallersampleof100negativeinstancesmayhaveahighervariancethanthelargersetof1000.
Oversamplingattemptstocreateabalancedtrainingsetbyartificiallygeneratingnewpositiveexamples.Asimpleapproachforoversamplingistoduplicateeverypositiveinstance times,where and arethenumbersofpositiveandnegativetraininginstances,respectively.Figure4.51 illustratestheeffectofoversamplingonthelearningofadecisionboundaryusingaclassifiersuchasadecisiontree.Withoutoversampling,onlythepositiveexamplesatthebottomright-handsideofFigure4.51(a)areclassifiedcorrectly.Thepositiveexampleinthemiddleofthediagramismisclassifiedbecausetherearenotenoughexamplestojustifythecreationofanewdecisionboundarytoseparatethepositiveandnegativeinstances.Oversamplingprovidestheadditionalexamplesneededtoensurethatthedecisionboundarysurroundingthepositiveexampleisnotpruned,asillustratedinFigure4.51(b) .Notethatduplicatingapositiveinstanceisanalogoustodoublingitsweightduringthetrainingstage.Hence,theeffectofoversamplingcanbealternativelyachievedbyassigninghigherweightstopositiveinstancesthannegativeinstances.Thismethodofweightinginstancescanbeusedwithanumberofclassifierssuchaslogisticregression,ANN,andSVM.
n−/n+ n+ n−
Figure4.51.Illustratingtheeffectofoversamplingoftherareclass.
Onelimitationoftheduplicationmethodforoversamplingisthatthereplicatedpositiveexampleshaveanartificiallylowervariancewhencomparedwiththeirtruedistributionintheoveralldata.Thiscanbiastheclassifiertothespecificdistributionoftraininginstances,whichmaynotberepresentativeoftheoveralldistributionoftestinstances,leadingtopoorgeneralizability.Toovercomethislimitation,analternativeapproachforoversamplingistogeneratesyntheticpositiveinstancesintheneighborhoodofexistingpositiveinstances.Inthisapproach,calledtheSyntheticMinorityOversamplingTechnique(SMOTE),wefirstdeterminethek-nearestpositiveneighborsofeverypositiveinstance ,andthengenerateasyntheticpositiveinstanceatsomeintermediatepointalongthelinesegmentjoining tooneofitsrandomlychosenk-nearestneighbor, .Thisprocessisrepeateduntilthedesirednumberofpositiveinstancesisreached.However,onelimitationofthisapproachisthatitcanonlygeneratenewpositiveinstancesintheconvexhulloftheexistingpositiveclass.Hence,itdoesnothelpimprovetherepresentationofthepositiveclassoutsidetheboundaryofexistingpositive
xk
instances.Despitetheircomplementarystrengthsandweaknesses,undersamplingandoversamplingprovideusefuldirectionsforgeneratingbalancedtrainingsetsinthepresenceofclassimbalance.
AssigningScorestoTestInstancesIfaclassifierreturnsanordinalscores( )foreverytestinstance suchthatahigherscoredenotesagreaterlikelihoodof belongingtothepositiveclass,thenforeverypossiblevalueofscorethreshold, ,wecancreateanewbinaryclassifierwhereatestinstance isclassifiedpositiveonlyif .Thus,everychoiceof canpotentiallyleadtoadifferentclassifier,andweareinterestedinfindingtheclassifierthatisbestsuitedforourneeds.
Ideally,wewouldliketheclassificationscoretovarymonotonicallywiththeactualposteriorprobabilityofthepositiveclass,i.e.,if and arethescoresofanytwoinstances, and ,then
.However,thisisdifficulttoguaranteeinpracticeasthepropertiesoftheclassificationscoredependsonseveralfactorssuchasthecomplexityoftheclassificationalgorithmandtherepresentativepowerofthetrainingset.Ingeneral,wecanonlyexpecttheclassificationscoreofareasonablealgorithmtobeweaklyrelatedtotheactualposteriorprobabilityofthepositiveclass,eventhoughtherelationshipmaynotbestrictlymonotonic.Mostclassifierscanbeeasilymodifiedtoproducesucharealvaluedscore.Forexample,thesigneddistanceofaninstancefromthepositivemarginhyperplaneofSVMcanbeusedasaclassificationscore.Asanotherexample,testinstancesbelongingtoaleafinadecisiontreecanbeassignedascorebasedonthefractionoftraininginstanceslabeledaspositiveintheleaf.Also,probabilisticclassifierssuchasnaïveBayes,Bayesiannetworks,andlogisticregressionnaturallyoutputestimatesofposteriorprobabilities, .Next,wediscusssome
sTs(x)>sT
sT
s(x1) s(x2)x1 x2
s(x1)≥s(x2)⇒P(y=1|x1)≥P(y=1|x2)
P(y=1|x)
evaluationmeasuresforassessingthegoodnessofaclassifierinthepresenceofclassimbalance.
Table4.6.Aconfusionmatrixforabinaryclassificationprobleminwhichtheclassesarenotequallyimportant.
PredictedClass
Actualclass
4.11.2EvaluatingPerformancewithClassImbalance
Themostbasicapproachforrepresentingaclassifier'sperformanceonatestsetistouseaconfusionmatrix,asshowninTable4.6 .ThistableisessentiallythesameasTable3.4 ,whichwasintroducedinthecontextofevaluatingclassificationperformanceinSection3.2 .Aconfusionmatrixsummarizesthenumberofinstancespredictedcorrectlyorincorrectlybyaclassifierusingthefollowingfourcounts:
Truepositive(TP)or ,whichcorrespondstothenumberofpositiveexamplescorrectlypredictedbytheclassifier.Falsepositive(FP)or (alsoknownasTypeIerror),whichcorrespondstothenumberofnegativeexampleswronglypredictedaspositivebytheclassifier.
+ −
+ f++(TP) f+−(FN)
− f−+(FP) f−−(TN)
f++
f−+
Falsenegative(FN)or (alsoknownasTypeIIerror),whichcorrespondstothenumberofpositiveexampleswronglypredictedasnegativebytheclassifier.Truenegative(TN)or ,whichcorrespondstothenumberofnegativeexamplescorrectlypredictedbytheclassifier.
Theconfusionmatrixprovidesaconciserepresentationofclassificationperformanceonagiventestdataset.However,itisoftendifficulttointerpretandcomparetheperformanceofclassifiersusingthefour-dimensionalrepresentations(correspondingtothefourcounts)providedbytheirconfusionmatrices.Hence,thecountsintheconfusionmatrixareoftensummarizedusinganumberofevaluationmeasures.Accuracyisanexampleofonesuchmeasurethatcombinesthesefourcountsintoasinglevalue,whichisusedextensivelywhenclassesarebalanced.However,theaccuracymeasureisnotsuitableforhandlingdatasetswithimbalancedclassdistributionsasittendstofavorclassifiersthatcorrectlyclassifythemajorityclass.Inthefollowing,wedescribeotherpossiblemeasuresthatcapturedifferentcriteriaofperformancewhenworkingwithimbalancedclasses.
Abasicevaluationmeasureisthetruepositiverate(TPR),whichisdefinedasthefractionofpositivetestinstancescorrectlypredictedbytheclassifier:
Inthemedicalcommunity,TPRisalsoknownassensitivity,whileintheinformationretrievalliterature,itisalsocalledrecall(r).AclassifierwithahighTPRhasahighchanceofcorrectlyidentifyingthepositiveinstancesofthedata.
AnalogouslytoTPR,thetruenegativerate(TNR)(alsoknownasspecificity)isdefinedasthefractionofnegativetestinstancescorrectly
f+−
f−−
TPR=TPTP+FN.
predictedbytheclassifier,i.e.,
AhighTNRvaluesignifiesthattheclassifiercorrectlyclassifiesanyrandomlychosennegativeinstanceinthetestset.AcommonlyusedevaluationmeasurethatiscloselyrelatedtoTNRisthefalsepositiverate(FPR),whichisdefinedas .
Similarly,wecandefinefalsenegativerate(FNR)as .
Notethattheevaluationmeasuresdefinedabovedonottakeintoaccounttheskewamongtheclasses,whichcanbeformallydefinedas ,wherePandNdenotethenumberofactualpositivesandactualnegatives,respectively.Asaresult,changingtherelativenumbersofPandNwillhavenoeffectonTPR,TNR,FPR,orFNR,sincetheydependonlyonthefractionofcorrectclassificationsforeveryclass,independentlyoftheotherclass.Furthermore,knowingthevaluesofTPRandTNR(andconsequentlyFNRandFPR)doesnotbyitselfhelpusuniquelydetermineallfourentriesoftheconfusionmatrix.However,togetherwithinformationabouttheskewfactor, ,andthetotalnumberofinstances,N,wecancomputetheentireconfusionmatrixusingTPRandTNR,asshowninTable4.7 .
Table4.7.EntriesoftheconfusionmatrixintermsoftheTPR,TNR,skew, ,andtotalnumberofinstances,N.
Predicted Predicted
TNR=TNFP+TN.
1−TNR
FPR=FPFP+TN.
1−TPR
FNR=FNFN+TP.
α=P/(P+N)
α
α
+ −
Actual
Actual
N
Anevaluationmeasurethatissensitivetotheskewisprecision,whichcanbedefinedasthefractionofcorrectpredictionsofthepositiveclassoverthetotalnumberofpositivepredictions,i.e.,
Precisionisalsoreferredasthepositivepredictedvalue(PPV).Aclassifierthathasahighprecisionislikelytohavemostofitspositivepredictionscorrect.Precisionisausefulmeasureforhighlyskewedtestsetswherethepositivepredictions,eventhoughsmallinnumbers,arerequiredtobemostlycorrect.Ameasurethatiscloselyrelatedtoprecisionisthefalsediscoveryrate(FDR),whichcanbedefinedas .
AlthoughbothFDRandFPRfocusonFP,theyaredesignedtocapturedifferentevaluationobjectivesandthuscantakequitecontrastingvalues,especiallyinthepresenceofclassimbalance.Toillustratethis,consideraclassifierwiththefollowingconfusionmatrix.
PredictedClass
ActualClass
100 0
+ TPR×α×N (1−TPR)×α×N α×N
− (1−TNR)×(1−α)×N TNR×(1−α)×N (1−α)×N
Precision,p=TPTP+FP.
1−p
FDR=FPTP+FP.
+ −
+
100 900
Sincehalfofthepositivepredictionsmadebytheclassifierareincorrect,ithasaFDRvalueof .However,itsFPRisequalto
,whichisquitelow.Thisexampleshowsthatinthepresenceofhighskew(i.e.,verysmallvalueof ),evenasmallFPRcanresultinhighFDR.SeeSection10.6 forfurtherdiscussionofthisissue.
Notethattheevaluationmeasuresdefinedaboveprovideanincompleterepresentationofperformance,becausetheyeitheronlycapturetheeffectoffalsepositives(e.g.,FPRandprecision)ortheeffectoffalsenegatives(e.g.,TPRorrecall),butnotboth.Hence,ifweoptimizeonlyoneoftheseevaluationmeasures,wemayendupwithaclassifierthatshowslowFNbuthighFP,orvice-versa.Forexample,aclassifierthatdeclareseveryinstancetobepositivewillhaveaperfectrecall,buthighFPRandverypoorprecision.Ontheotherhand,aclassifierthatisveryconservativeinclassifyinganinstanceaspositive(toreduceFP)mayenduphavinghighprecisionbutverypoorrecall.Wethusneedevaluationmeasuresthataccountforbothtypesofmisclassifications,FPandFN.Someexamplesofsuchevaluationmeasuresaresummarizedbythefollowingdefinitions.
Whilesomeoftheseevaluationmeasuresareinvarianttotheskew(e.g.,thepositivelikelihoodratio),others(e.g.,precisionandthe measure)aresensitivetoskew.Further,differentevaluationmeasurescapturetheeffectsofdifferenttypesofmisclassificationerrorsinvariousways.Forexample,themeasurerepresentsaharmonicmeanbetweenrecallandprecision,i.e.,
−
100/(100+100)=0.5100/(100+900)=0.1
α
PositiveLikelihoodRatio=TPRFPR.F1measure=2rpr+p=2×TP2×TP+FP+FN.G(TP+FN).
F1
F1
F1=21r+1p.
Becausetheharmonicmeanoftwonumberstendstobeclosertothesmallerofthetwonumbers,ahighvalueof -measureensuresthatbothprecisionandrecallarereasonablyhigh.Similarly,theGmeasurerepresentsthegeometricmeanbetweenrecallandprecision.Acomparisonamongharmonic,geometric,andarithmeticmeansisgiveninthenextexample.
Example4.9.Considertwopositivenumbers and .Theirarithmeticmeanis
andtheirgeometricmeanis .Theirharmonicmeanis ,whichisclosertothesmallervaluebetweenaandbthanthearithmeticandgeometricmeans.
Agenericextensionofthe measureisthe measure,whichcanbedefinedasfollows.
Bothprecisionandrecallcanbeviewedasspecialcasesof bysettingand ,respectively.Lowvaluesof make closertoprecision,andhighvaluesmakeitclosertorecall.
Amoregeneralmeasurethatcaptures aswellasaccuracyistheweightedaccuracymeasure,whichisdefinedbythefollowingequation:
Therelationshipbetweenweightedaccuracyandotherperformancemeasuresissummarizedinthefollowingtable:
Measure
F1
a=1 b=5 μa=(a+b)/2=3 μg=ab=2.236μh=(2×1×5)/6=1.667
F1 Fβ
Fβ=(β2+1)rpr+β2p=(β2+1)×TP(β2+1)TP+β2FP+FN. (4.106)
Fβ β=0β=∞ β Fβ
Fβ
Weightedaccuracy=w1TP+w4TNw1TP+w2FP+w3FN+w4TN. (4.107)
w1 w2 w3 w4
Recall 1 1 0 0
Precision 1 0 1 0
1 0
Accuracy 1 1 1 1
4.11.3FindinganOptimalScoreThreshold
GivenasuitablychosenevaluationmeasureEandadistributionofclassificationscores, ,onavalidationset,wecanobtaintheoptimalscorethreshold onthevalidationsetusingthefollowingapproach:
1. Sortthescoresinincreasingorderoftheirvalues.2. Foreveryuniquevalueofscore,s,considertheclassificationmodel
thatassignsaninstance aspositiveonlyif .LetE(s)denotetheperformanceofthismodelonthevalidationset.
3. Find thatmaximizestheevaluationmeasure,E(s).
Notethat canbetreatedasahyper-parameteroftheclassificationalgorithmthatislearnedduringmodelselection.Using ,wecanassignapositivelabeltoafuturetestinstance onlyif .IftheevaluationmeasureEisskewinvariant(e.g.,PositiveLikelihoodRatio),thenwecanselect withoutknowingtheskewofthetestset,andtheresultantclassifierformedusing canbeexpectedtoshowoptimalperformanceonthetestset
Fβ β2+1 β2
s(x)s*
s(x)>s
s*s*=argmaxsE(s).
s*s*
s(x)>s*
s*s*
(withrespecttotheevaluationmeasureE).Ontheotherhand,ifEissensitivetotheskew(e.g.,precisionor -measure),thenweneedtoensurethattheskewofthevalidationsetusedforselecting issimilartothatofthetestset,sothattheclassifierformedusing showsoptimaltestperformancewithrespecttoE.Alternatively,givenanestimateoftheskewofthetestdata, ,wecanuseitalongwiththeTPRandTNRonthevalidationsettoestimateallentriesoftheconfusionmatrix(seeTable4.7 ),andthustheestimateofanyevaluationmeasureEonthetestset.Thescorethreshold selectedusingthisestimateofEcanthenbeexpectedtoproduceoptimaltestperformancewithrespecttoE.Furthermore,themethodologyofselectingonthevalidationsetcanhelpincomparingthetestperformanceofdifferentclassificationalgorithms,byusingtheoptimalvaluesof foreachalgorithm.
4.11.4AggregateEvaluationofPerformance
Althoughtheaboveapproachhelpsinfindingascorethreshold thatprovidesoptimalperformancewithrespecttoadesiredevaluationmeasureandacertainamountofskew, ,sometimesweareinterestedinevaluatingtheperformanceofaclassifieronanumberofpossiblescorethresholds,eachcorrespondingtoadifferentchoiceofevaluationmeasureandskewvalue.Assessingtheperformanceofaclassifieroverarangeofscorethresholdsiscalledaggregateevaluationofperformance.Inthisstyleofanalysis,theemphasisisnotonevaluatingtheperformanceofasingleclassifiercorrespondingtotheoptimalscorethreshold,buttoassessthegeneralqualityofrankingproducedbytheclassificationscoresonthetestset.Ingeneral,thishelpsinobtainingrobustestimatesofclassificationperformancethatarenotsensitivetospecificchoicesofscorethresholds.
F1s*
s*α
s*
s*
s*
s*
α
ROCCurveOneofthewidely-usedtoolsforaggregateevaluationisthereceiveroperatingcharacteristic(ROC)curve.AnROCcurveisagraphicalapproachfordisplayingthetrade-offbetweenTPRandFPRofaclassifier,overvaryingscorethresholds.InanROCcurve,theTPRisplottedalongthey-axisandtheFPRisshownonthex-axis.Eachpointalongthecurvecorrespondstoaclassificationmodelgeneratedbyplacingathresholdonthetestscoresproducedbytheclassifier.ThefollowingproceduredescribesthegenericapproachforcomputinganROCcurve:
1. Sortthetestinstancesinincreasingorderoftheirscores.2. Selectthelowestrankedtestinstance(i.e.,theinstancewithlowest
score).Assigntheselectedinstanceandthoserankedaboveittothepositiveclass.Thisapproachisequivalenttoclassifyingallthetestinstancesaspositiveclass.Becauseallthepositiveexamplesareclassifiedcorrectlyandthenegativeexamplesaremisclassified,
.3. Selectthenexttestinstancefromthesortedlist.Classifytheselected
instanceandthoserankedaboveitaspositive,whilethoserankedbelowitasnegative.UpdatethecountsofTPandFPbyexaminingtheactualclasslabeloftheselectedinstance.Ifthisinstancebelongstothepositiveclass,theTPcountisdecrementedandtheFPcountremainsthesameasbefore.Iftheinstancebelongstothenegativeclass,theFPcountisdecrementedandTPcountremainsthesameasbefore.
4. RepeatStep3andupdatetheTPandFPcountsaccordinglyuntilthehighestrankedtestinstanceisselected.Atthisfinalthreshold,
,asallinstancesarelabeledasnegative.5. PlottheTPRagainstFPRoftheclassifier.
TPR=FPR=1
TPR=FPR=0
Example4.10.[GeneratingROCCurve]Figure4.52 showsanexampleofhowtocomputetheTPRandFPRvaluesforeverychoiceofscorethreshold.Therearefivepositiveexamplesandfivenegativeexamplesinthetestset.Theclasslabelsofthetestinstancesareshowninthefirstrowofthetable,whilethesecondrowcorrespondstothesortedscorevaluesforeachinstance.ThenextsixrowscontainthecountsofTP,FP,TN,andFN,alongwiththeircorrespondingTPRandFPR.Thetableisthenfilledfromlefttoright.Initially,alltheinstancesarepredictedtobepositive.Thus, and
.Next,weassignthetestinstancewiththelowestscoreasthenegativeclass.Becausetheselectedinstanceisactuallyapositiveexample,theTPcountdecreasesfrom5to4andtheFPcountisthesameasbefore.TheFPRandTPRareupdatedaccordingly.Thisprocessisrepeateduntilwereachtheendofthelist,where and .TheROCcurveforthisexampleisshowninFigure4.53 .
Figure4.52.ComputingtheTPRandFPRateveryscorethreshold.
TP=FP=5TPR=FPR=1
TPR=0 FPR=0
Figure4.53.ROCcurveforthedatashowninFigure4.52 .
NotethatinanROCcurve,theTPRmonotonicallyincreaseswithFPR,becausetheinclusionofatestinstanceinthesetofpredictedpositivescaneitherincreasetheTPRortheFPR.TheROCcurvethushasastaircasepattern.Furthermore,thereareseveralcriticalpointsalonganROCcurvethathavewell-knowninterpretations:
:Modelpredictseveryinstancetobeanegativeclass.
:Modelpredictseveryinstancetobeapositiveclass.
:Theperfectmodelwithzeromisclassifications.
Agoodclassificationmodelshouldbelocatedascloseaspossibletotheupperleftcornerofthediagram,whileamodelthatmakesrandomguessesshouldresidealongthemaindiagonal,connectingthepointsand .Randomguessingmeansthataninstanceisclassifiedasapositiveclasswithafixedprobabilityp,irrespectiveofitsattributeset.
(TPR=0,FPR=0)
(TPR=1,FPR=1)
(TPR=1,FPR=0)
(TPR=0,FPR=0)(TPR=1,FPR=1)
Forexample,consideradatasetthatcontains positiveinstancesandnegativeinstances.Therandomclassifierisexpectedtocorrectlyclassifyofthepositiveinstancesandtomisclassify ofthenegativeinstances.Therefore,theTPRoftheclassifieris ,whileitsFPRis .Hence,thisrandomclassifierwillresideatthepoint(p,p)intheROCcurvealongthemaindiagonal.
Figure4.54.ROCcurvesfortwodifferentclassifiers.
SinceeverypointontheROCcurverepresentstheperformanceofaclassifiergeneratedusingaparticularscorethreshold,theycanbeviewedasdifferentoperatingpointsoftheclassifier.Onemaychooseoneoftheseoperatingpointsdependingontherequirementsoftheapplication.Hence,anROCcurvefacilitatesthecomparisonofclassifiersoverarangeofoperatingpoints.Forexample,Figure4.54 comparestheROCcurvesoftwoclassifiers,
n+ n−pn+
pn−(pn+)/n+=p (pn−)/p=p
M1
and ,generatedbyvaryingthescorethresholds.Wecanseethat isbetterthan whenFPRislessthan0.36,as showsbetterTPRthanforthisrangeofoperatingpoints.Ontheotherhand, issuperiorwhenFPRisgreaterthan0.36,sincetheTPRof ishigherthanthatof forthisrange.Clearly,neitherofthetwoclassifiersdominates(isstrictlybetterthan)theother,i.e.,showshighervaluesofTPRandlowervaluesofFPRoveralloperatingpoints.
Tosummarizetheaggregatebehavioracrossalloperatingpoints,oneofthecommonlyusedmeasuresistheareaundertheROCcurve(AUC).Iftheclassifierisperfect,thenitsareaundertheROCcurvewillbeequal1.Ifthealgorithmsimplyperformsrandomguessing,thenitsareaundertheROCcurvewillbeequalto0.5.
AlthoughtheAUCprovidesausefulsummaryofaggregateperformance,therearecertaincaveatsinusingtheAUCforcomparingclassifiers.First,eveniftheAUCofalgorithmAishigherthantheAUCofanotheralgorithmB,thisdoesnotmeanthatalgorithmAisalwaysbetterthanB,i.e.,theROCcurveofAdominatesthatofBacrossalloperatingpoints.Forexample,eventhough showsaslightlylowerAUCthan inFigure4.54 ,wecanseethatboth and areusefuloverdifferentrangesofoperatingpointsandnoneofthemarestrictlybetterthantheotheracrossallpossibleoperatingpoints.Hence,wecannotusetheAUCtodeterminewhichalgorithmisbetter,unlessweknowthattheROCcurveofoneofthealgorithmsdominatestheother.
Second,althoughtheAUCsummarizestheaggregateperformanceoveralloperatingpoints,weareofteninterestedinonlyasmallrangeofoperatingpointsinmostapplications.Forexample,eventhough showsslightlylowerAUCthan ,itshowshigherTPRvaluesthan forsmallFPRvalues(smallerthan0.36).Inthepresenceofclassimbalance,thebehaviorof
M2 M1M2 M1 M2
M2M2 M1
M1 M2M1 M2
M1M2 M2
analgorithmoversmallFPRvalues(alsotermedasearlyretrieval)isoftenmoremeaningfulforcomparisonthantheperformanceoverallFPRvalues.Thisisbecause,inmanyapplications,itisimportanttoassesstheTPRachievedbyaclassifierinthefirstfewinstanceswithhighestscores,withoutincurringalargeFPR.Hence,inFigure4.54 ,duetothehighTPRvaluesof duringearlyretrieval ,wemayprefer over forimbalancedtestsets,despitethelowerAUCof .Hence,caremustbetakenwhilecomparingtheAUCvaluesofdifferentclassifiers,usuallybyvisualizingtheirROCcurvesratherthanjustreportingtheirAUC.
AkeycharacteristicofROCcurvesisthattheyareagnostictotheskewinthetestset,becauseboththeevaluationmeasuresusedinconstructingROCcurves(TPRandFPR)areinvarianttoclassimbalance.Hence,ROCcurvesarenotsuitableformeasuringtheimpactofskewonclassificationperformance.Inparticular,wewillobtainthesameROCcurvefortwotestdatasetsthathaveverydifferentskew.
M1 (FPR<0.36) M1 M2M1
Figure4.55.PRcurvesfortwodifferentclassifiers.
Precision-RecallCurveAnalternatetoolforaggregateevaluationistheprecisionrecallcurve(PRcurve).ThePRcurveplotstheprecisionandrecallvaluesofaclassifierontheyandxaxesrespectively,byvaryingthethresholdonthetestscores.Figure4.55 showsanexampleofPRcurvesfortwohypotheticalclassifiers, and .TheapproachforgeneratingaPRcurveissimilartotheapproachdescribedaboveforgeneratinganROCcurve.However,therearesomekeydistinguishingfeaturesinthePRcurve:
1. PRcurvesaresensitivetotheskewfactor ,anddifferentPRcurvesaregeneratedfordifferentvaluesof .
M1 M2
α=P/(P+N)α
2. Whenthescorethresholdislowest(everyinstanceislabeledaspositive),theprecisionisequalto whilerecallis1.Asweincreasethescorethreshold,thenumberofpredictedpositivescanstaythesameordecrease.Hence,therecallmonotonicallydeclinesasthescorethresholdincreases.Ingeneral,theprecisionmayincreaseordecreaseforthesamevalueofrecall,uponadditionofaninstanceintothesetofpredictedpositives.Forexample,ifthek rankedinstancebelongstothenegativeclass,thenincludingitwillresultinadropintheprecisionwithoutaffectingtherecall.Theprecisionmayimproveatthenextstep,whichaddsthe rankedinstance,ifthisinstancebelongstothepositiveclass.Hence,thePRcurveisnotasmooth,monotonicallyincreasingcurveliketheROCcurve,andgenerallyhasazigzagpattern.Thispatternismoreprominentintheleftpartofthecurve,whereevenasmallchangeinthenumberoffalsepositivescancausealargechangeinprecision.
3. As,asweincreasetheimbalanceamongtheclasses(reducethevalueof ),therightmostpointsofallPRcurveswillmovedownwards.AtandneartheleftmostpointonthePRcurve(correspondingtolargervaluesofscorethreshold),therecallisclosetozero,whiletheprecisionisequaltothefractionofpositivesinthetoprankedinstancesofthealgorithm.Hence,differentclassifierscanhavedifferentvaluesofprecisionattheleftmostpointsofthePRcurve.Also,iftheclassificationscoreofanalgorithmmonotonicallyvarieswiththeposteriorprobabilityofthepositiveclass,wecanexpectthePRcurvetograduallydecreasefromahighvalueofprecisionontheleftmostpointtoaconstantvalueof attherightmostpoint,albeitwithsomeupsanddowns.ThiscanbeobservedinthePRcurveofalgorithminFigure4.55 ,whichstartsfromahighervalueofprecisionontheleftthatgraduallydecreasesaswemovetowardstheright.Ontheotherhand,thePRcurveofalgorithm startsfromalowervalueofprecisionontheleftandshowsmoredrasticupsanddownsaswe
α
th
(k+1)th
α
αM1
M2
moveright,suggestingthattheclassificationscoreof showsaweakermonotonicrelationshipwiththeposteriorprobabilityofthepositiveclass.
4. Arandomclassifierthatassignsaninstancetobepositivewithafixedprobabilityphasaprecisionof andarecallofp.Hence,aclassifierthatperformsrandomguessinghasahorizontalPRcurvewith ,asshownusingadashedlineinFigure4.55 .NotethattherandombaselineinPRcurvesdependsontheskewinthetestset,incontrasttothefixedmaindiagonalofROCcurvesthatrepresentsrandomclassifiers.
5. NotethattheprecisionofanalgorithmisimpactedmorestronglybyfalsepositivesinthetoprankedtestinstancesthantheFPRofthealgorithm.Forthisreason,thePRcurvegenerallyhelpstomagnifythedifferencesbetweenclassifiersintheleftportionofthePRcurve.Hence,inthepresenceofclassimbalanceinthetestdata,analyzingthePRcurvesgenerallyprovidesadeeperinsightintotheperformanceofclassifiersthantheROCcurves,especiallyintheearlyretrievalrangeofoperatingpoints.
6. Theclassifiercorrespondingto representstheperfectclassifier.SimilartoAUC,wecanalsocomputetheareaunderthePRcurveofanalgorithm,knownasAUC-PR.TheAUC-PRofarandomclassifierisequalto ,whilethatofaperfectalgorithmisequalto1.NotethatAUC-PRvarieswithchangingskewinthetestset,incontrasttotheareaundertheROCcurve,whichisinsensitivetotheskew.TheAUC-PRhelpsinaccentuatingthedifferencesbetweenclassificationalgorithmsintheearlyretrievalrangeofoperatingpoints.Hence,itismoresuitedforevaluatingclassificationperformanceinthepresenceofclassimbalancethantheareaundertheROCcurve.However,similartoROCcurves,ahighervalueofAUC-PRdoesnotguaranteethesuperiorityofaclassificationalgorithmoveranother.ThisisbecausethePRcurvesoftwoalgorithmscaneasilycrosseach
M2
αy=α
(precision=1,recall=1)
α
other,suchthattheybothshowbetterperformancesindifferentrangesofoperatingpoints.Hence,itisimportanttovisualizethePRcurvesbeforecomparingtheirAUC-PRvalues,inordertoensureameaningfulevaluation.
4.12MulticlassProblemSomeoftheclassificationtechniquesdescribedinthischapterareoriginallydesignedforbinaryclassificationproblems.Yettherearemanyreal-worldproblems,suchascharacterrecognition,faceidentification,andtextclassification,wheretheinputdataisdividedintomorethantwocategories.Thissectionpresentsseveralapproachesforextendingthebinaryclassifierstohandlemulticlassproblems.Toillustratetheseapproaches,let
bethesetofclassesoftheinputdata.
ThefirstapproachdecomposesthemulticlassproblemintoKbinaryproblems.Foreachclass ,abinaryproblemiscreatedwhereallinstancesthatbelongto areconsideredpositiveexamples,whiletheremaininginstancesareconsiderednegativeexamples.Abinaryclassifieristhenconstructedtoseparateinstancesofclass fromtherestoftheclasses.Thisisknownastheone-against-rest(1-r)approach.
Thesecondapproach,whichisknownastheone-against-one(1-1)approach,constructs binaryclassifiers,whereeachclassifierisusedtodistinguishbetweenapairofclasses, .Instancesthatdonotbelongtoeither or areignoredwhenconstructingthebinaryclassifierfor .Inboth1-rand1-1approaches,atestinstanceisclassifiedbycombiningthepredictionsmadebythebinaryclassifiers.Avotingschemeistypicallyemployedtocombinethepredictions,wheretheclassthatreceivesthehighestnumberofvotesisassignedtothetestinstance.Inthe1-rapproach,ifaninstanceisclassifiedasnegative,thenallclassesexceptforthepositiveclassreceiveavote.Thisapproach,however,mayleadtotiesamongthedifferentclasses.Anotherpossibilityistotransformtheoutputsofthebinary
Y={y1,y2,…,yK}
yi∈Yyi
yi
K(K−1)/2(yi,yj)
yi yj (yi,yj)
classifiersintoprobabilityestimatesandthenassignthetestinstancetotheclassthathasthehighestprobability.
Example4.11.Consideramulticlassproblemwhere .Supposeatestinstanceisclassifiedas accordingtothe1-rapproach.Inotherwords,itisclassifiedaspositivewhen isusedasthepositiveclassandnegativewhen ,and areusedasthepositiveclass.Usingasimplemajorityvote,noticethat receivesthehighestnumberofvotes,whichisfour,whiletheremainingclassesreceiveonlythreevotes.Thetestinstanceisthereforeclassifiedas .
Example4.12.Supposethetestinstanceisclassifiedusingthe1-1approachasfollows:
Binarypairofclasses
Classification
Thefirsttworowsinthistablecorrespondtothepairofclasseschosentobuildtheclassifierandthelastrowrepresentsthepredictedclassforthetestinstance.Aftercombiningthepredictions, and eachreceivetwovotes,while and eachreceivesonlyonevote.Thetestinstanceisthereforeclassifiedaseither or ,dependingonthetie-breakingprocedure.
Error-CorrectingOutputCoding
Y={y1,y2,y3,y4}(+,−,−,−)
y1y2,y3 y4
y1
y1
+:y1−:y2 +:y1−:y3 +:y1−:y4 +:y2−:y3 +:y2−:y4 +:y3−:y4
+ + − + − +
(yi,yj)
y1 y4y2 y3
y1 y4
Apotentialproblemwiththeprevioustwoapproachesisthattheymaybesensitivetobinaryclassificationerrors.Forthe1-rapproachgiveninExample
4.12,ifatleastofoneofthebinaryclassifiersmakesamistakeinitsprediction,thentheclassifiermayendupdeclaringatiebetweenclassesormakingawrongprediction.Forexample,supposethetestinstanceisclassifiedas duetomisclassificationbythethirdclassifier.Inthiscase,itwillbedifficulttotellwhethertheinstanceshouldbeclassifiedas or,unlesstheprobabilityassociatedwitheachclasspredictionistakeninto
account.
Theerror-correctingoutputcoding(ECOC)methodprovidesamorerobustwayforhandlingmulticlassproblems.Themethodisinspiredbyaninformation-theoreticapproachforsendingmessagesacrossnoisychannels.
Theideabehindthisapproachistoaddredundancyintothetransmittedmessagebymeansofacodeword,sothatthereceivermaydetecterrorsinthereceivedmessageandperhapsrecovertheoriginalmessageifthenumberoferrorsissmall.
Formulticlasslearning,eachclass isrepresentedbyauniquebitstringoflengthnknownasitscodeword.Wethentrainnbinaryclassifierstopredicteachbitofthecodewordstring.ThepredictedclassofatestinstanceisgivenbythecodewordwhoseHammingdistanceisclosesttothecodewordproducedbythebinaryclassifiers.RecallthattheHammingdistancebetweenapairofbitstringsisgivenbythenumberofbitsthatdiffer.
Example4.13.Consideramulticlassproblemwhere .Supposeweencodetheclassesusingthefollowingsevenbitcodewords:
(+,−,+,−)y1
y3
yi
Y={y1,y2,y3,y4}
Class Codeword
1 1 1 1 1 1 1
0 0 0 0 1 1 1
0 0 1 1 0 0 1
0 1 0 1 0 1 0
Eachbitofthecodewordisusedtotrainabinaryclassifier.Ifatestinstanceisclassifiedas(0,1,1,1,1,1,1)bythebinaryclassifiers,thentheHammingdistancebetweenthecodewordand is1,whiletheHammingdistancetotheremainingclassesis3.Thetestinstanceisthereforeclassifiedas .
Aninterestingpropertyofanerror-correctingcodeisthatiftheminimumHammingdistancebetweenanypairofcodewordsisd,thenanyerrorsintheoutputcodecanbecorrectedusingitsnearestcodeword.InExample4.13 ,becausetheminimumHammingdistancebetweenanypairofcodewordsis4,theclassifiermaytolerateerrorsmadebyoneofthesevenbinaryclassifiers.Ifthereismorethanoneclassifierthatmakesamistake,thentheclassifiermaynotbeabletocompensatefortheerror.
Animportantissueishowtodesigntheappropriatesetofcodewordsfordifferentclasses.Fromcodingtheory,avastnumberofalgorithmshavebeendevelopedforgeneratingn-bitcodewordswithboundedHammingdistance.However,thediscussionofthesealgorithmsisbeyondthescopeofthisbook.Itisworthwhilementioningthatthereisasignificantdifferencebetweenthedesignoferror-correctingcodesforcommunicationtaskscomparedtothoseusedformulticlasslearning.Forcommunication,thecodewordsshouldmaximizetheHammingdistancebetweentherowssothaterrorcorrection
y1
y2
y3
y4
y1
y1
⌊(d−1)/2)⌋
canbeperformed.Multiclasslearning,however,requiresthatboththerow-wiseandcolumn-wisedistancesofthecodewordsmustbewellseparated.Alargercolumn-wisedistanceensuresthatthebinaryclassifiersaremutuallyindependent,whichisanimportantrequirementforensemblelearningmethods.
4.13BibliographicNotesMitchell[278]providesexcellentcoverageonmanyclassificationtechniquesfromamachinelearningperspective.ExtensivecoverageonclassificationcanalsobefoundinAggarwal[195],Dudaetal.[229],Webb[307],Fukunaga[237],Bishop[204],Hastieetal.[249],CherkasskyandMulier[215],WittenandFrank[310],Handetal.[247],HanandKamber[244],andDunham[230].
Directmethodsforrule-basedclassifierstypicallyemploythesequentialcoveringschemeforinducingclassificationrules.Holte's1R[255]isthesimplestformofarule-basedclassifierbecauseitsrulesetcontainsonlyasinglerule.Despiteitssimplicity,Holtefoundthatforsomedatasetsthatexhibitastrongone-to-onerelationshipbetweentheattributesandtheclasslabel,1Rperformsjustaswellasotherclassifiers.Otherexamplesofrule-basedclassifiersincludeIREP[234],RIPPER[218],CN2[216,217],AQ[276],RISE[224],andITRULE[296].Table4.8 showsacomparisonofthecharacteristicsoffouroftheseclassifiers.
Table4.8.Comparisonofvariousrule-basedclassifiers.
RIPPER CN2(unordered)
CN2(ordered)
AQR
Rule-growingstrategy
General-to-specific General-to-specific
General-to-specific
General-to-specific(seededbyapositive
example)
Evaluationmetric FOIL'sInfogain Laplace Entropyandlikelihoodratio
Numberoftruepositives
Stoppingconditionforrule-growing
Allexamplesbelongtothesame
class
Noperformance
gain
Noperformance
gain
Rulescoveronlypositiveclass
Rulepruning Reducederrorpruning
None None None
Instanceelimination
Positiveandnegative
Positiveonly Positiveonly Positiveandnegative
Stoppingconditionforaddingrules
orbasedonMDLNo
performancegain
Noperformance
gain
Allpositiveexamplesarecovered
Rulesetpruning Replaceormodifyrules
Statisticaltests
None None
Searchstrategy Greedy Beamsearch
Beamsearch
Beamsearch
Forrule-basedclassifiers,theruleantecedentcanbegeneralizedtoincludeanypropositionalorfirst-orderlogicalexpression(e.g.,Hornclauses).Readerswhoareinterestedinfirst-orderlogicrule-basedclassifiersmayrefertoreferencessuchas[278]orthevastliteratureoninductivelogicprogramming[279].Quinlan[287]proposedtheC4.5rulesalgorithmforextractingclassificationrulesfromdecisiontrees.AnindirectmethodforextractingrulesfromartificialneuralnetworkswasgivenbyAndrewsetal.in[198].
CoverandHart[220]presentedanoverviewofthenearestneighborclassificationmethodfromaBayesianperspective.Ahaprovidedboththeoreticalandempiricalevaluationsforinstance-basedmethodsin[196].PEBLS,whichwasdevelopedbyCostandSalzberg[219],isanearestneighborclassifierthatcanhandledatasetscontainingnominalattributes.
Error>50%
EachtrainingexampleinPEBLSisalsoassignedaweightfactorthatdependsonthenumberoftimestheexamplehelpsmakeacorrectprediction.Hanetal.[243]developedaweight-adjustednearestneighboralgorithm,inwhichthefeatureweightsarelearnedusingagreedy,hill-climbingoptimizationalgorithm.Amorerecentsurveyofk-nearestneighborclassificationisgivenbySteinbachandTan[298].
NaïveBayesclassifiershavebeeninvestigatedbymanyauthors,includingLangleyetal.[267],RamoniandSebastiani[288],Lewis[270],andDomingosandPazzani[227].AlthoughtheindependenceassumptionusedinnaïveBayesclassifiersmayseemratherunrealistic,themethodhasworkedsurprisinglywellforapplicationssuchastextclassification.Bayesiannetworksprovideamoreflexibleapproachbyallowingsomeoftheattributestobeinterdependent.AnexcellenttutorialonBayesiannetworksisgivenbyHeckermanin[252]andJensenin[258].Bayesiannetworksbelongtoabroaderclassofmodelsknownasprobabilisticgraphicalmodels.AformalintroductiontotherelationshipsbetweengraphsandprobabilitieswaspresentedinPearl[283].OthergreatresourcesonprobabilisticgraphicalmodelsincludebooksbyBishop[205],andJordan[259].Detaileddiscussionsofconceptssuchasd-separationandMarkovblanketsareprovidedinGeigeretal.[238]andRussellandNorvig[291].
Generalizedlinearmodels(GLM)arearichclassofregressionmodelsthathavebeenextensivelystudiedinthestatisticalliterature.TheywereformulatedbyNelderandWedderburnin1972[280]tounifyanumberofregressionmodelssuchaslinearregression,logisticregression,andPoissonregression,whichsharesomesimilaritiesintheirformulations.AnextensivediscussionofGLMsisprovidedinthebookbyMcCullaghandNelder[274].
Artificialneuralnetworks(ANN)havewitnessedalongandwindinghistoryofdevelopments,involvingmultiplephasesofstagnationandresurgence.The
ideaofamathematicalmodelofaneuralnetworkwasfirstintroducedin1943byMcCullochandPitts[275].Thisledtoaseriesofcomputationalmachinestosimulateaneuralnetworkbasedonthetheoryofneuralplasticity[289].Theperceptron,whichisthesimplestprototypeofmodernANNs,wasdevelopedbyRosenblattin1958[290].Theperceptronusesasinglelayerofprocessingunitsthatcanperformbasicmathematicaloperationssuchasadditionandmultiplication.However,theperceptroncanonlylearnlineardecisionboundariesandisguaranteedtoconvergeonlywhentheclassesarelinearlyseparable.Despitetheinterestinlearningmulti-layernetworkstoovercomethelimitationsofperceptron,progressinthisarearemainhalteduntiltheinventionofthebackpropagationalgorithmbyWerbosin1974[309],whichallowedforthequicktrainingofmulti-layerANNsusingthegradientdescentmethod.Thisledtoanupsurgeofinterestintheartificialintelligence(AI)communitytodevelopmulti-layerANNmodels,atrendthatcontinuedformorethanadecade.Historically,ANNsmarkaparadigmshiftinAIfromapproachesbasedonexpertsystems(whereknowledgeisencodedusingif-thenrules)tomachinelearningapproaches(wheretheknowledgeisencodedintheparametersofacomputationalmodel).However,therewerestillanumberofalgorithmicandcomputationalchallengesinlearninglargeANNmodels,whichremainedunresolvedforalongtime.ThishinderedthedevelopmentofANNmodelstothescalenecessaryforsolvingreal-worldproblems.Slowly,ANNsstartedgettingoutpacedbyotherclassificationmodelssuchassupportvectormachines,whichprovidedbetterperformanceaswellastheoreticalguaranteesofconvergenceandoptimality.Itisonlyrecentlythatthechallengesinlearningdeepneuralnetworkshavebeencircumvented,owingtobettercomputationalresourcesandanumberofalgorithmicimprovementsinANNssince2006.Thisre-emergenceofANNhasbeendubbedas“deeplearning,”whichhasoftenoutperformedexistingclassificationmodelsandgainedwide-spreadpopularity.
Deeplearningisarapidlyevolvingareaofresearchwithanumberofpotentiallyimpactfulcontributionsbeingmadeeveryyear.Someofthelandmarkadvancementsindeeplearningincludetheuseoflarge-scalerestrictedBoltzmannmachinesforlearninggenerativemodelsofdata[201,253],theuseofautoencodersanditsvariants(denoisingautoencoders)forlearningrobustfeaturerepresentations[199,305,306],andsophisticalarchitecturestopromotesharingofparametersacrossnodessuchasconvolutionalneuralnetworksforimages[265,268]andrecurrentneuralnetworksforsequences[241,242,277].OthermajorimprovementsincludetheapproachofunsupervisedpretrainingforinitializingANNmodels[232],thedropouttechniqueforregularization[254,297],batchnormalizationforfastlearningofANNparameters[256],andmaxoutnetworksforeffectiveusageofthedropouttechnique[240].EventhoughthediscussionsinthischapteronlearningANNmodelswerecenteredaroundthegradientdescentmethod,mostofthemodernANNmodelsinvolvingalargenumberofhiddenlayersaretrainedusingthestochasticgradientdescentmethodsinceitishighlyscalable[207].AnextensivesurveyofdeeplearningapproacheshasbeenpresentedinreviewarticlesbyBengio[200],LeCunetal.[269],andSchmidhuber[293].AnexcellentsummaryofdeeplearningapproachescanalsobeobtainedfromrecentbooksbyGoodfellowetal.[239]andNielsen[281].
Vapnik[303,304]haswrittentwoauthoritativebooksonSupportVectorMachines(SVM).OtherusefulresourcesonSVMandkernelmethodsincludethebooksbyCristianiniandShawe-Taylor[221]andSchölkopfandSmola[294].ThereareseveralsurveyarticlesonSVM,includingthosewrittenbyBurges[212],Bennetetal.[202],Hearst[251],andMangasarian[272].SVMcanalsobeviewedasan normregularizerofthehingelossfunction,asdescribedindetailbyHastieetal.[249].The normregularizerofthesquarelossfunctioncanbeobtainedusingtheleastabsoluteshrinkageandselectionoperator(Lasso),whichwasintroducedbyTibshiraniin1996[301].
L2L1
TheLassohasseveralinterestingpropertiessuchastheabilitytosimultaneouslyperformfeatureselectionaswellasregularization,sothatonlyasubsetoffeaturesareselectedinthefinalmodel.AnexcellentreviewofLassocanbeobtainedfromabookbyHastieetal.[250].
AsurveyofensemblemethodsinmachinelearningwasgivenbyDiet-terich[222].ThebaggingmethodwasproposedbyBreiman[209].FreundandSchapire[236]developedtheAdaBoostalgorithm.Arcing,whichstandsforadaptiveresamplingandcombining,isavariantoftheboostingalgorithmproposedbyBreiman[210].Itusesthenon-uniformweightsassignedtotrainingexamplestoresamplethedataforbuildinganensembleoftrainingsets.UnlikeAdaBoost,thevotesofthebaseclassifiersarenotweightedwhendeterminingtheclasslabeloftestexamples.TherandomforestmethodwasintroducedbyBreimanin[211].Theconceptofbias-variancedecompositionisexplainedindetailbyHastieetal.[249].Whilethebias-variancedecompositionwasinitiallyproposedforregressionproblemswithsquaredlossfunction,aunifiedframeworkforclassificationproblemsinvolving0–1losseswasintroducedbyDomingos[226].
RelatedworkonminingrareandimbalanceddatasetscanbefoundinthesurveypaperswrittenbyChawlaetal.[214]andWeiss[308].Sampling-basedmethodsforminingimbalanceddatasetshavebeeninvestigatedbymanyauthors,suchasKubatandMatwin[266],Japkowitz[257],andDrummondandHolte[228].Joshietal.[261]discussedthelimitationsofboostingalgorithmsforrareclassmodeling.OtheralgorithmsdevelopedforminingrareclassesincludeSMOTE[213],PNrule[260],andCREDOS[262].
Variousalternativemetricsthatarewell-suitedforclassimbalancedproblemsareavailable.Theprecision,recall,and -measurearewidely-usedmetricsininformationretrieval[302].ROCanalysiswasoriginallyusedinsignaldetectiontheoryforperformingaggregateevaluationoverarangeofscore
F1
thresholds.AmethodforcomparingclassifierperformanceusingtheconvexhullofROCcurveswassuggestedbyProvostandFawcettin[286].Bradley[208]investigatedtheuseofareaundertheROCcurve(AUC)asaperformancemetricformachinelearningalgorithms.DespitethevastbodyofliteratureonoptimizingtheAUCmeasureinmachinelearningmodels,itiswell-knownthatAUCsuffersfromcertainlimitations.Forexample,theAUCcanbeusedtocomparethequalityoftwoclassifiersonlyiftheROCcurveofoneclassifierstrictlydominatestheother.However,iftheROCcurvesoftwoclassifiersintersectatanypoint,thenitisdifficulttoassesstherelativequalityofclassifiersusingtheAUCmeasure.Anin-depthdiscussionofthepitfallsinusingAUCasaperformancemeasurecanbeobtainedinworksbyHand[245,246],andPowers[284].TheAUChasalsobeenconsideredtobeanincoherentmeasureofperformance,i.e.,itusesdifferentscaleswhilecomparingtheperformanceofdifferentclassifiers,althoughacoherentinterpretationofAUChasbeenprovidedbyFerrietal.[235].BerrarandFlach[203]describesomeofthecommoncaveatsinusingtheROCcurveforclinicalmicroarrayresearch.Analternateapproachformeasuringtheaggregateperformanceofaclassifieristheprecision-recall(PR)curve,whichisespeciallyusefulinthepresenceofclassimbalance[292].
Anexcellenttutorialoncost-sensitivelearningcanbefoundinareviewarticlebyLingandSheng[271].ThepropertiesofacostmatrixhadbeenstudiedbyElkanin[231].MargineantuandDietterich[273]examinedvariousmethodsforincorporatingcostinformationintotheC4.5learningalgorithm,includingwrappermethods,classdistribution-basedmethods,andloss-basedmethods.Othercost-sensitivelearningmethodsthatarealgorithm-independentincludeAdaCost[233],MetaCost[225],andcosting[312].
Extensiveliteratureisalsoavailableonthesubjectofmulticlasslearning.ThisincludestheworksofHastieandTibshirani[248],Allweinetal.[197],KongandDietterich[264],andTaxandDuin[300].Theerror-correctingoutput
coding(ECOC)methodwasproposedbyDietterichandBakiri[223].Theyhadalsoinvestigatedtechniquesfordesigningcodesthataresuitableforsolvingmulticlassproblems.
Apartfromexploringalgorithmsfortraditionalclassificationsettingswhereeveryinstancehasasinglesetoffeatureswithauniquecategoricallabel,therehasbeenalotofrecentinterestinnon-traditionalclassificationparadigms,involvingcomplexformsofinputsandoutputs.Forexample,theparadigmofmulti-labellearningallowsforaninstancetobeassignedmultipleclasslabelsratherthanjustone.Thisisusefulinapplicationssuchasobjectrecognitioninimages,whereaphotoimagemayincludemorethanoneclassificationobject,suchas,grass,sky,trees,andmountains.Asurveyonmulti-labellearningcanbefoundin[313].Asanotherexample,theparadigmofmulti-instancelearningconsiderstheproblemwheretheinstancesareavailableintheformofgroupscalledbags,andtraininglabelsareavailableatthelevelofbagsratherthanindividualinstances.Multi-instancelearningisusefulinapplicationswhereanobjectcanexistasmultipleinstancesindifferentstates(e.g.,thedifferentisomersofachemicalcompound),andevenifasingleinstanceshowsaspecificcharacteristic,theentirebagofinstancesassociatedwiththeobjectneedstobeassignedtherelevantclass.Asurveyonmulti-instancelearningisprovidedin[314].
Inanumberofreal-worldapplications,itisoftenthecasethatthetraininglabelsarescarceinquantity,becauseofthehighcostsassociatedwithobtaininggold-standardsupervision.However,wealmostalwayshaveabundantaccesstounlabeledtestinstances,whichdonothavesupervisedlabelsbutcontainvaluableinformationaboutthestructureordistributionofinstances.Traditionallearningalgorithms,whichonlymakeuseofthelabeledinstancesinthetrainingsetforlearningthedecisionboundary,areunabletoexploittheinformationcontainedinunlabeledinstances.Incontrast,learningalgorithmsthatmakeuseofthestructureintheunlabeleddataforlearningthe
classificationmodelareknownassemi-supervisedlearningalgorithms[315,316].Theuseofunlabeleddataisalsoexploredintheparadigmofmulti-viewlearning[299,311],whereeveryobjectisobservedinmultipleviewsofthedata,involvingdiversesetsoffeatures.Acommonstrategyusedbymulti-viewlearningalgorithmsisco-training[206],whereadifferentmodelislearnedforeveryviewofthedata,butthemodelpredictionsfromeveryviewareconstrainedtobeidenticaltoeachotherontheunlabeledtestinstances.
Anotherlearningparadigmthatiscommonlyexploredinthepaucityoftrainingdataistheframeworkofactivelearning,whichattemptstoseekthesmallestsetoflabelannotationstolearnareasonableclassificationmodel.Activelearningexpectstheannotatortobeinvolvedintheprocessofmodellearning,sothatthelabelsarerequestedincrementallyoverthemostrelevantsetofinstances,givenalimitedbudgetoflabelannotations.Forexample,itmaybeusefultoobtainlabelsoverinstancesclosertothedecisionboundarythatcanplayabiggerroleinfine-tuningtheboundary.Areviewonactivelearningapproachescanbefoundin[285,295].
Insomeapplications,itisimportanttosimultaneouslysolvemultiplelearningtaskstogether,wheresomeofthetasksmaybesimilartooneanother.Forexample,ifweareinterestedintranslatingapassagewritteninEnglishintodifferentlanguages,thetasksinvolvinglexicallysimilarlanguages(suchasSpanishandPortuguese)wouldrequiresimilarlearningofmodels.Theparadigmofmulti-tasklearninghelpsinsimultaneouslylearningacrossalltaskswhilesharingthelearningamongrelatedtasks.Thisisespeciallyusefulwhensomeofthetasksdonotcontainsufficientlymanytrainingsamples,inwhichcaseborrowingthelearningfromotherrelatedtaskshelpsinthelearningofrobustmodels.Aspecialcaseofmulti-tasklearningistransferlearning,wherethelearningfromasourcetask(withsufficientnumberoftrainingsamples)hastobetransferredtoadestinationtask(withpaucityof
trainingdata).AnextensivesurveyoftransferlearningapproachesisprovidedbyPanetal.[282].
Mostclassifiersassumeeverydatainstancemustbelongtoaclass,whichisnotalwaystrueforsomeapplications.Forexample,inmalwaredetection,duetotheeaseinwhichnewmalwaresarecreated,aclassifiertrainedonexistingclassesmayfailtodetectnewonesevenifthefeaturesforthenewmalwaresareconsiderablydifferentthanthoseforexistingmalwares.Anotherexampleisincriticalapplicationssuchasmedicaldiagnosis,wherepredictionerrorsarecostlyandcanhavesevereconsequences.Inthissituation,itwouldbebetterfortheclassifiertorefrainfrommakinganypredictiononadatainstanceifitisunsureofitsclass.Thisapproach,knownasclassifierwithrejectoption,doesnotneedtoclassifyeverydatainstanceunlessitdeterminesthepredictionisreliable(e.g.,iftheclassprobabilityexceedsauser-specifiedthreshold).Instancesthatareunclassifiedcanbepresentedtodomainexpertsforfurtherdeterminationoftheirtrueclasslabels.
Classifierscanalsobedistinguishedintermsofhowtheclassificationmodelistrained.Abatchclassifierassumesallthelabeledinstancesareavailableduringtraining.Thisstrategyisapplicablewhenthetrainingsetsizeisnottoolargeandforstationarydata,wheretherelationshipbetweentheattributesandclassesdoesnotvaryovertime.Anonlineclassifier,ontheotherhand,trainsaninitialmodelusingasubsetofthelabeleddata[263].Themodelisthenupdatedincrementallyasmorelabeledinstancesbecomeavailable.Thisstrategyiseffectivewhenthetrainingsetistoolargeorwhenthereisconceptdriftduetochangesinthedistributionofthedataovertime.
Bibliography[195]C.C.Aggarwal.Dataclassification:algorithmsandapplications.CRC
Press,2014.
[196]D.W.Aha.Astudyofinstance-basedalgorithmsforsupervisedlearningtasks:mathematical,empirical,andpsychologicalevaluations.PhDthesis,UniversityofCalifornia,Irvine,1990.
[197]E.L.Allwein,R.E.Schapire,andY.Singer.ReducingMulticlasstoBinary:AUnifyingApproachtoMarginClassifiers.JournalofMachineLearningResearch,1:113–141,2000.
[198]R.Andrews,J.Diederich,andA.Tickle.ASurveyandCritiqueofTechniquesForExtractingRulesFromTrainedArtificialNeuralNetworks.KnowledgeBasedSystems,8(6):373–389,1995.
[199]P.Baldi.Autoencoders,unsupervisedlearning,anddeeparchitectures.ICMLunsupervisedandtransferlearning,27(37-50):1,2012.
[200]Y.Bengio.LearningdeeparchitecturesforAI.FoundationsandtrendsRinMachineLearning,2(1):1–127,2009.
[201]Y.Bengio,A.Courville,andP.Vincent.Representationlearning:Areviewandnewperspectives.IEEEtransactionsonpatternanalysisand
machineintelligence,35(8):1798–1828,2013.
[202]K.BennettandC.Campbell.SupportVectorMachines:HypeorHallelujah.SIGKDDExplorations,2(2):1–13,2000.
[203]D.BerrarandP.Flach.CaveatsandpitfallsofROCanalysisinclinicalmicroarrayresearch(andhowtoavoidthem).Briefingsinbioinformatics,pagebbr008,2011.
[204]C.M.Bishop.NeuralNetworksforPatternRecognition.OxfordUniversityPress,Oxford,U.K.,1995.
[205]C.M.Bishop.PatternRecognitionandMachineLearning.Springer,2006.
[206]A.BlumandT.Mitchell.Combininglabeledandunlabeleddatawithco-training.InProceedingsoftheeleventhannualconferenceonComputationallearningtheory,pages92–100.ACM,1998.
[207]L.Bottou.Large-scalemachinelearningwithstochasticgradientdescent.InProceedingsofCOMPSTAT'2010,pages177–186.Springer,2010.
[208]A.P.Bradley.TheuseoftheareaundertheROCcurveintheEvaluationofMachineLearningAlgorithms.PatternRecognition,30(7):1145–1149,1997.
[209]L.Breiman.BaggingPredictors.MachineLearning,24(2):123–140,1996.
[210]L.Breiman.Bias,Variance,andArcingClassifiers.TechnicalReport486,UniversityofCalifornia,Berkeley,CA,1996.
[211]L.Breiman.RandomForests.MachineLearning,45(1):5–32,2001.
[212]C.J.C.Burges.ATutorialonSupportVectorMachinesforPatternRecognition.DataMiningandKnowledgeDiscovery,2(2):121–167,1998.
[213]N.V.Chawla,K.W.Bowyer,L.O.Hall,andW.P.Kegelmeyer.SMOTE:SyntheticMinorityOver-samplingTechnique.JournalofArtificialIntelligenceResearch,16:321–357,2002.
[214]N.V.Chawla,N.Japkowicz,andA.Kolcz.Editorial:SpecialIssueonLearningfromImbalancedDataSets.SIGKDDExplorations,6(1):1–6,2004.
[215]V.CherkasskyandF.Mulier.LearningfromData:Concepts,Theory,andMethods.WileyInterscience,1998.
[216]P.ClarkandR.Boswell.RuleInductionwithCN2:SomeRecentImprovements.InMachineLearning:Proc.ofthe5thEuropeanConf.(EWSL-91),pages151–163,1991.
[217]P.ClarkandT.Niblett.TheCN2InductionAlgorithm.MachineLearning,3(4):261–283,1989.
[218]W.W.Cohen.FastEffectiveRuleInduction.InProc.ofthe12thIntl.Conf.onMachineLearning,pages115–123,TahoeCity,CA,July1995.
[219]S.CostandS.Salzberg.AWeightedNearestNeighborAlgorithmforLearningwithSymbolicFeatures.MachineLearning,10:57–78,1993.
[220]T.M.CoverandP.E.Hart.NearestNeighborPatternClassification.KnowledgeBasedSystems,8(6):373–389,1995.
[221]N.CristianiniandJ.Shawe-Taylor.AnIntroductiontoSupportVectorMachinesandOtherKernel-basedLearningMethods.CambridgeUniversityPress,2000.
[222]T.G.Dietterich.EnsembleMethodsinMachineLearning.InFirstIntl.WorkshoponMultipleClassifierSystems,Cagliari,Italy,2000.
[223]T.G.DietterichandG.Bakiri.SolvingMulticlassLearningProblemsviaError-CorrectingOutputCodes.JournalofArtificialIntelligenceResearch,2:263–286,1995.
[224]P.Domingos.TheRISEsystem:Conqueringwithoutseparating.InProc.ofthe6thIEEEIntl.Conf.onToolswithArtificialIntelligence,pages704–707,NewOrleans,LA,1994.
[225]P.Domingos.MetaCost:AGeneralMethodforMakingClassifiersCost-Sensitive.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages155–164,SanDiego,CA,August1999.
[226]P.Domingos.Aunifiedbias-variancedecomposition.InProceedingsof17thInternationalConferenceonMachineLearning,pages231–238,2000.
[227]P.DomingosandM.Pazzani.OntheOptimalityoftheSimpleBayesianClassifierunderZero-OneLoss.MachineLearning,29(2-3):103–130,1997.
[228]C.DrummondandR.C.Holte.C4.5,Classimbalance,andCostsensitivity:Whyunder-samplingbeatsover-sampling.InICML'2004WorkshoponLearningfromImbalancedDataSetsII,Washington,DC,August2003.
[229]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley&Sons,Inc.,NewYork,2ndedition,2001.
[230]M.H.Dunham.DataMining:IntroductoryandAdvancedTopics.PrenticeHall,2006.
[231]C.Elkan.TheFoundationsofCost-SensitiveLearning.InProc.ofthe17thIntl.JointConf.onArtificialIntelligence,pages973–978,Seattle,WA,August2001.
[232]D.Erhan,Y.Bengio,A.Courville,P.-A.Manzagol,P.Vincent,andS.Bengio.Whydoesunsupervisedpre-traininghelpdeeplearning?JournalofMachineLearningResearch,11(Feb):625–660,2010.
[233]W.Fan,S.J.Stolfo,J.Zhang,andP.K.Chan.AdaCost:misclassificationcost-sensitiveboosting.InProc.ofthe16thIntl.Conf.onMachineLearning,pages97–105,Bled,Slovenia,June1999.
[234]J.FürnkranzandG.Widmer.Incrementalreducederrorpruning.InProc.ofthe11thIntl.Conf.onMachineLearning,pages70–77,NewBrunswick,NJ,July1994.
[235]C.Ferri,J.Hernández-Orallo,andP.A.Flach.AcoherentinterpretationofAUCasameasureofaggregatedclassificationperformance.InProceedingsofthe28thInternationalConferenceonMachineLearning(ICML-11),pages657–664,2011.
[236]Y.FreundandR.E.Schapire.Adecision-theoreticgeneralizationofon-linelearningandanapplicationtoboosting.JournalofComputerandSystemSciences,55(1):119–139,1997.
[237]K.Fukunaga.IntroductiontoStatisticalPatternRecognition.AcademicPress,NewYork,1990.
[238]D.Geiger,T.S.Verma,andJ.Pearl.d-separation:Fromtheoremstoalgorithms.arXivpreprintarXiv:1304.1505,2013.
[239]I.Goodfellow,Y.Bengio,andA.Courville.DeepLearning.BookinpreparationforMITPress,2016.
[240]I.J.Goodfellow,D.Warde-Farley,M.Mirza,A.C.Courville,andY.Bengio.Maxoutnetworks.ICML(3),28:1319–1327,2013.
[241]A.Graves,M.Liwicki,S.Fernández,R.Bertolami,H.Bunke,andJ.Schmidhuber.Anovelconnectionistsystemforunconstrainedhandwritingrecognition.IEEEtransactionsonpatternanalysisandmachineintelligence,31(5):855–868,2009.
[242]A.GravesandJ.Schmidhuber.Offlinehandwritingrecognitionwithmultidimensionalrecurrentneuralnetworks.InAdvancesinneuralinformationprocessingsystems,pages545–552,2009.
[243]E.-H.Han,G.Karypis,andV.Kumar.TextCategorizationUsingWeightAdjustedk-NearestNeighborClassification.InProc.ofthe5thPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,Lyon,France,2001.
[244]J.HanandM.Kamber.DataMining:ConceptsandTechniques.MorganKaufmannPublishers,SanFrancisco,2001.
[245]D.J.Hand.Measuringclassifierperformance:acoherentalternativetotheareaundertheROCcurve.Machinelearning,77(1):103–123,2009.
[246]D.J.Hand.Evaluatingdiagnostictests:theareaundertheROCcurveandthebalanceoferrors.Statisticsinmedicine,29(14):1502–1510,2010.
[247]D.J.Hand,H.Mannila,andP.Smyth.PrinciplesofDataMining.MITPress,2001.
[248]T.HastieandR.Tibshirani.Classificationbypairwisecoupling.AnnalsofStatistics,26(2):451–471,1998.
[249]T.Hastie,R.Tibshirani,andJ.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,andPrediction.Springer,2ndedition,2009.
[250]T.Hastie,R.Tibshirani,andM.Wainwright.Statisticallearningwithsparsity:thelassoandgeneralizations.CRCPress,2015.
[251]M.Hearst.Trends&Controversies:SupportVectorMachines.IEEEIntelligentSystems,13(4):18–28,1998.
[252]D.Heckerman.BayesianNetworksforDataMining.DataMiningandKnowledgeDiscovery,1(1):79–119,1997.
[253]G.E.HintonandR.R.Salakhutdinov.Reducingthedimensionalityofdatawithneuralnetworks.Science,313(5786):504–507,2006.
[254]G.E.Hinton,N.Srivastava,A.Krizhevsky,I.Sutskever,andR.R.Salakhutdinov.Improvingneuralnetworksbypreventingco-adaptationoffeaturedetectors.arXivpreprintarXiv:1207.0580,2012.
[255]R.C.Holte.VerySimpleClassificationRulesPerformWellonMostCommonlyUsedDatasets.MachineLearning,11:63–91,1993.
[256]S.IoffeandC.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducinginternalcovariateshift.arXivpreprintarXiv:1502.03167,2015.
[257]N.Japkowicz.TheClassImbalanceProblem:SignificanceandStrategies.InProc.ofthe2000Intl.Conf.onArtificialIntelligence:SpecialTrackonInductiveLearning,volume1,pages111–117,LasVegas,NV,June2000.
[258]F.V.Jensen.AnintroductiontoBayesiannetworks,volume210.UCLpressLondon,1996.
[259]M.I.Jordan.Learningingraphicalmodels,volume89.SpringerScience&BusinessMedia,1998.
[260]M.V.Joshi,R.C.Agarwal,andV.Kumar.MiningNeedlesinaHaystack:ClassifyingRareClassesviaTwo-PhaseRuleInduction.InProc.of2001ACM-SIGMODIntl.Conf.onManagementofData,pages91–102,SantaBarbara,CA,June2001.
[261]M.V.Joshi,R.C.Agarwal,andV.Kumar.Predictingrareclasses:canboostingmakeanyweaklearnerstrong?InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages297–306,Edmonton,Canada,July2002.
[262]M.V.JoshiandV.Kumar.CREDOS:ClassificationUsingRippleDownStructure(ACaseforRareClasses).InProc.oftheSIAMIntl.Conf.onDataMining,pages321–332,Orlando,FL,April2004.
[263]J.Kivinen,A.J.Smola,andR.C.Williamson.Onlinelearningwithkernels.IEEEtransactionsonsignalprocessing,52(8):2165–2176,2004.
[264]E.B.KongandT.G.Dietterich.Error-CorrectingOutputCodingCorrectsBiasandVariance.InProc.ofthe12thIntl.Conf.onMachineLearning,pages313–321,TahoeCity,CA,July1995.
[265]A.Krizhevsky,I.Sutskever,andG.E.Hinton.Imagenetclassificationwithdeepconvolutionalneuralnetworks.InAdvancesinneuralinformationprocessingsystems,pages1097–1105,2012.
[266]M.KubatandS.Matwin.AddressingtheCurseofImbalancedTrainingSets:OneSidedSelection.InProc.ofthe14thIntl.Conf.onMachineLearning,pages179–186,Nashville,TN,July1997.
[267]P.Langley,W.Iba,andK.Thompson.AnanalysisofBayesianclassifiers.InProc.ofthe10thNationalConf.onArtificialIntelligence,pages223–228,1992.
[268]Y.LeCunandY.Bengio.Convolutionalnetworksforimages,speech,andtimeseries.Thehandbookofbraintheoryandneuralnetworks,3361(10):1995,1995.
[269]Y.LeCun,Y.Bengio,andG.Hinton.Deeplearning.Nature,521(7553):436–444,2015.
[270]D.D.Lewis.NaiveBayesatForty:TheIndependenceAssumptioninInformationRetrieval.InProc.ofthe10thEuropeanConf.onMachineLearning(ECML1998),pages4–15,1998.
[271]C.X.LingandV.S.Sheng.Cost-sensitivelearning.InEncyclopediaofMachineLearning,pages231–235.Springer,2011.
[272]O.Mangasarian.DataMiningviaSupportVectorMachines.TechnicalReportTechnicalReport01-05,DataMiningInstitute,May2001.
[273]D.D.MargineantuandT.G.Dietterich.LearningDecisionTreesforLossMinimizationinMulti-ClassProblems.TechnicalReport99-30-03,OregonStateUniversity,1999.
[274]P.McCullaghandJ.A.Nelder.Generalizedlinearmodels,volume37.CRCpress,1989.
[275]W.S.McCullochandW.Pitts.Alogicalcalculusoftheideasimmanentinnervousactivity.Thebulletinofmathematicalbiophysics,5(4):115–133,1943.
[276]R.S.Michalski,I.Mozetic,J.Hong,andN.Lavrac.TheMulti-PurposeIncrementalLearningSystemAQ15andItsTestingApplicationtoThree
MedicalDomains.InProc.of5thNationalConf.onArtificialIntelligence,Orlando,August1986.
[277]T.Mikolov,M.Karafiát,L.Burget,J.Cernock`y,andS.Khudanpur.Recurrentneuralnetworkbasedlanguagemodel.InInterspeech,volume2,page3,2010.
[278]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.
[279]S.Muggleton.FoundationsofInductiveLogicProgramming.PrenticeHall,EnglewoodCliffs,NJ,1995.
[280]J.A.NelderandR.J.Baker.Generalizedlinearmodels.Encyclopediaofstatisticalsciences,1972.
[281]M.A.Nielsen.Neuralnetworksanddeeplearning.Publishedonline:http://neuralnetworksanddeeplearning.com/.(visited:10.15.2016),2015.
[282]S.J.PanandQ.Yang.Asurveyontransferlearning.IEEETransactionsonknowledgeanddataengineering,22(10):1345–1359,2010.
[283]J.Pearl.Probabilisticreasoninginintelligentsystems:networksofplausibleinference.MorganKaufmann,2014.
[284]D.M.Powers.Theproblemofareaunderthecurve.In2012IEEEInternationalConferenceonInformationScienceandTechnology,pages
567–573.IEEE,2012.
[285]M.Prince.Doesactivelearningwork?Areviewoftheresearch.Journalofengineeringeducation,93(3):223–231,2004.
[286]F.J.ProvostandT.Fawcett.AnalysisandVisualizationofClassifierPerformance:ComparisonunderImpreciseClassandCostDistributions.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages43–48,NewportBeach,CA,August1997.
[287]J.R.Quinlan.C4.5:ProgramsforMachineLearning.Morgan-KaufmannPublishers,SanMateo,CA,1993.
[288]M.RamoniandP.Sebastiani.RobustBayesclassifiers.ArtificialIntelligence,125:209–226,2001.
[289]N.Rochester,J.Holland,L.Haibt,andW.Duda.Testsonacellassemblytheoryoftheactionofthebrain,usingalargedigitalcomputer.IRETransactionsoninformationTheory,2(3):80–93,1956.
[290]F.Rosenblatt.Theperceptron:aprobabilisticmodelforinformationstorageandorganizationinthebrain.Psychologicalreview,65(6):386,1958.
[291]S.J.Russell,P.Norvig,J.F.Canny,J.M.Malik,andD.D.Edwards.Artificialintelligence:amodernapproach,volume2.PrenticehallUpperSaddleRiver,2003.
[292]T.SaitoandM.Rehmsmeier.Theprecision-recallplotismoreinformativethantheROCplotwhenevaluatingbinaryclassifiersonimbalanceddatasets.PloSone,10(3):e0118432,2015.
[293]J.Schmidhuber.Deeplearninginneuralnetworks:Anoverview.NeuralNetworks,61:85–117,2015.
[294]B.SchölkopfandA.J.Smola.LearningwithKernels:SupportVectorMachines,Regularization,Optimization,andBeyond.MITPress,2001.
[295]B.Settles.Activelearningliteraturesurvey.UniversityofWisconsin,Madison,52(55-66):11,2010.
[296]P.SmythandR.M.Goodman.AnInformationTheoreticApproachtoRuleInductionfromDatabases.IEEETrans.onKnowledgeandDataEngineering,4(4):301–316,1992.
[297]N.Srivastava,G.E.Hinton,A.Krizhevsky,I.Sutskever,andR.Salakhutdinov.Dropout:asimplewaytopreventneuralnetworksfromoverfitting.JournalofMachineLearningResearch,15(1):1929–1958,2014.
[298]M.SteinbachandP.-N.Tan.kNN:k-NearestNeighbors.InX.WuandV.Kumar,editors,TheTopTenAlgorithmsinDataMining.ChapmanandHall/CRCReference,1stedition,2009.
[299]S.Sun.Asurveyofmulti-viewmachinelearning.NeuralComputingandApplications,23(7-8):2031–2038,2013.
[300]D.M.J.TaxandR.P.W.Duin.UsingTwo-ClassClassifiersforMulticlassClassification.InProc.ofthe16thIntl.Conf.onPatternRecognition(ICPR2002),pages124–127,Quebec,Canada,August2002.
[301]R.Tibshirani.Regressionshrinkageandselectionviathelasso.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),pages267–288,1996.
[302]C.J.vanRijsbergen.InformationRetrieval.Butterworth-Heinemann,Newton,MA,1978.
[303]V.Vapnik.TheNatureofStatisticalLearningTheory.SpringerVerlag,NewYork,1995.
[304]V.Vapnik.StatisticalLearningTheory.JohnWiley&Sons,NewYork,1998.
[305]P.Vincent,H.Larochelle,Y.Bengio,andP.-A.Manzagol.Extractingandcomposingrobustfeatureswithdenoisingautoencoders.InProceedingsofthe25thinternationalconferenceonMachinelearning,pages1096–1103.ACM,2008.
[306]P.Vincent,H.Larochelle,I.Lajoie,Y.Bengio,andP.-A.Manzagol.Stackeddenoisingautoencoders:Learningusefulrepresentationsina
deepnetworkwithalocaldenoisingcriterion.JournalofMachineLearningResearch,11(Dec):3371–3408,2010.
[307]A.R.Webb.StatisticalPatternRecognition.JohnWiley&Sons,2ndedition,2002.
[308]G.M.Weiss.MiningwithRarity:AUnifyingFramework.SIGKDDExplorations,6(1):7–19,2004.
[309]P.Werbos.Beyondregression:newfoolsforpredictionandanalysisinthebehavioralsciences.PhDthesis,HarvardUniversity,1974.
[310]I.H.WittenandE.Frank.DataMining:PracticalMachineLearningToolsandTechniqueswithJavaImplementations.MorganKaufmann,1999.
[311]C.Xu,D.Tao,andC.Xu.Asurveyonmulti-viewlearning.arXivpreprintarXiv:1304.5634,2013.
[312]B.Zadrozny,J.C.Langford,andN.Abe.Cost-SensitiveLearningbyCost-ProportionateExampleWeighting.InProc.ofthe2003IEEEIntl.Conf.onDataMining,pages435–442,Melbourne,FL,August2003.
[313]M.-L.ZhangandZ.-H.Zhou.Areviewonmulti-labellearningalgorithms.IEEEtransactionsonknowledgeanddataengineering,26(8):1819–1837,2014.
[314]Z.-H.Zhou.Multi-instancelearning:Asurvey.DepartmentofComputerScience&Technology,NanjingUniversity,Tech.Rep,2004.
[315]X.Zhu.Semi-supervisedlearning.InEncyclopediaofmachinelearning,pages892–897.Springer,2011.
[316]X.ZhuandA.B.Goldberg.Introductiontosemi-supervisedlearning.Synthesislecturesonartificialintelligenceandmachinelearning,3(1):1–130,2009.
4.14Exercises1.Considerabinaryclassificationproblemwiththefollowingsetofattributesandattributevalues:
Supposearule-basedclassifierproducesthefollowingruleset:
a. Aretherulesmutuallyexclusive?
b. Istherulesetexhaustive?
c. Isorderingneededforthissetofrules?
d. Doyouneedadefaultclassfortheruleset?
2.TheRIPPERalgorithm(byCohen[218])isanextensionofanearlieralgorithmcalledIREP(byFürnkranzandWidmer[234]).Bothalgorithmsapplythereduced-errorpruningmethodtodeterminewhetheraruleneedstobepruned.Thereducederrorpruningmethodusesavalidationsettoestimatethegeneralizationerrorofaclassifier.Considerthefollowingpairofrules:
AirConditioner={Working,Broken}
Engine={Good,Bad}
Mileage={High,Medium,Low}
Rust={Yes,No}
Mileage=High→Mileage=HighMileage=Low→Value=HighAirConditioner=Working
R1:A→CR2:A∧B→C
isobtainedbyaddinganewconjunct,B,totheleft-handsideof .Forthisquestion,youwillbeaskedtodeterminewhether ispreferredoverfromtheperspectivesofrule-growingandrule-pruning.Todeterminewhetheraruleshouldbepruned,IREPcomputesthefollowingmeasure:
wherePisthetotalnumberofpositiveexamplesinthevalidationset,Nisthetotalnumberofnegativeexamplesinthevalidationset,pisthenumberofpositiveexamplesinthevalidationsetcoveredbytherule,andnisthenumberofnegativeexamplesinthevalidationsetcoveredbytherule.isactuallysimilartoclassificationaccuracyforthevalidationset.IREPfavorsrulesthathavehighervaluesof .Ontheotherhand,RIPPERappliesthefollowingmeasuretodeterminewhetheraruleshouldbepruned:
a. Suppose iscoveredby350positiveexamplesand150negativeexamples,while iscoveredby300positiveexamplesand50negativeexamples.ComputetheFOIL'sinformationgainfortherule withrespectto .
b. Consideravalidationsetthatcontains500positiveexamplesand500negativeexamples.For ,supposethenumberofpositiveexamplescoveredbytheruleis200,andthenumberofnegativeexamplescoveredbytheruleis50.For ,supposethenumberofpositiveexamplescoveredbytheruleis100andthenumberofnegativeexamplesis5.Compute
forbothrules.WhichruledoesIREPprefer?
c. Compute forthepreviousproblem.WhichruledoesRIPPERprefer?
R2 R1R2 R1
vIREP=p+(N−n)P+N,
vIREP
vIREP
vRIPPER=p−nP+n.
R1R2
R2R1
R1
R2
vIREP
vRIPPER
3.C4.5rulesisanimplementationofanindirectmethodforgeneratingrulesfromadecisiontree.RIPPERisanimplementationofadirectmethodforgeneratingrulesdirectlyfromdata.
a. Discussthestrengthsandweaknessesofbothmethods.
b. Consideradatasetthathasalargedifferenceintheclasssize(i.e.,someclassesaremuchbiggerthanothers).Whichmethod(betweenC4.5rulesandRIPPER)isbetterintermsoffindinghighaccuracyrulesforthesmallclasses?
4.Consideratrainingsetthatcontains100positiveexamplesand400negativeexamples.Foreachofthefollowingcandidaterules,
determinewhichisthebestandworstcandidateruleaccordingto:
a. Ruleaccuracy.
b. FOIL'sinformationgain.
c. Thelikelihoodratiostatistic.
d. TheLaplacemeasure.
e. Them-estimatemeasure(with and ).
5.Figure4.3 illustratesthecoverageoftheclassificationrulesR1,R2,andR3.Determinewhichisthebestandworstruleaccordingto:
a. Thelikelihoodratiostatistic.
b. TheLaplacemeasure.
R1:A→+(covers4positiveand1negativeexamples),R2:B→+(covers30positiveand10negativeexamples),R3:C→+(covers100positiveand90negativeexamples),
k=2 p+=0.2
c. Them-estimatemeasure(with and ).
d. TheruleaccuracyafterR1hasbeendiscovered,wherenoneoftheexamplescoveredbyR1arediscarded.
e. TheruleaccuracyafterR1hasbeendiscovered,whereonlythepositiveexamplescoveredbyR1arediscarded.
f. TheruleaccuracyafterR1hasbeendiscovered,wherebothpositiveandnegativeexamplescoveredbyR1arediscarded.
6.
a. Supposethefractionofundergraduatestudentswhosmokeis15%andthefractionofgraduatestudentswhosmokeis23%.Ifone-fifthofthecollegestudentsaregraduatestudentsandtherestareundergraduates,whatistheprobabilitythatastudentwhosmokesisagraduatestudent?
b. Giventheinformationinpart(a),isarandomlychosencollegestudentmorelikelytobeagraduateorundergraduatestudent?
c. Repeatpart(b)assumingthatthestudentisasmoker.
d. Suppose30%ofthegraduatestudentsliveinadormbutonly10%oftheundergraduatestudentsliveinadorm.Ifastudentsmokesandlivesinthedorm,isheorshemorelikelytobeagraduateorundergraduatestudent?Youcanassumeindependencebetweenstudentswholiveinadormandthosewhosmoke.
7.ConsiderthedatasetshowninTable4.9
Table4.9.DatasetforExercise7.
Instance A B C Class
1 0 0 0
k=2 p+=0.58
+
2 0 0 1
3 0 1 1
4 0 1 1
5 0 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 1 1
10 1 0 1
a. Estimatetheconditionalprobabilitiesfor,and .
b. UsetheestimateofconditionalprobabilitiesgiveninthepreviousquestiontopredicttheclasslabelforatestsampleusingthenaïveBayesapproach.
c. Estimatetheconditionalprobabilitiesusingthem-estimateapproach,withand .
d. Repeatpart(b)usingtheconditionalprobabilitiesgiveninpart(c).
e. Comparethetwomethodsforestimatingprobabilities.Whichmethodisbetterandwhy?
8.ConsiderthedatasetshowninTable4.10 .
Table4.10.DatasetforExercise8.
−
−
−
+
+
−
−
+
+
P(A|+),P(B|+),P(C|+),P(A|−),P(B|−) P(C|−)
(A=0,B=1,C=0)
p=1/2 m=4
Instance A B C Class
1 0 0 1
2 1 0 1
3 0 1 0
4 1 0 0
5 1 0 1
6 0 0 1
7 1 1 0
8 0 0 0
9 0 1 0
10 1 1 1 +
a. Estimatetheconditionalprobabilitiesfor,and using
thesameapproachasinthepreviousproblem.
b. Usetheconditionalprobabilitiesinpart(a)topredicttheclasslabelforatestsample usingthenaïveBayesapproach.
c. Compare ,and .StatetherelationshipsbetweenAandB.
d. Repeattheanalysisinpart(c)using ,and .
e. Compare against and.Arethevariablesconditionallyindependentgiventhe
class?
−
+
−
−
+
+
−
−
+
P(A=1|+),P(B=1|+),P(C=1|+),P(A=1|−),P(B=1|−) P(C=1|−)
(A=1,B=1,C=1)
P(A=1),P(B=1) P(A=1,B=1)
P(A=1),P(B=0) P(A=1,B=0)
P(A=1,B=1|Class=+) P(A=1|Class=+)P(B=1|Class=+)
9.
a. ExplainhownaïveBayesperformsonthedatasetshowninFigure4.56 .
b. Ifeachclassisfurtherdividedsuchthattherearefourclasses(A1,A2,B1,andB2),willnaïveBayesperformbetter?
c. Howwilladecisiontreeperformonthisdataset(forthetwo-classproblem)?Whatiftherearefourclasses?
10.Figure4.57 illustratestheBayesiannetworkforthedatasetshowninTable4.11 .(Assumethatalltheattributesarebinary).
a. Drawtheprobabilitytableforeachnodeinthenetwork.
b. UsetheBayesiannetworktocompute.
11.GiventheBayesiannetworkshowninFigure4.58 ,computethefollowingprobabilities:
P(Engine=Bad,AirConditioner=Broken)
Figure4.56.DatasetforExercise9.
Figure4.57.Bayesiannetwork.
a. .P(B=good,F=empty,G=empty,S=yes)
b. .
c. Giventhatthebatteryisbad,computetheprobabilitythatthecarwillstart.
12.Considertheone-dimensionaldatasetshowninTable4.12 .
a. Classifythedatapoint accordingtoits1-,3-,5-,and9-nearestneighbors(usingmajorityvote).
b. Repeatthepreviousanalysisusingthedistance-weightedvotingapproachdescribedinSection4.3.1 .
Table4.11.DatasetforExercise10.
Mileage Engine AirConditioner
NumberofInstanceswith
NumberofInstanceswith
Hi Good Working 3 4
Hi Good Broken 1 2
Hi Bad Working 1 5
Hi Bad Broken 0 4
Lo Good Working 9 0
Lo Good Broken 5 1
Lo Bad Working 1 2
Lo Bad Broken 0 2
P(B=bad,F=empty,G=notempty,S=no)
x=5.0
CarValue=Hi CarValue=Lo
Figure4.58.BayesiannetworkforExercise11.
13.ThenearestneighboralgorithmdescribedinSection4.3 canbeextendedtohandlenominalattributes.AvariantofthealgorithmcalledPEBLS(ParallelExemplar-BasedLearningSystem)byCostandSalzberg[219]measuresthedistancebetweentwovaluesofanominalattributeusingthemodifiedvaluedifferencemetric(MVDM).Givenapairofnominalattributevalues, and ,thedistancebetweenthemisdefinedasfollows:
where isthenumberofexamplesfromclassiwithattributevalue andisthenumberofexampleswithattributevalue
Table4.12.DatasetforExercise12.
x 0.5 3.0 4.5 4.6 4.9 5.2 5.3 5.5 7.0 9.5
V1 V2
d(V1,V2)=∑i=1k|ni1n1−ni2n2,| (4.108)
nij Vj njVj.
y
ConsiderthetrainingsetfortheloanclassificationproblemshowninFigure4.8 .UsetheMVDMmeasuretocomputethedistancebetweeneverypairofattributevaluesforthe and attributes.
14.ForeachoftheBooleanfunctionsgivenbelow,statewhethertheproblemislinearlyseparable.
a. AANDBANDC
b. NOTAANDB
c. (AORB)AND(AORC)
d. (AXORB)AND(AORB)
15.
a. DemonstratehowtheperceptronmodelcanbeusedtorepresenttheANDandORfunctionsbetweenapairofBooleanvariables.
b. Commentonthedisadvantageofusinglinearfunctionsasactivationfunctionsformulti-layerneuralnetworks.
16.Youareaskedtoevaluatetheperformanceoftwoclassificationmodels,and .Thetestsetyouhavechosencontains26binaryattributes,
labeledasAthroughZ.Table4.13 showstheposteriorprobabilitiesobtainedbyapplyingthemodelstothetestset.(Onlytheposteriorprobabilitiesforthepositiveclassareshown).Asthisisatwo-classproblem,
and .Assumethatwearemostlyinterestedindetectinginstancesfromthepositiveclass.
a. PlottheROCcurveforboth and .(Youshouldplotthemonthesamegraph.)Whichmodeldoyouthinkisbetter?Explainyourreasons.
− − + + + − − + − −
M1 M2
P(−)=1−P(+) P(−|A,…,Z)=1−P(+|A,…,Z)
M1 M2
b. Formodel ,supposeyouchoosethecutoffthresholdtobe .Inotherwords,anytestinstanceswhoseposteriorprobabilityisgreaterthantwillbeclassifiedasapositiveexample.Computetheprecision,recall,andF-measureforthemodelatthisthresholdvalue.
c. Repeattheanalysisforpart(b)usingthesamecutoffthresholdonmodel.ComparetheF-measureresultsforbothmodels.Whichmodelis
better?AretheresultsconsistentwithwhatyouexpectfromtheROCcurve?
d. Repeatpart(b)formodel usingthethreshold .Whichthresholddoyouprefer, or ?AretheresultsconsistentwithwhatyouexpectfromtheROCcurve?
Table4.13.PosteriorprobabilitiesforExercise16.
Instance TrueClass
1 0.73 0.61
2 0.69 0.03
3 0.44 0.68
4 0.55 0.31
5 0.67 0.45
6 0.47 0.09
7 0.08 0.38
8 0.15 0.05
9 0.45 0.01
10 0.35 0.04
M1 t=0.5
M2
M1 t=0.1t=0.5 t=0.1
P(+|A,…,Z,M1) P(+|A,…,Z,M2)
+
+
−
−
+
+
−
−
+
−
17.Followingisadatasetthatcontainstwoattributes,XandY,andtwoclasslabels,“ ”and“ ”.Eachattributecantakethreedifferentvalues:0,1,or2.
X Y NumberofInstances
0 0 0 100
1 0 0 0
2 0 0 100
0 1 10 100
1 1 10 0
2 1 10 100
0 2 0 100
1 2 0 0
2 2 0 100
Theconceptforthe“ ”classis andtheconceptforthe“ ”classis.
a. Buildadecisiontreeonthedataset.Doesthetreecapturethe“ ”and“ ”concepts?
b. Whataretheaccuracy,precision,recall,and -measureofthedecisiontree?(Notethatprecision,recall,and -measurearedefinedwith
+ −
+ −
+ Y=1 −X=0∨X=2
+−
F1F1
respecttothe“ ”class.)
c. Buildanewdecisiontreewiththefollowingcostfunction:
(Hint:onlytheleavesoftheolddecisiontreeneedtobechanged.)Doesthedecisiontreecapturethe“ ”concept?
d. Whataretheaccuracy,precision,recall,and -measureofthenewdecisiontree?
18.Considerthetaskofbuildingaclassifierfromrandomdata,wheretheattributevaluesaregeneratedrandomlyirrespectiveoftheclasslabels.Assumethedatasetcontainsinstancesfromtwoclasses,“ ”and“ .”Halfofthedatasetisusedfortrainingwhiletheremaininghalfisusedfortesting.
a. Supposethereareanequalnumberofpositiveandnegativeinstancesinthedataandthedecisiontreeclassifierpredictseverytestinstancetobepositive.Whatistheexpectederrorrateoftheclassifieronthetestdata?
b. Repeatthepreviousanalysisassumingthattheclassifierpredictseachtestinstancetobepositiveclasswithprobability0.8andnegativeclasswithprobability0.2.
c. Supposetwo-thirdsofthedatabelongtothepositiveclassandtheremainingone-thirdbelongtothenegativeclass.Whatistheexpectederrorofaclassifierthatpredictseverytestinstancetobepositive?
d. Repeatthepreviousanalysisassumingthattheclassifierpredictseachtestinstancetobepositiveclasswithprobability2/3andnegativeclasswithprobability1/3.
+
C(i,j)={0,ifi=j;1,ifi=+,j=−;Numberof−instancesNumberof+instancesifi=−,j=+;
+
F1
+ −
19.DerivethedualLagrangianforthelinearSVMwithnon-separabledatawheretheobjectivefunctionis
20.ConsidertheXORproblemwheretherearefourtrainingpoints:
Transformthedataintothefollowingfeaturespace:
Findthemaximummarginlineardecisionboundaryinthetransformedspace.
21.GiventhedatasetsshowninFigures4.59 ,explainhowthedecisiontree,naïveBayes,andk-nearestneighborclassifierswouldperformonthesedatasets.
f(w)=ǁwǁ22+C(∑i=1Nξi)2.
(1,1,−),(1,0,+),(0,1,+),(0,0,−).
φ=(1,2×1,2×2,2x1x2,x12,x22).
Figure4.59.DatasetforExercise21.
5AssociationAnalysis:BasicConceptsandAlgorithms
Manybusinessenterprisesaccumulatelargequantitiesofdatafromtheirday-to-dayoperations.Forexample,hugeamountsofcustomerpurchasedataarecollecteddailyatthecheckoutcountersofgrocerystores.Table5.1 givesanexampleofsuchdata,commonlyknownasmarketbaskettransactions.Eachrowinthistablecorrespondstoatransaction,whichcontainsauniqueidentifierlabeledTIDandasetofitemsboughtbyagivencustomer.Retailersareinterestedinanalyzingthedatatolearnaboutthepurchasingbehavioroftheircustomers.Suchvaluableinformationcanbeusedtosupportavarietyofbusiness-relatedapplicationssuchasmarketingpromotions,inventorymanagement,andcustomerrelationshipmanagement.
Table5.1.Anexampleofmarketbaskettransactions.
TID Items
1 {Bread,Milk}
2 {Bread,Diapers,Beer,Eggs}
3 {Milk,Diapers,Beer,Cola}
4 {Bread,Milk,Diapers,Beer}
5 {Bread,Milk,Diapers,Cola}
Thischapterpresentsamethodologyknownasassociationanalysis,whichisusefulfordiscoveringinterestingrelationshipshiddeninlargedatasets.Theuncoveredrelationshipscanberepresentedintheformofsetsofitemspresentinmanytransactions,whichareknownasfrequentitemsets,orassociationrules,thatrepresentrelationshipsbetweentwoitemsets.Forexample,thefollowingrulecanbeextractedfromthedatasetshowninTable5.1 :
Therulesuggestsarelationshipbetweenthesaleofdiapersandbeerbecausemanycustomerswhobuydiapersalsobuybeer.Retailerscanusethesetypesofrulestohelpthemidentifynewopportunitiesforcross-sellingtheirproductstothecustomers.
{Diapers}→{Beer}.
Besidesmarketbasketdata,associationanalysisisalsoapplicabletodatafromotherapplicationdomainssuchasbioinformatics,medicaldiagnosis,webmining,andscientificdataanalysis.IntheanalysisofEarthsciencedata,forexample,associationpatternsmayrevealinterestingconnectionsamongtheocean,land,andatmosphericprocesses.SuchinformationmayhelpEarthscientistsdevelopabetterunderstandingofhowthedifferentelementsoftheEarthsysteminteractwitheachother.Eventhoughthetechniquespresentedherearegenerallyapplicabletoawidervarietyofdatasets,forillustrativepurposes,ourdiscussionwillfocusmainlyonmarketbasketdata.
Therearetwokeyissuesthatneedtobeaddressedwhenapplyingassociationanalysistomarketbasketdata.First,discoveringpatternsfromalargetransactiondatasetcanbecomputationallyexpensive.Second,someofthediscoveredpatternsmaybespurious(happensimplybychance)andevenfornon-spuriouspatterns,somearemoreinterestingthanothers.Theremainderofthischapterisorganizedaroundthesetwoissues.Thefirstpartofthechapterisdevotedtoexplainingthebasicconceptsofassociationanalysisandthealgorithmsusedtoefficientlyminesuchpatterns.Thesecondpartofthechapterdealswiththeissueofevaluatingthediscoveredpatternsinordertohelppreventthegenerationofspuriousresultsandtorankthepatternsintermsofsomeinterestingnessmeasure.
5.1PreliminariesThissectionreviewsthebasicterminologyusedinassociationanalysisandpresentsaformaldescriptionofthetask.
BinaryRepresentation
MarketbasketdatacanberepresentedinabinaryformatasshowninTable5.2 ,whereeachrowcorrespondstoatransactionandeachcolumncorrespondstoanitem.Anitemcanbetreatedasabinaryvariablewhosevalueisoneiftheitemispresentinatransactionandzerootherwise.Becausethepresenceofaniteminatransactionisoftenconsideredmoreimportantthanitsabsence,anitemisanasymmetricbinaryvariable.Thisrepresentationisasimplisticviewofrealmarketbasketdatabecauseitignoresimportantaspectsofthedatasuchasthequantityofitemssoldorthepricepaidtopurchasethem.Methodsforhandlingsuchnon-binarydatawillbeexplainedinChapter6 .
Table5.2.Abinary0/1representationofmarketbasketdata.
TID Bread Milk Diapers Beer Eggs Cola
1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1
ItemsetandSupportCount
Let bethesetofallitemsinamarketbasketdataandbethesetofalltransactions.Eachtransaction, containsa
subsetofitemschosenfromI.Inassociationanalysis,acollectionofzeroormoreitemsistermedanitemset.Ifanitemsetcontainskitems,itiscalledak-itemset.Forinstance,{ , , }isanexampleofa3-itemset.Thenull(orempty)setisanitemsetthatdoesnotcontainanyitems.
Atransaction issaidtocontainanitemsetXifXisasubsetof .Forexample,thesecondtransactionshowninTable5.2 containstheitemset{ , }butnot{ , }.Animportantpropertyofanitemsetisitssupportcount,whichreferstothenumberoftransactionsthatcontainaparticularitemset.Mathematically,thesupportcount, ,foranitemsetXcanbestatedasfollows:
wherethesymbol denotesthenumberofelementsinaset.InthedatasetshowninTable5.2 ,thesupportcountfor{ , , }isequaltotwobecausethereareonlytwotransactionsthatcontainallthreeitems.
Often,thepropertyofinterestisthesupport,whichisfractionoftransactionsinwhichanitemsetoccurs:
AnitemsetXiscalledfrequentifs(X)isgreaterthansomeuser-definedthreshold,minsup.
I={i1,i2,…,id} T={t1,t2,…,tN} ti
tj tj
σ(X)
σ(X)=|{ti|X⊆ti,ti∈T}|,
|⋅|
s(X)=σ(X)/N.
AssociationRule
Anassociationruleisanimplicationexpressionoftheform ,whereXandYaredisjointitemsets,i.e., .Thestrengthofanassociationrulecanbemeasuredintermsofitssupportandconfidence.Supportdetermineshowoftenaruleisapplicabletoagivendataset,whileconfidencedetermineshowfrequentlyitemsinYappearintransactionsthatcontainX.Theformaldefinitionsofthesemetricsare
Example5.1.Considertherule Becausethesupportcountfor{ , , }is2andthetotalnumberoftransactionsis5,therule'ssupportis .Therule'sconfidenceisobtainedbydividingthesupportcountfor{ , , }bythesupportcountfor{ ,
}.Sincethereare3transactionsthatcontainmilkanddiapers,theconfidenceforthisruleis .
WhyUseSupportandConfidence?
Supportisanimportantmeasurebecausearulethathasverylowsupportmightoccursimplybychance.Also,fromabusinessperspectivealowsupportruleisunlikelytobeinterestingbecauseitmightnotbeprofitabletopromoteitemsthatcustomersseldombuytogether(withtheexceptionofthesituationdescribedinSection5.8 ).Forthesereasons,weareinterestedinfindingruleswhosesupportisgreaterthansomeuser-definedthreshold.As
X→YX∩Y=∅
Support,s(X→Y)=σ(X∪Y)N; (5.1)
Confidence,c(X→Y)=σ(X∪Y)σ(X). (5.2)
2/5=0.4
2/3=0.67
willbeshowninSection5.2.1 ,supportalsohasadesirablepropertythatcanbeexploitedfortheefficientdiscoveryofassociationrules.
Confidence,ontheotherhand,measuresthereliabilityoftheinferencemadebyarule.Foragivenrule ,thehighertheconfidence,themorelikelyitisforYtobepresentintransactionsthatcontainX.ConfidencealsoprovidesanestimateoftheconditionalprobabilityofYgivenX.
Associationanalysisresultsshouldbeinterpretedwithcaution.Theinferencemadebyanassociationruledoesnotnecessarilyimplycausality.Instead,itcansometimessuggestastrongco-occurrencerelationshipbetweenitemsintheantecedentandconsequentoftherule.Causality,ontheotherhand,requiresknowledgeaboutwhichattributesinthedatacapturecauseandeffect,andtypicallyinvolvesrelationshipsoccurringovertime(e.g.,greenhousegasemissionsleadtoglobalwarming).SeeSection5.7.1 foradditionaldiscussion.
FormulationoftheAssociationRuleMiningProblem
Theassociationruleminingproblemcanbeformallystatedasfollows:
Definition5.1.(AssociationRuleDiscovery.)GivenasetoftransactionsT,findalltheruleshaving
and ,whereminsupandminconfarethecorrespondingsupportandconfidencethresholds.
X→Y
support≥minsup confidence≥minconf
Abrute-forceapproachforminingassociationrulesistocomputethesupportandconfidenceforeverypossiblerule.Thisapproachisprohibitivelyexpensivebecausethereareexponentiallymanyrulesthatcanbeextractedfromadataset.Morespecifically,assumingthatneithertheleftnortheright-handsideoftheruleisanemptyset,thetotalnumberofpossiblerules,R,extractedfromadatasetthatcontainsditemsis
Theproofforthisequationisleftasanexercisetothereaders(seeExercise5onpage440).EvenforthesmalldatasetshowninTable5.1 ,thisapproachrequiresustocomputethesupportandconfidencefor
rules.Morethan80%oftherulesarediscardedafterapplyingand ,thuswastingmostofthecomputations.To
avoidperformingneedlesscomputations,itwouldbeusefultoprunetherulesearlywithouthavingtocomputetheirsupportandconfidencevalues.
Aninitialsteptowardimprovingtheperformanceofassociationruleminingalgorithmsistodecouplethesupportandconfidencerequirements.FromEquation5.1 ,noticethatthesupportofarule isthesameasthesupportofitscorrespondingitemset, .Forexample,thefollowingruleshaveidenticalsupportbecausetheyinvolveitemsfromthesameitemset,
{ , , }:
R=3d−2d+1+1. (5.3)
36−27+1=602minsup=20% mincof=50%
X→YX∪Y
{Beer,Diapers}→{Milk},{Beer,Milk}→{Diapers},{Diapers,Milk}→{Beer},{Beer}→{Diapers,Milk},{Milk}→{Beer,Diapers},{Diapers}→{Beer,Milk}.
Iftheitemsetisinfrequent,thenallsixcandidaterulescanbeprunedimmediatelywithoutourhavingtocomputetheirconfidencevalues.
Therefore,acommonstrategyadoptedbymanyassociationruleminingalgorithmsistodecomposetheproblemintotwomajorsubtasks:
1. FrequentItemsetGeneration,whoseobjectiveistofindalltheitemsetsthatsatisfytheminsupthreshold.
2. RuleGeneration,whoseobjectiveistoextractallthehighconfidencerulesfromthefrequentitemsetsfoundinthepreviousstep.Theserulesarecalledstrongrules.
Thecomputationalrequirementsforfrequentitemsetgenerationaregenerallymoreexpensivethanthoseofrulegeneration.EfficienttechniquesforgeneratingfrequentitemsetsandassociationrulesarediscussedinSections5.2 and5.3 ,respectively.
5.2FrequentItemsetGenerationAlatticestructurecanbeusedtoenumeratethelistofallpossibleitemsets.Figure5.1 showsanitemsetlatticefor .Ingeneral,adatasetthatcontainskitemscanpotentiallygenerateupto frequentitemsets,excludingthenullset.Becausekcanbeverylargeinmanypracticalapplications,thesearchspaceofitemsetsthatneedtobeexploredisexponentiallylarge.
Figure5.1.
I={a,b,c,d,e}2k−1
Anitemsetlattice.
Abrute-forceapproachforfindingfrequentitemsetsistodeterminethesupportcountforeverycandidateitemsetinthelatticestructure.Todothis,weneedtocompareeachcandidateagainsteverytransaction,anoperationthatisshowninFigure5.2 .Ifthecandidateiscontainedinatransaction,itssupportcountwillbeincremented.Forexample,thesupportfor{ ,
}isincrementedthreetimesbecausetheitemsetiscontainedintransactions1,4,and5.SuchanapproachcanbeveryexpensivebecauseitrequiresO(NMw)comparisons,whereNisthenumberoftransactions,isthenumberofcandidateitemsets,andwisthemaximumtransaction
width.Transactionwidthisthenumberofitemspresentinatransaction.
Figure5.2.Countingthesupportofcandidateitemsets.
Therearethreemainapproachesforreducingthecomputationalcomplexityoffrequentitemsetgeneration.
1. Reducethenumberofcandidateitemsets(M).TheAprioriprinciple,describedinthenextSection,isaneffectivewaytoeliminatesomeof
M=2k−1
thecandidateitemsetswithoutcountingtheirsupportvalues.2. Reducethenumberofcomparisons.Insteadofmatchingeach
candidateitemsetagainsteverytransaction,wecanreducethenumberofcomparisonsbyusingmoreadvanceddatastructures,eithertostorethecandidateitemsetsortocompressthedataset.WewilldiscussthesestrategiesinSections5.2.4 and5.6 ,respectively.
3. Reducethenumberoftransactions(N).Asthesizeofcandidateitemsetsincreases,fewertransactionswillbesupportedbytheitemsets.Forinstance,sincethewidthofthefirsttransactioninTable5.1 is2,itwouldbeadvantageoustoremovethistransactionbeforesearchingforfrequentitemsetsofsize3andlarger.AlgorithmsthatemploysuchastrategyarediscussedintheBibliographicNotes.
5.2.1TheAprioriPrinciple
ThisSectiondescribeshowthesupportmeasurecanbeusedtoreducethenumberofcandidateitemsetsexploredduringfrequentitemsetgeneration.Theuseofsupportforpruningcandidateitemsetsisguidedbythefollowingprinciple.
Theorem5.1(AprioriPrinciple).Ifanitemsetisfrequent,thenallofitssubsetsmustalsobefrequent.
ToillustratetheideabehindtheAprioriprinciple,considertheitemsetlatticeshowninFigure5.3 .Suppose{c,d,e}isafrequentitemset.Clearly,anytransactionthatcontains{c,d,e}mustalsocontainitssubsets,{c,d},{c,e},{d,e},{c},{d},and{e}.Asaresult,if{c,d,e}isfrequent,thenallsubsetsof{c,d,e}(i.e.,theshadeditemsetsinthisfigure)mustalsobefrequent.
Figure5.3.AnillustrationoftheAprioriprinciple.If{c,d,e}isfrequent,thenallsubsetsofthisitemsetarefrequent.
Conversely,ifanitemsetsuchas{a,b}isinfrequent,thenallofitssupersetsmustbeinfrequenttoo.AsillustratedinFigure5.4 ,theentiresubgraph
containingthesupersetsof{a,b}canbeprunedimmediatelyonce{a,b}isfoundtobeinfrequent.Thisstrategyoftrimmingtheexponentialsearchspacebasedonthesupportmeasureisknownassupport-basedpruning.Suchapruningstrategyismadepossiblebyakeypropertyofthesupportmeasure,namely,thatthesupportforanitemsetneverexceedsthesupportforitssubsets.Thispropertyisalsoknownastheanti-monotonepropertyofthesupportmeasure.
Figure5.4.Anillustrationofsupport-basedpruning.If{a,b}isinfrequent,thenallsupersetsof{a,b}areinfrequent.
Definition5.2.(Anti-monotoneProperty.)Ameasurefpossessestheanti-monotonepropertyifforeveryitemsetXthatisapropersubsetofitemsetY,i.e. ,wehave
.
Moregenerally,alargenumberofmeasures—seeSection5.7.1 —canbeappliedtoitemsetstoevaluatevariouspropertiesofitemsets.AswillbeshowninthenextSection,anymeasurethathastheanti-monotonepropertycanbeincorporateddirectlyintoanitemsetminingalgorithmtoeffectivelyprunetheexponentialsearchspaceofcandidateitemsets.
5.2.2FrequentItemsetGenerationintheAprioriAlgorithm
Aprioriisthefirstassociationruleminingalgorithmthatpioneeredtheuseofsupport-basedpruningtosystematicallycontroltheexponentialgrowthofcandidateitemsets.Figure5.5 providesahigh-levelillustrationofthefrequentitemsetgenerationpartoftheApriorialgorithmforthetransactionsshowninTable5.1 .Weassumethatthesupportthresholdis60%,whichisequivalenttoaminimumsupportcountequalto3.
X⊂Yf(Y)≤f(X)
Figure5.5.IllustrationoffrequentitemsetgenerationusingtheApriorialgorithm.
Initially,everyitemisconsideredasacandidate1-itemset.Aftercountingtheirsupports,thecandidateitemsets{ }and{ }arediscardedbecausetheyappearinfewerthanthreetransactions.Inthenextiteration,candidate2-itemsetsaregeneratedusingonlythefrequent1-itemsetsbecausetheAprioriprincipleensuresthatallsupersetsoftheinfrequent1-itemsetsmustbeinfrequent.Becausethereareonlyfourfrequent1-itemsets,thenumberofcandidate2-itemsetsgeneratedbythealgorithmis .Twoofthesesixcandidates,{ , }and{ , },aresubsequentlyfoundtobeinfrequentaftercomputingtheirsupportvalues.Theremainingfourcandidatesarefrequent,andthuswillbeusedtogeneratecandidate3-itemsets.Withoutsupport-basedpruning,thereare candidate3-itemsetsthatcanbeformedusingthesixitemsgiveninthisexample.With
(42)=6
(63)=20
theAprioriprinciple,weonlyneedtokeepcandidate3-itemsetswhosesubsetsarefrequent.Theonlycandidatethathasthispropertyis{ ,
, }.However,eventhoughthesubsetsof{ , , }arefrequent,theitemsetitselfisnot.
TheeffectivenessoftheAprioripruningstrategycanbeshownbycountingthenumberofcandidateitemsetsgenerated.Abrute-forcestrategyofenumeratingallitemsets(uptosize3)ascandidateswillproduce
candidates.WiththeAprioriprinciple,thisnumberdecreasesto
candidates,whichrepresentsa68%reductioninthenumberofcandidateitemsetseveninthissimpleexample.
ThepseudocodeforthefrequentitemsetgenerationpartoftheApriorialgorithmisshowninAlgorithm5.1 .Let denotethesetofcandidatek-itemsetsand denotethesetoffrequentk-itemsets:
Thealgorithminitiallymakesasinglepassoverthedatasettodeterminethesupportofeachitem.Uponcompletionofthisstep,thesetofallfrequent1-itemsets, ,willbeknown(steps1and2).Next,thealgorithmwilliterativelygeneratenewcandidatek-itemsetsandpruneunnecessarycandidatesthatareguaranteedtobeinfrequentgiventhefrequent -itemsetsfoundinthepreviousiteration(steps5and6).Candidategenerationandpruningisimplementedusingthefunctionscandidate-genandcandidate-prune,whicharedescribedinSection5.2.3 .
(61)+(62)+(63)=6+15+20=41
(61)+(42)+1=6+6+1=13
CkFk
F1
(k−1)
Tocountthesupportofthecandidates,thealgorithmneedstomakeanadditionalpassoverthedataset(steps7–12).Thesubsetfunctionisusedtodetermineallthecandidateitemsetsin thatarecontainedineachtransactiont.TheimplementationofthisfunctionisdescribedinSection5.2.4 .Aftercountingtheirsupports,thealgorithmeliminatesallcandidateitemsetswhosesupportcountsarelessthan (step13).Thealgorithmterminateswhentherearenonewfrequentitemsetsgenerated,i.e., (step14).
ThefrequentitemsetgenerationpartoftheApriorialgorithmhastwoimportantcharacteristics.First,itisalevel-wisealgorithm;i.e.,ittraversestheitemsetlatticeonelevelatatime,fromfrequent1-itemsetstothemaximumsizeoffrequentitemsets.Second,itemploysagenerate-and-teststrategyforfindingfrequentitemsets.Ateachiteration(level),newcandidateitemsetsaregeneratedfromthefrequentitemsetsfoundinthepreviousiteration.Thesupportforeachcandidateisthencountedandtestedagainsttheminsupthreshold.Thetotalnumberofiterationsneededbythealgorithmis ,where isthemaximumsizeofthefrequentitemsets.
5.2.3CandidateGenerationandPruning
Thecandidate-genandcandidate-prunefunctionsshowninSteps5and6ofAlgorithm5.1 generatecandidateitemsetsandprunesunnecessaryonesbyperformingthefollowingtwooperations,respectively:
Ck
N×minsup
Fk=∅
kmax+1kmax
1. CandidateGeneration.Thisoperationgeneratesnewcandidatek-itemsetsbasedonthefrequent -itemsetsfoundinthepreviousiteration.
Algorithm5.1FrequentitemsetgenerationoftheApriorialgorithm.
∈ ∧
∈
∈
∈ ∧
∅
∪
2. CandidatePruning.Thisoperationeliminatessomeofthecandidatek-itemsetsusingsupport-basedpruning,i.e.byremovingk-itemsetswhosesubsetsareknowntobeinfrequentinpreviousiterations.Note
(k−1)
thatthispruningisdonewithoutcomputingtheactualsupportofthesek-itemsets(whichcouldhaverequiredcomparingthemagainsteachtransaction).
CandidateGenerationInprinciple,therearemanywaystogeneratecandidateitemsets.Aneffectivecandidategenerationproceduremustbecompleteandnon-redundant.Acandidategenerationprocedureissaidtobecompleteifitdoesnotomitanyfrequentitemsets.Toensurecompleteness,thesetofcandidateitemsetsmustsubsumethesetofallfrequentitemsets,i.e., .Acandidategenerationprocedureisnon-redundantifitdoesnotgeneratethesamecandidateitemsetmorethanonce.Forexample,thecandidateitemset{a,b,c,d}canbegeneratedinmanyways—bymerging{a,b,c}with{d},{b,d}with{a,c},{c}with{a,b,d},etc.Generationofduplicatecandidatesleadstowastedcomputationsandthusshouldbeavoidedforefficiencyreasons.Also,aneffectivecandidategenerationprocedureshouldavoidgeneratingtoomanyunnecessarycandidates.Acandidateitemsetisunnecessaryifatleastoneofitssubsetsisinfrequent,andthus,eliminatedinthecandidatepruningstep.
Next,wewillbrieflydescribeseveralcandidategenerationprocedures,includingtheoneusedbythecandidate-genfunction.
Brute-ForceMethod
Thebrute-forcemethodconsiderseveryk-itemsetasapotentialcandidateandthenappliesthecandidatepruningsteptoremoveanyunnecessarycandidateswhosesubsetsareinfrequent(seeFigure5.6 ).Thenumberofcandidateitemsetsgeneratedatlevelkisequalto ,wheredisthetotalnumberofitems.Althoughcandidategenerationisrathertrivial,candidate
∀k:Fk⊆Ck
(dk)
pruningbecomesextremelyexpensivebecausealargenumberofitemsetsmustbeexamined.
Figure5.6.Abrute-forcemethodforgeneratingcandidate3-itemsets.
Method
Analternativemethodforcandidategenerationistoextendeachfrequent-itemsetwithfrequentitemsthatarenotpartofthe -itemset.Figure
5.7 illustrateshowafrequent2-itemsetsuchas{ , }canbeaugmentedwithafrequentitemsuchas toproduceacandidate3-itemset{ , , }.
Fk−1×F1
(k−1) (k−1)
Figure5.7.Generatingandpruningcandidatek-itemsetsbymergingafrequent -itemsetwithafrequentitem.Notethatsomeofthecandidatesareunnecessarybecausetheirsubsetsareinfrequent.
Theprocedureiscompletebecauseeveryfrequentk-itemsetiscomposedofafrequent -itemsetandafrequent1-itemset.Therefore,allfrequentk-itemsetsarepartofthecandidatek-itemsetsgeneratedbythisprocedure.Figure5.7 showsthatthe candidategenerationmethodonlyproducesfourcandidate3-itemsets,insteadofthe
itemsetsproducedbythebrute-forcemethod.The methodgenerateslowernumberofcandidatesbecauseeverycandidateisguaranteedtocontainatleastonefrequent -itemset.Whilethisprocedureisasubstantialimprovementoverthebrute-forcemethod,itcanstillproducealargenumberofunnecessarycandidates,astheremainingsubsetsofacandidateitemsetcanstillbeinfrequent.
Notethattheapproachdiscussedabovedoesnotpreventthesamecandidate
(k−1)
(k−1)
Fk−1×F1
(63)=20 Fk−1×F1
(k−1)
itemsetfrombeinggeneratedmorethanonce.Forinstance,{ , ,}canbegeneratedbymerging{ , }with{ },{ ,}with{ },or{ , }with{ }.Onewaytoavoid
generatingduplicatecandidatesisbyensuringthattheitemsineachfrequentitemsetarekeptsortedintheirlexicographicorder.Forexample,itemsetssuchas{ , },{ , , },and{ , }followlexicographicorderastheitemswithineveryitemsetarearrangedalphabetically.Eachfrequent -itemsetXisthenextendedwithfrequentitemsthatarelexicographicallylargerthantheitemsinX.Forexample,theitemset{ , }canbeaugmentedwith{ }becauseMilkislexicographicallylargerthanBreadandDiapers.However,weshouldnotaugment{ , }with{ }nor{ , }with{ }becausetheyviolatethelexicographicorderingcondition.Everycandidatek-itemsetisthusgeneratedexactlyonce,bymergingthelexicographicallylargestitemwiththeremaining itemsintheitemset.Ifthemethodisusedinconjunctionwithlexicographicordering,thenonlytwocandidate3-itemsetswillbeproducedintheexampleillustratedinFigure5.7 .{ , , }and{ , , }willnotbegeneratedbecause{ , }isnotafrequent2-itemset.
Method
Thiscandidategenerationprocedure,whichisusedinthecandidate-genfunctionoftheApriorialgorithm,mergesapairoffrequent -itemsetsonlyiftheirfirst items,arrangedinlexicographicorder,areidentical.Let
and beapairoffrequent -itemsets,arrangedlexicographically.AandBaremergediftheysatisfythefollowingconditions:
(k−1)
k−1 Fk−1×F1
Fk−1×Fk−1
(k−1)k−2 A=
{a1,a2,…,ak−1} B={b1,b2,…,bk−1} (k−1)
ai=bi(fori=1,2,…,k−2).
Notethatinthiscase, becauseAandBaretwodistinctitemsets.Thecandidatek-itemsetgeneratedbymergingAandBconsistsofthefirstcommonitemsfollowedby and inlexicographicorder.This
candidategenerationprocedureiscomplete,becauseforeverylexicographicallyorderedfrequentk-itemset,thereexiststwolexicographicallyorderedfrequent -itemsetsthathaveidenticalitemsinthefirstpositions.
InFigure5.8 ,thefrequentitemsets{ , }and{ , }aremergedtoformacandidate3-itemset{ , , }.Thealgorithmdoesnothavetomerge{ , }with{ , }becausethefirstiteminbothitemsetsisdifferent.Indeed,if{ , , }isaviablecandidate,itwouldhavebeenobtainedbymerging{ , }with{ , }instead.Thisexampleillustratesboththecompletenessofthecandidategenerationprocedureandtheadvantagesofusinglexicographicorderingtopreventduplicatecandidates.Also,ifweorderthefrequent -itemsetsaccordingtotheirlexicographicrank,itemsetswithidenticalfirstitemswouldtakeconsecutiveranks.Asaresult,the candidategenerationmethodwouldconsidermergingafrequentitemsetonlywithonesthatoccupythenextfewranksinthesortedlist,thussavingsomecomputations.
ak−1≠bk−1k
−2 ak−1 bk−1
(k−1) k−2
(k−1)k−2
Fk−1×Fk−1
Figure5.8.Generatingandpruningcandidatek-itemsetsbymergingpairsoffrequent
-itemsets.
Figure5.8 showsthatthe candidategenerationprocedureresultsinonlyonecandidate3-itemset.Thisisaconsiderablereductionfromthefourcandidate3-itemsetsgeneratedbythe method.Thisisbecausethe methodensuresthateverycandidatek-itemsetcontainsatleasttwofrequent -itemsets,thusgreatlyreducingthenumberofcandidatesthataregeneratedinthisstep.
Notethattherecanbemultiplewaysofmergingtwofrequent -itemsetsinthe procedure,oneofwhichismergingiftheirfirst itemsareidentical.Analternateapproachcouldbetomergetwofrequent -itemsetsAandBifthelast itemsofAareidenticaltothefirstitemsetsofB.Forexample,{ , }and{ , }couldbemergedusingthisapproachtogeneratethecandidate3-itemset{ ,
, }.Aswewillseelater,thisalternate procedureis
(k−1)
Fk−1×Fk−1
Fk−1×F1Fk−1×Fk−1
(k−1)
(k−1)Fk−1×Fk−1 k−2
(k−1)k−2 k−2
Fk−1×Fk−1
usefulingeneratingsequentialpatterns,whichwillbediscussedinChapter6 .
CandidatePruningToillustratethecandidatepruningoperationforacandidatek-itemset,
,consideritskpropersubsets, .Ifanyofthemareinfrequent,thenXisimmediatelyprunedbyusingtheAprioriprinciple.Notethatwedon'tneedtoexplicitlyensurethatallsubsetsofXofsizelessthan arefrequent(seeExercise7).Thisapproachgreatlyreducesthenumberofcandidateitemsetsconsideredduringsupportcounting.Forthebrute-forcecandidategenerationmethod,candidatepruningrequirescheckingonlyksubsetsofsize foreachcandidatek-itemset.However,sincethe candidategenerationstrategyensuresthatatleastoneofthe -sizesubsetsofeverycandidatek-itemsetisfrequent,weonlyneedtocheckfortheremaining subsets.Likewise,thestrategyrequiresexaminingonly subsetsofeverycandidatek-itemset,
sincetwoofits -sizesubsetsarealreadyknowntobefrequentinthecandidategenerationstep.
5.2.4SupportCounting
Supportcountingistheprocessofdeterminingthefrequencyofoccurrenceforeverycandidateitemsetthatsurvivesthecandidatepruningstep.Supportcountingisimplementedinsteps6through11ofAlgorithm5.1 .Abrute-forceapproachfordoingthisistocompareeachtransactionagainsteverycandidateitemset(seeFigure5.2 )andtoupdatethesupportcountsofcandidatescontainedinatransaction.Thisapproachiscomputationally
X={i1,i2,…,ik} X−{ij}(∀j=1,2,…,k)
k−1
k−1Fk−1×F1
(k−1)k−1 Fk−1×Fk
−1 k−2(k−1)
expensive,especiallywhenthenumbersoftransactionsandcandidateitemsetsarelarge.
Analternativeapproachistoenumeratetheitemsetscontainedineachtransactionandusethemtoupdatethesupportcountsoftheirrespectivecandidateitemsets.Toillustrate,consideratransactiontthatcontainsfiveitems,{1,2,3,5,6}.Thereare itemsetsofsize3containedinthistransaction.Someoftheitemsetsmaycorrespondtothecandidate3-itemsetsunderinvestigation,inwhichcase,theirsupportcountsareincremented.Othersubsetsoftthatdonotcorrespondtoanycandidatescanbeignored.
Figure5.9 showsasystematicwayforenumeratingthe3-itemsetscontainedint.Assumingthateachitemsetkeepsitsitemsinincreasinglexicographicorder,anitemsetcanbeenumeratedbyspecifyingthesmallestitemfirst,followedbythelargeritems.Forinstance,given ,allthe3-itemsetscontainedintmustbeginwithitem1,2,or3.Itisnotpossibletoconstructa3-itemsetthatbeginswithitems5or6becausethereareonlytwoitemsintwhoselabelsaregreaterthanorequalto5.Thenumberofwaystospecifythefirstitemofa3-itemsetcontainedintisillustratedbytheLevel1prefixtreestructuredepictedinFigure5.9 .Forinstance,1representsa3-itemsetthatbeginswithitem1,followedbytwomoreitemschosenfromtheset{2,3,5,6}.
(53)=10
t={1,2,3,5,6}
2356
Figure5.9.Enumeratingsubsetsofthreeitemsfromatransactiont.
Afterfixingthefirstitem,theprefixtreestructureatLevel2representsthenumberofwaystoselecttheseconditem.Forexample,12correspondstoitemsetsthatbeginwiththeprefix(12)andarefollowedbytheitems3,5,or6.Finally,theprefixtreestructureatLevel3representsthecompletesetof3-itemsetscontainedint.Forexample,the3-itemsetsthatbeginwithprefix{12}are{1,2,3},{1,2,5},and{1,2,6},whilethosethatbeginwithprefix{23}are{2,3,5}and{2,3,6}.
TheprefixtreestructureshowninFigure5.9 demonstrateshowitemsetscontainedinatransactioncanbesystematicallyenumerated,i.e.,byspecifyingtheiritemsonebyone,fromtheleftmostitemtotherightmostitem.Westillhavetodeterminewhethereachenumerated3-itemsetcorresponds
356
toanexistingcandidateitemset.Ifitmatchesoneofthecandidates,thenthesupportcountofthecorrespondingcandidateisincremented.InthenextSection,weillustratehowthismatchingoperationcanbeperformedefficientlyusingahashtreestructure.
SupportCountingUsingaHashTree*IntheApriorialgorithm,candidateitemsetsarepartitionedintodifferentbucketsandstoredinahashtree.Duringsupportcounting,itemsetscontainedineachtransactionarealsohashedintotheirappropriatebuckets.Thatway,insteadofcomparingeachitemsetinthetransactionwitheverycandidateitemset,itismatchedonlyagainstcandidateitemsetsthatbelongtothesamebucket,asshowninFigure5.10 .
Figure5.10.Countingthesupportofitemsetsusinghashstructure.
Figure5.11 showsanexampleofahashtreestructure.Eachinternalnodeofthetreeusesthefollowinghashfunction, ,wheremodeh(p)=(p−1)mod3,
referstothemodulo(remainder)operator,todeterminewhichbranchofthecurrentnodeshouldbefollowednext.Forexample,items1,4,and7arehashedtothesamebranch(i.e.,theleftmostbranch)becausetheyhavethesameremainderafterdividingthenumberby3.Allcandidateitemsetsarestoredattheleafnodesofthehashtree.ThehashtreeshowninFigure5.11 contains15candidate3-itemsets,distributedacross9leafnodes.
Figure5.11.Hashingatransactionattherootnodeofahashtree.
Considerthetransaction, .Toupdatethesupportcountsofthecandidateitemsets,thehashtreemustbetraversedinsuchawaythatall
t={1,2,3,4,5,6}
theleafnodescontainingcandidate3-itemsetsbelongingtotmustbevisitedatleastonce.Recallthatthe3-itemsetscontainedintmustbeginwithitems1,2,or3,asindicatedbytheLevel1prefixtreestructureshowninFigure5.9 .Therefore,attherootnodeofthehashtree,theitems1,2,and3ofthetransactionarehashedseparately.Item1ishashedtotheleftchildoftherootnode,item2ishashedtothemiddlechild,anditem3ishashedtotherightchild.Atthenextlevelofthetree,thetransactionishashedontheseconditemlistedintheLevel2treestructureshowninFigure5.9 .Forexample,afterhashingonitem1attherootnode,items2,3,and5ofthetransactionarehashed.Basedonthehashfunction,items2and5arehashedtothemiddlechild,whileitem3ishashedtotherightchild,asshowninFigure5.12 .Thisprocesscontinuesuntiltheleafnodesofthehashtreearereached.Thecandidateitemsetsstoredatthevisitedleafnodesarecomparedagainstthetransaction.Ifacandidateisasubsetofthetransaction,itssupportcountisincremented.Notethatnotalltheleafnodesarevisitedwhiletraversingthehashtree,whichhelpsinreducingthecomputationalcost.Inthisexample,5outofthe9leafnodesarevisitedand9outofthe15itemsetsarecomparedagainstthetransaction.
Figure5.12.Subsetoperationontheleftmostsubtreeoftherootofacandidatehashtree.
5.2.5ComputationalComplexity
ThecomputationalcomplexityoftheApriorialgorithm,whichincludesbothitsruntimeandstorage,canbeaffectedbythefollowingfactors.
SupportThreshold
Loweringthesupportthresholdoftenresultsinmoreitemsetsbeingdeclaredasfrequent.Thishasanadverseeffectonthecomputationalcomplexityofthealgorithmbecausemorecandidateitemsetsmustbegeneratedandcounted
ateverylevel,asshowninFigure5.13 .Themaximumsizeoffrequentitemsetsalsotendstoincreasewithlowersupportthresholds.ThisincreasesthetotalnumberofiterationstobeperformedbytheApriorialgorithm,furtherincreasingthecomputationalcost.
Figure5.13.Effectofsupportthresholdonthenumberofcandidateandfrequentitemsetsobtainedfromabenchmarkdataset.
NumberofItems(Dimensionality)
Asthenumberofitemsincreases,morespacewillbeneededtostorethesupportcountsofitems.Ifthenumberoffrequentitemsalsogrowswiththedimensionalityofthedata,theruntimeandstoragerequirementswillincreasebecauseofthelargernumberofcandidateitemsetsgeneratedbythealgorithm.
NumberofTransactions
BecausetheApriorialgorithmmakesrepeatedpassesoverthetransactiondataset,itsruntimeincreaseswithalargernumberoftransactions.
AverageTransactionWidth
Fordensedatasets,theaveragetransactionwidthcanbeverylarge.ThisaffectsthecomplexityoftheApriorialgorithmintwoways.First,themaximumsizeoffrequentitemsetstendstoincreaseastheaveragetransactionwidthincreases.Asaresult,morecandidateitemsetsmustbeexaminedduringcandidategenerationandsupportcounting,asillustratedinFigure5.14 .Second,asthetransactionwidthincreases,moreitemsetsarecontainedinthetransaction.Thiswillincreasethenumberofhashtreetraversalsperformedduringsupportcounting.
AdetailedanalysisofthetimecomplexityfortheApriorialgorithmispresentednext.
Figure5.14.
Effectofaveragetransactionwidthonthenumberofcandidateandfrequentitemsetsobtainedfromasyntheticdataset.
Generationoffrequent1-itemsets
Foreachtransaction,weneedtoupdatethesupportcountforeveryitempresentinthetransaction.Assumingthatwistheaveragetransactionwidth,thisoperationrequiresO(Nw)time,whereNisthetotalnumberoftransactions.
Candidategeneration
Togeneratecandidatek-itemsets,pairsoffrequent -itemsetsaremergedtodeterminewhethertheyhaveatleast itemsincommon.Eachmergingoperationrequiresatmost equalitycomparisons.Everymergingstepcanproduceatmostoneviablecandidatek-itemset,whileintheworst-case,thealgorithmmusttrytomergeeverypairoffrequent -itemsetsfoundinthepreviousiteration.Therefore,theoverallcostofmergingfrequentitemsetsis
wherewisthemaximumtransactionwidth.Ahashtreeisalsoconstructedduringcandidategenerationtostorethecandidateitemsets.Becausethemaximumdepthofthetreeisk,thecostforpopulatingthehashtreewithcandidateitemsetsis .Duringcandidatepruning,weneedtoverifythatthe subsetsofeverycandidatek-itemsetarefrequent.SincethecostforlookingupacandidateinahashtreeisO(k),thecandidatepruningsteprequires time.
Supportcounting
(k−1)k−2
k−2
(k−1)
∑k=2w(k−2)|Ck|<Costofmerging<∑k=2w(k−2)|Fk−1|2,
O(∑k=2wk|Ck|)k−2
O(∑k=2wk(k−2)|Ck|)
Eachtransactionofwidth produces itemsetsofsizek.Thisisalsotheeffectivenumberofhashtreetraversalsperformedforeachtransaction.Thecostforsupportcountingis ,wherewisthemaximumtransactionwidthand isthecostforupdatingthesupportcountofacandidatek-itemsetinthehashtree.
|t| (|t|k)
O(N∑k(wk)αk)αk
5.3RuleGenerationThisSectiondescribeshowtoextractassociationrulesefficientlyfromagivenfrequentitemset.Eachfrequentk-itemset,Y,canproduceuptoassociationrules,ignoringrulesthathaveemptyantecedentsorconsequents
or ).AnassociationrulecanbeextractedbypartitioningtheitemsetYintotwonon-emptysubsets,Xand ,suchthat satisfiestheconfidencethreshold.Notethatallsuchrulesmusthavealreadymetthesupportthresholdbecausetheyaregeneratedfromafrequentitemset.
Example5.2.Let beafrequentitemset.Therearesixcandidateassociationrulesthatcanbegeneratedfrom
,and .AseachoftheirsupportisidenticaltothesupportforX,alltherulessatisfythesupportthreshold.
Computingtheconfidenceofanassociationruledoesnotrequireadditionalscansofthetransactiondataset.Considertherule ,whichisgeneratedfromthefrequentitemset .Theconfidenceforthisruleis
.Because{1,2,3}isfrequent,theanti-monotonepropertyofsupportensuresthat{1,2}mustbefrequent,too.Sincethesupportcountsforbothitemsetswerealreadyfoundduringfrequentitemsetgeneration,thereisnoneedtoreadtheentiredatasetagain.
5.3.1Confidence-BasedPruning
2k−2
∅→Y Y→∅Y−X X→Y−X
X={a,b,c}X:{a,b}→{c},{a,c}→{b},{b,c}→{a},{a}
→{b,c},{b}→{a,c} {c}→{a,b}
{1,2}→{3}X={1,2,3}
σ{(1,2,3})/σ({1,2})
Confidencedoesnotshowtheanti-monotonepropertyinthesamewayasthesupportmeasure.Forexample,theconfidencefor canbelarger,smaller,orequaltotheconfidenceforanotherrule ,where and
(seeExercise3onpage439).Nevertheless,ifwecomparerulesgeneratedfromthesamefrequentitemsetY,thefollowingtheoremholdsfortheconfidencemeasure.
Theorem5.2.LetYbeanitemsetandXisasubsetofY.Ifaruledoesnotsatisfytheconfidencethreshold,thenanyrule
,where isasubsetofX,mustnotsatisfytheconfidencethresholdaswell.
Toprovethistheorem,considerthefollowingtworules: and,where .Theconfidenceoftherulesare and ,
respectively.Since isasubsetofX, .Therefore,theformerrulecannothaveahigherconfidencethanthelatterrule.
5.3.2RuleGenerationinAprioriAlgorithm
TheApriorialgorithmusesalevel-wiseapproachforgeneratingassociationrules,whereeachlevelcorrespondstothenumberofitemsthatbelongtothe
X→YX˜→Y˜ X˜⊆X
Y˜⊆Y
X→Y−XX˜→Y
−X˜ X˜
X˜→Y−X˜ X→Y−X X˜⊂X σ(Y)/σ(X˜) σ(Y)/σ(X)
X˜ σ(X˜)/σ(X)
ruleconsequent.Initially,allthehighconfidencerulesthathaveonlyoneitemintheruleconsequentareextracted.Theserulesarethenusedtogeneratenewcandidaterules.Forexample,if and arehighconfidencerules,thenthecandidaterule isgeneratedbymergingtheconsequentsofbothrules.Figure5.15 showsalatticestructurefortheassociationrulesgeneratedfromthefrequentitemset{a,b,c,d}.Ifanynodeinthelatticehaslowconfidence,thenaccordingtoTheorem5.2 ,theentiresubgraphspannedbythenodecanbeprunedimmediately.Supposetheconfidencefor islow.Alltherulescontainingitemainitsconsequent,including ,and canbediscarded.
Figure5.15.Pruningofassociationrulesusingtheconfidencemeasure.
{acd}→{b} {abd}→{c}{ad}→{bc}
{bcd}→{a}{cd}→{ab},{bd}→{ac},{bc}→{ad} {d}→{abc}
ApseudocodefortherulegenerationstepisshowninAlgorithms5.2 and5.3 .Notethesimilaritybetweenthe proceduregiveninAlgorithm5.3 andthefrequentitemsetgenerationproceduregiveninAlgorithm5.1 .Theonlydifferenceisthat,inrulegeneration,wedonothavetomakeadditionalpassesoverthedatasettocomputetheconfidenceofthecandidaterules.Instead,wedeterminetheconfidenceofeachrulebyusingthesupportcountscomputedduringfrequentitemsetgeneration.
Algorithm5.2RulegenerationoftheApriorialgorithm.
∈
Algorithm5.3Procedureap-genrules .
∈
(fk,Hm)
5.3.3AnExample:CongressionalVotingRecords
ThisSectiondemonstratestheresultsofapplyingassociationanalysistothevotingrecordsofmembersoftheUnitedStatesHouseofRepresentatives.Thedataisobtainedfromthe1984CongressionalVotingRecordsDatabase,whichisavailableattheUCImachinelearningdatarepository.Eachtransactioncontainsinformationaboutthepartyaffiliationforarepresentativealongwithhisorhervotingrecordon16keyissues.Thereare435transactionsand34itemsinthedataset.ThesetofitemsarelistedinTable5.3 .
Table5.3.Listofbinaryattributesfromthe1984UnitedStatesCongressionalVotingRecords.Source:TheUCImachinelearningrepository.
1. Republican2. Democrat3.4.5.6.7.
handicapped-infants=yeshandicapped-infants=nowaterprojectcostsharing=yeswaterprojectcostsharing=nobudget-resolution=yes
8.9.
10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.
TheApriorialgorithmisthenappliedtothedatasetwith and.Someofthehighconfidencerulesextractedbythealgorithm
areshowninTable5.4 .ThefirsttworulessuggestthatmostofthememberswhovotedyesforaidtoElSalvadorandnoforbudgetresolutionandMXmissileareRepublicans;whilethosewhovotednoforaidtoElSalvadorandyesforbudgetresolutionandMXmissileareDemocrats.These
budget-resolution=nophysicianfeefreeze=yesphysicianfeefreeze=noaidtoElSalvador=yesaidtoElSalvador=noreligiousgroupsinschools=yesreligiousgroupsinschools=noanti-satellitetestban=yesanti-satellitetestban=noaidtoNicaragua=yesaidtoNicaragua=noMX-missile=yesMX-missile=noimmigration=yesimmigration=nosynfuelcorporationcutback=yessynfuelcorporationcutback=noeducationspending=yeseducationspending=noright-to-sue=yesright-to-sue=nocrime=yescrime=noduty-free-exports=yesduty-free-exports=noexportadministrationact=yesexportadministrationact=no
minsup=30%minconf=90%
highconfidencerulesshowthekeyissuesthatdividemembersfrombothpoliticalparties.
Table5.4.Associationrulesextractedfromthe1984UnitedStatesCongressionalVotingRecords.
AssociationRule Confidence
91.0%
97.5%
93.5%
100%
{budgetresolution=no,MX-missile=no,aidtoElSalvador=yes}→{Republican}
{budgetresolution=yes,MX-missile=yes,aidtoElSalvador=no}→{Democrat}
{crime=yes,right-to-sue=yes,physicianfeefreeze=yes}→{Republican}
{crime=no,right-to-sue=no,physicianfeefreeze=no}→{Democrat}
5.4CompactRepresentationofFrequentItemsetsInpractice,thenumberoffrequentitemsetsproducedfromatransactiondatasetcanbeverylarge.Itisusefultoidentifyasmallrepresentativesetoffrequentitemsetsfromwhichallotherfrequentitemsetscanbederived.TwosuchrepresentationsarepresentedinthisSectionintheformofmaximalandclosedfrequentitemsets.
5.4.1MaximalFrequentItemsets
Definition5.3.(MaximalFrequentItemset.)Afrequentitemsetismaximalifnoneofitsimmediatesupersetsarefrequent.
Toillustratethisconcept,considertheitemsetlatticeshowninFigure5.16 .Theitemsetsinthelatticearedividedintotwogroups:thosethatarefrequentandthosethatareinfrequent.Afrequentitemsetborder,whichisrepresentedbyadashedline,isalsoillustratedinthediagram.Everyitemsetlocatedabovetheborderisfrequent,whilethoselocatedbelowtheborder(theshadednodes)areinfrequent.Amongtheitemsetsresidingneartheborder,
{a,d},{a,c,e},and{b,c,d,e}aremaximalfrequentitemsetsbecausealloftheirimmediatesupersetsareinfrequent.Forexample,theitemset{a,d}ismaximalfrequentbecauseallofitsimmediatesupersets,{a,b,d},{a,c,d},and{a,d,e},areinfrequent.Incontrast,{a,c}isnon-maximalbecauseoneofitsimmediatesupersets,{a,c,e},isfrequent.
Figure5.16.Maximalfrequentitemset.
Maximalfrequentitemsetseffectivelyprovideacompactrepresentationoffrequentitemsets.Inotherwords,theyformthesmallestsetofitemsetsfromwhichallfrequentitemsetscanbederived.Forexample,everyfrequentitemsetinFigure5.16 isasubsetofoneofthethreemaximalfrequent
itemsets,{a,d},{a,c,e},and{b,c,d,e}.Ifanitemsetisnotapropersubsetofanyofthemaximalfrequentitemsets,thenitiseitherinfrequent(e.g.,{a,d,e})ormaximalfrequentitself(e.g.,{b,c,d,e}).Hence,themaximalfrequentitemsets{a,c,e},{a,d},and{b,c,d,e}provideacompactrepresentationofthefrequentitemsetsshowninFigure5.16 .Enumeratingallthesubsetsofmaximalfrequentitemsetsgeneratesthecompletelistofallfrequentitemsets.
Maximalfrequentitemsetsprovideavaluablerepresentationfordatasetsthatcanproduceverylong,frequentitemsets,asthereareexponentiallymanyfrequentitemsetsinsuchdata.Nevertheless,thisapproachispracticalonlyifanefficientalgorithmexiststoexplicitlyfindthemaximalfrequentitemsets.WebrieflydescribeonesuchapproachinSection5.5 .
Despiteprovidingacompactrepresentation,maximalfrequentitemsetsdonotcontainthesupportinformationoftheirsubsets.Forexample,thesupportofthemaximalfrequentitemsets{a,c,e},{a,d},and{b,c,d,e}donotprovideanyinformationaboutthesupportoftheirsubsetsexceptthatitmeetsthesupportthreshold.Anadditionalpassoverthedatasetisthereforeneededtodeterminethesupportcountsofthenon-maximalfrequentitemsets.Insomecases,itisdesirabletohaveaminimalrepresentationofitemsetsthatpreservesthesupportinformation.WedescribesucharepresentationinthenextSection.
5.4.2ClosedItemsets
Closeditemsetsprovideaminimalrepresentationofallitemsetswithoutlosingtheirsupportinformation.Aformaldefinitionofacloseditemsetispresentedbelow.
Definition5.4.(ClosedItemset.)AnitemsetXisclosedifnoneofitsimmediatesupersetshasexactlythesamesupportcountasX.
Putanotherway,XisnotclosedifatleastoneofitsimmediatesupersetshasthesamesupportcountasX.ExamplesofcloseditemsetsareshowninFigure5.17 .Tobetterillustratethesupportcountofeachitemset,wehaveassociatedeachnode(itemset)inthelatticewithalistofitscorrespondingtransactionIDs.Forexample,sincethenode{b,c}isassociatedwithtransactionIDs1,2,and3,itssupportcountisequaltothree.Fromthetransactionsgiveninthisdiagram,noticethatthesupportfor{b}isidenticalto{b,c}.Thisisbecauseeverytransactionthatcontainsbalsocontainsc.Hence,{b}isnotacloseditemset.Similarly,sincecoccursineverytransactionthatcontainsbothaandd,theitemset{a,d}isnotclosedasithasthesamesupportasitssuperset{a,c,d}.Ontheotherhand,{b,c}isacloseditemsetbecauseitdoesnothavethesamesupportcountasanyofitssupersets.
Figure5.17.Anexampleoftheclosedfrequentitemsets(withminimumsupportequalto40%).
Aninterestingpropertyofcloseditemsetsisthatifweknowtheirsupportcounts,wecanderivethesupportcountofeveryotheritemsetintheitemsetlatticewithoutmakingadditionalpassesoverthedataset.Forexample,considerthe2-itemset{b,e}inFigure5.17 .Since{b,e}isnotclosed,itssupportmustbeequaltothesupportofoneofitsimmediatesupersets,{a,b,e},{b,c,e},and{b,d,e}.Further,noneofthesupersetsof{b,e}canhaveasupportgreaterthanthesupportof{b,e},duetotheanti-monotonenatureofthesupportmeasure.Hence,thesupportof{b,e}canbecomputedbyexaminingthesupportcountsofallofitsimmediatesupersetsofsizethree
andtakingtheirmaximumvalue.Ifanimmediatesupersetisclosed(e.g.,{b,c,e}),wewouldknowitssupportcount.Otherwise,wecanrecursivelycomputeitssupportbyexaminingthesupportsofitsimmediatesupersetsofsizefour.Ingeneral,thesupportcountofanynon-closed -itemsetcanbedeterminedaslongasweknowthesupportcountsofallk-itemsets.Hence,onecandeviseaniterativealgorithmthatcomputesthesupportcountsofitemsetsatlevel usingthesupportcountsofitemsetsatlevelk,startingfromthelevel ,where isthesizeofthelargestcloseditemset.
Eventhoughcloseditemsetsprovideacompactrepresentationofthesupportcountsofallitemsets,theycanstillbeexponentiallylargeinnumber.Moreover,formostpracticalapplications,weonlyneedtodeterminethesupportcountofallfrequentitemsets.Inthisregard,closedfrequentitem-setsprovideacompactrepresentationofthesupportcountsofallfrequentitemsets,whichcanbedefinedasfollows.
Definition5.5.(ClosedFrequentItemset.)Anitemsetisaclosedfrequentitemsetifitisclosedanditssupportisgreaterthanorequaltominsup.
Inthepreviousexample,assumingthatthesupportthresholdis40%,{b,c}isaclosedfrequentitemsetbecauseitssupportis60%.InFigure5.17 ,theclosedfrequentitemsetsareindicatedbytheshadednodes.
Algorithmsareavailabletoexplicitlyextractclosedfrequentitemsetsfromagivendataset.InterestedreadersmayrefertotheBibliographicNotesatthe
(k−1)
k−1kmax kmax
endofthischapterforfurtherdiscussionsofthesealgorithms.Wecanuseclosedfrequentitemsetstodeterminethesupportcountsforallnon-closedfrequentitemsets.Forexample,considerthefrequentitemset{a,d}showninFigure5.17 .Becausethisitemsetisnotclosed,itssupportcountmustbeequaltothemaximumsupportcountofitsimmediatesupersets,{a,b,d},{a,c,d},and{a,d,e}.Also,since{a,d}isfrequent,weonlyneedtoconsiderthesupportofitsfrequentsupersets.Ingeneral,thesupportcountofeverynon-closedfrequentk-itemsetcanbeobtainedbyconsideringthesupportofallitsfrequentsupersetsofsize .Forexample,sincetheonlyfrequentsupersetof{a,d}is{a,c,d},itssupportisequaltothesupportof{a,c,d},whichis2.Usingthismethodology,analgorithmcanbedevelopedtocomputethesupportforeveryfrequentitemset.ThepseudocodeforthisalgorithmisshowninAlgorithm5.4 .Thealgorithmproceedsinaspecific-to-generalfashion,i.e.,fromthelargesttothesmallestfrequentitemsets.Thisisbecause,inordertofindthesupportforanon-closedfrequentitemset,thesupportforallofitssupersetsmustbeknown.Notethatthesetofallfrequentitemsetscanbeeasilycomputedbytakingtheunionofallsubsetsoffrequentcloseditemsets.
Algorithm5.4Supportcountingusingclosedfrequentitemsets.
∈
∈
k+1
∈
∉
⋅ ′⋅ ′∈ ⊂ ′
Toillustratetheadvantageofusingclosedfrequentitemsets,considerthedatasetshowninTable5.5 ,whichcontainstentransactionsandfifteenitems.Theitemscanbedividedintothreegroups:(1)GroupA,whichcontainsitems through ;(2)GroupB,whichcontainsitems through;and(3)GroupC,whichcontainsitems through .Assumingthatthe
supportthresholdis20%,itemsetsinvolvingitemsfromthesamegrouparefrequent,butitemsetsinvolvingitemsfromdifferentgroupsareinfrequent.Thetotalnumberoffrequentitemsetsisthus .However,thereareonlyfourclosedfrequentitemsetsinthedata:
and .Itisoftensufficienttopresentonlytheclosedfrequentitemsetstotheanalystsinsteadoftheentiresetoffrequentitemsets.
Table5.5.Atransactiondatasetforminingcloseditemsets.
TID
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
4 0 0 1 1 0 1 1 1 1 1 0 0 0 0 0
5 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
a1 a5 b1b5 c1 c5
3×(25−1)=93
({a3,a4},{a1,a2,a3,a4,a5},{b1,b2,b3,b4,b5}, {c1,c2,c3,c4,c5})
a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 c1 c2 c3 c4 c5
6 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
Finally,notethatallmaximalfrequentitemsetsareclosedbecausenoneofthemaximalfrequentitemsetscanhavethesamesupportcountastheirimmediatesupersets.Therelationshipsamongfrequent,closed,closedfrequent,andmaximalfrequentitemsetsareshowninFigure5.18 .
Figure5.18.Relationshipsamongfrequent,closed,closedfrequent,andmaximalfrequentitemsets.
5.5AlternativeMethodsforGeneratingFrequentItemsets*Aprioriisoneoftheearliestalgorithmstohavesuccessfullyaddressedthecombinatorialexplosionoffrequentitemsetgeneration.ItachievesthisbyapplyingtheAprioriprincipletoprunetheexponentialsearchspace.Despiteitssignificantperformanceimprovement,thealgorithmstillincursconsiderableI/Ooverheadsinceitrequiresmakingseveralpassesoverthetransactiondataset.Inaddition,asnotedinSection5.2.5 ,theperformanceoftheApriorialgorithmmaydegradesignificantlyfordensedatasetsbecauseoftheincreasingwidthoftransactions.SeveralalternativemethodshavebeendevelopedtoovercometheselimitationsandimproveupontheefficiencyoftheApriorialgorithm.Thefollowingisahigh-leveldescriptionofthesemethods.
TraversalofItemsetLattice
AsearchforfrequentitemsetscanbeconceptuallyviewedasatraversalontheitemsetlatticeshowninFigure5.1 .Thesearchstrategyemployedbyanalgorithmdictateshowthelatticestructureistraversedduringthefrequentitemsetgenerationprocess.Somesearchstrategiesarebetterthanothers,dependingontheconfigurationoffrequentitemsetsinthelattice.Anoverviewofthesestrategiesispresentednext.
General-to-SpecificversusSpecific-to-General:TheApriorialgorithmusesageneral-to-specificsearchstrategy,wherepairsoffrequent -itemsetsaremergedtoobtaincandidatek-itemsets.Thisgeneral-to-
(k−1)
specificsearchstrategyiseffective,providedthemaximumlengthofafrequentitemsetisnottoolong.TheconfigurationoffrequentitemsetsthatworksbestwiththisstrategyisshowninFigure5.19(a) ,wherethedarkernodesrepresentinfrequentitemsets.Alternatively,aspecificto-generalsearchstrategylooksformorespecificfrequentitemsetsfirst,beforefindingthemoregeneralfrequentitemsets.Thisstrategyisusefultodiscovermaximalfrequentitemsetsindensetransactions,wherethefrequentitemsetborderislocatednearthebottomofthelattice,asshowninFigure5.19(b) .TheAprioriprinciplecanbeappliedtopruneallsubsetsofmaximalfrequentitemsets.Specifically,ifacandidatek-itemsetismaximalfrequent,wedonothavetoexamineanyofitssubsetsofsize.However,ifthecandidatek-itemsetisinfrequent,weneedtocheckall
ofits subsetsinthenextiteration.Anotherapproachistocombinebothgeneral-to-specificandspecific-to-generalsearchstrategies.Thisbidirectionalapproachrequiresmorespacetostorethecandidateitemsets,butitcanhelptorapidlyidentifythefrequentitemsetborder,giventheconfigurationshowninFigure5.19(c) .
Figure5.19.General-to-specific,specific-to-general,andbidirectionalsearch.
k−1
k−1
EquivalenceClasses:Anotherwaytoenvisionthetraversalistofirstpartitionthelatticeintodisjointgroupsofnodes(orequivalenceclasses).Afrequentitemsetgenerationalgorithmsearchesforfrequentitemsetswithinaparticularequivalenceclassfirstbeforemovingtoanotherequivalenceclass.Asanexample,thelevel-wisestrategyusedintheApriorialgorithmcanbeconsideredtobepartitioningthelatticeonthebasisofitemsetsizes;i.e.,thealgorithmdiscoversallfrequent1-itemsetsfirstbeforeproceedingtolarger-sizeditemsets.Equivalenceclassescanalsobedefinedaccordingtotheprefixorsuffixlabelsofanitemset.Inthiscase,twoitemsetsbelongtothesameequivalenceclassiftheyshareacommonprefixorsuffixoflengthk.Intheprefix-basedapproach,thealgorithmcansearchforfrequentitemsetsstartingwiththeprefixabeforelookingforthosestartingwithprefixesb,c,andsoon.Bothprefix-basedandsuffix-basedequivalenceclassescanbedemonstratedusingthetree-likestructureshowninFigure5.20 .
Figure5.20.Equivalenceclassesbasedontheprefixandsuffixlabelsofitemsets.
Breadth-FirstversusDepth-First:TheApriorialgorithmtraversesthelatticeinabreadth-firstmanner,asshowninFigure5.21(a) .Itfirstdiscoversallthefrequent1-itemsets,followedbythefrequent2-itemsets,andsoon,untilnonewfrequentitemsetsaregenerated.Theitemsetlatticecanalsobetraversedinadepth-firstmanner,asshowninFigures5.21(b) and5.22 .Thealgorithmcanstartfrom,say,nodeainFigure5.22 ,andcountitssupporttodeterminewhetheritisfrequent.Ifso,thealgorithmprogressivelyexpandsthenextlevelofnodes,i.e.,ab,abc,andsoon,untilaninfrequentnodeisreached,say,abcd.Itthenbacktrackstoanotherbranch,say,abce,andcontinuesthesearchfromthere.
Figure5.21.Breadth-firstanddepth-firsttraversals.
Figure5.22.Generatingcandidateitemsetsusingthedepth-firstapproach.
Thedepth-firstapproachisoftenusedbyalgorithmsdesignedtofindmaximalfrequentitemsets.Thisapproachallowsthefrequentitemsetbordertobedetectedmorequicklythanusingabreadth-firstapproach.
Onceamaximalfrequentitemsetisfound,substantialpruningcanbeperformedonitssubsets.Forexample,ifthenodebcdeshowninFigure5.22 ismaximalfrequent,thenthealgorithmdoesnothavetovisitthesubtreesrootedatbd,be,c,d,andebecausetheywillnotcontainanymaximalfrequentitemsets.However,ifabcismaximalfrequent,onlythenodessuchasacandbcarenotmaximalfrequent(butthesubtreesofacandbcmaystillcontainmaximalfrequentitemsets).Thedepth-firstapproachalsoallowsadifferentkindofpruningbasedonthesupportofitemsets.Forexample,supposethesupportfor{a,b,c}isidenticaltothesupportfor{a,b}.Thesubtreesrootedatabdandabecanbeskipped
becausetheyareguaranteednottohaveanymaximalfrequentitemsets.Theproofofthisisleftasanexercisetothereaders.
RepresentationofTransactionDataSet
Therearemanywaystorepresentatransactiondataset.ThechoiceofrepresentationcanaffecttheI/Ocostsincurredwhencomputingthesupportofcandidateitemsets.Figure5.23 showstwodifferentwaysofrepresentingmarketbaskettransactions.Therepresentationontheleftiscalledahorizontaldatalayout,whichisadoptedbymanyassociationruleminingalgorithms,includingApriori.Anotherpossibilityistostorethelistoftransactionidentifiers(TID-list)associatedwitheachitem.Sucharepresentationisknownastheverticaldatalayout.ThesupportforeachcandidateitemsetisobtainedbyintersectingtheTID-listsofitssubsetitems.ThelengthoftheTID-listsshrinksasweprogresstolargersizeditemsets.However,oneproblemwiththisapproachisthattheinitialsetofTID-listsmightbetoolargetofitintomainmemory,thusrequiringmoresophisticatedtechniquestocompresstheTID-lists.WedescribeanothereffectiveapproachtorepresentthedatainthenextSection.
Figure5.23.Horizontalandverticaldataformat.
HorizontalDataLayout
5.6FP-GrowthAlgorithm*ThisSectionpresentsanalternativealgorithmcalledFP-growththattakesaradicallydifferentapproachtodiscoveringfrequentitemsets.Thealgorithmdoesnotsubscribetothegenerate-and-testparadigmofApriori.Instead,itencodesthedatasetusingacompactdatastructurecalledanFP-treeandextractsfrequentitemsetsdirectlyfromthisstructure.Thedetailsofthisapproacharepresentednext.
5.6.1FP-TreeRepresentation
AnFP-treeisacompressedrepresentationoftheinputdata.ItisconstructedbyreadingthedatasetonetransactionatatimeandmappingeachtransactionontoapathintheFP-tree.Asdifferenttransactionscanhaveseveralitemsincommon,theirpathsmightoverlap.Themorethepathsoverlapwithoneanother,themorecompressionwecanachieveusingtheFP-treestructure.IfthesizeoftheFP-treeissmallenoughtofitintomainmemory,thiswillallowustoextractfrequentitemsetsdirectlyfromthestructureinmemoryinsteadofmakingrepeatedpassesoverthedatastoredondisk.
Figure5.24 showsadatasetthatcontainstentransactionsandfiveitems.ThestructuresoftheFP-treeafterreadingthefirstthreetransactionsarealsodepictedinthediagram.Eachnodeinthetreecontainsthelabelofanitemalongwithacounterthatshowsthenumberoftransactionsmappedontothegivenpath.Initially,theFP-treecontainsonlytherootnoderepresentedbythenullsymbol.TheFP-treeissubsequentlyextendedinthefollowingway:
Figure5.24.ConstructionofanFP-tree.
1. Thedatasetisscannedoncetodeterminethesupportcountofeachitem.Infrequentitemsarediscarded,whilethefrequentitemsaresortedindecreasingsupportcountsinsideeverytransactionofthedata
set.ForthedatasetshowninFigure5.24 ,aisthemostfrequentitem,followedbyb,c,d,ande.
2. ThealgorithmmakesasecondpassoverthedatatoconstructtheFP-tree.Afterreadingthefirsttransaction,{a,b},thenodeslabeledasaandbarecreated.Apathisthenformedfrom toencodethetransaction.Everynodealongthepathhasafrequencycountof1.
3. Afterreadingthesecondtransaction,{b,c,d},anewsetofnodesiscreatedforitemsb,c,andd.Apathisthenformedtorepresentthetransactionbyconnectingthenodes .Everynodealongthispathalsohasafrequencycountequaltoone.Althoughthefirsttwotransactionshaveanitemincommon,whichisb,theirpathsaredisjointbecausethetransactionsdonotshareacommonprefix.
4. Thethirdtransaction,{a,c,d,e},sharesacommonprefixitem(whichisa)withthefirsttransaction.Asaresult,thepathforthethirdtransaction, ,overlapswiththepathforthefirsttransaction, .Becauseoftheiroverlappingpath,thefrequencycountfornodeaisincrementedtotwo,whilethefrequencycountsforthenewlycreatednodes,c,d,ande,areequaltoone.
5. ThisprocesscontinuesuntileverytransactionhasbeenmappedontooneofthepathsgivenintheFP-tree.TheresultingFP-treeafterreadingallthetransactionsisshownatthebottomofFigure5.24 .
ThesizeofanFP-treeistypicallysmallerthanthesizeoftheuncompresseddatabecausemanytransactionsinmarketbasketdataoftenshareafewitemsincommon.Inthebest-casescenario,whereallthetransactionshavethesamesetofitems,theFP-treecontainsonlyasinglebranchofnodes.Theworst-casescenariohappenswheneverytransactionhasauniquesetofitems.Asnoneofthetransactionshaveanyitemsincommon,thesizeoftheFP-treeiseffectivelythesameasthesizeoftheoriginaldata.However,thephysicalstoragerequirementfortheFP-treeishigherbecauseitrequiresadditionalspacetostorepointersbetweennodesandcountersforeachitem.
→a→b
→b→c→d
→a→c→d→e→a→b
ThesizeofanFP-treealsodependsonhowtheitemsareordered.Thenotionoforderingitemsindecreasingorderofsupportcountsreliesonthepossibilitythatthehighsupportitemsoccurmorefrequentlyacrossallpathsandhencemustbeusedasmostcommonlyoccurringprefixes.Forexample,iftheorderingschemeintheprecedingexampleisreversed,i.e.,fromlowesttohighestsupportitem,theresultingFP-treeisshowninFigure5.25 .Thetreeappearstobedenserbecausethebranchingfactorattherootnodehasincreasedfrom2to5andthenumberofnodescontainingthehighsupportitemssuchasaandbhasincreasedfrom3to12.Nevertheless,orderingbydecreasingsupportcountsdoesnotalwaysleadtothesmallesttree,especiallywhenthehighsupportitemsdonotoccurfrequentlytogetherwiththeotheritems.Forexample,supposeweaugmentthedatasetgiveninFigure5.24 with100transactionsthatcontain{e},80transactionsthatcontain{d},60transactionsthatcontain{c},and40transactionsthatcontain{b}.Itemeisnowmostfrequent,followedbyd,c,b,anda.Withtheaugmentedtransactions,orderingbydecreasingsupportcountswillresultinanFP-treesimilartoFigure5.25 ,whileaschemebasedonincreasingsupportcountsproducesasmallerFP-treesimilartoFigure5.24(iv) .
Figure5.25.
AnFP-treerepresentationforthedatasetshowninFigure5.24 withadifferentitemorderingscheme.
AnFP-treealsocontainsalistofpointersconnectingnodesthathavethesameitems.Thesepointers,representedasdashedlinesinFigures5.24and5.25 ,helptofacilitatetherapidaccessofindividualitemsinthetree.WeexplainhowtousetheFP-treeanditscorrespondingpointersforfrequentitemsetgenerationinthenextSection.
5.6.2FrequentItemsetGenerationinFP-GrowthAlgorithm
FP-growthisanalgorithmthatgeneratesfrequentitemsetsfromanFP-treebyexploringthetreeinabottom-upfashion.GiventheexampletreeshowninFigure5.24 ,thealgorithmlooksforfrequentitemsetsendinginefirst,followedbyd,c,b,andfinally,a.Thisbottom-upstrategyforfindingfrequentitemsetsendingwithaparticularitemisequivalenttothesuffix-basedapproachdescribedinSection5.5 .SinceeverytransactionismappedontoapathintheFP-tree,wecanderivethefrequentitemsetsendingwithaparticularitem,say,e,byexaminingonlythepathscontainingnodee.Thesepathscanbeaccessedrapidlyusingthepointersassociatedwithnodee.TheextractedpathsareshowninFigure5.26(a) .Similarpathsforitemsetsendingind,c,b,andaareshowninFigures5.26(b) ,(c) ,(d) ,and(e) ,respectively.
Figure5.26.Decomposingthefrequentitemsetgenerationproblemintomultiplesubproblems,whereeachsubprobleminvolvesfindingfrequentitemsetsendingine,d,c,b,anda.
FP-growthfindsallthefrequentitemsetsendingwithaparticularsuffixbyemployingadivide-and-conquerstrategytosplittheproblemintosmallersubproblems.Forexample,supposeweareinterestedinfindingallfrequentitemsetsendingine.Todothis,wemustfirstcheckwhethertheitemset{e}itselfisfrequent.Ifitisfrequent,weconsiderthesubproblemoffindingfrequentitemsetsendinginde,followedbyce,be,andae.Inturn,eachofthesesubproblemsarefurtherdecomposedintosmallersubproblems.Bymergingthesolutionsobtainedfromthesubproblems,allthefrequentitemsetsendinginecanbefound.Finally,thesetofallfrequentitemsetscanbegeneratedbymergingthesolutionstothesubproblemsoffindingfrequent
itemsetsendingine,d,c,b,anda.Thisdivide-and-conquerapproachisthekeystrategyemployedbytheFP-growthalgorithm.
Foramoreconcreteexampleonhowtosolvethesubproblems,considerthetaskoffindingfrequentitemsetsendingwithe.
1. Thefirststepistogatherallthepathscontainingnodee.TheseinitialpathsarecalledprefixpathsandareshowninFigure5.27(a) .
Figure5.27.ExampleofapplyingtheFP-growthalgorithmtofindfrequentitemsetsendingine.
2. FromtheprefixpathsshowninFigure5.27(a) ,thesupportcountforeisobtainedbyaddingthesupportcountsassociatedwithnodee.Assumingthattheminimumsupportcountis2,{e}isdeclaredafrequentitemsetbecauseitssupportcountis3.
3. Because{e}isfrequent,thealgorithmhastosolvethesubproblemsoffindingfrequentitemsetsendinginde,ce,be,andae.Beforesolvingthesesubproblems,itmustfirstconverttheprefixpathsintoaconditionalFP-tree,whichisstructurallysimilartoanFP-tree,exceptitisusedtofindfrequentitemsetsendingwithaparticularsuffix.AconditionalFP-treeisobtainedinthefollowingway:
a. First,thesupportcountsalongtheprefixpathsmustbeupdatedbecausesomeofthecountsincludetransactionsthatdonotcontainiteme.Forexample,therightmostpathshowninFigure5.27(a) , ,includesatransaction{b,c}thatdoesnotcontainiteme.Thecountsalongtheprefixpathmustthereforebeadjustedto1toreflecttheactualnumberoftransactionscontaining{b,c,e}.
b. Theprefixpathsaretruncatedbyremovingthenodesfore.Thesenodescanberemovedbecausethesupportcountsalongtheprefixpathshavebeenupdatedtoreflectonlytransactionsthatcontaineandthesubproblemsoffindingfrequentitemsetsendinginde,ce,be,andaenolongerneedinformationaboutnodee.
c. Afterupdatingthesupportcountsalongtheprefixpaths,someoftheitemsmaynolongerbefrequent.Forexample,thenodebappearsonlyonceandhasasupportcountequalto1,whichmeansthatthereisonlyonetransactionthatcontainsbothbande.Itembcanbesafelyignoredfromsubsequentanalysisbecauseallitemsetsendinginbemustbeinfrequent.
TheconditionalFP-treeforeisshowninFigure5.27(b) .Thetreelooksdifferentthantheoriginalprefixpathsbecausethefrequencycountshavebeenupdatedandthenodesbandehavebeeneliminated.
4. FP-growthusestheconditionalFP-treeforetosolvethesubproblemsoffindingfrequentitemsetsendinginde,ce,andae.Tofindthefrequentitemsetsendinginde,theprefixpathsfordaregatheredfromtheconditionalFP-treefore(Figure5.27(c) ).Byaddingthefrequencycountsassociatedwithnoded,weobtainthesupportcountfor{d,e}.Sincethesupportcountisequalto2,{d,e}isdeclaredafrequentitemset.Next,thealgorithmconstructstheconditionalFP-treefordeusingtheapproachdescribedinstep3.Afterupdatingthesupportcountsandremovingtheinfrequentitemc,theconditionalFP-treefordeisshowninFigure5.27(d) .SincetheconditionalFP-treecontainsonlyoneitem,a,whosesupportisequaltominsup,thealgorithmextractsthefrequentitemset{a,d,e}andmovesontothenextsubproblem,whichistogeneratefrequentitemsetsendingince.Afterprocessingtheprefixpathsforc,{c,e}isfoundtobefrequent.However,theconditionalFP-treeforcewillhavenofrequentitemsandthuswillbeeliminated.Thealgorithmproceedstosolvethenextsubproblemandfinds{a,e}tobetheonlyfrequentitemsetremaining.
Thisexampleillustratesthedivide-and-conquerapproachusedintheFP-growthalgorithm.Ateachrecursivestep,aconditionalFP-treeisconstructedbyupdatingthefrequencycountsalongtheprefixpathsandremovingallinfrequentitems.Becausethesubproblemsaredisjoint,FP-growthwillnotgenerateanyduplicateitemsets.Inaddition,thecountsassociatedwiththenodesallowthealgorithmtoperformsupportcountingwhilegeneratingthecommonsuffixitemsets.
FP-growthisaninterestingalgorithmbecauseitillustrateshowacompactrepresentationofthetransactiondatasethelpstoefficientlygeneratefrequentitemsets.Inaddition,forcertaintransactiondatasets,FP-growthoutperformsthestandardApriorialgorithmbyseveralordersofmagnitude.Therun-timeperformanceofFP-growthdependsonthecompactionfactorofthedataset.IftheresultingconditionalFP-treesareverybushy(intheworstcase,afullprefixtree),thentheperformanceofthealgorithmdegradessignificantlybecauseithastogeneratealargenumberofsubproblemsandmergetheresultsreturnedbyeachsubproblem.
5.7EvaluationofAssociationPatternsAlthoughtheAprioriprinciplesignificantlyreducestheexponentialsearchspaceofcandidateitemsets,associationanalysisalgorithmsstillhavethepotentialtogeneratealargenumberofpatterns.Forexample,althoughthedatasetshowninTable5.1 containsonlysixitems,itcanproducehundredsofassociationrulesatparticularsupportandconfidencethresholds.Asthesizeanddimensionalityofrealcommercialdatabasescanbeverylarge,wecaneasilyendupwiththousandsorevenmillionsofpatterns,manyofwhichmightnotbeinteresting.Identifyingthemostinterestingpatternsfromthemultitudeofallpossibleonesisnotatrivialtaskbecause“oneperson'strashmightbeanotherperson'streasure.”Itisthereforeimportanttoestablishasetofwell-acceptedcriteriaforevaluatingthequalityofassociationpatterns.
Thefirstsetofcriteriacanbeestablishedthroughadata-drivenapproachtodefineobjectiveinterestingnessmeasures.Thesemeasurescanbeusedtorankpatterns—itemsetsorrules—andthusprovideastraightforwardwayofdealingwiththeenormousnumberofpatternsthatarefoundinadataset.Someofthemeasurescanalsoprovidestatisticalinformation,e.g.,itemsetsthatinvolveasetofunrelateditemsorcoververyfewtransactionsareconsidereduninterestingbecausetheymaycapturespuriousrelationshipsinthedataandshouldbeeliminated.Examplesofobjectiveinterestingnessmeasuresincludesupport,confidence,andcorrelation.
Thesecondsetofcriteriacanbeestablishedthroughsubjectivearguments.Apatternisconsideredsubjectivelyuninterestingunlessitrevealsunexpectedinformationaboutthedataorprovidesusefulknowledgethatcanleadtoprofitableactions.Forexample,therule maynotbeinteresting,despitehavinghighsupportandconfidencevalues,becausethe
{Butter}→{Bread}
relationshiprepresentedbytherulemightseemratherobvious.Ontheotherhand,therule isinterestingbecausetherelationshipisquiteunexpectedandmaysuggestanewcross-sellingopportunityforretailers.Incorporatingsubjectiveknowledgeintopatternevaluationisadifficulttaskbecauseitrequiresaconsiderableamountofpriorinformationfromdomainexperts.Readersinterestedinsubjectiveinterestingnessmeasuresmayrefertoresourceslistedinthebibliographyattheendofthischapter.
5.7.1ObjectiveMeasuresofInterestingness
Anobjectivemeasureisadata-drivenapproachforevaluatingthequalityofassociationpatterns.Itisdomain-independentandrequiresonlythattheuserspecifiesathresholdforfilteringlow-qualitypatterns.Anobjectivemeasureisusuallycomputedbasedonthefrequencycountstabulatedinacontingencytable.Table5.6 showsanexampleofacontingencytableforapairofbinaryvariables,AandB.Weusethenotation toindicatethatA(B)isabsentfromatransaction.Eachentry inthis tabledenotesafrequencycount.Forexample, isthenumberoftimesAandBappeartogetherinthesametransaction,while isthenumberoftransactionsthatcontainBbutnotA.Therowsum representsthesupportcountforA,whilethecolumnsum representsthesupportcountforB.Finally,eventhoughourdiscussionfocusesmainlyonasymmetricbinaryvariables,notethatcontingencytablesarealsoapplicabletootherattributetypessuchassymmetricbinary,nominal,andordinalvariables.
Table5.6.A2-waycontingencytableforvariablesAandB.
{Diapers}→{Beer}
A¯(B¯)fij 2×2
f11f01
f1+f+1
B
A
N
LimitationsoftheSupport-ConfidenceFrameworkTheclassicalassociationruleminingformulationreliesonthesupportandconfidencemeasurestoeliminateuninterestingpatterns.Thedrawbackofsupport,whichisdescribedmorefullyinSection5.8 ,isthatmanypotentiallyinterestingpatternsinvolvinglowsupportitemsmightbeeliminatedbythesupportthreshold.Thedrawbackofconfidenceismoresubtleandisbestdemonstratedwiththefollowingexample.
Example5.3.Supposeweareinterestedinanalyzingtherelationshipbetweenpeoplewhodrinkteaandcoffee.WemaygatherinformationaboutthebeveragepreferencesamongagroupofpeopleandsummarizetheirresponsesintoacontingencytablesuchastheoneshowninTable5.7 .
Table5.7.Beveragepreferencesamongagroupof1000people.
Coffee
Tea 150 50 200
650 150 800
800 200 1000
B¯
f11 f10 f1+
A¯ f01 f00 f0+
f+1 f+0
Coffee¯
Tea¯
Theinformationgiveninthistablecanbeusedtoevaluatetheassociationrule .Atfirstglance,itmayappearthatpeoplewhodrinkteaalsotendtodrinkcoffeebecausetherule'ssupport(15%)andconfidence(75%)valuesarereasonablyhigh.Thisargumentwouldhavebeenacceptableexceptthatthefractionofpeoplewhodrinkcoffee,regardlessofwhethertheydrinktea,is80%,whilethefractionofteadrinkerswhodrinkcoffeeisonly75%.Thusknowingthatapersonisateadrinkeractuallydecreasesherprobabilityofbeingacoffeedrinkerfrom80%to75%!Therule isthereforemisleadingdespiteitshighconfidencevalue.
Nowconsiderasimilarproblemwhereweareinterestedinanalyzingtherelationshipbetweenpeoplewhodrinkteaandpeoplewhousehoneyintheirbeverage.Table5.8 summarizestheinformationgatheredoverthesamegroupofpeopleabouttheirpreferencesfordrinkingteaandusinghoney.Ifweevaluatetheassociationrule usingthisinformation,wewillfindthattheconfidencevalueofthisruleismerely50%,whichmightbeeasilyrejectedusingareasonablethresholdontheconfidencevalue,say70%.Onethusmightconsiderthatthepreferenceofapersonfordrinkingteahasnoinfluenceonherpreferenceforusinghoney.However,thefractionofpeoplewhousehoney,regardlessofwhethertheydrinktea,isonly12%.Hence,knowingthatapersondrinksteasignificantlyincreasesherprobabilityofusinghoneyfrom12%to50%.Further,thefractionofpeoplewhodonotdrinkteabutusehoneyisonly2.5%!Thissuggeststhatthereisdefinitelysomeinformationinthepreferenceofapersonofusinghoneygiventhatshedrinkstea.Therule
maythereforebefalselyrejectedifconfidenceisusedastheevaluationmeasure.
Table5.8.Informationaboutpeoplewhodrinkteaandpeoplewhousehoneyintheirbeverage.
{Tea}→{Coffee}
{Tea}→{Coffee}
{Tea}→{Honey}
{Tea}→{Honey}
Honey
Tea 100 100 200
20 780 800
120 880 1000
Notethatifwetakethesupportofcoffeedrinkersintoaccount,wewouldnotbesurprisedtofindthatmanyofthepeoplewhodrinkteaalsodrinkcoffee,sincetheoverallnumberofcoffeedrinkersisquitelargebyitself.Whatismoresurprisingisthatthefractionofteadrinkerswhodrinkcoffeeisactuallylessthantheoverallfractionofpeoplewhodrinkcoffee,whichpointstoaninverserelationshipbetweenteadrinkersandcoffeedrinkers.Similarly,ifweaccountforthefactthatthesupportofusinghoneyisinherentlysmall,itiseasytounderstandthatthefractionofteadrinkerswhousehoneywillnaturallybesmall.Instead,whatisimportanttomeasureisthechangeinthefractionofhoneyusers,giventheinformationthattheydrinktea.
Thelimitationsoftheconfidencemeasurearewell-knownandcanbeunderstoodfromastatisticalperspectiveasfollows.Thesupportofavariablemeasurestheprobabilityofitsoccurrence,whilethesupports(A,B)ofapairofavariablesAandBmeasurestheprobabilityofthetwovariablesoccurringtogether.Hence,thejointprobabilityP(A,B)canbewrittenas
IfweassumeAandBarestatisticallyindependent,i.e.thereisnorelationshipbetweentheoccurrencesofAandB,then .Hence,undertheassumptionofstatisticalindependencebetweenAandB,thesupportsindep(A,B)ofAandBcanbewrittenas
Honey¯
Tea¯
P(A,B)=s(A,B)=f11N.
P(A,B)=P(A)×P(B)
Ifthesupportbetweentwovariables,s(A,B)isequalto ,thenAandBcanbeconsideredtobeunrelatedtoeachother.However,ifs(A,B)iswidelydifferentfrom ,thenAandBaremostlikelydependent.Hence,anydeviationofs(A,B)from canbeseenasanindicationofastatisticalrelationshipbetweenAandB.Sincetheconfidencemeasureonlyconsidersthedevianceofs(A,B)froms(A)andnotfrom ,itfailstoaccountforthesupportoftheconsequent,namelys(B).Thisresultsinthedetectionofspuriouspatterns(e.g., )andtherejectionoftrulyinterestingpatterns(e.g., ),asillustratedinthepreviousexample.
Variousobjectivemeasureshavebeenusedtocapturethedevianceofs(A,B)from ,thatarenotsusceptibletothelimitationsoftheconfidencemeasure.Below,weprovideabriefdescriptionofsomeofthesemeasuresanddiscusssomeoftheirproperties.
InterestFactorTheinterestfactor,whichisalsocalledasthe“lift,”canbedefinedasfollows:
Noticethat .Hence,theinterestfactormeasurestheratioofthesupportofapatterns(A,B)againstitsbaselinesupport (A,B)computedunderthestatisticalindependenceassumption.UsingEquations5.5 and5.4 ,wecaninterpretthemeasureasfollows:
sindep(A,B)=s(A)×s(B)orequivalently,sindep(A,B)=f1+N×f+1N. (5.4)
sindep(A,B)
sindep(A,B)s(A)×s(B)
s(A)×s(B)
{Tea}→{Coffee}{Tea}→{Honey}
sindep(A,B)
I(A,B)=s(A,B)s(A)×s(B)=Nf11f1+f+1. (5.5)
s(A)×s(B)=sindep(A,B)sindep
I(A,B)={=1,ifAandBareindependent;>1,ifAandBarepositivelyrelated;<1,ifAandBarenegativelyrelated.
(5.6)
Forthetea-coffeeexampleshowninTable5.7 , ,thussuggestingaslightnegativerelationshipbetweenteadrinkersandcoffeedrinkers.Also,forthetea-honeyexampleshowninTable5.8 ,
,suggestingastrongpositiverelationshipbetweenpeoplewhodrinkteaandpeoplewhousehoneyintheirbeverage.Wecanthusseethattheinterestfactorisabletodetectmeaningfulpatternsinthetea-coffeeandtea-honeyexamples.Indeed,theinterestfactorhasanumberofstatisticaladvantagesovertheconfidencemeasurethatmakeitasuitablemeasureforanalyzingstatisticalindependencebetweenvariables.
Piatesky-Shapiro(PS)MeasureInsteadofcomputingtheratiobetweens(A,B)and ,thePSmeasureconsidersthedifferencebetweens(A,B)and asfollows.
ThePSvalueis0whenAandBaremutuallyindependentofeachother.Otherwise, whenthereisapositiverelationshipbetweenthetwovariables,and whenthereisanegativerelationship.
CorrelationAnalysisCorrelationanalysisisoneofthemostpopulartechniquesforanalyzingrelationshipsbetweenapairofvariables.Forcontinuousvariables,correlationisdefinedusingPearson'scorrelationcoefficient(seeEquation2.10 onpage83).Forbinaryvariables,correlationcanbemeasuredusingthe
,whichisdefinedas
I=0.150.2×0.8=0.9375
I=0.10.12×0.2=4.1667
sindep(A,B)=s(A)×s(B)s(A)×s(B)
PS=s(A,B)−s(A)×s(B)=f11N−f1+f+1N2 (5.7)
PS>0PS<0
ϕ-coefficient
ϕ=f11f00−f01f10f1+f+1f0+f+0. (5.8)
Ifwerearrangethetermsin5.8,wecanshowthatthe canberewrittenintermsofthesupportmeasuresofA,B,and{A,B}asfollows:
NotethatthenumeratorintheaboveequationisidenticaltothePSmeasure.Hence,the canbeunderstoodasanormalizedversionofthePSmeasure,wherethatthevalueofthe rangesfrom to .Fromastatisticalviewpoint,thecorrelationcapturesthenormalizeddifferencebetweens(A,B)and (A,B).Acorrelationvalueof0meansnorelationship,whileavalueof suggestsaperfectpositiverelationshipandavalueof suggestsaperfectnegativerelationship.Thecorrelationmeasurehasastatisticalmeaningandhenceiswidelyusedtoevaluatethestrengthofstatisticalindependenceamongvariables.Forinstance,thecorrelationbetweenteaandcoffeedrinkersinTable5.7 is whichisslightlylessthan0.Ontheotherhand,thecorrelationbetweenpeoplewhodrinkteaandpeoplewhousehoneyinTable5.8 is0.5847,suggestingapositiverelationship.
ISMeasureISisanalternativemeasureforcapturingtherelationshipbetweens(A,B)and
.TheISmeasureisdefinedasfollows:
AlthoughthedefinitionofISlooksquitesimilartotheinterestfactor,theysharesomeinterestingdifferences.SinceISisthegeometricmeanbetweentheinterestfactorandthesupportofapattern,ISislargewhenboththeinterestfactorandsupportarelarge.Hence,iftheinterestfactoroftwopatternsareidentical,theIShasapreferenceofselectingthepatternwithhighersupport.ItisalsopossibletoshowthatISismathematicallyequivalent
ϕ-coefficient
ϕ=s(A,B)−s(A)×s(B)s(A)×(1−s(A))×s(B)×(1−s(B)). (5.9)
ϕ-coefficientϕ-coefficient −1 +1
sindep+1
−1
−0.0625
s(A)×s(B)
IS(A,B)=I(A,B)×s(A,B)=s(A,B)s(A)s(B)=f11f1+f+1. (5.10)
tothecosinemeasureforbinaryvariables(seeEquation2.6 onpage81 ).ThevalueofISthusvariesfrom0to1,whereanISvalueof0correspondstonoco-occurrenceofthetwovariables,whileanISvalueof1denotesperfectrelationship,sincetheyoccurinexactlythesametransactions.Forthetea-coffeeexampleshowninTable5.7 ,thevalueofISisequalto0.375,whilethevalueofISforthetea-honeyexampleinTable5.8 is0.6455.TheISmeasurethusgivesahighervalueforthe
rulethanthe rule,whichisconsistentwithourunderstandingofthetworules.
AlternativeObjectiveInterestingnessMeasuresNotethatallofthemeasuresdefinedinthepreviousSectionusedifferenttechniquestocapturethedeviancebetweens(A,B)and .Somemeasuresusetheratiobetweens(A,B)and (A,B),e.g.,theinterestfactorandIS,whilesomeothermeasuresconsiderthedifferencebetweenthetwo,e.g.,thePSandthe .Somemeasuresareboundedinaparticularrange,e.g.,theISandthe ,whileothersareunboundedanddonothaveadefinedmaximumorminimumvalue,e.g.,theInterestFactor.Becauseofsuchdifferences,thesemeasuresbehavedifferentlywhenappliedtodifferenttypesofpatterns.Indeed,themeasuresdefinedabovearenotexhaustiveandthereexistmanyalternativemeasuresforcapturingdifferentpropertiesofrelationshipsbetweenpairsofbinaryvariables.Table5.9 providesthedefinitionsforsomeofthesemeasuresintermsofthefrequencycountsofa contingencytable.
Table5.9.Examplesofobjectivemeasuresfortheitemset{A,B}.
Measure(Symbol) Definition
Correlation
{Tea}→{Honey} {Tea}→{Coffee}
sindep=s(A)×s(B)sindep
ϕ-coefficientϕ-coefficient
2×2
(ϕ) Nf11−f1+f+1f1+f+1f0+f+0
Oddsratio
Kappa
Interest(I)
Cosine(IS)
Piatetsky-Shapiro(PS)
Collectivestrength(S)
Jaccard
All-confidence(h)
ConsistencyamongObjectiveMeasuresGiventhewidevarietyofmeasuresavailable,itisreasonabletoquestionwhetherthemeasurescanproducesimilarorderingresultswhenappliedtoasetofassociationpatterns.Ifthemeasuresareconsistent,thenwecanchooseanyoneofthemasourevaluationmetric.Otherwise,itisimportanttounderstandwhattheirdifferencesareinordertodeterminewhichmeasureismoresuitableforanalyzingcertaintypesofpatterns.
SupposethemeasuresdefinedinTable5.9 areappliedtorankthetencontingencytablesshowninTable5.10 .Thesecontingencytablesarechosentoillustratethedifferencesamongtheexistingmeasures.TheorderingproducedbythesemeasuresisshowninTable5.11 (with1asthemostinterestingand10astheleastinterestingtable).Althoughsomeofthemeasuresappeartobeconsistentwitheachother,othersproducequitedifferentorderingresults.Forexample,therankingsgivenbytheagreesmostlywiththoseprovidedby andcollectivestrength,butarequite
(α) (f11f00)/(f10f01)
(κ) Nf11+Nf00−f1+f+1−f0+f+0N2−f1+f+1−f0+f+0
(Nf11)/(f1+f+1)
(f11)/(f1+f+1)
f11N−f1+f+1N2
f11+f00f1+f+1+f0+f+0×N−f1+f+1−f0+f+0N−f11−f00
(ζ) f11/(f1++f+1−f11)
min[f11f1+,f11f+1]
ϕ-coefficientκ
differentthantherankingsproducedbyinterestfactor.Furthermore,acontingencytablesuchas isrankedlowestaccordingtothe ,buthighestaccordingtointerestfactor.
Table5.10.Exampleofcontingencytables.
Example
8123 83 424 1370
8330 2 622 1046
3954 3080 5 2961
2886 1363 1320 4431
1500 2000 500 6000
4000 2000 1000 3000
9481 298 127 94
4000 2000 2000 2000
7450 2483 4 63
61 2483 4 7452
Table5.11.RankingsofcontingencytablesusingthemeasuresgiveninTable5.9 .
I IS PS S h
1 3 1 6 2 2 1 2 2
2 1 2 7 3 5 2 3 3
E10 ϕ-coefficient
f11 f10 f01 f00
E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
ϕ α κ ζ
E1
E2
3 2 4 4 5 1 3 6 8
4 8 3 3 7 3 4 7 5
5 7 6 2 9 6 6 9 9
6 9 5 5 6 4 5 5 7
7 6 7 9 1 8 7 1 1
8 10 8 8 8 7 8 8 7
9 4 9 10 4 9 9 4 4
10 5 10 1 10 10 10 10 10
PropertiesofObjectiveMeasuresTheresultsshowninTable5.11 suggestthatthemeasuresgreatlydifferfromeachotherandcanprovideconflictinginformationaboutthequalityofapattern.Infact,nomeasureisuniversallybestforallapplications.Inthefollowing,wedescribesomepropertiesofthemeasuresthatplayanimportantroleindeterminingiftheyaresuitedforacertainapplication.
InversionProperty
ConsiderthebinaryvectorsshowninFigure5.28 .The0/1valueineachcolumnvectorindicateswhetheratransaction(row)containsaparticularitem(column).Forexample,thevectorAindicatesthattheitemappearsinthefirstandlasttransactions,whereasthevectorBindicatesthattheitemiscontainedonlyinthefifthtransaction.Thevectors and aretheinvertedversionsofAandB,i.e.,their1valueshavebeenchangedto0values(absencetopresence)andviceversa.Applyingthistransformationtoabinary
E3
E4
E5
E6
E7
E8
E9
E10
A¯ B¯
vectoriscalledinversion.Ifameasureisinvariantundertheinversionoperation,thenitsvalueforthevectorpair shouldbeidenticaltoitsvaluefor{A,B}.Theinversionpropertyofameasurecanbetestedasfollows.
Figure5.28.Effectoftheinversionoperation.Thevectors and areinversionsofvectorsAandB,respectively.
Definition5.6.(InversionProperty.)AnobjectivemeasureMisinvariantundertheinversionoperationifitsvalueremainsthesamewhenexchangingthefrequencycounts with and with .
Measuresthatareinvarianttotheinversionpropertyincludethecorrelation( ),oddsratio, ,andcollectivestrength.Thesemeasuresareespeciallyusefulinscenarioswherethepresence(1's)ofavariableisas
{A¯,B¯}
A¯ E¯
f11 f00 f10 f01
ϕ-coefficient κ
importantasitsabsence(0's).Forexample,ifwecomparetwosetsofanswerstoaseriesoftrue/falsequestionswhere0's(true)and1's(false)areequallymeaningful,weshoulduseameasurethatgivesequalimportancetooccurrencesof0–0'sand1–1's.ForthevectorsshowninFigure5.28 ,the
isequalto-0.1667regardlessofwhetherweconsiderthepair{A,B}orpair .Similarly,theoddsratioforbothpairsofvectorsisequaltoaconstantvalueof0.Notethateventhoughthe andtheoddsratioareinvarianttoinversion,theycanstillshowdifferentresults,aswillbeshownlater.
MeasuresthatdonotremaininvariantundertheinversionoperationincludetheinterestfactorandtheISmeasure.Forexample,theISvalueforthepair
inFigure5.28 is0.825,whichreflectsthefactthatthe1'sinand occurfrequentlytogether.However,theISvalueofitsinvertedpair{A,B}isequalto0,sinceAandBdonothaveanyco-occurrenceof1's.Forasymmetricbinaryvariables,e.g.,theoccurrenceofwordsindocuments,thisisindeedthedesiredbehavior.Adesiredsimilaritymeasurebetweenasymmetricvariablesshouldnotbeinvarianttoinversion,sinceforthesevariables,itismoremeaningfultocapturerelationshipsbasedonthepresenceofavariableratherthanitsabsence.Ontheotherhand,ifwearedealingwithsymmetricbinaryvariableswheretherelationshipsbetween0'sand1'sareequallymeaningful,careshouldbetakentoensurethatthechosenmeasureisinvarianttoinversion.
AlthoughthevaluesoftheinterestfactorandISchangewiththeinversionoperation,theycanstillbeinconsistentwitheachother.Toillustratethis,considerTable5.12 ,whichshowsthecontingencytablesfortwopairsofvariables,{p,q}and{r,s}.Notethatrandsareinvertedtransformationsofpandq,respectively,wheretherolesof0'sand1'shavejustbeenreversed.Theinterestfactorfor{p,q}is1.02andfor{r,s}is4.08,whichmeansthattheinterestfactorfindstheinvertedpair{r,s}morerelatedthantheoriginalpair
ϕ-coefficient{A¯,B¯}
ϕ-coefficient
{A¯,B¯} A¯B¯
{p,q}.Onthecontrary,theISvaluedecreasesuponinversionfrom0.9346for{p,q}to0.286for{r,s},suggestingquiteanoppositetrendtothatoftheinterestfactor.Eventhoughthesemeasuresconflictwitheachotherforthisexample,theymaybetherightchoiceofmeasureindifferentapplications.
Table5.12.Contingencytablesforthepairs{p,q}and{r,s}.
p
q 880 50 930
50 20 70
930 70 1000
r
s 20 50 70
50 880 930
70 930 1000
ScalingProperty
Table5.13 showstwocontingencytablesforgenderandthegradesachievedbystudentsenrolledinaparticularcourse.Thesetablescanbeusedtostudytherelationshipbetweengenderandperformanceinthecourse.Thesecondcontingencytablehasdatafromthesamepopulationbuthastwiceasmanymalesandthreetimesasmanyfemales.Theactualnumberofmalesorfemalescandependuponthesamplesavailableforstudy,buttherelationshipbetweengenderandgradeshouldnotchangejustbecauseofdifferencesinsamplesizes.Similarly,ifthenumberofstudentswithhighandlowgradesarechangedinanewstudy,ameasureofassociationbetween
p¯
q¯
r¯
s¯
genderandgradesshouldremainunchanged.Hence,weneedameasurethatisinvarianttoscalingofrowsorcolumns.Theprocessofmultiplyingaroworcolumnofacontingencytablebyaconstantvalueiscalledaroworcolumnscalingoperation.Ameasurethatisinvarianttoscalingdoesnotchangeitsvalueafteranyroworcolumnscalingoperation.
Table5.13.Thegrade-genderexample.(a)Sampledataofsize100.
Male Female
High 30 20 50
Low 40 10 50
70 30 100
(b)Sampledataofsize230.
Male Female
High 60 60 120
Low 80 30 110
140 90 230
Definition5.7.(ScalingInvarianceProperty.)LetTbeacontingencytablewithfrequencycounts
.Let bethetransformedacontingencytable[f11;f10;f01;f00] T′
withscaledfrequencycounts,where are
positiveconstantsusedtoscalethetworowsandthetwocolumnsofT.AnobjectivemeasureMisinvariantundertherow/columnscalingoperationif .
Notethattheuseoftheterm‘scaling'hereshouldnotbeconfusedwiththescalingoperationforcontinuousvariablesintroducedinChapter2 onpage23,whereallthevaluesofavariablewerebeingmultipliedbyaconstantfactor,insteadofscalingaroworcolumnofacontingencytable.
Scalingofrowsandcolumnsincontingencytablesoccursinmultiplewaysindifferentapplications.Forexample,ifwearemeasuringtheeffectofaparticularmedicalprocedureontwosetsofsubjects,healthyanddiseased,theratioofhealthyanddiseasedsubjectscanwidelyvaryacrossdifferentstudiesinvolvingdifferentgroupsofparticipants.Further,thefractionofhealthyanddiseasedsubjectschosenforacontrolledstudycanbequitedifferentfromthetruefractionobservedinthecompletepopulation.Thesedifferencescanresultinaroworcolumnscalinginthecontingencytablesfordifferentpopulationsofsubjects.Ingeneral,thefrequenciesofitemsinacontingencytablecloselydependsonthesampleoftransactionsusedtogeneratethetable.Anychangeinthesamplingproceduremayaffectaroworcolumnscalingtransformation.Ameasurethatisexpectedtobeinvarianttodifferencesinthesamplingproceduremustnotchangewithroworcolumnscaling.
OfallthemeasuresintroducedinTable5.9 ,onlytheoddsratio isinvarianttorowandcolumnscalingoperations.Forexample,thevalueofoddsratioforboththetablesinTable5.13 isequalto0.375.Allother
[k1k3f11;k2k3f10;k1k4f01;k2k4f00] k1,k2,k3,k4
M(T)=M(T′)
(α)
measuressuchasthe ,IS,interestfactor,andcollectivestrength(S)changetheirvalueswhentherowsandcolumnsofthecontingencytablearerescaled.Indeed,theoddsratioisapreferredchoiceofmeasureinthemedicaldomain,whereitisimportanttofindrelationshipsthatdonotchangewithdifferencesinthepopulationsamplechosenforastudy.
NullAdditionProperty
Supposeweareinterestedinanalyzingtherelationshipbetweenapairofwords,suchas and ,inasetofdocuments.Ifacollectionofarticlesabouticefishingisaddedtothedataset,shouldtheassociationbetween and beaffected?Thisprocessofaddingunrelateddata(inthiscase,documents)toagivendatasetisknownasthenulladditionoperation.
Definition5.8.(NullAdditionProperty.)AnobjectivemeasureMisinvariantunderthenulladditionoperationifitisnotaffectedbyincreasing ,whileallotherfrequenciesinthecontingencytablestaythesame.
Forapplicationssuchasdocumentanalysisormarketbasketanalysis,wewouldliketouseameasurethatremainsinvariantunderthenulladditionoperation.Otherwise,therelationshipbetweenwordscanbemadetochangesimplybyaddingenoughdocumentsthatdonotcontainbothwords!Examplesofmeasuresthatsatisfythispropertyincludecosine(IS)and
ϕ-coefficient,κ
f00
Jaccard measures,whilethosethatviolatethispropertyincludeinterestfactor,PS,oddsratio,andthe .
Todemonstratetheeffectofnulladdition,considerthetwocontingencytablesand showninTable5.14 .Table hasbeenobtainedfrom by
adding1000extratransactionswithbothAandBabsent.Thisoperationonlyaffectsthe entryofTable ,whichhasincreasedfrom100to1100,whereasalltheotherfrequenciesinthetable ,and remainthesame.SinceISisinvarianttonulladdition,itgivesaconstantvalueof0.875toboththetables.However,theadditionof1000extratransactionswithoccurrencesof0–0'schangesthevalueofinterestfactorfrom0.972for(denotingaslightlynegativecorrelation)to1.944for (positivecorrelation).Similarly,thevalueofoddsratioincreasesfrom7for to77for .Hence,whentheinterestfactororoddsratioareusedastheassociationmeasure,therelationshipsbetweenvariableschangesbytheadditionofnulltransactionswhereboththevariablesareabsent.Incontrast,theISmeasureisinvarianttonulladdition,sinceitconsiderstwovariablestoberelatedonlyiftheyfrequentlyoccurtogether.Indeed,theISmeasure(cosinemeasure)iswidelyusedtomeasuresimilarityamongdocuments,whichisexpectedtodependonlyonthejointoccurrences(1's)ofwordsindocuments,butnottheirabsences(0's).
Table5.14.Anexampledemonstratingtheeffectofnulladdition.(a)Table .
B
A 700 100 800
100 100 200
800 200 1000
(ξ)ϕ-coefficient
T1 T2 T2 T1
f00 T2(f11,f10 f01)
T1T2T1 T2
T1
B¯
A¯
(b)Table .
B
A 700 100 800
10 1100 1200
800 1200 2000
Table5.15 providesasummaryofpropertiesforthemeasuresdefinedinTable5.9 .Eventhoughthislistofpropertiesisnotexhaustive,itcanserveasausefulguideforselectingtherightchoiceofmeasureforanapplication.Ideally,ifweknowthespecificrequirementsofacertainapplication,wecanensurethattheselectedmeasureshowspropertiesthatadheretothoserequirements.Forexample,ifwearedealingwithasymmetricvariables,wewouldprefertouseameasurethatisnotinvarianttonulladditionorinversion.Ontheotherhand,ifwerequirethemeasuretoremaininvarianttochangesinthesamplesize,wewouldliketouseameasurethatdoesnotchangewithscaling.
Table5.15.Propertiesofsymmetricmeasures.
Symbol Measure Inversion NullAddition Scaling
Yes No No
oddsratio Yes No Yes
Cohen's Yes No No
I Interest No No No
IS Cosine No Yes No
PS Piatetsky-Shapiro's Yes No No
T2
B¯
A¯
ϕ ϕ-coefficient
α
κ
S Collectivestrength Yes No No
Jaccard No Yes No
h All-confidence No Yes No
s Support No No No
AsymmetricInterestingnessMeasuresNotethatinthediscussionsofar,wehaveonlyconsideredmeasuresthatdonotchangetheirvaluewhentheorderofthevariablesarereversed.Morespecifically,ifMisameasureandAandBaretwovariables,thenM(A,B)isequaltoM(B,A)iftheorderofthevariablesdoesnotmatter.Suchmeasuresarecalledsymmetric.Ontheotherhand,measuresthatdependontheorderofvariables arecalledasymmetricmeasures.Forexample,theinterestfactorisasymmetricmeasurebecauseitsvalueisidenticalfortherules and .Incontrast,confidenceisanasymmetricmeasuresincetheconfidencefor and maynotbethesame.Notethattheuseoftheterm‘asymmetric'todescribeaparticulartypeofmeasureofrelationship—oneinwhichtheorderofthevariablesisimportant—shouldnotbeconfusedwiththeuseof‘asymmetric'todescribeabinaryvariableforwhichonly1'sareimportant.Asymmetricmeasuresaremoresuitableforanalyzingassociationrules,sincetheitemsinaruledohaveaspecificorder.Eventhoughweonlyconsideredsymmetricmeasurestodiscussthedifferentpropertiesofassociationmeasures,theabovediscussionisalsorelevantfortheasymmetricmeasures.SeeBibliographicNotesformoreinformationaboutdifferentkindsofasymmetricmeasuresandtheirproperties.
ζ
(M(A,B)≠M(B,A))
A→B B→AA→B B→A
5.7.2MeasuresbeyondPairsofBinaryVariables
ThemeasuresshowninTable5.9 aredefinedforpairsofbinaryvariables(e.g.,2-itemsetsorassociationrules).However,manyofthem,suchassupportandall-confidence,arealsoapplicabletolarger-sizeditemsets.Othermeasures,suchasinterestfactor,IS,PS,andJaccardcoefficient,canbeextendedtomorethantwovariablesusingthefrequencytablestabulatedinamultidimensionalcontingencytable.Anexampleofathree-dimensionalcontingencytablefora,b,andcisshowninTable5.16 .Eachentry inthistablerepresentsthenumberoftransactionsthatcontainaparticularcombinationofitemsa,b,andc.Forexample, isthenumberoftransactionsthatcontainaandc,butnotb.Ontheotherhand,amarginalfrequencysuchas isthenumberoftransactionsthatcontainaandc,irrespectiveofwhetherbispresentinthetransaction.
Table5.16.Exampleofathree-dimensionalcontingencytable.
c b
a
c b
a
fijk
f101
f1+1
b¯
f111 f101 f1+1
a¯ f011 f001 f0+1
f+11 f+01 f++1
b¯
f110 f100 f1+0
a¯ f010 f000 f0+0
Givenak-itemset ,theconditionforstatisticalindependencecanbestatedasfollows:
Withthisdefinition,wecanextendobjectivemeasuressuchasinterestfactorandPS,whicharebasedondeviationsfromstatisticalindependence,tomorethantwovariables:
Anotherapproachistodefinetheobjectivemeasureasthemaximum,minimum,oraveragevaluefortheassociationsbetweenpairsofitemsinapattern.Forexample,givenak-itemset ,wemaydefinethe
forXastheaverage betweeneverypairofitemsinX.However,becausethemeasureconsidersonlypairwise
associations,itmaynotcapturealltheunderlyingrelationshipswithinapattern.Also,careshouldbetakeninusingsuchalternatemeasuresformorethantwovariables,sincetheymaynotalwaysshowtheanti-monotonepropertyinthesamewayasthesupportmeasure,makingthemunsuitableforminingpatternsusingtheAprioriprinciple.
Analysisofmultidimensionalcontingencytablesismorecomplicatedbecauseofthepresenceofpartialassociationsinthedata.Forexample,someassociationsmayappearordisappearwhenconditioneduponthevalueofcertainvariables.ThisproblemisknownasSimpson'sparadoxandisdescribedinSection5.7.3 .Moresophisticatedstatisticaltechniquesare
f+10 f+00 f++0
{i1,i2,…,ik}
fi1i2…ik=fi1+…+×f+i2…+×…×f++…ikNk−1. (5.11)
I=Nk−1×fi1i2…ikfi1+…+×f+i2…+×…×f++…ikPS=fi1i2…ikN−fi1+…+×f+i2…+×…×f++…ikNk
X={i1,i2,…,ik} ϕ-coefficient ϕ-coefficient(ip,iq)
availabletoanalyzesuchrelationships,e.g.,loglinearmodels,butthesetechniquesarebeyondthescopeofthisbook.
5.7.3Simpson'sParadox
Itisimportanttoexercisecautionwheninterpretingtheassociationbetweenvariablesbecausetheobservedrelationshipmaybeinfluencedbythepresenceofotherconfoundingfactors,i.e.,hiddenvariablesthatarenotincludedintheanalysis.Insomecases,thehiddenvariablesmaycausetheobservedrelationshipbetweenapairofvariablestodisappearorreverseitsdirection,aphenomenonthatisknownasSimpson'sparadox.Weillustratethenatureofthisparadoxwiththefollowingexample.
Considertherelationshipbetweenthesaleofhigh-definitiontelevisions(HDTV)andexercisemachines,asshowninTable5.17 .Therule
hasaconfidenceof andtherule hasaconfidenceof
.Together,theserulessuggestthatcustomerswhobuyhigh-definitiontelevisionsaremorelikelytobuyexercisemachinesthanthosewhodonotbuyhigh-definitiontelevisions.
Table5.17.Atwo-waycontingencytablebetweenthesaleofhigh-definitiontelevisionandexercisemachine.
BuyHDTV
BuyExerciseMachine
Yes No
Yes 99 81 180
No 54 66 120
{HDTV=Yes}→{Exercisemachine=Yes} 99/180=55%{HDTV=No}→{Exercisemachine=Yes}
54/120=45%
153 147 300
However,adeeperanalysisrevealsthatthesalesoftheseitemsdependonwhetherthecustomerisacollegestudentoraworkingadult.Table5.18summarizestherelationshipbetweenthesaleofHDTVsandexercisemachinesamongcollegestudentsandworkingadults.NoticethatthesupportcountsgiveninthetableforcollegestudentsandworkingadultssumuptothefrequenciesshowninTable5.17 .Furthermore,therearemoreworkingadultsthancollegestudentswhobuytheseitems.Forcollegestudents:
Table5.18.Exampleofathree-waycontingencytable.
CustomerGroup
BuyHDTV
BuyExerciseMachine Total
Yes No
CollegeStudents Yes 1 9 10
No 4 30 34
WorkingAdult Yes 98 72 170
No 50 36 86
whileforworkingadults:
c({HDTV=Yes}→{Exercisemachine=Yes})=1/10=10%,c({HDTV=No}→{Exercisemachine=Yes})=4/34=11.8%.
c({HDTV=Yes}→{Exercisemachine=Yes})=98/170=57.7%,c({HDTV=No}→{Exercisemachine=Yes})=50/86=58.1%.
Therulessuggestthat,foreachgroup,customerswhodonotbuyhigh-definitiontelevisionsaremorelikelytobuyexercisemachines,whichcontradictsthepreviousconclusionwhendatafromthetwocustomergroupsarepooledtogether.Evenifalternativemeasuressuchascorrelation,oddsratio,orinterestareapplied,westillfindthatthesaleofHDTVandexercisemachineispositivelyrelatedinthecombineddatabutisnegativelyrelatedinthestratifieddata(seeExercise21onpage449).ThereversalinthedirectionofassociationisknownasSimpson'sparadox.
Theparadoxcanbeexplainedinthefollowingway.First,noticethatmostcustomerswhobuyHDTVsareworkingadults.Thisisreflectedinthehighconfidenceoftherule .Second,thehighconfidenceoftherule
suggeststhatmostcustomerswhobuyexercisemachinesarealsoworkingadults.SinceworkingadultsformthelargestfractionofcustomersforbothHDTVsandexercisemachines,theybothlookrelatedandtherule turnsouttobestrongerinthecombineddatathanwhatitwouldhavebeenifthedataisstratified.Hence,customergroupactsasahiddenvariablethataffectsboththefractionofcustomerswhobuyHDTVsandthosewhobuyexercisemachines.Ifwefactorouttheeffectofthehiddenvariablebystratifyingthedata,weseethattherelationshipbetweenbuyingHDTVsandbuyingexercisemachinesisnotdirect,butshowsupasanindirectconsequenceoftheeffectofthehiddenvariable.
TheSimpson'sparadoxcanalsobeillustratedmathematicallyasfollows.Suppose
{HDTV=Yes}→{WorkingAdult}(170/180=94.4%){Exercisemachine=Yes}
→{WorkingAdult}(148/153=96.7%)
{HDTV=Yes}→{Exercisemachine=Yes}
a/b<c/dandp/q<r/s,
wherea/bandp/qmayrepresenttheconfidenceoftherule intwodifferentstrata,whilec/dandr/smayrepresenttheconfidenceoftherule
inthetwostrata.Whenthedataispooledtogether,theconfidencevaluesoftherulesinthecombineddataare and ,respectively.Simpson'sparadoxoccurswhen
thusleadingtothewrongconclusionabouttherelationshipbetweenthevariables.ThelessonhereisthatproperstratificationisneededtoavoidgeneratingspuriouspatternsresultingfromSimpson'sparadox.Forexample,marketbasketdatafromamajorsupermarketchainshouldbestratifiedaccordingtostorelocations,whilemedicalrecordsfromvariouspatientsshouldbestratifiedaccordingtoconfoundingfactorssuchasageandgender.
A→B
A¯→B(a+p)/(b+q) (c+r)/(d+s)
a+pb+q>c+rd+s,
5.8EffectofSkewedSupportDistributionTheperformancesofmanyassociationanalysisalgorithmsareinfluencedbypropertiesoftheirinputdata.Forexample,thecomputationalcomplexityoftheApriorialgorithmdependsonpropertiessuchasthenumberofitemsinthedata,theaveragetransactionwidth,andthesupportthresholdused.ThisSectionexaminesanotherimportantpropertythathassignificantinfluenceontheperformanceofassociationanalysisalgorithmsaswellasthequalityofextractedpatterns.Morespecifically,wefocusondatasetswithskewedsupportdistributions,wheremostoftheitemshaverelativelylowtomoderatefrequencies,butasmallnumberofthemhaveveryhighfrequencies.
Figure5.29.Atransactiondatasetcontainingthreeitems,p,q,andr,wherepisahighsupportitemandqandrarelowsupportitems.
Figure5.29 showsanillustrativeexampleofadatasetthathasaskewedsupportdistributionofitsitems.Whilephasahighsupportof83.3%inthedata,qandrarelow-supportitemswithasupportof16.7%.Despitetheirlowsupport,qandralwaysoccurtogetherinthelimitednumberoftransactionsthattheyappearandhencearestronglyrelated.Apatternminingalgorithmthereforeshouldreport{q,r}asinteresting.
However,notethatchoosingtherightsupportthresholdforminingitem-setssuchas{q,r}canbequitetricky.Ifwesetthethresholdtoohigh(e.g.,20%),
thenwemaymissmanyinterestingpatternsinvolvinglow-supportitemssuchas{q,r}.Conversely,settingthesupportthresholdtoolowcanbedetrimentaltothepatternminingprocessforthefollowingreasons.First,thecomputationalandmemoryrequirementsofexistingassociationanalysisalgorithmsincreaseconsiderablywithlowsupportthresholds.Second,thenumberofextractedpatternsalsoincreasessubstantiallywithlowsupportthresholds,whichmakestheiranalysisandinterpretationdifficult.Inparticular,wemayextractmanyspuriouspatternsthatrelateahigh-frequencyitemsuchasptoalow-frequencyitemsuchasq.Suchpatterns,whicharecalledcross-supportpatterns,arelikelytobespuriousbecausetheassociationbetweenpandqislargelyinfluencedbythefrequentoccurrenceofpinsteadofthejointoccurrenceofpandqtogether.Becausethesupportof{p,q}isquiteclosetothesupportof{q,r},wemayeasilyselect{p,q}ifwesetthesupportthresholdlowenoughtoinclude{q,r}.
AnexampleofarealdatasetthatexhibitsaskewedsupportdistributionisshowninFigure5.30 .Thedata,takenfromthePUMS(PublicUseMicrodataSample)censusdata,contains49,046recordsand2113asymmetricbinaryvariables.Weshalltreattheasymmetricbinaryvariablesasitemsandrecordsastransactions.Whilemorethan80%oftheitemshavesupportlessthan1%,ahandfulofthemhavesupportgreaterthan90%.Tounderstandtheeffectofskewedsupportdistributiononfrequentitemsetmining,wedividetheitemsintothreegroups, ,and ,accordingtotheirsupportlevels,asshowninTable5.19 .Wecanseethatmorethan82%ofitemsbelongto andhaveasupportlessthan1%.Inmarketbasketanalysis,suchlowsupportitemsmaycorrespondtoexpensiveproducts(suchasjewelry)thatareseldomboughtbycustomers,butwhosepatternsarestillinterestingtoretailers.Patternsinvolvingsuchlow-supportitems,thoughmeaningful,caneasilyberejectedbyafrequentpatternminingalgorithmwithahighsupportthreshold.Ontheotherhand,settingalowsupportthresholdmayresultintheextractionofspuriouspatternsthatrelateahigh-frequency
G1,G2 G3
G1
itemin toalow-frequencyitemin .Forexample,atasupportthresholdequalto0.05%,thereare18,847frequentpairsinvolvingitemsfrom and
.Outofthese,93%ofthemarecross-supportpatterns;i.e.,thepatternscontainitemsfromboth and .
Figure5.30.Supportdistributionofitemsinthecensusdataset.
Table5.19.Groupingtheitemsinthecensusdatasetbasedontheirsupportvalues.
Group
Support
NumberofItems 1735 358 20
G3 G1G1
G3G1 G3
G1 G2 G3
<1% 1%−90% >90%
Thisexampleshowsthatalargenumberofweaklyrelatedcross-supportpatternscanbegeneratedwhenthesupportthresholdissufficientlylow.Notethatfindinginterestingpatternsindatasetswithskewedsupportdistributionsisnotjustachallengeforthesupportmeasure,butsimilarstatementscanbemadeaboutmanyotherobjectivemeasuresdiscussedinthepreviousSections.Beforepresentingamethodologyforfindinginterestingpatternsandpruningspuriousones,weformallydefinetheconceptofcross-supportpatterns.
Definition5.9.(Cross-supportPattern.)Letusdefinethesupportratio,r(X),ofanitemset
as
Givenauser-specifiedthreshold ,anitemsetXisacross-supportpatternif .
Example5.4.Supposethesupportformilkis70%,whilethesupportforsugaris10%andcaviaris0.04%.Given ,thefrequentitemset{milk,sugar,caviar}isacross-supportpatternbecauseitssupportratiois
X={i1,i2,…,ik}
r(X)=min[s(i1),s(i2),…,s(ik)}max[s(i1),s(i2),…,s(ik)} (5.12)
hcr(X)<hc
hc=0.01
r=min[0.7,0.1,0.0004]max[0.7,0.1,0.0004]=0.0040.7=0.00058<0.01.
Existingmeasuressuchassupportandconfidencemaynotbesufficienttoeliminatecross-supportpatterns.Forexample,ifweassume forthedatasetpresentedinFigure5.29 ,theitemsets{p,q},{p,r},and{p,q,r}arecross-supportpatternsbecausetheirsupportratios,whichareequalto0.2,arelessthanthethreshold .However,theirsupportsarecomparabletothatof{q,r},makingitdifficulttoeliminatecross-supportpatternswithoutloosinginterestingonesusingasupport-basedpruningstrategy.Confidencepruningalsodoesnothelpbecausetheconfidenceoftherulesextractedfromcross-supportpatternscanbeveryhigh.Forexample,theconfidencefor
is80%eventhough{p,q}isacross-supportpattern.Thefactthatthecross-supportpatterncanproduceahighconfidenceruleshouldnotcomeasasurprisebecauseoneofitsitems(p)appearsveryfrequentlyinthedata.Therefore,pisexpectedtoappearinmanyofthetransactionsthatcontainq.Meanwhile,therule alsohashighconfidenceeventhough{q,r}isnotacross-supportpattern.Thisexampledemonstratesthedifficultyofusingtheconfidencemeasuretodistinguishbetweenrulesextractedfromcross-supportpatternsandinterestingpatternsinvolvingstronglyconnectedbutlow-supportitems.
Eventhoughtherule hasveryhighconfidence,noticethattherulehasverylowconfidencebecausemostofthetransactionsthatcontainp
donotcontainq.Incontrast,therule ,whichisderivedfrom{q,r},hasveryhighconfidence.Thisobservationsuggeststhatcross-supportpatternscanbedetectedbyexaminingthelowestconfidencerulethatcanbeextractedfromagivenitemset.Anapproachforfindingtherulewiththelowestconfidencegivenanitemsetcanbedescribedasfollows.
1. Recallthefollowinganti-monotonepropertyofconfidence:
hc=0.3
hc
{q}→{p}
{q}→{r}
{q}→{p} {p}→{q}
{r}→{q}
conf({i1i2}→{i3,i4,…,ik})≤conf({i1i2i3}→{i4,i5,…,ik}).
Thispropertysuggeststhatconfidenceneverincreasesasweshiftmoreitemsfromtheleft-totheright-handsideofanassociationrule.Becauseofthisproperty,thelowestconfidenceruleextractedfromafrequentitemsetcontainsonlyoneitemonitsleft-handside.Wedenotethesetofallruleswithonlyoneitemonitsleft-handsideas .
2. Givenafrequentitemset ,therule
hasthelowestconfidencein if .Thisfollowsdirectlyfromthedefinitionofconfidenceastheratiobetweentherule'ssupportandthesupportoftheruleantecedent.Hence,theconfidenceofarulewillbelowestwhenthesupportoftheantecedentishighest.
3. Summarizingthepreviouspoints,thelowestconfidenceattainablefromafrequentitemset is
Thisexpressionisalsoknownastheh-confidenceorall-confidencemeasure.Becauseoftheanti-monotonepropertyofsupport,thenumeratoroftheh-confidencemeasureisboundedbytheminimumsupportofanyitemthatappearsinthefrequentitemset.Inotherwords,theh-confidenceofanitemset mustnotexceedthefollowingexpression:
Notethattheupperboundofh-confidenceintheaboveequationisexactlysameassupportratio(r)giveninEquation5.12 .Becausethesupportratioforacross-supportpatternisalwayslessthan ,theh-confidenceofthepatternisalsoguaranteedtobelessthan .Therefore,cross-supportpatternscanbeeliminatedbyensuringthattheh-confidencevaluesforthepatternsexceed .Asafinalnote,theadvantagesofusingh-confidencego
R1{i1,i2,…,ik}
{ij}→{i1,i2,…,ij−1,ij+1,…,ik}
R1 s(ij)=max[s(i1),s(i2),…,s(ik)]
{i1,i2,…,ik}s({i1,i2,…,ik})max[s(i1),s(i2),…,s(ik)].
X={i1,i2,…,ik}
h-confidence(X)≤min[s(i1),s(i2),…,s(ik)]max[s(i1),s(i2),…,s(ik)].
hchc
hc
beyondeliminatingcross-supportpatterns.Themeasureisalsoanti-monotone,i.e.,
andthuscanbeincorporateddirectlyintotheminingalgorithm.Furthermore,h-confidenceensuresthattheitemscontainedinanitemsetarestronglyassociatedwitheachother.Forexample,supposetheh-confidenceofanitemsetXis80%.IfoneoftheitemsinXispresentinatransaction,thereisatleastan80%chancethattherestoftheitemsinXalsobelongtothesametransaction.Suchstronglyassociatedpatternsinvolvinglow-supportitemsarecalledhypercliquepatterns.
Definition5.10.(HypercliquePattern.)AnitemsetXisahypercliquepatternifh-confidence ,where isauser-specifiedthreshold.
h-confidence({i1,i2,…,ik})≥h-confidence({i1,i2,…,ik+1}),
(X)>hchc
5.9BibliographicNotesTheassociationruleminingtaskwasfirstintroducedbyAgrawaletal.[324,325]todiscoverinterestingrelationshipsamongitemsinmarketbaskettransactions.Sinceitsinception,extensiveresearchhasbeenconductedtoaddressthevariousissuesinassociationrulemining,fromitsfundamentalconceptstoitsimplementationandapplications.Figure5.31 showsataxonomyofthevariousresearchdirectionsinthisarea,whichisgenerallyknownasassociationanalysis.Asmuchoftheresearchfocusesonfindingpatternsthatappearsignificantlyofteninthedata,theareaisalsoknownasfrequentpatternmining.Adetailedreviewonsomeoftheresearchtopicsinthisareacanbefoundin[362]andin[319].
Figure5.31.Anoverviewofthevariousresearchdirectionsinassociationanalysis.
ConceptualIssues
Researchontheconceptualissuesofassociationanalysishasfocusedondevelopingatheoreticalformulationofassociationanalysisandextendingtheformulationtonewtypesofpatternsandgoingbeyondasymmetricbinaryattributes.
FollowingthepioneeringworkbyAgrawaletal.[324,325],therehasbeenavastamountofresearchondevelopingatheoreticalformulationfortheassociationanalysisproblem.In[357],Gunopoulosetal.showedtheconnectionbetweenfindingmaximalfrequentitemsetsandthehypergraphtransversalproblem.Anupperboundonthecomplexityoftheassociationanalysistaskwasalsoderived.Zakietal.[454,456]andPasquieretal.[407]haveappliedformalconceptanalysistostudythefrequentitemsetgenerationproblem.Moreimportantly,suchresearchhasledtothedevelopmentofaclassofpatternsknownasclosedfrequentitemsets[456].Friedmanetal.[355]havestudiedtheassociationanalysisprobleminthecontextofbumphuntinginmultidimensionalspace.Specifically,theyconsiderfrequentitemsetgenerationasthetaskoffindinghighdensityregionsinmultidimensionalspace.Formalizingassociationanalysisinastatisticallearningframeworkisanotheractiveresearchdirection[414,435,444]asitcanhelpaddressissuesrelatedtoidentifyingstatisticallysignificantpatternsanddealingwithuncertaindata[320,333,343].
Overtheyears,theassociationruleminingformulationhasbeenexpandedtoencompassotherrule-basedpatterns,suchas,profileassociationrules[321],cyclicassociationrules[403],fuzzyassociationrules[379],exceptionrules[431],negativeassociationrules[336,418],weightedassociationrules[338,413],dependencerules[422],peculiarrules[462],inter-transactionassociationrules[353,440],andpartialclassificationrules[327,397].Additionally,theconceptoffrequentitemsethasbeenextendedtoothertypesofpatternsincludingcloseditemsets[407,456],maximalitemsets[330],hypercliquepatterns[449],supportenvelopes[428],emergingpatterns[347],
contrastsets[329],high-utilityitemsets[340,390],approximateorerror-tolerantitem-sets[358,389,451],anddiscriminativepatterns[352,401,430].Associationanalysistechniqueshavealsobeensuccessfullyappliedtosequential[326,426],spatial[371],andgraph-based[374,380,406,450,455]data.
Substantialresearchhasbeenconductedtoextendtheoriginalassociationruleformulationtonominal[425],ordinal[392],interval[395],andratio[356,359,425,443,461]attributes.Oneofthekeyissuesishowtodefinethesupportmeasurefortheseattributes.AmethodologywasproposedbySteinbachetal.[429]toextendthetraditionalnotionofsupporttomoregeneralpatternsandattributetypes.
ImplementationIssuesResearchactivitiesinthisarearevolvearound(1)integratingtheminingcapabilityintoexistingdatabasetechnology,(2)developingefficientandscalableminingalgorithms,(3)handlinguser-specifiedordomain-specificconstraints,and(4)post-processingtheextractedpatterns.
Thereareseveraladvantagestointegratingassociationanalysisintoexistingdatabasetechnology.First,itcanmakeuseoftheindexingandqueryprocessingcapabilitiesofthedatabasesystem.Second,itcanalsoexploittheDBMSsupportforscalability,check-pointing,andparallelization[415].TheSETMalgorithmdevelopedbyHoutsmaetal.[370]wasoneoftheearliestalgorithmstosupportassociationrulediscoveryviaSQLqueries.Sincethen,numerousmethodshavebeendevelopedtoprovidecapabilitiesforminingassociationrulesindatabasesystems.Forexample,theDMQL[363]andM-SQL[373]querylanguagesextendthebasicSQLwithnewoperatorsforminingassociationrules.TheMineRuleoperator[394]isanexpressiveSQLoperatorthatcanhandlebothclusteredattributesanditemhierarchies.Tsuret
al.[439]developedagenerate-and-testapproachcalledqueryflocksforminingassociationrules.AdistributedOLAP-basedinfrastructurewasdevelopedbyChenetal.[341]forminingmultilevelassociationrules.
Despiteitspopularity,theApriorialgorithmiscomputationallyexpensivebecauseitrequiresmakingmultiplepassesoverthetransactiondatabase.ItsruntimeandstoragecomplexitieswereinvestigatedbyDunkelandSoparkar[349].TheFP-growthalgorithmwasdevelopedbyHanetal.in[364].OtheralgorithmsforminingfrequentitemsetsincludetheDHP(dynamichashingandpruning)algorithmproposedbyParketal.[405]andthePartitionalgorithmdevelopedbySavasereetal[417].Asampling-basedfrequentitemsetgenerationalgorithmwasproposedbyToivonen[436].Thealgorithmrequiresonlyasinglepassoverthedata,butitcanproducemorecandidateitem-setsthannecessary.TheDynamicItemsetCounting(DIC)algorithm[337]makesonly1.5passesoverthedataandgenerateslesscandidateitemsetsthanthesampling-basedalgorithm.Othernotablealgorithmsincludethetree-projectionalgorithm[317]andH-Mine[408].Surveyarticlesonfrequentitemsetgenerationalgorithmscanbefoundin[322,367].ArepositoryofbenchmarkdatasetsandsoftwareimplementationofassociationruleminingalgorithmsisavailableattheFrequentItemsetMiningImplementations(FIMI)repository(http://fimi.cs.helsinki.fi).
Parallelalgorithmshavebeendevelopedtoscaleupassociationruleminingforhandlingbigdata[318,360,399,420,457].Asurveyofsuchalgorithmscanbefoundin[453].OnlineandincrementalassociationruleminingalgorithmshavealsobeenproposedbyHidber[365]andCheungetal.[342].Morerecently,newalgorithmshavebeendevelopedtospeedupfrequentitemsetminingbyexploitingtheprocessingpowerofGPUs[459]andtheMapReduce/Hadoopdistributedcomputingframework[382,384,396].Forexample,animplementationoffrequentitemsetminingfortheHadoopframeworkisavailableintheApacheMahoutsoftware .
1
1http://mahout.apache.org
Srikantetal.[427]haveconsideredtheproblemofminingassociationrulesinthepresenceofBooleanconstraintssuchasthefollowing:
Givensuchaconstraint,thealgorithmlooksforrulesthatcontainbothcookiesandmilk,orrulesthatcontainthedescendentitemsofcookiesbutnotancestoritemsofwheatbread.Singhetal.[424]andNgetal.[400]hadalsodevelopedalternativetechniquesforconstrained-basedassociationrulemining.Constraintscanalsobeimposedonthesupportfordifferentitemsets.ThisproblemwasinvestigatedbyWangetal.[442],Liuetal.in[387],andSenoetal.[419].Inaddition,constraintsarisingfromprivacyconcernsofminingsensitivedatahaveledtothedevelopmentofprivacy-preservingfrequentpatternminingtechniques[334,350,441,458].
Onepotentialproblemwithassociationanalysisisthelargenumberofpatternsthatcanbegeneratedbycurrentalgorithms.Toovercomethisproblem,methodstorank,summarize,andfilterpatternshavebeendeveloped.Toivonenetal.[437]proposedtheideaofeliminatingredundantrulesusingstructuralrulecoversandgroupingtheremainingrulesusingclustering.Liuetal.[388]appliedthestatisticalchi-squaretesttoprunespuriouspatternsandsummarizedtheremainingpatternsusingasubsetofthepatternscalleddirectionsettingrules.Theuseofobjectivemeasurestofilterpatternshasbeeninvestigatedbymanyauthors,includingBrinetal.[336],BayardoandAgrawal[331],AggarwalandYu[323],andDuMouchelandPregibon[348].ThepropertiesformanyofthesemeasureswereanalyzedbyPiatetsky-Shapiro[410],KamberandSinghal[376],HildermanandHamilton[366],andTanetal.[433].Thegrade-genderexampleusedtohighlighttheimportanceoftherowandcolumnscalinginvarianceproperty
(Cookies∧Milk)∨(descendants(Cookies)∧¬ancestors(WheatBread))
washeavilyinfluencedbythediscussiongivenin[398]byMosteller.Meanwhile,thetea-coffeeexampleillustratingthelimitationofconfidencewasmotivatedbyanexamplegivenin[336]byBrinetal.Becauseofthelimitationofconfidence,Brinetal.[336]hadproposedtheideaofusinginterestfactorasameasureofinterestingness.Theall-confidencemeasurewasproposedbyOmiecinski[402].Xiongetal.[449]introducedthecross-supportpropertyandshowedthattheall-confidencemeasurecanbeusedtoeliminatecross-supportpatterns.Akeydifficultyinusingalternativeobjectivemeasuresbesidessupportistheirlackofamonotonicityproperty,whichmakesitdifficulttoincorporatethemeasuresdirectlyintotheminingalgorithms.Xiongetal.[447]haveproposedanefficientmethodforminingcorrelationsbyintroducinganupperboundfunctiontothe .Althoughthemeasureisnon-monotone,ithasanupperboundexpressionthatcanbeexploitedfortheefficientminingofstronglycorrelateditempairs.
FabrisandFreitas[351]haveproposedamethodfordiscoveringinterestingassociationsbydetectingtheoccurrencesofSimpson'sparadox[423].MegiddoandSrikant[393]describedanapproachforvalidatingtheextractedpatternsusinghypothesistestingmethods.Aresampling-basedtechniquewasalsodevelopedtoavoidgeneratingspuriouspatternsbecauseofthemultiplecomparisonproblem.Boltonetal.[335]haveappliedtheBenjamini-Hochberg[332]andBonferronicorrectionmethodstoadjustthep-valuesofdiscoveredpatternsinmarketbasketdata.AlternativemethodsforhandlingthemultiplecomparisonproblemweresuggestedbyWebb[445],Zhangetal.[460],andLlinares-Lopezetal.[391].
Applicationofsubjectivemeasurestoassociationanalysishasbeeninvestigatedbymanyauthors.SilberschatzandTuzhilin[421]presentedtwoprinciplesinwhicharulecanbeconsideredinterestingfromasubjectivepointofview.TheconceptofunexpectedconditionruleswasintroducedbyLiuetal.in[385].Cooleyetal.[344]analyzedtheideaofcombiningsoftbeliefsets
ϕ-coefficient
usingtheDempster-Shafertheoryandappliedthisapproachtoidentifycontradictoryandnovelassociationpatternsinwebdata.AlternativeapproachesincludeusingBayesiannetworks[375]andneighborhood-basedinformation[346]toidentifysubjectivelyinterestingpatterns.
Visualizationalsohelpstheusertoquicklygrasptheunderlyingstructureofthediscoveredpatterns.Manycommercialdataminingtoolsdisplaythecompletesetofrules(whichsatisfybothsupportandconfidencethresholdcriteria)asatwo-dimensionalplot,witheachaxiscorrespondingtotheantecedentorconsequentitemsetsoftherule.Hofmannetal.[368]proposedusingMosaicplotsandDoubleDeckerplotstovisualizeassociationrules.Thisapproachcanvisualizenotonlyaparticularrule,butalsotheoverallcontingencytablebetweenitemsetsintheantecedentandconsequentpartsoftherule.Nevertheless,thistechniqueassumesthattheruleconsequentconsistsofonlyasingleattribute.
ApplicationIssuesAssociationanalysishasbeenappliedtoavarietyofapplicationdomainssuchaswebmining[409,432],documentanalysis[369],telecommunicationalarmdiagnosis[377],networkintrusiondetection[328,345,381],andbioinformatics[416,446].ApplicationsofassociationandcorrelationpatternanalysistoEarthSciencestudieshavebeeninvestigatedin[411,412,434].Trajectorypatternmining[339,372,438]isanotherapplicationofspatio-temporalassociationanalysistoidentifyfrequentlytraversedpathsofmovingobjects.
Associationpatternshavealsobeenappliedtootherlearningproblemssuchasclassification[383,386],regression[404],andclustering[361,448,452].AcomparisonbetweenclassificationandassociationruleminingwasmadebyFreitasinhispositionpaper[354].Theuseofassociationpatternsfor
clusteringhasbeenstudiedbymanyauthorsincludingHanetal.[361],Kostersetal.[378],Yangetal.[452]andXiongetal.[448].
Bibliography[317]R.C.Agarwal,C.C.Aggarwal,andV.V.V.Prasad.ATreeProjection
AlgorithmforGenerationofFrequentItemsets.JournalofParallelandDistributedComputing(SpecialIssueonHighPerformanceDataMining),61(3):350–371,2001.
[318]R.C.AgarwalandJ.C.Shafer.ParallelMiningofAssociationRules.IEEETransactionsonKnowledgeandDataEngineering,8(6):962–969,March1998.
[319]C.AggarwalandJ.Han.FrequentPatternMining.Springer,2014.
[320]C.C.Aggarwal,Y.Li,J.Wang,andJ.Wang.Frequentpatternminingwithuncertaindata.InProceedingsofthe15thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages29–38,Paris,France,2009.
[321]C.C.Aggarwal,Z.Sun,andP.S.Yu.OnlineGenerationofProfileAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages129—133,NewYork,NY,August1996.
[322]C.C.AggarwalandP.S.Yu.MiningLargeItemsetsforAssociationRules.DataEngineeringBulletin,21(1):23–31,March1998.
[323]C.C.AggarwalandP.S.Yu.MiningAssociationswiththeCollectiveStrengthApproach.IEEETrans.onKnowledgeandDataEngineering,13(6):863–873,January/February2001.
[324]R.Agrawal,T.Imielinski,andA.Swami.Databasemining:Aperformanceperspective.IEEETransactionsonKnowledgeandDataEngineering,5:914–925,1993.
[325]R.Agrawal,T.Imielinski,andA.Swami.Miningassociationrulesbetweensetsofitemsinlargedatabases.InProc.ACMSIGMODIntl.Conf.ManagementofData,pages207–216,Washington,DC,1993.
[326]R.AgrawalandR.Srikant.MiningSequentialPatterns.InProc.ofIntl.Conf.onDataEngineering,pages3–14,Taipei,Taiwan,1995.
[327]K.Ali,S.Manganaris,andR.Srikant.PartialClassificationusingAssociationRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages115—118,NewportBeach,CA,August1997.
[328]D.Barbará,J.Couto,S.Jajodia,andN.Wu.ADAM:ATestbedforExploringtheUseofDataMininginIntrusionDetection.SIGMODRecord,30(4):15–24,2001.
[329]S.D.BayandM.Pazzani.DetectingGroupDifferences:MiningContrastSets.DataMiningandKnowledgeDiscovery,5(3):213–246,2001.
[330]R.Bayardo.EfficientlyMiningLongPatternsfromDatabases.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages85–93,Seattle,WA,June1998.
[331]R.BayardoandR.Agrawal.MiningtheMostInterestingRules.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages145–153,SanDiego,CA,August1999.
[332]Y.BenjaminiandY.Hochberg.ControllingtheFalseDiscoveryRate:APracticalandPowerfulApproachtoMultipleTesting.JournalRoyalStatisticalSocietyB,57(1):289–300,1995.
[333]T.Bernecker,H.Kriegel,M.Renz,F.Verhein,andA.Züle.Probabilisticfrequentitemsetmininginuncertaindatabases.InProceedingsofthe15thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages119–128,Paris,France,2009.
[334]R.Bhaskar,S.Laxman,A.D.Smith,andA.Thakurta.Discoveringfrequentpatternsinsensitivedata.InProceedingsofthe16thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages503–512,Washington,DC,2010.
[335]R.J.Bolton,D.J.Hand,andN.M.Adams.DeterminingHitRateinPatternSearch.InProc.oftheESFExploratoryWorkshoponPatternDetectionandDiscoveryinDataMining,pages36–48,London,UK,September2002.
[336]S.Brin,R.Motwani,andC.Silverstein.Beyondmarketbaskets:Generalizingassociationrulestocorrelations.InProc.ACMSIGMODIntl.Conf.ManagementofData,pages265–276,Tucson,AZ,1997.
[337]S.Brin,R.Motwani,J.Ullman,andS.Tsur.DynamicItemsetCountingandImplicationRulesformarketbasketdata.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages255–264,Tucson,AZ,June1997.
[338]C.H.Cai,A.Fu,C.H.Cheng,andW.W.Kwong.MiningAssociationRuleswithWeightedItems.InProc.ofIEEEIntl.DatabaseEngineeringandApplicationsSymp.,pages68–77,Cardiff,Wales,1998.
[339]H.Cao,N.Mamoulis,andD.W.Cheung.MiningFrequentSpatio-TemporalSequentialPatterns.InProceedingsofthe5thIEEEInternationalConferenceonDataMining,pages82–89,Houston,TX,2005.
[340]R.Chan,Q.Yang,andY.Shen.MiningHighUtilityItemsets.InProceedingsofthe3rdIEEEInternationalConferenceonDataMining,pages19–26,Melbourne,FL,2003.
[341]Q.Chen,U.Dayal,andM.Hsu.ADistributedOLAPinfrastructureforE-Commerce.InProc.ofthe4thIFCISIntl.Conf.onCooperativeInformationSystems,pages209—220,Edinburgh,Scotland,1999.
[342]D.C.Cheung,S.D.Lee,andB.Kao.AGeneralIncrementalTechniqueforMaintainingDiscoveredAssociationRules.InProc.ofthe5thIntl.Conf.
onDatabaseSystemsforAdvancedApplications,pages185–194,Melbourne,Australia,1997.
[343]C.K.Chui,B.Kao,andE.Hung.MiningFrequentItemsetsfromUncertainData.InProceedingsofthe11thPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,pages47–58,Nanjing,China,2007.
[344]R.Cooley,P.N.Tan,andJ.Srivastava.DiscoveryofInterestingUsagePatternsfromWebData.InM.SpiliopoulouandB.Masand,editors,AdvancesinWebUsageAnalysisandUserProfiling,volume1836,pages163–182.LectureNotesinComputerScience,2000.
[345]P.Dokas,L.Ertöz,V.Kumar,A.Lazarevic,J.Srivastava,andP.N.Tan.DataMiningforNetworkIntrusionDetection.InProc.NSFWorkshoponNextGenerationDataMining,Baltimore,MD,2002.
[346]G.DongandJ.Li.Interestingnessofdiscoveredassociationrulesintermsofneighborhood-basedunexpectedness.InProc.ofthe2ndPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,pages72–86,Melbourne,Australia,April1998.
[347]G.DongandJ.Li.EfficientMiningofEmergingPatterns:DiscoveringTrendsandDifferences.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages43–52,SanDiego,CA,August1999.
[348]W.DuMouchelandD.Pregibon.EmpiricalBayesScreeningforMulti-ItemAssociations.InProc.ofthe7thIntl.Conf.onKnowledgeDiscovery
andDataMining,pages67–76,SanFrancisco,CA,August2001.
[349]B.DunkelandN.Soparkar.DataOrganizationandAccessforEfficientDataMining.InProc.ofthe15thIntl.Conf.onDataEngineering,pages522–529,Sydney,Australia,March1999.
[350]A.V.Evfimievski,R.Srikant,R.Agrawal,andJ.Gehrke.Privacypreservingminingofassociationrules.InProceedingsoftheEighthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages217–228,Edmonton,Canada,2002.
[351]C.C.FabrisandA.A.Freitas.DiscoveringsurprisingpatternsbydetectingoccurrencesofSimpson'sparadox.InProc.ofthe19thSGESIntl.Conf.onKnowledge-BasedSystemsandAppliedArtificialIntelligence),pages148–160,Cambridge,UK,December1999.
[352]G.Fang,G.Pandey,W.Wang,M.Gupta,M.Steinbach,andV.Kumar.MiningLow-SupportDiscriminativePatternsfromDenseandHigh-DimensionalData.IEEETrans.Knowl.DataEng.,24(2):279–294,2012.
[353]L.Feng,H.J.Lu,J.X.Yu,andJ.Han.Mininginter-transactionassociationswithtemplates.InProc.ofthe8thIntl.Conf.onInformationandKnowledgeManagement,pages225–233,KansasCity,Missouri,Nov1999.
[354]A.A.Freitas.Understandingthecrucialdifferencesbetweenclassificationanddiscoveryofassociationrules—apositionpaper.SIGKDDExplorations,2(1):65–69,2000.
[355]J.H.FriedmanandN.I.Fisher.Bumphuntinginhigh-dimensionaldata.StatisticsandComputing,9(2):123–143,April1999.
[356]T.Fukuda,Y.Morimoto,S.Morishita,andT.Tokuyama.MiningOptimizedAssociationRulesforNumericAttributes.InProc.ofthe15thSymp.onPrinciplesofDatabaseSystems,pages182–191,Montreal,Canada,June1996.
[357]D.Gunopulos,R.Khardon,H.Mannila,andH.Toivonen.DataMining,HypergraphTransversals,andMachineLearning.InProc.ofthe16thSymp.onPrinciplesofDatabaseSystems,pages209–216,Tucson,AZ,May1997.
[358]R.Gupta,G.Fang,B.Field,M.Steinbach,andV.Kumar.Quantitativeevaluationofapproximatefrequentpatternminingalgorithms.InProceedingsofthe14thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages301–309,LasVegas,NV,2008.
[359]E.Han,G.Karypis,andV.Kumar.Min-apriori:Analgorithmforfindingassociationrulesindatawithcontinuousattributes.DepartmentofComputerScienceandEngineering,UniversityofMinnesota,Tech.Rep,1997.
[360]E.-H.Han,G.Karypis,andV.Kumar.ScalableParallelDataMiningforAssociationRules.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages277–288,Tucson,AZ,May1997.
[361]E.-H.Han,G.Karypis,V.Kumar,andB.Mobasher.ClusteringBasedonAssociationRuleHypergraphs.InProc.ofthe1997ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,Tucson,AZ,1997.
[362]J.Han,H.Cheng,D.Xin,andX.Yan.Frequentpatternmining:currentstatusandfuturedirections.DataMiningandKnowledgeDiscovery,15(1):55–86,2007.
[363]J.Han,Y.Fu,K.Koperski,W.Wang,andO.R.Zaïane.DMQL:Adataminingquerylanguageforrelationaldatabases.InProc.ofthe1996ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,Montreal,Canada,June1996.
[364]J.Han,J.Pei,andY.Yin.MiningFrequentPatternswithoutCandidateGeneration.InProc.ACM-SIGMODInt.Conf.onManagementofData(SIGMOD'00),pages1–12,Dallas,TX,May2000.
[365]C.Hidber.OnlineAssociationRuleMining.InProc.of1999ACM-SIGMODIntl.Conf.onManagementofData,pages145–156,Philadelphia,PA,1999.
[366]R.J.HildermanandH.J.Hamilton.KnowledgeDiscoveryandMeasuresofInterest.KluwerAcademicPublishers,2001.
[367]J.Hipp,U.Guntzer,andG.Nakhaeizadeh.AlgorithmsforAssociationRuleMining—AGeneralSurvey.SigKDDExplorations,2(1):58–64,June
2000.
[368]H.Hofmann,A.P.J.M.Siebes,andA.F.X.Wilhelm.VisualizingAssociationRuleswithInteractiveMosaicPlots.InProc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages227–235,Boston,MA,August2000.
[369]J.D.HoltandS.M.Chung.EfficientMiningofAssociationRulesinTextDatabases.InProc.ofthe8thIntl.Conf.onInformationandKnowledgeManagement,pages234–242,KansasCity,Missouri,1999.
[370]M.HoutsmaandA.Swami.Set-orientedMiningforAssociationRulesinRelationalDatabases.InProc.ofthe11thIntl.Conf.onDataEngineering,pages25–33,Taipei,Taiwan,1995.
[371]Y.Huang,S.Shekhar,andH.Xiong.DiscoveringCo-locationPatternsfromSpatialDatasets:AGeneralApproach.IEEETrans.onKnowledgeandDataEngineering,16(12):1472–1485,December2004.
[372]S.Hwang,Y.Liu,J.Chiu,andE.Lim.MiningMobileGroupPatterns:ATrajectory-BasedApproach.InProceedingsofthe9thPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,pages713–718,Hanoi,Vietnam,2005.
[373]T.Imielinski,A.Virmani,andA.Abdulghani.DataMine:ApplicationProgrammingInterfaceandQueryLanguageforDatabaseMining.InProc.ofthe2ndIntl.Conf.onKnowledgeDiscoveryandDataMining,pages256–262,Portland,Oregon,1996.
[374]A.Inokuchi,T.Washio,andH.Motoda.AnApriori-basedAlgorithmforMiningFrequentSubstructuresfromGraphData.InProc.ofthe4thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages13–23,Lyon,France,2000.
[375]S.JaroszewiczandD.Simovici.InterestingnessofFrequentItemsetsUsingBayesianNetworksasBackgroundKnowledge.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages178–186,Seattle,WA,August2004.
[376]M.KamberandR.Shinghal.EvaluatingtheInterestingnessofCharacteristicRules.InProc.ofthe2ndIntl.Conf.onKnowledgeDiscoveryandDataMining,pages263–266,Portland,Oregon,1996.
[377]M.Klemettinen.AKnowledgeDiscoveryMethodologyforTelecommunicationNetworkAlarmDatabases.PhDthesis,UniversityofHelsinki,1999.
[378]W.A.Kosters,E.Marchiori,andA.Oerlemans.MiningClusterswithAssociationRules.InThe3rdSymp.onIntelligentDataAnalysis(IDA99),pages39–50,Amsterdam,August1999.
[379]C.M.Kuok,A.Fu,andM.H.Wong.MiningFuzzyAssociationRulesinDatabases.ACMSIGMODRecord,27(1):41–46,March1998.
[380]M.KuramochiandG.Karypis.FrequentSubgraphDiscovery.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages313–320,SanJose,CA,
November2001.
[381]W.Lee,S.J.Stolfo,andK.W.Mok.AdaptiveIntrusionDetection:ADataMiningApproach.ArtificialIntelligenceReview,14(6):533–567,2000.
[382]N.Li,L.Zeng,Q.He,andZ.Shi.ParallelImplementationofAprioriAlgorithmBasedonMapReduce.InProceedingsofthe13thACISInternationalConferenceonSoftwareEngineering,ArtificialIntelligence,NetworkingandParallel/DistributedComputing,pages236–241,Kyoto,Japan,2012.
[383]W.Li,J.Han,andJ.Pei.CMAR:AccurateandEfficientClassificationBasedonMultipleClass-associationRules.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages369–376,SanJose,CA,2001.
[384]M.Lin,P.Lee,andS.Hsueh.Apriori-basedfrequentitemsetminingalgorithmsonMapReduce.InProceedingsofthe6thInternationalConferenceonUbiquitousInformationManagementandCommunication,pages26–30,KualaLumpur,Malaysia,2012.
[385]B.Liu,W.Hsu,andS.Chen.UsingGeneralImpressionstoAnalyzeDiscoveredClassificationRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages31–36,NewportBeach,CA,August1997.
[386]B.Liu,W.Hsu,andY.Ma.IntegratingClassificationandAssociationRuleMining.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages80–86,NewYork,NY,August1998.
[387]B.Liu,W.Hsu,andY.Ma.Miningassociationruleswithmultipleminimumsupports.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages125—134,SanDiego,CA,August1999.
[388]B.Liu,W.Hsu,andY.Ma.PruningandSummarizingtheDiscoveredAssociations.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages125–134,SanDiego,CA,August1999.
[389]J.Liu,S.Paulsen,W.Wang,A.B.Nobel,andJ.Prins.MiningApproximateFrequentItemsetsfromNoisyData.InProceedingsofthe5thIEEEInternationalConferenceonDataMining,pages721–724,Houston,TX,2005.
[390]Y.Liu,W.-K.Liao,andA.Choudhary.Atwo-phasealgorithmforfastdiscoveryofhighutilityitemsets.InProceedingsofthe9thPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,pages689–695,Hanoi,Vietnam,2005.
[391]F.Llinares-López,M.Sugiyama,L.Papaxanthos,andK.M.Borgwardt.FastandMemory-EfficientSignificantPatternMiningviaPermutationTesting.InProceedingsofthe21thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages725–734,Sydney,Australia,2015.
[392]A.Marcus,J.I.Maletic,andK.-I.Lin.Ordinalassociationrulesforerroridentificationindatasets.InProc.ofthe10thIntl.Conf.onInformationandKnowledgeManagement,pages589–591,Atlanta,GA,October2001.
[393]N.MegiddoandR.Srikant.DiscoveringPredictiveAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages274–278,NewYork,August1998.
[394]R.Meo,G.Psaila,andS.Ceri.ANewSQL-likeOperatorforMiningAssociationRules.InProc.ofthe22ndVLDBConf.,pages122–133,Bombay,India,1996.
[395]R.J.MillerandY.Yang.AssociationRulesoverIntervalData.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages452–461,Tucson,AZ,May1997.
[396]S.Moens,E.Aksehirli,andB.Goethals.FrequentItemsetMiningforBigData.InProceedingsofthe2013IEEEInternationalConferenceonBigData,pages111–118,SantaClara,CA,2013.
[397]Y.Morimoto,T.Fukuda,H.Matsuzawa,T.Tokuyama,andK.Yoda.Algorithmsforminingassociationrulesforbinarysegmentationsofhugecategoricaldatabases.InProc.ofthe24thVLDBConf.,pages380–391,NewYork,August1998.
[398]F.Mosteller.AssociationandEstimationinContingencyTables.JASA,63:1–28,1968.
[399]A.Mueller.Fastsequentialandparallelalgorithmsforassociationrulemining:Acomparison.TechnicalReportCS-TR-3515,UniversityofMaryland,August1995.
[400]R.T.Ng,L.V.S.Lakshmanan,J.Han,andA.Pang.ExploratoryMiningandPruningOptimizationsofConstrainedAssociationRules.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages13–24,Seattle,WA,June1998.
[401]P.K.Novak,N.Lavrač,andG.I.Webb.Superviseddescriptiverulediscovery:Aunifyingsurveyofcontrastset,emergingpatternandsubgroupmining.JournalofMachineLearningResearch,10(Feb):377–403,2009.
[402]E.Omiecinski.AlternativeInterestMeasuresforMiningAssociationsinDatabases.IEEETrans.onKnowledgeandDataEngineering,15(1):57–69,January/February2003.
[403]B.Ozden,S.Ramaswamy,andA.Silberschatz.CyclicAssociationRules.InProc.ofthe14thIntl.Conf.onDataEng.,pages412–421,Orlando,FL,February1998.
[404]A.Ozgur,P.N.Tan,andV.Kumar.RBA:AnIntegratedFrameworkforRegressionbasedonAssociationRules.InProc.oftheSIAMIntl.Conf.onDataMining,pages210–221,Orlando,FL,April2004.
[405]J.S.Park,M.-S.Chen,andP.S.Yu.Aneffectivehash-basedalgorithmforminingassociationrules.SIGMODRecord,25(2):175–186,1995.
[406]S.ParthasarathyandM.Coatney.EfficientDiscoveryofCommonSubstructuresinMacromolecules.InProc.ofthe2002IEEEIntl.Conf.on
DataMining,pages362—369,MaebashiCity,Japan,December2002.
[407]N.Pasquier,Y.Bastide,R.Taouil,andL.Lakhal.Discoveringfrequentcloseditemsetsforassociationrules.InProc.ofthe7thIntl.Conf.onDatabaseTheory(ICDT'99),pages398–416,Jerusalem,Israel,January1999.
[408]J.Pei,J.Han,H.J.Lu,S.Nishio,andS.Tang.H-Mine:Hyper-StructureMiningofFrequentPatternsinLargeDatabases.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages441–448,SanJose,CA,November2001.
[409]J.Pei,J.Han,B.Mortazavi-Asl,andH.Zhu.MiningAccessPatternsEfficientlyfromWebLogs.InProc.ofthe4thPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,pages396–407,Kyoto,Japan,April2000.
[410]G.Piatetsky-Shapiro.Discovery,AnalysisandPresentationofStrongRules.InG.Piatetsky-ShapiroandW.Frawley,editors,KnowledgeDiscoveryinDatabases,pages229–248.MITPress,Cambridge,MA,1991.
[411]C.Potter,S.Klooster,M.Steinbach,P.N.Tan,V.Kumar,S.Shekhar,andC.Carvalho.UnderstandingGlobalTeleconnectionsofClimatetoRegionalModelEstimatesofAmazonEcosystemCarbonFluxes.GlobalChangeBiology,10(5):693—703,2004.
[412]C.Potter,S.Klooster,M.Steinbach,P.N.Tan,V.Kumar,S.Shekhar,R.Myneni,andR.Nemani.GlobalTeleconnectionsofOceanClimatetoTerrestrialCarbonFlux.JournalofGeophysicalResearch,108(D17),2003.
[413]G.D.Ramkumar,S.Ranka,andS.Tsur.Weightedassociationrules:Modelandalgorithm.InProc.ACMSIGKDD,1998.
[414]M.RiondatoandF.Vandin.FindingtheTrueFrequentItemsets.InProceedingsofthe2014SIAMInternationalConferenceonDataMining,pages497–505,Philadelphia,PA,2014.
[415]S.Sarawagi,S.Thomas,andR.Agrawal.IntegratingMiningwithRelationalDatabaseSystems:AlternativesandImplications.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages343–354,Seattle,WA,1998.
[416]K.Satou,G.Shibayama,T.Ono,Y.Yamamura,E.Furuichi,S.Kuhara,andT.Takagi.FindingAssociationRulesonHeterogeneousGenomeData.InProc.ofthePacificSymp.onBiocomputing,pages397–408,Hawaii,January1997.
[417]A.Savasere,E.Omiecinski,andS.Navathe.Anefficientalgorithmforminingassociationrulesinlargedatabases.InProc.ofthe21stInt.Conf.onVeryLargeDatabases(VLDB’95),pages432–444,Zurich,Switzerland,September1995.
[418]A.Savasere,E.Omiecinski,andS.Navathe.MiningforStrongNegativeAssociationsinaLargeDatabaseofCustomerTransactions.InProc.ofthe14thIntl.Conf.onDataEngineering,pages494–502,Orlando,Florida,February1998.
[419]M.SenoandG.Karypis.LPMiner:AnAlgorithmforFindingFrequentItemsetsUsingLength-DecreasingSupportConstraint.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages505–512,SanJose,CA,November2001.
[420]T.ShintaniandM.Kitsuregawa.Hashbasedparallelalgorithmsforminingassociationrules.InProcofthe4thIntl.Conf.onParallelandDistributedInfo.Systems,pages19–30,MiamiBeach,FL,December1996.
[421]A.SilberschatzandA.Tuzhilin.Whatmakespatternsinterestinginknowledgediscoverysystems.IEEETrans.onKnowledgeandDataEngineering,8(6):970–974,1996.
[422]C.Silverstein,S.Brin,andR.Motwani.Beyondmarketbaskets:Generalizingassociationrulestodependencerules.DataMiningandKnowledgeDiscovery,2(1):39–68,1998.
[423]E.-H.Simpson.TheInterpretationofInteractioninContingencyTables.JournaloftheRoyalStatisticalSociety,B(13):238–241,1951.
[424]L.Singh,B.Chen,R.Haight,andP.Scheuermann.AnAlgorithmforConstrainedAssociationRuleMininginSemi-structuredData.InProc.of
the3rdPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,pages148–158,Beijing,China,April1999.
[425]R.SrikantandR.Agrawal.MiningQuantitativeAssociationRulesinLargeRelationalTables.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Montreal,Canada,1996.
[426]R.SrikantandR.Agrawal.MiningSequentialPatterns:GeneralizationsandPerformanceImprovements.InProc.ofthe5thIntlConf.onExtendingDatabaseTechnology(EDBT'96),pages18–32,Avignon,France,1996.
[427]R.Srikant,Q.Vu,andR.Agrawal.MiningAssociationRuleswithItemConstraints.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages67–73,NewportBeach,CA,August1997.
[428]M.Steinbach,P.N.Tan,andV.Kumar.SupportEnvelopes:ATechniqueforExploringtheStructureofAssociationPatterns.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages296–305,Seattle,WA,August2004.
[429]M.Steinbach,P.N.Tan,H.Xiong,andV.Kumar.ExtendingtheNotionofSupport.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages689–694,Seattle,WA,August2004.
[430]M.Steinbach,H.Yu,G.Fang,andV.Kumar.Usingconstraintstogenerateandexplorehigherorderdiscriminativepatterns.AdvancesinKnowledgeDiscoveryandDataMining,pages338–350,2011.
[431]E.Suzuki.AutonomousDiscoveryofReliableExceptionRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages259–262,NewportBeach,CA,August1997.
[432]P.N.TanandV.Kumar.MiningAssociationPatternsinWebUsageData.InProc.oftheIntl.Conf.onAdvancesinInfrastructurefore-Business,e-Education,e-Scienceande-MedicineontheInternet,L'Aquila,Italy,January2002.
[433]P.N.Tan,V.Kumar,andJ.Srivastava.SelectingtheRightInterestingnessMeasureforAssociationPatterns.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages32–41,Edmonton,Canada,July2002.
[434]P.N.Tan,M.Steinbach,V.Kumar,S.Klooster,C.Potter,andA.Torregrosa.FindingSpatio-TemporalPatternsinEarthScienceData.InKDD2001WorkshoponTemporalDataMining,SanFrancisco,CA,2001.
[435]N.Tatti.Probablythebestitemsets.InProceedingsofthe16thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages293–302,Washington,DC,2010.
[436]H.Toivonen.SamplingLargeDatabasesforAssociationRules.InProc.ofthe22ndVLDBConf.,pages134–145,Bombay,India,1996.
[437]H.Toivonen,M.Klemettinen,P.Ronkainen,K.Hatonen,andH.Mannila.PruningandGroupingDiscoveredAssociationRules.InECML-95
WorkshoponStatistics,MachineLearningandKnowledgeDiscoveryinDatabases,pages47–52,Heraklion,Greece,April1995.
[438]I.TsoukatosandD.Gunopulos.Efficientminingofspatiotemporalpatterns.InProceedingsofthe7thInternationalSymposiumonAdvancesinSpatialandTemporalDatabases,pages425–442,2001.
[439]S.Tsur,J.Ullman,S.Abiteboul,C.Clifton,R.Motwani,S.Nestorov,andA.Rosenthal.QueryFlocks:AGeneralizationofAssociationRuleMining.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Seattle,WA,June1998.
[440]A.Tung,H.J.Lu,J.Han,andL.Feng.BreakingtheBarrierofTransactions:MiningInter-TransactionAssociationRules.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages297–301,SanDiego,CA,August1999.
[441]J.VaidyaandC.Clifton.Privacypreservingassociationrulemininginverticallypartitioneddata.InProceedingsoftheEighthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages639–644,Edmonton,Canada,2002.
[442]K.Wang,Y.He,andJ.Han.MiningFrequentItemsetsUsingSupportConstraints.InProc.ofthe26thVLDBConf.,pages43–52,Cairo,Egypt,September2000.
[443]K.Wang,S.H.Tay,andB.Liu.Interestingness-BasedIntervalMergerforNumericAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledge
DiscoveryandDataMining,pages121–128,NewYork,NY,August1998.
[444]L.Wang,R.Cheng,S.D.Lee,andD.W.Cheung.Acceleratingprobabilisticfrequentitemsetmining:amodel-basedapproach.InProceedingsofthe19thACMConferenceonInformationandKnowledgeManagement,pages429–438,2010.
[445]G.I.Webb.Preliminaryinvestigationsintostatisticallyvalidexploratoryrulediscovery.InProc.oftheAustralasianDataMiningWorkshop(AusDM03),Canberra,Australia,December2003.
[446]H.Xiong,X.He,C.Ding,Y.Zhang,V.Kumar,andS.R.Holbrook.IdentificationofFunctionalModulesinProteinComplexesviaHypercliquePatternDiscovery.InProc.ofthePacificSymposiumonBiocomputing,(PSB2005),Maui,January2005.
[447]H.Xiong,S.Shekhar,P.N.Tan,andV.Kumar.ExploitingaSupport-basedUpperBoundofPearson'sCorrelationCoefficientforEfficientlyIdentifyingStronglyCorrelatedPairs.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages334–343,Seattle,WA,August2004.
[448]H.Xiong,M.Steinbach,P.N.Tan,andV.Kumar.HICAP:HierarchialClusteringwithPatternPreservation.InProc.oftheSIAMIntl.Conf.onDataMining,pages279–290,Orlando,FL,April2004.
[449]H.Xiong,P.N.Tan,andV.Kumar.MiningStrongAffinityAssociationPatternsinDataSetswithSkewedSupportDistribution.InProc.ofthe
2003IEEEIntl.Conf.onDataMining,pages387–394,Melbourne,FL,2003.
[450]X.YanandJ.Han.gSpan:Graph-basedSubstructurePatternMining.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages721–724,MaebashiCity,Japan,December2002.
[451]C.Yang,U.M.Fayyad,andP.S.Bradley.Efficientdiscoveryoferror-tolerantfrequentitemsetsinhighdimensions.InProceedingsoftheseventhACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages194–203,,SanFrancisco,CA,2001.
[452]C.Yang,U.M.Fayyad,andP.S.Bradley.Efficientdiscoveryoferror-tolerantfrequentitemsetsinhighdimensions.InProc.ofthe7thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages194–203,SanFrancisco,CA,August2001.
[453]M.J.Zaki.ParallelandDistributedAssociationMining:ASurvey.IEEEConcurrency,specialissueonParallelMechanismsforDataMining,7(4):14–25,December1999.
[454]M.J.Zaki.GeneratingNon-RedundantAssociationRules.InProc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages34–43,Boston,MA,August2000.
[455]M.J.Zaki.Efficientlyminingfrequenttreesinaforest.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages71–80,Edmonton,Canada,July2002.
[456]M.J.ZakiandM.Orihara.Theoreticalfoundationsofassociationrules.InProc.ofthe1998ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,Seattle,WA,June1998.
[457]M.J.Zaki,S.Parthasarathy,M.Ogihara,andW.Li.NewAlgorithmsforFastDiscoveryofAssociationRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages283–286,NewportBeach,CA,August1997.
[458]C.Zeng,J.F.Naughton,andJ.Cai.Ondifferentiallyprivatefrequentitemsetmining.ProceedingsoftheVLDBEndowment,6(1):25–36,2012.
[459]F.Zhang,Y.Zhang,andJ.Bakos.GPApriori:GPU-AcceleratedFrequentItemsetMining.InProceedingsofthe2011IEEEInternationalConferenceonClusterComputing,pages590–594,Austin,TX,2011.
[460]H.Zhang,B.Padmanabhan,andA.Tuzhilin.OntheDiscoveryofSignificantStatisticalQuantitativeRules.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages374–383,Seattle,WA,August2004.
[461]Z.Zhang,Y.Lu,andB.Zhang.AnEffectivePartioning-CombiningAlgorithmforDiscoveringQuantitativeAssociationRules.InProc.ofthe1stPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,Singapore,1997.
[462]N.Zhong,Y.Y.Yao,andS.Ohsuga.PeculiarityOrientedMulti-databaseMining.InProc.ofthe3rdEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages136–146,Prague,CzechRepublic,1999.
5.10Exercises1.Foreachofthefollowingquestions,provideanexampleofanassociationrulefromthemarketbasketdomainthatsatisfiesthefollowingconditions.Also,describewhethersuchrulesaresubjectivelyinteresting.
a. Arulethathashighsupportandhighconfidence.
b. Arulethathasreasonablyhighsupportbutlowconfidence.
c. Arulethathaslowsupportandlowconfidence.
d. Arulethathaslowsupportandhighconfidence.
2.ConsiderthedatasetshowninTable5.20 .
Table5.20.Exampleofmarketbaskettransactions.
CustomerID TransactionID ItemsBought
1 0001 {a,d,e}
1 0024 {a,b,c,e}
2 0012 {a,b,d,e}
2 0031 {a,c,d,e}
3 0015 {b,ce}
3 0022 {b,d,e}
4 0029 {cd}
4 0040 {a,b,c}
5 0033 {a,d,e}
5 0038 {a,b,e}
a. Computethesupportforitemsets{e},{b,d},and{b,d,e}bytreatingeachtransactionIDasamarketbasket.
b. Usetheresultsinpart(a)tocomputetheconfidencefortheassociationrules and .Isconfidenceasymmetricmeasure?
c. Repeatpart(a)bytreatingeachcustomerIDasamarketbasket.Eachitemshouldbetreatedasabinaryvariable(1ifanitemappearsinatleastonetransactionboughtbythecustomer,and0otherwise).
d. Usetheresultsinpart(c)tocomputetheconfidencefortheassociationrules and .
e. Suppose and arethesupportandconfidencevaluesofanassociationrulerwhentreatingeachtransactionIDasamarketbasket.Also,let and bethesupportandconfidencevaluesofrwhentreatingeachcustomerIDasamarketbasket.Discusswhetherthereareanyrelationshipsbetween and or and .
3.
a. Whatistheconfidencefortherules and ?
b. Let ,and betheconfidencevaluesoftherules,and ,respectively.Ifweassumethat ,and
havedifferentvalues,whatarethepossiblerelationshipsthatmayexistamong ,and ?Whichrulehasthelowestconfidence?
c. Repeattheanalysisinpart(b)assumingthattheruleshaveidenticalsupport.Whichrulehasthehighestconfidence?
{b,d}→{e} {e}→{b,d}
{b,d}→{e} {e}→{b,d}
s1 c1
s2 c2
s1 s2 c1 c2
∅→A A→∅c1,c2 c3 {p}→{q},{p}
→{q,r} {p,r}→{q} c1,c2 c3
c1,c2 c3
d. Transitivity:Supposetheconfidenceoftherules and arelargerthansomethreshold,minconf.Isitpossiblethat hasaconfidencelessthanminconf?
4.Foreachofthefollowingmeasures,determinewhetheritismonotone,anti-monotone,ornon-monotone(i.e.,neithermonotonenoranti-monotone).
Example:Support, isanti-monotonebecause whenever.
a. Acharacteristicruleisaruleoftheform ,wheretheruleantecedentcontainsonlyasingleitem.Anitemsetofsizekcanproduceuptokcharacteristicrules.Let betheminimumconfidenceofallcharacteristicrulesgeneratedfromagivenitemset:
Is monotone,anti-monotone,ornon-monotone?
b. Adiscriminantruleisaruleoftheform ,wheretheruleconsequentcontainsonlyasingleitem.Anitemsetofsizekcanproduceuptokdiscriminantrules.Let betheminimumconfidenceofalldiscriminantrulesgeneratedfromagivenitemset:
Is monotone,anti-monotone,ornon-monotone?
c. Repeattheanalysisinparts(a)and(b)byreplacingtheminfunctionwithamaxfunction.
A→B B→CA→C
s=σ(x)|T| s(X)≥s(Y)X⊂Y
{p}→{q1,q2,…,qn}
ζ
ζ({p1,p2,…,pk})=min[c({p1}→{p2,p3,…,pk}),…c({pk}→{p1,p2,…,pk−1})]
ζ
{p1,p2,…,pn}→{q}
η
η({p1,p2,…,pk})=min[c({p2,p3,…,pk}→{p1}),…c({p1,p2,…,pk−1}→{pk})]
η
5.ProveEquation5.3 .(Hint:First,countthenumberofwaystocreateanitemsetthatformstheleft-handsideoftherule.Next,foreachsizekitemsetselectedfortheleft-handside,countthenumberofwaystochoosetheremaining itemstoformtheright-handsideoftherule.)Assumethatneitheroftheitemsetsofaruleareempty.
6.ConsiderthemarketbaskettransactionsshowninTable5.21 .
a. Whatisthemaximumnumberofassociationrulesthatcanbeextractedfromthisdata(includingrulesthathavezerosupport)?
b. Whatisthemaximumsizeoffrequentitemsetsthatcanbeextracted(assuming )?
Table5.21.Marketbaskettransactions.
TransactionID ItemsBought
1 {Milk,Beer,Diapers}
2 {Bread,Butter,Milk}
3 {Milk,Diapers,Cookies}
4 {Bread,Butter,Cookies}
5 {Beer,Cookies,Diapers}
6 {Milk,Diapers,Bread,Butter}
7 {Bread,Butter,Diapers}
8 {Beer,Diapers}
9 {Milk,Diapers,Bread,Butter}
10 {Beer,Cookies}
d−k
minsup>0
c. Writeanexpressionforthemaximumnumberofsize-3itemsetsthatcanbederivedfromthisdataset.
d. Findanitemset(ofsize2orlarger)thathasthelargestsupport.
e. Findapairofitems,aandb,suchthattherules andhavethesameconfidence.
7.Showthatifacandidatek-itemsetXhasasubsetofsizelessthan thatisinfrequent,thenatleastoneofthe -sizesubsetsofXisnecessarilyinfrequent.
8.Considerthefollowingsetoffrequent3-itemsets:
{1,2,3},{1,2,4},{1,2,5},{1,3,4},{1,3,5},{2,3,4},{2,3,5},{3,4,5}.
Assumethatthereareonlyfiveitemsinthedataset.
a. Listallcandidate4-itemsetsobtainedbyacandidategenerationprocedureusingthe mergingstrategy.
b. Listallcandidate4-itemsetsobtainedbythecandidategenerationprocedureinApriori.
c. Listallcandidate4-itemsetsthatsurvivethecandidatepruningstepoftheApriorialgorithm.
9.TheApriorialgorithmusesagenerate-and-countstrategyforderivingfrequentitemsets.Candidateitemsetsofsize arecreatedbyjoiningapairoffrequentitemsetsofsizek(thisisknownasthecandidategenerationstep).Acandidateisdiscardedifanyoneofitssubsetsisfoundtobeinfrequentduringthecandidatepruningstep.SupposetheApriorialgorithmisappliedtothedatasetshowninTable5.22 with ,i.e.,anyitemsetoccurringinlessthan3transactionsisconsideredtobeinfrequent.
{a}→{b} {b}→{a}
k−1(k−1)
Fk−1×F1
k+1
minsup=30%
Table5.22.Exampleofmarketbaskettransactions.
TransactionID ItemsBought
1 {a,b,d,e}
2 {b,cd}
3 {a,b,d,e}
4 {a,c,d,e}
5 {b,c,d,e}
6 {b,d,e}
7 {c,d}
8 {a,b,c}
9 {a,d,e}
10 {b,d}
a. DrawanitemsetlatticerepresentingthedatasetgiveninTable5.22 .Labeleachnodeinthelatticewiththefollowingletter(s):
N:IftheitemsetisnotconsideredtobeacandidateitemsetbytheApriorialgorithm.Therearetworeasonsforanitemsetnottobeconsideredasacandidateitemset:(1)itisnotgeneratedatallduringthecandidategenerationstep,or(2)itisgeneratedduringthecandidategenerationstepbutissubsequentlyremovedduringthecandidatepruningstepbecauseoneofitssubsetsisfoundtobeinfrequent.
F:IfthecandidateitemsetisfoundtobefrequentbytheApriorialgorithm.
I:Ifthecandidateitemsetisfoundtobeinfrequentaftersupportcounting.
b. Whatisthepercentageoffrequentitemsets(withrespecttoallitemsetsinthelattice)?
c. WhatisthepruningratiooftheApriorialgorithmonthisdataset?(Pruningratioisdefinedasthepercentageofitemsetsnotconsideredtobeacandidatebecause(1)theyarenotgeneratedduringcandidategenerationor(2)theyareprunedduringthecandidatepruningstep.)
d. Whatisthefalsealarmrate(i.e.,percentageofcandidateitemsetsthatarefoundtobeinfrequentafterperformingsupportcounting)?
10.TheApriorialgorithmusesahashtreedatastructuretoefficientlycountthesupportofcandidateitemsets.Considerthehashtreeforcandidate3-itemsetsshowninFigure5.32 .
Figure5.32.Anexampleofahashtreestructure.
a. Givenatransactionthatcontainsitems{1,3,4,5,8},whichofthehashtreeleafnodeswillbevisitedwhenfindingthecandidatesofthetransaction?
b. Usethevisitedleafnodesinpart(a)todeterminethecandidateitemsetsthatarecontainedinthetransaction{1,3,4,5,8}.
11.Considerthefollowingsetofcandidate3-itemsets:
{1,2,3},{1,2,6},{1,3,4},{2,3,4},{2,4,5},{3,4,6},{4,5,6}
a. Constructahashtreefortheabovecandidate3-itemsets.Assumethetreeusesahashfunctionwhereallodd-numbereditemsarehashedtotheleftchildofanode,whiletheeven-numbereditemsarehashedtotherightchild.Acandidatek-itemsetisinsertedintothetreebyhashingoneachsuccessiveiteminthecandidateandthenfollowingtheappropriatebranchofthetreeaccordingtothehashvalue.Oncealeafnodeisreached,thecandidateisinsertedbasedononeofthefollowingconditions:
Condition1:Ifthedepthoftheleafnodeisequaltok(therootisassumedtobeatdepth0),thenthecandidateisinsertedregardlessofthenumberofitemsetsalreadystoredatthenode.
Condition2:Ifthedepthoftheleafnodeislessthank,thenthecandidatecanbeinsertedaslongasthenumberofitemsetsstoredatthenodeislessthanmaxsize.Assume forthisquestion.
Condition3:Ifthedepthoftheleafnodeislessthankandthenumberofitemsetsstoredatthenodeisequaltomaxsize,thentheleafnodeisconvertedintoaninternalnode.Newleafnodesarecreatedaschildren
maxsize=2
oftheoldleafnode.Candidateitemsetspreviouslystoredintheoldleafnodearedistributedtothechildrenbasedontheirhashvalues.Thenewcandidateisalsohashedtoitsappropriateleafnode.
b. Howmanyleafnodesarethereinthecandidatehashtree?Howmanyinternalnodesarethere?
c. Consideratransactionthatcontainsthefollowingitems:{1,2,3,5,6}.Usingthehashtreeconstructedinpart(a),whichleafnodeswillbecheckedagainstthetransaction?Whatarethecandidate3-itemsetscontainedinthetransaction?
12.GiventhelatticestructureshowninFigure5.33 andthetransactionsgiveninTable5.22 ,labeleachnodewiththefollowingletter(s):
Figure5.33.Anitemsetlattice
Mifthenodeisamaximalfrequentitemset,
Cifitisaclosedfrequentitemset,
Nifitisfrequentbutneithermaximalnorclosed,and
Iifitisinfrequent
Assumethatthesupportthresholdisequalto30%.
13.Theoriginalassociationruleminingformulationusesthesupportandconfidencemeasurestopruneuninterestingrules.
a. DrawacontingencytableforeachofthefollowingrulesusingthetransactionsshowninTable5.23 .
Table5.23.Exampleofmarketbaskettransactions.
TransactionID ItemsBought
1 {a,b,d,e}
2 {b,c,d}
3 {a,b,d,e}
4 {a,c,d,e}
5 {b,c,d,e}
6 {b,d,e}
7 {c,d}
8 {a,b,c}
9 {a,d,e}
10 {b,d}
Rules: .
b. Usethecontingencytablesinpart(a)tocomputeandranktherulesindecreasingorderaccordingtothefollowingmeasures.
i. Support.
ii. Confidence.
iii. Interest
iv.
v. ,where.
vi.
14.GiventherankingsyouhadobtainedinExercise13,computethecorrelationbetweentherankingsofconfidenceandtheotherfivemeasures.Whichmeasureismosthighlycorrelatedwithconfidence?Whichmeasureisleastcorrelatedwithconfidence?
15.AnswerthefollowingquestionsusingthedatasetsshowninFigure5.34 .Notethateachdatasetcontains1000itemsand10,000transactions.Darkcellsindicatethepresenceofitemsandwhitecellsindicatetheabsenceofitems.WewillapplytheApriorialgorithmtoextractfrequentitemsetswith
(i.e.,itemsetsmustbecontainedinatleast1000transactions).
{b}→{c},{a}→{d},{b}→{d},{e}→{c},{c}→{a}
(X→Y)=P(X,Y)P(X)P(Y).
IS(X→Y)=P(X,Y)P(X)P(Y).
Klosgen(X→Y)=P(X,Y)×max(P(Y|X))−P(Y),P(X|Y)−P(X))P(Y|X)=P(X,Y)P(X)
Oddsratio(X→Y)=P(X,Y)P(X¯,Y¯)P(X,Y¯)P(X¯,Y).
minsup=10%
Figure5.34.FiguresforExercise15.
a. Whichdataset(s)willproducethemostnumberoffrequentitemsets?
b. Whichdataset(s)willproducethefewestnumberoffrequentitemsets?
c. Whichdataset(s)willproducethelongestfrequentitemset?
d. Whichdataset(s)willproducefrequentitemsetswithhighestmaximumsupport?
e. Whichdataset(s)willproducefrequentitemsetscontainingitemswithwide-varyingsupportlevels(i.e.,itemswithmixedsupport,rangingfromlessthan20%tomorethan70%)?
16.
a. Provethatthe coefficientisequalto1ifandonlyif .
b. ShowthatifAandBareindependent,then.
c. ShowthatYule'sQandYcoefficients
arenormalizedversionsoftheoddsratio.
d. WriteasimplifiedexpressionforthevalueofeachmeasureshowninTable5.9 whenthevariablesarestatisticallyindependent.
17.Considertheinterestingnessmeasure, ,foranassociationrule .
ϕ f11=f1+=f+1
P(A,B)×P(A¯,B¯)=P(A,B¯)×P(A¯,B)
Q=[f11f00−f10f01f11f00+f10f01]Y=[f11f00−f10f01f11f00+f10f01]
M=P(B|A)−P(B)1−P(B)A→B
a. Whatistherangeofthismeasure?Whendoesthemeasureattainitsmaximumandminimumvalues?
b. HowdoesMbehavewhenP(A,B)isincreasedwhileP(A)andP(B)remainunchanged?
c. HowdoesMbehavewhenP(A)isincreasedwhileP(A,B)andP(B)remainunchanged?
d. HowdoesMbehavewhenP(B)isincreasedwhileP(A,B)andP(A)remainunchanged?
e. Isthemeasuresymmetricundervariablepermutation?
f. WhatisthevalueofthemeasurewhenAandBarestatisticallyindependent?
g. Isthemeasurenull-invariant?
h. Doesthemeasureremaininvariantunderroworcolumnscalingoperations?
i. Howdoesthemeasurebehaveundertheinversionoperation?
18.Supposewehavemarketbasketdataconsistingof100transactionsand20items.Assumethesupportforitemais25%,thesupportforitembis90%andthesupportforitemset{a,b}is20%.Letthesupportandconfidencethresholdsbe10%and60%,respectively.
a. Computetheconfidenceoftheassociationrule .Istheruleinterestingaccordingtotheconfidencemeasure?
b. Computetheinterestmeasurefortheassociationpattern{a,b}.Describethenatureoftherelationshipbetweenitemaanditembintermsoftheinterestmeasure.
{a}→{b}
c. Whatconclusionscanyoudrawfromtheresultsofparts(a)and(b)?
d. Provethatiftheconfidenceoftherule islessthanthesupportof{b},then:
i.
ii.
where denotetheruleconfidenceand denotethesupportofanitemset.
19.Table5.24 showsa contingencytableforthebinaryvariablesAandBatdifferentvaluesofthecontrolvariableC.
Table5.24.AContingencyTable.
A
1 0
B 1 0 15
0 15 30
B 1 5 0
0 0 15
a. Computethe coefficientforAandBwhen and or1.Notethat .
b. Whatconclusionscanyoudrawfromtheaboveresult?
20.ConsiderthecontingencytablesshowninTable5.25 .
{a}→{b}
c({a¯}→{b})>c({a}→{b}),
c({a¯}→{b})>s({b}),
c(⋅) s(⋅)
2×2×2
C=0
C=1
ϕ C=0,C=1, C=0ϕ=({A,B})=P(A,B)−P(A)P(B)P(A)P(B)(1−P(A))(1−P(B))
a. FortableI,computesupport,theinterestmeasure,andthe correlationcoefficientfortheassociationpattern{A,B}.Also,computetheconfidenceofrules and .
b. FortableII,computesupport,theinterestmeasure,andthe correlationcoefficientfortheassociationpattern{A,B}.Also,computetheconfidenceofrules and .
Table5.25.ContingencytablesforExercise20.(a)TableI.
B
A 9 1
1 89
(b)TableII.
B
A 89 1
1 9
c. Whatconclusionscanyoudrawfromtheresultsof(a)and(b)?
21.Considertherelationshipbetweencustomerswhobuyhigh-definitiontelevisionsandexercisemachinesasshowninTables5.17 and5.18 .
a. Computetheoddsratiosforbothtables.
b. Computethe forbothtables.
c. Computetheinterestfactorforbothtables.
ϕ
A→B B→A
ϕ
A→B B→A
B¯
A¯
B¯
A¯
ϕ-coefficient
Foreachofthemeasuresgivenabove,describehowthedirectionofassociationchangeswhendataispooledtogetherinsteadofbeingstratified.
6AssociationAnalysis:AdvancedConcepts
Theassociationruleminingformulationdescribedinthepreviouschapterassumesthattheinputdataconsistsofbinaryattributescalleditems.Thepresenceofaniteminatransactionisalsoassumedtobemoreimportantthanitsabsence.Asaresult,anitemistreatedasanasymmetricbinaryattributeandonlyfrequentpatternsareconsideredinteresting.
Thischapterextendstheformulationtodatasetswithsymmetricbinary,categorical,andcontinuousattributes.Theformulationwillalsobeextendedtoincorporatemorecomplexentitiessuchassequencesandgraphs.Althoughtheoverallstructureofassociationanalysisalgorithmsremainsunchanged,certainaspectsofthealgorithmsmustbemodifiedtohandlethenon-traditionalentities.
6.1HandlingCategoricalAttributesTherearemanyapplicationsthatcontainsymmetricbinaryandnominalattributes.TheInternetsurveydatashowninTable6.1 containssymmetricbinaryattributessuchasand ;aswellasnominalattributessuchas
and .Usingassociationanalysis,wemayuncoverinterestinginformationaboutthecharacteristicsofInternetuserssuchas
Table6.1.Internetsurveydatawithcategoricalattributes.
Gender LevelofEducation
State ComputeratHome
ChatOnline
ShopOnline
PrivacyConcerns
Female Graduate Illinois Yes Yes Yes Yes
Male College California No No No No
Male Graduate Michigan Yes Yes Yes Yes
Female College Virginia No No Yes Yes
Female Graduate California Yes No No Yes
Male College Minnesota Yes Yes Yes Yes
Male College Alaska Yes Yes Yes No
Male HighSchool Oregon Yes No No No
Female Graduate Texas No Yes No No
… … … … … … …
ThisrulesuggeststhatmostInternetuserswhoshoponlineareconcernedabouttheirpersonalprivacy.
Toextractsuchpatterns,thecategoricalandsymmetricbinaryattributesaretransformedinto“items”first,sothatexistingassociationruleminingalgorithmscanbeapplied.Thistypeoftransformationcanbeperformedbycreatinganewitemforeachdistinctattribute-valuepair.Forexample,thenominalattribute canbereplacedbythreebinaryitems:
,and .Similarly,symmetricbinaryattributessuchas canbeconvertedintoapairofbinaryitems, and .Table6.2 showstheresultofbinarizingtheInternetsurveydata.
Table6.2.Internetsurveydataafterbinarizingcategoricalandsymmetricbinaryattributes.
Male Female Education=Graduate
Education=College
… Privacy=Yes
Privacy=No
0 1 1 0 … 1 0
1 0 0 1 … 0 1
1 0 1 0 … 1 0
0 1 0 1 … 1 0
0 1 1 0 … 1 0
1 0 0 1 … 1 0
1 0 0 1 … 0 1
1 0 0 0 … 0 1
0 1 1 0 … 0 1
… … … … … … …
Thereareseveralissuestoconsiderwhenapplyingassociationanalysistothebinarizeddata:
1. Someattributevaluesmaynotbefrequentenoughtobepartofafrequentpattern.Thisproblemismoreevidentfornominalattributesthathavemanypossiblevalues,e.g.,statenames.Loweringthesupportthresholddoesnothelpbecauseitexponentiallyincreasesthenumberoffrequentpatternsfound(manyofwhichmaybespurious)andmakesthecomputationmoreexpensive.Amorepracticalsolutionistogrouprelatedattributevaluesintoasmallnumberofcategories.Forexample,eachstatenamecanbereplacedbyitscorrespondinggeographicalregion,suchas ,and .Anotherpossibilityistoaggregatethelessfrequentattributevaluesintoasinglecategorycalled ,asshowninFigure6.1 .
Figure6.1.Apiechartwithamergedcategorycalled .
2. Someattributevaluesmayhaveconsiderablyhigherfrequenciesthanothers.Forexample,suppose85%ofthesurveyparticipantsownahomecomputer.Bycreatingabinaryitemforeachattributevaluethatappearsfrequentlyinthedata,wemaypotentiallygeneratemanyredundantpatterns,asillustratedbythefollowingexample:
Theruleisredundantbecauseitissubsumedbythemoregeneralrulegivenatthebeginningofthissection.Becausethehigh-frequencyitemscorrespondtothetypicalvaluesofanattribute,theyseldomcarryanynewinformationthatcanhelpustobetterunderstandthepattern.Itmaythereforebeusefultoremovesuchitemsbeforeapplying
standardassociationanalysisalgorithms.AnotherpossibilityistoapplythetechniquespresentedinSection5.8 forhandlingdatasetswithawiderangeofsupportvalues.
3. Althoughthewidthofeverytransactionisthesameasthenumberofattributesintheoriginaldata,thecomputationtimemayincreaseespeciallywhenmanyofthenewlycreateditemsbecomefrequent.Thisisbecausemoretimeisneededtodealwiththeadditionalcandidateitemsetsgeneratedbytheseitems(seeExercise1 onpage510).Onewaytoreducethecomputationtimeistoavoidgeneratingcandidateitemsetsthatcontainmorethanoneitemfromthesameattribute.Forexample,wedonothavetogenerateacandidateitemsetsuchas becausethesupportcountoftheitemsetiszero.
6.2HandlingContinuousAttributesTheInternetsurveydatadescribedintheprevioussectionmayalsocontaincontinuousattributessuchastheonesshowninTable6.3 .Miningthecontinuousattributesmayrevealusefulinsightsaboutthedatasuchas“userswhoseannualincomeismorethan$120Kbelongtothe45–60agegroup”or“userswhohavemorethan3emailaccountsandspendmorethan15hoursonlineperweekareoftenconcernedabouttheirpersonalprivacy.”Associationrulesthatcontaincontinuousattributesarecommonlyknownasquantitativeassociationrules.
Table6.3.Internetsurveydatawithcontinuousattributes.
Gender … Age AnnualIncome
No.ofHoursSpentOnlineperWeek
No.ofEmailAccounts
PrivacyConcern
Female … 26 90K 20 4 Yes
Male … 51 135K 10 2 No
Male … 29 80K 10 3 Yes
Female … 45 120K 15 3 Yes
Female … 31 95K 20 5 Yes
Male … 25 55K 25 5 Yes
Male … 37 100K 10 1 No
Male … 41 65K 8 2 No
Female … 26 85K 12 1 No
… … … … … … …
Thissectiondescribesthevariousmethodologiesforapplyingassociationanalysistocontinuousdata.Wewillspecificallydiscussthreetypesofmethods:(1)discretization-basedmethods,(2)statistics-basedmethods,and(3)nondiscretizationmethods.Thequantitativeassociationrulesderivedusingthesemethodsarequitedifferentinnature.
6.2.1Discretization-BasedMethods
Discretizationisthemostcommonapproachforhandlingcontinuousattributes.Thisapproachgroupstheadjacentvaluesofacontinuousattributeintoafinitenumberofintervals.Forexample,the attributecanbedividedintothefollowingintervals: ∈[12,16), ∈[16,20), ∈[20,24),…,
∈[56,60),where[a,b)representsanintervalthatincludesabutnotb.DiscretizationcanbeperformedusinganyofthetechniquesdescribedinSection2.3.6 (equalintervalwidth,equalfrequency,entropy-based,orclustering).Thediscreteintervalsarethenmappedintoasymmetricbinaryattributessothatexistingassociationanalysisalgorithmscanbeapplied.Table6.4 showstheInternetsurveydataafterdiscretizationandbinarization.
Table6.4.Internetsurveydataafterbinarizingcategoricalandcontinuousattributes.
Male Female … Age<13
Age∈[13,21)
Age∈[21,30)
… Privacy=Yes
Privacy=No
0 1 … 0 0 1 … 1 0
1 0 … 0 0 0 … 0 1
1 0 … 0 0 1 … 1 0
0 1 … 0 0 0 … 1 0
0 1 … 0 0 0 … 1 0
1 0 … 0 0 1 … 1 0
1 0 … 0 0 0 … 0 1
1 0 … 0 0 0 … 0 1
0 1 … 0 0 1 … 0 1
… … … … … … … … …
Akeyparameterinattributediscretizationisthenumberofintervalsusedtopartitioneachattribute.Thisparameteristypicallyprovidedbytheusersandcanbeexpressedintermsoftheintervalwidth(fortheequalintervalwidthapproach),theaveragenumberoftransactionsperinterval(fortheequalfrequencyapproach),orthenumberofdesiredclusters(fortheclustering-basedapproach).ThedifficultyindeterminingtherightnumberofintervalscanbeillustratedusingthedatasetshowninTable6.5 ,whichsummarizestheresponsesof250userswhoparticipatedinthesurvey.Therearetwostrongrulesembeddedinthedata:
Table6.5.AbreakdownofInternetuserswhoparticipatedinonlinechataccordingtotheiragegroup.
AgeGroup ChatOnline=Yes ChatOnline=No
R1:Age∈[16,24)→Chat Online=Yes(s=8.8%, c=81.5%).R2:Age∈[44,60)→Chat
[12,16) 12 13
[16,20) 11 2
[20,24) 11 3
[24,28) 12 13
[28,32) 14 12
[32,36) 15 12
[36,40) 16 14
[40,44) 16 14
[44,48) 4 10
[48,52) 5 11
[52,56) 5 10
[56,60) 4 11
Theserulessuggestthatmostoftheusersfromtheagegroupof16–24oftenparticipateinonlinechatting,whilethosefromtheagegroupof44–60arelesslikelytochatonline.Inthisexample,weconsideraruletobeinterestingonlyifitssupport(s)exceeds5%anditsconfidence(c)exceeds65%.Oneoftheproblemsencounteredwhendiscretizingthe attributeishowtodeterminetheintervalwidth.
1. Iftheintervalistoowide,thenwemaylosesomepatternsbecauseoftheirlackofconfidence.Forexample,whentheintervalwidthis24years, and arereplacedbythefollowingrules:R1 R2R′1:Age∈[12,36)→Chat Online=Yes(s=30%, c=57.7%).R
Despitetheirhighersupports,thewiderintervalshavecausedtheconfidenceforbothrulestodropbelowtheminimumconfidencethreshold.Asaresult,bothpatternsarelostafterdiscretization.
2. Iftheintervalistoonarrow,thenwemaylosesomepatternsbecauseoftheirlackofsupport.Forexample,iftheintervalwidthis4years,then isbrokenupintothefollowingtwosubrules:
Sincethesupportsforthesubrulesarelessthantheminimumsupportthreshold, islostafterdiscretization.Similarly,therule ,whichisbrokenupintofoursubrules,willalsobelostbecausethesupportofeachsubruleislessthantheminimumsupportthreshold.
3. Iftheintervalwidthis8years,thentherule isbrokenupintothefollowingtwosubrules:
Since and havesufficientsupportandconfidence,canberecoveredbyaggregatingbothsubrules.Meanwhile, isbrokenupintothefollowingtwosubrules:
Unlike ,wecannotrecovertherule byaggregatingthesubrulesbecausebothsubrulesfailtheconfidencethreshold.
Onewaytoaddresstheseissuesistoconsidereverypossiblegroupingofadjacentintervals.Forexample,wecanstartwithanintervalwidthof4yearsandthenmergetheadjacentintervalsintowiderintervals: ∈[12,16),∈[12,20),…, ∈[12,60), ∈[16,20), ∈[16,24),etc.This
′2:Age∈[36,60)→Chat Online=No(s=28%, c=58.3%).
R1R11(4):Age∈[16,20)→Chat Online=Yes(s=4.4%, c=84.6%).R12(4):Age∈
R1 R2
R2
R21(8):Age∈[44,52)→Chat Online=No(s=8.4%, c=70%).R22(8):Age∈[52
R21(8) R22(8) R2R1
R11(8):Age∈[12,20)→Chat Online=Yes(s=9.2%, c=60.5%).R12(8):Age∈
R2 R1
approachenablesthedetectionofboth and asstrongrules.However,italsoleadstothefollowingcomputationalissues:
1. Thecomputationbecomesextremelyexpensive.Iftherangeisinitiallydividedintokintervals,then binaryitemsmustbegeneratedtorepresentallpossibleintervals.Furthermore,ifanitemcorrespondingtotheinterval[a,b)isfrequent,thenallotheritemscorrespondingtointervalsthatsubsume[a,b)mustbefrequenttoo.Thisapproachcanthereforegeneratefartoomanycandidateandfrequentitemsets.Toaddresstheseproblems,amaximumsupportthresholdcanbeappliedtopreventthecreationofitemscorrespondingtoverywideintervalsandtoreducethenumberofitemsets.
2. Manyredundantrulesareextracted.Forexample,considerthefollowingpairofrules:
isageneralizationof (and isaspecializationof )becausehasawiderintervalforthe attribute.Iftheconfidencevaluesfor
bothrulesarethesame,then shouldbemoreinterestingbecauseitcoversmoreexamples—includingthosefor . isthereforearedundantrule.
6.2.2Statistics-BasedMethods
Quantitativeassociationrulescanbeusedtoinferthestatisticalpropertiesofapopulation.Forexample,supposeweareinterestedinfindingtheaverageageofcertaingroupsofInternetusersbasedonthedataprovidedinTables
R1 R2
k(k−1)/2
R3:{Age∈[16,20), Gender=Male}→{Chat Online=Yes},R4:{Age∈[16,24), Gender=Male}→{Chat Online=Yes}.
R4 R3 R3 R4R4
R4R3 R3
6.1 and6.3 .Usingthestatistics-basedmethoddescribedinthissection,quantitativeassociationrulessuchasthefollowingcanbeextracted:
TherulestatesthattheaverageageofInternetuserswhoseannualincomeexceeds$100Kandwhoshoponlineregularlyis38yearsold.
RuleGenerationTogeneratethestatistics-basedquantitativeassociationrules,thetargetattributeusedtocharacterizeinterestingsegmentsofthepopulationmustbespecified.Bywithholdingthetargetattribute,theremainingcategoricalandcontinuousattributesinthedataarebinarizedusingthemethodsdescribedintheprevioussection.ExistingalgorithmssuchasAprioriorFP-growtharethenappliedtoextractfrequentitemsetsfromthebinarizeddata.Eachfrequentitemsetidentifiesaninterestingsegmentofthepopulation.Thedistributionofthetargetattributeineachsegmentcanbesummarizedusingdescriptivestatisticssuchasmean,median,variance,orabsolutedeviation.Forexample,theprecedingruleisobtainedbyaveragingtheageofInternetuserswhosupportthefrequentitemset{ >$100K,
}.
Thenumberofquantitativeassociationrulesdiscoveredusingthismethodisthesameasthenumberofextractedfrequentitemsets.Becauseofthewaythequantitativeassociationrulesaredefined,thenotionofconfidenceisnotapplicabletosuchrules.Analternativemethodforvalidatingthequantitativeassociationrulesispresentednext.
RuleValidation
{Annual Income>$100K, Shop Online=Yes}→Age: Mean=38.
Aquantitativeassociationruleisinterestingonlyifthestatisticscomputedfromtransactionscoveredbytherulearedifferentthanthosecomputedfromtransactionsnotcoveredbytherule.Forexample,therulegivenatthebeginningofthissectionisinterestingonlyiftheaverageageofInternetuserswhodonotsupportthefrequentitemset{ >100K,
}issignificantlyhigherorlowerthan38yearsold.Todeterminewhetherthedifferenceintheiraverageagesisstatisticallysignificant,statisticalhypothesistestingmethodsshouldbeapplied.
Considerthequantitativeassociationrule, ,whereAisafrequentitemset,tisthecontinuoustargetattribute,andμistheaveragevalueoftamongtransactionscoveredbyA.Furthermore,let denotetheaveragevalueoftamongtransactionsnotcoveredbyA.Thegoalistotestwhetherthedifferencebetweenμand isgreaterthansomeuser-specifiedthreshold,Δ.Instatisticalhypothesistesting,twooppositepropositions,knownasthenullhypothesisandthealternativehypothesis,aregiven.Ahypothesistestisperformedtodeterminewhichofthesetwohypothesesshouldbeaccepted,basedonevidencegatheredfromthedata(seeAppendixC).
Inthiscase,assumingthat ,thenullhypothesisis ,whilethealternativehypothesisis .Todeterminewhichhypothesisshouldbeaccepted,thefollowingZ-statisticiscomputed:
where isthenumberoftransactionssupportingA, isthenumberoftransactionsnotsupportingA, isthestandarddeviationfortamongtransactionsthatsupportA,and isthestandarddeviationfortamongtransactionsthatdonotsupportA.Underthenullhypothesis,Zhasastandardnormaldistributionwithmean0andvariance1.ThevalueofZcomputedusingEquation6.1 isthencomparedagainstacriticalvalue, ,
A→t:μ
μ′
μ′
μ<μ′ H0:μ′=μ+ΔH1:μ′>μ+Δ
Z=μ′−μ−Δs12n1+s22n2, (6.1)
n1 n2s1s2
Zα
whichisathresholdthatdependsonthedesiredconfidencelevel.If ,thenthenullhypothesisisrejectedandwemayconcludethatthequantitativeassociationruleisinteresting.Otherwise,thereisnotenoughevidenceinthedatatoshowthatthedifferenceinmeanisstatisticallysignificant.
Example6.1.Considerthequantitativeassociationrule
Supposethereare50Internetuserswhosupportedtheruleantecedent.Thestandarddeviationoftheiragesis3.5.Ontheotherhand,theaverageageofthe200userswhodonotsupporttheruleantecedentis30andtheirstandarddeviationis6.5.Assumethataquantitativeassociationruleisconsideredinterestingonlyifthedifferencebetweenμand ismorethan5years.UsingEquation6.1 weobtain
Foraone-sidedhypothesistestata95%confidencelevel,thecriticalvalueforrejectingthenullhypothesisis1.64.Since ,thenullhypothesiscanberejected.Wethereforeconcludethatthequantitativeassociationruleisinterestingbecausethedifferencebetweentheaverageagesofuserswhosupportanddonotsupporttheruleantecedentismorethan5years.
6.2.3Non-discretizationMethods
Z>Zα
{Income>100K, Shop Online=Yes}→Age:μ=38.
μ′
Z=38−30−53.5250+6.52200=4.4414.
Z>1.64
Therearecertainapplicationsinwhichanalystsaremoreinterestedinfindingassociationsamongthecontinuousattributes,ratherthanassociationsamongdiscreteintervalsofthecontinuousattributes.Forexample,considertheproblemoffindingwordassociationsintextdocuments.Table6.6 showsadocument-wordmatrixwhereeveryentryrepresentsthenumberoftimesawordappearsinagivendocument.Givensuchadatamatrix,weareinterestedinfindingassociationsbetweenwords(e.g., and )insteadofassociationsbetweenrangesofwordcounts(e.g., ∈[1,4]and ∈[2,3]).Onewaytodothisistotransformthedataintoa0/1matrix,wheretheentryis1ifthecountexceedssomethresholdt,and0otherwise.Whilethisapproachallowsanalyststoapplyexistingfrequentitemsetgenerationalgorithmstothebinarizeddataset,findingtherightthresholdforbinarizationcanbequitetricky.Ifthethresholdissettoohigh,itispossibletomisssomeinterestingassociations.Conversely,ifthethresholdissettoolow,thereisapotentialforgeneratingalargenumberofspuriousassociations.
Table6.6.Document-wordmatrix.
Document
d1 3 6 0 0 0 2
d2 1 2 0 0 0 2
d3 4 2 7 0 0 2
d4 2 0 3 0 0 1
d5 0 0 0 1 1 0
Thissectionpresentsanothermethodologyforfindingassociationsamongcontinuousattributes,knownasthemin-Aprioriapproach.Analogousto
word1 word2 word3 word4 word5 word6
traditionalassociationanalysis,anitemsetisconsideredtobeacollectionofcontinuousattributes,whileitssupportmeasuresthedegreeofassociationamongtheattributes,acrossmultiplerowsofthedatamatrix.Forexample,acollectionofwordsinTable6.6 canbereferredtoasanitemset,whosesupportisdeterminedbytheco-occurrenceofwordsacrossdocuments.Inmin-Apriori,theassociationamongattributesinagivenrowofthedatamatrixisobtainedbytakingtheminimumvalueoftheattributes.Forexample,theassociationbetweenwords, and ,inthedocument isgivenby
.Thesupportofanitemsetisthencomputedbyaggregatingitsassociationoverallthedocuments.
Thesupportmeasuredefinedinmin-Apriorihasthefollowingdesiredproperties,whichmakesitsuitableforfindingwordassociationsindocuments:
1. Supportincreasesmonotonicallyasthenumberofoccurrencesofawordincreases.
2. Supportincreasesmonotonicallyasthenumberofdocumentsthatcontainthewordincreases.
3. Supporthasananti-monotoneproperty.Forexample,considerapairofitemsets and .Since
.Therefore,supportdecreasesmonotonicallyasthenumberofwordsinanitemsetincreases.
ThestandardApriorialgorithmcanbemodifiedtofindassociationsamongwordsusingthenewsupportdefinition.
word1 word2 d1min(3,6)=3.
s({word1,word2})=min(3,6)+min(1,2)+min(4,2)+min(2,0)=6.
{A,B} {A,B,C}min({A,B})≥min({A,B,C}),s({A,B})≥s({A,B,C})
6.3HandlingaConceptHierarchyAconcepthierarchyisamultilevelorganizationofthevariousentitiesorconceptsdefinedinaparticulardomain.Forexample,inmarketbasketanalysis,aconcepthierarchyhastheformofanitemtaxonomydescribingthe“is-a”relationshipsamongitemssoldatagrocerystore—e.g.,milkisakindoffoodandDVDisakindofhomeelectronicsequipment(seeFigure6.2 ).Concepthierarchiesareoftendefinedaccordingtodomainknowledgeorbasedonastandardclassificationschemedefinedbycertainorganizations(e.g.,theLibraryofCongressclassificationschemeisusedtoorganizelibrarymaterialsbasedontheirsubjectcategories).
Figure6.2.Exampleofanitemtaxonomy.
Aconcepthierarchycanberepresentedusingadirectedacyclicgraph,asshowninFigure6.2 .Ifthereisanedgeinthegraphfromanodeptoanothernodec,wecallptheparentofcandcthechildofp.Forexample,
istheparentof becausethereisadirectededgefromthenode tothenode . iscalledanancestorofX(andXadescendentof )ifthereisapathfromnode tonodeXinthedirectedacyclicgraph.InthediagramshowninFigure6.2 , isanancestorof
and isadescendentof .
Themainadvantagesofincorporatingconcepthierarchiesintoassociationanalysisareasfollows:
1. Itemsatthelowerlevelsofahierarchymaynothaveenoughsupporttoappearinanyfrequentitemset.Forexample,althoughthesaleofACadaptorsanddockingstationsmaybelow,thesaleoflaptopaccessories,whichistheirparentnodeintheconcepthierarchy,maybehigh.Also,rulesinvolvinghigh-levelcategoriesmayhavelowerconfidencethantheonesgeneratedusinglow-levelcategories.Unlesstheconcepthierarchyisused,thereisapotentialtomissinterestingpatternsatdifferentlevelsofcategories.
2. Rulesfoundatthelowerlevelsofaconcepthierarchytendtobeoverlyspecificandmaynotbeasinterestingasrulesatthehigherlevels.Forexample,stapleitemssuchasmilkandbreadtendtoproducemanylow-levelrulessuchas
,and .Usingaconcepthierarchy,theycanbesummarizedintoasinglerule, .Consideringonlyitemsresidingatthetopleveloftheirhierarchiesalsomaynotbegoodenough,becausesuchrulesmaynotbeofanypracticaluse.Forexample,althoughtherule maysatisfythesupportandconfidencethresholds,itisnotinformativebecausethe
X^X^ X^
combinationofelectronicsandfooditemsthatarefrequentlypurchasedbycustomersareunknown.Ifmilkandbatteriesaretheonlyitemssoldtogetherfrequently,thenthepattern{ }mayhaveovergeneralizedthesituation.
Standardassociationanalysiscanbeextendedtoincorporateconcepthierarchiesinthefollowingway.Eachtransactiontisinitiallyreplacedwithitsextendedtransaction ,whichcontainsalltheitemsintalongwiththeircorrespondingancestors.Forexample,thetransaction{ }canbeextendedto{
},where and aretheancestorsof ,while and aretheancestorsof .
Wecanthenapplyexistingalgorithms,suchasApriori,tothedatabaseofextendedtransactions.Althoughsuchanapproachwouldfindrulesthatspandifferentlevelsoftheconcepthierarchy,itwouldsufferfromseveralobviouslimitationsasdescribedbelow:
1. Itemsresidingatthehigherlevelstendtohavehighersupportcountsthanthoseresidingatthelowerlevelsofaconcepthierarchy.Asaresult,ifthesupportthresholdissettoohigh,thenonlypatternsinvolvingthehigh-levelitemsareextracted.Ontheotherhand,ifthethresholdissettoolow,thenthealgorithmgeneratesfartoomanypatterns(mostofwhichmaybespurious)andbecomescomputationallyinefficient.
2. Introductionofaconcepthierarchytendstoincreasethecomputationtimeofassociationanalysisalgorithmsbecauseofthelargernumberofitemsandwidertransactions.Thenumberofcandidatepatternsandfrequentpatternsgeneratedbythesealgorithmsmayalsogrowexponentiallywithwidertransactions.
t′
3. Introductionofaconcepthierarchymayproduceredundantrules.Arule isredundantifthereexistsamoregeneralrule ,where isanancestorofX, isanancestorofY,andbothruleshaveverysimilarconfidence.Forexample,suppose{ }→{ },{ }→{2% },{ }→{2% },{ }→{skim },and{ }→{ }haveverysimilarconfidence.Therulesinvolvingitemsfromthelowerlevelofthehierarchyareconsideredredundantbecausetheycanbesummarizedbyaruleinvolvingtheancestoritems.Anitemsetsuchas{
}isalsoredundantbecause and areancestorsof.Fortunately,itiseasytoeliminatesuchredundantitemsets
duringfrequentitemsetgeneration,giventheknowledgeofthehierarchy.
X→Y X^→Y^X^ Y^
6.4SequentialPatternsMarketbasketdataoftencontainstemporalinformationaboutwhenanitemwaspurchasedbycustomers.Suchinformationcanbeusedtopiecetogetherthesequenceoftransactionsmadebyacustomeroveracertainperiodoftime.Similarly,event-baseddatacollectedfromscientificexperimentsorthemonitoringofphysicalsystems,suchastelecommunicationsnetworks,computernetworks,andwirelesssensornetworks,haveaninherentsequentialnaturetothem.Thismeansthatanordinalrelation,usuallybasedontemporalprecedence,existsamongeventsoccurringinsuchdata.However,theconceptsofassociationpatternsdiscussedsofaremphasizeonly“co-occurrence”relationshipsanddisregardthesequentialinformationofthedata.Thelatterinformationmaybevaluableforidentifyingrecurringfeaturesofadynamicsystemorpredictingfutureoccurrencesofcertainevents.Thissectionpresentsthebasicconceptofsequentialpatternsandthealgorithmsdevelopedtodiscoverthem.
6.4.1Preliminaries
Theinputtotheproblemofdiscoveringsequentialpatternsisasequencedataset,anexampleofwhichisshownontheleft-handsideofFigure6.3 .Eachrowrecordstheoccurrencesofeventsassociatedwithaparticularobjectatagiventime.Forexample,thefirstrowcontainsthesetofeventsoccurringattimestamp forobjectA.Notethatifweonlyconsiderthelastcolumnofthissequencedataset,itwouldlooksimilartoamarketbasketdatawhereeveryrowrepresentsatransactioncontainingasetofevents(items).Thetraditionalconceptofassociationpatternsinthisdatawouldcorrespond
t=10
tocommonco-occurrencesofeventsacrosstransactions.However,asequencedatasetalsocontainsinformationabouttheobjectandthetimestampofatransactionofeventsinthefirsttwocolumns.Thesecolumnsaddcontexttoeverytransaction,whichenablesadifferentstyleofassociationanalysisforsequencedatasets.Theright-handsideofFigure6.3 showsadifferentrepresentationofthesequencedatasetwheretheeventsassociatedwitheveryobjectappeartogether,sortedinincreasingorderoftheirtimestamps.Inasequencedataset,wecanlookforassociationpatternsofeventsthatcommonlyoccurinasequentialorderacrossobjects.Forexample,inthesequencedatashowninFigure6.3 ,event6isfollowedbyevent1inallofthesequences.Notethatsuchapatterncannotbeinferredifwetreatthisasamarketbasketdatabyignoringinformationabouttheobjectandtimestamp.
Figure6.3.Exampleofasequencedatabase.
Beforepresentingamethodologyforfindingsequentialpatterns,weprovideabriefdescriptionofsequencesandsubsequences.
SequencesGenerallyspeaking,asequenceisanorderedlistofelements(transactions).Asequencecanbedenotedas ,whereeachelement isacollectionofoneormoreevents(items),i.e., .Thefollowingisalistofexamplesofsequences:
Sequenceofwebpagesviewedbyawebsitevisitor:
⟨{Homepage}{Electronics}{CamerasandCamcorders}{DigitalCameras}{ShoppingCart}{OrderConfirmation}{ReturntoShopping}⟩SequenceofeventsleadingtothenuclearaccidentatThree-MileIsland:
⟨{cloggedresin}{outletvalveclosure}{lossoffeedwater}{condenserpolisheroutletvalveshut}{boosterpumpstrip}{mainwaterpumptrips}{mainturbinetrips}{reactorpressureincreases}⟩Sequenceofclassestakenbyacomputersciencemajorstudentindifferentsemesters:
⟨{AlgorithmsandDataStructures,IntroductiontoOperatingSystems}{DatabaseSystems,ComputerArchitecture}{ComputerNetworks,SoftwareEngineering}{ComputerGraphics,ParallelProgramming}⟩
Asequencecanbecharacterizedbyitslengthandthenumberofoccurringevents.Thelengthofasequencecorrespondstothenumberofelementspresentinthesequence,whilewerefertoasequencethatcontainskeventsasak-sequence.Thewebsequenceinthepreviousexamplecontains7elementsand7events;theeventsequenceatThree-MileIslandcontains8
s=⟨e1e2e3…en⟩ ejej={i1,i2,…,ik}
elementsand8events;andtheclasssequencecontains4elementsand8events.
Figure6.4 providesexamplesofsequences,elements,andeventsdefinedforavarietyofapplicationdomains.Exceptforthelastrow,theordinalattributeassociatedwitheachofthefirstthreedomainscorrespondstocalendartime.Forthelastrow,theordinalattributecorrespondstothelocationofthebases(A,C,G,T)inthegenesequence.Althoughthediscussiononsequentialpatternsisprimarilyfocusedontemporalevents,itcanbeextendedtothecasewheretheeventshavenon-temporalordering,suchasspatialordering.
Figure6.4.Examplesofelementsandeventsinsequencedatasets.
SubsequencesAsequencetisasubsequenceofanothersequencesifitispossibletoderivetfromsbysimplydeletingsomeeventsfromelementsinsorevendeletingsomeelementsinscompletely.Formally,thesequenceisasubsequenceof ifthereexistintegerssuchthat .Iftisasubsequenceofs,thenwesaythattiscontainedins.Table6.7 givesexamplesillustratingtheideaofsubsequencesforvarioussequences.
Table6.7.Examplesillustratingtheconceptofasubsequence.
Sequence,s Sequence,t Istasubsequenceofs?
Yes
Yes
No
Yes
6.4.2SequentialPatternDiscovery
LetDbeadatasetthatcontainsoneormoredatasequences.Thetermdatasequencereferstoanorderedlistofelementsassociatedwithasingledataobject.Forexample,thedatasetshowninFigure6.5 containsfivedatasequences,oneforeachobjectA,B,C,D,andE.
t=⟨t1t2…tm⟩ s=⟨s1s2…sn⟩ 1≤j1<j2<⋯<jm≤n
t1⊆sj1,t2⊆sj2,…,tm⊆sjm
⟨{2,4} {3,5,6} {8}⟩ ⟨{2} {3,6} {8}⟩
⟨{2,4} {3,5,6} {8}⟩ ⟨{2} {8}⟩
⟨{1,2} {3,4}⟩ ⟨{1} {2}⟩
⟨{2,4} {2,4} {2,5}⟩ ⟨{2} {4}⟩
Figure6.5.Sequentialpatternsderivedfromadatasetthatcontainsfivedatasequences.
Thesupportofasequencesisthefractionofalldatasequencesthatcontains.Ifthesupportforsisgreaterthanorequaltoauser-specifiedthresholdminsup,thensisdeclaredtobeasequentialpattern(orfrequentsequence).
Definition6.1(SequentialPatternDiscovery).GivenasequencedatasetDandauser-specifiedminimumsupportthresholdminsup,thetaskofsequentialpatterndiscoveryistofindallsequenceswith .support≥minsup
InFigure6.5 ,thesupportforthesequence isequalto80%becauseitiscontainedinfourofthefivedatasequences(everyobjectexceptforD).Assumingthattheminimumsupportthresholdis50%,anysequencethatiscontainedinatleastthreedatasequencesisconsideredtobeasequentialpattern.Examplesofsequentialpatternsextractedfromthegivendatasetinclude ,etc.
Sequentialpatterndiscoveryisacomputationallychallengingtaskbecausethesetofallpossiblesequencesthatcanbegeneratedfromacollectionofeventsisexponentiallylargeanddifficulttoenumerate.Forexample,acollectionofneventscanresultinthefollowingexamplesof1-sequences,2-sequences,and3-sequences:
1-sequences:
2-sequences:
3-sequences:
TheaboveenumerationissimilarinsomewaystotheitemsetlatticeintroducedinChapter5 formarketbasketdata.However,notethattheaboveenumerationisnotexhaustive;itonlyshowssomesequencesandomitsalargenumberofremainingonesbytheuseofellipses(…).Thisisbecausethenumberofcandidatesequencesissubstantiallylargerthanthenumberofcandidateitemsets,whichmakestheirenumerationdifficult.Therearethreereasonsfortheadditionalnumberofcandidatessequences:
1. Anitemcanappearatmostonceinanitemset,butaneventcanappearmorethanonceinasequence,indifferentelementsofthe
⟨{1}{2}⟩
⟨{1}{2}⟩,⟨{1,2}⟩,⟨{2,3}⟩,⟨{1,2}{2,3}⟩
⟨i1⟩,⟨i2⟩,…,⟨in⟩
⟨{i1,i2}⟩,⟨{i1,i3}⟩,…,⟨{in−1,in}⟩,…⟨{i1}{i1}⟩,⟨{i1}{i2}⟩,…,⟨{in}{in}⟩
⟨{i1,i2,i3}⟩,⟨{i1,i2,i4}⟩,…,⟨{in−2,in−1,in}⟩,…⟨{i1}{i1,i2}⟩,⟨{i1}{i1,i3}⟩,…,⟨{in−1}{in−1,in}⟩,…⟨{i1,i2}{i2}⟩,⟨{i1,i2}{i3}⟩,…,⟨{in−1,in}{in}⟩,…⟨{i1}{i1}{i1}⟩,⟨{i1}{i1}{i2}⟩,…,⟨{in}{in}{in}⟩
sequence.Forexample,givenapairofitems, and ,onlyonecandidate2-itemset, ,canbegenerated.Incontrast,therearemanycandidate2-sequencesthatcanbegeneratedusingonlytwoevents: ,and .
2. Ordermattersinsequences,butnotforitemsets.Forexample,and referstothesameitemset,whereas ,and correspondtodifferentsequences,andthusmustbegeneratedseparately.
3. Formarketbasketdata,thenumberofdistinctitemsnputsanupperboundonthenumberofcandidateitemsets ,whereasforsequencedata,eventwoeventsaandbcanleadtoinfinitelymanycandidatesequences(seeFigure6.6 foranillustration).
Figure6.6.Comparingthenumberofitemsetswiththenumberofsequencesgeneratedusingtwoevents(items).Weonlyshow1-sequences,2-sequences,and3-sequencesforillustration.
Becauseoftheabovereasons,itischallengingtocreateasequencelatticethatenumeratesallpossiblesequencesevenwhenthenumberofeventsinthedataissmall.Itisthusdifficulttouseabrute-forceapproachfor
i1 i2{i1,i2}
⟨{i1}{i1}⟩,⟨{i1}{i2}⟩,⟨{i2}{i1}⟩,⟨{i2}{i2}⟩ ⟨{i1,i2}⟩{i1,i2}
{i2,i1} ⟨{i1}{i2}⟩,⟨{i2}{i1}⟩⟨{i1,i2}⟩
(2n−1)
generatingsequentialpatternsthatenumeratesallpossiblesequencesbytraversingthesequencelattice.Despitethesechallenges,theAprioriprinciplestillholdsforsequentialdatabecauseanydatasequencethatcontainsaparticulark-sequencemustalsocontainallofits -subsequences.Aswewillseelater,eventhoughitischallengingtoconstructthesequencelattice,itispossibletogeneratecandidatek-sequencesfromthefrequent -sequencesusingtheAprioriprinciple.ThisallowsustoextractsequentialpatternsfromasequencedatasetusinganApriori-likealgorithm.ThebasicstructureofthisalgorithmisshowninAlgorithm6.1 .
Algorithm6.1Apriori-likealgorithmfor
sequentialpatterndiscovery.
(k−1)
(k−1)
1: .2: . {Findallfrequent1-subsequences.}3:repeat4: .5: . {Generatecandidatek-subsequences.}6: . {Prunecandidatek-subsequences.}7:foreachdatasequence do8: . {Identifyallcandidatescontainedint.}9:foreachcandidatek-subsequence do10: . {Incrementthesupportcount.}11:endfor12:endfor
k=1Fk={i|i∈I∧σ({i})N≥minsup}
k=k+1Ck=candidate-gen(Fk−1)
Ck=candidate-prune(Ck,Fk−1)
t∈TCt=subsequence(Ck,t)
c∈Ctσ(c)=σ(c)+1
NoticethatthestructureofthealgorithmisalmostidenticaltoApriorialgorithmforfrequentitemsetdiscovery,presentedinthepreviouschapter.Thealgorithmwoulditerativelygeneratenewcandidatek-sequences,prunecandidateswhose -sequencesareinfrequent,andthencountthesupportsoftheremainingcandidatestoidentifythesequentialpatterns.Thedetailedaspectsofthesestepsaregivennext.
CandidateGeneration
Wegeneratecandidatek-sequencesbymergingapairoffrequent -sequences.Althoughthisapproachissimilartothe strategyintroducedinChapter5 forgeneratingcandidateitem-sets,therearecertaindifferences.First,inthecaseofgeneratingsequences,wecanmergea -sequencewithitselftoproduceak-sequence,since.Forexample,wecanmergethe1-sequence withitselftoproduceacandidate2-sequence, .Second,recallthatinordertoavoidgeneratingduplicatecandidates,thetraditionalApriorialgorithmmergesapairoffrequentk-itemsetsonlyiftheirfirst items,arrangedinlexicographicorder,areidentical.Inthecaseofgeneratingsequences,westillusethelexicographicorderforarrangingeventswithinanelement.However,thearrangementofelementsinasequencemaynotfollowthelexicographicorder.Forexample,
isaviablerepresentationofa4-sequence,eventhoughtheelementsinthesequencearenotarrangedaccordingtotheirlexicographicranks.Ontheotherhand, isnotaviablerepresentationofthesame4-sequence,sincetheeventsinthefirstelementviolatethelexicographicorder.
13: . {Extractthefrequentk-subsequences.}14:until15: .
Fk={c|c∈Ck∧σ(c)N≥minsup}
Fk=∅Answer=∪Fk
(k−1)
(k−1)Fk−1×Fk−1
(k−1)⟨a⟩
⟨{a}{a}⟩
k−1
⟨{b,c}{a}{d}⟩
⟨{c,b}{a}{d}⟩
Givenasequence ,wheretheeventsineveryelementarearrangedlexicographically,wecanrefertofirsteventof asthefirsteventofsandthelasteventof asthelasteventofs.Thecriteriaformergingsequencescanthenbestatedintheformofthefollowingprocedure.
SequenceMergingProcedure
Asequence ismergedwithanothersequence onlyifthesubsequenceobtainedbydroppingthefirsteventin isidenticaltothesubsequenceobtainedbydroppingthelasteventin .Theresultingcandidateisgivenbyextendingthesequence asfollows:
1. Ifthelastelementof hasonlyoneevent,appendthelastelementof totheendof andobtainthemergedsequence.
2. Ifthelastelementof hasmorethanoneevent,appendthelasteventfromthelastelementof (thatisabsentinthelastelementof )tothelastelementof andobtainthemergedsequence.
Figure6.7 illustratesexamplesofcandidate4-sequencesobtainedbymergingpairsoffrequent3-sequences.Thefirstcandidate, ,isobtainedbymerging with .Sincethelastelementofthesecondsequence containsonlyoneelement,itissimplyappendedtothefirstsequencetogeneratethemergedsequence.Ontheotherhand,merging with producesthecandidate4-sequence
.Inthiscase,thelastelementofthesecondsequence
s=⟨e1e2e3…en⟩e1
en
s(1) s(2)s(1)
s(2)s(1)
s(2)s(2) s(1)
s(2)s(2)
s(1) s(1)
⟨{1}{2}{3}{4}⟩⟨{1}{2}{3}⟩ ⟨{2}{3}{4}⟩({4})
⟨{1}{5}{3}⟩ ⟨{5}{3,4}⟩ ⟨{1}{5}{3,4}⟩ ({3,4})
containstwoevents.Hence,thelasteventinthiselement(4)isaddedtothelastelementofthefirstsequence toobtainthemergedsequence.
Figure6.7.Exampleofthecandidategenerationandpruningstepsofasequentialpatternminingalgorithm.
Itcanbeshownthatthesequencemergingprocedureiscomplete,i.e.,itgenerateseveryfrequentk-subsequence.Thisisbecauseeveryfrequentk-subsequencesincludesafrequent -sequence ,thatdoesnotcontainthefirsteventofs,andafrequent -sequence ,thatdoesnotcontainthelasteventofs.Since and arefrequentandfollowthecriteriaformergingsequences,theywillbemergedtoproduceeveryfrequentk-subsequencesasoneofthecandidates.Also,thesequencemergingprocedureensuresthatthereisauniquewayofgeneratingsonlybymergingand .Forexample,inFigure6.7 ,thesequences and
donothavetobemergedbecauseremovingthefirsteventfromthefirstsequencedoesnotgivethesamesubsequenceasremovingthelasteventfromthesecondsequence.Although isaviablecandidate,itisgeneratedbymergingadifferentpairofsequences,
({3})
(k−1) s1(k−1) s2
s1 s2
s1 s2 ⟨{1}{2}{3}⟩ ⟨{1}{2,5}⟩
⟨{1}{2,5}{3}⟩⟨{1}{2,5}
and .Thisexampleillustratesthatthesequencemergingproceduredoesnotgenerateduplicatecandidatesequences.
CandidatePruning
Acandidatek-sequenceisprunedifatleastoneofits -sequencesisinfrequent.Forexample,considerthecandidate4-sequence .Weneedtocheckifanyofthe3-sequencescontainedinthis4-sequenceisinfrequent.Sincethesequence iscontainedinthissequenceandisinfrequent,thecandidate canbeeliminated.Readersshouldbeabletoverifythattheonlycandidate4-sequencethatsurvivesthecandidatepruningstepinFigure6.7 is .
SupportCounting
Duringsupportcounting,thealgorithmidentifiesallcandidatek-sequencesbelongingtoaparticulardatasequenceandincrementstheirsupportcounts.Afterperformingthisstepforeachdatasequence,thealgorithmidentifiesthefrequentk-sequencesanddiscardsallcandidatesequenceswhosesupportvaluesarelessthantheminsupthreshold.
6.4.3TimingConstraints*
Thissectionpresentsasequentialpatternformulationwheretimingconstraintsareimposedontheeventsandelementsofapattern.Tomotivatetheneedfortimingconstraints,considerthefollowingsequenceofcoursestakenbytwostudentswhoenrolledinadataminingclass:
⟩ ⟨{2,5}{3}⟩
(k−1)⟨{1}{2}{3}{4}⟩
⟨{1}{2}{4}⟩⟨{1}{2}{3}{4}⟩
⟨{1}{2 5}{3}⟩
StudentA:⟨{Statistics}{DatabaseSystems}{DataMining}⟩.StudentB:⟨
Thesequentialpatternofinterestis,whichmeansthatstudentswhoareenrolledinthedata
miningclassmusthavepreviouslytakenacourseinstatisticsanddatabasesystems.Clearly,thepatternissupportedbybothstudentseventhoughtheydonottakestatisticsanddatabasesystemsatthesametime.Incontrast,astudentwhotookastatisticscoursetenyearsearliershouldnotbeconsideredassupportingthepatternbecausethetimegapbetweenthecoursesistoolong.Becausetheformulationpresentedintheprevioussectiondoesnotincorporatethesetimingconstraints,anewsequentialpatterndefinitionisneeded.
Figure6.8 illustratessomeofthetimingconstraintsthatcanbeimposedonapattern.Thedefinitionoftheseconstraintsandtheimpacttheyhaveonsequentialpatterndiscoveryalgorithmswillbediscussedinthefollowingsections.Notethateachelementofthesequentialpatternisassociatedwithatimewindow ,wherelistheearliestoccurrenceofaneventwithinthetimewindowanduisthelatestoccurrenceofaneventwithinthetimewindow.Notethatinthisdiscussion,wealloweventswithinanelementtooccuratdifferenttimes.Hence,theactualtimingoftheeventoccurrencemaynotbethesameasthelexicographicordering.
{DatabaseSystems}{Statistics}{DataMining}⟩.
⟨{Statistics,DatabaseSystems}{DataMining}⟩
[l,u]
Figure6.8.Timingconstraintsofasequentialpattern.
ThemaxspanConstraintThemaxspanconstraintspecifiesthemaximumallowedtimedifferencebetweenthelatestandtheearliestoccurrencesofeventsintheentiresequence.Forexample,supposethefollowingdatasequencescontainelementsthatoccuratconsecutivetimestamps ,i.e.,theelementinthesequenceoccursatthe timestamp.Assumingthatmaxspan=3,thefollowingtablecontainssequentialpatternsthataresupportedandnotsupportedbyagivendatasequence.
DataSequence,s SequentialPattern,t Doesssupportt?
Yes
Yes
(1,2,3,…) ithith
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3}{4}⟩
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3}{6}⟩
No
Ingeneral,thelongerthemaxspan,themorelikelyitistodetectapatterninadatasequence.However,alongermaxspancanalsocapturespuriouspatternsbecauseitincreasesthechancefortwounrelatedeventstobetemporallyrelated.Inaddition,thepatternmayinvolveeventsthatarealreadyobsolete.
Themaxspanconstraintaffectsthesupportcountingstepofsequentialpatterndiscoveryalgorithms.Asshownintheprecedingexamples,somedatasequencesnolongersupportacandidatepatternwhenthemaxspanconstraintisimposed.IfwesimplyapplyAlgorithm6.1 ,thesupportcountsforsomepatternsmaybeoverestimated.Toavoidthisproblem,thealgorithmmustbemodifiedtoignorecaseswheretheintervalbetweenthefirstandlastoccurrencesofeventsinagivenpatternisgreaterthanmaxspan.
ThemingapandmaxgapConstraintsTimingconstraintscanalsobespecifiedtorestrictthetimedifferencebetweentwoconsecutiveelementsofasequence.Ifthemaximumtimedifference(maxgap)isoneweek,theneventsinoneelementmustoccurwithinaweek’stimeoftheeventsoccurringinthepreviouselement.Iftheminimumtimedifference(mingap)iszero,theneventsinoneelementmustoccuraftertheeventsoccurringinthepreviouselement.(SeeFigure6.8 .)Thefollowingtableshowsexamplesofpatternsthatpassorfailthemaxgapandmingapconstraints,assumingthatmaxgap=3andmingap=1.Theseexamplesassumeeachelementoccursatconsecutivetimesteps.
DataSequence,s SequentialPattern,t maxgap mingap
Pass Pass
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1,3}{6}⟩
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3}{6}⟩
Pass Fail
Fail Pass
Fail Fail
Aswithmaxspan,theseconstraintswillaffectthesupportcountingstepofsequentialpatterndiscoveryalgorithmsbecausesomedatasequencesnolongersupportacandidatepatternwhenmingapandmaxgapconstraintsarepresent.Thesealgorithmsmustbemodifiedtoensurethatthetimingconstraintsarenotviolatedwhencountingthesupportofapattern.Otherwise,someinfrequentsequencesmaymistakenlybedeclaredasfrequentpatterns.
AsideeffectofusingthemaxgapconstraintisthattheAprioriprinciplemightbeviolated.Toillustratethis,considerthedatasetshowninFigure6.5 .Withoutmingapormaxgapconstraints,thesupportfor and
arebothequalto60%.However,ifmingap=0andmaxgap=1,thenthesupportfor reducesto40%,whilethesupportfor isstill60%.Inotherwords,supporthasincreasedwhenthenumberofeventsinasequenceincreases—whichcontradictstheAprioriprinciple.TheviolationoccursbecausetheobjectDdoesnotsupportthepattern sincethetimegapbetweenevents2and5isgreaterthanmaxgap.Thisproblemcanbeavoidedbyusingtheconceptofacontiguoussubsequence.
Definition6.2(ContiguousSubsequence).Asequencesisacontiguoussubsequenceof ifanyoneofthefollowingconditionshold:
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{6}{8}⟩
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1,3}{6}⟩
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1}{3}{8}⟩
⟨{2}{5}⟩ ⟨{2}{3}{5}⟩
⟨{2}{5}⟩ ⟨{2}{3}{5}⟩
⟨{2}{5}⟩
w=⟨e1e2…ek⟩
1. sisobtainedfromwafterdeletinganeventfromeither or ,2. sisobtainedfromwafterdeletinganeventfromanyelement that
containsatleasttwoevents,or3. sisacontiguoussubsequenceoftandtisacontiguoussubsequence
ofw.
Thefollowingexamplesillustratetheconceptofacontiguoussubsequence:
DataSequence,s SequentialPattern,t Istacontiguoussubsequenceofs?
Yes
Yes
Yes
No
No
Usingtheconceptofcontiguoussubsequences,theAprioriprinciplecanbemodifiedtohandlemaxgapconstraintsinthefollowingway.
Definition6.3(ModifiedAprioriPrinciple).Ifak-sequenceisfrequent,thenallofitscontiguous -subsequencesmustalsobefrequent.
e1 ekei∈w
⟨{1}{2,3}⟩ ⟨{1}{2}⟩
⟨{1,2}{2}{3}⟩ ⟨{1}{2}⟩
⟨{3,4}{1,2}{2,3}{4}⟩ ⟨{1}{2}⟩
⟨{1}{3}{2}⟩ ⟨{1}{2}⟩
⟨{1,2}{1}{3}{2}⟩ ⟨{1}{2}⟩
k−1
ThemodifiedAprioriprinciplecanbeappliedtothesequentialpatterndiscoveryalgorithmwithminormodifications.Duringcandidatepruning,notallk-sequencesneedtobeverifiedsincesomeofthemmayviolatethemaxgapconstraint.Forexample,ifmaxgap=1,itisnotnecessarytocheckwhetherthesubsequence ofthecandidate isfrequentsincethetimedifferencebetweenelements and isgreaterthanonetimeunit.Instead,onlythecontiguoussubsequencesofneedtobeexamined.Thesesubsequencesinclude ,
, ,and .
TheWindowSizeConstraintFinally,eventswithinanelement donothavetooccuratthesametime.Awindowsizethreshold(ws)canbedefinedtospecifythemaximumallowedtimedifferencebetweenthelatestandearliestoccurrencesofeventsinanyelementofasequentialpattern.Awindowsizeof0meansalleventsinthesameelementofapatternmustoccursimultaneously.
Thefollowingexampleuses todeterminewhetheradatasequencesupportsagivensequence(assuming , ,and
).
DataSequence,s SequentialPattern,t Doesssupportt?
Yes
Yes
No
No
⟨{1}{2,3}{5}⟩ ⟨{1}{2,3}{4}{5}⟩⟨{2,3}⟩ {5}
⟨{1}{2,3}{4}{5}⟩⟨{1}{2,3}{4}⟩ ⟨{2,3}{4}
{5}⟩ ⟨{1}{2}{4}{5}⟩ {1}{3}{4}{5}
sj
ws=2mingap=0 maxgap=3
maxspan=∞
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3,4}{5}⟩
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{4,6}{8}⟩
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3,4,6}{8}⟩
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1,3,4}{6,7,8}⟩
Inthelastexample,althoughthepattern satisfiesthewindowsizeconstraint,itviolatesthemaxgapconstraintbecausethemaximumtimedifferencebetweeneventsinthetwoelementsis5units.Thewindowsizeconstraintalsoaffectsthesupportcountingstepofsequentialpatterndiscoveryalgorithms.IfAlgorithm6.1 isappliedwithoutimposingthewindowsizeconstraint,thesupportcountsforsomeofthecandidatepatternsmightbeunderestimated,andthussomeinterestingpatternsmaybelost.
6.4.4AlternativeCountingSchemes*
Therearemultiplewaysofdefiningasequencegivenadatasequence.Forexample,ifourdatabaseinvolveslongsequencesofevents,wemightbeinterestedinfindingsubsequencesthatoccurmultipletimesinthesamedatasequence.Hence,insteadofcountingthesupportofasubsequenceasthenumberofdatasequencesitiscontainedin,wecanalsotakeintoaccountthenumberoftimesasubsequenceiscontainedinadatasequence.Thisviewpointgivesrisetoseveraldifferentformulationsforcountingthesupportofacandidatek-sequencefromadatabaseofsequences.Forillustrativepurposes,considertheproblemofcountingthesupportforsequence,asshowninFigure6.9 .Assumethat , , ,and
.
⟨{1,3,4}{6,7,8}⟩
⟨{p}{q}⟩ ws=0 mingap=0 maxgap=2maxspan=2
Figure6.9.Comparingdifferentsupportcountingmethods.
COBJ:Oneoccurrenceperobject.
Thismethodlooksforatleastoneoccurrenceofagivensequenceinanobject’stimeline.InFigure6.9 ,eventhoughthesequenceappearsseveraltimesintheobject’stimeline,itiscountedonlyonce—withpoccurringat andqoccurringat .CWIN:Oneoccurrenceperslidingwindow.
⟨(p)(q)⟩
t=1 t=3
Inthisapproach,aslidingtimewindowoffixedlength(maxspan)ismovedacrossanobject’stimeline,oneunitatatime.Thesupportcountisincrementedeachtimethesequenceisencounteredintheslidingwindow.InFigure6.9 ,thesequence isobservedsixtimesusingthismethod.CMINWIN:Numberofminimalwindowsofoccurrence.
Aminimalwindowofoccurrenceisthesmallestwindowinwhichthesequenceoccursgiventhetimingconstraints.Inotherwords,aminimalwindowisthetimeintervalsuchthatthesequenceoccursinthattimeinterval,butitdoesnotoccurinanyofthepropersubintervalsofit.ThisdefinitioncanbeconsideredasarestrictiveversionofCWIN,becauseitseffectistoshrinkandcollapsesomeofthewindowsthatarecountedbyCWIN.Forexample,sequence hasfourminimalwindowoccurrences:(1)thepair ,(2)thepair ,(3)thepair
,and(4)thepair .Theoccurrenceofeventpatandeventqat isnotaminimalwindowoccurrencebecauseit
containsasmallerwindowwith ,whichisindeedaminimalwindowofoccurrence.CDIST_O:Distinctoccurrenceswithpossibilityofevent-timestampoverlap.
Adistinctoccurrenceofasequenceisdefinedtobethesetofevent-timestamppairssuchthattherehastobeatleastonenewevent-timestamppairthatisdifferentfromapreviouslycountedoccurrence.CountingallsuchdistinctoccurrencesresultsintheCDIST_Omethod.Iftheoccurrencetimeofeventspandqisdenotedasatuple ,thenthismethodyieldseightdistinctoccurrencesofsequence attimes(1,3),(2,3),(2,4),(3,4),(3,5),(5,6),(5,7),and(6,7).CDIST:Distinctoccurrenceswithnoevent-timestampoverlapallowed.
⟨{p}{q}⟩
⟨{p}{q}⟩(p:t=2,q:t=3) (p:t=3,q:t=4)
(p:t=5,q:t=6) (p:t=6,q:t=7)t=1 t=3
(p:t=2,q:t=3)
(t(p),t(q))⟨{p}{q}⟩
InCDIST_Oabove,twooccurrencesofasequencewereallowedtohaveoverlappingevent-timestamppairs,e.g.,theoverlapbetween(1,3)and(2,3).IntheCDISTmethod,nooverlapisallowed.Effectively,whenanevent-timestamppairisconsideredforcounting,itismarkedasusedandisneverusedagainforsubsequentcountingofthesamesequence.Asanexample,therearefivedistinct,non-overlappingoccurrencesofthesequence inthediagramshowninFigure6.9 .Theseoccurrenceshappenattimes(1,3),(2,4),(3,5),(5,6),and(6,7).ObservethattheseoccurrencesaresubsetsoftheoccurrencesobservedinCDIST_O.
Onefinalpointregardingthecountingmethodsistheneedtodeterminethebaselineforcomputingthesupportmeasure.Forfrequentitemsetmining,thebaselineisgivenbythetotalnumberoftransactions.Forsequentialpatternmining,thebaselinedependsonthecountingmethodused.FortheCOBJmethod,thetotalnumberofobjectsintheinputdatacanbeusedasthebaseline.FortheCWINandCMINWINmethods,thebaselineisgivenbythesumofthenumberoftimewindowspossibleinallobjects.FormethodssuchasCDISTandCDIST_O,thebaselineisgivenbythesumofthenumberofdistincttimestampspresentintheinputdataofeachobject.
⟨{p}{q}⟩
6.5SubgraphPatternsThissectiondescribestheapplicationofassociationanalysismethodstographs,whicharemorecomplexentitiesthanitemsetsandsequences.Anumberofentitiessuchaschemicalcompounds,3-Dproteinstructures,computernetworks,andtreestructuredXMLdocumentscanbemodeledusingagraphrepresentation,asshowninTable6.8 .
Table6.8.Graphrepresentationofentitiesinvariousapplicationdomains.
Application Graphs Vertices Edges
Webmining Collectionofinter-linkedWebpages
Webpages Hyperlinkbetweenpages
Computationalchemistry
Chemicalcompounds Atomsorions Bondbetweenatomsorions
Computersecurity
Computernetworks Computersandservers
Interconnectionbetweenmachines
SemanticWeb XMLdocuments XMLelements Parent-childrelationshipbetweenelements
Bioinformatics 3-DProteinstructures Aminoacids Contactresidue
Ausefuldataminingtasktoperformonthistypeofdataistoderiveasetoffrequentlyoccurringsubstructuresinacollectionofgraphs.Suchataskisknownasfrequentsubgraphmining.Apotentialapplicationoffrequentsubgraphminingcanbeseeninthecontextofcomputationalchemistry.Each
year,newchemicalcompoundsaredesignedforthedevelopmentofpharmaceuticaldrugs,pesticides,fertilizers,etc.Althoughthestructureofacompoundisknowntoplayamajorroleindeterminingitschemicalproperties,itisdifficulttoestablishtheirexactrelationship.Frequentsubgraphminingcanaidthisundertakingbyidentifyingthesubstructurescommonlyassociatedwithcertainpropertiesofknowncompounds.Suchinformationcanhelpscientiststodevelopnewchemicalcompoundsthathavecertaindesiredproperties.
Thissectionpresentsamethodologyforapplyingassociationanalysistograph-baseddata.Thesectionbeginswithareviewofsomeofthebasicgraph-relatedconceptsanddefinitions.Thefrequentsubgraphminingproblemisthenintroduced,followedbyadescriptionofhowthetraditionalApriorialgorithmcanbeextendedtodiscoversuchpatterns.
6.5.1Preliminaries
GraphsAgraphisadatastructurethatcanbeusedtorepresentrelationshipsamongasetofentities.Mathematically,agraph iscomposedofavertexsetandasetofedges connectingpairsofvertices.Eachedgeisdenotedby
avertexpair ,where .Alabel canbeassignedtoeachvertex representingthenameofanentity.Similarlyeachedge canalsobeassociatedwithalabel describingtherelationshipbetweenapairofentities.Table6.8 showstheverticesandedgesassociatedwithdifferenttypesofgraphs.Forexample,inawebgraph,theverticescorrespondtowebpagesandtheedgesrepresentthehyperlinksbetweenwebpages.
G=(V,E)V E
(vi,vj) vi,vj∈V l(vi)vi (vi,vj)
l(vi,vj)
Althoughthesizeofagraphcangenerallyberepresentedeitherbythenumberofitsverticesoritsedges,inthischapter,wewillconsiderthesizeofagraphasitsnumberofedges.Further,wewilldenoteagraphwithkedgesasak-graph.
GraphIsomorphismAbasicprimitivethatisneededtoworkwithgraphsistodecideiftwographswiththesamenumberofverticesandedgesareequivalenttoeachother,i.e.,representthesamestructureofrelationshipsamongentities.Graphisomorphismprovidesaformaldefinitionofgraphequivalencethatservesasabuildingblockforcomputingsimilaritiesamonggraphs.
Definition6.4(GraphIsomorphism).Twographs and areisomorphictoeachother(denotedas )ifthereexistsfunctions,
and ,thatmapeveryvertexandedge,respectively,from to ,suchthatthefollowingpropertiesaresatisfied:
1. Edge-preservingproperty:Twovertices and informanedgein ifandonlyifthevertices and
formanedgein2. Label-preservingproperty:Thelabelsoftwovertices
and in areequalifandonlyifthelabelsof andin areequal.Similarly,thelabelsoftwoedgesand in areequalifandonlyifthelabelsof
and areequal.
G1=(V1,E1) G2=(V2,E2)G1≃G2 fv:V1→
V2 fe:E1→E2G1 G2
va vb G1G1 fv(va) fv(
vb) G2va
vb G1 fv(va)fv(vb) G2 (va,vb) (vc,vd) G1fe(va,vb) fe(vc,vd)
Themappingfunctions constitutetheisomorphismbetweenthegraphsand .Thisisdenotedas : .Anautomorphismisaspecial
typeofisomorphismwhereagraphismappeduntoitself,i.e., and.Figure6.10 showsanexampleofagraphautomorphismwheretheset
ofvertexlabelsinbothgraphsis .Eventhoughbothgraphslookdifferent,theyareactuallyisomorphictoeachotherbecausethereisaone-toonemappingbetweentheverticesandedgesofbothgraphs.Sincethesamegraphcanbedepictedinmultipleforms,detectinggraphautomorphismisanon-trivialproblem.Acommonapproachtosolvingthisproblemistoassignacanonicallabeltoeverygraph,suchthateveryautomorphismofagraphsharesthesamecanonicallabel.Canonicallabelscanalsohelpinarranginggraphsinaparticular(canonical)orderandcheckingforduplicates.Techniquesforconstructingcanonicallabelsarenotcoveredinthischapter,butinterestedreadersmayconsulttheBibliographicNotesattheendofthischapterformoredetails.
Figure6.10.
(fv,fe)G1 G2 (fv,fe) G1→G2
V1=V2 EE2
{A,B}
Graphisomorphism
Subgraphs
Definition6.5(Subgraph).Agraph isasubgraphofanothergraph ifitsvertexset isasubsetofVanditsedgeset isasubsetofE,suchthattheendpointsofeveryedgein iscontainedin.Thesubgraphrelationshipisdenotedas .
Example6.2.Figure6.11 showsagraphthatcontains6verticesand11edgesalongwithoneofitspossiblesubgraphs.Thesubgraph,whichisshowninFigure6.11(b) ,containsonly4ofthe6verticesand4ofthe11edgesintheoriginalgraph.
G′=(V′,E′) G=(V,E)V′ E′
E′ V′G′⊆SG
Figure6.11.Exampleofasubgraph.
Definition6.6(Support).GivenacollectionofgraphsG,thesupportforasubgraph isdefinedasthefractionofallgraphsthatcontain asitssubgraph,i.e.,
Example6.3.Considerthefivegraphs, through ,showninFigure6.12 ,wherethesetofvertexlabelsrangesfrom to butalltheedgesinthegraphshavethesamelabel.Thegraph shownonthetopright-handdiagramisasubgraphof , , ,and .Therefore, .
gg
s(g)=|{Gi|g⊆SGi,Gi∈g}||G|. (6.2)
G1 G5a e
g1G1 G3 G4 G5 s(g1)=4/5=80%
Similarly,wecanshowthat because isasubgraphof,and ,while because isasubgraphof and
Figure6.12.Computingthesupportofasubgraphfromasetofgraphs.
6.5.2FrequentSubgraphMining
Thissectionpresentsaformaldefinitionofthefrequentsubgraphminingproblemandillustratesthecomplexityofthistask.
s(g2)=60% g2 G1G2 G3 s(g3)=40% g3 G1 G
Definition6.7(FrequentSubgraphMining).GivenasetofgraphsGandasupportthreshold,minsup,thegoaloffrequentsubgraphminingistofindallsubgraphsgsuchthat .
Whilethisformulationisgenerallyapplicabletoanytypeofgraph,thediscussionpresentedinthischapterfocusesprimarilyonundirected,connectedgraphs.Thedefinitionsofthesegraphsaregivenbelow:
1. Agraphisundirectedifitcontainsonlyundirectededges.Anedgeisundirectedifitisindistinguishablefrom .
2. Agraphisconnectedifthereexistsapathbetweeneverypairofverticesinthegraph,inwhichapathisasequenceofvertices
suchthatthereisanedgeconnectingeverypairofadjacentvertices inthesequence.
Methodsforhandlingothertypesofsubgraphs(directedordisconnected)areleftasanexercisetothereaders(seeExercise15 onpage519).
Miningfrequentsubgraphsisacomputationallyexpensivetaskthatismuchmorechallengingthanminingfrequentitemsetsorfrequentsubsequences.Theadditionalcomplexityinfrequentsubgraphminingarisesduetotwomajorreasons.First,computingthesupportofasubgraphggivenacollectionofgraphsGisnotasstraightforwardasforitemsetsorsequences.Thisisbecauseitisanon-trivialproblemtocheckifasubgraphgiscontainedinagraph ,sincethesamegraphgcanbepresentinadifferentforminduetographisomorphism.Theproblemofverifyingifagraphisasubgraph
s(g)≥minsup
(vi,vj) (vj,vi)
⟨v1v2…vk⟩
(vi,vi+1)
g′∈G g′
ofanothergraphisknownasthesubgraphisomorphismproblem,whichisproventobeNP-complete,i.e.,thereisnoknownalgorithmforthisproblemthatrunsinpolynomialtime.
Second,thenumberofcandidatesubgraphsthatcanbegeneratedfromagivensetofvertexandedgelabelsisfarlargerthanthenumberofcandidateitemsetsgeneratedusingtraditionalmarketbasketdatasets.Thisisbecauseofthefollowingreasons:
1. Acollectionofitemsformsauniqueitemsetbutthesamesetofedgelabelscanbearrangedinexponentialnumberofwaysinagraph,withmultiplechoicesofvertexlabelsattheirendpoints.Forexample,itemsp,q,andrformauniqueitemset ,butthreeedgeswithlabelsp,q,andrcanformmultiplegraphs,someexamplesofwhichareshowninFigure6.13 .
Figure6.13.Examplesofgraphsgeneratedusingthreeedgeswithlabelsp,q,andr.
{p,q,r}
2. Anitemcanappearatmostonceinanitemsetbutanedgelabelcanappearmultipletimesinagraph,becausedifferentarrangementsofedgeswiththesameedgelabelrepresentdifferentgraphs.Forexample,anitempcanonlygenerateasinglecandidateitemset,whichistheitemitself.However,usingasingleedgelabelpandvertexlabela,wecangenerateanumberofgraphswithdifferentsizes,asshowninFigure6.14 .
Figure6.14.Graphsofsizesonetothreegeneratedusingasingleedgelabelpandvertexlabela.
Becauseoftheabovereasons,itischallengingtoenumerateallpossiblesubgraphsthatcanbegeneratedusingagivensetofvertexandedgelabels.Figure6.15 showssomeexamplesof1-graphs,2-graphs,and3-graphsthatcanbegeneratedusingvertexlabels andedgelabels .Itcanbeseenthatevenusingtwovertexandedgelabels,enumeratingallpossiblegraphsbecomesdifficultevenforsizetwo.Hence,itishighlyimpracticaltouseabrute-forcemethodforfrequentsubgraphmining,thatenumeratesallpossiblesubgraphsandcountstheirrespectivesupports.
{a,b} {p,q}
Figure6.15.Examplesofgraphsgeneratedusingtwoedgelabels,pandq,andtwovertexlabels,aandb,forsizesvaryingfromonetothree.
However,notethattheAprioriprinciplestillholdsforsubgraphsbecauseak-graphisfrequentonlyifallofits -subgraphsarefrequent.Hence,despitethecomputationalchallengesinenumeratingallpossiblecandidatesubgraphs,wecanusetheAprioriprincipletogeneratecandidatek-subgraphsusingfrequent -subgraphs.Algorithm6.2 presentsagenericApriori-likeapproachforfrequentsubgraphmining.Inthefollowing,webrieflydescribethethreemainstepsofthealgorithm:candidategeneration,candidatepruning,andsupportcounting.
Algorithm6.2Apriori-likealgorithmfor
frequentsubgraphmining.1. ←Findallfrequent1-subgraphsinG2. ←Findallfrequent2-subgraphsinG3. .4. repeat5. .6. =candidate-gen .{Generatecandidatek-subgraphs.}7. =candidate-prune .{Performcandidate
pruning.}8. foreachgraph do9. =subgraph .{Identifyallcandidatescontainedint.}10. foreachcandidatek-subgraph do11. .{Incrementthesupportcount.}12. endfor13. endfor14. .{Extractthefrequentk-
subgraphs.}15. until16. Answer= .
(k−1)
(k−1)
F1F2k=2
k=k+1Ck (Fk−1)Ck (Ck−1,Fk−1)
g∈GCt (Ck,g)
c∈Ctσ(c)=σ(c)+1
Fk={c|c∈Ck∧σ(c)N≥minsup}
Fk=0∪Fk
6.5.3CandidateGeneration
Apairoffrequent -subgraphsaremergedtoformacandidatek-subgraphiftheyshareacommon -subgraph,knownastheircore.Givenacommoncore,thesubgraphmergingprocedurecanbedescribedasfollows:
SubgraphMergingProcedure
Let and betwofrequent -subgraphs.Letconsistofacore andanextraedge ,whereuispartofthecore.ThisisdepictedinFigure6.16(a) ,wherethecoreisrepresentedbyasquareandtheextraedgeisrepresentedbyalinebetweenuand .Similarly,let consistofthecore andtheextraedge, ,asshowninFigure6.16(b) .
Figure6.16.Acompactrepresentationofapairoffrequent -subgraphsconsideredformerging.
(k−1)(k−2)
Gi(k−1) Gj(k−1) (k−1) Gi(k−1)Gi(k−2) (u,u′)
u′ Gj(k−1) Gj(k−2)(v,v′)
(k−1)
Usingthesecores,thetwographsaremergedonlyifthereexistsanautomorphismbetweenthetwocores: Theresultingcandidatesareobtainedbyaddinganedgeto asfollows:
1. If ,i.e.,uismappedtovintheautomorphismbetweenthecores,thengenerateacandidatebyadding to ,asshowninFigure6.17(a) .
Figure6.17.IllustrationofCandidateMergingProcedures.
2. If ,i.e.,uisnotmappedtovbutadifferentvertexw,thengenerateacandidatebyadding to .Additionally,ifthelabelsof and areidentical,thengenerateanothercandidatebyadding to ,asshowninFigure6.17(b) .
(fv,fe):Gi(k−2)→Gj(k−2)Gi(k−1)
fv(u)=v(v,u′) Gj(k−1)
fv(u)=w≠v(w,u′) Gj(k−1)
u′ v′(w,v′) Gi(k−1)
Figure6.18(a) showsthecandidatesubgraphsgeneratedbymergingand .Theshadedverticesandthickerlinesrepresentthecoreverticesandedges,respectively,ofthetwographs,whilethedottedlinesrepresentthemappingbetweenthetwocores.Notethatthisexampleillustratescondition1ofthesubgraphmergingprocedure,sincetheendpointsoftheextraedgesinboththegraphsaremappedtoeachother.Thisresultsinasinglecandidatesubgraph, .Ontheotherhand,Figure6.18(b) showsanexampleofcondition2ofthesubgraphmergingprocedure,wheretheendpointsoftheextraedgesdonotmaptoeachotherandthelabelsofthenewendpointsareidentical.Mergingthetwographs and thusresultsintwosubgraphs,asshownintheFigureas and .
G1G2
G3
G4 G5G6 G7
Figure6.18.Twoexamplesofcandidatek-subgraphgenerationusingapairof -subgraphs.
Theapproachpresentedaboveofmergingtwofrequent -subgraphsissimilartothe candidategenerationstrategyintroducedforitemsetsinChapter5 ,andisguaranteedtoexhaustivelygenerateallfrequentk-subgraphsasviablecandidates(seeExercise18 ).However,
(k−1)
(k−1)Fk−1×Fk−1
thereareseveralnotabledifferencesinthecandidategenerationproceduresofitemsetsandsubgraphs.
1. MergingwithSelf:Unlikeitemsets,afrequent -subgraphcanbemergedwithitselftocreateacandidatek-subgraph.Thisisespeciallyimportantwhenak-graphcontainsrepeatingunitsofedgescontainedina -subgraph.Asanexample,the3-graphsshowninFigure6.14 canonlybegeneratedfromthe2-graphsshowninFigure6.14 ,ifself-mergingisallowed.
2. MultiplicityofCandidates:Asdescribedinthesubgraphmergingprocedure,apairoffrequent -subgraphssharingacommoncorecangeneratemultiplecandidates.Asanexample,ifthelabelsattheendpointsoftheextraedgesareidentical,i.e., ,wewillgeneratetwocandidatesasshowninFigure6.18(b) .Ontheotherhand,mergingapairoffrequentitemsetsorsubsequencesgeneratesauniquecandidateitemsetorsubsequence.
3. MultiplicityofCores:Twofrequent -subgraphscansharemorethanonecoreofsize thatiscommoninboththegraphs.Figure6.19 showsanexampleofapairofgraphsthatsharetwocommoncores.Sinceeverychoiceofacommoncorecanresultinadifferentwayofmergingthetwographs,thiscanpotentiallycontributetothemultiplicityofcandidatesgeneratedbymergingthesamepairofsubgraphs.
(k−1)
(k−1)
(k−1)
l(u′)=l(v′)
(k−1)k−2
Figure6.19.Multiplicityofcoresforthesamepairof -subgraphs.
4. MultiplicityofAutomorphisms:Thecommoncoresofthetwographscanbemappedtoeachotherusingmultiplechoicesofmappingfunctions,eachresultinginadifferentautomorphism.Toillustratethis,Figure6.20 showsapairofgraphsthatshareacommoncoreofsizefour,representedasasquare.Thefirstcorecanexistinthreedifferentforms(rotatedviews),eachresultinginadifferentmappingbetweenthetwocores.Sincethechoiceofthemappingfunctionaffectsthecandidategenerationprocedure,everyautomorphismofthecorecanpotentiallyresultinadifferentsetofcandidates,asshowninFigure6.20 .
(k−1)
Figure6.20.Anexampleshowingmultiplewaysofmappingthecoresoftwo -subgraphswithoneanother.
5. GenerationofDuplicateCandidates:Inthecaseofitemsets,generationofduplicatecandidatesisavoidedbytheuseoflexicographicordering,suchthattwofrequentk-itemsetsaremergedonlyiftheirfirst items,arrangedinlexicographicorder,areidentical.Unfortunately,inthecaseofsubgraphs,theredoesnotexistanotionoflexicographicorderingamongtheverticesoredgesofagraph.Hence,thesamecandidatek-subgraphcanbegeneratedbymergingtwodifferentpairsof -subgraphs.Figure6.21 showsan
(k−1)
k−1
k−1
exampleofacandidate4-subgraphthatcanbegeneratedintwodifferentways,usingdifferentpairsoffrequent3-subgraphs.Thus,itisnecessarytocheckforduplicatesandeliminatetheredundantgraphsduringcandidatepruning.
Figure6.21.Differentpairsof -subgraphscangeneratethesamecandidatek-subgraph,thusresultinginduplicatecandidates.
Algorithm6.3 presentsthecompleteprocedureforgeneratingthesetofallcandidatek-subgraphs, ,usingthesetoffrequent -subgraphs, .Weconsidermergingeverypairofsubgraphsin ,includingpairsinvolvingthesamesubgraphtwice(toensureself-merging).Foreverypairof
-subgraphs,weconsiderallpossibleconnectedcoresofsize ,thatcanbeconstructedfromthetwographsbyremovinganedgefromeachgraph.Ifthetwocoresareisomorphic,weconsiderallpossiblemappingsbetweentheverticesandedgesofthetwocores.Foreverysuchmapping,we
(k−1)
Ck (k−1) Fk−1Fk−1
(k−1) k−2
employthesubgraphmergingproceduretoproducecandidatek-subgraphs,thatareaddedto .
Algorithm6.3Procedureforcandidate
generation:candidate-gen .1. .2. foreachpair, and , do3. {Consideringallpairsoffrequent -subgraphsformerging.}4. foreachpair, and do5. {Findingallcommoncoresbetweenapairoffrequent –
subgraphs.}6. .{Removinganedgefrom .}7. .{Removinganedgefrom .}8. if AND and areconnected
graphsthen9. { and arecommoncoresof and ,
respectively.}10. foreach do11. {Generatingcandidatesforeveryautomorphismbetweenthe
cores.}12. subgraph-merge .13. endfor14. endif15. endfor16. endfor17. Answer= .
Ck
(Fk−1)Ck=0
Gi(k−1)∈Fk−1 Gj(k−1)∈Fk−1 i≤j(k−1)
ei∈Gi(k−1) ej∈Gj(k−1)(k−1)
Gi(k−2)=Gi(k−1)ei Gi(k−1)Gj(k−2)=Gj(k−1)ej Gj(k−1)Gi(k−2)≃Gj(k−2) Gi(k−2) Gj(k−2)
Gi(k−2) Gj(k−2) Gi(k−1) Gj(k−1)
(fv,fe):Gi(k−2)→Gj(k−2)
Ck=Ck∪ (Gi(k−2),Gj(k−2),fv,fe,ei,ej)
Ck
6.5.4CandidatePruning
Afterthecandidatek-subgraphsaregenerated,thecandidateswhose -subgraphsareinfrequentneedtobepruned.Thepruningstepcanbeperformedbyidentifyingallpossibleconnected -subgraphsthatcanbeconstructedbyremovingoneedgefromacandidatek-subgraphandthencheckingiftheyhavealreadybeenidentifiedasfrequent.Ifanyofthe -subgraphsareinfrequent,thecandidatek-subgraphisdiscarded.Also,duplicatecandidatesneedtobedetectedandeliminated.Thiscanbedonebycomparingthecanonicallabelsofcandidategraphs,sincethecanonicallabelsofduplicategraphswillbeidentical.Canonicallabelscanalsohelpincheckingifa -subgraphcontainedinacandidatek-subgraphisfrequentornot,bymatchingitscanonicallabelwiththatofeveryfrequent -subgraphin .
6.5.5SupportCounting
Supportcountingisalsoapotentiallycostlyoperationbecauseallthecandidatesubgraphscontainedineachgraph mustbedetermined.OnewaytospeedupthisoperationistomaintainalistofgraphIDsassociatedwitheachfrequent -subgraph.Wheneveranewcandidatek-subgraphisgeneratedbymergingapairoffrequent -subgraphs,theircorrespondinglistsofgraphIDsareintersected.Finally,thesubgraphisomorphismtestsareperformedonthegraphsintheintersectedlisttodeterminewhethertheycontainaparticularcandidatesubgraph.
(k−1)
(k−1)
(k−1)
(k−1)(k−1)
Fk−1
G∈G
(k−1)(k−1)
6.6InfrequentPatterns*Theassociationanalysisformulationdescribedsofarisbasedonthepremisethatthepresenceofaniteminatransactionismoreimportantthanitsabsence.Asaconsequence,patternsthatarerarelyfoundinadatabaseareoftenconsideredtobeuninterestingandareeliminatedusingthesupportmeasure.Suchpatternsareknownasinfrequentpatterns.
Definition6.8(InfrequentPattern).Aninfrequentpatternisanitemsetorarulewhosesupportislessthantheminsupthreshold.
Althoughavastmajorityofinfrequentpatternsareuninteresting,someofthemmightbeusefultotheanalysts,particularlythosethatcorrespondtonegativecorrelationsinthedata.Forexample,thesaleof sand stogetherislowbecauseanycustomerwhobuysa willmostlikelynotbuya ,andviceversa.Suchnegative-correlatedpatternsareusefultohelpidentifycompetingitems,whichareitemsthatcanbesubstitutedforoneanother.Examplesofcompetingitemsincludeteaversuscoffee,butterversusmargarine,regularversusdietsoda,anddesktopversuslaptopcomputers.
Someinfrequentpatternsmayalsosuggesttheoccurrenceofinterestingrareeventsorexceptionalsituationsinthedata.Forexample,if is
frequentbut isinfrequent,thenthelatterisaninterestinginfrequentpatternbecauseitmayindicatefaultyalarmsystems.Todetectsuchunusualsituations,theexpectedsupportofapatternmustbedetermined,sothat,ifapatternturnsouttohaveaconsiderablylowersupportthanexpected,itisdeclaredasaninterestinginfrequentpattern.
Mininginfrequentpatternsisachallengingendeavorbecausethereisanenormousnumberofsuchpatternsthatcanbederivedfromagivendataset.Morespecifically,thekeyissuesinmininginfrequentpatternsare:(1)howtoidentifyinterestinginfrequentpatterns,and(2)howtoefficientlydiscovertheminlargedatasets.Togetsomeperspectiveonvarioustypesofinterestinginfrequentpatterns,tworelatedconcepts—negativepatternsandnegativelycorrelatedpatterns—areintroducedinSections6.6.1 and6.6.2 ,respectively.TherelationshipsamongthesepatternsareelucidatedinSection6.6.3 .Finally,twoclassesoftechniquesdevelopedformininginterestinginfrequentpatternsarepresentedinSections6.6.5 and6.6.6 .
6.6.1NegativePatterns
Let beasetofitems.Anegativeitem, denotestheabsenceofitem fromagiventransaction.Forexample, isanegativeitemwhosevalueis1ifatransactiondoesnotcontain .
Definition6.9(NegativeItemset).
I={i1,i2,…,id} ik¯ik
AnegativeitemsetXisanitemsetthathasthefollowingproperties:(1) ,whereAisasetofpositiveitems, isasetofnegativeitems, 1,and(2) minsup.
Definition6.10(NegativeAssociationRule).Anegativeassociationruleisanassociationrulethathasthefollowingproperties:(1)theruleisextractedfromanegativeitemset,(2)thesupportoftheruleisgreaterthanorequaltominsup,and(3)theconfidenceoftheruleisgreaterthanorequaltominconf.
Thenegativeitemsetsandnegativeassociationrulesarecollectivelyknownasnegativepatternsthroughoutthischapter.Anexampleofanegativeassociationruleis ,whichmaysuggestthatpeoplewhodrinkteatendtonotdrinkcoffee.
6.6.2NegativelyCorrelatedPatterns
Section5.7.1 onpage402describedhowcorrelationanalysiscanbeusedtoanalyzetherelationshipbetweenapairofcategoricalvariables.Measuressuchasinterestfactor(Equation5.5 )andtheφ-coefficient(Equation
X=A∪B¯ B¯|B¯|≥1 s(X)≥
5.8 )wereshowntobeusefulfordiscoveringitemsetsthatarepositivelycorrelated.Thissectionextendsthediscussiontonegativelycorrelatedpatterns.
Definition6.11(NegativelyCorrelatedItemset).AnitemsetX,whichisdefinedas ,isnegativelycorrelatedif
where isthesupportoftheitemx.
Notethatthesupportofanitemsetisanestimateoftheprobabilitythatatransactioncontainstheitemset.Hence,theright-handsideoftheprecedingexpression, ,representsanestimateoftheprobabilitythatalltheitemsinXarestatisticallyindependent.Definition6.11 suggeststhatanitemsetisnegativelycorrelatedifitssupportisbelowtheexpectedsupportcomputedusingthestatisticalindependenceassumption.Thesmaller ,themorenegativelycorrelatedisthepattern.
Definition6.12(NegativelyCorrelatedAssociationRule).
X={x1,x2,…,xk}
s(X)<∏j=1ks(xj)=s(x1)×s(x2)×…×s(xk), (6.3)
s(x)
∏j=1ks(xj)
s(X)
Anassociationrule isnegativelycorrelatedif
whereXandYaredisjointitemsets;i.e., .
TheprecedingdefinitionprovidesonlyapartialconditionfornegativecorrelationbetweenitemsinXanditemsinY.Afullconditionfornegativecorrelationcanbestatedasfollows:
where and .BecausetheitemsinX(andinY)areoftenpositivelycorrelated,itismorepracticaltousethepartialconditiontodefineanegativelycorrelatedassociationruleinsteadofthefullcondition.Forexample,althoughtherule
isnegativelycorrelatedaccordingtoInequality6.4 , ispositivelycorrelatedwith and ispositivelycorrelatedwith
.IfInequality6.5 isappliedinstead,sucharulecouldbemissedbecauseitmaynotsatisfythefullconditionfornegativecorrelation.
Theconditionfornegativecorrelationcanalsobeexpressedintermsofthesupportforpositiveandnegativeitemsets.Let and denotethecorrespondingnegativeitemsetsforXandY,respectively.Since
X→Y
s(X∪Y)<s(X)s(Y), (6.4)
X∪Y=0
s(X∪Y)<∏is(xi)∏js(yi), (6.5)
xi∈X yi∈Y
X¯ Y¯
theconditionfornegativecorrelationcanbestatedasfollows:
Thenegativelycorrelateditemsetsandassociationrulesareknownasnegativelycorrelatedpatternsthroughoutthischapter.
6.6.3ComparisonsamongInfrequentPatterns,NegativePatterns,andNegativelyCorrelatedPatterns
Infrequentpatterns,negativepatterns,andnegativelycorrelatedpatternsarethreecloselyrelatedconcepts.Althoughinfrequentpatternsandnegativelycorrelatedpatternsreferonlytoitemsetsorrulesthatcontainpositiveitems,whilenegativepatternsrefertoitemsetsorrulesthatcontainbothpositiveandnegativeitems,therearecertaincommonalitiesamongtheseconcepts,asillustratedinFigure6.22 .
s(X∪Y)−s(X)s(Y)=s(X∪Y)−[s(X∪Y)+s(X∪Y¯)][s(X∪Y)+s(X¯∪Y)]=s(X∪Y)[1−s(X∪Y)−s(X∪Y¯)−s(X¯∪Y)]−s(X∪Y¯)s(X¯∪Y)=s(X∪Y)s(X¯∪Y¯)−s(X∪Y¯)s(X¯∪Y),
s(X∪Y)s(X¯∪Y¯)<s(X∪Y¯)s(X¯∪Y). (6.6)
Figure6.22.Comparisonsamonginfrequentpatterns,negativepatterns,andnegativelycorrelatedpatterns.
First,notethatmanyinfrequentpatternshavecorrespondingnegativepatterns.Tounderstandwhythisisthecase,considerthecontingencytableshowninTable6.9 .If isinfrequent,thenitislikelytohaveacorrespondingnegativeitemsetunlessminsupistoohigh.Forexample,assumingthat ,if isinfrequent,thenthesupportforatleastoneofthefollowingitemsets, , ,or ,mustbehigherthanminsupsincethesumofthesupportsinacontingencytableis1.
Table6.9.Atwo-waycontingencytablefortheassociationrule .
Y
X
X∪Y
minsup≤0.25 X∪YX∪Y¯ X¯∪Y X¯∪Y¯
X→Y
Y¯
X¯
s(X∪Y)
s(X¯∪Y)
s(X∪Y¯)
s(X¯∪Y¯)
s(X)
s(X¯)
1
Second,notethatmanynegativelycorrelatedpatternsalsohavecorrespondingnegativepatterns.ConsiderthecontingencytableshowninTable6.9 andtheconditionfornegativecorrelationstatedinInequality6.6 .IfXandYhavestrongnegativecorrelation,then
Therefore,either or ,orboth,musthaverelativelyhighsupportwhenXandYarenegativelycorrelated.Theseitemsetscorrespondtothenegativepatterns.Finally,becausethelowerthesupportof ,themorenegativelycorrelatedisthepattern,infrequentpatternstendtobestrongernegativelycorrelatedpatternsthanfrequentones.
6.6.4TechniquesforMiningInterestingInfrequentPatterns
Inprinciple,infrequentitemsetsaregivenbyallitemsetsthatarenotextractedbystandardfrequentitemsetgenerationalgorithmssuchasAprioriandFP-growth.TheseitemsetscorrespondtothoselocatedbelowthefrequentitemsetbordershowninFigure6.23 .
s(Y) s(Y¯)
s(X∪Y¯)×s(X¯∪Y)≫s(X∪Y)×s(X¯∪Y¯).
X∪Y¯ X¯∪Y
X∪Y
Figure6.23.Frequentandinfrequentitemsets.
Sincethenumberofinfrequentpatternscanbeexponentiallylarge,especiallyforsparse,high-dimensionaldata,techniquesdevelopedformininginfrequentpatternsfocusonfindingonlyinterestinginfrequentpatterns.AnexampleofsuchpatternsincludesthenegativelycorrelatedpatternsdiscussedinSection6.6.2 .ThesepatternsareobtainedbyeliminatingallinfrequentitemsetsthatfailthenegativecorrelationconditionprovidedinInequality6.3 .Thisapproachcanbecomputationallyintensivebecausethesupportsforallinfrequentitemsetsmustbecomputedinordertodeterminewhethertheyarenegativelycorrelated.Unlikethesupportmeasureusedforminingfrequentitemsets,correlation-basedmeasuresusedforminingnegativelycorrelateditemsetsdonotpossessananti-monotonepropertythatcanbe
exploitedforpruningtheexponentialsearchspace.Althoughanefficientsolutionremainselusive,severalinnovativemethodshavebeendeveloped,asmentionedintheBibliographicNotesprovidedattheendofthischapter.
Theremainderofthischapterpresentstwoclassesoftechniquesformininginterestinginfrequentpatterns.Section6.6.5 describesmethodsforminingnegativepatternsindata,whileSection6.6.6 describesmethodsforfindinginterestinginfrequentpatternsbasedonsupportexpectation.
6.6.5TechniquesBasedonMiningNegativePatterns
Thefirstclassoftechniquesdevelopedformininginfrequentpatternstreatseveryitemasasymmetricbinaryvariable.UsingtheapproachdescribedinSection6.1 ,thetransactiondatacanbebinarizedbyaugmentingitwithnegativeitems.Figure6.24 showsanexampleoftransformingtheoriginaldataintotransactionshavingbothpositiveandnegativeitems.ByapplyingexistingfrequentitemsetgenerationalgorithmssuchasAprioriontheaugmentedtransactions,allthenegativeitemsetscanbederived.
Figure6.24.Augmentingadatasetwithnegativeitems.
Suchanapproachisfeasibleonlyifafewvariablesaretreatedassymmetricbinary(i.e.,welookfornegativepatternsinvolvingthenegationofonlyasmallnumberofitems).Ifeveryitemmustbetreatedassymmetricbinary,theproblembecomescomputationallyintractableduetothefollowingreasons.
1. Thenumberofitemsdoubleswheneveryitemisaugmentedwithitscorrespondingnegativeitem.Insteadofexploringanitemsetlatticeofsize ,wheredisthenumberofitemsintheoriginaldataset,thelatticebecomesconsiderablylarger,asshowninExercise22 onpage522.
2. Support-basedpruningisnolongereffectivewhennegativeitemsareaugmented.Foreachvariablex,eitherxor hassupportgreaterthanorequalto50%.Hence,evenifthesupportthresholdisashighas50%,halfoftheitemswillremainfrequent.Forlowerthresholds,manymoreitemsandpossiblyitemsetscontainingthemwillbefrequent.Thesupport-basedpruningstrategyemployedbyAprioriiseffectiveonlywhenthesupportformostitemsetsislow;otherwise,thenumberoffrequentitemsetsgrowsexponentially.
2d
x¯
3. Thewidthofeachtransactionincreaseswhennegativeitemsareaugmented.Supposethereareditemsavailableintheoriginaldataset.Forsparsedatasetssuchasmarketbaskettransactions,thewidthofeachtransactiontendstobemuchsmallerthand.Asaresult,themaximumsizeofafrequentitemset,whichisboundedbythemaximumtransactionwidth, ,tendstoberelativelysmall.Whennegativeitemsareincluded,thewidthofthetransactionsincreasestodbecauseanitemiseitherpresentinthetransactionorabsentfromthetransaction,butnotboth.Sincethemaximumtransactionwidthhasgrownfrom tod,thiswillincreasethenumberoffrequentitemsetsexponentially.Asaresult,manyexistingalgorithmstendtobreakdownwhentheyareappliedtotheextendeddataset.
Thepreviousbrute-forceapproachiscomputationallyexpensivebecauseitforcesustodeterminethesupportforalargenumberofpositiveandnegativepatterns.Insteadofaugmentingthedatasetwithnegativeitems,anotherapproachistodeterminethesupportofthenegativeitemsetsbasedonthesupportoftheircorrespondingpositiveitems.Forexample,thesupportfor
canbecomputedinthefollowingway:
Moregenerally,thesupportforanyitemset canbeobtainedasfollows:
ToapplyEquation6.7 , mustbedeterminedforeveryZthatisasubsetofY.ThesupportforanycombinationofXandZthatexceedstheminsupthresholdcanbefoundusingtheApriorialgorithm.Forallothercombinations,thesupportsmustbedeterminedexplicitly,e.g.,byscanningtheentiresetoftransactions.Anotherpossibleapproachistoeitherignorethe
wmax
wmax
{p,q¯,r¯}
s({p,q¯,r¯})=s({p})−s({p,q})−s({p,r})+s({p,q,r}).
X∪Y¯
s(X∪Y¯)=s(X)+∑i=1n∑Z⊂Y,|Z|=i{(−1)i×s(X∪Z)}. (6.7)
s(X∪Z)
supportforanyinfrequentitemset ortoapproximateitwiththeminsupthreshold.
Severaloptimizationstrategiesareavailabletofurtherimprovetheperformanceoftheminingalgorithms.First,thenumberofvariablesconsideredassymmetricbinarycanberestricted.Morespecifically,anegativeitem isconsideredinterestingonlyifyisafrequentitem.Therationaleforthisstrategyisthatrareitemstendtoproducealargenumberofinfrequentpatternsandmanyofwhichareuninteresting.Byrestrictingtheset
giveninEquation6.7 tovariableswhosepositiveitemsarefrequent,thenumberofcandidatenegativeitemsetsconsideredbytheminingalgorithmcanbesubstantiallyreduced.Anotherstrategyistorestrictthetypeofnegativepatterns.Forexample,thealgorithmmayconsideronlyanegativepattern ifitcontainsatleastonepositiveitem(i.e., ).Therationaleforthisstrategyisthatifthedatasetcontainsveryfewpositiveitemswithsupportgreaterthan50%,thenmostofthenegativepatternsoftheform
willbecomefrequent,thusdegradingtheperformanceoftheminingalgorithm.
6.6.6TechniquesBasedonSupportExpectation
Anotherclassoftechniquesconsidersaninfrequentpatterntobeinterestingonlyifitsactualsupportisconsiderablysmallerthanitsexpectedsupport.Fornegativelycorrelatedpatterns,theexpectedsupportiscomputedbasedonthestatisticalindependenceassumption.Thissectiondescribestwoalternativeapproachesfordeterminingtheexpectedsupportofapattern
X∪Z
y¯
Y¯
X∪Y¯ |X|≥1
X¯∪Y¯
using(1)aconcepthierarchyand(2)aneighborhood-basedapproachknownasindirectassociation.
SupportExpectationBasedonConceptHierarchyObjectivemeasuresalonemaynotbesufficienttoeliminateuninterestinginfrequentpatterns.Forexample,suppose and arefrequentitems.Eventhoughtheitemset isinfrequentandperhapsnegativelycorrelated,itisnotinterestingbecausetheirlackofsupportseemsobvioustodomainexperts.Therefore,asubjectiveapproachfordeterminingexpectedsupportisneededtoavoidgeneratingsuchinfrequentpatterns.
Intheprecedingexample, and belongtotwocompletelydifferentproductcategories,whichiswhyitisnotsurprisingtofindthattheirsupportislow.Thisexamplealsoillustratestheadvantageofusingdomainknowledgetopruneuninterestingpatterns.Formarketbasketdata,thedomainknowledgecanbeinferredfromaconcepthierarchysuchastheoneshowninFigure6.25 .Thebasicassumptionofthisapproachisthatitemsfromthesameproductfamilyareexpectedtohavesimilartypesofinteractionwithotheritems.Forexample,since and belongtothesameproductfamily,weexpecttheassociationbetween and tobesomewhatsimilartotheassociationbetween and .Iftheactualsupportforanyoneofthesepairsislessthantheirexpectedsupport,thentheinfrequentpatternisinteresting.
Figure6.25.Exampleofaconcepthierarchy.
Toillustratehowtocomputetheexpectedsupport,considerthediagramshowninFigure6.26 .Supposetheitemset isfrequent.Letdenotetheactualsupportofapatternand denoteitsexpectedsupport.TheexpectedsupportforanychildrenorsiblingsofCandGcanbecomputedusingtheformulashownbelow.
{C,G} s(⋅)ϵ(⋅)
Figure6.26.Mininginterestingnegativepatternsusingaconcepthierarchy.
Forexample,if and arefrequent,thentheexpectedsupportbetween and canbecomputedusingEquation6.8becausetheseitemsarechildrenof and ,respectively.Iftheactualsupportfor and isconsiderablylowerthantheirexpectedvalue,then and formaninterestinginfrequentpattern.
SupportExpectationBasedonIndirectAssociationConsiderapairofitems,(a,b),thatarerarelyboughttogetherbycustomers.Ifaandbareunrelateditems,suchas and player,thentheirsupportisexpectedtobelow.Ontheotherhand,ifaandbarerelateditems,thentheirsupportisexpectedtobehigh.Theexpectedsupportwaspreviouslycomputedusingaconcepthierarchy.Thissectionpresentsanapproachfordeterminingtheexpectedsupportbetweenapairofitemsbylookingatotheritemscommonlypurchasedtogetherwiththesetwoitems.
Forexample,supposecustomerswhobuya alsotendtobuyothercampingequipment,whereasthosewhobuya also
ϵ(s(E,J))=s(C,G)×s(E)s(C)×s(J)s(G) (6.8)
ϵ(s(C,J))=s(C,G)×s(J)s(G) (6.9)
ϵ(s(C,H))=s(C,G)×s(H)s(G) (6.10)
tendtobuyothercomputeraccessoriessuchasanopticalmouseoraprinter.Assumingthereisnootheritemfrequentlyboughttogetherwithbothasleepingbagandadesktopcomputer,thesupportfortheseunrelateditemsisexpectedtobelow.Ontheotherhand,suppose and areoftenboughttogetherwith and .Evenwithoutusingaconcepthierarchy,bothitemsareexpectedtobesomewhatrelatedandtheirsupportshouldbehigh.Becausetheiractualsupportislow, andformaninterestinginfrequentpattern.Suchpatternsareknownasindirectassociationpatterns.
Ahigh-levelillustrationofindirectassociationisshowninFigure6.27 .Itemsaandbcorrespondto and ,whileY,whichisknownasthemediatorset,containsitemssuchas and .Aformaldefinitionofindirectassociationispresentednext.
Figure6.27.Anindirectassociationbetweenapairofitems.
Definition6.13(IndirectAssociation).
Apairofitemsa,bisindirectlyassociatedviaamediatorsetYifthefollowingconditionshold:
1. (Itempairsupportcondition).2. suchthat:
a. and (Mediatorsupportcondition).
b. ,where isanobjectivemeasureoftheassociationbetweenXandZ(Mediatordependencecondition).
NotethatthemediatorsupportanddependenceconditionsareusedtoensurethatitemsinYformacloseneighborhoodtobothaandb.Someofthedependencemeasuresthatcanbeusedincludeinterest,cosineorIS,Jaccard,andothermeasurespreviouslydescribedinSection5.7.1 onpage402.
Indirectassociationhasmanypotentialapplications.Inthemarketbasketdomain,aandbmayrefertocompetingitemssuchas and
.Intextmining,indirectassociationcanbeusedtoidentifysynonyms,antonyms,orwordsthatareusedindifferentcontexts.Forexample,givenacollectionofdocuments,theword maybeindirectlyassociatedwith viathemediator .Thispatternsuggeststhattheword canbeusedintwodifferentcontexts—dataminingversusgoldmining.
Indirectassociationscanbegeneratedinthefollowingway.First,thesetoffrequentitemsetsisgeneratedusingstandardalgorithmssuchasApriorior
s({a,b})<ts∃Y≠0
s({a}∪Y)≥tf s({b}∪Y)≥tf
d({a}Y)≥td,d({b}Y)≥td d(X,Z)
FP-growth.Eachpairoffrequentk-itemsetsarethenmergedtoobtainacandidateindirectassociation ,whereaandbareapairofitemsandYistheircommonmediator.Forexample,if and arefrequent3-itemsets,thenthecandidateindirectassociation isobtainedbymergingthepairoffrequentitemsets.Oncethecandidateshavebeengenerated,itisnecessarytoverifythattheysatisfytheitempairsupportandmediatordependenceconditionsprovidedinDefinition6.13 .However,themediatorsupportconditiondoesnothavetobeverifiedbecausethecandidateindirectassociationisobtainedbymergingapairoffrequentitemsets.AsummaryofthealgorithmisshowninAlgorithm6.4 .
Algorithm6.4Algorithmforminingindirect
associations.1. Generate ,thesetoffrequentitemsets.2. for to do3.4. foreachcandidate do5. if then6.7. endif8. endfor9. endfor10. Result= .
(a,b,Y){p,q,r} {p,q,s}
(r,s,{p,q})
Fkk=2 kmax
Ck={(a,b,Y)|{a}∪Y∈Fk,{b}∪Y∈Fk,a≠b}(a,b,Y)∈Ck
s({a,b})<ts∧d({a},Y)≥td∧d({b},Y)≥tdIk=Ik∪{(a,b,Y)}
∪Ik
6.7BibliographicNotesTheproblemofminingassociationrulesfromcategoricalandcontinuousdatawasintroducedbySrikantandAgrawalin[495].Theirstrategywastobinarizethecategoricalattributesandtoapplyequal-frequencydiscretizationtothecontinuousattributes.Apartialcompletenessmeasurewasalsoproposedtodeterminetheamountofinformationlossasaresultofdiscretization.Thismeasurewasthenusedtodeterminethenumberofdiscreteintervalsneededtoensurethattheamountofinformationlosscanbekeptatacertaindesiredlevel.Followingthiswork,numerousotherformulationshavebeenproposedforminingquantitativeassociationrules.Insteadofdiscretizingthequantitativeattributes,astatistical-basedapproachwasdevelopedbyAumannandLindell[465],wheresummarystatisticssuchasmeanandstandarddeviationarecomputedforthequantitativeattributesoftherules.ThisformulationwaslaterextendedbyotherauthorsincludingWebb[501]andZhangetal.[506].Themin-ApriorialgorithmwasdevelopedbyHanetal.[474]forfindingassociationrulesincontinuousdatawithoutdiscretization.Followingthemin-Apriori,arangeoftechniquesforcapturingdifferenttypesofassociationsamongcontinuousattributeshavebeenexplored.Forexample,theRAngesupportPatterns(RAP)developedbyPandeyetal.[487]findsgroupsofattributesthatshowcoherentvaluesacrossmultiplerowsofthedatamatrix.TheRAPframeworkwasextendedbytodealwithnoisydatabyGuptaetal.[473].Sincetherulescanbedesignedtosatisfymultipleobjectives,evolutionaryalgorithmsforminingquantitativeassociationrules[484,485]havealsobeendeveloped.OthertechniquesincludethoseproposedbyFukudaetal.[471],Lentetal.[480],Wangetal.[500],Ruckertetal.[490]andMillerandYang[486].
ThemethoddescribedinSection6.3 forhandlingconcepthierarchyusingextendedtransactionswasdevelopedbySrikantandAgrawal[494].AnalternativealgorithmwasproposedbyHanandFu[475],wherefrequentitemsetsaregeneratedonelevelatatime.Morespecifically,theiralgorithminitiallygeneratesallthefrequent1-itemsetsatthetopleveloftheconcepthierarchy.Thesetoffrequent1-itemsetsisdenotedasL(1,1).Usingthefrequent1-itemsetsinL(1,1),thealgorithmproceedstogenerateallfrequent2-itemsetsatlevel1,L(1,2).Thisprocedureisrepeateduntilallthefrequentitemsetsinvolvingitemsfromthehighestlevelofthehierarchy,L(1,k)( ),areextracted.Thealgorithmthencontinuestoextractfrequentitemsetsatthenextlevelofthehierarchy,L(2,1),basedonthefrequentitemsetsinL(1,1).Theprocedureisrepeateduntilitterminatesatthelowestleveloftheconcepthierarchyrequestedbytheuser.
ThesequentialpatternformulationandalgorithmdescribedinSection6.4wasproposedbyAgrawalandSrikantin[463,496].Similarly,Mannilaetal.[483]introducedtheconceptoffrequentepisode,whichisusefulforminingsequentialpatternsfromalongstreamofevents.AnotherformulationofsequentialpatternminingbasedonregularexpressionswasproposedbyGarofalakisetal.in[472].Joshietal.haveattemptedtoreconcilethedifferencesbetweenvarioussequentialpatternformulations[477].TheresultwasauniversalformulationofsequentialpatternwiththedifferentcountingschemesdescribedinSection6.4.4 .AlternativealgorithmsforminingsequentialpatternswerealsoproposedbyPeietal.[489],Ayresetal.[466],Chengetal.[468],andSenoetal.[492].Areviewonsequentialpatternminingalgorithmscanbefoundin[482]and[493].Extensionsoftheformulationtomaximal[470,481]andclosed[499,504]sequentialpatternmininghavealsobeendevelopedinrecentyears.
ThefrequentsubgraphminingproblemwasinitiallyintroducedbyInokuchietal.in[476].Theyusedavertex-growingapproachforgeneratingfrequent
k>1
inducedsubgraphsfromagraphdataset.Theedge-growingstrategywasdevelopedbyKuramochiandKarypisin[478],wheretheyalsopresentedanApriori-likealgorithmcalledFSGthataddressesissuessuchasmultiplicityofcandidates,canonicallabeling,andvertexinvariantschemes.AnotherfrequentsubgraphminingalgorithmknownasgSpanwasdevelopedbyYanandHanin[503].TheauthorsproposedusingaminimumDFScodeforencodingthevarioussubgraphs.OthervariantsofthefrequentsubgraphminingproblemswereproposedbyZakiin[505],ParthasarathyandCoatneyin[488],andKuramochiandKarypisin[479].ArecentreviewongraphminingisgivenbyChengetal.in[469].
Theproblemofmininginfrequentpatternshasbeeninvestigatedbymanyauthors.Savasereetal.[491]examinedtheproblemofminingnegativeassociationrulesusingaconcepthierarchy.Tanetal.[497]proposedtheideaofminingindirectassociationsforsequentialandnon-sequentialdata.EfficientalgorithmsforminingnegativepatternshavealsobeenproposedbyBoulicautetal.[467],Tengetal.[498],Wuetal.[502],andAntonieandZaïane[464].
Bibliography[463]R.AgrawalandR.Srikant.MiningSequentialPatterns.InProc.ofIntl.
Conf.onDataEngineering,pages3–14,Taipei,Taiwan,1995.
[464]M.-L.AntonieandO.R.Za¨ıane.MiningPositiveandNegativeAssociationRules:AnApproachforConfinedRules.InProc.ofthe8thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages27–38,Pisa,Italy,September2004.
[465]Y.AumannandY.Lindell.AStatisticalTheoryforQuantitativeAssociationRules.InKDD99,pages261–270,SanDiego,CA,August1999.
[466]J.Ayres,J.Flannick,J.Gehrke,andT.Yiu.SequentialPatternminingusingabitmaprepresentation.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages429–435,Edmonton,Canada,July2002.
[467]J.-F.Boulicaut,A.Bykowski,andB.Jeudy.TowardstheTractableDiscoveryofAssociationRuleswithNegations.InProc.ofthe4thIntl.ConfonFlexibleQueryAnsweringSystemsFQAS’00,pages425–434,Warsaw,Poland,October2000.
[468]H.Cheng,X.Yan,andJ.Han.IncSpan:incrementalminingofsequentialpatternsinlargedatabase.InProc.ofthe10thIntl.Conf.on
KnowledgeDiscoveryandDataMining,pages527–532,Seattle,WA,August2004.
[469]H.Cheng,X.Yan,andJ.Han.MiningGraphPatterns.InC.AggarwalandJ.Han,editors,FrequentPatternMining,pages307–338.Springer,2014.
[470]P.Fournier-Viger,C.-W.Wu,A.Gomariz,andV.S.Tseng.VMSP:Efficientverticalminingofmaximalsequentialpatterns.InProceedingsoftheCanadianConferenceonArtificialIntelligence,pages83–94,2014.
[471]T.Fukuda,Y.Morimoto,S.Morishita,andT.Tokuyama.MiningOptimizedAssociationRulesforNumericAttributes.InProc.ofthe15thSymp.onPrinciplesofDatabaseSystems,pages182–191,Montreal,Canada,June1996.
[472]M.N.Garofalakis,R.Rastogi,andK.Shim.SPIRIT:SequentialPatternMiningwithRegularExpressionConstraints.InProc.ofthe25thVLDBConf.,pages223–234,Edinburgh,Scotland,1999.
[473]R.Gupta,N.Rao,andV.Kumar.Discoveryoferror-tolerantbiclustersfromnoisygeneexpressiondata.BMCbioinformatics,12(12):1,2011.
[474]E.-H.Han,G.Karypis,andV.Kumar.Min-Apriori:AnAlgorithmforFindingAssociationRulesinDatawithContinuousAttributes.http://www.cs.umn.edu/˜han,1997.
[475]J.HanandY.Fu.MiningMultiple-LevelAssociationRulesinLargeDatabases.IEEETrans.onKnowledgeandDataEngineering,11(5):798–804,1999.
[476]A.Inokuchi,T.Washio,andH.Motoda.AnApriori-basedAlgorithmforMiningFrequentSubstructuresfromGraphData.InProc.ofthe4thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages13–23,Lyon,France,2000.
[477]M.V.Joshi,G.Karypis,andV.Kumar.AUniversalFormulationofSequentialPatterns.InProc.oftheKDD’2001workshoponTemporalDataMining,SanFrancisco,CA,August2001.
[478]M.KuramochiandG.Karypis.FrequentSubgraphDiscovery.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages313–320,SanJose,CA,November2001.
[479]M.KuramochiandG.Karypis.DiscoveringFrequentGeometricSubgraphs.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages258–265,MaebashiCity,Japan,December2002.
[480]B.Lent,A.Swami,andJ.Widom.ClusteringAssociationRules.InProc.ofthe13thIntl.Conf.onDataEngineering,pages220–231,Birmingham,U.K,April1997.
[481]C.LuoandS.M.Chung.Efficientminingofmaximalsequentialpatternsusingmultiplesamples.InProceedingsoftheSIAMInternationalConferenceonDataMining,pages415–426,2005.
[482]N.R.MabroukehandC.Ezeife.Ataxonomyofsequentialpatternminingalgorithms.ACMComputingSurvey,43(1),2010.
[483]H.Mannila,H.Toivonen,andA.I.Verkamo.DiscoveryofFrequentEpisodesinEventSequences.DataMiningandKnowledgeDiscovery,1(3):259–289,November1997.
[484]D.Martin,A.Rosete,J.Alcalá-Fdez,andF.Herrera.Anewmultiobjectiveevolutionaryalgorithmforminingareducedsetofinterestingpositiveandnegativequantitativeassociationrules.IEEETransactionsonEvolutionaryComputation,18(1):54–69,2014.
[485]J.Mata,J.L.Alvarez,andJ.C.Riquelme.MiningNumericAssociationRuleswithGeneticAlgorithms.InProceedingsoftheInternationalConferenceonArtificialNeuralNetsandGeneticAlgorithms,pages264–267,Prague,CzechRepublic,2001.Springer.
[486]R.J.MillerandY.Yang.AssociationRulesoverIntervalData.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages452–461,Tucson,AZ,May1997.
[487]G.Pandey,G.Atluri,M.Steinbach,C.L.Myers,andV.Kumar.Anassociationanalysisapproachtobiclustering.InProceedingsofthe15thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages677–686.ACM,2009.
[488]S.ParthasarathyandM.Coatney.EfficientDiscoveryofCommonSubstructuresinMacromolecules.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages362–369,MaebashiCity,Japan,December2002.
[489]J.Pei,J.Han,B.Mortazavi-Asl,Q.Chen,U.Dayal,andM.Hsu.PrefixSpan:MiningSequentialPatternsefficientlybyprefix-projectedpatterngrowth.InProcofthe17thIntl.Conf.onDataEngineering,Heidelberg,Germany,April2001.
[490]U.Ruckert,L.Richter,andS.Kramer.Quantitativeassociationrulesbasedonhalf-spaces:Anoptimizationapproach.InProceedingsoftheFourthIEEEInternationalConferenceonDataMining,pages507–510,2004.
[491]A.Savasere,E.Omiecinski,andS.Navathe.MiningforStrongNegativeAssociationsinaLargeDatabaseofCustomerTransactions.InProc.ofthe14thIntl.Conf.onDataEngineering,pages494–502,Orlando,Florida,February1998.
[492]M.SenoandG.Karypis.SLPMiner:AnAlgorithmforFindingFrequentSequentialPatternsUsingLength-DecreasingSupportConstraint.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages418–425,MaebashiCity,Japan,December2002.
[493]W.Shen,J.Wang,andJ.Han.SequentialPatternMining.InC.AggarwalandJ.Han,editors,FrequentPatternMining,pages261–282.Springer,2014.
[494]R.SrikantandR.Agrawal.MiningGeneralizedAssociationRules.InProc.ofthe21stVLDBConf.,pages407–419,Zurich,Switzerland,1995.
[495]R.SrikantandR.Agrawal.MiningQuantitativeAssociationRulesinLargeRelationalTables.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Montreal,Canada,1996.
[496]R.SrikantandR.Agrawal.MiningSequentialPatterns:GeneralizationsandPerformanceImprovements.InProc.ofthe5thIntlConf.onExtendingDatabaseTechnology(EDBT’96),pages18–32,Avignon,France,1996.
[497]P.N.Tan,V.Kumar,andJ.Srivastava.IndirectAssociation:MiningHigherOrderDependenciesinData.InProc.ofthe4thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages632–637,Lyon,France,2000.
[498]W.G.Teng,M.J.Hsieh,andM.-S.Chen.OntheMiningofSubstitutionRulesforStatisticallyDependentItems.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages442–449,MaebashiCity,Japan,December2002.
[499]P.Tzvetkov,X.Yan,andJ.Han.TSP:Miningtop-kclosedsequentialpatterns.KnowledgeandInformationSystems,7(4):438–457,2005.
[500]K.Wang,S.H.Tay,andB.Liu.Interestingness-BasedIntervalMergerforNumericAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages121–128,NewYork,NY,August1998.
[501]G.I.Webb.Discoveringassociationswithnumericvariables.InProc.ofthe7thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages383–388,SanFrancisco,CA,August2001.
[502]X.Wu,C.Zhang,andS.Zhang.MiningBothPositiveandNegativeAssociationRules.ACMTrans.onInformationSystems,22(3):381–405,2004.
[503]X.YanandJ.Han.gSpan:Graph-basedSubstructurePatternMining.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages721–724,MaebashiCity,Japan,December2002.
[504]X.Yan,J.Han,andR.Afshar.CloSpan:Mining:Closedsequentialpatternsinlargedatasets.InProceedingsoftheSIAMInternationalConferenceonDataMining,pages166–177,2003.
[505]M.J.Zaki.Efficientlyminingfrequenttreesinaforest.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages71–80,Edmonton,Canada,July2002.
[506]H.Zhang,B.Padmanabhan,andA.Tuzhilin.OntheDiscoveryofSignificantStatisticalQuantitativeRules.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages374–383,Seattle,WA,August2004.
6.8Exercises1.ConsiderthetrafficaccidentdatasetshowninTable6.10 .
Table6.10.Trafficaccidentdataset.
WeatherCondition Driver’sCondition TrafficViolation SeatBelt CrashSeverity
Good Alcohol-impaired Exceedspeedlimit No Major
Bad Sober None Yes Minor
Good Sober Disobeystopsign Yes Minor
Good Sober Exceedspeedlimit Yes Major
Bad Sober Disobeytrafficsignal No Major
Good Alcohol-impaired Disobeystopsign Yes Minor
Bad Alcohol-impaired None Yes Major
Good Sober Disobeytrafficsignal Yes Major
Good Alcohol-impaired None No Major
Bad Sober Disobeytrafficsignal No Major
Good Alcohol-impaired Exceedspeedlimit Yes Major
Bad Sober Disobeystopsign Yes Minor
a. Showabinarizedversionofthedataset.
b. Whatisthemaximumwidthofeachtransactioninthebinarizeddata?
c. Assumingthatthesupportthresholdis30%,howmanycandidateandfrequentitemsetswillbegenerated?
d. Createadatasetthatcontainsonlythefollowingasymmetricbinaryattributes:(
,onlyNonehasavalueof0.Therestoftheattributevaluesareassignedto1.Assumingthatthesupportthresholdis30%,howmanycandidateandfrequentitemsetswillbegenerated?
e. Comparethenumberofcandidateandfrequentitemsetsgeneratedinparts(c)and(d).
2.
a. ConsiderthedatasetshowninTable6.11 .Supposeweapplythefollowingdiscretizationstrategiestothecontinuousattributesofthedataset.
Table6.11.DatasetforExercise2 .
TID Temperature Pressure Alarm1 Alarm2 Alarm3
1 95 1105 0 0 1
2 85 1040 1 1 0
3 103 1090 1 1 1
4 97 1084 1 0 0
5 80 1038 0 1 1
6 100 1080 1 1 0
7 83 1025 1 0 1
8 86 1030 1 0 0
9 101 1100 1 1 1
D1: Partitiontherangeofeachcontinuousattributeinto3equal-sizedbins.
D2: Partitiontherangeofeachcontinuousattributeinto3bins;whereeachbincontainsanequalnumberoftransactions
Foreachstrategy,answerthefollowingquestions:
b. Thecontinuousattributecanalsobediscretizedusingaclusteringapproach.
Constructabinarizedversionofthedataset.
Deriveallthefrequentitemsetshavingsupport .
i.
ii. ≥30%
PlotagraphoftemperatureversuspressureforthedatapointsshowninTable6.11 .
Howmanynaturalclustersdoyouobservefromthegraph?Assignalabel( etc.)toeachclusterinthegraph.
Whattypeofclusteringalgorithmdoyouthinkcanbeusedtoidentifytheclusters?Stateyourreasonsclearly.
ReplacethetemperatureandpressureattributesinTable6.11withasymmetricbinaryattributes etc.Constructatransactionmatrixusingthenewattributes(alongwithattributesAlarm1,Alarm2,andAlarm3).
Deriveallthefrequentitemsetshavingsupport fromthebinarizeddata.
i.
ii.C1,C2,
iii.
iv.C1,C2,
v. ≥30%
3.ConsiderthedatasetshowninTable6.12 .Thefirstattributeiscontinuous,whiletheremainingtwoattributesareasymmetricbinary.Aruleisconsideredtobestrongifitssupportexceeds15%anditsconfidenceexceeds60%.ThedatagiveninTable6.12 supportsthefollowingtwostrongrules:
Table6.12.DatasetforExercise3 .
A B C
1 1 1
2 1 1
3 1 0
4 1 0
5 1 1
6 0 1
7 0 0
8 1 1
9 0 0
10 0 0
11 0 0
12 0 1
{(1≤A≤2),B=1}→{C=1}
(ii) {(5≤A≤8),B=1}→{C=1}
a. Computethesupportandconfidenceforbothrules.
b. TofindtherulesusingthetraditionalApriorialgorithm,weneedtodiscretizethecontinuousattributeA.Supposeweapplytheequalwidthbinningapproachtodiscretizethedata,with .Foreachbin-width,statewhethertheabovetworulesarediscoveredbytheApriorialgorithm.(NotethattherulesmaynotbeinthesameexactformasbeforebecauseitmaycontainwiderornarrowerintervalsforA.)Foreachrulethatcorrespondstooneoftheabovetworules,computeitssupportandconfidence.
c. Commentontheeffectivenessofusingtheequalwidthapproachforclassifyingtheabovedataset.Isthereabin-widththatallowsyoutofindbothrulessatisfactorily?Ifnot,whatalternativeapproachcanyoutaketoensurethatyouwillfindbothrules?
4.ConsiderthedatasetshowninTable6.13 .
Table6.13.DatasetforExercise4 .
Age(A) NumberofHoursOnlineperWeek(B)
0–5 5–10 10–20 20–30 30–40
10–15 2 3 5 3 2
15–25 2 5 10 10 3
25–35 10 15 5 3 2
35–50 4 6 5 3 2
a. Foreachcombinationofrulesgivenbelow,specifytherulethathasthehighestconfidence.
bin-width=2,3,4
b. SupposeweareinterestedinfindingtheaveragenumberofhoursspentonlineperweekbyInternetusersbetweentheageof15and35.Writethecorrespondingstatistics-basedassociationruletocharacterizethesegmentofusers.Tocomputetheaveragenumberofhoursspentonline,approximateeachintervalbyitsmidpointvalue(e.g.,use torepresenttheinterval ).
c. Testwhetherthequantitativeassociationrulegiveninpart(b)isstatisticallysignificantbycomparingitsmeanagainsttheaveragenumberofhoursspentonlinebyotheruserswhodonotbelongtotheagegroup.
5.Forthedatasetwiththeattributesgivenbelow,describehowyouwouldconvertitintoabinarytransactiondatasetappropriateforassociationanalysis.Specifically,indicateforeachattributeintheoriginaldataset
a. howmanybinaryattributesitwouldcorrespondtointhetransactiondataset,
b. howthevaluesoftheoriginalattributewouldbemappedtovaluesofthebinaryattributes,and
c. ifthereisanyhierarchicalstructureinthedatavaluesofanattributethatcouldbeusefulforgroupingthedataintofewerbinaryattributes.
Thefollowingisalistofattributesforthedatasetalongwiththeirpossiblevalues.Assumethatallattributesarecollectedonaper-studentbasis:
and
and
and
i. 15<A<25→10<B<20,10<A<25→10<B<20, 15<A<35→10<B<20.
ii. 15<A<25→10<B<20,15<A<25→5<B<20, 15<A<25→5<B<30
iii. 15<A<25→10<B<20 10<A<35→5<B<30
B=7.55<B<10
Year:Freshman,Sophomore,Junior,Senior,Graduate:Masters,Graduate:PhD,Professional
Zipcode:zipcodeforthehomeaddressofaU.S.student,zipcodeforthelocaladdressofanon-U.S.student
College:Agriculture,Architecture,ContinuingEducation,Education,LiberalArts,Engineering,NaturalSciences,Business,Law,Medical,Dentistry,Pharmacy,Nursing,VeterinaryMedicine
OnCampus:1ifthestudentlivesoncampus,0otherwise
Eachofthefollowingisaseparateattributethathasavalueof1ifthepersonspeaksthelanguageandavalueof0,otherwise.
–Arabic
–Bengali
–ChineseMandarin
–English
–Portuguese
–Russian
–Spanish
6.ConsiderthedatasetshowninTable6.14 .Supposeweareinterestedinextractingthefollowingassociationrule:
Table6.14.DatasetforExercise6 .
Age PlayPiano EnjoyClassicalMusic
{α1≤Age≤α2,Play Piano=Yes}→{EnjoyClassicalMusic=Yes }
9 Yes Yes
11 Yes Yes
14 Yes No
17 Yes No
19 Yes Yes
21 No No
25 No No
29 Yes Yes
33 No No
39 No Yes
41 No No
47 No Yes
Tohandlethecontinuousattribute,weapplytheequal-frequencyapproachwith3,4,and6intervals.Categoricalattributesarehandledbyintroducingasmanynewasymmetricbinaryattributesasthenumberofcategoricalvalues.Assumethatthesupportthresholdis10%andtheconfidencethresholdis70%.
a. SupposewediscretizetheAgeattributeinto3equal-frequencyintervals.Findapairofvaluesfor and thatsatisfytheminimumsupportandminimumconfidencerequirements.
b. Repeatpart(a)bydiscretizingtheAgeattributeinto4equal-frequencyintervals.Comparetheextractedrulesagainsttheonesyouhadobtained
α1 α2
inpart(a).
c. Repeatpart(a)bydiscretizingtheAgeattributeinto6equal-frequencyintervals.Comparetheextractedrulesagainsttheonesyouhadobtainedinpart(a).
d. Fromtheresultsinpart(a),(b),and(c),discusshowthechoiceofdiscretizationintervalswillaffecttherulesextractedbyassociationruleminingalgorithms.
7.ConsiderthetransactionsshowninTable6.15 ,withanitemtaxonomygiveninFigure6.25 .
Table6.15.Exampleofmarketbaskettransactions.
TransactionID ItemsBought
1 Chips,Cookies,RegularSoda,Ham
2 Chips,Ham,BonelessChicken,DietSoda
3 Ham,Bacon,WholeChicken,RegularSoda
4 Chips,Ham,BonelessChicken,DietSoda
5 Chips,Bacon,BonelessChicken
6 Chips,Ham,Bacon,WholeChicken,RegularSoda
7 Chips,Cookies,BonelessChicken,DietSoda
a. Whatarethemainchallengesofminingassociationruleswithitemtaxonomy?
b. Considertheapproachwhereeachtransactiontisreplacedbyanextendedtransaction thatcontainsalltheitemsintaswellastheirt′
respectiveancestors.Forexample,thetransactiont=willbereplacedbyt= .Usethisapproachtoderiveallfrequentitemsets(uptosize4)withsupport .
c. Consideranalternativeapproachwherethefrequentitemsetsaregeneratedonelevelatatime.Initially,allthefrequentitemsetsinvolvingitemsatthehighestlevelofthehierarchyaregenerated.Next,weusethefrequentitemsetsdiscoveredatthehigherlevelofthehierarchytogeneratecandidateitemsetsinvolvingitemsatthelowerlevelsofthehierarchy.Forexample,wegeneratethecandidateitemset
onlyif isfrequent.Usethisapproachtoderiveallfrequentitemsets(uptosize4)withsupport .
d. Comparethefrequentitemsetsfoundinparts(b)and(c).Commentontheefficiencyandcompletenessofthealgorithms.
8.Thefollowingquestionsexaminehowthesupportandconfidenceofanassociationrulemayvaryinthepresenceofaconcepthierarchy.
a. Consideranitemxinagivenconcepthierarchy.Letdenotethekchildrenofxintheconcepthierarchy.Showthat
,where isthesupportofanitem.Underwhatconditionswilltheinequalitybecomeanequality?
b. Letpandqdenoteapairofitems,while and aretheircorrespondingparentsintheconcepthierarchy.If >minsup,whichofthefollowingitemsetsareguaranteedtobefrequent?(i)
,(ii) ,and(iii) .
c. Considertheassociationrule .Supposetheconfidenceoftheruleexceedsminconf.Whichofthefollowingrulesareguaranteedtohaveconfidencehigherthanminconf?(i) ,(ii) ,and(iii)
.
≥70%
≥70%
x¯1,x¯2,…,x¯k
s(x)≤∑i=1ks(x¯i) s(⋅)
p^ q^s({p,q})
s({p^,q}) s({p,q^}) s({p^,q^})
{p}→{q}
{p}→{q^} {p^}→{q} {p^}→{q^}
9.
a. Listallthe4-subsequencescontainedinthefollowingdatasequence:
assumingnotimingconstraints.
b. Listallthe3-elementsubsequencescontainedinthedatasequenceforpart(a)assumingthatnotimingconstraintsareimposed.
c. Listallthe4-subsequencescontainedinthedatasequenceforpart(a)(assumingthetimingconstraintsareflexible).
d. Listallthe3-elementsubsequencescontainedinthedatasequenceforpart(a)(assumingthetimingconstraintsareflexible).
10.Findallthefrequentsubsequenceswithsupport giventhesequencedatabaseshowninTable6.16 .Assumethattherearenotimingconstraintsimposedonthesequences.
Table6.16.Exampleofeventsequencesgeneratedbyvarioussensors.
Sensor Timestamp Events
S1 1 A,B
2 C
3 D,E
4 C
S2 1 A,B
2 C,D
{1,3} {2} {2,3} {4},
≥50%
3 E
S3 1 B
2 A
3 B
4 D,E
S4 1 C
2 D,E
3 C
4 E
S5 1 B
2 A
3 B,C
4 A,D
11.
a. Foreachofthesequences givenbelow,determinewhethertheyaresubsequencesofthesequence
subjectedtothefollowingtimingconstraints:
mingap=0 (intervalbetweenlasteventin andfirsteventin is )
w=⟨e1e2…ei…ei+1…elast⟩
⟨{1,2,3}{2,4}{2,4,5}{3,5}{6}⟩
ei ei+1 ⟩0
maxgap=3 (intervalbetweenfirsteventin andlasteventin is )
maxspan=5 (intervalbetweenfirsteventin andlasteventin is )
ws=1 (timebetweenfirstandlasteventsin is )
b. Determinewhethereachofthesubsequenceswgiveninthepreviousquestionarecontiguoussubsequencesofthefollowingsequencess.
12.Foreachofthesequence below,determinewhethertheyaresubsequencesofthefollowingdatasequence:
subjectedtothefollowingtimingconstraints:
mingap=0 (intervalbetweenlasteventin andfirsteventin is )
maxgap=2 (intervalbetweenfirsteventin andlasteventin is )
ei ei+1 ≤3
ei elast ≤5
ei ≤1
w=⟨{1}{2}{3}⟩
w=⟨{1,2,3,4}{5,6}⟩
w=⟨{2,4}{2,4}{6}⟩
w=⟨{1}{2,4}{6}⟩
w=⟨{1,2}{3,4}{5,6}⟩
s=⟨{1,2,3,4,5,6}{1,2,3,4,5,6}{1,2,3,4,5,6}⟩
s=⟨{1,2,3,4}{1,2,3,4,5,6}{3,4,5,6}⟩
s=⟨{1,2}{1,2,3,4}{3,4,5,6}{5,6}⟩
s=⟨{1,2,3}{2,3,4,5}{4,5,6}⟩
w=⟨e1,…,elast⟩
⟨{A,B}{C,D}{A,B}{C,D}{A,B}{C,D}⟩
ei ei+1 >0
ei ei+1 ≤2
maxspan=6 (intervalbetweenfirsteventin andlasteventin is )
ws=1 (timebetweenfirstandlasteventsin is )
a.
b.
c.
d.
e.
13.Considerthefollowingfrequent3-sequences:
and .
a. Listallthecandidate4-sequencesproducedbythecandidategenerationstepoftheGSPalgorithm.
b. Listallthecandidate4-sequencesprunedduringthecandidatepruningstepoftheGSPalgorithm(assumingnotimingconstraints).
c. Listallthecandidate4-sequencesprunedduringthecandidatepruningstepoftheGSPalgorithm(assumingmaxgap=1).
14.ConsiderthedatasequenceshowninTable6.17 foragivenobject.Countthenumberofoccurrencesforthesequence accordingtothefollowingcountingmethods:
Table6.17.ExampleofeventsequencedataforExercise14 .
Timestamp Events
1 p,q
ei elast ≤6
ei ≤1
w=⟨{A}{B}{C}{D}⟩
w=⟨{A}{B,C,D}{A}⟩
w=⟨{A}{A,B,C,D}{A}⟩
w=⟨{B,C}{A,D}{B,C}⟩
w=⟨{A,B,C,D}{A,B,C,D}⟩
⟨{1,2,3}⟩,⟨{1,2}{3}⟩,⟨{1}{2,3}⟩,⟨{1,2}{4}⟩,⟨{1,3}{4}⟩,⟨{1,2,4}⟩,⟨{2,3}{3}⟩,⟨{2,3}{4}⟩,⟨{2}{3}{3}⟩, ⟨{2}{3}{4}⟩
⟨{p}{q}{r}⟩
2 r
3 s
4 p,q
5 r,s
6 p
7 q,r
8 q,s
9 p
10 q,r,s
a. COBJ(oneoccurrenceperobject).
b. CWIN(oneoccurrenceperslidingwindow).
c. CMINWIN(numberofminimalwindowsofoccurrence).
d. CDIST_O(distinctoccurrenceswithpossibilityofevent-timestampoverlap).
e. CDIST(distinctoccurrenceswithnoeventtimestampoverlapallowed).
15.Describethetypesofmodificationsnecessarytoadaptthefrequentsubgraphminingalgorithmtohandle:
a. Directedgraphs
b. Unlabeledgraphs
c. Acyclicgraphs
d. Disconnectedgraphs
Foreachtypeofgraphgivenabove,describewhichstepofthealgorithmwillbeaffected(candidategeneration,candidatepruning,andsupportcounting),andanyfurtheroptimizationthatcanhelpimprovetheefficiencyofthealgorithm.
16.DrawallcandidatesubgraphsobtainedfromjoiningthepairofgraphsshowninFigure6.28 .
Figure6.28.GraphsforExercise16 .
17.DrawallthecandidatesubgraphsobtainedbyjoiningthepairofgraphsshowninFigure6.29 .
Figure6.29.GraphsforExercise17 .
18.ShowthatthecandidategenerationprocedureintroducedinSection6.5.3 forfrequentsubgraphminingiscomplete,i.e.,nofrequentk-subgraphcanbemissedfrombeinggeneratedifeverypairoffrequent -subgraphsisconsideredformerging.
19.
a. Ifsupportisdefinedintermsofinducedsubgraphrelationship,showthattheconfidenceoftherule canbegreaterthan1if and areallowedtohaveoverlappingvertexsets.
b. Whatisthetimecomplexityneededtodeterminethecanonicallabelofagraphthatcontains vertices?
c. Thecoreofasubgraphcanhavemultipleautomorphisms.Thiswillincreasethenumberofcandidatesubgraphsobtainedaftermergingtwo
(k−1)
g1→g2 g1 g2
|V|
frequentsubgraphsthatsharethesamecore.Determinethemaximumnumberofcandidatesubgraphsobtainedduetoautomorphismofacoreofsizek.
d. Twofrequentsubgraphsofsizekmaysharemultiplecores.Determinethemaximumnumberofcoresthatcanbesharedbythetwofrequentsubgraphs.
20.(a)Considerthetwographsshownbelow.
21.Theoriginalassociationruleminingframeworkconsidersonlypresenceofitemstogetherinthesametransaction.Therearesituationsinwhichitemsetsthatareinfrequentmayalsobeinformative.Forinstance,theitemsetTV,DVD,¬VCRsuggeststhatmanycustomerswhobuyTVsandDVDsdonotbuyVCRs.
Drawallthedistinctcoresobtainedwhenmergingthetwosubgraphs.
Howmanycandidatesaregeneratedusingthefollowingcore?
Inthisproblem,youareaskedtoextendtheassociationruleframeworktonegativeitemsets(i.e.,itemsetsthatcontainbothpresenceandabsenceofitems).Wewillusethenegationsymbol(¬)torefertoabsenceofitems.
a. AnaïvewayforderivingnegativeitemsetsistoextendeachtransactiontoincludeabsenceofitemsasshowninTable6.18 .
Table6.18.Exampleofnumericdataset.
TID TV ¬TV DVD ¬DVD VCR ¬VCR …
1 1 0 0 1 0 1 …
2 1 0 0 1 0 1 …
b. ConsiderthedatabaseshowninTable6.15 .Whatarethesupportandconfidencevaluesforthefollowingnegativeassociationrulesinvolvingregularanddietsoda?
Supposethetransactiondatabasecontains1000distinctitems.Whatisthetotalnumberofpositiveitemsetsthatcanbegeneratedfromtheseitems?(Note:Apositiveitemsetdoesnotcontainanynegateditems).
Whatisthemaximumnumberoffrequentitemsetsthatcanbegeneratedfromthesetransactions?(Assumethatafrequentitemsetmaycontainpositive,negative,orbothtypesofitems)
Explainwhysuchana¨ıvemethodofextendingeachtransactionwithnegativeitemsisnotpracticalforderivingnegativeitemsets.
i.
ii.
iii.
¬Regular→Diet.
Regular→¬Diet.
¬Diet→Regular.
i.
ii.
iii.
22.Supposewewouldliketoextractpositiveandnegativeitemsetsfromadatasetthatcontainsditems.
a. Consideranapproachwhereweintroduceanewvariabletorepresenteachnegativeitem.Withthisapproach,thenumberofitemsgrowsfromdto2d.Whatisthetotalsizeoftheitemsetlattice,assumingthatanitemsetmaycontainbothpositiveandnegativeitemsofthesamevariable?
b. Assumethatanitemsetmustcontainpositiveornegativeitemsofdifferentvariables.Forexample,theitemset isinvalidbecauseitcontainsbothpositiveandnegativeitemsforvariablea.Whatisthetotalsizeoftheitemsetlattice?
23.Foreachtypeofpatterndefinedbelow,determinewhetherthesupportmeasureismonotone,anti-monotone,ornon-monotone(i.e.,neithermonotonenoranti-monotone)withrespecttoincreasingitemsetsize.
a. Itemsetsthatcontainbothpositiveandnegativeitemssuchas.Isthesupportmeasuremonotone,anti-monotone,ornon-monotonewhenappliedtosuchpatterns?
b. Booleanlogicalpatternssuchas ,whichmaycontainbothdisjunctionsandconjunctionsofitems.Isthesupportmeasuremonotone,anti-monotone,ornon-monotonewhenappliedtosuchpatterns?
24.ManyassociationanalysisalgorithmsrelyonanApriori-likeapproachforfindingfrequentpatterns.Theoverallstructureofthealgorithmisgivenbelow.
SupposeweareinterestedinfindingBooleanlogicalrulessuchas
Diet→¬Regular.iv.
{a,a¯,b,c¯}
{a,b,c¯,d¯}
{(a∨b∨c),d,e}
whichmaycontainbothdisjunctionsandconjunctionsofitems.Thecorrespondingitemsetcanbewrittenas .
a. DoestheAprioriprinciplestillholdforsuchitemsets?
b. Howshouldthecandidategenerationstepbemodifiedtofindsuchpatterns?
c. Howshouldthecandidatepruningstepbemodifiedtofindsuchpatterns?
d. Howshouldthesupportcountingstepbemodifiedtofindsuchpatterns?
Algorithm6.5Apriori-likealgorithm.1.
2. {Findfrequent1-patterns.}
3. repeat
4. .
5. .{CandidateGeneration}
6. .{CandidatePruning}
7. .{SupportCounting}
8. .{Extractfrequentpatterns}
9. until
10. Answer= .
{a∨b}→{c,d},
{(a∨b),c,d}
k=1
Fk={i|i∈I∧σ{i}N≥minsup}
k=k+1
Ck=genCandidate(Fk−1)
Ck=pruneCandidate (Ck,Fk−1)
Ck=count (Ck,D)
Fk={c|c∈Ck∧σ(c)N≥minsup}Fk=∅
∪Fk
7ClusterAnalysis:BasicConceptsandAlgorithms
Clusteranalysisdividesdataintogroups(clusters)thataremeaningful,useful,orboth.Ifmeaningfulgroupsarethegoal,thentheclustersshouldcapturethenaturalstructureofthedata.Insomecases,however,clusteranalysisisusedfordatasummarizationinordertoreducethesizeofthedata.Whetherforunderstandingorutility,clusteranalysishaslongplayedanimportantroleinawidevarietyoffields:psychologyandothersocialsciences,biology,statistics,patternrecognition,informationretrieval,machinelearning,anddatamining.
Therehavebeenmanyapplicationsofclusteranalysistopracticalproblems.Weprovidesomespecificexamples,organizedbywhetherthepurposeoftheclusteringisunderstandingorutility.
ClusteringforUnderstandingClasses,orconceptuallymeaningfulgroupsofobjectsthatsharecommoncharacteristics,playanimportantroleinhowpeopleanalyzeanddescribetheworld.Indeed,humanbeingsareskilledat
dividingobjectsintogroups(clustering)andassigningparticularobjectstothesegroups(classification).Forexample,evenrelativelyyoungchildrencanquicklylabeltheobjectsinaphotograph.Inthecontextofunderstandingdata,clustersarepotentialclassesandclusteranalysisisthestudyoftechniquesforautomaticallyfindingclasses.Thefollowingaresomeexamples:
Biology.Biologistshavespentmanyyearscreatingataxonomy(hierarchicalclassification)ofalllivingthings:kingdom,phylum,class,order,family,genus,andspecies.Thus,itisperhapsnotsurprisingthatmuchoftheearlyworkinclusteranalysissoughttocreateadisciplineofmathematicaltaxonomythatcouldautomaticallyfindsuchclassificationstructures.Morerecently,biologistshaveappliedclusteringtoanalyzethelargeamountsofgeneticinformationthatarenowavailable.Forexample,clusteringhasbeenusedtofindgroupsofgenesthathavesimilarfunctions.InformationRetrieval.TheWorldWideWebconsistsofbillionsofwebpages,andtheresultsofaquerytoasearchenginecanreturnthousandsofpages.Clusteringcanbeusedtogroupthesesearchresultsintoasmallnumberofclusters,eachofwhichcapturesaparticularaspectofthequery.Forinstance,aqueryof“movie”mightreturnwebpagesgroupedintocategoriessuchasreviews,trailers,stars,andtheaters.Eachcategory(cluster)canbebrokenintosubcategories(subclusters),producingahierarchicalstructurethatfurtherassistsauser’sexplorationofthequeryresults.Climate.UnderstandingtheEarth’sclimaterequiresfindingpatternsintheatmosphereandocean.Tothatend,clusteranalysishasbeenappliedtofindpatternsinatmosphericpressureandoceantemperaturethathaveasignificantimpactonclimate.PsychologyandMedicine.Anillnessorconditionfrequentlyhasanumberofvariations,andclusteranalysiscanbeusedtoidentifythesedifferentsubcategories.Forexample,clusteringhasbeenusedtoidentify
differenttypesofdepression.Clusteranalysiscanalsobeusedtodetectpatternsinthespatialortemporaldistributionofadisease.Business.Businessescollectlargeamountsofinformationaboutcurrentandpotentialcustomers.Clusteringcanbeusedtosegmentcustomersintoasmallnumberofgroupsforadditionalanalysisandmarketingactivities.
ClusteringforUtilityClusteranalysisprovidesanabstractionfromindividualdataobjectstotheclustersinwhichthosedataobjectsreside.Additionally,someclusteringtechniquescharacterizeeachclusterintermsofaclusterprototype;i.e.,adataobjectthatisrepresentativeoftheobjectsinthecluster.Theseclusterprototypescanbeusedasthebasisforanumberofadditionaldataanalysisordataprocessingtechniques.Therefore,inthecontextofutility,clusteranalysisisthestudyoftechniquesforfindingthemostrepresentativeclusterprototypes.
Summarization.Manydataanalysistechniques,suchasregressionorprincipalcomponentanalysis,haveatimeorspacecomplexityof orhigher(wheremisthenumberofobjects),andthus,arenotpracticalforlargedatasets.However,insteadofapplyingthealgorithmtotheentiredataset,itcanbeappliedtoareduceddatasetconsistingonlyofclusterprototypes.Dependingonthetypeofanalysis,thenumberofprototypes,andtheaccuracywithwhichtheprototypesrepresentthedata,theresultscanbecomparabletothosethatwouldhavebeenobtainedifallthedatacouldhavebeenused.Compression.Clusterprototypescanalsobeusedfordatacompression.Inparticular,atableiscreatedthatconsistsoftheprototypesforeachcluster;i.e.,eachprototypeisassignedanintegervaluethatisitsposition(index)inthetable.Eachobjectisrepresentedbytheindexoftheprototypeassociatedwithitscluster.Thistypeofcompressionisknownasvectorquantizationandisoftenappliedtoimage,sound,andvideodata,
O(m2)
where(1)manyofthedataobjectsarehighlysimilartooneanother,(2)somelossofinformationisacceptable,and(3)asubstantialreductioninthedatasizeisdesired.EfficientlyFindingNearestNeighbors.Findingnearestneighborscanrequirecomputingthepairwisedistancebetweenallpoints.Oftenclustersandtheirclusterprototypescanbefoundmuchmoreefficiently.Ifobjectsarerelativelyclosetotheprototypeoftheircluster,thenwecanusetheprototypestoreducethenumberofdistancecomputationsthatarenecessarytofindthenearestneighborsofanobject.Intuitively,iftwoclusterprototypesarefarapart,thentheobjectsinthecorrespondingclusterscannotbenearestneighborsofeachother.Consequently,tofindanobject’snearestneighbors,itisnecessarytocomputeonlythedistancetoobjectsinnearbyclusters,wherethenearnessoftwoclustersismeasuredbythedistancebetweentheirprototypes.ThisideaismademorepreciseinExercise25 ofChapter2 ,whichisonpage111.
Thischapterprovidesanintroductiontoclusteranalysis.Webeginwithahigh-leveloverviewofclustering,includingadiscussionofthevariousapproachestodividingobjectsintosetsofclustersandthedifferenttypesofclusters.Wethendescribethreespecificclusteringtechniquesthatrepresentbroadcategoriesofalgorithmsandillustrateavarietyofconcepts:K-means,agglomerativehierarchicalclustering,andDBSCAN.Thefinalsectionofthischapterisdevotedtoclustervalidity—methodsforevaluatingthegoodnessoftheclustersproducedbyaclusteringalgorithm.MoreadvancedclusteringconceptsandalgorithmswillbediscussedinChapter8 .Wheneverpossible,wediscussthestrengthsandweaknessesofdifferentschemes.Inaddition,theBibliographicNotesprovidereferencestorelevantbooksandpapersthatexploreclusteranalysisingreaterdepth.
7.1OverviewBeforediscussingspecificclusteringtechniques,weprovidesomenecessarybackground.First,wefurtherdefineclusteranalysis,illustratingwhyitisdifficultandexplainingitsrelationshiptoothertechniquesthatgroupdata.Thenweexploretwoimportanttopics:(1)differentwaystogroupasetofobjectsintoasetofclusters,and(2)typesofclusters.
7.1.1WhatIsClusterAnalysis?
Clusteranalysisgroupsdataobjectsbasedoninformationfoundonlyinthedatathatdescribestheobjectsandtheirrelationships.Thegoalisthattheobjectswithinagroupbesimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroups.Thegreaterthesimilarity(orhomogeneity)withinagroupandthegreaterthedifferencebetweengroups,thebetterormoredistincttheclustering.
Inmanyapplications,thenotionofaclusterisnotwelldefined.Tobetterunderstandthedifficultyofdecidingwhatconstitutesacluster,considerFigure7.1 ,whichshows20pointsandthreedifferentwaysofdividingthemintoclusters.Theshapesofthemarkersindicateclustermembership.Figures7.1(b) and7.1(d) dividethedataintotwoandsixparts,respectively.However,theapparentdivisionofeachofthetwolargerclustersintothreesubclustersmaysimplybeanartifactofthehumanvisualsystem.Also,itmaynotbeunreasonabletosaythatthepointsformfourclusters,asshowninFigure7.1(c) .Thisfigureillustratesthatthedefinitionofacluster
isimpreciseandthatthebestdefinitiondependsonthenatureofdataandthedesiredresults.
Figure7.1.Threedifferentwaysofclusteringthesamesetofpoints.
Clusteranalysisisrelatedtoothertechniquesthatareusedtodividedataobjectsintogroups.Forinstance,clusteringcanberegardedasaformofclassificationinthatitcreatesalabelingofobjectswithclass(cluster)labels.However,itderivestheselabelsonlyfromthedata.Incontrast,classificationinthesenseofChapter3 issupervisedclassification;i.e.,new,unlabeledobjectsareassignedaclasslabelusingamodeldevelopedfromobjectswithknownclasslabels.Forthisreason,clusteranalysisissometimesreferredtoasunsupervisedclassification.Whenthetermclassificationisusedwithoutanyqualificationwithindatamining,ittypicallyreferstosupervisedclassification.
Also,whilethetermssegmentationandpartitioningaresometimesusedassynonymsforclustering,thesetermsarefrequentlyusedforapproachesoutsidethetraditionalboundsofclusteranalysis.Forexample,theterm
partitioningisoftenusedinconnectionwithtechniquesthatdividegraphsintosubgraphsandthatarenotstronglyconnectedtoclustering.Segmentationoftenreferstothedivisionofdataintogroupsusingsimpletechniques;e.g.,animagecanbesplitintosegmentsbasedonlyonpixelintensityandcolor,orpeoplecanbedividedintogroupsbasedontheirincome.Nonetheless,someworkingraphpartitioningandinimageandsegmentationisrelatedtoclusteranalysis.
7.1.2DifferentTypesofClusterings
Anentirecollectionofclustersiscommonlyreferredtoasaclustering,andinthissection,wedistinguishvarioustypesofclusterings:hierarchical(nested)versuspartitional(unnested),exclusiveversusoverlappingversusfuzzy,andcompleteversuspartial.
HierarchicalversusPartitional
Themostcommonlydiscusseddistinctionamongdifferenttypesofclusteringsiswhetherthesetofclustersisnestedorunnested,orinmoretraditionalterminology,hierarchicalorpartitional.Apartitionalclusteringissimplyadivisionofthesetofdataobjectsintonon-overlappingsubsets(clusters)suchthateachdataobjectisinexactlyonesubset.Takenindividually,eachcollectionofclustersinFigures7.1(b–d) isapartitionalclustering.
Ifwepermitclusterstohavesubclusters,thenweobtainahierarchicalclustering,whichisasetofnestedclustersthatareorganizedasatree.Eachnode(cluster)inthetree(exceptfortheleafnodes)istheunionofitschildren(subclusters),andtherootofthetreeistheclustercontainingalltheobjects.Often,butnotalways,theleavesofthetreearesingletonclustersof
individualdataobjects.Ifweallowclusterstobenested,thenoneinterpretationofFigure7.1(a) isthatithastwosubclusters(Figure7.1(b) ),eachofwhich,inturn,hasthreesubclusters(Figure7.1(d) ).TheclustersshowninFigures7.1(a–d) ,whentakeninthatorder,alsoformahierarchical(nested)clusteringwith,respectively,1,2,4,and6clustersoneachlevel.Finally,notethatahierarchicalclusteringcanbeviewedasasequenceofpartitionalclusteringsandapartitionalclusteringcanbeobtainedbytakinganymemberofthatsequence;i.e.,bycuttingthehierarchicaltreeataparticularlevel.
ExclusiveversusOverlappingversusFuzzy
TheclusteringsshowninFigure7.1 areallexclusive,astheyassigneachobjecttoasinglecluster.Therearemanysituationsinwhichapointcouldreasonablybeplacedinmorethanonecluster,andthesesituationsarebetteraddressedbynon-exclusiveclustering.Inthemostgeneralsense,anoverlappingornon-exclusiveclusteringisusedtoreflectthefactthatanobjectcansimultaneouslybelongtomorethanonegroup(class).Forinstance,apersonatauniversitycanbebothanenrolledstudentandanemployeeoftheuniversity.Anon-exclusiveclusteringisalsooftenusedwhen,forexample,anobjectis“between”twoormoreclustersandcouldreasonablybeassignedtoanyoftheseclusters.ImagineapointhalfwaybetweentwooftheclustersofFigure7.1 .Ratherthanmakeasomewhatarbitraryassignmentoftheobjecttoasinglecluster,itisplacedinallofthe“equallygood”clusters.
Inafuzzyclustering(Section8.2.1 ),everyobjectbelongstoeveryclusterwithamembershipweightthatisbetween0(absolutelydoesn’tbelong)and1(absolutelybelongs).Inotherwords,clustersaretreatedasfuzzysets.(Mathematically,afuzzysetisoneinwhichanobjectbelongstoeverysetwithaweightthatisbetween0and1.Infuzzyclustering,weoftenimposethe
additionalconstraintthatthesumoftheweightsforeachobjectmustequal1.)Similarly,probabilisticclusteringtechniques(Section8.2.2 )computetheprobabilitywithwhicheachpointbelongstoeachcluster,andtheseprobabilitiesmustalsosumto1.Becausethemembershipweightsorprobabilitiesforanyobjectsumto1,afuzzyorprobabilisticclusteringdoesnotaddresstruemulticlasssituations,suchasthecaseofastudentemployee,whereanobjectbelongstomultipleclasses.Instead,theseapproachesaremostappropriateforavoidingthearbitrarinessofassigninganobjecttoonlyoneclusterwhenitisclosetoseveral.Inpractice,afuzzyorprobabilisticclusteringisoftenconvertedtoanexclusiveclusteringbyassigningeachobjecttotheclusterinwhichitsmembershipweightorprobabilityishighest.
CompleteversusPartial
Acompleteclusteringassignseveryobjecttoacluster,whereasapartialclusteringdoesnot.Themotivationforapartialclusteringisthatsomeobjectsinadatasetmaynotbelongtowell-definedgroups.Manytimesobjectsinthedatasetrepresentnoise,outliers,or“uninterestingbackground.”Forexample,somenewspaperstoriesshareacommontheme,suchasglobalwarming,whileotherstoriesaremoregenericorone-of-a-kind.Thus,tofindtheimportanttopicsinlastmonth’sstories,weoftenwanttosearchonlyforclustersofdocumentsthataretightlyrelatedbyacommontheme.Inothercases,acompleteclusteringoftheobjectsisdesired.Forexample,anapplicationthatusesclusteringtoorganizedocumentsforbrowsingneedstoguaranteethatalldocumentscanbebrowsed.
a
7.1.3DifferentTypesofClusters
Clusteringaimstofindusefulgroupsofobjects(clusters),whereusefulnessisdefinedbythegoalsofthedataanalysis.Notsurprisingly,severaldifferentnotionsofaclusterproveusefulinpractice.Inordertovisuallyillustratethedifferencesamongthesetypesofclusters,weusetwo-dimensionalpoints,asshowninFigure7.2 ,asourdataobjects.Westress,however,thatthetypesofclustersdescribedhereareequallyvalidforotherkindsofdata.
Figure7.2.Differenttypesofclustersasillustratedbysetsoftwo-dimensionalpoints.
Well-Separated
Aclusterisasetofobjectsinwhicheachobjectiscloser(ormoresimilar)toeveryotherobjectintheclusterthantoanyobjectnotinthecluster.Sometimesathresholdisusedtospecifythatalltheobjectsinaclustermustbesufficientlyclose(orsimilar)tooneanother.Thisidealisticdefinitionofaclusterissatisfiedonlywhenthedatacontainsnaturalclustersthatarequitefarfromeachother.Figure7.2(a) givesanexampleofwell-separatedclustersthatconsistsoftwogroupsofpointsinatwo-dimensionalspace.Thedistancebetweenanytwopointsindifferentgroupsislargerthanthedistancebetweenanytwopointswithinagroup.Well-separatedclustersdonotneedtobeglobular,butcanhaveanyshape.
Prototype-Based
Aclusterisasetofobjectsinwhicheachobjectiscloser(moresimilar)totheprototypethatdefinestheclusterthantotheprototypeofanyothercluster.Fordatawithcontinuousattributes,theprototypeofaclusterisoftenacentroid,i.e.,theaverage(mean)ofallthepointsinthecluster.Whenacentroidisnotmeaningful,suchaswhenthedatahascategoricalattributes,theprototypeisoftenamedoid,i.e.,themostrepresentativepointofacluster.Formanytypesofdata,theprototypecanberegardedasthemostcentralpoint,andinsuchinstances,wecommonlyrefertoprototype-basedclustersascenter-basedclusters.Notsurprisingly,suchclusterstendtobeglobular.Figure7.2(b) showsanexampleofcenter-basedclusters.
Graph-Based
Ifthedataisrepresentedasagraph,wherethenodesareobjectsandthelinksrepresentconnectionsamongobjects(seeSection2.1.2 ),thenaclustercanbedefinedasaconnectedcomponent;i.e.,agroupofobjects
thatareconnectedtooneanother,butthathavenoconnectiontoobjectsoutsidethegroup.Animportantexampleofgraph-basedclustersisacontiguity-basedcluster,wheretwoobjectsareconnectedonlyiftheyarewithinaspecifieddistanceofeachother.Thisimpliesthateachobjectinacontiguity-basedclusterisclosertosomeotherobjectintheclusterthantoanypointinadifferentcluster.Figure7.2(c) showsanexampleofsuchclustersfortwo-dimensionalpoints.Thisdefinitionofaclusterisusefulwhenclustersareirregularorintertwined.However,thisapproachcanhavetroublewhennoiseispresentsince,asillustratedbythetwosphericalclustersofFigure7.2(c) ,asmallbridgeofpointscanmergetwodistinctclusters.
Othertypesofgraph-basedclustersarealsopossible.Onesuchapproach(Section7.3.2 )definesaclusterasaclique;i.e.,asetofnodesinagraphthatarecompletelyconnectedtoeachother.Specifically,ifweaddconnectionsbetweenobjectsintheorderoftheirdistancefromoneanother,aclusterisformedwhenasetofobjectsformsaclique.Likeprototype-basedclusters,suchclusterstendtobeglobular.
Density-Based
Aclusterisadenseregionofobjectsthatissurroundedbyaregionoflowdensity.Figure7.2(d) showssomedensity-basedclustersfordatacreatedbyaddingnoisetothedataofFigure7.2(c) .Thetwocircularclustersarenotmerged,asinFigure7.2(c) ,becausethebridgebetweenthemfadesintothenoise.Likewise,thecurvethatispresentinFigure7.2(c) alsofadesintothenoiseanddoesnotformaclusterinFigure7.2(d) .Adensity-baseddefinitionofaclusterisoftenemployedwhentheclustersareirregularorintertwined,andwhennoiseandoutliersarepresent.Bycontrast,acontiguity-baseddefinitionofaclusterwouldnotworkwellforthedataofFigure7.2(d) becausethenoisewouldtendtoformbridgesbetweenclusters.
Shared-Property(ConceptualClusters)
Moregenerally,wecandefineaclusterasasetofobjectsthatsharesomeproperty.Thisdefinitionencompassesallthepreviousdefinitionsofacluster;e.g.,objectsinacenter-basedclustersharethepropertythattheyareallclosesttothesamecentroidormedoid.However,theshared-propertyapproachalsoincludesnewtypesofclusters.ConsidertheclustersshowninFigure7.2(e) .Atriangulararea(cluster)isadjacenttoarectangularone,andtherearetwointertwinedcircles(clusters).Inbothcases,aclusteringalgorithmwouldneedaveryspecificconceptofaclustertosuccessfullydetecttheseclusters.Theprocessoffindingsuchclustersiscalledconceptualclustering.However,toosophisticatedanotionofaclusterwouldtakeusintotheareaofpatternrecognition,andthus,weonlyconsidersimplertypesofclustersinthisbook.
RoadMapInthischapter,weusethefollowingthreesimple,butimportanttechniquestointroducemanyoftheconceptsinvolvedinclusteranalysis.
K-means.Thisisaprototype-based,partitionalclusteringtechniquethatattemptstofindauser-specifiednumberofclusters(K),whicharerepresentedbytheircentroids.AgglomerativeHierarchicalClustering.Thisclusteringapproachreferstoacollectionofcloselyrelatedclusteringtechniquesthatproduceahierarchicalclusteringbystartingwitheachpointasasingletonclusterandthenrepeatedlymergingthetwoclosestclustersuntilasingle,all-encompassingclusterremains.Someofthesetechniqueshaveanaturalinterpretationintermsofgraph-basedclustering,whileothershaveaninterpretationintermsofaprototype-basedapproach.
DBSCAN.Thisisadensity-basedclusteringalgorithmthatproducesapartitionalclustering,inwhichthenumberofclustersisautomaticallydeterminedbythealgorithm.Pointsinlow-densityregionsareclassifiedasnoiseandomitted;thus,DBSCANdoesnotproduceacompleteclustering.
7.2K-meansPrototype-basedclusteringtechniquescreateaone-levelpartitioningofthedataobjects.Thereareanumberofsuchtechniques,buttwoofthemostprominentareK-meansandK-medoid.K-meansdefinesaprototypeintermsofacentroid,whichisusuallythemeanofagroupofpoints,andistypicallyappliedtoobjectsinacontinuousn-dimensionalspace.K-medoiddefinesaprototypeintermsofamedoid,whichisthemostrepresentativepointforagroupofpoints,andcanbeappliedtoawiderangeofdatasinceitrequiresonlyaproximitymeasureforapairofobjects.Whileacentroidalmostnevercorrespondstoanactualdatapoint,amedoid,byitsdefinition,mustbeanactualdatapoint.Inthissection,wewillfocussolelyonK-means,whichisoneoftheoldestandmostwidely-usedclusteringalgorithms.
7.2.1TheBasicK-meansAlgorithm
TheK-meansclusteringtechniqueissimple,andwebeginwithadescriptionofthebasicalgorithm.WefirstchooseKinitialcentroids,whereKisauser-specifiedparameter,namely,thenumberofclustersdesired.Eachpointisthenassignedtotheclosestcentroid,andeachcollectionofpointsassignedtoacentroidisacluster.Thecentroidofeachclusteristhenupdatedbasedonthepointsassignedtothecluster.Werepeattheassignmentandupdatestepsuntilnopointchangesclusters,orequivalently,untilthecentroidsremainthesame.
K-meansisformallydescribedbyAlgorithm7.1. TheoperationofK-meansisillustratedinFigure7.3 ,whichshowshow,startingfromthree
centroids,thefinalclustersarefoundinfourassignment-updatesteps.IntheseandotherfiguresdisplayingK-meansclustering,eachsubfigureshows(1)thecentroidsatthestartoftheiterationand(2)theassignmentofthepointstothosecentroids.Thecentroidsareindicatedbythe“+”symbol;allpointsbelongingtothesameclusterhavethesamemarkershape.
Figure7.3.UsingtheK-meansalgorithmtofindthreeclustersinsampledata.
Algorithm7.1BasicK-meansalgorithm.
Inthefirststep,showninFigure7.3(a) ,pointsareassignedtotheinitialcentroids,whichareallinthelargergroupofpoints.Forthisexample,weusethemeanasthecentroid.Afterpointsareassignedtoacentroid,thecentroid
SelectKpointsasinitialcentroids.repeatFormKclustersbyassigningeachpointtoitsclosestcentroid.Recomputethecentroidofeachcluster.untilCentroidsdonotchange.
1:2:3:4:5:
isthenupdated.Again,thefigureforeachstepshowsthecentroidatthebeginningofthestepandtheassignmentofpointstothosecentroids.Inthesecondstep,pointsareassignedtotheupdatedcentroids,andthecentroidsareupdatedagain.Insteps2,3,and4,whichareshowninFigures7.3(b) ,(c) ,and(d) ,respectively,twoofthecentroidsmovetothetwosmallgroupsofpointsatthebottomofthefigures.WhentheK-meansalgorithmterminatesinFigure7.3(d) ,becausenomorechangesoccur,thecentroidshaveidentifiedthenaturalgroupingsofpoints.
Foranumberofcombinationsofproximityfunctionsandtypesofcentroids,K-meansalwaysconvergestoasolution;i.e.,K-meansreachesastateinwhichnopointsareshiftingfromoneclustertoanother,andhence,thecentroidsdon’tchange.Becausemostoftheconvergenceoccursintheearlysteps,however,theconditiononline5ofAlgorithm7.1 isoftenreplacedbyaweakercondition,e.g.,repeatuntilonly1%ofthepointschangeclusters.
WeconsidereachofthestepsinthebasicK-meansalgorithminmoredetailandthenprovideananalysisofthealgorithm’sspaceandtimecomplexity.
AssigningPointstotheClosestCentroidToassignapointtotheclosestcentroid,weneedaproximitymeasurethatquantifiesthenotionof“closest”forthespecificdataunderconsideration.Euclidean distanceisoftenusedfordatapointsinEuclideanspace,whilecosinesimilarityismoreappropriatefordocuments.However,severaltypesofproximitymeasurescanbeappropriateforagiventypeofdata.Forexample,Manhattan distancecanbeusedforEuclideandata,whiletheJaccardmeasureisoftenemployedfordocuments.
Usually,thesimilaritymeasuresusedforK-meansarerelativelysimplesincethealgorithmrepeatedlycalculatesthesimilarityofeachpointtoeach
(L2)
(L1)
centroid.Insomecases,however,suchaswhenthedataisinlow-dimensionalEuclideanspace,itispossibletoavoidcomputingmanyofthesimilarities,thussignificantlyspeedinguptheK-meansalgorithm.BisectingK-means(describedinSection7.2.3 )isanotherapproachthatspeedsupK-meansbyreducingthenumberofsimilaritiescomputed.
CentroidsandObjectiveFunctionsStep4oftheK-meansalgorithmwasstatedrathergenerallyas“recomputethecentroidofeachcluster,”sincethecentroidcanvary,dependingontheproximitymeasureforthedataandthegoaloftheclustering.Thegoaloftheclusteringistypicallyexpressedbyanobjectivefunctionthatdependsontheproximitiesofthepointstooneanotherortotheclustercentroids;e.g.,minimizethesquareddistanceofeachpointtoitsclosestcentroid.Weillustratethiswithtwoexamples.However,thekeypointisthis:afterwehavespecifiedaproximitymeasureandanobjectivefunction,thecentroidthatweshouldchoosecanoftenbedeterminedmathematically.WeprovidemathematicaldetailsinSection7.2.6 ,andprovideanon-mathematicaldiscussionofthisobservationhere.
DatainEuclideanSpace
ConsiderdatawhoseproximitymeasureisEuclideandistance.Forourobjectivefunction,whichmeasuresthequalityofaclustering,weusethesumofthesquarederror(SSE),whichisalsoknownasscatter.Inotherwords,wecalculatetheerrorofeachdatapoint,i.e.,itsEuclideandistancetotheclosestcentroid,andthencomputethetotalsumofthesquarederrors.GiventwodifferentsetsofclustersthatareproducedbytwodifferentrunsofK-means,weprefertheonewiththesmallestsquarederrorsincethismeansthattheprototypes(centroids)ofthisclusteringareabetterrepresentationof
thepointsintheircluster.UsingthenotationinTable7.1 ,theSSEisformallydefinedasfollows:
Table7.1.Tableofnotation.
Symbol Description
x Anobject.
The cluster.
Thecentroidofcluster .
c Thecentroidofallpoints.
Thenumberofobjectsinthe cluster.
m Thenumberofobjectsinthedataset.
K Thenumberofclusters.
wheredististhestandardEuclidean distancebetweentwoobjectsinEuclideanspace.
Giventheseassumptions,itcanbeshown(seeSection7.2.6 )thatthecentroidthatminimizestheSSEoftheclusteristhemean.UsingthenotationinTable7.1 ,thecentroid(mean)ofthe clusterisdefinedbyEquation7.2 .
Ci ith
ci Ci
mi ith
SSE=∑i=1K∑x∈Cidist(ci,x)2 (7.1)
(L2)
ith
ci=1mi∑x∈Cix (7.2)
Toillustrate,thecentroidofaclustercontainingthethreetwo-dimensionalpoints,(1,1),(2,3),and(6,2),is , .
Steps3and4oftheK-meansalgorithmdirectlyattempttominimizetheSSE(ormoregenerally,theobjectivefunction).Step3formsclustersbyassigningpointstotheirnearestcentroid,whichminimizestheSSEforthegivensetofcentroids.Step4recomputesthecentroidssoastofurtherminimizetheSSE.However,theactionsofK-meansinSteps3and4areguaranteedtoonlyfindalocalminimumwithrespecttotheSSEbecausetheyarebasedonoptimizingtheSSEforspecificchoicesofthecentroidsandclusters,ratherthanforallpossiblechoices.Wewilllaterseeanexampleinwhichthisleadstoasuboptimalclustering.
DocumentData
ToillustratethatK-meansisnotrestrictedtodatainEuclideanspace,weconsiderdocumentdataandthecosinesimilaritymeasure.Hereweassumethatthedocumentdataisrepresentedasadocument-termmatrixasdescribedonpage37.Ourobjectiveistomaximizethesimilarityofthedocumentsinaclustertotheclustercentroid;thisquantityisknownasthecohesionofthecluster.Forthisobjectiveitcanbeshownthattheclustercentroidis,asforEuclideandata,themean.TheanalogousquantitytothetotalSSEisthetotalcohesion,whichisgivenbyEquation7.3 .
TheGeneralCase
Thereareanumberofchoicesfortheproximityfunction,centroid,andobjectivefunctionthatcanbeusedinthebasicK-meansalgorithmandthatareguaranteedtoconverge.Table7.2 showssomepossiblechoices,
((1+2+6)/3 ((1+3+2)/3)=(3,2)
TotalCohesion=∑i=1K∑x∈Cicosine(x,ci) (7.3)
includingthetwothatwehavejustdiscussed.NoticethatforManhattandistanceandtheobjectiveofminimizingthesumofthedistances,theappropriatecentroidisthemedianofthepointsinacluster.
Table7.2.K-means:Commonchoicesforproximity,centroids,andobjectivefunctions.
ProximityFunction Centroid ObjectiveFunction
Manhattan median Minimizesumofthe distanceofanobjecttoitsclustercentroid
SquaredEuclidean mean Minimizesumofthesquared distanceofanobjecttoitsclustercentroid
cosine mean Maximizesumofthecosinesimilarityofanobjecttoitsclustercentroid
Bregmandivergence mean MinimizesumoftheBregmandivergenceofanobjecttoitsclustercentroid
Thelastentryinthetable,Bregmandivergence(Section2.4.8 ),isactuallyaclassofproximitymeasuresthatincludesthesquaredEuclideandistance,
,theMahalanobisdistance,andcosinesimilarity.TheimportanceofBregmandivergencefunctionsisthatanysuchfunctioncanbeusedasthebasisofaK-meansstyleclusteringalgorithmwiththemeanasthecentroid.Specifically,ifweuseaBregmandivergenceasourproximityfunction,thentheresultingclusteringalgorithmhastheusualpropertiesofK-meanswithrespecttoconvergence,localminima,etc.Furthermore,thepropertiesofsuchaclusteringalgorithmcanbedevelopedforallpossibleBregmandivergences.Forexample,K-meansalgorithmsthatusecosinesimilarityorsquaredEuclideandistanceareparticularinstancesofageneralclusteringalgorithmbasedonBregmandivergences.
(L1)
(L1) L1
(L22)L2
L22
FortherestofourK-meansdiscussion,weusetwo-dimensionaldatasinceitiseasytoexplainK-meansanditspropertiesforthistypeofdata.But,assuggestedbythelastfewparagraphs,K-meansisageneralclusteringalgorithmandcanbeusedwithawidevarietyofdatatypes,suchasdocumentsandtimeseries.
ChoosingInitialCentroidsWhenrandominitializationofcentroidsisused,differentrunsofK-meanstypicallyproducedifferenttotalSSEs.Weillustratethiswiththesetoftwo-dimensionalpointsshowninFigure7.3 ,whichhasthreenaturalclustersofpoints.Figure7.4(a) showsaclusteringsolutionthatistheglobalminimumoftheSSEforthreeclusters,whileFigure7.4(b) showsasuboptimalclusteringthatisonlyalocalminimum.
Figure7.4.Threeoptimalandnon-optimalclusters.
ChoosingtheproperinitialcentroidsisthekeystepofthebasicK-meansprocedure.Acommonapproachistochoosetheinitialcentroidsrandomly,buttheresultingclustersareoftenpoor.
Example7.1(PoorInitialCentroids).Randomlyselectedinitialcentroidscanbepoor.WeprovideanexampleofthisusingthesamedatasetusedinFigures7.3 and7.4 .Figures7.3 and7.5 showtheclustersthatresultfromtwoparticularchoicesofinitialcentroids.(Forbothfigures,thepositionsoftheclustercentroidsinthevariousiterationsareindicatedbycrosses.)InFigure7.3 ,eventhoughalltheinitialcentroidsarefromonenaturalcluster,theminimumSSEclusteringisstillfound.InFigure7.5 ,however,eventhoughtheinitialcentroidsseemtobebetterdistributed,weobtainasuboptimalclustering,withhighersquarederror.
Figure7.5.PoorstartingcentroidsforK-means.
Example7.2(LimitsofRandomInitialization).Onetechniquethatiscommonlyusedtoaddresstheproblemofchoosinginitialcentroidsistoperformmultipleruns,eachwithadifferentsetofrandomlychoseninitialcentroids,andthenselectthesetofclusterswiththeminimumSSE.Whilesimple,thisstrategymightnotworkverywell,dependingonthedatasetandthenumberofclusterssought.We
demonstratethisusingthesampledatasetshowninFigure7.6(a) .Thedataconsistsoftwopairsofclusters,wheretheclustersineach(top-bottom)pairareclosertoeachotherthantotheclustersintheotherpair.Figure7.6(b–d) showsthatifwestartwithtwoinitialcentroidsperpairofclusters,thenevenwhenbothcentroidsareinasinglecluster,thecentroidswillredistributethemselvessothatthe“true”clustersarefound.However,Figure7.7 showsthatifapairofclustershasonlyoneinitialcentroidandtheotherpairhasthree,thentwoofthetrueclusterswillbecombinedandonetrueclusterwillbesplit.
Figure7.6.Twopairsofclusterswithapairofinitialcentroidswithineachpairofclusters.
Figure7.7.Twopairsofclusterswithmoreorfewerthantwoinitialcentroidswithinapairofclusters.
Notethatanoptimalclusteringwillbeobtainedaslongastwoinitialcentroidsfallanywhereinapairofclusters,sincethecentroidswillredistributethemselves,onetoeachcluster.Unfortunately,asthenumberofclustersbecomeslarger,itisincreasinglylikelythatatleastonepairofclusterswillhaveonlyoneinitialcentroid—seeExercise4 onpage603.
Inthiscase,becausethepairsofclustersarefartherapartthanclusterswithinapair,theK-meansalgorithmwillnotredistributethecentroidsbetweenpairsofclusters,andthus,onlyalocalminimumwillbeachieved.
Becauseoftheproblemswithusingrandomlyselectedinitialcentroids,whichevenrepeatedrunsmightnotovercome,othertechniquesareoftenemployedforinitialization.Oneeffectiveapproachistotakeasampleofpointsandclusterthemusingahierarchicalclusteringtechnique.Kclustersareextractedfromthehierarchicalclustering,andthecentroidsofthoseclustersareusedastheinitialcentroids.Thisapproachoftenworkswell,butispracticalonlyif(1)thesampleisrelativelysmall,e.g.,afewhundredtoafewthousand(hierarchicalclusteringisexpensive),and(2)Kisrelativelysmallcomparedtothesamplesize.
Thefollowingprocedureisanotherapproachtoselectinginitialcentroids.Selectthefirstpointatrandomortakethecentroidofallpoints.Then,foreachsuccessiveinitialcentroid,selectthepointthatisfarthestfromanyoftheinitialcentroidsalreadyselected.Inthisway,weobtainasetofinitialcentroidsthatisguaranteedtobenotonlyrandomlyselectedbutalsowellseparated.Unfortunately,suchanapproachcanselectoutliers,ratherthanpointsindenseregions(clusters),whichcanleadtoasituationwheremanyclustershavejustonepoint—anoutlier—whichreducesthenumberofcentroidsforformingclustersforthemajorityofpoints.Also,itisexpensivetocomputethefarthestpointfromthecurrentsetofinitialcentroids.Toovercometheseproblems,thisapproachisoftenappliedtoasampleofthepoints.Becauseoutliersarerare,theytendnottoshowupinarandomsample.Incontrast,pointsfromeverydenseregionarelikelytobeincludedunlessthesamplesizeisverysmall.Also,thecomputationinvolvedinfindingtheinitialcentroidsisgreatlyreducedbecausethesamplesizeistypicallymuchsmallerthanthenumberofpoints.
K-means++
Morerecently,anewapproachforinitializingK-means,calledK-means++,hasbeendeveloped.ThisprocedureisguaranteedtofindaK-meansclusteringsolutionthatisoptimaltowithinafactorof ,whichinpracticetranslatesintonoticeablybetterclusteringresultsintermsoflowerSSE.Thistechniqueissimilartotheideajustdiscussedofpickingthefirstcentroidatrandomandthenpickingeachremainingcentroidasthepointasfarfromtheremainingcentroidsaspossible.Specifically,K-means++pickscentroidsincrementallyuntilkcentroidshavebeenpicked.Ateverysuchstep,eachpointhasaprobabilityofbeingpickedasthenewcentroidthatisproportionaltothesquareofitsdistancetoitsclosestcentroid.Itmightseemthatthisapproachmighttendtochooseoutliersforcentroids,butbecauseoutliersarerare,bydefinition,thisisunlikely.
ThedetailsofK-means++initializationaregivenbyAlgorithm7.2. TherestofthealgorithmisthesameasordinaryK-means.
Algorithm7.2K-means++initialization
algorithm.
Olog(k)
Forthefirstcentroid,pickoneofthepointsatrandom.for tonumberoftrialsdoComputethedistance, ,ofeachpointtoitsclosestcentroid.Assigneachpointaprobabilityproportionaltoeachpoint’s
.Picknewcentroidfromtheremainingpointsusingtheweightedprobabilities.endfor
1:2: i=13: d(x)4:
d(x)25:
6:
Later,wewilldiscusstwootherapproachesthatarealsousefulforproducingbetter-quality(lowerSSE)clusterings:usingavariantofK-meansthatislesssusceptibletoinitializationproblems(bisectingK-means)andusingpostprocessingto“fixup”thesetofclustersproduced.K-means++couldbecombinedwitheitherapproach.
TimeandSpaceComplexityThespacerequirementsforK-meansaremodestbecauseonlythedatapointsandcentroidsarestored.Specifically,thestoragerequiredis
,wheremisthenumberofpointsandnisthenumberofattributes.ThetimerequirementsforK-meansarealsomodest—basicallylinearinthenumberofdatapoints.Inparticular,thetimerequiredis ,whereIisthenumberofiterationsrequiredforconvergence.Asmentioned,Iisoftensmallandcanusuallybesafelybounded,asmostchangestypicallyoccurinthefirstfewiterations.Therefore,K-meansislinearinm,thenumberofpoints,andisefficientaswellassimpleprovidedthatK,thenumberofclusters,issignificantlylessthanm.
7.2.2K-means:AdditionalIssues
HandlingEmptyClustersOneoftheproblemswiththebasicK-meansalgorithmisthatemptyclusterscanbeobtainedifnopointsareallocatedtoaclusterduringtheassignmentstep.Ifthishappens,thenastrategyisneededtochooseareplacementcentroid,sinceotherwise,thesquarederrorwillbelargerthannecessary.Oneapproachistochoosethepointthatisfarthestawayfromanycurrentcentroid.Ifnothingelse,thiseliminatesthepointthatcurrentlycontributes
O((m+K)n)
O(I×K×m×n)
mosttothetotalsquarederror.(AK-means++approachcouldbeusedaswell.)AnotherapproachistochoosethereplacementcentroidatrandomfromtheclusterthathasthehighestSSE.ThiswilltypicallysplittheclusterandreducetheoverallSSEoftheclustering.Ifthereareseveralemptyclusters,thenthisprocesscanberepeatedseveraltimes.
OutliersWhenthesquarederrorcriterionisused,outlierscanundulyinfluencetheclustersthatarefound.Inparticular,whenoutliersarepresent,theresultingclustercentroids(prototypes)aretypicallynotasrepresentativeastheyotherwisewouldbeandthus,theSSEwillbehigher.Becauseofthis,itisoftenusefultodiscoveroutliersandeliminatethembeforehand.Itisimportant,however,toappreciatethattherearecertainclusteringapplicationsforwhichoutliersshouldnotbeeliminated.Whenclusteringisusedfordatacompression,everypointmustbeclustered,andinsomecases,suchasfinancialanalysis,apparentoutliers,e.g.,unusuallyprofitablecustomers,canbethemostinterestingpoints.
Anobviousissueishowtoidentifyoutliers.AnumberoftechniquesforidentifyingoutlierswillbediscussedinChapter9 .Ifweuseapproachesthatremoveoutliersbeforeclustering,weavoidclusteringpointsthatwillnotclusterwell.Alternatively,outlierscanalsobeidentifiedinapostprocessingstep.Forinstance,wecankeeptrackoftheSSEcontributedbyeachpoint,andeliminatethosepointswithunusuallyhighcontributions,especiallyovermultipleruns.Also,weoftenwanttoeliminatesmallclustersbecausetheyfrequentlyrepresentgroupsofoutliers.
ReducingtheSSEwithPostprocessing
AnobviouswaytoreducetheSSEistofindmoreclusters,i.e.,tousealargerK.However,inmanycases,wewouldliketoimprovetheSSE,butdon’twanttoincreasethenumberofclusters.ThisisoftenpossiblebecauseK-meanstypicallyconvergestoalocalminimum.Varioustechniquesareusedto“fixup”theresultingclustersinordertoproduceaclusteringthathaslowerSSE.ThestrategyistofocusonindividualclusterssincethetotalSSEissimplythesumoftheSSEcontributedbyeachcluster.(WewillusethetermstotalSSEandclusterSSE,respectively,toavoidanypotentialconfusion.)WecanchangethetotalSSEbyperformingvariousoperationsontheclusters,suchassplittingormergingclusters.Onecommonlyusedapproachistoemployalternateclustersplittingandmergingphases.Duringasplittingphase,clustersaredivided,whileduringamergingphase,clustersarecombined.Inthisway,itisoftenpossibletoescapelocalSSEminimaandstillproduceaclusteringsolutionwiththedesirednumberofclusters.Thefollowingaresometechniquesusedinthesplittingandmergingphases.
TwostrategiesthatdecreasethetotalSSEbyincreasingthenumberofclustersarethefollowing:
Splitacluster:TheclusterwiththelargestSSEisusuallychosen,butwecouldalsosplittheclusterwiththelargeststandarddeviationforoneparticularattribute.
Introduceanewclustercentroid:Oftenthepointthatisfarthestfromanyclustercenterischosen.WecaneasilydeterminethisifwekeeptrackoftheSSEcontributedbyeachpoint.AnotherapproachistochooserandomlyfromallpointsorfromthepointswiththehighestSSEwithrespecttotheirclosestcentroids.
Twostrategiesthatdecreasethenumberofclusters,whiletryingtominimizetheincreaseintotalSSE,arethefollowing:
Disperseacluster:Thisisaccomplishedbyremovingthecentroidthatcorrespondstotheclusterandreassigningthepointstootherclusters.Ideally,theclusterthatisdispersedshouldbetheonethatincreasesthetotalSSEtheleast.
Mergetwoclusters:Theclusterswiththeclosestcentroidsaretypicallychosen,althoughanother,perhapsbetter,approachistomergethetwoclustersthatresultinthesmallestincreaseintotalSSE.ThesetwomergingstrategiesarethesameonesthatareusedinthehierarchicalclusteringtechniquesknownasthecentroidmethodandWard’smethod,respectively.BothmethodsarediscussedinSection7.3 .
UpdatingCentroidsIncrementallyInsteadofupdatingclustercentroidsafterallpointshavebeenassignedtoacluster,thecentroidscanbeupdatedincrementally,aftereachassignmentofapointtoacluster.Noticethatthisrequireseitherzeroortwoupdatestoclustercentroidsateachstep,sinceapointeithermovestoanewcluster(twoupdates)orstaysinitscurrentcluster(zeroupdates).Usinganincrementalupdatestrategyguaranteesthatemptyclustersarenotproducedbecauseallclustersstartwithasinglepoint,andifaclustereverhasonlyonepoint,thenthatpointwillalwaysbereassignedtothesamecluster.
Inaddition,ifincrementalupdatingisused,therelativeweightofthepointbeingaddedcanbeadjusted;e.g.,theweightofpointsisoftendecreasedastheclusteringproceeds.Whilethiscanresultinbetteraccuracyandfasterconvergence,itcanbedifficulttomakeagoodchoicefortherelativeweight,especiallyinawidevarietyofsituations.Theseupdateissuesaresimilartothoseinvolvedinupdatingweightsforartificialneuralnetworks.
Yetanotherbenefitofincrementalupdateshastodowithusingobjectivesotherthan“minimizeSSE.”Supposethatwearegivenanarbitraryobjectivefunctiontomeasurethegoodnessofasetofclusters.Whenweprocessanindividualpoint,wecancomputethevalueoftheobjectivefunctionforeachpossibleclusterassignment,andthenchoosetheonethatoptimizestheobjective.SpecificexamplesofalternativeobjectivefunctionsaregiveninSection7.5.2 .
Onthenegativeside,updatingcentroidsincrementallyintroducesanorderdependency.Inotherwords,theclustersproducedusuallydependontheorderinwhichthepointsareprocessed.Althoughthiscanbeaddressedbyrandomizingtheorderinwhichthepointsareprocessed,thebasicK-meansapproachofupdatingthecentroidsafterallpointshavebeenassignedtoclustershasnoorderdependency.Also,incrementalupdatesareslightlymoreexpensive.However,K-meansconvergesratherquickly,andtherefore,thenumberofpointsswitchingclustersquicklybecomesrelativelysmall.
7.2.3BisectingK-means
ThebisectingK-meansalgorithmisastraightforwardextensionofthebasicK-meansalgorithmthatisbasedonasimpleidea:toobtainKclusters,splitthesetofallpointsintotwoclusters,selectoneoftheseclusterstosplit,andsoon,untilKclustershavebeenproduced.ThedetailsofbisectingK-meansaregivenbyAlgorithm7.3.
Thereareanumberofdifferentwaystochoosewhichclustertosplit.Wecanchoosethelargestclusterateachstep,choosetheonewiththelargestSSE,oruseacriterionbasedonbothsizeandSSE.Differentchoicesresultindifferentclusters.
BecauseweareusingtheK-meansalgorithm“locally,”i.e.,tobisectindividualclusters,thefinalsetofclustersdoesnotrepresentaclusteringthatisalocalminimumwithrespecttothetotalSSE.Thus,weoftenrefinetheresultingclustersbyusingtheirclustercentroidsastheinitialcentroidsforthestandardK-meansalgorithm.
Algorithm7.3BisectingK-means
algorithm.
Example7.3(BisectingK-meansandInitialization).ToillustratethatbisectingK-meansislesssusceptibletoinitializationproblems,weshow,inFigure7.8 ,howbisectingK-meansfindsfourclustersinthedatasetoriginallyshowninFigure7.6(a) .Initeration1,twopairsofclustersarefound;initeration2,therightmostpairofclusters
Initializethelistofclusterstocontaintheclusterconsistingofallpoints.repeatRemoveaclusterfromthelistofclusters.{Performseveral“trial”bisectionsofthechosencluster.}for tonumberoftrialsdoBisecttheselectedclusterusingbasicK-means.endforSelectthetwoclustersfromthebisectionwiththelowesttotalSSE.Addthesetwoclusterstothelistofclusters.untilThelistofclusterscontainsKclusters.
1:
2:3:4:5: i=16:7:8:
9:10:
issplit;andiniteration3,theleftmostpairofclustersissplit.BisectingK-meanshaslesstroublewithinitializationbecauseitperformsseveraltrialbisectionsandtakestheonewiththelowestSSE,andbecausethereareonlytwocentroidsateachstep.
Figure7.8.BisectingK-meansonthefourclustersexample.
Finally,byrecordingthesequenceofclusteringsproducedasK-meansbisectsclusters,wecanalsousebisectingK-meanstoproduceahierarchicalclustering.
7.2.4K-meansandDifferentTypesofClusters
K-meansanditsvariationshaveanumberoflimitationswithrespecttofindingdifferenttypesofclusters.Inparticular,K-meanshasdifficultydetectingthe“natural”clusters,whenclustershavenon-sphericalshapesorwidelydifferentsizesordensities.ThisisillustratedbyFigures7.9 ,7.10 ,and7.11 .InFigure7.9 ,K-meanscannotfindthethreenaturalclustersbecauseoneoftheclustersismuchlargerthantheothertwo,andhence,thelargerclusteris
broken,whileoneofthesmallerclustersiscombinedwithaportionofthelargercluster.InFigure7.10 ,K-meansfailstofindthethreenaturalclustersbecausethetwosmallerclustersaremuchdenserthanthelargercluster.Finally,inFigure7.11 ,K-meansfindstwoclustersthatmixportionsofthetwonaturalclustersbecausetheshapeofthenaturalclustersisnotglobular.
Figure7.9.K-meanswithclustersofdifferentsize.
Figure7.10.K-meanswithclustersofdifferentdensity.
Figure7.11.K-meanswithnon-globularclusters.
ThedifficultyinthesethreesituationsisthattheK-meansobjectivefunctionisamismatchforthekindsofclusterswearetryingtofindbecauseitisminimizedbyglobularclustersofequalsizeanddensityorbyclustersthatarewellseparated.However,theselimitationscanbeovercome,insomesense,iftheuseriswillingtoacceptaclusteringthatbreaksthenaturalclustersintoanumberofsubclusters.Figure7.12 showswhathappenstothethreepreviousdatasetsifwefindsixclustersinsteadoftwoorthree.Eachsmallerclusterispureinthesensethatitcontainsonlypointsfromoneofthenaturalclusters.
Figure7.12.UsingK-meanstofindclustersthataresubclustersofthenaturalclusters.
7.2.5StrengthsandWeaknesses
K-meansissimpleandcanbeusedforawidevarietyofdatatypes.Itisalsoquiteefficient,eventhoughmultiplerunsareoftenperformed.Somevariants,includingbisectingK-means,areevenmoreefficient,andarelesssusceptibletoinitializationproblems.K-meansisnotsuitableforalltypesofdata,however.Itcannothandlenon-globularclustersorclustersofdifferentsizesanddensities,althoughitcantypicallyfindpuresubclustersifalargeenoughnumberofclustersisspecified.K-meansalsohastroubleclusteringdatathatcontainsoutliers.Outlierdetectionandremovalcanhelpsignificantlyinsuchsituations.Finally,K-meansisrestrictedtodataforwhichthereisanotionofacenter(centroid).Arelatedtechnique,K-medoidclustering,doesnothavethisrestriction,butismoreexpensive.
7.2.6K-meansasanOptimizationProblem
Here,wedelveintothemathematicsbehindK-means.Thissection,whichcanbeskippedwithoutlossofcontinuity,requiresknowledgeofcalculusthroughpartialderivatives.Familiaritywithoptimizationtechniques,especiallythosebasedongradientdescent,canalsobehelpful.
Asmentionedearlier,givenanobjectivefunctionsuchas“minimizeSSE,”clusteringcanbetreatedasanoptimizationproblem.Onewaytosolvethisproblem—tofindaglobaloptimum—istoenumerateallpossiblewaysofdividingthepointsintoclustersandthenchoosethesetofclustersthatbestsatisfiestheobjectivefunction,e.g.,thatminimizesthetotalSSE.Ofcourse,thisexhaustivestrategyiscomputationallyinfeasibleandasaresult,amorepracticalapproachisneeded,evenifsuchanapproachfindssolutionsthatarenotguaranteedtobeoptimal.Onetechnique,whichisknownasgradientdescent,isbasedonpickinganinitialsolutionandthenrepeatingthefollowingtwosteps:computethechangetothesolutionthatbestoptimizestheobjectivefunctionandthenupdatethesolution.
Weassumethatthedataisone-dimensional,i.e., .Thisdoesnotchangeanythingessential,butgreatlysimplifiesthenotation.
DerivationofK-meansasanAlgorithmtoMinimizetheSSEInthissection,weshowhowthecentroidfortheK-meansalgorithmcanbemathematicallyderivedwhentheproximityfunctionisEuclideandistanceandtheobjectiveistominimizetheSSE.Specifically,weinvestigatehowwecanbestupdateaclustercentroidsothattheclusterSSEisminimized.Inmathematicalterms,weseektominimizeEquation7.1 ,whichwerepeathere,specializedforone-dimensionaldata.
Here, isthe cluster,xisapointin ,and isthemeanofthecluster.SeeTable7.1 foracompletelistofnotation.
dist(x,y)=(x−y)2
SSE=∑i=1K∑x∈Ci(ci−x)2 (7.4)
Ci ith Ci ci ith
Wecansolveforthe centroid ,whichminimizesEquation7.4 ,bydifferentiatingtheSSE,settingitequalto0,andsolving,asindicatedbelow.
Thus,aspreviouslyindicated,thebestcentroidforminimizingtheSSEofaclusteristhemeanofthepointsinthecluster.
DerivationofK-meansforSAETodemonstratethattheK-meansalgorithmcanbeappliedtoavarietyofdifferentobjectivefunctions,weconsiderhowtopartitionthedataintoKclusterssuchthatthesumoftheManhattan distancesofpointsfromthecenteroftheirclustersisminimized.Weareseekingtominimizethesumofthe absoluteerrors(SAE)asgivenbythefollowingequation,whereisthe distance.Again,fornotationalsimplicity,weuseone-dimensionaldata,i.e., .
Wecansolveforthe centroid ,whichminimizesEquation7.5 ,bydifferentiatingtheSAE,settingitequalto0,andsolving.
kth ck
∂∂ckSSE=∂∂ck∑i=1K∑x∈Ci(ci−x)2=∑i=1K∑x∈Ci∂∂ck(ci−x)2=∑x∈Ck2×(ck−xk)=0
∑x∈Ck2×(ck−xk)=0⇒mkck=∑x∈Ckxk⇒ck=1mk∑x∈Ckxk
(L1)
L1 distL1L1
distL1=|ci−x|
SAE=∑i=1K∑x∈CidistL1(ci,x) (7.5)
kth ck
∂∂ckSAE=∂∂ck∑i=1K∑x∈Ci|ci−x|=∑i=1K∑x∈Ci∂∂ck|ci−x|=∑x∈Ck∂∂ck|ck−x|=0
∑x∈Ck∂∂ck|ck−x|=0⇒∑x∈Cksign(x−ck)=0
Ifwesolvefor ,wefindthat ,themedianofthepointsinthecluster.Themedianofagroupofpointsisstraightforwardtocomputeandlesssusceptibletodistortionbyoutliers.
ck ck=median{x∈Ck}
7.3AgglomerativeHierarchicalClusteringHierarchicalclusteringtechniquesareasecondimportantcategoryofclusteringmethods.AswithK-means,theseapproachesarerelativelyoldcomparedtomanyclusteringalgorithms,buttheystillenjoywidespreaduse.Therearetwobasicapproachesforgeneratingahierarchicalclustering:
Agglomerative:Startwiththepointsasindividualclustersand,ateachstep,mergetheclosestpairofclusters.Thisrequiresdefininganotionofclusterproximity.
Divisive:Startwithone,all-inclusiveclusterand,ateachstep,splitaclusteruntilonlysingletonclustersofindividualpointsremain.Inthiscase,weneedtodecidewhichclustertosplitateachstepandhowtodothesplitting.
Agglomerativehierarchicalclusteringtechniquesarebyfarthemostcommon,and,inthissection,wewillfocusexclusivelyonthesemethods.AdivisivehierarchicalclusteringtechniqueisdescribedinSection8.4.2 .
Ahierarchicalclusteringisoftendisplayedgraphicallyusingatree-likediagramcalledadendrogram,whichdisplaysboththecluster-subclusterrelationshipsandtheorderinwhichtheclustersweremerged(agglomerativeview)orsplit(divisiveview).Forsetsoftwo-dimensionalpoints,suchasthosethatwewilluseasexamples,ahierarchicalclusteringcanalsobegraphicallyrepresentedusinganestedclusterdiagram.Figure7.13 showsanexampleofthesetwotypesoffiguresforasetoffourtwo-dimensionalpoints.
Thesepointswereclusteredusingthesingle-linktechniquethatisdescribedinSection7.3.2 .
Figure7.13.Ahierarchicalclusteringoffourpointsshownasadendrogramandasnestedclusters.
7.3.1BasicAgglomerativeHierarchicalClusteringAlgorithm
Manyagglomerativehierarchicalclusteringtechniquesarevariationsonasingleapproach:startingwithindividualpointsasclusters,successivelymergethetwoclosestclustersuntilonlyoneclusterremains.ThisapproachisexpressedmoreformallyinAlgorithm7.4 .
Algorithm7.4Basicagglomerative
hierarchicalclusteringalgorithm.
DefiningProximitybetweenClustersThekeyoperationofAlgorithm7.4 isthecomputationoftheproximitybetweentwoclusters,anditisthedefinitionofclusterproximitythatdifferentiatesthevariousagglomerativehierarchicaltechniquesthatwewilldiscuss.Clusterproximityistypicallydefinedwithaparticulartypeofclusterinmind—seeSection7.1.3 .Forexample,manyagglomerativehierarchicalclusteringtechniques,suchasMIN,MAX,andGroupAverage,comefromagraph-basedviewofclusters.MINdefinesclusterproximityastheproximitybetweentheclosesttwopointsthatareindifferentclusters,orusinggraphterms,theshortestedgebetweentwonodesindifferentsubsetsofnodes.Thisyieldscontiguity-basedclustersasshowninFigure7.2(c) .Alternatively,MAXtakestheproximitybetweenthefarthesttwopointsindifferentclusterstobetheclusterproximity,orusinggraphterms,thelongestedgebetweentwonodesindifferentsubsetsofnodes.(Ifourproximitiesaredistances,thenthenames,MINandMAX,areshortandsuggestive.Forsimilarities,however,wherehighervaluesindicatecloserpoints,thenamesseemreversed.Forthatreason,weusuallyprefertousethealternativenames,singlelinkandcompletelink,respectively.)Anothergraph-basedapproach,thegroupaveragetechnique,definesclusterproximitytobetheaveragepairwiseproximities(averagelengthofedges)ofallpairsofpointsfromdifferentclusters.Figure7.14 illustratesthesethreeapproaches.
Computetheproximitymatrix,ifnecessary.repeatMergetheclosesttwoclusters.Updatetheproximitymatrixtoreflecttheproximitybetweenthenewclusterandtheoriginalclusters.untilOnlyoneclusterremains.
1:2:3:4:
5:
Figure7.14.Graph-baseddefinitionsofclusterproximity.
If,instead,wetakeaprototype-basedview,inwhicheachclusterisrepresentedbyacentroid,differentdefinitionsofclusterproximityaremorenatural.Whenusingcentroids,theclusterproximityiscommonlydefinedastheproximitybetweenclustercentroids.Analternativetechnique,Ward’smethod,alsoassumesthataclusterisrepresentedbyitscentroid,butitmeasurestheproximitybetweentwoclustersintermsoftheincreaseintheSSEthatresultsfrommergingthetwoclusters.LikeK-means,Ward’smethodattemptstominimizethesumofthesquareddistancesofpointsfromtheirclustercentroids.
TimeandSpaceComplexityThebasicagglomerativehierarchicalclusteringalgorithmjustpresentedusesaproximitymatrix.Thisrequiresthestorageof proximities(assumingtheproximitymatrixissymmetric)wheremisthenumberofdatapoints.Thespaceneededtokeeptrackoftheclustersisproportionaltothenumberofclusters,whichis ,excludingsingletonclusters.Hence,thetotalspacecomplexityis .
Theanalysisofthebasicagglomerativehierarchicalclusteringalgorithmisalsostraightforwardwithrespecttocomputationalcomplexity. timeisrequiredtocomputetheproximitymatrix.Afterthatstep,thereare
12m2
m−1O(m2)
O(m2)m−1
iterationsinvolvingsteps3and4becausetherearemclustersatthestartandtwoclustersaremergedduringeachiteration.Ifperformedasalinearsearchoftheproximitymatrix,thenforthe iteration,Step3requirestime,whichisproportionaltothecurrentnumberofclusterssquared.Step4requires timetoupdatetheproximitymatrixafterthemergeroftwoclusters.(Aclustermergeraffects proximitiesforthetechniquesthatweconsider.)Withoutmodification,thiswouldyieldatimecomplexityof.Ifthedistancesfromeachclustertoallotherclustersarestoredasasorted
list(orheap),itispossibletoreducethecostoffindingthetwoclosestclustersto .However,becauseoftheadditionalcomplexityofkeepingdatainasortedlistorheap,theoveralltimerequiredforahierarchicalclusteringbasedonAlgorithm7.4 is .
Thespaceandtimecomplexityofhierarchicalclusteringseverelylimitsthesizeofdatasetsthatcanbeprocessed.Wediscussscalabilityapproachesforclusteringalgorithms,includinghierarchicalclusteringtechniques,inSection8.5 .Note,however,thatthebisectingK-meansalgorithmpresentedinSection7.2.3 isascalablealgorithmthatproducesahierarchicalclustering.
7.3.2SpecificTechniques
SampleDataToillustratethebehaviorofthevarioushierarchicalclusteringalgorithms,wewillusesampledatathatconsistsofsixtwo-dimensionalpoints,whichareshowninFigure7.15 .ThexandycoordinatesofthepointsandtheEuclideandistancesbetweenthemareshowninTables7.3 and7.4 ,respectively.
ith O((m−i+1)2)
O(m−i+1)O(m−i+1)
O(m3)
O(m−i+1)
O(m2logm)
Figure7.15.Setofsixtwo-dimensionalpoints.
Table7.3.xy-coordinatesofsixpoints.
Point xCoordinate yCoordinate
p1 0.4005 0.5306
p2 0.2148 0.3854
p3 0.3457 0.3156
p4 0.2652 0.1875
p5 0.0789 0.4139
p6 0.4548 0.3022
Table7.4.Euclideandistancematrixforsixpoints.
p1 p2 p3 p4 p5 p6
p1 0.00 0.24 0.22 0.37 0.34 0.23
p2 0.24 0.00 0.15 0.20 0.14 0.25
p3 0.22 0.15 0.00 0.15 0.28 0.11
p4 0.37 0.20 0.15 0.00 0.29 0.22
p5 0.34 0.14 0.28 0.29 0.00 0.39
p6 0.23 0.25 0.11 0.22 0.39 0.00
SingleLinkorMINForthesinglelinkorMINversionofhierarchicalclustering,theproximityoftwoclustersisdefinedastheminimumofthedistance(maximumofthesimilarity)betweenanytwopointsinthetwodifferentclusters.Usinggraphterminology,ifyoustartwithallpointsassingletonclustersandaddlinksbetweenpointsoneatatime,shortestlinksfirst,thenthesesinglelinkscombinethepointsintoclusters.Thesinglelinktechniqueisgoodathandlingnon-ellipticalshapes,butissensitivetonoiseandoutliers.
Example7.4(SingleLink).Figure7.16 showstheresultofapplyingthesinglelinktechniquetoourexampledatasetofsixpoints.Figure7.16(a) showsthenestedclustersasasequenceofnestedellipses,wherethenumbersassociatedwiththeellipsesindicatetheorderoftheclustering.Figure7.16(b) showsthesameinformation,butasadendrogram.Theheightatwhichtwoclustersaremergedinthedendrogramreflectsthedistanceofthetwoclusters.Forinstance,fromTable7.4 ,weseethatthedistancebetweenpoints3and6is0.11,andthatistheheightatwhichtheyarejoinedintooneclusterinthedendrogram.Asanotherexample,thedistancebetweenclustersand isgivenby
{3,6}{2,5}
dist({3,6},{2,5})=min(dist(3,2), dist(6,2), dist(3,5), dist(6,5))=
Figure7.16.SinglelinkclusteringofthesixpointsshowninFigure7.15 .
CompleteLinkorMAXorCLIQUEForthecompletelinkorMAXversionofhierarchicalclustering,theproximityoftwoclustersisdefinedasthemaximumofthedistance(minimumofthesimilarity)betweenanytwopointsinthetwodifferentclusters.Usinggraphterminology,ifyoustartwithallpointsassingletonclustersandaddlinksbetweenpointsoneatatime,shortestlinksfirst,thenagroupofpointsisnotaclusteruntilallthepointsinitarecompletelylinked,i.e.,formaclique.Completelinkislesssusceptibletonoiseandoutliers,butitcanbreaklargeclustersanditfavorsglobularshapes.
Example7.5(CompleteLink).
min(0.15, 0.25, 0.28, 0.39)=0.15.
Figure7.17 showstheresultsofapplyingMAXtothesampledatasetofsixpoints.Aswithsinglelink,points3and6aremergedfirst.However,
ismergedwith ,insteadof or because
Figure7.17.CompletelinkclusteringofthesixpointsshowninFigure7.15 .
GroupAverageForthegroupaverageversionofhierarchicalclustering,theproximityoftwoclustersisdefinedastheaveragepairwiseproximityamongallpairsofpointsinthedifferentclusters.Thisisanintermediateapproachbetweenthesingleandcompletelinkapproaches.Thus,forgroupaverage,theclusterproximity
{3,6} {4} {2,5} {1}
dist({3,6},{4})=max(dist(3,4),dist(6,4))=max(0.15,0.22)=0.22.dist({3,6},{2,5})=max(dist(3,2),dist(6,2),dist(3,5),dist(6,5))=max(0.15,0.25,0.28,0.39)=0.39.dist({3,6},{1})=max(dist(3,1),dist(6,1))=max(0.22,0.23)=0.23.
proximity ofclusters and ,whichareofsize and ,respectively,isexpressedbythefollowingequation:
Example7.6(GroupAverage).Figure7.18 showstheresultsofapplyingthegroupaverageapproachtothesampledatasetofsixpoints.Toillustratehowgroupaverageworks,wecalculatethedistancebetweensomeclusters.
Figure7.18.GroupaverageclusteringofthesixpointsshowninFigure7.15 .
(Ci,Cj) Ci Cj mi mj
proximity(Ci, Cj)=∑y∈Cjx∈Ciproximity(x, y)mi×mj. (7.6)
dist({3,6,4},{1})=(0.22+0.37+0.23)/(3×1)=0.28dist({2,5},{1})=(0.24+0.34)/(2×1)=0.29dist({3,6,4},{2,5})=(0.15+0.28+0.25+0.39+0.20+0.29)/(3×2)=0.26
Because issmallerthan and,clusters and aremergedatthefourthstage.
Ward’sMethodandCentroidMethodsForWard’smethod,theproximitybetweentwoclustersisdefinedastheincreaseinthesquarederrorthatresultswhentwoclustersaremerged.Thus,thismethodusesthesameobjectivefunctionasK-meansclustering.WhileitmightseemthatthisfeaturemakesWard’smethodsomewhatdistinctfromotherhierarchicaltechniques,itcanbeshownmathematicallythatWard’smethodisverysimilartothegroupaveragemethodwhentheproximitybetweentwopointsistakentobethesquareofthedistancebetweenthem.
Example7.7(Ward’sMethod).Figure7.19 showstheresultsofapplyingWard’smethodtothesampledatasetofsixpoints.Theclusteringthatisproducedisdifferentfromthoseproducedbysinglelink,completelink,andgroupaverage.
dist({3,6,4},{2,5}) dist({3,6,4},{1}) dist({2,5},{1}) {3,6,4} {2,5}
Figure7.19.Ward’sclusteringofthesixpointsshowninFigure7.15 .
Centroidmethodscalculatetheproximitybetweentwoclustersbycalculatingthedistancebetweenthecentroidsofclusters.ThesetechniquesmayseemsimilartoK-means,butaswehaveremarked,Ward’smethodisthecorrecthierarchicalanalog.
Centroidmethodsalsohaveacharacteristic—oftenconsideredbad—thatisnotpossessedbytheotherhierarchicalclusteringtechniquesthatwehavediscussed:thepossibilityofinversions.Specifically,twoclustersthataremergedcanbemoresimilar(lessdistant)thanthepairofclustersthatweremergedinapreviousstep.Fortheothermethods,thedistancebetweenmergedclustersmonotonicallyincreases(oris,atworst,non-increasing)asweproceedfromsingletonclusterstooneall-inclusivecluster.
7.3.3TheLance-WilliamsFormulaforClusterProximity
Anyoftheclusterproximitiesthatwehavediscussedinthissectioncanbeviewedasachoiceofdifferentparameters(intheLance-WilliamsformulashownbelowinEquation7.7 )fortheproximitybetweenclustersQandR,whereRisformedbymergingclustersAandB.Inthisequation,p(.,.)isaproximityfunction,while , ,and arethenumberofpointsinclustersA,B,andQ,respectively.Inotherwords,afterwemergeclustersAandBtoformclusterR,theproximityofthenewcluster,R,toanexistingcluster,Q,isalinearfunctionoftheproximitiesofQwithrespecttothe
mA mB mQ
originalclustersAandB.Table7.5 showsthevaluesofthesecoefficientsforthetechniquesthatwehavediscussed.
Table7.5.TableofLance-Williamscoefficientsforcommonhierarchicalclusteringapproaches.
ClusteringMethod
SingleLink 1/2 1/2 0 −1/2
CompleteLink 1/2 1/2 0 1/2
GroupAverage 0 0
Centroid 0
Ward’s 0
AnyhierarchicalclusteringtechniquethatcanbeexpressedusingtheLance-Williamsformuladoesnotneedtokeeptheoriginaldatapoints.Instead,theproximitymatrixisupdatedasclusteringoccurs.Whileageneralformulaisappealing,especiallyforimplementation,itiseasiertounderstandthedifferenthierarchicalmethodsbylookingdirectlyatthedefinitionofclusterproximitythateachmethoduses.
7.3.4KeyIssuesinHierarchical
p(R,Q)= αA p(A,Q)+αB p(B,Q)+β p(A,B)+ γ|p(A,Q)−p(B,Q)| (7.7)
αA αB β γ
mAmA+mB mBmA+mB
mAmA+mB mBmA+mB −mAmB(mA+mB)2
mA+mQmA+mB+mQ
mB+mQmA+mB+mQ
−mQmA+mB+mQ
Clustering
LackofaGlobalObjectiveFunctionWepreviouslymentionedthatagglomerativehierarchicalclusteringcannotbeviewedasgloballyoptimizinganobjectivefunction.Instead,agglomerativehierarchicalclusteringtechniquesusevariouscriteriatodecidelocally,ateachstep,whichclustersshouldbemerged(orsplitfordivisiveapproaches).Thisapproachyieldsclusteringalgorithmsthatavoidthedifficultyofattemptingtosolveahardcombinatorialoptimizationproblem.(Itcanbeshownthatthegeneralclusteringproblemforanobjectivefunctionsuchas“minimizeSSE”iscomputationallyinfeasible.)Furthermore,suchapproachesdonothavedifficultiesinchoosinginitialpoints.Nonetheless,thetimecomplexityof
andthespacecomplexityof areprohibitiveinmanycases.
AbilitytoHandleDifferentClusterSizesOneaspectofagglomerativehierarchicalclusteringthatwehavenotyetdiscussedishowtotreattherelativesizesofthepairsofclustersthataremerged.(Thisdiscussionappliesonlytoclusterproximityschemesthatinvolvesums,suchascentroid,Ward’s,andgroupaverage.)Therearetwoapproaches:weighted,whichtreatsallclustersequally,andunweighted,whichtakesthenumberofpointsineachclusterintoaccount.Notethattheterminologyofweightedorunweightedreferstothedatapoints,nottheclusters.Inotherwords,treatingclustersofunequalsizeequally—theweightedapproach—givesdifferentweightstothepointsindifferentclusters,whiletakingtheclustersizeintoaccount—theunweightedapproach—givespointsindifferentclustersthesameweight.
O(m2log m) O(m2)
WewillillustratethisusingthegroupaveragetechniquediscussedinSection7.3.2 ,whichistheunweightedversionofthegroupaveragetechnique.Intheclusteringliterature,thefullnameofthisapproachistheUnweightedPairGroupMethodusingArithmeticaverages(UPGMA).InTable7.5 ,whichgivestheformulaforupdatingclustersimilarity,thecoefficientsforUPGMAinvolvethesize, and ofeachoftheclusters,AandBthatweremerged: .Fortheweightedversionofgroupaverage—knownasWPGMA—thecoefficientsareconstantsthatareindependentoftheclustersizes: .Ingeneral,unweightedapproachesarepreferredunlessthereisreasontobelievethatindividualpointsshouldhavedifferentweights;e.g.,perhapsclassesofobjectshavebeenunevenlysampled.
MergingDecisionsAreFinalAgglomerativehierarchicalclusteringalgorithmstendtomakegoodlocaldecisionsaboutcombiningtwoclustersbecausetheycanuseinformationaboutthepairwisesimilarityofallpoints.However,onceadecisionismadetomergetwoclusters,itcannotbeundoneatalatertime.Thisapproachpreventsalocaloptimizationcriterionfrombecomingaglobaloptimizationcriterion.Forexample,althoughthe“minimizesquarederror”criterionfromK-meansisusedindecidingwhichclusterstomergeinWard’smethod,theclustersateachleveldonotrepresentlocalminimawithrespecttothetotalSSE.Indeed,theclustersarenotevenstable,inthesensethatapointinoneclustercanbeclosertothecentroidofsomeotherclusterthanitistothecentroidofitscurrentcluster.Nonetheless,Ward’smethodisoftenusedasarobustmethodofinitializingaK-meansclustering,indicatingthatalocal“minimizesquarederror”objectivefunctiondoeshaveaconnectiontoaglobal“minimizesquarederror”objectivefunction.
mA mBαA=mA/(mA+mB), αB=mB/(mA+mB), β=0, γ=0
αA=1/2, αB=1/2, β=0, γ=0
Therearesometechniquesthatattempttoovercomethelimitationthatmergesarefinal.Oneapproachattemptstofixupthehierarchicalclusteringbymovingbranchesofthetreearoundsoastoimproveaglobalobjectivefunction.AnotherapproachusesapartitionalclusteringtechniquesuchasK-meanstocreatemanysmallclusters,andthenperformshierarchicalclusteringusingthesesmallclustersasthestartingpoint.
7.3.5Outliers
OutliersposethemostseriousproblemsforWard’smethodandcentroid-basedhierarchicalclusteringapproachesbecausetheyincreaseSSEanddistortcentroids.Forclusteringapproaches,suchassinglelink,completelink,andgroupaverage,outliersarepotentiallylessproblematic.Ashierarchicalclusteringproceedsforthesealgorithms,outliersorsmallgroupsofoutlierstendtoformsingletonorsmallclustersthatdonotmergewithanyotherclustersuntilmuchlaterinthemergingprocess.Bydiscardingsingletonorsmallclustersthatarenotmergingwithotherclusters,outlierscanberemoved.
7.3.6StrengthsandWeaknesses
Thestrengthsandweaknessesofspecificagglomerativehierarchicalclusteringalgorithmswerediscussedabove.Moregenerally,suchalgorithmsaretypicallyusedbecausetheunderlyingapplication,e.g.,creationofataxonomy,requiresahierarchy.Also,somestudieshavesuggestedthatthesealgorithmscanproducebetter-qualityclusters.However,agglomerativehierarchicalclusteringalgorithmsareexpensiveintermsoftheircomputationalandstoragerequirements.Thefactthatallmergesarefinalcan
alsocausetroublefornoisy,high-dimensionaldata,suchasdocumentdata.Inturn,thesetwoproblemscanbeaddressedtosomedegreebyfirstpartiallyclusteringthedatausinganothertechnique,suchasK-means.
7.4DBSCANDensity-basedclusteringlocatesregionsofhighdensitythatareseparatedfromoneanotherbyregionsoflowdensity.DBSCANisasimpleandeffectivedensity-basedclusteringalgorithmthatillustratesanumberofimportantconceptsthatareimportantforanydensity-basedclusteringapproach.Inthissection,wefocussolelyonDBSCANafterfirstconsideringthekeynotionofdensity.Otheralgorithmsforfindingdensity-basedclustersaredescribedinthenextchapter.
7.4.1TraditionalDensity:Center-BasedApproach
Althoughtherearenotasmanyapproachesfordefiningdensityastherearefordefiningsimilarity,thereareseveraldistinctmethods.Inthissectionwediscussthecenter-basedapproachonwhichDBSCANisbased.OtherdefinitionsofdensitywillbepresentedinChapter8 .
Inthecenter-basedapproach,densityisestimatedforaparticularpointinthedatasetbycountingthenumberofpointswithinaspecifiedradius,Eps,ofthatpoint.Thisincludesthepointitself.ThistechniqueisgraphicallyillustratedbyFigure7.20 .ThenumberofpointswithinaradiusofEpsofpointAis7,includingAitself.
Figure7.20.Center-baseddensity.
Thismethodissimpletoimplement,butthedensityofanypointwilldependonthespecifiedradius.Forinstance,iftheradiusislargeenough,thenallpointswillhaveadensityofm,thenumberofpointsinthedataset.Likewise,iftheradiusistoosmall,thenallpointswillhaveadensityof1.Anapproachfordecidingontheappropriateradiusforlow-dimensionaldataisgiveninthenextsectioninthecontextofourdiscussionofDBSCAN.
ClassificationofPointsAccordingtoCenter-BasedDensityThecenter-basedapproachtodensityallowsustoclassifyapointasbeing(1)intheinteriorofadenseregion(acorepoint),(2)ontheedgeofadenseregion(aborderpoint),or(3)inasparselyoccupiedregion(anoiseorbackgroundpoint).Figure7.21 graphicallyillustratestheconceptsofcore,border,andnoisepointsusingacollectionoftwo-dimensionalpoints.Thefollowingtextprovidesamoreprecisedescription.
Figure7.21.Core,border,andnoisepoints.
Corepoints:Thesepointsareintheinteriorofadensity-basedcluster.ApointisacorepointifthereareatleastMinPtswithinadistanceofEps,whereMinPtsandEpsareuser-specifiedparameters.InFigure7.21 ,pointAisacorepointfortheradius(Eps)if .
Borderpoints:Aborderpointisnotacorepoint,butfallswithintheneighborhoodofacorepoint.InFigure7.21 ,pointBisaborderpoint.Aborderpointcanfallwithintheneighborhoodsofseveralcorepoints.
Noisepoints:Anoisepointisanypointthatisneitheracorepointnoraborderpoint.InFigure7.21 ,pointCisanoisepoint.
7.4.2TheDBSCANAlgorithm
Giventhepreviousdefinitionsofcorepoints,borderpoints,andnoisepoints,theDBSCANalgorithmcanbeinformallydescribedasfollows.Anytwocorepointsthatarecloseenough—withinadistanceEpsofoneanother—areput
MinPts≥7
inthesamecluster.Likewise,anyborderpointthatiscloseenoughtoacorepointisputinthesameclusterasthecorepoint.(Tiesneedtoberesolvedifaborderpointisclosetocorepointsfromdifferentclusters.)Noisepointsarediscarded.TheformaldetailsaregiveninAlgorithm7.5 .ThisalgorithmusesthesameconceptsandfindsthesameclustersastheoriginalDBSCAN,butisoptimizedforsimplicity,notefficiency.
Algorithm7.5DBSCANalgorithm.
TimeandSpaceComplexityThebasictimecomplexityoftheDBSCANalgorithmisO( tofindpointsintheEps-neighborhood),wheremisthenumberofpoints.Intheworstcase,thiscomplexityis .However,inlow-dimensionalspaces(especially2Dspace),datastructuressuchaskd-treesallowefficientretrievalofallpointswithinagivendistanceofaspecifiedpoint,andthetimecomplexitycanbeaslowasO(mlogm)intheaveragecase.ThespacerequirementofDBSCAN,evenforhigh-dimensionaldata,isO(m)becauseitisnecessarytokeeponlyasmallamountofdataforeachpoint,i.e.,theclusterlabelandtheidentificationofeachpointasacore,border,ornoisepoint.
Labelallpointsascore,border,ornoisepoints.Eliminatenoisepoints.PutanedgebetweenallcorepointswithinadistanceEpsofeachother.Makeeachgroupofconnectedcorepointsintoaseparatecluster.Assigneachborderpointtooneoftheclustersofitsassociatedcorepoints.
1:2:3:
4:
5:
m×time
O(m2)
SelectionofDBSCANParametersThereis,ofcourse,theissueofhowtodeterminetheparametersEpsandMinPts.Thebasicapproachistolookatthebehaviorofthedistancefromapointtoits nearestneighbor,whichwewillcallthek-dist.Forpointsthatbelongtosomecluster,thevalueofk-distwillbesmallifkisnotlargerthantheclustersize.Notethattherewillbesomevariation,dependingonthedensityoftheclusterandtherandomdistributionofpoints,butonaverage,therangeofvariationwillnotbehugeiftheclusterdensitiesarenotradicallydifferent.However,forpointsthatarenotinacluster,suchasnoisepoints,thek-distwillberelativelylarge.Therefore,ifwecomputethek-distforallthedatapointsforsomek,sorttheminincreasingorder,andthenplotthesortedvalues,weexpecttoseeasharpchangeatthevalueofk-distthatcorrespondstoasuitablevalueofEps.IfweselectthisdistanceastheEpsparameterandtakethevalueofkastheMinPtsparameter,thenpointsforwhichk-distislessthanEpswillbelabeledascorepoints,whileotherpointswillbelabeledasnoiseorborderpoints.
Figure7.22 showsasampledataset,whilethek-distgraphforthedataisgiveninFigure7.23 .ThevalueofEpsthatisdeterminedinthiswaydependsonk,butdoesnotchangedramaticallyaskchanges.Ifthevalueofkistoosmall,thenevenasmallnumberofcloselyspacedpointsthatarenoiseoroutlierswillbeincorrectlylabeledasclusters.Ifthevalueofkistoolarge,thensmallclusters(ofsizelessthank)arelikelytobelabeledasnoise.TheoriginalDBSCANalgorithmusedavalueof ,whichappearstobeareasonablevalueformosttwo-dimensionaldatasets.
kth
k=4
Figure7.22.Sampledata.
Figure7.23.K-distplotforsampledata.
ClustersofVaryingDensityDBSCANcanhavetroublewithdensityifthedensityofclustersvarieswidely.ConsiderFigure7.24 ,whichshowsfourclustersembeddedinnoise.Thedensityoftheclustersandnoiseregionsisindicatedbytheirdarkness.Thenoisearoundthepairofdenserclusters,AandB,hasthesamedensityasclustersCandD.ForafixedMinPts,iftheEpsthresholdischosensothatDBSCANfindsCandDasdistinctclusters,withthepointssurroundingthem
asnoise,thenAandBandthepointssurroundingthemwillbecomeasinglecluster.IftheEpsthresholdissetsothatDBSCANfindsAandBasseparateclusters,andthepointssurroundingthemaremarkedasnoise,thenC,D,andthepointssurroundingthemwillalsobemarkedasnoise.
Figure7.24.Fourclustersembeddedinnoise.
AnExampleToillustratetheuseofDBSCAN,weshowtheclustersthatitfindsintherelativelycomplicatedtwo-dimensionaldatasetshowninFigure7.22 .Thisdatasetconsistsof3000two-dimensionalpoints.TheEpsthresholdforthisdatawasfoundbyplottingthesorteddistancesofthefourthnearestneighborofeachpoint(Figure7.23 )andidentifyingthevalueatwhichthereisasharpincrease.Weselected ,whichcorrespondstothekneeofthecurve.TheclustersfoundbyDBSCANusingtheseparameters,i.e.,and ,areshowninFigure7.25(a) .Thecorepoints,borderpoints,andnoisepointsaredisplayedinFigure7.25(b) .
Eps=10MinPts=4
Eps=10
Figure7.25.DBSCANclusteringof3000two-dimensionalpoints.
7.4.3StrengthsandWeaknesses
BecauseDBSCANusesadensity-baseddefinitionofacluster,itisrelativelyresistanttonoiseandcanhandleclustersofarbitraryshapesandsizes.Thus,DBSCANcanfindmanyclustersthatcouldnotbefoundusingK-means,suchasthoseinFigure7.22 .Asindicatedpreviously,however,DBSCANhastroublewhentheclustershavewidelyvaryingdensities.Italsohastroublewithhigh-dimensionaldatabecausedensityismoredifficulttodefineforsuchdata.OnepossibleapproachtodealingwithsuchissuesisgiveninSection8.4.9 .Finally,DBSCANcanbeexpensivewhenthecomputationofnearestneighborsrequirescomputingallpairwiseproximities,asisusuallythecaseforhigh-dimensionaldata.
7.5ClusterEvaluationInsupervisedclassification,theevaluationoftheresultingclassificationmodelisanintegralpartoftheprocessofdevelopingaclassificationmodel,andtherearewell-acceptedevaluationmeasuresandprocedures,e.g.,accuracyandcross-validation,respectively.However,becauseofitsverynature,clusterevaluationisnotawell-developedorcommonlyusedpartofclusteranalysis.Nonetheless,clusterevaluation,orclustervalidationasitismoretraditionallycalled,isimportant,andthissectionwillreviewsomeofthemostcommonandeasilyappliedapproaches.
Theremightbesomeconfusionastowhyclusterevaluationisnecessary.Manytimes,clusteranalysisisconductedasapartofanexploratorydataanalysis.Hence,evaluationseemstobeanunnecessarilycomplicatedadditiontowhatissupposedtobeaninformalprocess.Furthermore,becausethereareanumberofdifferenttypesofclusters—insomesense,eachclusteringalgorithmdefinesitsowntypeofcluster—itcanseemthateachsituationmightrequireadifferentevaluationmeasure.Forinstance,K-meansclustersmightbeevaluatedintermsoftheSSE,butfordensity-basedclusters,whichneednotbeglobular,SSEwouldnotworkwellatall.
Nonetheless,clusterevaluationshouldbeapartofanyclusteranalysis.Akeymotivationisthatalmosteveryclusteringalgorithmwillfindclustersinadataset,evenifthatdatasethasnonaturalclusterstructure.Forinstance,considerFigure7.26 ,whichshowstheresultofclustering100pointsthatarerandomly(uniformly)distributedontheunitsquare.TheoriginalpointsareshowninFigure7.26(a) ,whiletheclustersfoundbyDBSCAN,K-means,andcompletelinkareshowninFigures7.26(b) ,7.26(c) ,and7.26(d) ,respectively.SinceDBSCANfoundthreeclusters(afterwesetEpsbylooking
atthedistancesofthefourthnearestneighbors),wesetK-meansandcompletelinktofindthreeclustersaswell.(InFigure7.26(b) thenoiseisshownbythesmallmarkers.)However,theclustersdonotlookcompellingforanyofthethreemethods.Inhigherdimensions,suchproblemscannotbesoeasilydetected.
Figure7.26.
Clusteringof100uniformlydistributedpoints.
7.5.1Overview
Beingabletodistinguishwhetherthereisnon-randomstructureinthedataisjustoneimportantaspectofclustervalidation.Thefollowingisalistofseveralimportantissuesforclustervalidation.
1. Determiningtheclusteringtendencyofasetofdata,i.e.,distinguishingwhethernon-randomstructureactuallyexistsinthedata.
2. Determiningthecorrectnumberofclusters.3. Evaluatinghowwelltheresultsofaclusteranalysisfitthedatawithout
referencetoexternalinformation.4. Comparingtheresultsofaclusteranalysistoexternallyknownresults,
suchasexternallyprovidedclasslabels.5. Comparingtwosetsofclusterstodeterminewhichisbetter.
Noticethatitems1,2,and3donotmakeuseofanyexternalinformation—theyareunsupervisedtechniques—whileitem4requiresexternalinformation.Item5canbeperformedineitherasupervisedoranunsupervisedmanner.Afurtherdistinctioncanbemadewithrespecttoitems3,4,and5:Dowewanttoevaluatetheentireclusteringorjustindividualclusters?
Whileitispossibletodevelopvariousnumericalmeasurestoassessthedifferentaspectsofclustervaliditymentionedabove,thereareanumberofchallenges.First,ameasureofclustervalidityissometimesquitelimitedinthescopeofitsapplicability.Forexample,mostworkonmeasuresofclusteringtendencyhasbeendonefortwo-orthree-dimensionalspatialdata.Second,weneedaframeworktointerpretanymeasure.Ifweobtainavalue
of10forameasurethatevaluateshowwellclusterlabelsmatchexternallyprovidedclasslabels,doesthisvaluerepresentagood,fair,orpoormatch?Thegoodnessofamatchoftencanbemeasuredbylookingatthestatisticaldistributionofthisvalue,i.e.,howlikelyitisthatsuchavalueoccursbychance.Finally,ifameasureistoocomplicatedtoapplyortounderstand,thenfewwilluseit.
Theevaluationmeasures,orindices,thatareappliedtojudgevariousaspectsofclustervalidityaretraditionallyclassifiedintothefollowingthreetypes.
Unsupervised.Measuresthegoodnessofaclusteringstructurewithoutrespecttoexternalinformation.AnexampleofthisistheSSE.Unsupervisedmeasuresofclustervalidityareoftenfurtherdividedintotwoclasses:measuresofclustercohesion(compactness,tightness),whichdeterminehowcloselyrelatedtheobjectsinaclusterare,andmeasuresofclusterseparation(isolation),whichdeterminehowdistinctorwell-separatedaclusterisfromotherclusters.Unsupervisedmeasuresareoftencalledinternalindicesbecausetheyuseonlyinformationpresentinthedataset.
Supervised.Measurestheextenttowhichtheclusteringstructurediscoveredbyaclusteringalgorithmmatchessomeexternalstructure.Anexampleofasupervisedindexisentropy,whichmeasureshowwellclusterlabelsmatchexternallysuppliedclasslabels.Supervisedmeasuresareoftencalledexternalindicesbecausetheyuseinformationnotpresentinthedataset.
Relative.Comparesdifferentclusteringsorclusters.Arelativeclusterevaluationmeasureisasupervisedorunsupervisedevaluationmeasurethatisusedforthepurposeofcomparison.Thus,relativemeasuresarenotactuallyaseparatetypeofclusterevaluationmeasure,butareinsteadaspecificuseofsuchmeasures.Asanexample,twoK-meansclusteringscanbecomparedusingeithertheSSEorentropy.
Intheremainderofthissection,weprovidespecificdetailsconcerningclustervalidity.Wefirstdescribetopicsrelatedtounsupervisedclusterevaluation,beginningwith(1)measuresbasedoncohesionandseparation,and(2)twotechniquesbasedontheproximitymatrix.Sincetheseapproachesareusefulonlyforpartitionalsetsofclusters,wealsodescribethepopularcopheneticcorrelationcoefficient,whichcanbeusedfortheunsupervisedevaluationofahierarchicalclustering.Weendourdiscussionofunsupervisedevaluationwithbriefdiscussionsaboutfindingthecorrectnumberofclustersandevaluatingclusteringtendency.Wethenconsidersupervisedapproachestoclustervalidity,suchasentropy,purity,andtheJaccardmeasure.Weconcludethissectionwithashortdiscussionofhowtointerpretthevaluesof(unsupervisedorsupervised)validitymeasures.
7.5.2UnsupervisedClusterEvaluationUsingCohesionandSeparation
Manyinternalmeasuresofclustervalidityforpartitionalclusteringschemesarebasedonthenotionsofcohesionorseparation.Inthissection,weuseclustervaliditymeasuresforprototype-andgraph-basedclusteringtechniquestoexplorethesenotionsinsomedetail.Intheprocess,wewillalsoseesomeinterestingrelationshipsbetweenprototype-andgraph-basedmeasures.
Ingeneral,wecanconsiderexpressingoverallclustervalidityforasetofKclustersasaweightedsumofthevalidityofindividualclusters,
overallvalidity=∑i=1Kwi validity(Ci). (7.8)
Thevalidityfunctioncanbecohesion,separation,orsomecombinationofthesequantities.Theweightswillvarydependingontheclustervaliditymeasure.Insomecases,theweightsaresimply1orthesizeofthecluster,whileinothercasestobediscussedabitlater,theyreflectamorecomplicatedpropertyofthecluster.
Graph-BasedViewofCohesionandSeparationFromagraph-basedview,thecohesionofaclustercanbedefinedasthesumoftheweightsofthelinksintheproximitygraphthatconnectpointswithinthecluster.SeeFigure7.27(a) .(Recallthattheproximitygraphhasdataobjectsasnodes,alinkbetweeneachpairofdataobjects,andaweightassignedtoeachlinkthatistheproximitybetweenthetwodataobjectsconnectedbythelink.)Likewise,theseparationbetweentwoclusterscanbemeasuredbythesumoftheweightsofthelinksfrompointsinoneclustertopointsintheothercluster.ThisisillustratedinFigure7.27(b) .
Figure7.27.Graph-basedviewofclustercohesionandseparation.
Mostsimply,thecohesionandseparationforagraph-basedclustercanbeexpressedusingEquations7.9 and7.10 ,respectively.Theproximityfunctioncanbeasimilarityoradissimilarity.Forsimilarity,asinTable7.6 ,highervaluesarebetterforcohesionwhilelowervaluesarebetterforseparation.Fordissimilarity,theoppositeistrue,i.e.,lowervaluesarebetterforcohesionwhilehighervaluesarebetterforseparation.Morecomplicatedapproachesarepossiblebuttypicallyembodythebasicideasoffigure7.27a and7.27b .
Prototype-BasedViewofCohesionandSeparationForprototype-basedclusters,thecohesionofaclustercanbedefinedasthesumoftheproximitieswithrespecttotheprototype(centroidormedoid)ofthecluster.Similarly,theseparationbetweentwoclusterscanbemeasuredbytheproximityofthetwoclusterprototypes.ThisisillustratedinFigure7.28 ,wherethecentroidofaclusterisindicatedbya“+”.
cohesion(Ci)=∑x∈Ciy∈Ciproximity(x,y) (7.9)
separation(Ci,Cj)=∑x∈Ciy∈Cjproximity(x,y) (7.10)
Figure7.28.Prototype-basedviewofclustercohesionandseparation.
Cohesionforaprototype-basedclusterisgiveninEquation7.11 ,whiletwomeasuresforseparationaregiveninEquations7.12 and7.13 ,respectively,where istheprototype(centroid)ofcluster andcistheoverallprototype(centroid).Therearetwomeasuresforseparationbecause,aswewillseeinthenextsection,theseparationofclusterprototypesfromanoverallprototypeissometimesdirectlyrelatedtotheseparationofclusterprototypesfromoneanother.(Thisistrue,forexample,forEuclideandistance.)NotethatEquation7.11 istheclusterSSEifweletproximitybethesquaredEuclideandistance.
RelationshipbetweenPrototype-BasedCohesionandGraph-BasedCohesionWhilethegraph-basedandprototype-basedapproachestomeasuringthecohesionandseparationofaclusterseemdistinct,forsomeproximitymeasurestheyareequivalent.Forinstance,fortheSSEandpointsinEuclideanspace,itcanbeshown(Equation7.14 )thattheaveragepairwisedistancebetweenthepointsinaclusterisequivalenttotheSSEofthecluster.SeeExercise27 onpage610.
ci Ci
cohesion(Ci)=∑x∈Ciproximity(x,ci) (7.11)
separation(Ci,Cj)=proximity(ci,cj) (7.12)
separation(Ci)=proximity(ci,c) (7.13)
ClusterSSE=∑x∈Cidist(ci,x)2=12mi∑x∈Ci∑y∈Cidist(x,y)2(7.14)
RelationshipoftheTwoApproachestoPrototype-BasedSeparationWhenproximityismeasuredbyEuclideandistance,thetraditionalmeasureofseparationbetweenclustersisthebetweengroupsumofsquares(SSB),whichisthesumofthesquareddistanceofaclustercentroid, ,totheoverallmean,c,ofallthedatapoints.TheSSBisgivenbyEquation7.15 ,where isthemeanofthe clusterandcistheoverallmean.ThehigherthetotalSSBofaclustering,themoreseparatedtheclustersarefromoneanother.
ItisstraightforwardtoshowthatthetotalSSBisdirectlyrelatedtothepairwisedistancesbetweenthecentroids.Inparticular,iftheclustersizesareequal,i.e., ,thenthisrelationshiptakesthesimpleformgivenbyEquation7.16 .(SeeExercise28 onpage610.)ItisthistypeofequivalencethatmotivatesthedefinitionofprototypeseparationintermsofbothEquations7.12 and7.13 .
RelationshipbetweenCohesionandSeparationForsomevaliditymeasures,thereisalsoastrongrelationshipbetweencohesionandseparation.Specifically,itispossibletoshowthatthesumofthetotalSSEandthetotalSSBisaconstant;i.e.,thatitisequaltothetotalsumofsquares(TSS),whichisthesumofsquaresofthedistanceofeach
ci
ci ith
TotalSSB=∑i=1Kmi dist(ci,c)2 (7.15)
mi=m/K
TotalSSB=12K∑i=1K∑j=1KmK dist(ci, cj)2 (7.16)
pointtotheoverallmeanofthedata.TheimportanceofthisresultisthatminimizingSSE(cohesion)isequivalenttomaximizingSSB(separation).
Weprovidetheproofofthisfactbelow,sincetheapproachillustratestechniquesthatarealsoapplicabletoprovingtherelationshipsstatedinthelasttwosections.Tosimplifythenotation,weassumethatthedataisone-dimensional,i.e., .Also,weusethefactthatthecross-term
is0.(SeeExercise29 onpage610.)
RelationshipbetweenGraph-andCentroid-BasedCohesionItcanalsobeshownthatthereisarelationshipbetweengraph-andcentroid-basedcohesionmeasuresforEuclideandistance.Forsimplicity,weonceagainassumeone-dimensionaldata.Recallthat .
Moregenerally,incaseswhereacentroidmakessenseforthedata,thesimplegraphorcentroid-basedmeasuresofclustervaliditywepresentedareoftenrelated.
dist(x,y)=(x−y)2∑i=1K∑x∈Ci(x−ci)(c−ci)
TSS=∑i=1K∑x∈Ci(x−c)2=∑i=1K∑x∈Ci((x−ci)−(c−ci))2=∑i=1K∑x∈Ci(x−ci)2−2∑i=1K∑x∈Ci(x−ci)(c−ci)+∑i=1K∑x∈Ci(c−ci)2=∑i=1K∑x∈Ci(x−ci)2+∑i=1K∑x∈Ci(c−ci)2=∑i=1K∑x∈Ci(x−ci)2+∑i=1K|Ci|(c−ci)2=SSE+SSB
ci=1/mi∑y∈Ciy
mi2cohesion(Ci)=mi2∑x∈Ciproximity(x,ci)=∑x∈Cimi2(x−ci)2=∑x∈Ci(mix−mici)2=∑x∈Ci(mix−mi(1/mi∑y∈Ciy))2=∑x∈Ci∑y∈Ci(x−y)2=∑x∈Ciy∈Ci(x−y)2=∑x∈Ciy∈Ciproximity(x,y)
OverallMeasuresofCohesionandSeparationThepreviousdefinitionsofclustercohesionandseparationgaveussomesimpleandwell-definedmeasuresofindividualclustervaliditythatcanbecombinedintoanoverallmeasureofclustervaliditybyusingaweightedsum,asindicatedinEquation7.8 .However,weneedtodecidewhatweightstouse.Notsurprisingly,theweightsusedcanvarywidely.Often,butnotalways,theyareeitherafunctionofclustersizeor1,whichtreatsallclustersequallyregardlessofsize.
TheCLUsteringTOolkit(CLUTO)(seetheBibliographicNotes)usestheclusterevaluationmeasuresdescribedinTable7.6 ,aswellassomeotherevaluationmeasuresnotmentionedhere.Onlysimilaritymeasuresareused:cosine,correlation,Jaccard,andtheinverseofEuclideandistance. isameasureofcohesionintermsofthepairwisesimilarityofobjectsinthecluster. isameasureofcohesionthatcanbeexpressedeitherintermsofthesumofthesimilaritiesofobjectsintheclustertotheclustercentroidorintermsofthepairwisesimilaritiesofobjectsinthecluster. isameasureofseparation.Itcanbedefinedintermsofthesimilarityofaclustercentroidtotheoverallcentroidorintermsofthepairwisesimilaritiesofobjectintheclustertoobjectsinotherclusters.(Although isameasureofseparation,theseconddefinitionshowsthatitalsousesclustercohesion,albeitintheclusterweight.) ,whichisaclustervaliditymeasurebasedonbothcohesionandseparation,isthesumofthepairwisesimilaritiesofallobjectsintheclusterwithallobjectsoutsidethecluster—thetotalweightoftheedgesofthesimilaritygraphthatmustbecuttoseparatetheclusterfromallotherclusters—dividedbythesumofthepairwisesimilaritiesofobjectsinthecluster.
Table7.6.Tableofgraph-basedclusterevaluationmeasures.
ℐ1
ℐ2
ε1
ε1
1
Name ClusterMeasure ClusterWeight Type
graph-basedcohesion
prototype-basedcohesion
prototype-basedcohesion
prototype-basedseparation
graph-basedseparation
graph-basedseparationandcohesion
Notethatanyunsupervisedmeasureofclustervaliditypotentiallycanbeusedasanobjectivefunctionforaclusteringalgorithmandviceversa.CLUTOtakesthisapproachbyusinganalgorithmthatissimilartotheincrementalK-meansalgorithmdiscussedinSection7.2.2 .Specifically,eachpointisassignedtotheclusterthatproducesthebestvaluefortheclusterevaluationfunction.Theclusterevaluationmeasure correspondstotraditionalK-meansandproducesclustersthathavegoodSSEvalues.TheothermeasuresproduceclustersthatarenotasgoodwithrespecttoSSE,butthataremoreoptimalwithrespecttothespecifiedclustervaliditymeasure.
EvaluatingIndividualClustersandObjectsSofar,wehavefocusedonusingcohesionandseparationintheoverallevaluationofagroupofclusters.Mostofthesemeasuresofclustervalidityalsocanbeusedtoevaluateindividualclustersandobjects.Forexample,wecanrankindividualclustersaccordingtotheirspecificvalueofclustervalidity,i.e.,clustercohesionorseparation.Aclusterthathasahighvalueofcohesion
ℐ1 ∑x∈Ciy∈Cisim(x,y) 1mi
ℐ2 ∑x∈Cisim(x,ci)
ℐ2 ∑x∈Ciy∈Cisim(x,y)
ε1 sim(ci,c) mi
ε1 ∑j=1k∑x∈Ciy∈Cjsim(x,y)
mi(∑x∈Ciy∈Cisim(x,y))
1 ∑j=1j≠ik∑x∈Ciy∈Cjsim(x,y)
1∑x∈Ciy∈Cisim(x,y)
ℐ2
maybeconsideredbetterthanaclusterthathasalowervalue.Thisinformationoftencanbeusedtoimprovethequalityofaclustering.If,forexample,aclusterisnotverycohesive,thenwemaywanttosplititintoseveralsubclusters.Ontheotherhand,iftwoclustersarerelativelycohesive,butnotwellseparated,wemaywanttomergethemintoasinglecluster.
Wecanalsoevaluatetheobjectswithinaclusterintermsoftheircontributiontotheoverallcohesionorseparationofthecluster.Objectsthatcontributemoretothecohesionandseparationarenearthe“interior”ofthecluster.Thoseobjectsforwhichtheoppositeistrueareprobablynearthe“edge”ofthecluster.Inthefollowingsection,weconsideraclusterevaluationmeasurethatusesanapproachbasedontheseideastoevaluatepoints,clusters,andtheentiresetofclusters.
TheSilhouetteCoefficientThepopularmethodofsilhouettecoefficientscombinesbothcohesionandseparation.Thefollowingstepsexplainhowtocomputethesilhouettecoefficientforanindividualpoint,aprocessthatconsistsofthefollowingthreesteps.Weusedistances,butananalogousapproachcanbeusedforsimilarities.
1. Forthe object,calculateitsaveragedistancetoallotherobjectsinitscluster.Callthisvalue .
2. Forthe objectandanyclusternotcontainingtheobject,calculatetheobject’saveragedistancetoalltheobjectsinthegivencluster.Findtheminimumsuchvaluewithrespecttoallclusters;callthisvalue .
3. Forthe object,thesilhouettecoefficientis.
ithai
ith
biith si=(bi−ai)/max(ai,
bi)
Thevalueofthesilhouettecoefficientcanvarybetween and1.Anegativevalueisundesirablebecausethiscorrespondstoacaseinwhich ,theaveragedistancetopointsinthecluster,isgreaterthan ,theminimumaveragedistancetopointsinanothercluster.Wewantthesilhouettecoefficienttobepositive ,andfor tobeascloseto0aspossible,sincethecoefficientassumesitsmaximumvalueof1when .
Wecancomputetheaveragesilhouettecoefficientofaclusterbysimplytakingtheaverageofthesilhouettecoefficientsofpointsbelongingtothecluster.Anoverallmeasureofthegoodnessofaclusteringcanbeobtainedbycomputingtheaveragesilhouettecoefficientofallpoints.
Example7.8(SilhouetteCoefficient).Figure7.29 showsaplotofthesilhouettecoefficientsforpointsin10clusters.Darkershadesindicatelowersilhouettecoefficients.
Figure7.29.Silhouettecoefficientsforpointsintenclusters.
7.5.3UnsupervisedClusterEvaluation
−1ai
bi
(ai<bi) aiai=0
UsingtheProximityMatrix
Inthissection,weexamineacoupleofunsupervisedapproachesforassessingclustervaliditythatarebasedontheproximitymatrix.Thefirstcomparesanactualandidealizedproximitymatrix,whilethesecondusesvisualization.
GeneralCommentsonUnsupervisedClusterEvaluationMeasuresInadditiontothemeasurespresentedabove,alargenumberofothermeasureshavebeenproposedasunsupervisedclustervaliditymeasures.Almostallthesemeasures,includingthemeasurespresentedaboveareintendedforpartitional,center-basedclusters.Inparticular,noneofthemdoeswellforcontinuity-ordensity-basedclusters.Indeed,arecentevaluation—seeBibliographicNotes—ofadozensuchmeasuresfoundthatalthoughanumberofthemdidwellintermsofhandlingissuessuchasnoiseanddifferingsizesanddensity,noneofthemexceptarelativelyrecentlyproposedmeasure,ClusteringValidationindexbasedonNearestNeighbors(CVNN),didwellonhandlingarbitraryshapes.Thesilhouetteindex,however,didwellonallotherissuesexaminedexceptforthat.
Mostunsupervisedclusterevaluationmeasures,suchasthesilhouettecoefficient,incorporatebothcohesion(compactness)andseparation.WhenusedwithapartitionalclusteringalgorithmsuchasK-means,thesemeasureswilltendtodecreaseuntilthe“natural”setofclustersisfoundandstartincreasingonceclustersarebeingsplit“toofinely”sinceseparationwillsufferandcohesionwillnotimprovemuch.Thus,thesemeasurescanprovideawaytodeterminethenumberofclusters.However,ifthedefinitionofacluster
usedbytheclusteringalgorithm,differsfromthatoftheclusterevaluationmeasure,thenthesetofclustersidentifiedasoptimalbythealgorithmandvalidationmeasurecandiffer.Forinstance,CLUTOusesthemeasuresdescribedinTable7.6 todriveitsclustering,andthus,theclusteringproducedwillnotusuallymatchtheoptimalclustersaccordingtothesilhouettecoefficient.LikewiseforstandardK-meansandSSE.Inaddition,ifthereactuallyaresubclustersthatarenotseparatedverywellfromoneanother,thenmethodsthatincorporatebothmayprovideonlyacoarseviewoftheclusterstructureofthedata.Anotherconsiderationisthatwhenclusteringforsummarization,wearenotinterestedinthe“natural”clusterstructureofthedata,butratherwanttoachieveacertainlevelofapproximation,e.g.,wanttoreduceSSEtoacertainlevel.
Moregenerally,iftherearenottoomanyclusters,thenitcanbebettertoexamineclustercohesionandseparationindependently.Thiscangiveamorecomprehensiveviewofhowcohesiveeachclusterisandhowwelleachpairofclustersisseparatedfromoneanother.Forinstance,givenacentroidbasedclustering,wecouldcomputethepairwisesimilarityordistanceofthecentroids,i.e.,computethedistanceorsimilaritymatrixofthecentroids.Theapproachjustoutlinedissimilartolookingattheconfusionmatrixforaclassificationprobleminsteadofclassificationmeasures,suchasaccuracyortheF-measure.
MeasuringClusterValidityviaCorrelationIfwearegiventhesimilaritymatrixforadatasetandtheclusterlabelsfromaclusteranalysisofthedataset,thenwecanevaluatethe“goodness”oftheclusteringbylookingatthecorrelationbetweenthesimilaritymatrixandanidealversionofthesimilaritymatrixbasedontheclusterlabels.(Withminorchanges,thefollowingappliestoproximitymatrices,butforsimplicity,wediscussonlysimilaritymatrices.)Morespecifically,anidealclusterisone
whosepointshaveasimilarityof1toallpointsinthecluster,andasimilarityof0toallpointsinotherclusters.Thus,ifwesorttherowsandcolumnsofthesimilaritymatrixsothatallobjectsbelongingtothesameclusteraretogether,thenanidealclustersimilaritymatrixhasablockdiagonalstructure.Inotherwords,thesimilarityisnon-zero,i.e.,1,insidetheblocksofthesimilaritymatrixwhoseentriesrepresentintra-clustersimilarity,and0elsewhere.Theidealclustersimilaritymatrixisconstructedbycreatingamatrixthathasonerowandonecolumnforeachdatapoint—justlikeanactualsimilaritymatrix—andassigninga1toanentryiftheassociatedpairofpointsbelongstothesamecluster.Allotherentriesare0.
Highcorrelationbetweentheidealandactualsimilaritymatricesindicatesthatthepointsthatbelongtothesameclusterareclosetoeachother,whilelowcorrelationindicatestheopposite.(Becausetheactualandidealsimilaritymatricesaresymmetric,thecorrelationiscalculatedonlyamongthe
entriesbeloworabovethediagonalofthematrices.)Consequently,thisisnotagoodmeasureformanydensity-orcontiguity-basedclusters,becausetheyarenotglobularandcanbecloselyintertwinedwithotherclusters.
Example7.9(CorrelationofActualandIdealSimilarityMatrices).Toillustratethismeasure,wecalculatedthecorrelationbetweentheidealandactualsimilaritymatricesfortheK-meansclustersshowninFigure7.26(c) (randomdata)andFigure7.30(a) (datawiththreewell-separatedclusters).Thecorrelationswere0.5810and0.9235,respectively,whichreflectstheexpectedresultthattheclustersfoundbyK-meansintherandomdataareworsethantheclustersfoundbyK-meansindatawithwell-separatedclusters.
n(n−1)/2
Figure7.30.Similaritymatrixforwell-separatedclusters.
JudgingaClusteringVisuallybyItsSimilarityMatrixTheprevioustechniquesuggestsamoregeneral,qualitativeapproachtojudgingasetofclusters:Orderthesimilaritymatrixwithrespecttoclusterlabelsandthenplotit.Intheory,ifwehavewell-separatedclusters,thenthesimilaritymatrixshouldberoughlyblock-diagonal.Ifnot,thenthepatternsdisplayedinthesimilaritymatrixcanrevealtherelationshipsbetweenclusters.Again,allofthiscanbeappliedtodissimilaritymatrices,butforsimplicity,wewillonlydiscusssimilaritymatrices.
Example7.10(VisualizingaSimilarityMatrix).ConsiderthepointsinFigure7.30(a) ,whichformthreewell-separatedclusters.IfweuseK-meanstogroupthesepointsintothreeclusters,then
weshouldhavenotroublefindingtheseclustersbecausetheyarewell-separated.TheseparationoftheseclustersisillustratedbythereorderedsimilaritymatrixshowninFigure7.30(b) .(Foruniformity,wehavetransformedthedistancesintosimilaritiesusingtheformula
.)Figure7.31 showsthereorderedsimilaritymatricesforclustersfoundintherandomdatasetofFigure7.26 byDBSCAN,K-means,andcompletelink.
Figure7.31.Similaritymatricesforclustersfromrandomdata.
Thewell-separatedclustersinFigure7.30 showaverystrong,block-diagonalpatterninthereorderedsimilaritymatrix.However,therearealsoweakblockdiagonalpatterns—seeFigure7.31 —inthereorderedsimilaritymatricesoftheclusteringsfoundbyK-means,DBSCAN,andcompletelinkintherandomdata.Justaspeoplecanfindpatternsinclouds,dataminingalgorithmscanfindclustersinrandomdata.Whileitisentertainingtofindpatternsinclouds,itispointlessandperhapsembarrassingtofindclustersinnoise.
Thisapproachmayseemhopelesslyexpensiveforlargedatasets,sincethecomputationoftheproximitymatrixtakes time,wheremisthenumber
s=1−(d−min_d)/(max_d−min_d)
O(m2)
ofobjects,butwithsampling,thismethodcanstillbeused.Wecantakeasampleofdatapointsfromeachcluster,computethesimilaritybetweenthesepoints,andplottheresult.Itissometimesnecessarytooversamplesmallclustersandundersamplelargeonestoobtainanadequaterepresentationofallclusters.
7.5.4UnsupervisedEvaluationofHierarchicalClustering
Thepreviousapproachestoclusterevaluationareintendedforpartitionalclusterings.Herewediscussthecopheneticcorrelation,apopularevaluationmeasureforhierarchicalclusterings.Thecopheneticdistancebetweentwoobjectsistheproximityatwhichanagglomerativehierarchicalclusteringtechniqueputstheobjectsinthesameclusterforthefirsttime.Forexample,ifatsomepointintheagglomerativehierarchicalclusteringprocess,thesmallestdistancebetweenthetwoclustersthataremergedis0.1,thenallpointsinoneclusterhaveacopheneticdistanceof0.1withrespecttothepointsintheothercluster.Inacopheneticdistancematrix,theentriesarethecopheneticdistancesbetweeneachpairofobjects.Thecopheneticdistanceisdifferentforeachhierarchicalclusteringofasetofpoints.
Example7.11(CopheneticDistanceMatrix).Table7.7 showsthecopheneticdistancematrixforthesinglelinkclusteringshowninFigure7.16 .(Thedataforthisfigureconsistsofthesixtwo-dimensionalpointsgiveninTable2.14 .)
Table7.7.CopheneticdistancematrixforsinglelinkanddatainTable
2.14 onpage90.
Point P1 P2 P3 P4 P5 P6
P1 0 0.222 0.222 0.222 0.222 0.222
P2 0.222 0 0.148 0.151 0.139 0.148
P3 0.222 0.148 0 0.151 0.148 0.110
P4 0.222 0.151 0.151 0 0.151 0.151
P5 0.222 0.139 0.148 0.151 0 0.148
P6 0.222 0.148 0.110 0.151 0.148 0
TheCopheneticCorrelationCoefficient(CPCC)isthecorrelationbetweentheentriesofthismatrixandtheoriginaldissimilaritymatrixandisastandardmeasureofhowwellahierarchicalclustering(ofaparticulartype)fitsthedata.Oneofthemostcommonusesofthismeasureistoevaluatewhichtypeofhierarchicalclusteringisbestforaparticulartypeofdata.
Example7.12(CopheneticCorrelationCoefficient).WecalculatedtheCPCCforthehierarchicalclusteringsshowninFigures7.16 –7.19 .ThesevaluesareshowninTable7.8 .Thehierarchicalclusteringproducedbythesinglelinktechniqueseemstofitthedatalesswellthantheclusteringsproducedbycompletelink,groupaverage,andWard’smethod.
Table7.8.CopheneticcorrelationcoefficientfordataofTable2.14andfouragglomerativehierarchicalclusteringtechniques.
Technique CPCC
SingleLink 0.44
CompleteLink 0.63
GroupAverage 0.66
Ward’s 0.64
7.5.5DeterminingtheCorrectNumberofClusters
Variousunsupervisedclusterevaluationmeasurescanbeusedtoapproximatelydeterminethecorrectornaturalnumberofclusters.
Example7.13(NumberofClusters).ThedatasetofFigure7.29 has10naturalclusters.Figure7.32showsaplotoftheSSEversusthenumberofclustersfora(bisecting)K-meansclusteringofthedataset,whileFigure7.33 showstheaveragesilhouettecoefficientversusthenumberofclustersforthesamedata.ThereisadistinctkneeintheSSEandadistinctpeakinthesilhouettecoefficientwhenthenumberofclustersisequalto10.
Figure7.32.SSEversusnumberofclustersforthedataofFigure7.29 onpage582.
Figure7.33.AveragesilhouettecoefficientversusnumberofclustersforthedataofFigure7.29 .
Thus,wecantrytofindthenaturalnumberofclustersinadatasetbylookingforthenumberofclustersatwhichthereisaknee,peak,ordipintheplotoftheevaluationmeasurewhenitisplottedagainstthenumberofclusters.Ofcourse,suchanapproachdoesnotalwaysworkwell.Clusterscanbe
considerablymoreintertwinedoroverlappingthanthoseshowninFigure7.29 .Also,thedatacanconsistofnestedclusters.Actually,theclustersinFigure7.29 aresomewhatnested;i.e.,therearefivepairsofclusterssincetheclustersareclosertoptobottomthantheyarelefttoright.ThereisakneethatindicatesthisintheSSEcurve,butthesilhouettecoefficientcurveisnotasclear.Insummary,whilecautionisneeded,thetechniquewehavejustdescribedcanprovideinsightintothenumberofclustersinthedata.
7.5.6ClusteringTendency
Oneobviouswaytodetermineifadatasethasclustersistotrytoclusterit.However,almostallclusteringalgorithmswilldutifullyfindclusterswhengivendata.Toaddressthisissue,wecouldevaluatetheresultingclustersandonlyclaimthatadatasethasclustersifatleastsomeoftheclustersareofgoodquality.However,thisapproachdoesnotaddressthefacttheclustersinthedatacanbeofadifferenttypethanthosesoughtbyourclusteringalgorithm.Tohandlethisadditionalproblem,wecouldusemultiplealgorithmsandagainevaluatethequalityoftheresultingclusters.Iftheclustersareuniformlypoor,thenthismayindeedindicatethattherearenoclustersinthedata.
Alternatively,andthisisthefocusofmeasuresofclusteringtendency,wecantrytoevaluatewhetheradatasethasclusterswithoutclustering.Themostcommonapproach,especiallyfordatainEuclideanspace,hasbeentousestatisticaltestsforspatialrandomness.Unfortunately,choosingthecorrectmodel,estimatingtheparameters,andevaluatingthestatisticalsignificanceofthehypothesisthatthedataisnon-randomcanbequitechallenging.Nonetheless,manyapproacheshavebeendeveloped,mostofthemforpointsinlow-dimensionalEuclideanspace.
Example7.14(HopkinsStatistic).Forthisapproach,wegenerateppointsthatarerandomlydistributedacrossthedataspaceandalsosamplepactualdatapoints.Forbothsetsofpoints,wefindthedistancetothenearestneighborintheoriginaldataset.Letthe bethenearestneighbordistancesoftheartificiallygeneratedpoints,whilethe arethenearestneighbordistancesofthesampleofpointsfromtheoriginaldataset.TheHopkinsstatisticHisthendefinedbyEquation7.17 .
Iftherandomlygeneratedpointsandthesampleofdatapointshaveroughlythesamenearestneighbordistances,thenHwillbenear0.5.ValuesofHnear0and1indicate,respectively,datathatishighlyclusteredanddatathatisregularlydistributedinthedataspace.Togiveanexample,theHopkinsstatisticforthedataofFigure7.26 wascomputedforand100differenttrials.TheaveragevalueofHwas0.56withastandarddeviationof0.03.Thesameexperimentwasperformedforthewell-separatedpointsofFigure7.30 .TheaveragevalueofHwas0.95withastandarddeviationof0.006.
7.5.7SupervisedMeasuresofClusterValidity
Whenwehaveexternalinformationaboutdata,itistypicallyintheformofexternallyderivedclasslabelsforthedataobjects.Insuchcases,theusualprocedureistomeasurethedegreeofcorrespondencebetweenthecluster
uiwi
H=∑i=1pwi∑i=1pui+∑i=1pwi (7.17)
p=20
labelsandtheclasslabels.Butwhyisthisofinterest?Afterall,ifwehavetheclasslabels,thenwhatisthepointinperformingaclusteranalysis?Motivationsforsuchananalysisincludethecomparisonofclusteringtechniqueswiththe“groundtruth”ortheevaluationoftheextenttowhichamanualclassificationprocesscanbeautomaticallyproducedbyclusteranalysis,e.g.,theclusteringofnewsarticles.Anotherpotentialmotivationcouldbetoevaluatewhetherobjectsinthesameclustertendtohavethesamelabelforsemi-supervisedlearningtechniques.
Weconsidertwodifferentkindsofapproaches.Thefirstsetoftechniquesusemeasuresfromclassification,suchasentropy,purity,andtheF-measure.Thesemeasuresevaluatetheextenttowhichaclustercontainsobjectsofasingleclass.Thesecondgroupofmethodsisrelatedtothesimilaritymeasuresforbinarydata,suchastheJaccardmeasurethatwesawinChapter2 .Theseapproachesmeasuretheextenttowhichtwoobjectsthatareinthesameclassareinthesameclusterandviceversa.Forconvenience,wewillrefertothesetwotypesofmeasuresasclassification-orientedandsimilarity-oriented,respectively.
Classification-OrientedMeasuresofClusterValidityThereareanumberofmeasuresthatarecommonlyusedtoevaluatetheperformanceofaclassificationmodel.Inthissection,wewilldiscussfive:entropy,purity,precision,recall,andtheF-measure.Inthecaseofclassification,wemeasurethedegreetowhichpredictedclasslabelscorrespondtoactualclasslabels,butforthemeasuresjustmentioned,nothingfundamentalischangedbyusingclusterlabelsinsteadofpredictedclasslabels.Next,wequicklyreviewthedefinitionsofthesemeasuresinthecontextofclustering.
Entropy:Thedegreetowhicheachclusterconsistsofobjectsofasingleclass.Foreachcluster,theclassdistributionofthedataiscalculatedfirst,i.e.,forclusteriwecompute theprobabilitythatamemberofclusteribelongstoclassjas where isthenumberofobjectsinclusteriand isthenumberofobjectsofclassjinclusteri.Usingthisclassdistribution,theentropyofeachclusteriiscalculatedusingthestandardformula,
,whereListhenumberofclasses.Thetotalentropyforasetofclustersiscalculatedasthesumoftheentropiesofeachclusterweightedbythesizeofeachcluster,i.e., whereKisthenumberofclustersandmisthetotalnumberofdatapoints.
Purity:Anothermeasureoftheextenttowhichaclustercontainsobjectsofasingleclass.Usingthepreviousterminology,thepurityofclusteriis
theoverallpurityofaclusteringis
Precision:Thefractionofaclusterthatconsistsofobjectsofaspecifiedclass.Theprecisionofclusteriwithrespecttoclassjis
Recall:Theextenttowhichaclustercontainsallobjectsofaspecifiedclass.Therecallofclusteriwithrespecttoclassjis where isthenumberofobjectsinclassj.
F-measureAcombinationofbothprecisionandrecallthatmeasurestheextenttowhichaclustercontainsonlyobjectsofaparticularclassandallobjectsofthatclass.TheF-measureofclusteriwithrespecttoclassjis
TheF-measureofasetofclusters,partitionalorhierarchicalispresentedonpage594whenwediscussclustervalidityforhierarchicalclusterings.
Example7.15(SupervisedEvaluationMeasures).Wepresentanexampletoillustratethesemeasures.Specifically,weuseK-meanswiththecosinesimilaritymeasuretocluster3204newspaper
pij,pij=mij/mi, mi mij
ei=−∑j=1Lpij log2pij
e=∑i=1Kmimei,
purity(i)=maxjpij, purity=∑i=1Kmimpurity(i).
precision(i,j)=pij.
recall(i,j)=mij/mj, mj
F(i,j)=(2×precision(i,j)×recall(i,j))/(precision(i,j)+recall(i,j)).
articlesfromtheLosAngelesTimes.Thesearticlescomefromsixdifferentclasses:Entertainment,Financial,Foreign,Metro,National,andSports.Table7.9 showstheresultsofaK-meansclusteringtofindsixclusters.Thefirstcolumnindicatesthecluster,whilethenextsixcolumnstogetherformtheconfusionmatrix;i.e.,thesecolumnsindicatehowthedocumentsofeachcategoryaredistributedamongtheclusters.Thelasttwocolumnsaretheentropyandpurityofeachcluster,respectively.
Table7.9.K-meansclusteringresultsfortheLATimesdocumentdataset.
Ideally,eachclusterwillcontaindocumentsfromonlyoneclass.Inreality,eachclustercontainsdocumentsfrommanyclasses.Nevertheless,manyclusterscontaindocumentsprimarilyfromjustoneclass.Inparticular,cluster3,whichcontainsmostlydocumentsfromtheSportssection,isexceptionallygood,bothintermsofpurityandentropy.Thepurityandentropyoftheotherclustersisnotasgood,butcantypicallybegreatlyimprovedifthedataispartitionedintoalargernumberofclusters.
Cluster Entertainment Financial Foreign Metro National Sports Entropy Purity
1 3 5 40 506 96 27 1.2270 0.7474
2 4 7 280 29 39 2 1.1472 0.7756
3 1 1 1 7 4 671 0.1813 0.9796
4 10 162 3 119 73 2 1.7487 0.4390
5 331 22 5 70 13 23 1.3976 0.7134
6 5 358 12 212 48 13 1.5523 0.5525
Total 354 555 341 943 273 738 1.1450 0.7203
Precision,recall,andtheF-measurecanbecalculatedforeachcluster.Togiveaconcreteexample,weconsidercluster1andtheMetroclassofTable7.9 .Theprecisionis recallis andhence,theFvalueis0.39.Incontrast,theFvalueforcluster3andSportsis0.94.Asinclassification,theconfusionmatrixgivesthemostdetailedinformation.
Similarity-OrientedMeasuresofClusterValidityThemeasuresthatwediscussinthissectionareallbasedonthepremisethatanytwoobjectsthatareinthesameclustershouldbeinthesameclassandviceversa.Wecanviewthisapproachtoclustervalidityasinvolvingthecomparisonoftwomatrices:(1)theidealclustersimilaritymatrixdiscussedpreviously,whichhasa1inthe entryiftwoobjects,iandj,areinthesameclusterand0,otherwise,and(2)aclasssimilaritymatrixdefinedwithrespecttoclasslabels,whichhasa1inthe entryiftwoobjects,iandj,belongtothesameclass,anda0otherwise.Asbefore,wecantakethecorrelationofthesetwomatricesasthemeasureofclustervalidity.ThismeasureisknownasHubert’sΓstatisticintheclusteringvalidationliterature.
Example7.16(CorrelationbetweenClusterandClassMatrices).Todemonstratethisideamoreconcretely,wegiveanexampleinvolvingfivedatapoints, and twoclusters, and
andtwoclasses, and TheidealclusterandclasssimilaritymatricesaregiveninTables7.10 and7.11 .Thecorrelationbetweentheentriesofthesetwomatricesis0.359.
Table7.10.Idealclustersimilaritymatrix.
506/677=0.75, 506/943=0.26,
ijth
ijth
p1,p2,p3,p4, p5, C1={p1,p2,p3}C2={p4,p5}, L1={p1,p2} L2={p3,p4,p5}.
Point p1 p2 p3 p4 p5
p1 1 1 1 0 0
p2 1 1 1 0 0
p3 1 1 1 0 0
p4 0 0 0 1 1
p5 0 0 0 1 1
Table7.11.Classsimilaritymatrix.
Point p1 p2 p3 p4 p5
p1 1 1 0 0 0
p2 1 1 0 0 0
p3 0 0 1 1 1
p4 0 0 1 1 1
p5 0 0 1 1 1
Moregenerally,wecanuseanyofthemeasuresforbinarysimilaritythatwesawinSection2.4.5 .(Forexample,wecanconvertthesetwomatricesintobinaryvectorsbyappendingtherows.)Werepeatthedefinitionsofthefourquantitiesusedtodefinethosesimilaritymeasures,butmodifyourdescriptivetexttofitthecurrentcontext.Specifically,weneedtocomputethefollowingfourquantitiesforallpairsofdistinctobjects.(Thereare suchpairs,ifmisthenumberofobjects.)
m(m−1)/2
f00=numberofpairsofobjectshavingadifferentclassandadifferentclusterf01
Inparticular,thesimplematchingcoefficient,whichisknownastheRandstatisticinthiscontext,andtheJaccardcoefficientaretwoofthemostfrequentlyusedclustervaliditymeasures.
Example7.17(RandandJaccardMeasures).Basedontheseformulas,wecanreadilycomputetheRandstatisticandJaccardcoefficientfortheexamplebasedonTables7.10 and7.11 .Notingthat and theandthe
Wealsonotethatthefourquantities, and defineacontingencytableasshowninTable7.12 .
Table7.12.Two-waycontingencytablefordeterminingwhetherpairsofobjectsareinthesameclassandsamecluster.
SameCluster DifferentCluster
SameClass
DifferentClass
Previously,inthecontextofassociationanalysis—seeSection5.7.1 onpage402—wepresentedanextensivediscussionofmeasuresofassociationthatcanbeusedforthistypeofcontingencytable.(CompareTable7.12 onpage593withTable5.6 onpage402.)Thosemeasurescanalsobeappliedtoclustervalidity.
Randstatistic=f00+f11f00+f01+f10+f11 (7.18)
Jaccardcoefficient=f11f01+f10+f11 (7.19)
f00=4,f01=2,f10=2, f11=2, Randstatistic=(2+4)/10=0.6Jaccardcoefficient=2/(2+2+2)=0.33.
f00,f01,f10, f11,
f11 f10
f01 f00
ClusterValidityforHierarchicalClusteringsSofarinthissection,wehavediscussedsupervisedmeasuresofclustervalidityonlyforpartitionalclusterings.Supervisedevaluationofahierarchicalclusteringismoredifficultforavarietyofreasons,includingthefactthatapreexistinghierarchicalstructureoftendoesnotexist.Inaddition,althoughrelativelypureclustersoftenexistatcertainlevelsinthehierarchicalclustering,astheclusteringproceeds,theclusterswillbecomeimpure.Thekeyideaoftheapproachpresentedhere,whichisbasedontheF-measure,istoevaluatewhetherahierarchicalclusteringcontains,foreachclass,atleastoneclusterthatisrelativelypureandincludesmostoftheobjectsofthatclass.Toevaluateahierarchicalclusteringwithrespecttothisgoal,wecompute,foreachclass,theF-measureforeachclusterintheclusterhierarchy,andthentakethemaximumF-measureattainedforanycluster.Finally,wecalculateanoverallF-measureforthehierarchicalclusteringbycomputingtheweightedaverageofallper-classF-measures,wheretheweightsarebasedontheclasssizes.Moreformally,thishierarchicalF-measureisdefinedasfollows:
wherethemaximumistakenoverallclustersiatalllevels, isthenumberofobjectsinclassj,andmisthetotalnumberofobjects.Notethatthismeasurecanalsobeappliedforapartitionalclusteringwithoutmodification.
7.5.8AssessingtheSignificanceofClusterValidityMeasures
F=∑jmjmmaxiF(i,j)
mj
Clustervaliditymeasuresareintendedtohelpusmeasurethegoodnessoftheclustersthatwehaveobtained.Indeed,theytypicallygiveusasinglenumberasameasureofthatgoodness.However,wearethenfacedwiththeproblemofinterpretingthesignificanceofthisnumber,ataskthatmightbeevenmoredifficult.
Theminimumandmaximumvaluesofclusterevaluationmeasurescanprovidesomeguidanceinmanycases.Forinstance,bydefinition,apurityof0isbad,whileapurityof1isgood,atleastifwetrustourclasslabelsandwantourclusterstructuretoreflecttheclassstructure.Likewise,anentropyof0isgood,asisanSSEof0.
Sometimes,however,thereisnominimumormaximumvalue,orthescaleofthedatamightaffecttheinterpretation.Also,evenifthereareminimumandmaximumvalueswithobviousinterpretations,intermediatevaluesstillneedtobeinterpreted.Insomecases,wecanuseanabsolutestandard.If,forexample,weareclusteringforutility,weareoftenwillingtotolerateonlyacertainleveloferrorintheapproximationofourpointsbyaclustercentroid.
Butifthisisnotthecase,thenwemustdosomethingelse.Acommonapproachistointerpretthevalueofourvaliditymeasureinstatisticalterms.Specifically,weattempttojudgehowlikelyitisthatourobservedvaluewasachievedbyrandomchance.Thevalueisgoodifitisunusual;i.e.,ifitisunlikelytobetheresultofrandomchance.Themotivationforthisapproachisthatweareinterestedonlyinclustersthatreflectnon-randomstructureinthedata,andsuchstructuresshouldgenerateunusuallyhigh(low)valuesofourclustervaliditymeasure,atleastifthevaliditymeasuresaredesignedtoreflectthepresenceofstrongclusterstructure.
Example7.18(SignificanceofSSE).
Toshowhowthisworks,wepresentanexamplebasedonK-meansandtheSSE.Supposethatwewantameasureofhowgoodthewell-separatedclustersofFigure7.30 arewithrespecttorandomdata.Wegeneratemanyrandomsetsof100pointshavingthesamerangeasthepointsinthethreeclusters,findthreeclustersineachdatasetusingK-means,andaccumulatethedistributionofSSEvaluesfortheseclusterings.ByusingthisdistributionoftheSSEvalues,wecanthenestimatetheprobabilityoftheSSEvaluefortheoriginalclusters.Figure7.34 showsthehistogramoftheSSEfrom500randomruns.ThelowestSSEshowninFigure7.34 is0.0173.ForthethreeclustersofFigure7.30 ,theSSEis0.0050.Wecouldthereforeconservativelyclaimthatthereislessthana1%chancethataclusteringsuchasthatofFigure7.30 couldoccurbychance.
Figure7.34.HistogramofSSEfor500randomdatasets.
Inthepreviousexample,randomizationwasusedtoevaluatethestatisticalsignificanceofaclustervaliditymeasure.However,forsomemeasures,such
asHubert’sΓstatistic,thedistributionisknownandcanbeusedtoevaluatethemeasure.Inaddition,anormalizedversionofthemeasurecanbecomputedbysubtractingthemeananddividingbythestandarddeviation.SeeBibliographicNotesformoredetails.
Westressthatthereismoretoclusterevaluation(unsupervisedorsupervised)thanobtaininganumericalmeasureofclustervalidity.Unlessthisvaluehasanaturalinterpretationbasedonthedefinitionofthemeasure,weneedtointerpretthisvalueinsomeway.Ifourclusterevaluationmeasureisdefinedsuchthatlower(higher)valuesindicatestrongerclusters,thenwecanusestatisticstoevaluatewhetherthevaluewehaveobtainedisunusuallylow(high),providedwehaveadistributionfortheevaluationmeasure.Wehavepresentedanexampleofhowtofindsuchadistribution,butthereisconsiderablymoretothistopic,andwereferthereadertotheBibliographicNotesformorepointers.
Finally,evenwhenanevaluationmeasureisusedasarelativemeasure,i.e.,tocomparetwoclusterings,westillneedtoassessthesignificanceinthedifferencebetweentheevaluationmeasuresofthetwoclusterings.Althoughonevaluewillalmostalwaysbebetterthananother,itcanbedifficulttodetermineifthedifferenceissignificant.Notethattherearetwoaspectstothissignificance:whetherthedifferenceisstatisticallysignificant(repeatable)andwhetherthemagnitudeofthedifferenceismeaningfulwithrespecttotheapplication.Manywouldnotregardadifferenceof0.1%assignificant,evenifitisconsistentlyreproducible.
7.5.9ChoosingaClusterValidityMeasure
Therearemanymoremeasuresforevaluatingclustervaliditythanhavebeendiscussedhere.Variousbooksandarticlesproposevariousmeasuresasbeingsuperiortoothers.Inthissection,weoffersomehigh-levelguidance.First,itisimportanttodistinguishwhethertheclusteringisbeingperformedforsummarizationorunderstanding.Ifsummarization,typicallyclasslabelsarenotinvolvedandthegoalismaximumcompression.Thisisoftendonebyfindingclustersthatminimizethedistanceofobjectstotheirclosestclustercentroid.Indeed,theclusteringprocessisoftendrivenbytheobjectiveofminimizingrepresentationerror.MeasuressuchasSSEaremorenaturalforthisapplication.
Iftheclusteringisbeingperformedforunderstanding,thenthesituationismorecomplicated.Fortheunsupervisedcase,virtuallyallmeasurestrytomaximizecohesionandseparation.Somemeasuresobtaina“best”valueforaparticularvalueofK,thenumberofclusters.Althoughthismightseemappealing,suchmeasurestypicallyonlyidentifyone“right”numberofclusters,evenwhensubclustersarepresent.(RecallthatcohesionandseparationcontinuetoincreaseforK-meansuntilthereisoneclusterperpoint.)Moregenerally,ifthenumberofclustersisnottoolarge,itcanbeusefultomanuallyexaminetheclustercohesionofeachclusterandthepairwiseseparationofclusters.However,notethatveryfewclustervaliditymeasuresareappropriatetocontiguityordensity-basedclustersthatcanhaveirregularandintertwinedshapes.
Forthesupervisedcase,whereclusteringisalmostalwaysperformedwithagoalofgeneratingunderstandableclusters,theidealresultofclusteringistoproduceclustersthatmatchtheunderlyingclassstructure.Evaluatingthematchbetweenasetofclustersandclassesisanon-trivialproblem.TheF-measureandhierarchicalF-measurediscussedearlier,areexamplesofhowtoevaluatesuchamatch.OtherexamplescanbefoundinthereferencestoclusterevaluationprovidedintheBibliographicNotes.Inanycase,whenthe
numberofclustersarerelativelysmall,theconfusionmatrixcanbemoreinformativethananysinglemeasureofclustervaliditysinceitanindicatewhichclassestendtobeappearinclusterswithwhichotherclasses.Notethatsupervisedclusterevaluationindicesareindependentofwhethertheclustersarecenter-,contiguity-,ordensity-based.
Inconclusion,itisimportanttorealizethatclusteringisoftenusedasanexploratorydatatechniquewhosegoalisoftennottoprovideacrispanswer,butrathertoprovidesomeinsightintotheunderlyingstructureofthedata.Inthissituation,clustervalidityindicesareusefultotheextenttheyareusefultothatendgoal.
7.6BibliographicNotesDiscussioninthischapterhasbeenmostheavilyinfluencedbythebooksonclusteranalysiswrittenbyJainandDubes[536],Anderberg[509],andKaufmanandRousseeuw[540],aswellasthemorerecentbookeditedbyandAggarwalandReddy[507].AdditionalclusteringbooksthatmayalsobeofinterestincludethosebyAldenderferandBlashfield[508],Everittetal.[527],Hartigan[533],Mirkin[548],Murtagh[550],Romesburg[553],andSpäth[557].AmorestatisticallyorientedapproachtoclusteringisgivenbythepatternrecognitionbookofDudaetal.[524],themachinelearningbookofMitchell[549],andthebookonstatisticallearningbyHastieetal.[534].GeneralsurveysofclusteringaregivenbyJainetal.[537]andXuandWunsch[560],whileasurveyofspatialdataminingtechniquesisprovidedbyHanetal.[532].Behrkin[515]providesasurveyofclusteringtechniquesfordatamining.AgoodsourceofreferencestoclusteringoutsideofthedataminingfieldisthearticlebyArabieandHubert[511].ApaperbyKleinberg[541]providesadiscussionofsomeofthetrade-offsthatclusteringalgorithmsmakeandprovesthatitisimpossibleforaclusteringalgorithmtosimultaneouslypossessthreesimpleproperties.Awide-ranging,retrospectivearticlebyJainprovidesalookatclusteringduringthe50yearsfromtheinventionofK-means[535].
TheK-meansalgorithmhasalonghistory,butisstillthesubjectofcurrentresearch.TheK-meansalgorithmwasnamedbyMacQueen[545],althoughitshistoryismoreextensive.BockexaminestheoriginsofK-meansandsomeofitsextensions[516].TheISODATAalgorithmbyBallandHall[513]wasanearly,butsophisticatedversionofK-meansthatemployedvariouspre-andpostprocessingtechniquestoimproveonthebasicalgorithm.TheK-meansalgorithmandmanyofitsvariationsaredescribedindetailinthebooksby
Anderberg[509]andJainandDubes[536].ThebisectingK-meansalgorithmdiscussedinthischapterwasdescribedinapaperbySteinbachetal.[558],andanimplementationofthisandotherclusteringapproachesisfreelyavailableforacademicuseintheCLUTO(CLUsteringTOolkit)packagecreatedbyKarypis[520].Boley[517]hascreatedadivisivepartitioningclusteringalgorithm(PDDP)basedonfindingthefirstprincipaldirection(component)ofthedata,andSavaresiandBoley[555]haveexploreditsrelationshiptobisectingK-means.RecentvariationsofK-meansareanewincrementalversionofK-means(Dhillonetal.[522]),X-means(PellegandMoore[552]),andK-harmonicmeans(Zhangetal[562]).HamerlyandElkan[531]discusssomeclusteringalgorithmsthatproducebetterresultsthanK-means.WhilesomeofthepreviouslymentionedapproachesaddresstheinitializationproblemofK-meansinsomemanner,otherapproachestoimprovingK-meansinitializationcanalsobefoundintheworkofBradleyandFayyad[518].TheK-means++initializationapproachwasproposedbyArthurandVassilvitskii[512].DhillonandModha[523]presentageneralizationofK-means,calledsphericalK-means,whichworkswithcommonlyusedsimilarityfunctions.AgeneralframeworkforK-meansclusteringthatusesdissimilarityfunctionsbasedonBregmandivergenceswasconstructedbyBanerjeeetal.[514].
Hierarchicalclusteringtechniquesalsohavealonghistory.MuchoftheinitialactivitywasintheareaoftaxonomyandiscoveredinbooksbyJardineandSibson[538]andSneathandSokal[556].General-purposediscussionsofhierarchicalclusteringarealsoavailableinmostoftheclusteringbooksmentionedabove.Agglomerativehierarchicalclusteringisthefocusofmostworkintheareaofhierarchicalclustering,butdivisiveapproacheshavealsoreceivedsomeattention.Forexample,Zahn[561]describesadivisivehierarchicaltechniquethatusestheminimumspanningtreeofagraph.Whilebothdivisiveandagglomerativeapproachestypicallytaketheviewthatmerging(splitting)decisionsarefinal,therehasbeensomeworkbyFisher
[528]andKarypisetal.[539]toovercometheselimitations.MurtaghandContrerasprovidearecentoverviewofhierarchicalclusteringalgorithms[551]andhavealsoproposedalineartimehierarchicalclusteringalgorithm[521].
Esteretal.proposedDBSCAN[526],whichwaslatergeneralizedtotheGDBSCANalgorithmbySanderetal.[554]inordertohandlemoregeneraltypesofdataanddistancemeasures,suchaspolygonswhoseclosenessismeasuredbythedegreeofintersection.AnincrementalversionofDBSCANwasdevelopedbyKriegeletal.[525].OneinterestingoutgrowthofDBSCANisOPTICS(OrderingPointsToIdentifytheClusteringStructure)(Ankerstetal.[510]),whichallowsthevisualizationofclusterstructureandcanalsobeusedforhierarchicalclustering.Arecentdiscussionofdensity-basedclusteringbyKriegeletal.[542]providesaveryreadablesynopsisofdensity-basedclusteringandrecentdevelopments.
Anauthoritativediscussionofclustervalidity,whichstronglyinfluencedthediscussioninthischapter,isprovidedinChapter4 ofJainandDubes’clusteringbook[536].ArecentreviewofclustervaliditymeasuresbyXiongandLicanbefoundin[559].OtherrecentreviewsofclustervalidityarethoseofHalkidietal.[529,530]andMilligan[547].SilhouettecoefficientsaredescribedinKaufmanandRousseeuw’sclusteringbook[540].ThesourceofthecohesionandseparationmeasuresinTable7.6 isapaperbyZhaoandKarypis[563],whichalsocontainsadiscussionofentropy,purity,andthehierarchicalF-measure.TheoriginalsourceofthehierarchicalF-measureisanarticlebyLarsenandAone[543].TheCVNNmeasurewasintroducedbyLietal.[544].Anaxiomaticapproachtoclusteringvalidityispresentedin[546].ManyofthepopularindicesforclustervalidationareimplementedintheNbClustRpackage,whichisdescribedinthearticlebyCharradetal.[519].
Bibliography[507]C.C.AggarwalandC.K.Reddy,editors.DataClustering:Algorithms
andApplications.Chapman&Hall/CRC,1stedition,2013.
[508]M.S.AldenderferandR.K.Blashfield.ClusterAnalysis.SagePublications,LosAngeles,1985.
[509]M.R.Anderberg.ClusterAnalysisforApplications.AcademicPress,NewYork,December1973.
[510]M.Ankerst,M.M.Breunig,H.-P.Kriegel,andJ.Sander.OPTICS:OrderingPointsToIdentifytheClusteringStructure.InProc.of1999ACM-SIGMODIntl.Conf.onManagementofData,pages49–60,Philadelphia,Pennsylvania,June1999.ACMPress.
[511]P.Arabie,L.Hubert,andG.D.Soete.Anoverviewofcombinatorialdataanalysis.InP.Arabie,L.Hubert,andG.D.Soete,editors,ClusteringandClassification,pages188–217.WorldScientific,Singapore,January1996.
[512]D.ArthurandS.Vassilvitskii.k-means++:Theadvantagesofcarefulseeding.InProceedingsoftheeighteenthannualACM-SIAMsymposiumonDiscretealgorithms,pages1027–1035.SocietyforIndustrialandAppliedMathematics,2007.
[513]G.BallandD.Hall.AClusteringTechniqueforSummarizingMultivariateData.BehaviorScience,12:153–155,March1967.
[514]A.Banerjee,S.Merugu,I.S.Dhillon,andJ.Ghosh.ClusteringwithBregmanDivergences.InProc.ofthe2004SIAMIntl.Conf.onDataMining,pages234–245,LakeBuenaVista,FL,April2004.
[515]P.Berkhin.SurveyOfClusteringDataMiningTechniques.Technicalreport,AccrueSoftware,SanJose,CA,2002.
[516]H.-H.Bock.Originsandextensionsofthe-meansalgorithminclusteranalysis.JournalÉlectroniqued’HistoiredesProbabilitésetdelaStatistique[electroniconly],4(2):Article–14,2008.
[517]D.Boley.PrincipalDirectionDivisivePartitioning.DataMiningandKnowledgeDiscovery,2(4):325–344,1998.
[518]P.S.BradleyandU.M.Fayyad.RefiningInitialPointsforK-MeansClustering.InProc.ofthe15thIntl.Conf.onMachineLearning,pages91–99,Madison,WI,July1998.MorganKaufmannPublishersInc.
[519]M.Charrad,N.Ghazzali,V.Boiteau,andA.Niknafs.NbClust:anRpackagefordeterminingtherelevantnumberofclustersinadataset.JournalofStatisticalSoftware,61(6):1–36,2014.
[520]CLUTO2.1.2:SoftwareforClusteringHigh-DimensionalDatasets.www.cs.umn.edu/∼karypis,October2016.
[521]P.ContrerasandF.Murtagh.Fast,lineartimehierarchicalclusteringusingtheBairemetric.Journalofclassification,29(2):118–143,2012.
[522]I.S.Dhillon,Y.Guan,andJ.Kogan.IterativeClusteringofHighDimensionalTextDataAugmentedbyLocalSearch.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages131–138.IEEEComputerSociety,2002.
[523]I.S.DhillonandD.S.Modha.ConceptDecompositionsforLargeSparseTextDataUsingClustering.MachineLearning,42(1/2):143–175,2001.
[524]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley&Sons,Inc.,NewYork,secondedition,2001.
[525]M.Ester,H.-P.Kriegel,J.Sander,M.Wimmer,andX.Xu.IncrementalClusteringforMininginaDataWarehousingEnvironment.InProc.ofthe24thVLDBConf.,pages323–333,NewYorkCity,August1998.MorganKaufmann.
[526]M.Ester,H.-P.Kriegel,J.Sander,andX.Xu.ADensity-BasedAlgorithmforDiscoveringClustersinLargeSpatialDatabaseswithNoise.InProc.ofthe2ndIntl.Conf.onKnowledgeDiscoveryandDataMining,pages226–231,Portland,Oregon,August1996.AAAIPress.
[527]B.S.Everitt,S.Landau,andM.Leese.ClusterAnalysis.ArnoldPublishers,London,4thedition,May2001.
[528]D.Fisher.IterativeOptimizationandSimplificationofHierarchicalClusterings.JournalofArtificialIntelligenceResearch,4:147–179,1996.
[529]M.Halkidi,Y.Batistakis,andM.Vazirgiannis.Clustervaliditymethods:partI.SIGMODRecord(ACMSpecialInterestGrouponManagementofData),31(2):40–45,June2002.
[530]M.Halkidi,Y.Batistakis,andM.Vazirgiannis.Clusteringvaliditycheckingmethods:partII.SIGMODRecord(ACMSpecialInterestGrouponManagementofData),31(3):19–27,Sept.2002.
[531]G.HamerlyandC.Elkan.Alternativestothek-meansalgorithmthatfindbetterclusterings.InProc.ofthe11thIntl.Conf.onInformationandKnowledgeManagement,pages600–607,McLean,Virginia,2002.ACMPress.
[532]J.Han,M.Kamber,andA.Tung.SpatialClusteringMethodsinDataMining:Areview.InH.J.MillerandJ.Han,editors,GeographicDataMiningandKnowledgeDiscovery,pages188–217.TaylorandFrancis,London,December2001.
[533]J.Hartigan.ClusteringAlgorithms.Wiley,NewYork,1975.
[534]T.Hastie,R.Tibshirani,andJ.H.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,Prediction.Springer,NewYork,2001.
[535]A.K.Jain.Dataclustering:50yearsbeyondK-means.Patternrecognitionletters,31(8):651–666,2010.
[536]A.K.JainandR.C.Dubes.AlgorithmsforClusteringData.PrenticeHallAdvancedReferenceSeries.PrenticeHall,March1988.
[537]A.K.Jain,M.N.Murty,andP.J.Flynn.Dataclustering:Areview.ACMComputingSurveys,31(3):264–323,September1999.
[538]N.JardineandR.Sibson.MathematicalTaxonomy.Wiley,NewYork,1971.
[539]G.Karypis,E.-H.Han,andV.Kumar.MultilevelRefinementforHierarchicalClustering.TechnicalReportTR99-020,UniversityofMinnesota,Minneapolis,MN,1999.
[540]L.KaufmanandP.J.Rousseeuw.FindingGroupsinData:AnIntroductiontoClusterAnalysis.WileySeriesinProbabilityandStatistics.JohnWileyandSons,NewYork,November1990.
[541]J.M.Kleinberg.AnImpossibilityTheoremforClustering.InProc.ofthe16thAnnualConf.onNeuralInformationProcessingSystems,December,9–142002.
[542]H.-P.Kriegel,P.Kröger,J.Sander,andA.Zimek.Density-basedclustering.WileyInterdisciplinaryReviews:DataMiningandKnowledgeDiscovery,1(3):231–240,2011.
[543]B.LarsenandC.Aone.FastandEffectiveTextMiningUsingLinear-TimeDocumentClustering.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages16–22,SanDiego,California,1999.ACMPress.
[544]Y.Liu,Z.Li,H.Xiong,X.Gao,J.Wu,andS.Wu.Understandingandenhancementofinternalclusteringvalidationmeasures.Cybernetics,IEEETransactionson,43(3):982–994,2013.
[545]J.MacQueen.Somemethodsforclassificationandanalysisofmultivariateobservations.InProc.ofthe5thBerkeleySymp.onMathematicalStatisticsandProbability,pages281–297.UniversityofCaliforniaPress,1967.
[546]M.Meilă.ComparingClusterings:AnAxiomaticView.InProceedingsofthe22NdInternationalConferenceonMachineLearning,ICML’05,pages577–584,NewYork,NY,USA,2005.ACM.
[547]G.W.Milligan.ClusteringValidation:ResultsandImplicationsforAppliedAnalyses.InP.Arabie,L.Hubert,andG.D.Soete,editors,ClusteringandClassification,pages345–375.WorldScientific,Singapore,January1996.
[548]B.Mirkin.MathematicalClassificationandClustering,volume11ofNonconvexOptimizationandItsApplications.KluwerAcademicPublishers,August1996.
[549]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.
[550]F.Murtagh.MultidimensionalClusteringAlgorithms.Physica-Verlag,HeidelbergandVienna,1985.
[551]F.MurtaghandP.Contreras.Algorithmsforhierarchicalclustering:anoverview.WileyInterdisciplinaryReviews:DataMiningandKnowledgeDiscovery,2(1):86–97,2012.
[552]D.PellegandA.W.Moore.X-means:ExtendingK-meanswithEfficientEstimationoftheNumberofClusters.InProc.ofthe17thIntl.Conf.onMachineLearning,pages727–734.MorganKaufmann,SanFrancisco,CA,2000.
[553]C.Romesburg.ClusterAnalysisforResearchers.LifeTimeLearning,Belmont,CA,1984.
[554]J.Sander,M.Ester,H.-P.Kriegel,andX.Xu.Density-BasedClusteringinSpatialDatabases:TheAlgorithmGDBSCANanditsApplications.DataMiningandKnowledgeDiscovery,2(2):169–194,1998.
[555]S.M.SavaresiandD.Boley.AcomparativeanalysisonthebisectingK-meansandthePDDPclusteringalgorithms.IntelligentDataAnalysis,8(4):345–362,2004.
[556]P.H.A.SneathandR.R.Sokal.NumericalTaxonomy.Freeman,SanFrancisco,1971.
[557]H.Späth.ClusterAnalysisAlgorithmsforDataReductionandClassificationofObjects,volume4ofComputersandTheirApplication.EllisHorwoodPublishers,Chichester,1980.ISBN0-85312-141-9.
[558]M.Steinbach,G.Karypis,andV.Kumar.AComparisonofDocumentClusteringTechniques.InProc.ofKDDWorkshoponTextMining,Proc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,Boston,MA,August2000.
[559]H.XiongandZ.Li.ClusteringValidationMeasures.InC.C.AggarwalandC.K.Reddy,editors,DataClustering:AlgorithmsandApplications,pages571–605.Chapman&Hall/CRC,2013.
[560]R.Xu,D.Wunsch,etal.Surveyofclusteringalgorithms.NeuralNetworks,IEEETransactionson,16(3):645–678,2005.
[561]C.T.Zahn.Graph-TheoreticalMethodsforDetectingandDescribingGestaltClusters.IEEETransactionsonComputers,C-20(1):68–86,Jan.1971.
[562]B.Zhang,M.Hsu,andU.Dayal.K-HarmonicMeans—ADataClusteringAlgorithm.TechnicalReportHPL-1999-124,HewlettPackardLaboratories,Oct.291999.
[563]Y.ZhaoandG.Karypis.Empiricalandtheoreticalcomparisonsofselectedcriterionfunctionsfordocumentclustering.MachineLearning,55(3):311–331,2004.
7.7Exercises1.Consideradatasetconsistingof datavectors,whereeachvectorhas32componentsandeachcomponentisa4-bytevalue.Supposethatvectorquantizationisusedforcompression,andthat prototypevectorsareused.Howmanybytesofstoragedoesthatdatasettakebeforeandaftercompressionandwhatisthecompressionratio?
2.Findallwell-separatedclustersinthesetofpointsshowninFigure7.35 .
Figure7.35.PointsforExercise2 .
3.Manypartitionalclusteringalgorithmsthatautomaticallydeterminethenumberofclustersclaimthatthisisanadvantage.Listtwosituationsinwhichthisisnotthecase.
4.GivenKequallysizedclusters,theprobabilitythatarandomlychoseninitialcentroidwillcomefromanygivenclusteris buttheprobabilitythateachclusterwillhaveexactlyoneinitialcentroidismuchlower.(ItshouldbeclearthathavingoneinitialcentroidineachclusterisagoodstartingsituationforK-means.)Ingeneral,ifthereareKclustersandeachclusterhasnpoints,thentheprobability,p,ofselectinginasampleofsizeKoneinitialcentroidfromeachclusterisgivenbyEquation7.20 .(Thisassumessamplingwith
220
216
1/K,
replacement.)Fromthisformulawecancalculate,forexample,thatthechanceofhavingoneinitialcentroidfromeachoffourclustersis
a. PlottheprobabilityofobtainingonepointfromeachclusterinasampleofsizeKforvaluesofKbetween2and100.
b. ForKclusters, and1000,findtheprobabilitythatasampleofsize2Kcontainsatleastonepointfromeachcluster.Youcanuseeithermathematicalmethodsorstatisticalsimulationtodeterminetheanswer.
5.IdentifytheclustersinFigure7.36 usingthecenter-,contiguity-,anddensity-baseddefinitions.Alsoindicatethenumberofclustersforeachcaseandgiveabriefindicationofyourreasoning.Notethatdarknessorthenumberofdotsindicatesdensity.Ifithelps,assumecenter-basedmeansK-means,contiguity-basedmeanssinglelink,anddensity-basedmeansDBSCAN.
Figure7.36.ClustersforExercise5 .
6.Forthefollowingsetsoftwo-dimensionalpoints,(1)provideasketchofhowtheywouldbesplitintoclustersbyK-meansforthegivennumberof
4!/44=0.0938.
p=numberofwaystoselectonecentroidfromeachclusternumberofwaystoselect(7.20)
K=10,100,
clustersand(2)indicateapproximatelywheretheresultingcentroidswouldbe.Assumethatweareusingthesquarederrorobjectivefunction.Ifyouthinkthatthereismorethanonepossiblesolution,thenpleaseindicatewhethereachsolutionisaglobalorlocalminimum.NotethatthelabelofeachdiagraminFigure7.37 matchesthecorrespondingpartofthisquestion,e.g.,Figure7.37(a) goeswithpart(a).
Figure7.37.DiagramsforExercise6 .
a. Assumingthatthepointsareuniformlydistributedinthecircle,howmanypossiblewaysarethere(intheory)topartitionthepointsintotwoclusters?Whatcanyousayaboutthepositionsofthetwocentroids?(Again,youdon’tneedtoprovideexactcentroidlocations,justaqualitativedescription.)
b. Thedistancebetweentheedgesofthecirclesisslightlygreaterthantheradiiofthecircles.
c. Thedistancebetweentheedgesofthecirclesismuchlessthantheradiiofthecircles.
d.
e. Hint:Usethesymmetryofthesituationandrememberthatwearelookingforaroughsketchofwhattheresultwouldbe.
K=2.
K=3.
K=3.
K=2.
K=3.
7.Supposethatforadataset
therearempointsandKclusters,
halfthepointsandclustersarein“moredense”regions,
halfthepointsandclustersarein“lessdense”regions,and
thetworegionsarewell-separatedfromeachother.
Forthegivendataset,whichofthefollowingshouldoccurinordertominimizethesquarederrorwhenfindingKclusters:
a. Centroidsshouldbeequallydistributedbetweenmoredenseandlessdenseregions.
b. Morecentroidsshouldbeallocatedtothelessdenseregion.
c. Morecentroidsshouldbeallocatedtothedenserregion.
Note:Donotgetdistractedbyspecialcasesorbringinfactorsotherthandensity.However,ifyoufeelthetrueanswerisdifferentfromanygivenabove,justifyyourresponse.
8.Considerthemeanofaclusterofobjectsfromabinarytransactiondataset.Whataretheminimumandmaximumvaluesofthecomponentsofthemean?Whatistheinterpretationofcomponentsoftheclustermean?Whichcomponentsmostaccuratelycharacterizetheobjectsinthecluster?
9.Giveanexampleofadatasetconsistingofthreenaturalclusters,forwhich(almostalways)K-meanswouldlikelyfindthecorrectclusters,butbisectingK-meanswouldnot.
10.WouldthecosinemeasurebetheappropriatesimilaritymeasuretousewithK-meansclusteringfortimeseriesdata?Whyorwhynot?Ifnot,whatsimilaritymeasurewouldbemoreappropriate?
11.TotalSSEisthesumoftheSSEforeachseparateattribute.WhatdoesitmeaniftheSSEforonevariableislowforallclusters?Lowforjustonecluster?Highforallclusters?Highforjustonecluster?HowcouldyouusethepervariableSSEinformationtoimproveyourclustering?
12.Theleaderalgorithm(Hartigan[533])representseachclusterusingapoint,knownasaleader,andassignseachpointtotheclustercorrespondingtotheclosestleader,unlessthisdistanceisaboveauser-specifiedthreshold.Inthatcase,thepointbecomestheleaderofanewcluster.
a. WhataretheadvantagesanddisadvantagesoftheleaderalgorithmascomparedtoK-means?
b. Suggestwaysinwhichtheleaderalgorithmmightbeimproved.
13.TheVoronoidiagramforasetofKpointsintheplaneisapartitionofallthepointsoftheplaneintoKregions,suchthateverypoint(oftheplane)isassignedtotheclosestpointamongtheKspecifiedpoints—seeFigure7.38 .WhatistherelationshipbetweenVoronoidiagramsandK-meansclusters?WhatdoVoronoidiagramstellusaboutthepossibleshapesofK-meansclusters?
Figure7.38.VoronoidiagramforExercise13 .
14.Youaregivenadatasetwith100recordsandareaskedtoclusterthedata.YouuseK-meanstoclusterthedata,butforallvaluesofK,theK-meansalgorithmreturnsonlyonenon-emptycluster.YouthenapplyanincrementalversionofK-means,butobtainexactlythesameresult.Howisthispossible?HowwouldsinglelinkorDBSCANhandlesuchdata?
15.Traditionalagglomerativehierarchicalclusteringroutinesmergetwoclustersateachstep.Doesitseemlikelythatsuchanapproachaccuratelycapturesthe(nested)clusterstructureofasetofdatapoints?Ifnot,explainhowyoumightpostprocessthedatatoobtainamoreaccurateviewoftheclusterstructure.
16.UsethesimilaritymatrixinTable7.13 toperformsingleandcompletelinkhierarchicalclustering.Showyourresultsbydrawingadendrogram.Thedendrogramshouldclearlyshowtheorderinwhichthepointsaremerged.
Table7.13.SimilaritymatrixforExercise16 .
p1 p2 p3 p4 p5
p1 1.00 0.10 0.41 0.55 0.35
p2 0.10 1.00 0.64 0.47 0.98
p3 0.41 0.64 1.00 0.44 0.85
p4 0.55 0.47 0.44 1.00 0.76
p5 0.35 0.98 0.85 0.76 1.00
17.HierarchicalclusteringissometimesusedtogenerateKclusters, bytakingtheclustersatthe levelofthedendrogram.(Rootisatlevel1.)Bylookingattheclustersproducedinthisway,wecanevaluatethebehaviorofhierarchicalclusteringondifferenttypesofdataandclusters,andalsocomparehierarchicalapproachestoK-means.
1≤K≤100,
K>1Kth
Thefollowingisasetofone-dimensionalpoints:{6,12,18,24,30,42,48}.
a. Foreachofthefollowingsetsofinitialcentroids,createtwoclustersbyassigningeachpointtothenearestcentroid,andthencalculatethetotalsquarederrorforeachsetoftwoclusters.Showboththeclustersandthetotalsquarederrorforeachsetofcentroids.
i. {18,45}
ii. {15,40}
b. Dobothsetsofcentroidsrepresentstablesolutions;i.e.,iftheK-meansalgorithmwasrunonthissetofpointsusingthegivencentroidsasthestartingcentroids,wouldtherebeanychangeintheclustersgenerated?
c. Whatarethetwoclustersproducedbysinglelink?
d. Whichtechnique,K-meansorsinglelink,seemstoproducethe“mostnatural”clusteringinthissituation?(ForK-means,taketheclusteringwiththelowestsquarederror.)
e. Whatdefinition(s)ofclusteringdoesthisnaturalclusteringcorrespondto?(Well-separated,center-based,contiguous,ordensity.)
f. Whatwell-knowncharacteristicoftheK-meansalgorithmexplainsthepreviousbehavior?
18.SupposewefindKclustersusingWard’smethod,bisectingK-means,andordinaryK-means.Whichofthesesolutionsrepresentsalocalorglobalminimum?Explain.
19.Hierarchicalclusteringalgorithmsrequire time,andconsequently,areimpracticaltousedirectlyonlargerdatasets.Onepossibletechniqueforreducingthetimerequiredistosamplethedataset.Forexample,ifKclustersaredesiredand pointsaresampledfromthem
O(m2log(m))
m
points,thenahierarchicalclusteringalgorithmwillproduceahierarchicalclusteringinroughly time.Kclusterscanbeextractedfromthishierarchicalclusteringbytakingtheclustersonthe levelofthedendrogram.Theremainingpointscanthenbeassignedtoaclusterinlineartime,byusingvariousstrategies.Togiveaspecificexample,thecentroidsoftheKclusterscanbecomputed,andtheneachofthe remainingpointscanbeassignedtotheclusterassociatedwiththeclosestcentroid.
Foreachofthefollowingtypesofdataorclusters,discussbrieflyif(1)samplingwillcauseproblemsforthisapproachand(2)whatthoseproblemsare.Assumethatthesamplingtechniquerandomlychoosespointsfromthetotalsetofmpointsandthatanyunmentionedcharacteristicsofthedataorclustersareasoptimalaspossible.Inotherwords,focusonlyonproblemscausedbytheparticularcharacteristicmentioned.Finally,assumethatKisverymuchlessthanm.
a. Datawithverydifferentsizedclusters.
b. High-dimensionaldata.
c. Datawithoutliers,i.e.,atypicalpoints.
d. Datawithhighlyirregularregions.
e. Datawithglobularclusters.
f. Datawithwidelydifferentdensities.
g. Datawithasmallpercentageofnoisepoints.
h. Non-Euclideandata.
i. Euclideandata.
j. Datawithmanyandmixedattributetypes.
O(m)Kth
m−m
20.ConsiderthefollowingfourfacesshowninFigure7.39 .Again,darknessornumberofdotsrepresentsdensity.Linesareusedonlytodistinguishregionsanddonotrepresentpoints.
Figure7.39.FigureforExercise20 .
a. Foreachfigure,couldyouusesinglelinktofindthepatternsrepresentedbythenose,eyes,andmouth?Explain.
b. Foreachfigure,couldyouuseK-meanstofindthepatternsrepresentedbythenose,eyes,andmouth?Explain.
c. WhatlimitationdoesclusteringhaveindetectingallthepatternsformedbythepointsinFigure7.39(c) ?
21.ComputetheentropyandpurityfortheconfusionmatrixinTable7.14 .
Table7.14.ConfusionmatrixforExercise21 .
Cluster Entertainment Financial Foreign Metro National Sports Total
#1 1 1 0 11 4 676 693
#2 27 89 333 827 253 33 1562
#3 326 465 8 105 16 29 949
Total 354 555 341 943 273 738 3204
22.Youaregiventwosetsof100pointsthatfallwithintheunitsquare.Onesetofpointsisarrangedsothatthepointsareuniformlyspaced.Theothersetofpointsisgeneratedfromauniformdistributionovertheunitsquare.
a. Isthereadifferencebetweenthetwosetsofpoints?
b. Ifso,whichsetofpointswilltypicallyhaveasmallerSSEforclusters?
c. WhatwillbethebehaviorofDBSCANontheuniformdataset?Therandomdataset?
23.UsingthedatainExercise24 ,computethesilhouettecoefficientforeachpoint,eachofthetwoclusters,andtheoverallclustering.
24.GiventhesetofclusterlabelsandsimilaritymatrixshowninTables7.15 and7.16 ,respectively,computethecorrelationbetweenthesimilaritymatrixandtheidealsimilaritymatrix,i.e.,thematrixwhose entryis1iftwoobjectsbelongtothesamecluster,and0otherwise.
Table7.15.TableofclusterlabelsforExercise24 .
Point ClusterLabel
P1 1
P2 1
P3 2
P4 2
K=10
ijth
Table7.16.SimilaritymatrixforExercise24 .
Point P1 P2 P3 P4
P1 1 0.8 0.65 0.55
P2 0.8 1 0.7 0.6
P3 0.65 0.7 1 0.9
P4 0.55 0.6 0.9 1
25.ComputethehierarchicalF-measurefortheeightobjects{p1,p2,p3,p4,p5,p6,p7,andp8}andhierarchicalclusteringshowninFigure7.40 .ClassAcontainspointsp1,p2,andp3,whilep4,p5,p6,p7,andp8belongtoclassB.
Figure7.40.HierarchicalclusteringforExercise25 .
26.ComputethecopheneticcorrelationcoefficientforthehierarchicalclusteringsinExercise16 .(Youwillneedtoconvertthesimilaritiesintodissimilarities.)
27.ProveEquation7.14 .
28.ProveEquation7.16 .
29.Provethat Thisfactwasusedintheproofthat inSection7.5.2 .
30.Clustersofdocumentscanbesummarizedbyfindingthetopterms(words)forthedocumentsinthecluster,e.g.,bytakingthemostfrequentkterms,wherekisaconstant,say10,orbytakingalltermsthatoccurmorefrequentlythanaspecifiedthreshold.SupposethatK-meansisusedtofindclustersofbothdocumentsandwordsforadocumentdataset.
a. HowmightasetoftermclustersdefinedbythetoptermsinadocumentclusterdifferfromthewordclustersfoundbyclusteringthetermswithK-means?
b. Howcouldtermclusteringbeusedtodefineclustersofdocuments?
31.Wecanrepresentadatasetasacollectionofobjectnodesandacollectionofattributenodes,wherethereisalinkbetweeneachobjectandeachattribute,andwheretheweightofthatlinkisthevalueoftheobjectforthatattribute.Forsparsedata,ifthevalueis0,thelinkisomitted.Bipartiteclusteringattemptstopartitionthisgraphintodisjointclusters,whereeachclusterconsistsofasetofobjectnodesandasetofattributenodes.Theobjectiveistomaximizetheweightoflinksbetweentheobjectandattributenodesofacluster,whileminimizingtheweightoflinksbetweenobjectandattributelinksindifferentclusters.Thistypeofclusteringisalsoknownasco-clusteringbecausetheobjectsandattributesareclusteredatthesametime.
a. Howisbipartiteclustering(co-clustering)differentfromclusteringthesetsofobjectsandattributesseparately?
b. Arethereanycasesinwhichtheseapproachesyieldthesameclusters?
c. Whatarethestrengthsandweaknessesofco-clusteringascomparedtoordinaryclustering?
Σi=1KΣx∈Ci(x−mi)(m−mi)=0.TSS=SSE+SSB
32.InFigure7.41 ,matchthesimilaritymatrices,whicharesortedaccordingtoclusterlabels,withthesetsofpoints.Differencesinshadingandmarkershapedistinguishbetweenclusters,andeachsetofpointscontains100pointsandthreeclusters.Inthesetofpointslabeled2,therearethreeverytight,equalsizedclusters.
Figure7.41.PointsandsimilaritymatricesforExercise32 .
8ClusterAnalysis:AdditionalIssuesandAlgorithms
Alargenumberofclusteringalgorithmshavebeendevelopedinavarietyofdomainsfordifferenttypesofapplications.Noalgorithmissuitableforalltypesofdata,clusters,andapplications.Infact,itseemsthatthereisalwaysroomforanewclusteringalgorithmthatismoreefficientorbettersuitedtoaparticulartypeofdata,cluster,orapplication.Instead,wecanonlyclaimthatwehavetechniquesthatworkwellinsomesituations.Thereasonisthat,inmanycases,whatconstitutesagoodsetofclustersisopentosubjectiveinterpretation.Furthermore,whenanobjectivemeasureisemployedtogiveaprecisedefinitionofacluster,theproblemoffindingtheoptimalclusteringisoftencomputationallyinfeasible.
Thischapterfocusesonimportantissuesinclusteranalysisandexplorestheconceptsandapproachesthathavebeendevelopedtoaddressthem.Webeginwithadiscussionofthekeyissuesofclusteranalysis,namely,thecharacteristicsofdata,clusters,andalgorithmsthatstronglyimpactclustering.Theseissues
areimportantforunderstanding,describing,andcomparingclusteringtechniques,andprovidethebasisfordecidingwhichtechniquetouseinaspecificsituation.Forexample,manyclusteringalgorithmshaveatimeorspacecomplexityof (mbeingthenumberofobjects)and,thus,arenotsuitableforlargedatasets.Wethendiscussadditionalclusteringtechniques.Foreachtechnique,wedescribethealgorithm,includingtheissuesitaddressesandthemethodsthatitusestoaddressthem.Weconcludethischapterbyprovidingsomegeneralguidelinesforselectingaclusteringalgorithmforagivenapplication.
O(m2)
8.1CharacteristicsofData,Clusters,andClusteringAlgorithmsThissectionexploresissuesrelatedtothecharacteristicsofdata,clusters,andalgorithmsthatareimportantforabroadunderstandingofclusteranalysis.Someoftheseissuesrepresentchallenges,suchashandlingnoiseandoutliers.Otherissuesinvolveadesiredfeatureofanalgorithm,suchasanabilitytoproducethesameresultregardlessoftheorderinwhichthedataobjectsareprocessed.Thediscussioninthissection,alongwiththediscussionofdifferenttypesofclusteringsinSection7.1.2 anddifferenttypesofclustersinSection7.1.3 ,identifiesanumberof“dimensions”thatcanbeusedtodescribeandcomparevariousclusteringalgorithmsandtheclusteringresultsthattheyproduce.Toillustratethis,webeginthissectionwithanexamplethatcomparestwoclusteringalgorithmsthatweredescribedinthepreviouschapter,DBSCANandK-means.Thisisfollowedbyamoredetaileddescriptionofthecharacteristicsofdata,clusters,andalgorithmsthatimpactclusteranalysis.
8.1.1Example:ComparingK-meansandDBSCAN
Tosimplifythecomparison,weassumethattherearenotiesindistancesforeitherK-meansorDBSCANandthatDBSCANalwaysassignsaborderpointthatisassociatedwithseveralcorepointstotheclosestcorepoint.
BothDBSCANandK-meansarepartitionalclusteringalgorithmsthatassigneachobjecttoasinglecluster,butK-meanstypicallyclustersalltheobjects,whileDBSCANdiscardsobjectsthatitclassifiesasnoise.K-meansusesaprototype-basednotionofacluster;DBSCANusesadensity-basedconcept.DBSCANcanhandleclustersofdifferentsizesandshapesandisnotstronglyaffectedbynoiseoroutliers.K-meanshasdifficultywithnon-globularclustersandclustersofdifferentsizes.Bothalgorithmscanperformpoorlywhenclustershavewidelydifferingdensities.K-meanscanonlybeusedfordatathathasawell-definedcentroid,suchasameanormedian.DBSCANrequiresthatitsdefinitionofdensity,whichisbasedonthetraditionalEuclideannotionofdensity,bemeaningfulforthedata.K-meanscanbeappliedtosparse,high-dimensionaldata,suchasdocumentdata.DBSCANtypicallyperformspoorlyforsuchdatabecausethetraditionalEuclideandefinitionofdensitydoesnotworkwellforhigh-dimensionaldata.TheoriginalversionsofK-meansandDBSCANweredesignedforEuclideandata,butbothhavebeenextendedtohandleothertypesofdata.DBSCANmakesnoassumptionaboutthedistributionofthedata.ThebasicK-meansalgorithmisequivalenttoastatisticalclusteringapproach(mixturemodels)thatassumesallclusterscomefromsphericalGaussiandistributionswithdifferentmeansbutthesamecovariancematrix.SeeSection8.2.2 .DBSCANandK-meansbothlookforclustersusingallattributes,thatis,theydonotlookforclustersthatinvolveonlyasubsetoftheattributes.K-meanscanfindclustersthatarenotwellseparated,eveniftheyoverlap(seeFigure7.2(b) ),butDBSCANmergesclustersthatoverlap.TheK-meansalgorithmhasatimecomplexityof ,whileDBSCANtakes time,exceptforspecialcasessuchaslow-dimensional
O(m)O(m2)
Euclideandata.DBSCANproducesthesamesetofclustersfromoneruntoanother,whileK-means,whichistypicallyusedwithrandominitializationofcentroids,doesnot.DBSCANautomaticallydeterminesthenumberofclusters;forK-means,thenumberofclustersneedstobespecifiedasaparameter.However,DBSCANhastwootherparametersthatmustbespecified,EpsandMinPts.K-meansclusteringcanbeviewedasanoptimizationproblem;i.e.,minimizethesumofthesquarederrorofeachpointtoitsclosestcentroid,andasaspecificcaseofastatisticalclusteringapproach(mixturemodels).DBSCANisnotbasedonanyformalmodel.
8.1.2DataCharacteristics
Thefollowingaresomecharacteristicsofdatathatcanstronglyaffectclusteranalysis.
HighDimensionalityInhigh-dimensionaldatasets,thetraditionalEuclideannotionofdensity,whichisthenumberofpointsperunitvolume,becomesmeaningless.Toseethis,considerthatasthenumberofdimensionsincreases,thevolumeincreasesrapidly,andunlessthenumberofpointsgrowsexponentiallywiththenumberofdimensions,thedensitytendsto0.(Volumeisexponentialinthenumberofdimensions.Forinstance,ahyperspherewithradius,r,anddimension,d,hasvolumeproportionalto .)Also,proximitytendstobecomemoreuniforminhigh-dimensionalspaces.Anotherwaytoviewthisfactisthattherearemoredimensions(attributes)thatcontributetotheproximitybetweentwopointsandthistendstomaketheproximitymoreuniform.Sincemostclusteringtechniquesarebasedon
rd
proximityordensity,theycanoftenhavedifficultywithhigh-dimensionaldata.Onewaytoaddresssuchproblemsistoemploydimensionalityreductiontechniques.Anotherapproach,asdiscussedinSections8.4.6 and8.4.8 ,istoredefinethenotionsofproximityanddensity.
SizeManyclusteringalgorithmsthatworkwellforsmallormedium-sizedatasetsareunabletohandlelargerdatasets.Thisisaddressedfurtherinthediscussionofthecharacteristicsofclusteringalgorithms—scalabilityisonesuchcharacteristic—andinSection8.5 ,whichdiscussesscalableclusteringalgorithms.
SparsenessSparsedataoftenconsistsofasymmetricattributes,wherezerovaluesarenotasimportantasnon-zerovalues.Therefore,similaritymeasuresappropriateforasymmetricattributesarecommonlyused.However,other,relatedissuesalsoarise.Forexample,arethemagnitudesofnon-zeroentriesimportant,ordotheydistorttheclustering?Inotherwords,doestheclusteringworkbestwhenthereareonlytwovalues,0and1?
NoiseandOutliersAnatypicalpoint(outlier)canoftenseverelydegradetheperformanceofclusteringalgorithms,especiallyalgorithmssuchasK-meansthatareprototype-based.Ontheotherhand,noisecancausetechniques,suchassinglelink,tojoinclustersthatshouldnotbejoined.Insomecases,algorithmsforremovingnoiseandoutliersareappliedbeforeaclusteringalgorithmisused.Alternatively,somealgorithmscandetectpointsthatrepresentnoiseandoutliersduringtheclusteringprocessandthendeletethemorotherwiseeliminatetheirnegativeeffects.Inthepreviouschapter,forinstance,wesawthatDBSCANautomaticallyclassifieslow-densitypointsasnoiseandremovesthemfromtheclusteringprocess.Chameleon(Section8.4.4 ),SNNdensity-basedclustering(Section8.4.9 ),andCURE(Section8.5.3 )arethreeofthealgorithmsinthischapterthatexplicitlydealwithnoiseandoutliersduringtheclusteringprocess.
TypeofAttributesandDataSetAsdiscussedinChapter2 ,datasetscanbeofvarioustypes,suchasstructured,graph,orordered,whileattributesareusuallycategorical(nominalorordinal)orquantitative(intervalorratio),andarebinary,discrete,orcontinuous.Differentproximityanddensitymeasuresareappropriatefordifferenttypesofdata.Insomesituations,dataneedstobediscretizedorbinarizedsothatadesiredproximitymeasureorclusteringalgorithmcanbeused.Anothercomplicationoccurswhenattributesareofwidelydifferingtypes,e.g.,continuousandnominal.Insuchcases,proximityanddensityaremoredifficulttodefineandoftenmoreadhoc.Finally,specialdatastructuresandalgorithmsareoftenneededtohandlecertaintypesofdataefficiently.
ScaleDifferentattributes,e.g.,heightandweight,areoftenmeasuredondifferentscales.Thesedifferencescanstronglyaffectthedistanceorsimilaritybetweentwoobjectsand,consequently,theresultsofaclusteranalysis.Considerclusteringagroupofpeoplebasedontheirheights,whicharemeasuredinmeters,andtheirweights,whicharemeasuredinkilograms.IfweuseEuclideandistanceasourproximitymeasure,thenheightwillhavelittleimpactandpeoplewillbeclusteredmostlybasedontheweightattribute.If,however,westandardizeeachattributebysubtractingoffitsmeananddividingbyitsstandarddeviation,thenwewillhaveeliminatedeffectsduetothedifferenceinscale.Moregenerally,normalizationtechniques,suchasthosediscussedinSection2.3.7 ,aretypicallyusedtohandletheseissues.
MathematicalPropertiesoftheDataSpaceSomeclusteringtechniquescalculatethemeanofacollectionofpointsoruseothermathematicaloperationsthatonlymakesenseinEuclideanspaceorinotherspecificdataspaces.Otheralgorithmsrequirethatthedefinitionofdensitybemeaningfulforthedata.
8.1.3ClusterCharacteristics
Thedifferenttypesofclusters,suchasprototype-,graph-,anddensity-based,weredescribedearlierinSection7.1.3 .Here,wedescribeotherimportantcharacteristicsofclusters.
DataDistributionSomeclusteringtechniquesassumeaparticulartypeofdistributionforthedata.Morespecifically,theyoftenassumethatdatacanbemodeledasarisingfromamixtureofdistributions,whereeachclustercorrespondstoadistribution.ClusteringbasedonmixturemodelsisdiscussedinSection8.2.2 .
ShapeSomeclustersareregularlyshaped,e.g.,rectangularorglobular,butingeneral,clusterscanbeofarbitraryshape.TechniquessuchasDBSCANandsinglelinkcanhandleclustersofarbitraryshape,butprototype-basedschemesandsomehierarchicaltechniques,suchascompletelinkandgroupaverage,cannot.Chameleon(Section8.4.4 )andCURE(Section8.5.3 )areexamplesoftechniquesthatwerespecificallydesignedtoaddressthisproblem.
DifferingSizesManyclusteringmethods,suchasK-means,don’tworkwellwhenclustershavedifferentsizes.(SeeSection7.2.4 .)ThistopicisdiscussedfurtherinSection8.6 .
DifferingDensitiesClustersthathavewidelyvaryingdensitycancauseproblemsformethodssuchasDBSCANandK-means.TheSNNdensity-basedclusteringtechniquepresentedinSection8.4.9 addressesthisissue.
PoorlySeparatedClustersWhenclusterstouchoroverlap,someclusteringtechniquescombineclustersthatshouldbekeptseparate.Eventechniquesthatfinddistinctclustersarbitrarilyassignpointstooneclusteroranother.Fuzzyclustering,whichisdescribedinSection8.2.1 ,isonetechniquefordealingwithdatathatdoesnotformwell-separatedclusters.
RelationshipsamongClustersInmostclusteringtechniques,thereisnoexplicitconsiderationoftherelationshipsbetweenclusters,suchastheirrelativeposition.Self-organizingmaps(SOM),whicharedescribedinSection8.2.3 ,areaclusteringtechniquethatdirectlyconsiderstherelationshipsbetweenclustersduringtheclusteringprocess.Specifically,theassignmentofapointtooneclusteraffectsthedefinitionsofnearbyclusters.
SubspaceClustersClustersmayonlyexistinasubsetofdimensions(attributes),andtheclustersdeterminedusingonesetofdimensionsarefrequentlyquitedifferentfromtheclustersdeterminedbyusinganotherset.Whilethisissuecanarisewithasfewastwodimensions,itbecomesmoreacuteasdimensionalityincreases,becausethenumberofpossiblesubsetsofdimensionsisexponentialinthetotalnumberofdimensions.Forthatreason,itisnotfeasibletosimplylookforclustersinallpossiblesubsetsofdimensionsunlessthenumberofdimensionsisrelativelylow.
Oneapproachistoapplyfeatureselection,whichwasdiscussedinSection2.3.4 .However,thisapproachassumesthatthereisonlyonesubsetofdimensionsinwhichtheclustersexist.Inreality,clusterscanexistinmanydistinctsubspaces(setsofdimensions),someofwhichoverlap.Section8.3.2 considerstechniquesthataddressthegeneralproblemofsubspaceclustering,i.e.,offindingbothclustersandthedimensionstheyspan.
8.1.4GeneralCharacteristicsofClusteringAlgorithms
Clusteringalgorithmsarequitevaried.Weprovideageneraldiscussionofimportantcharacteristicsofclusteringalgorithmshere,andmakemorespecificcommentsduringourdiscussionofparticulartechniques.
OrderDependenceForsomealgorithms,thequalityandnumberofclustersproducedcanvary,perhapsdramatically,dependingontheorderinwhichthedataisprocessed.Whileitwouldseemdesirabletoavoidsuchalgorithms,sometimestheorderdependenceisrelativelyminororthealgorithmhasotherdesirablecharacteristics.SOM(Section8.2.3 )isanexampleofanalgorithmthatisorderdependent.
NondeterminismClusteringalgorithms,suchasK-means,arenotorder-dependent,buttheyproducedifferentresultsforeachrunbecausetheyrelyonaninitializationstepthatrequiresarandomchoice.Becausethequalityoftheclusterscanvaryfromoneruntoanother,multiplerunscanbenecessary.
ScalabilityItisnotunusualforadatasettocontainmillionsofobjects,andtheclusteringalgorithmsusedforsuchdatasetsshouldhavelinearornear-lineartimeandspacecomplexity.Evenalgorithmsthathaveacomplexityof
arenotpracticalforlargedatasets.Furthermore,clusteringtechniquesfordatasetscannotalwaysassumethatallthedatawillfitinmainmemoryorthatdataelementscanberandomlyaccessed.Suchalgorithmsareinfeasibleforlargedatasets.Section8.5 isdevotedtotheissueofscalability.
ParameterSelectionMostclusteringalgorithmshaveoneormoreparametersthatneedtobesetbytheuser.Itcanbedifficulttochoosethe
O(m2)
propervalues;thus,theattitudeisusually,“thefewerparameters,thebetter.”Choosingparametervaluesbecomesevenmorechallengingifasmallchangeintheparametersdrasticallychangestheclusteringresults.Finally,unlessaprocedure(whichmightinvolveuserinput)isprovidedfordeterminingparametervalues,auserofthealgorithmisreducedtousingtrialanderrortofindsuitableparametervalues.
Perhapsthemostwell-knownparameterselectionproblemisthatof“choosingtherightnumberofclusters”forpartitionalclusteringalgorithms,suchasK-means.OnepossibleapproachtothatissueisgiveninSection7.5.5 ,whilereferencestoothersareprovidedintheBibliographicNotes.
TransformingtheClusteringProblemtoAnotherDomainOneapproachtakenbysomeclusteringtechniquesistomaptheclusteringproblemtoaprobleminadifferentdomain.Graph-basedclustering,forinstance,mapsthetaskoffindingclusterstothetaskofpartitioningaproximitygraphintoconnectedcomponents.
TreatingClusteringasanOptimizationProblemClusteringisoftenviewedasanoptimizationproblem:dividethepointsintoclustersinawaythatmaximizesthegoodnessoftheresultingsetofclustersasmeasuredbyauser-specifiedobjectivefunction.Forexample,theK-meansclusteringalgorithm(Section7.2 )triestofindthesetofclustersthatminimizesthesumofthesquareddistanceofeachpointfromitsclosestclustercentroid.Intheory,suchproblemscanbesolvedbyenumeratingallpossiblesetsofclustersandselectingtheonewiththebestvalueoftheobjectivefunction,butthisexhaustiveapproachiscomputationallyinfeasible.Forthisreason,manyclusteringtechniquesarebasedonheuristicapproachesthatproducegood,butnotoptimalclusterings.Anotherapproachistouseobjectivefunctionsonagreedyorlocalbasis.Inparticular,thehierarchicalclusteringtechniques
discussedinSection7.3 proceedbymakinglocallyoptimal(greedy)decisionsateachstepoftheclusteringprocess.
RoadMapWearrangeourdiscussionofclusteringalgorithmsinamannersimilartothatofthepreviouschapter,groupingtechniquesprimarilyaccordingtowhethertheyareprototype-based,density-based,orgraph-based.Thereis,however,aseparatediscussionforscalableclusteringtechniques.Weconcludethischapterwithadiscussionofhowtochooseaclusteringalgorithm.
8.2Prototype-BasedClusteringInprototype-basedclustering,aclusterisasetofobjectsinwhichanyobjectisclosertotheprototypethatdefinestheclusterthantotheprototypeofanyothercluster.Section7.2 describedK-means,asimpleprototype-basedclusteringalgorithmthatusesthecentroidoftheobjectsinaclusterastheprototypeofthecluster.Thissectiondiscussesclusteringapproachesthatexpandontheconceptofprototype-basedclusteringinoneormoreways,asdiscussednext:
Objectsareallowedtobelongtomorethanonecluster.Morespecifically,anobjectbelongstoeveryclusterwithsomeweight.Suchanapproachaddressesthefactthatsomeobjectsareequallyclosetoseveralclusterprototypes.Aclusterismodeledasastatisticaldistribution,i.e.,objectsaregeneratedbyarandomprocessfromastatisticaldistributionthatischaracterizedbyanumberofstatisticalparameters,suchasthemeanandvariance.Thisviewpointgeneralizesthenotionofaprototypeandenablestheuseofwell-establishedstatisticaltechniques.Clustersareconstrainedtohavefixedrelationships.Mostcommonly,theserelationshipsareconstraintsthatspecifyneighborhoodrelationships;i.e.,thedegreetowhichtwoclustersareneighborsofeachother.Constrainingtherelationshipsamongclusterscansimplifytheinterpretationandvisualizationofthedata.
Weconsiderthreespecificclusteringalgorithmstoillustratetheseextensionsofprototype-basedclustering.Fuzzyc-meansusesconceptsfromthefieldoffuzzylogicandfuzzysettheorytoproposeaclusteringscheme,whichismuchlikeK-means,butwhichdoesnotrequireahardassignmentofapoint
toonlyonecluster.Mixturemodelclusteringtakestheapproachthatasetofclusterscanbemodeledasamixtureofdistributions,oneforeachcluster.TheclusteringschemebasedonSelf-OrganizingMaps(SOM)performsclusteringwithinaframeworkthatrequiresclusterstohaveaprespecifiedrelationshiptooneanother,e.g.,atwo-dimensionalgridstructure.
8.2.1FuzzyClustering
Ifdataobjectsaredistributedinwell-separatedgroups,thenacrispclassificationoftheobjectsintodisjointclustersseemslikeanidealapproach.However,inmostcases,theobjectsinadatasetcannotbepartitionedintowell-separatedclusters,andtherewillbeacertainarbitrarinessinassigninganobjecttoaparticularcluster.Consideranobjectthatliesneartheboundaryoftwoclusters,butisslightlyclosertooneofthem.Inmanysuchcases,itmightbemoreappropriatetoassignaweighttoeachobjectandeachclusterthatindicatesthedegreetowhichtheobjectbelongstothecluster.Mathematically, istheweightwithwhichobject belongstocluster .
Asshowninthenextsection,probabilisticapproachescanalsoprovidesuchweights.Whileprobabilisticapproachesareusefulinmanysituations,therearetimeswhenitisdifficulttodetermineanappropriatestatisticalmodel.Insuchcases,non-probabilisticclusteringtechniquesareneededtoprovidesimilarcapabilities.Fuzzyclusteringtechniquesarebasedonfuzzysettheoryandprovideanaturaltechniqueforproducingaclusteringinwhichmembershipweights(the )haveanatural(butnotprobabilistic)interpretation.Thissectiondescribesthegeneralapproachoffuzzyclusteringandprovidesaspecificexampleintermsoffuzzyc-means(fuzzyK-means).
FuzzySets
wij xi Cj
wij
LotfiZadehintroducedfuzzysettheoryandfuzzylogicin1965asawayofdealingwithimprecisionanduncertainty.Briefly,fuzzysettheoryallowsanobjecttobelongtoasetwithadegreeofmembershipbetween0and1,whilefuzzylogicallowsastatementtobetruewithadegreeofcertaintybetween0and1.Traditionalsettheoryandlogicarespecialcasesoftheirfuzzycounterpartsthatrestrictthedegreeofsetmembershiporthedegreeofcertaintytobeeither0or1.Fuzzyconceptshavebeenappliedtomanydifferentareas,includingcontrolsystems,patternrecognition,anddataanalysis(classificationandclustering).
Considerthefollowingexampleoffuzzylogic.Thedegreeoftruthofthestatement“Itiscloudy”canbedefinedtobethepercentageofcloudcoverinthesky,e.g.,iftheskyis50%coveredbyclouds,thenwewouldassign“Itiscloudy”adegreeoftruthof0.5.Ifwehavetwosets,“cloudydays”and“non-cloudydays,”thenwecansimilarlyassigneachdayadegreeofmembershipinthetwosets.Thus,ifadaywere25%cloudy,itwouldhavea25%degreeofmembershipin“cloudydays”anda75%degreeofmembershipin“non-cloudydays.”
FuzzyClustersAssumethatwehaveasetofdatapoints ,whereeachpoint, ,isann-dimensionalpoint,i.e., .Acollectionoffuzzyclusters,
isasubsetofallpossiblefuzzysubsetsof .(Thissimplymeansthatthemembershipweights(degrees), ,havebeenassignedvaluesbetween0and1foreachpoint, ,andeachcluster, .)However,wealsowanttoimposethefollowingreasonableconditionsontheclustersinordertoensurethattheclustersformwhatiscalledafuzzypseudo-partition.
1. Alltheweightsforagivenpoint, ,addupto1.
X={x1,…,xm} xixi=(xi1,…,xin)
C1,C2,…, Ck Xwij
xi Cj
xi
2. Eachcluster, ,contains,withnon-zeroweight,atleastonepoint,butdoesnotcontain,withaweightofone,allofthepoints.
Fuzzyc-meansWhiletherearemanytypesoffuzzyclustering—indeed,manydataanalysisalgorithmscanbe“fuzzified”—weonlyconsiderthefuzzyversionofK-means,whichiscalledfuzzyc-means.Intheclusteringliterature,theversionofK-meansthatdoesnotuseincrementalupdatesofclustercentroidsissometimesreferredtoasc-means,andthiswasthetermadaptedbythefuzzycommunityforthefuzzyversionofK-means.Thefuzzyc-meansalgorithm,alsosometimesknownasFCM,isgivenbyAlgorithm8.1 .
Algorithm8.1Basicfuzzyc-means
algorithm.
∑j=1kwij=1Cj
0<∑i=1mwij<m
1:Selectaninitialfuzzypseudo-partition,i.e.,assignvaluestoallthe .2:repeat3:Computethecentroidofeachclusterusingthefuzzypseudo-partition.4:Recomputethefuzzypseudo-partition,i.e.,the .5:untilThecentroidsdon’tchange.(Alternativestoppingconditionsare“ifthechangeintheerrorisbelowaspecifiedthreshold”or“iftheabsolutechangeinanyisbelowagiventhreshold.”)
wij
wij
wij
Afterinitialization,FCMrepeatedlycomputesthecentroidsofeachclusterandthefuzzypseudo-partitionuntilthepartitiondoesnotchange.FCMissimilarinstructuretotheK-meansalgorithm,whichafterinitialization,alternatesbetweenastepthatupdatesthecentroidsandastepthatassignseachobjecttotheclosestcentroid.Specifically,computingafuzzypseudo-partitionisequivalenttotheassignmentstep.AswithK-means,FCMcanbeinterpretedasattemptingtominimizethesumofthesquarederror(SSE),althoughFCMisbasedonafuzzyversionofSSE.Indeed,K-meanscanberegardedasaspecialcaseofFCMandthebehaviorofthetwoalgorithmsisquitesimilar.ThedetailsofFCMaredescribedbelow.
ComputingSSE
Thedefinitionofthesumofthesquarederror(SSE)ismodifiedasfollows:
where isthecentroidofthe clusterandp,whichistheexponentthatdeterminestheinfluenceoftheweights,hasavaluebetween1and∞.NotethatthisSSEisjustaweightedversionofthetraditionalK-meansSSEgiveninEquation7.1 .
Initialization
Randominitializationisoftenused.Inparticular,weightsarechosenrandomly,subjecttotheconstraintthattheweightsassociatedwithanyobjectmustsumto1.AswithK-means,randominitializationissimple,butoftenresultsinaclusteringthatrepresentsalocalminimumintermsoftheSSE.Section7.2.1 ,whichcontainsadiscussiononchoosinginitialcentroidsforK-means,hasconsiderablerelevanceforFCMaswell.
SSE(C1,C2,…,Ck)=∑j=1k∑i=1mwijpdist(xi,cj)2 (8.1)
cj jth
ComputingCentroids
ThedefinitionofthecentroidgiveninEquation8.2 canbederivedbyfindingthecentroidthatminimizesthefuzzySSEasgivenbyEquation8.1 .(SeetheapproachinSection7.2.6 .)Foracluster, ,thecorrespondingcentroid, ,isdefinedbythefollowingequation:
Thefuzzycentroiddefinitionissimilartothetraditionaldefinitionexceptthatallpointsareconsidered(anypointcanbelongtoanycluster,atleastsomewhat)andthecontributionofeachpointtothecentroidisweightedbyitsmembershipdegree.Inthecaseoftraditionalcrispsets,whereall areeither0or1,thisdefinitionreducestothetraditionaldefinitionofacentroid.
Thereareafewconsiderationswhenchoosingthevalueofp.Choosingsimplifiestheweightupdateformula—seeEquation8.4 .However,ifpischosentobenear1,thenfuzzyc-meansbehavesliketraditionalK-means.Goingintheotherdirection,aspgetslarger,alltheclustercentroidsapproachtheglobalcentroidofallthedatapoints.Inotherwords,thepartitionbecomesfuzzieraspincreases.
UpdatingtheFuzzyPseudo-partition
Becausethefuzzypseudo-partitionisdefinedbytheweight,thisstepinvolvesupdatingtheweights associatedwiththe pointand cluster.TheweightupdateformulagiveninEquation8.3 canbederivedbyminimizingtheSSEofEquation8.1 subjecttotheconstraintthattheweightssumto1.
Cjcj
cj=∑i=1mwijpxi/∑i=1mwijp (8.2)
wij
p=2
wij ith jth
wij=(1/dist(xi,cj)2)1p−1/∑q=1k(1/dist(xi,cq)2)1p−1 (8.3)
Thisformulamightappearabitmysterious.However,notethatif ,thenweobtainEquation8.4 ,whichissomewhatsimpler.WeprovideanintuitiveexplanationofEquation8.4 ,which,withaslightmodification,alsoappliestoEquation8.3 .
Intuitively,theweight ,whichindicatesthedegreeofmembershipofpointincluster ,shouldberelativelyhighif isclosetocentroid (if islow)andrelativelylowif isfarfromcentroid (if ishigh).If
,whichisthenumeratorofEquation8.4 ,thenthiswillindeedbethecase.However,themembershipweightsforapointwillnotsumtooneunlesstheyarenormalized;i.e.,dividedbythesumofalltheweightsasinEquation8.4 .Tosummarize,themembershipweightofapointinaclusterisjustthereciprocalofthesquareofthedistancebetweenthepointandtheclustercentroiddividedbythesumofallthemembershipweightsofthepoint.
Nowconsidertheimpactoftheexponent inEquation8.3 .If ,thenthisexponentdecreasestheweightassignedtoclustersthatareclosetothepoint.Indeed,aspgoestoinfinity,theexponenttendsto0andweightstendtothevalue1/k.Ontheotherhand,aspapproaches1,theexponentincreasesthemembershipweightsofpointstowhichtheclusterisclose.Aspgoesto1,themembershipweightgoesto1fortheclosestclusterandto0foralltheotherclusters.ThiscorrespondstoK-means.
Example8.1(Fuzzyc-meansonThreeCircularClusters).Figure8.1 showstheresultofapplyingfuzzyc-meanstofindthreeclustersforatwo-dimensionaldatasetof100points.Eachpointwas
p=2
wij=1/dist(xi,cj)2/∑q=1k1/dist(xi,cq)2 (8.4)
wij xiCj xi cj dist(xi,cj)
xi cj dist(xi,cj)wij=1/dist(xi,cj)2
1/(p−1) p>2
assignedtotheclusterinwhichithadthelargestmembershipweight.Thepointsbelongingtoeachclusterareshownbydifferentmarkershapes,whilethedegreeofmembershipintheclusterisshownbytheshading.Thedarkerthepoints,thestrongertheirmembershipintheclustertowhichtheyhavebeenassigned.Themembershipinaclusterisstrongesttowardthecenteroftheclusterandweakestforthosepointsthatarebetweenclusters.
Figure8.1.Fuzzyc-meansclusteringofatwo-dimensionalpointset.
StrengthsandLimitations
ApositivefeatureofFCMisthatitproducesaclusteringthatprovidesanindicationofthedegreetowhichanypointbelongstoanycluster.Otherwise,ithasmuchthesamestrengthsandweaknessesasK-means,althoughitissomewhatmorecomputationallyintensive.
8.2.2ClusteringUsingMixtureModels
Thissectionconsidersclusteringbasedonstatisticalmodels.Itisoftenconvenientandeffectivetoassumethatdatahasbeengeneratedasaresultofastatisticalprocessandtodescribethedatabyfindingthestatisticalmodelthatbestfitsthedata,wherethestatisticalmodelisdescribedintermsofadistributionandasetofparametersforthatdistribution.Atahighlevel,thisprocessinvolvesdecidingonastatisticalmodelforthedataandestimatingtheparametersofthatmodelfromthedata.Thissectiondescribesaparticularkindofstatisticalmodel,mixturemodels,whichmodelthedatabyusinganumberofstatisticaldistributions.Eachdistributioncorrespondstoaclusterandtheparametersofeachdistributionprovideadescriptionofthecorrespondingcluster,typicallyintermsofitscenterandspread.
Thediscussioninthissectionproceedsasfollows.Afterprovidingadescriptionofmixturemodels,weconsiderhowparameterscanbeestimatedforstatisticaldatamodels.Wefirstdescribehowaprocedureknownasmaximumlikelihoodestimation(MLE)canbeusedtoestimateparametersforsimplestatisticalmodelsandthendiscusshowwecanextendthisapproachforestimatingtheparametersofmixturemodels.Specifically,wedescribethewell-knownExpectation-Maximization(EM)algorithm,whichmakesaninitialguessfortheparameters,andtheniterativelyimprovestheseestimates.WepresentexamplesofhowtheEMalgorithmcanbeusedto
clusterdatabyestimatingtheparametersofamixturemodelanddiscussitsstrengthsandlimitations.
Afirmunderstandingofstatisticsandprobability,ascoveredinAppendixC,isessentialforunderstandingthissection.Also,forconvenienceinthefollowingdiscussion,weusethetermprobabilitytorefertobothprobabilityandprobabilitydensity.
MixtureModelsMixturemodelsviewthedataasasetofobservationsfromamixtureofdifferentprobabilitydistributions.Theprobabilitydistributionscanbeanything,butareoftentakentobemultivariatenormal,asthistypeofdistributioniswellunderstood,mathematicallyeasytoworkwith,andhasbeenshowntoproducegoodresultsinmanyinstances.Thesetypesofdistributionscanmodelellipsoidalclusters.
Conceptually,mixturemodelscorrespondtothefollowingprocessofgeneratingdata.Givenseveraldistributions,usuallyofthesametype,butwithdifferentparameters,randomlyselectoneofthesedistributionsandgenerateanobjectfromit.Repeattheprocessmtimes,wheremisthenumberofobjects.
Moreformally,assumethatthereareKdistributionsandmobjects,.Letthe distributionhaveparameters ,andlet bethesetofall
parameters,i.e., .Then, istheprobabilityoftheobjectifitcomesfromthe distribution.Theprobabilitythatthedistributionischosentogenerateanobjectisgivenbytheweight ,wheretheseweights(probabilities)aresubjecttotheconstraintthattheysumtoone,i.e., .Then,theprobabilityofanobjectxisgivenbyEquation8.5 .
X={x1,…,xm} jth θj Θ
Θ={θ1,…,θK} prob(xi|θj) ithjth jth
wj,1≤j≤K
∑j=1Kwj=1
Iftheobjectsaregeneratedinanindependentmanner,thentheprobabilityoftheentiresetofobjectsisjusttheproductoftheprobabilitiesofeachindividual .
Formixturemodels,eachdistributiondescribesadifferentgroup,i.e.,adifferentcluster.Byusingstatisticalmethods,wecanestimatetheparametersofthesedistributionsfromthedataandthusdescribethesedistributions(clusters).Wecanalsoidentifywhichobjectsbelongtowhichclusters.However,mixturemodelingdoesnotproduceacrispassignmentofobjectstoclusters,butrathergivestheprobabilitywithwhichaspecificobjectbelongstoaparticularcluster.
Example8.2(UnivariateGaussianMixture).WeprovideaconcreteillustrationofamixturemodelintermsofGaussiandistributions.Theprobabilitydensityfunctionforaone-dimensionalGaussiandistributionatapointxis
TheparametersoftheGaussiandistributionaregivenby ,whereμisthemeanofthedistributionandσisthestandarddeviation.AssumethattherearetwoGaussiandistributions,withacommonstandarddeviationof2andmeansof and4,respectively.Alsoassumethateachofthetwodistributionsisselectedwithequalprobability,i.e., .ThenEquation8.5 becomesthefollowing:
prob(x|Θ)=∑j=1Kwjpj(x|θj) (8.5)
xi
prob(X|Θ)=∏i=1mprob(xi|Θ)=∏i=1m∑j=1Kwjpj(xi|θj) (8.6)
prob(x|Θ)=12πσe−(x−μ)22σ2. (8.7)
θ=(μ, σ)
−4w1=w2=0.5
Figure8.2(a) showsaplotoftheprobabilitydensityfunctionofthismixturemodel,whileFigure8.2(b) showsthehistogramfor20,000pointsgeneratedfromthismixturemodel.
Figure8.2.Mixturemodelconsistingoftwonormaldistributionswithmeansof-4and4,respectively.Bothdistributionshaveastandarddeviationof2.
EstimatingModelParametersUsingMaximumLikelihoodGivenastatisticalmodelforthedata,itisnecessarytoestimatetheparametersofthatmodel.Astandardapproachusedforthistaskismaximumlikelihoodestimation,whichwenowexplain.
Considerasetofmpointsthataregeneratedfromaone-dimensionalGaussiandistribution.Assumingthatthepointsaregeneratedindependently,
prob(x|Θ)=122πe−(x+4)28+122πe−(x−4)28. (8.8)
theprobabilityofthesepointsisjusttheproductoftheirindividualprobabilities.(Again,wearedealingwithprobabilitydensities,buttokeepourterminologysimple,wewillrefertoprobabilities.)UsingEquation8.7 ,wecanwritethisprobabilityasshowninEquation8.9 .Becausethisprobabilitywouldbeaverysmallnumber,wetypicallywillworkwiththelogprobability,asshowninEquation8.10 .
Wewouldliketofindaproceduretoestimateμandσiftheyareunknown.Oneapproachistochoosethevaluesoftheparametersforwhichthedataismostprobable(mostlikely).Inotherwords,choosetheμandσthatmaximizeEquation8.9 .Thisapproachisknowninstatisticsasthemaximumlikelihoodprinciple,andtheprocessofapplyingthisprincipletoestimatetheparametersofastatisticaldistributionfromthedataisknownasmaximumlikelihoodestimation(MLE).
Theprincipleiscalledthemaximumlikelihoodprinciplebecause,givenasetofdata,theprobabilityofthedata,regardedasafunctionoftheparameters,iscalledalikelihoodfunction.Toillustrate,werewriteEquation8.9 asEquation8.11 toemphasizethatweviewthestatisticalparametersμandσasourvariablesandthatthedataisregardedasaconstant.Forpracticalreasons,theloglikelihoodismorecommonlyused.TheloglikelihoodfunctionderivedfromthelogprobabilityofEquation8.10 isshowninEquation8.12 .Notethattheparametervaluesthatmaximizetheloglikelihoodalsomaximizethelikelihoodsincelogisamonotonicallyincreasingfunction.
prob(X|Θ)=∏i=1m12πσe−(xi−μ)22σ2 (8.9)
log prob(X|Θ)=−∑i=1m(xi−μ)22σ2−0.5m log2π−mlogσ (8.10)
likelihood(Θ|X)=L(Θ|X)=∏i=1m12πσe−(xi−μ)22σ2 (8.11)
Example8.3(MaximumLikelihoodParameterEstimation).WeprovideaconcreteillustrationoftheuseofMLEforfindingparametervalues.Supposethatwehavethesetof200pointswhosehistogramisshowninFigure8.3(a) .Figure8.3(b) showsthemaximumloglikelihoodplotforthe200pointsunderconsideration.Thevaluesoftheparametersforwhichthelogprobabilityisamaximumare and
,whichareclosetotheparametervaluesoftheunderlyingGaussiandistribution, and .
Figure8.3.200pointsfromaGaussiandistributionandtheirlogprobabilityfordifferentparametervalues.
Graphingthelikelihoodofthedatafordifferentvaluesoftheparametersisnotpractical,atleastiftherearemorethantwoparameters.Thus,standard
log likelihood(Θ|X)=ℓ(Θ|X)=−∑i=1m(xi−μ)22σ2−0.5mlog2π−mlogσ (8.12)
μ=−4.1σ=2.1
μ=−4.0 σ=2.0
statisticalprocedureistoderivethemaximumlikelihoodestimatesofastatisticalparameterbytakingthederivativeoflikelihoodfunctionwithrespecttothatparameter,settingtheresultequalto0,andsolving.Inparticular,foraGaussiandistribution,itcanbeshownthatthemeanandstandarddeviationofthesamplepointsarethemaximumlikelihoodestimatesofthecorrespondingparametersoftheunderlyingdistribution.(SeeExercise9on700.)Indeed,forthe200pointsconsideredinourexample,theparametervaluesthatmaximizedtheloglikelihoodwerepreciselythemeanandstandarddeviationofthe200points,i.e., and .
EstimatingMixtureModelParametersUsingMaximumLikelihood:TheEMAlgorithmWecanalsousethemaximumlikelihoodapproachtoestimatethemodelparametersforamixturemodel.Inthesimplestcase,weknowwhichdataobjectscomefromwhichdistributions,andthesituationreducestooneofestimatingtheparametersofasingledistributiongivendatafromthatdistribution.Formostcommondistributions,themaximumlikelihoodestimatesoftheparametersarecalculatedfromsimpleformulasinvolvingthedata.
Inamoregeneral(andmorerealistic)situation,wedonotknowwhichpointsweregeneratedbywhichdistribution.Thus,wecannotdirectlycalculatetheprobabilityofeachdatapoint,andhence,itwouldseemthatwecannotusethemaximumlikelihoodprincipletoestimateparameters.ThesolutiontothisproblemistheEMalgorithm,whichisshowninAlgorithm8.2 .Briefly,givenaguessfortheparametervalues,theEMalgorithmcalculatestheprobabilitythateachpointbelongstoeachdistributionandthenusestheseprobabilitiestocomputeanewestimatefortheparameters.(Theseparametersaretheonesthatmaximizethelikelihood.)Thisiterationcontinuesuntiltheestimatesoftheparameterseitherdonotchangeorchangevery
μ=−4.1 σ=2.1
little.Thus,westillemploymaximumlikelihoodestimation,butviaaniterativesearch.
Algorithm8.2EMalgorithm.
TheEMalgorithmissimilartotheK-meansalgorithmgiveninSection7.2.1 .Indeed,theK-meansalgorithmforEuclideandataisaspecialcaseoftheEMalgorithmforsphericalGaussiandistributionswithequalcovariancematrices,butdifferentmeans.TheexpectationstepcorrespondstotheK-meansstepofassigningeachobjecttoacluster.Instead,eachobjectisassignedtoeverycluster(distribution)withsomeprobability.Themaximizationstepcorrespondstocomputingtheclustercentroids.Instead,alltheparametersofthedistributions,aswellastheweightparameters,areselectedtomaximizethelikelihood.Thisprocessisoftenstraightforward,astheparametersaretypicallycomputedusingformulasderivedfrommaximumlikelihoodestimation.Forinstance,forasingleGaussiandistribution,theMLEestimateofthemeanisthemeanoftheobjectsinthedistribution.Inthe
1:Selectaninitialsetofmodelparameters.(AswithK-means,thiscanbedonerandomlyorinavarietyofways.)2:repeat3:ExpectationStepForeachobject,calculatetheprobabilitythateachobjectbelongstoeachdistribution,i.e.,calculate
.4:MaximizationStepGiventheprobabilitiesfromtheexpectationstep,findthenewestimatesoftheparametersthatmaximizetheexpectedlikelihood.5:untilTheparametersdonotchange.(Alternatively,stopifthechangeintheparametersisbelowaspecifiedthreshold.)
prob(distribution j|xi,Θ)
contextofmixturemodelsandtheEMalgorithm,thecomputationofthemeanismodifiedtoaccountforthefactthateveryobjectbelongstoadistributionwithacertainprobability.Thisisillustratedfurtherinthefollowingexample.
Example8.4(SimpleExampleofEMAlgorithm).ThisexampleillustrateshowEMoperateswhenappliedtothedatainFigure8.2 .Tokeeptheexampleassimpleaspossible,weassumethatweknowthatthestandarddeviationofbothdistributionsis2.0andthatpointsweregeneratedwithequalprobabilityfrombothdistributions.Wewillrefertotheleftandrightdistributionsasdistributions1and2,respectively.
WebegintheEMalgorithmbymakinginitialguessesfor and ,say,and .Thus,theinitialparameters, ,forthetwo
distributionsare,respectively, and .Thesetofparametersfortheentiremixturemodelis .FortheexpectationstepofEM,wewanttocomputetheprobabilitythatapointcamefromaparticulardistribution;i.e.,wewanttocompute and
.ThesevaluescanbeexpressedbyEquation8.13 ,whichisastraightforwardapplicationofBayesrule,whichisdescribedinAppendixC.
where0.5istheprobability(weight)ofeachdistributionandjis1or2.
Forinstance,assumeoneofthepointsis0.UsingtheGaussiandensityfunctiongiveninEquation8.7 ,wecomputethat and
.(Again,wearereallycomputingprobabilitydensities.)UsingthesevaluesandEquation8.13 ,wefindthat
μ1 μ2μ1=−2 μ2=3 θ=(μ,σ)
θ1=(−2,2) θ2=(3,2)Θ={θ1,θ2}
prob(distribution 1|xi,Θ)prob(distribution 2|xi,Θ)
prob(distribution j|xi,θ)=0.5prob(xi|θj)0.5 prob(xi|θ1)+0.5 prob(xi|θ2), (8.13)
prob(0|θ1)=0.12prob(0|θ2)=0.06
and.Thismeansthatthepoint
0istwiceaslikelytobelongtodistribution1asdistribution2basedonthecurrentassumptionsfortheparametervalues.
Aftercomputingtheclustermembershipprobabilitiesforall20,000points,wecomputenewestimatesfor and (usingEquations8.14 and8.15 )inthemaximizationstepoftheEMalgorithm.Noticethatthenewestimateforthemeanofadistributionisjustaweightedaverageofthepoints,wheretheweightsaretheprobabilitiesthatthepointsbelongtothedistribution,i.e.,the values.
Werepeatthesetwostepsuntiltheestimatesof and eitherdon’tchangeorchangeverylittle.Table8.1 givesthefirstfewiterationsoftheEMalgorithmwhenitisappliedtothesetof20,000points.Forthisdata,weknowwhichdistributiongeneratedwhichpoint,sowecanalsocomputethemeanofthepointsfromeachdistribution.Themeansare
and .
Table8.1.FirstfewiterationsoftheEMalgorithmforthesimpleexample.
Iteration
0 3.00
1 4.10
2 4.07
prob(distribution 1|0,Θ)=0.12/(0.12+0.06)=0.66prob(distribution 2|0,Θ)=0.06/(0.12+0.06)=0.33
μ1 μ2
prob(distribution j|xi)
μ1=∑i=120,000xiprob(distribution 1|xi,Θ)∑i=120,000prob(distribution 1|xi,Θ)(8.14)
μ2=∑i=120,000xiprob(distribution 2|xi,Θ)∑i=120,000prob(distribution 2|xi,Θ)(8.15)
μ1 μ2
μ1=−3.98 μ2=4.03
μ1 μ2
−2.00
−3.74
−3.94
3 4.04
4 4.03
5 4.03
Example8.5(TheEMAlgorithmonSampleDataSets).WegivethreeexamplesthatillustratetheuseoftheEMalgorithmtofindclustersusingmixturemodels.Thefirstexampleisbasedonthedatasetusedtoillustratethefuzzyc-meansalgorithm—seeFigure8.1 .Wemodeledthisdataasamixtureofthreetwo-dimensionalGaussiandistributionswithdifferentmeansandidenticalcovariancematrices.WethenclusteredthedatausingtheEMalgorithm.TheresultsareshowninFigure8.4 .Eachpointwasassignedtotheclusterinwhichithadthelargestmembershipweight.Thepointsbelongingtoeachclusterareshownbydifferentmarkershapes,whilethedegreeofmembershipintheclusterisshownbytheshading.Membershipinaclusterisrelativelyweakforthosepointsthatareontheborderofthetwoclusters,butstrongelsewhere.ItisinterestingtocomparethemembershipweightsandprobabilitiesofFigures8.4 and8.1 .(SeeExercise11 onpage700.)
−3.97
−3.98
−3.98
Figure8.4.EMclusteringofatwo-dimensionalpointsetwiththreeclusters.
Foroursecondexample,weapplymixturemodelclusteringtodatathatcontainsclusterswithdifferentdensities.Thedataconsistsoftwonaturalclusters,eachwithroughly500points.ThisdatawascreatedbycombiningtwosetsofGaussiandata,onewithacenterat andastandarddeviationof2,andonewithacenterat(0,0)andastandarddeviationof0.5.Figure8.5 showstheclusteringproducedbytheEMalgorithm.Despitethedifferencesinthedensity,theEMalgorithmisquitesuccessfulatidentifyingtheoriginalclusters.
(−4,1)
Figure8.5.EMclusteringofatwo-dimensionalpointsetwithtwoclustersofdifferingdensity.
Forourthirdexample,weusemixturemodelclusteringonadatasetthatK-meanscannotproperlyhandle.Figure8.6(a) showstheclusteringproducedbyamixturemodelalgorithm,whileFigure8.6(b) showstheK-meansclusteringofthesamesetof1,000points.Formixturemodelclustering,eachpointhasbeenassignedtotheclusterforwhichithasthehighestprobability.Inbothfigures,differentmarkersareusedtodistinguishdifferentclusters.Donotconfusethe‘+’and‘x’markersinFigure8.6(a) .
Figure8.6.MixturemodelandK-meansclusteringofasetoftwo-dimensionalpoints.
AdvantagesandLimitationsofMixtureModelClusteringUsingtheEMAlgorithmFindingclustersbymodelingthedatausingmixturemodelsandapplyingtheEMalgorithmtoestimatetheparametersofthosemodelshasavarietyofadvantagesanddisadvantages.Onthenegativeside,theEMalgorithmcanbeslow,itisnotpracticalformodelswithlargenumbersofcomponents,anditdoesnotworkwellwhenclusterscontainonlyafewdatapointsorifthedatapointsarenearlyco-linear.Thereisalsoaprobleminestimatingthenumberofclustersor,moregenerally,inchoosingtheexactformofthemodeltouse.ThisproblemtypicallyhasbeendealtwithbyapplyingaBayesianapproach,which,roughlyspeaking,givestheoddsofonemodelversusanother,basedonanestimatederivedfromthedata.Mixturemodelscanalsohavedifficultywithnoiseandoutliers,althoughworkhasbeendonetodealwiththisproblem.
Onthepositiveside,mixturemodelsaremoregeneralthanK-meansorfuzzyc-meansbecausetheycanusedistributionsofvarioustypes.Asaresult,mixturemodels(basedonGaussiandistributions)canfindclustersofdifferentsizesandellipticalshapes.Also,amodel-basedapproachprovidesadisciplinedwayofeliminatingsomeofthecomplexityassociatedwithdata.Toseethepatternsindata,itisoftennecessarytosimplifythedata,andfittingthedatatoamodelisagoodwaytodothatifthemodelisagoodmatchforthedata.Furthermore,itiseasytocharacterizetheclustersproduced,becausetheycanbedescribedbyasmallnumberofparameters.Finally,manysetsofdataareindeedtheresultofrandomprocesses,andthusshouldsatisfythestatisticalassumptionsofthesemodels.
8.2.3Self-OrganizingMaps(SOM)
TheKohonenSelf-OrganizingFeatureMap(SOFMorSOM)isaclusteringanddatavisualizationtechniquebasedonaneuralnetworkviewpoint.DespitetheneuralnetworkoriginsofSOM,itismoreeasilypresented—atleastinthecontextofthischapter—asavariationofprototype-basedclustering.Aswithothertypesofcentroid-basedclustering,thegoalofSOMistofindasetofcentroids(referencevectorsinSOMterminology)andtoassigneachobjectinthedatasettothecentroidthatprovidesthebestapproximationofthatobject.Inneuralnetworkterminology,thereisoneneuronassociatedwitheachcentroid.
AswithincrementalK-means,dataobjectsareprocessedoneatatimeandtheclosestcentroidisupdated.UnlikeK-means,SOMimposesatopographicorderingonthecentroidsandnearbycentroidsarealsoupdated.Furthermore,SOMdoesnotkeeptrackofthecurrentclustermembershipofanobject,and,unlikeK-means,ifanobjectswitchesclusters,thereisnoexplicitupdateoftheoldclustercentroid.However,iftheoldclusterisintheneighborhoodofthenewcluster,itwillbeupdated.Theprocessingofpointscontinuesuntilsomepredeterminedlimitisreachedorthecentroidsarenotchangingverymuch.ThefinaloutputoftheSOMtechniqueisasetofcentroidsthatimplicitlydefineclusters.Eachclusterconsistsofthepointsclosesttoaparticularcentroid.Thefollowingsectionexploresthedetailsofthisprocess.
TheSOMAlgorithmAdistinguishingfeatureofSOMisthatitimposesatopographic(spatial)organizationonthecentroids(neurons).Figure8.7 showsanexampleofatwo-dimensionalSOMinwhichthecentroidsarerepresentedbynodesthat
areorganizedinarectangularlattice.Eachcentroidisassignedapairofcoordinates(i,j).Sometimes,suchanetworkisdrawnwithlinksbetweenadjacentnodes,butthatcanbemisleadingbecausetheinfluenceofonecentroidonanotherisviaaneighborhoodthatisdefinedintermsofcoordinates,notlinks.TherearemanytypesofSOMneuralnetworks,butwerestrictourdiscussiontotwo-dimensionalSOMswitharectangularorhexagonalorganizationofthecentroids.
Figure8.7.Two-dimensional3-by-3rectangularSOMneuralnetwork.
EventhoughSOMissimilartoK-meansorotherprototype-basedapproaches,thereisafundamentaldifference.CentroidsusedinSOMhaveapredeterminedtopographicorderingrelationship.Duringthetrainingprocess,SOMuseseachdatapointtoupdatetheclosestcentroidandcentroidsthatarenearbyinthetopographicordering.Inthisway,SOMproducesanorderedsetofcentroidsforanygivendataset.Inotherwords,thecentroidsthatareclosetoeachotherintheSOMgridaremorecloselyrelatedtoeachotherthantothecentroidsthatarefartheraway.Becauseofthisconstraint,thecentroidsofatwo-dimensionalSOMcanbeviewedaslyingonatwo-dimensionalsurfacethattriestofitthen-dimensionaldataaswellaspossible.TheSOMcentroidscanalsobethoughtofastheresultofanonlinearregressionwithrespecttothedatapoints.
Atahighlevel,clusteringusingtheSOMtechniqueconsistsofthestepsdescribedinAlgorithm8.3 .
Algorithm8.3BasicSOMAlgorithm.
Initialization
Thisstep(line1)canbeperformedinanumberofways.Oneapproachistochooseeachcomponentofacentroidrandomlyfromtherangeofvaluesobservedinthedataforthatcomponent.Whilethisapproachworks,itisnotnecessarilythebestapproach,especiallyforproducingrapidconvergence.Anotherapproachistorandomlychoosetheinitialcentroidsfromtheavailabledatapoints.ThisisverymuchlikerandomlyselectingcentroidsforK-means.
SelectionofanObject
Thefirststepintheloop(line3)istheselectionofthenextobject.Thisisfairlystraightforward,buttherearesomedifficulties.Becauseconvergencecan
1:Initializethecentroids.2:repeat3:Selectthenextobject.4:Determinetheclosestcentroidtotheobject.5:Updatethiscentroidandthecentroidsthatareclose,i.e.,inaspecifiedneighborhood.6:untilThecentroidsdon’tchangemuchorathresholdisexceeded.7:Assigneachobjecttoitsclosestcentroidandreturnthecentroidsandclusters.
requiremanysteps,eachdataobjectmaybeusedmultipletimes,especiallyifthenumberofobjectsissmall.However,ifthenumberofobjectsislarge,thennoteveryobjectneedstobeused.Itisalsopossibletoenhancetheinfluenceofcertaingroupsofobjectsbyincreasingtheirfrequencyinthetrainingset.
Assignment
Thedeterminationoftheclosestcentroid(line4)isalsorelativelystraightforward,althoughitrequiresthespecificationofadistancemetric.TheEuclideandistancemetricisoftenused,asisthedotproductmetric.Whenusingthedotproductdistance,thedatavectorsaretypicallynormalizedbeforehandandthereferencevectorsarenormalizedateachstep.Insuchcases,usingthedotproductmetricisequivalenttousingthecosinemeasure.
Update
Theupdatestep(line5)isthemostcomplicated.Let bethecentroids.(Forarectangulargrid,notethatkistheproductofthenumberofrowsandthenumberofcolumns.)Fortimestept,let bethecurrentobject(point)andassumethattheclosestcentroidto is .Then,fortime ,the centroidisupdatedbyusingthefollowingequation.(Wewillseeshortlythattheupdateisreallyrestrictedtocentroidswhoseneuronsareinasmallneighborhoodof .)
Thus,attimet,acentroid isupdatedbyaddingaterm, ,whichisproportionaltothedifference, ,betweenthecurrentobject,
,andcentroid, . ,determinestheeffectthatthedifference,,willhaveandischosensothat(1)itdiminisheswithtimeand(2)it
enforcesaneighborhoodeffect,i.e.,theeffectofanobjectisstrongestonthe
m1,…, mk
p(t)p(t) mj t+1
jth
mj
mj(t+1)=mj(t)+hj(t)(p(t)−mj(t)) (8.16)
mj(t) hj(t) (p(t)−mj(t))p(t)−mj(t)
p(t) mj(t) hj(t) p(t)−mj(t)
centroidsclosesttothecentroid .Herewearereferringtothedistanceinthegrid,notthedistanceinthedataspace.Typically, ischosentobeoneofthefollowingtwofunctions:
Thesefunctionsrequiremoreexplanation. isalearningrateparameter,,whichdecreasesmonotonicallywithtimeandcontrolstherateof
convergence. isthetwo-dimensionalpointthatgivesthegridcoordinatesofthe centroid. istheEuclideandistancebetweenthegridlocationofthetwocentroids,i.e., .Consequently,forcentroidswhosegridlocationsarefarfromthegridlocationofcentroid ,theinfluenceofobject willbeeithergreatlydiminishedornon-existent.Finally,notethatσisthetypicalGaussianvarianceparameterandcontrolsthewidthoftheneighborhood,i.e.,asmallσwillyieldasmallneighborhood,whilealargeσwillyieldawideneighborhood.Thethresholdusedforthestepfunctionalsocontrolstheneighborhoodsize.
Remember,itistheneighborhoodupdatingtechniquethatenforcesarelationship(ordering)betweencentroidsassociatedwithneighboringneurons.
Termination
Decidingwhenwearecloseenoughtoastablesetofcentroidsisanimportantissue.Ideally,iterationshouldcontinueuntilconvergenceoccurs,thatis,untilthereferencevectorseitherdonotchangeorchangeverylittle.Therateofconvergencewilldependonanumberoffactors,suchasthedataand .Wewillnotdiscusstheseissuesfurther,excepttomentionthat,ingeneral,convergencecanbeslowandisnotguaranteed.
mjhj(t)
hj(t)=α(t)exp(−dist(rj,rk)2/(2σ2(t)) (Gaussianfunction)hj(t)=α(t) if dist(rj,rk)≤th
α(t)0<α(t)<1
rk=(xk,yk)kth dist(rj,rk)
(xj−xk)2+(yj−yk)2mj
p(t)
α(t)
Example8.6(DocumentData).Wepresenttwoexamples.Inthefirstcase,weapplySOMwitha4-by-4hexagonalgridtodocumentdata.Weclustered3204newspaperarticlesfromtheLosAngelesTimes,whichcomefrom6differentsections:Entertainment,Financial,Foreign,Metro,National,andSports.Figure8.8 showstheSOMgrid.Wehaveusedahexagonalgrid,whichallowseachcentroidtohavesiximmediateneighborsinsteadoffour.EachSOMgridcell(cluster)hasbeenlabeledwiththemajorityclasslabeloftheassociatedpoints.Theclustersofeachparticularcategoryformcontiguousgroups,andtheirpositionrelativetoothercategoriesofclustersgivesusadditionalinformation,e.g.,thattheMetrosectioncontainsstoriesrelatedtoallothersections.
Figure8.8.VisualizationoftherelationshipsbetweenSOMclusterforLosAngelesTimesdocumentdataset.
Example8.7(Two-DimensionalPoints).Inthesecondcase,weusearectangularSOMandasetoftwo-dimensionaldatapoints.Figure8.9(a) showsthepointsandthepositionsofthe36referencevectors(shownasx’s)producedbySOM.Thepointsarearrangedinacheckerboardpatternandaresplitintofiveclasses:circles,triangles,squares,diamonds,andhexagons(stars).A6-by-6two-dimensionalrectangulargridofcentroidswasusedwithrandominitialization.AsFigure8.9(a) shows,thecentroidstendtodistributethemselvestothedenseareas.Figure8.9(b) indicatesthemajorityclassofthepointsassociatedwiththatcentroid.Theclustersassociatedwithtrianglepointsareinonecontiguousarea,asarethecentroidsassociatedwiththefourothertypesofpoints.ThisisaresultoftheneighborhoodconstraintsenforcedbySOM.Whiletherearethesamenumberofpointsineachofthefivegroups,noticealsothatthecentroidsarenotevenlydistributed.Thisispartlyduetotheoveralldistributionofpointsandpartlyanartifactofputtingeachcentroidinasinglecluster.
Figure8.9.SOMappliedtotwo-dimensionaldatapoints.
ApplicationsOncetheSOMvectorsarefound,theycanbeusedformanypurposesotherthanclustering.Forexample,withatwo-dimensionalSOM,itispossibletoassociatevariousquantitieswiththegridpointsassociatedwitheachcentroid(cluster)andtovisualizetheresultsviavarioustypesofplots.Forexample,plottingthenumberofpointsassociatedwitheachclusteryieldsaplotthatrevealsthedistributionofpointsamongclusters.Atwo-dimensionalSOMisanonlinearprojectionoftheoriginalprobabilitydistributionfunctionintotwodimensions.Thisprojectionattemptstopreservetopologicalfeatures;thus,usingSOMtocapturethestructureofthedatahasbeencomparedtotheprocessof“pressingaflower.”
StrengthsandLimitationsSOMisaclusteringtechniquethatenforcesneighborhoodrelationshipsontheresultingclustercentroids.Becauseofthis,clustersthatareneighborsaremorerelatedtooneanotherthanclustersthatarenot.Suchrelationshipsfacilitatetheinterpretationandvisualizationoftheclusteringresults.Indeed,thisaspectofSOMhasbeenexploitedinmanyareas,suchasvisualizingwebdocumentsorgenearraydata.
SOMalsohasanumberoflimitations,whicharelistednext.SomeofthelistedlimitationsareonlyvalidifweconsiderSOMtobeastandardclusteringtechniquethataimstofindthetrueclustersinthedata,ratherthanatechniquethatusesclusteringtohelpdiscoverthestructureofthedata.Also,
someoftheselimitationshavebeenaddressedeitherbyextensionsofSOMorbyclusteringalgorithmsinspiredbySOM.(SeetheBibliographicNotes.)
Theusermustchoosethesettingsofparameters,theneighborhoodfunction,thegridtype,andthenumberofcentroids.ASOMclusteroftendoesnotcorrespondtoasinglenaturalcluster.Insomecases,aSOMclustermightencompassseveralnaturalclusters,whileinothercasesasinglenaturalclusterissplitintoseveralSOMclusters.ThisproblemispartlyduetotheuseofagridofcentroidsandpartlyduetothefactthatSOM,likeotherprototype-basedclusteringtechniques,tendstosplitorcombinenaturalclusterswhentheyareofvaryingsizes,shapes,anddensities.SOMlacksaspecificobjectivefunction.SOMattemptstofindasetofcentroidsthatbestapproximatethedata,subjecttothetopographicconstraintsamongthecentroids,butthesuccessofSOMindoingthiscannotbeexpressedbyafunction.ThiscanmakeitdifficulttocomparedifferentSOMclusteringresults.SOMisnotguaranteedtoconverge,although,inpractice,ittypicallydoes.
8.3Density-BasedClusteringInSection7.4 ,weconsideredDBSCAN,asimple,buteffectivealgorithmforfindingdensity-basedclusters,i.e.,denseregionsofobjectsthataresurroundedbylow-densityregions.Thissectionexaminesadditionaldensity-basedclusteringtechniquesthataddressissuesofefficiency,findingclustersinsubspaces,andmoreaccuratelymodelingdensity.First,weconsidergrid-basedclustering,whichbreaksthedataspaceintogridcellsandthenformsclustersfromcellsthataresufficientlydense.Suchanapproachcanbeefficientandeffective,atleastforlow-dimensionaldata.Next,weconsidersubspaceclustering,whichlooksforclusters(denseregions)insubsetsofalldimensions.Foradataspacewithndimensions,potentially subspacesneedtobesearched,andthusanefficienttechniqueisneededtodothis.CLIQUEisagrid-basedclusteringalgorithmthatprovidesanefficientapproachtosubspaceclusteringbasedontheobservationthatdenseareasinahigh-dimensionalspaceimplytheexistenceofdenseareasinlower-dimensionalspace.Finally,wedescribeDENCLUE,aclusteringtechniquethatuseskerneldensityfunctionstomodeldensityasthesumoftheinfluencesofindividualdataobjects.WhileDENCLUEisnotfundamentallyagrid-basedtechnique,itdoesemployagrid-basedapproachtoimproveefficiency.
8.3.1Grid-BasedClustering
Agridisanefficientwaytoorganizeasetofdata,atleastinlowdimensions.Theideaistosplitthepossiblevaluesofeachattributeintoanumberofcontiguousintervals,creatingasetofgridcells.(Weareassuming,forthis
2n−1
discussionandtheremainderofthesection,thatourattributesareordinal,interval,orcontinuous.)Eachobjectfallsintoagridcellwhosecorrespondingattributeintervalscontainthevaluesoftheobject.Objectscanbeassignedtogridcellsinonepassthroughthedata,andinformationabouteachcell,suchasthenumberofpointsinthecell,canalsobegatheredatthesametime.
Thereareanumberofwaystoperformclusteringusingagrid,butmostapproachesarebasedondensity,atleastinpart,andthus,inthissection,wewillusegrid-basedclusteringtomeandensity-basedclusteringusingagrid.Algorithm8.4 describesabasicapproachtogrid-basedclustering.Variousaspectsofthisapproachareexplorednext.
Algorithm8.4Basicgrid-basedclustering
algorithm.
DefiningGridCellsThisisakeystepintheprocess,butalsotheleastwelldefined,astherearemanywaystosplitthepossiblevaluesofeachattributeintoanumberofcontiguousintervals.Forcontinuousattributes,onecommonapproachisto
1:Defineasetofgridcells.2:Assignobjectstotheappropriatecellsandcomputethedensityofeachcell.3:Eliminatecellshavingadensitybelowaspecifiedthreshold,τ.4:Formclustersfromcontiguous(adjacent)groupsofdensecells.
splitthevaluesintoequalwidthintervals.Ifthisapproachisappliedtoeachattribute,thentheresultinggridcellsallhavethesamevolume,andthedensityofacellisconvenientlydefinedasthenumberofpointsinthecell.
However,moresophisticatedapproachescanalsobeused.Inparticular,forcontinuousattributesanyofthetechniquesthatarecommonlyusedtodiscretizeattributescanbeapplied.(SeeSection2.3.6 .)Inadditiontotheequalwidthapproachalreadymentioned,thisincludes(1)breakingthevaluesofanattributeintointervalssothateachintervalcontainsanequalnumberofpoints,i.e.,equalfrequencydiscretization,or(2)usingclustering.Anotherapproach,whichisusedbythesubspaceclusteringalgorithmMAFIA,initiallybreaksthesetofvaluesofanattributeintoalargenumberofequalwidthintervalsandthencombinesintervalsofsimilardensity.
Regardlessoftheapproachtaken,thedefinitionofthegridhasastrongimpactontheclusteringresults.Wewillconsiderspecificaspectsofthislater.
TheDensityofGridCellsAnaturalwaytodefinethedensityofagridcell(oramoregenerallyshapedregion)isasthenumberofpointsdividedbythevolumeoftheregion.Inotherwords,densityisthenumberofpointsperamountofspace,regardlessofthedimensionalityofthatspace.Specific,low-dimensionalexamplesofdensityarethenumberofroadsignspermile(onedimension),thenumberofeaglespersquarekilometerofhabitat(twodimensions),andthenumberofmoleculesofagaspercubiccentimeter(threedimensions).Asmentioned,however,acommonapproachistousegridcellsthathavethesamevolumesothatthenumberofpointspercellisadirectmeasureofthecell’sdensity.
Example8.8(Grid-BasedDensity).
Figure8.10 showstwosetsoftwo-dimensionalpointsdividedinto49cellsusinga7-by-7grid.Thefirstsetcontains200pointsgeneratedfromauniformdistributionoveracirclecenteredat(2,3)ofradius2,whilethesecondsethas100pointsgeneratedfromauniformdistributionoveracirclecenteredat(6,3)ofradius1.ThecountsforthegridcellsareshowninTable8.2 .Sincethecellshaveequalvolume(area),wecanconsiderthesevaluestobethedensitiesofthecells.
Figure8.10.Grid-baseddensity.
Table8.2.Pointcountsforgridcells.
0 0 0 0 0 0 0
0 0 0 0 0 0 0
4 17 18 6 0 0 0
14 14 13 13 0 18 27
11 18 10 21 0 24 31
3 20 14 4 0 0 0
0 0 0 0 0 0 0
FormingClustersfromDenseGridCellsFormingclustersfromadjacentgroupsofdensecellsisrelativelystraightforward.(InFigure8.10 ,forexample,itisclearthattherewouldbetwoclusters.)Thereare,however,someissues.Weneedtodefinewhatwemeanbyadjacentcells.Forexample,doesatwo-dimensionalgridcellhave4adjacentcellsor8?Also,weneedanefficienttechniquetofindtheadjacentcells,particularlywhenonlyoccupiedcellsarestored.
TheclusteringapproachdefinedbyAlgorithm8.4 hassomelimitationsthatcouldbeaddressedbymakingthealgorithmslightlymoresophisticated.Forexample,therearelikelytobepartiallyemptycellsontheboundaryofacluster.Often,thesecellsarenotdense.Ifso,theywillbediscardedandpartsofaclusterwillbelost.Figure8.10 andTable8.2 showthatfourpartsofthelargerclusterwouldbelostifthedensitythresholdis9.Theclusteringprocesscouldbemodifiedtoavoiddiscardingsuchcells,althoughthiswouldrequireadditionalprocessing.
Itisalsopossibletoenhancebasicgrid-basedclusteringbyusingmorethanjustdensityinformation.Inmanycases,thedatahasbothspatialandnon-spatialattributes.Inotherwords,someoftheattributesdescribethelocationofobjectsintimeorspace,whileotherattributesdescribeotheraspectsoftheobjects.Acommonexampleishouses,whichhavebothalocationandanumberofothercharacteristics,suchaspriceorfloorspaceinsquarefeet.Becauseofspatial(ortemporal)autocorrelation,objectsinaparticularcelloftenhavesimilarvaluesfortheirotherattributes.Insuchcases,itispossibletofilterthecellsbasedonthestatisticalpropertiesofoneormorenon-spatial
attributes,e.g.,averagehouseprice,andthenformclustersbasedonthedensityoftheremainingpoints.
StrengthsandLimitationsOnthepositiveside,grid-basedclusteringcanbeveryefficientandeffective.Givenapartitioningofeachattribute,asinglepassthroughthedatacandeterminethegridcellofeveryobjectandthecountofeverygrid.Also,eventhoughthenumberofpotentialgridcellscanbehigh,gridcellsneedtobecreatedonlyfornon-emptycells.Thus,thetimeandspacecomplexityofdefiningthegrid,assigningeachobjecttoacell,andcomputingthedensityofeachcellisonly ,wheremisthenumberofpoints.Ifadjacent,occupiedcellscanbeefficientlyaccessed,forexample,byusingasearchtree,thentheentireclusteringprocesswillbehighlyefficient,e.g.,withatimecomplexityof
Forthisreason,thegrid-basedapproachtodensityclusteringformsthebasisofanumberofclusteringalgorithms,suchasSTING,GRIDCLUS,WaveCluster,Bang-Clustering,CLIQUE,andMAFIA.
Onthenegativeside,grid-basedclustering,likemostdensity-basedclusteringschemes,isverydependentonthechoiceofthedensitythresholdτ.Ifτistoohigh,thenclusterswillbelost.Ifτistoolow,twoclustersthatshouldbeseparatemaybejoined.Furthermore,ifthereareclustersandnoiseofdifferingdensities,thenitmightnotbepossibletofindasinglevalueofτthatworksforallpartsofthedataspace.
Therearealsoanumberofissuesrelatedtothegrid-basedapproach.InFigure8.10 ,forexample,therectangulargridcellsdonotaccuratelycapturethedensityofthecircularboundaryareas.Wecouldattempttoalleviatethisproblembymakingthegridfiner,butthenumberofpointsinthegridcellsassociatedwithaclusterwouldlikelyshowmorefluctuationbecausepointsintheclusterarenotevenlydistributed.Indeed,somegridcells,
O(m)
O(mlogm).
includingthoseintheinteriorofthecluster,mightevenbeempty.Anotherissueisthat,dependingontheplacementorsizeofthecells,agroupofpointscanappearinjustonecellorbesplitbetweenseveraldifferentcells.Thesamegroupofpointsmightbepartofaclusterinthefirstcase,butbediscardedinthesecond.Finally,asdimensionalityincreases,thenumberofpotentialgridcellsincreasesrapidly—exponentiallyinthenumberofdimensions.Eventhoughitisnotnecessarytoexplicitlyconsideremptygridcells,itcaneasilyhappenthatmostgridcellscontainasingleobject.Inotherwords,grid-basedclusteringtendstoworkpoorlyforhigh-dimensionaldata.
8.3.2SubspaceClustering
Theclusteringtechniquesconsidereduntilnowfoundclustersbyusingalloftheattributes.However,ifonlysubsetsofthefeaturesareconsidered,i.e.,subspacesofthedata,thentheclustersthatwefindcanbequitedifferentfromonesubspacetoanother.Therearetworeasonsthatsubspaceclustersmightbeinteresting.First,thedatamaybeclusteredwithrespecttoasmallsetofattributes,butrandomlydistributedwithrespecttotheremainingattributes.Second,therearecasesinwhichdifferentclustersexistindifferentsetsofdimensions.Consideradatasetthatrecordsthesalesofvariousitemsatvarioustimes.(Thetimesarethedimensionsandtheitemsaretheobjects.)Someitemsmightshowsimilarbehavior(clustertogether)forparticularsetsofmonths,e.g.,summer,butdifferentclusterswouldlikelybecharacterizedbydifferentmonths(dimensions).
Example8.9(SubspaceClusters).Figure8.11(a) showsasetofpointsinthree-dimensionalspace.Therearethreeclustersofpointsinthefullspace,whicharerepresentedby
squares,diamonds,andtriangles.Inaddition,thereisonesetofpoints,representedbycircles,thatisnotaclusterinthree-dimensionalspace.Eachdimension(attribute)oftheexampledatasetissplitintoafixednumber ofequalwidthintervals.Thereare intervals,eachofsize0.1.Thispartitionsthedataspaceintorectangularcellsofequalvolume,andthus,thedensityofeachunitisthefractionofpointsitcontains.Clustersarecontiguousgroupsofdensecells.Toillustrate,ifthethresholdforadensecellis or6%ofthepoints,thenthreeone-dimensionalclusterscanbeidentifiedinFigure8.12 ,whichshowsahistogramofthedatapointsofFigure8.11(a) forthexattribute.
(η) η=20
ξ=0.06,
Figure8.11.Examplefiguresforsubspaceclustering.
Figure8.12.Histogramshowingthedistributionofpointsforthexattribute.
Figure8.11(b) showsthepointsplottedinthexyplane.(Thezattributeisignored.)Thisfigurealsocontainshistogramsalongthexandyaxesthatshowthedistributionofthepointswithrespecttotheirxandycoordinates,respectively.(Ahigherbarindicatesthatthecorrespondingintervalcontainsrelativelymorepoints,andviceversa.)Whenweconsidertheyaxis,weseethreeclusters.Oneisfromthecirclepointsthatdonotformaclusterinthefullspace,oneconsistsofthesquarepoints,andoneconsistsofthediamondandtrianglepoints.Therearealsothreeclustersinthexdimension;theycorrespondtothethreeclusters—diamonds,triangles,andsquares—inthefullspace.Thesepointsalsoformdistinctclustersinthexyplane.Figure8.11(c) showsthepointsplottedinthexzplane.Therearetwoclusters,ifweconsideronlythezattribute.Oneclustercorrespondstothepointsrepresentedbycircles,whiletheother
consistsofthediamond,triangle,andsquarepoints.Thesepointsalsoformdistinctclustersinthexzplane.InFigure8.11(d) ,therearethreeclusterswhenweconsiderboththeyandzcoordinates.Oneoftheseclustersconsistsofthecircles;anotherconsistsofthepointsmarkedbysquares.Thediamondsandtrianglesformasingleclusterintheyzplane.
Thesefiguresillustrateacoupleofimportantfacts.First,asetofpoints—thecircles—maynotformaclusterintheentiredataspace,butmayformaclusterinasubspace.Second,clustersthatexistinthefulldataspace(orevenasubspace)showupasclustersinlower-dimensionalspaces.Thefirstfacttellsusthatweneedtolookinsubsetsofdimensionstofindclusters,whilethesecondfacttellsusthatmanyoftheclusterswefindinsubspacesarelikelytobe“shadows”(projections)ofhigher-dimensionalclusters.Thegoalistofindtheclustersandthedimensionsinwhichtheyexist,butwearetypicallynotinterestedinclustersthatareprojectionsofhigher-dimensionalclusters.
CLIQUECLIQUE(CLusteringInQUEst)isagrid-basedclusteringalgorithmthatmethodicallyfindssubspaceclusters.Itisimpracticaltocheckeachsubspaceforclustersbecausethenumberofsuchsubspacesisexponentialinthenumberofdimensions.Instead,CLIQUEreliesonthefollowingproperty:
Monotonicitypropertyofdensity-basedclustersIfasetofpointsformsadensity-basedclusterinkdimensions(attributes),thenthesamesetofpointsisalsopartofadensity-basedclusterinallpossiblesubsetsofthosedimensions.
Considerasetofadjacent,k-dimensionalcellsthatformacluster;i.e.,thereisacollectionofadjacentcellsthathaveadensityabovethespecifiedthreshold
ξ.Acorrespondingsetofcellsin dimensionscanbefoundbyomittingoneofthekdimensions(attributes).Thelower-dimensionalcellsarestilladjacent,andeachlow-dimensionalcellcontainsallpointsofthecorrespondinghigh-dimensionalcell.Itcancontainadditionalpointsaswell.Thus,alow-dimensionalcellhasadensitygreaterthanorequaltothatofitscorrespondinghigh-dimensionalcell.Consequently,thelow-dimensionalcellsformacluster;i.e.,thepointsformaclusterwiththereducedsetofattributes.
Algorithm8.5 givesasimplifiedversionofthestepsinvolvedinCLIQUE.Conceptually,theCLIQUEalgorithmissimilartotheApriorialgorithmforfindingfrequentitemsets.SeeChapter5 .
Algorithm8.5CLIQUE.
k−1
1:Findallthedenseareasintheone-dimensionalspacescorrespondingtoeachattribute.Thisisthesetofdenseone-dimensionalcells.2:3:repeat4:Generateallcandidatedensek-dimensionalcellsfromdense
-dimensionalcells.5:Eliminatecellsthathavefewerthanξpoints.6:7:untilTherearenocandidatedensek-dimensionalcells.8:Findclustersbytakingtheunionofalladjacent,high-densitycells.9:Summarizeeachclusterusingasmallsetofinequalitiesthatdescribetheattributerangesofthecellsinthecluster.
k←2
(k−1)
k←k+1
StrengthsandLimitationsofCLIQUEThemostusefulfeatureofCLIQUEisthatitprovidesanefficienttechniqueforsearchingsubspacesforclusters.Sincethisapproachisbasedonthewell-knownAprioriprinciplefromassociationanalysis,itspropertiesarewellunderstood.AnotherusefulfeatureisCLIQUE’sabilitytosummarizethelistofcellsthatcomprisesaclusterwithasmallsetofinequalities.
ManylimitationsofCLIQUEareidenticaltothepreviouslydiscussedlimitationsofothergrid-baseddensityschemes.OtherlimitationsaresimilartothoseoftheApriorialgorithm.Specifically,justasfrequentitemsetscanshareitems,theclustersfoundbyCLIQUEcanshareobjects.Allowingclusterstooverlapcangreatlyincreasethenumberofclustersandmakeinterpretationdifficult.AnotherissueisthatApriori—likeCLIQUE—potentiallyhasexponentialtimecomplexity.Inparticular,CLIQUEwillhavedifficultyiftoomanydensecellsaregeneratedatlowervaluesofk.Raisingthedensitythresholdξcanalleviatethisproblem.StillanotherpotentiallimitationofCLIQUEisexploredinExercise20 onpage702.
8.3.3DENCLUE:AKernel-BasedSchemeforDensity-BasedClustering
DENCLUE(DENsityCLUstEring)isadensity-basedclusteringapproachthatmodelstheoveralldensityofasetofpointsasthesumofinfluencefunctionsassociatedwitheachpoint.Theresultingoveralldensityfunctionwillhavelocalpeaks,i.e.,localdensitymaxima,andtheselocalpeakscanbeusedtodefineclustersinanaturalway.Specifically,foreachdatapoint,ahill-climbingprocedurefindsthenearestpeakassociatedwiththatpoint,andthesetofall
datapointsassociatedwithaparticularpeak(calledalocaldensityattractor)becomesacluster.However,ifthedensityatalocalpeakistoolow,thenthepointsintheassociatedclusterareclassifiedasnoiseanddiscarded.Also,ifalocalpeakcanbeconnectedtoasecondlocalpeakbyapathofdatapoints,andthedensityateachpointonthepathisabovetheminimumdensitythreshold,thentheclustersassociatedwiththeselocalpeaksaremerged.Therefore,clustersofanyshapecanbediscovered.
Example8.10(DENCLUEDensity).WeillustratetheseconceptswithFigure8.13 ,whichshowsapossibledensityfunctionforaone-dimensionaldataset.PointsA–Earethepeaksofthisdensityfunctionandrepresentlocaldensityattractors.Thedottedverticallinesdelineatelocalregionsofinfluenceforthelocaldensityattractors.Pointsintheseregionswillbecomecenter-definedclusters.Thedashedhorizontallineshowsadensitythreshold,ξ.Allpointsassociatedwithalocaldensityattractorthathasadensitylessthanξ,suchasthoseassociatedwithC,willbediscarded.Allotherclustersarekept.Notethatthiscanincludepointswhosedensityislessthanξ,aslongastheyareassociatedwithlocaldensityattractorswhosedensityisgreaterthanξ.Finally,clustersthatareconnectedbyapathofpointswithadensityaboveξarecombined.ClustersAandBwouldremainseparate,whileclustersDandEwouldbecombined.
*****
Figure8.13.IllustrationofDENCLUEdensityconceptsinonedimension.
Thehigh-leveldetailsoftheDENCLUEalgorithmaresummarizedinAlgorithm8.6 .Next,weexplorevariousaspectsofDENCLUEinmoredetail.First,weprovideabriefoverviewofkerneldensityestimationandthenpresentthegrid-basedapproachthatDENCLUEusesforapproximatingthedensity.
Algorithm8.6DENCLUEalgorithm.1:Deriveadensityfunctionforthespaceoccupiedbythedatapoints.2:Identifythepointsthatarelocalmaxima.(Thesearethedensityattractors.)3:Associateeachpointwithadensityattractorbymovinginthedirectionofmaximumincreaseindensity.4:Defineclustersconsistingofpointsassociatedwithaparticulardensityattractor.5:Discardclusterswhosedensityattractorhasadensitylessthanauser-specifiedthresholdofξ.
KernelDensityEstimationDENCLUEisbasedonawell-developedareaofstatisticsandpatternrecognitionthatisknownaskerneldensityestimation.Thegoalofthiscollectionoftechniques(andmanyotherstatisticaltechniquesaswell)istodescribethedistributionofthedatabyafunction.Forkerneldensityestimation,thecontributionofeachpointtotheoveralldensityfunctionisexpressedbyaninfluenceorkernelfunction.Theoveralldensityfunctionissimplythesumoftheinfluencefunctionsassociatedwitheachpoint.
Typically,theinfluenceorkernelfunctionissymmetric(thesameinalldirections)anditsvalue(contribution)decreasesasthedistancefromthepointincreases.Forexample,foraparticularpoint,x,theGaussianfunction,
isoftenusedasakernelfunction.σisaparameter,analogoustostandarddeviation,whichgovernshowquicklytheinfluenceofapointdiminisheswithdistance.Figure8.14(a) showswhataGaussiandensityfunctionwouldlooklikeforasinglepointintwodimensions,whileFigures8.14(c) and8.14(d) showtheoveralldensityfunctionproducedbyapplyingtheGaussianinfluencefunctiontothesetofpointsshowninFigure8.14(b) .
6:Combineclustersthatareconnectedbyapathofpointsthatallhaveadensityofξorhigher.
K(y)=e−distance(x,y)2/2σ2,
Figure8.14.ExampleoftheGaussianinfluence(kernel)functionandanoveralldensityfunction.
ImplementationIssuesComputationofkerneldensitycanbequiteexpensive,andDENCLUEusesanumberofapproximationstoimplementitsbasicapproachefficiently.First,itexplicitlycomputesdensityonlyatdatapoints.However,thisstillwouldresultinan timecomplexitybecausethedensityateachpointisafunctionofO(m2)
thedensitycontributedbyeverypoint.Toreducethetimecomplexity,DENCLUEusesagrid-basedimplementationtoefficientlydefineneighborhoodsandthuslimitthenumberofpointsthatneedtobeconsideredtodefinethedensityatapoint.First,apreprocessingstepcreatesasetofgridcells.Onlyoccupiedcellsarecreated,andthesecellsandtheirrelatedinformationcanbeefficientlyaccessedviaasearchtree.Then,whencomputingthedensityofapointandfindingitsnearestdensityattractor,DENCLUEconsidersonlythepointsintheneighborhood;i.e.,pointsinthesamecellandincellsthatareconnectedtothepoint’scell.Whilethisapproachcansacrificesomeaccuracywithrespecttodensityestimation,computationalcomplexityisgreatlyreduced.
StrengthsandLimitationsofDENCLUEDENCLUEhasasolidtheoreticalfoundationbecauseitisbasedontheconceptofkerneldensityestimation,whichisawell-developedareaofstatistics.Forthisreason,DENCLUEprovidesamoreflexibleandpotentiallymoreaccuratewaytocomputedensitythanothergrid-basedclusteringtechniquesandDBSCAN.(DBSCANisaspecialcaseofDENCLUE.)Anapproachbasedonkerneldensityfunctionsisinherentlycomputationallyexpensive,butDENCLUEemploysgrid-basedtechniquestoaddresssuchissues.Nonetheless,DENCLUEcanbemorecomputationallyexpensivethanotherdensity-basedclusteringtechniques.Also,theuseofagridcanadverselyaffecttheaccuracyofthedensityestimation,anditmakesDENCLUEsusceptibletoproblemscommontogrid-basedapproaches;e.g.,thedifficultyofchoosingthepropergridsize.Moregenerally,DENCLUEsharesmanyofthestrengthsandlimitationsofotherdensity-basedapproaches.Forinstance,DENCLUEisgoodathandlingnoiseandoutliersanditcanfindclustersofdifferentshapesandsize,butithastroublewithhigh-dimensionaldataanddatathatcontainsclustersofwidelydifferentdensities.
8.4Graph-BasedClusteringSection7.3 discussedanumberofclusteringtechniquesthattookagraph-basedviewofdata,inwhichdataobjectsarerepresentedbynodesandtheproximitybetweentwodataobjectsisrepresentedbytheweightoftheedgebetweenthecorrespondingnodes.Thissectionconsiderssomeadditionalgraph-basedclusteringalgorithmsthatuseanumberofkeypropertiesandcharacteristicsofgraphs.Thefollowingaresomekeyapproaches,differentsubsetsofwhichareemployedbythesealgorithms.
1. Sparsifytheproximitygraphtokeeponlytheconnectionsofanobjectwithitsnearestneighbors.Thissparsificationisusefulforhandlingnoiseandoutliers.Italsoallowstheuseofhighlyefficientgraphpartitioningalgorithmsthathavebeendevelopedforsparsegraphs.
2. Defineasimilaritymeasurebetweentwoobjectsbasedonthenumberofnearestneighborsthattheyshare.Thisapproach,whichisbasedontheobservationthatanobjectanditsnearestneighborsusuallybelongtothesameclass,isusefulforovercomingproblemswithhighdimensionalityandclustersofvaryingdensity.
3. Definecoreobjectsandbuildclustersaroundthem.Todothisforgraph-basedclustering,itisnecessarytointroduceanotionofdensity-basedonaproximitygraphorasparsifiedproximitygraph.AswithDBSCAN,buildingclustersaroundcoreobjectsleadstoaclusteringtechniquethatcanfindclustersofdifferingshapesandsizes.
4. Usetheinformationintheproximitygraphtoprovideamoresophisticatedevaluationofwhethertwoclustersshouldbemerged.Specifically,twoclustersaremergedonlyiftheresultingclusterwillhavecharacteristicssimilartotheoriginaltwoclusters.
Webeginbydiscussingthesparsificationofproximitygraphs,providingthreeexamplesoftechniqueswhoseapproachtoclusteringisbasedsolelyonthistechnique:MST,whichisequivalenttothesinglelinkclusteringalgorithm,Opossum,andspectralclustering.WethendiscussChameleon,ahierarchicalclusteringalgorithmthatusesanotionofself-similaritytodetermineifclustersshouldbemerged.WenextdefineSharedNearestNeighbor(SNN)similarity,anewsimilaritymeasure,andintroducetheJarvis-Patrickclusteringalgorithm,whichusesthissimilarity.Finally,wediscusshowtodefinedensityandcoreobjectsbasedonSNNsimilarityandintroduceanSNNdensity-basedclusteringalgorithm,whichcanbeviewedasDBSCANwithanewsimilaritymeasure.
8.4.1Sparsification
Thembymproximitymatrixformdatapointscanberepresentedasadensegraphinwhicheachnodeisconnectedtoallothersandtheweightoftheedgebetweenanypairofnodesreflectstheirpairwiseproximity.Althougheveryobjecthassomelevelofsimilaritytoeveryotherobject,formostdatasets,objectsarehighlysimilartoasmallnumberofobjectsandweaklysimilartomostotherobjects.Thispropertycanbeusedtosparsifytheproximitygraph(matrix),bysettingmanyoftheselow-similarity(high-dissimilarity)valuesto0beforebeginningtheactualclusteringprocess.Thesparsificationmaybeperformed,forexample,bybreakingalllinksthathaveasimilarity(dissimilarity)below(above)aspecifiedthresholdorbykeepingonlylinkstothek-nearestneighborsofpoint.Thislatterapproachcreateswhatiscalledak-nearestneighborgraph.
Sparsificationhasseveralbeneficialeffects:
Datasizeisreduced.Theamountofdatathatneedstobeprocessedtoclusterthedataisdrasticallyreduced.Sparsificationcanofteneliminatemorethan99%oftheentriesinaproximitymatrix.Asaresult,thesizeofproblemsthatcanbehandledisincreased.Clusteringoftenworksbetter.Sparsificationtechniqueskeeptheconnectionstotheirnearestneighborsofanobjectwhilebreakingtheconnectionstomoredistantobjects.Thisisinkeepingwiththenearestneighborprinciplethatthenearestneighborsofanobjecttendtobelongtothesameclass(cluster)astheobjectitself.Thisreducestheimpactofnoiseandoutliersandsharpensthedistinctionbetweenclusters.Graphpartitioningalgorithmscanbeused.Therehasbeenaconsiderableamountofworkonheuristicalgorithmsforfindingmincutpartitioningsofsparsegraphs,especiallyintheareasofparallelcomputingandthedesignofintegratedcircuits.Sparsificationoftheproximitygraphmakesitpossibletousegraphpartitioningalgorithmsfortheclusteringprocess.Forexample,OpossumandChameleonusegraphpartitioning.
Sparsificationoftheproximitygraphshouldberegardedasaninitialstepbeforetheuseofactualclusteringalgorithms.Intheory,aperfectsparsificationcouldleavetheproximitymatrixsplitintoconnectedcomponentscorrespondingtothedesiredclusters,butinpractice,thisrarelyhappens.Itiseasyforasingleedgetolinktwoclustersorforasingleclustertobesplitintoseveraldisconnectedsubclusters.AsweshallseewhenwediscussJarvis-PatrickandSNNdensity-basedclustering,thesparseproximitygraphisoftenmodifiedtoyieldanewproximitygraph.Thisnewproximitygraphcanagainbesparsified.Clusteringalgorithmsworkwiththeproximitygraphthatistheresultofallthesepreprocessingsteps.ThisprocessissummarizedinFigure8.15 .
Figure8.15.Idealprocessofclusteringusingsparsification.
8.4.2MinimumSpanningTree(MST)Clustering
InSection7.3 ,wherewedescribedagglomerativehierarchicalclusteringtechniques,wementionedthatdivisivehierarchicalclusteringalgorithmsalsoexist.Wesawanexampleofonesuchtechnique,bisectingK-means,inSection7.2.3 .Anotherdivisivehierarchicaltechnique,MST,startswiththeminimumspanningtreeoftheproximitygraphandcanbeviewedasanapplicationofsparsificationforfindingclusters.Webrieflydescribethisalgorithm.Interestingly,thisalgorithmalsoproducesthesameclusteringassinglelinkagglomerativeclustering.SeeExercise13 onpage700.
Aminimumspanningtreeofagraphisasubgraphthat(1)hasnocycles,i.e.,isatree,(2)containsallthenodesofthegraph,and(3)hastheminimumtotaledgeweightofallpossiblespanningtrees.Theterminology,minimumspanningtree,assumesthatweareworkingonlywithdissimilaritiesordistances,andwewillfollowthisconvention.Thisisnotalimitation,however,sincewecanconvertsimilaritiestodissimilaritiesormodifythenotionofaminimumspanningtreetoworkwithsimilarities.Anexampleofaminimumspanningtreeforsometwo-dimensionalpointsisshowninFigure8.16 .
Figure8.16.Minimumspanningtreeforasetofsixtwo-dimensionalpoints.
TheMSTdivisivehierarchicalalgorithmisshowninAlgorithm8.7 .ThefirststepistofindtheMSToftheoriginaldissimilaritygraph.Notethataminimumspanningtreecanbeviewedasaspecialtypeofsparsifiedgraph.Step3canalsobeviewedasgraphsparsification.Hence,MSTcanbeviewedasaclusteringalgorithmbasedonthesparsificationofthedissimilaritygraph.
Algorithm8.7MSTdivisivehierarchical
clusteringalgorithm.1:Computeaminimumspanningtreeforthedissimilaritygraph.2:repeat3:Createanewclusterbybreakingthelinkcorrespondingtothelargestdissimilarity.
8.4.3OPOSSUM:OptimalPartitioningofSparseSimilaritiesUsingMETIS
OPOSSUMisaclusteringtechniqueforclusteringsparse,high-dimensionaldata,e.g.,documentormarketbasketdata.LikeMST,itperformsclusteringbasedonthesparsificationofaproximitygraph.However,OPOSSUMusestheMETISalgorithm,whichwasspecificallycreatedforpartitioningsparsegraphs.ThestepsofOPOSSUMaregiveninAlgorithm8.8 .
Thesimilaritymeasuresusedarethoseappropriateforsparse,high-dimensionaldata,suchastheextendedJaccardmeasureorthecosinemeasure.TheMETISgraphpartitioningprogrampartitionsasparsegraphintokdistinctcomponents,wherekisauser-specifiedparameter,inorderto(1)minimizetheweightoftheedges(thesimilarity)betweencomponentsand(2)fulfillabalanceconstraint.OPOSSUMusesoneofthefollowingtwobalanceconstraints:(1)thenumberofobjectsineachclustermustberoughlythesame,or(2)thesumoftheattributevaluesmustberoughlythesame.Thesecondconstraintisusefulwhen,forexample,theattributevaluesrepresentthecostofanitem.
Algorithm8.8OPOSSUMclustering
algorithm.
4:untilOnlysingletonclustersremain.
1:Computeasparsifiedsimilaritygraph.
StrengthsandWeaknessesOPOSSUMissimpleandfast.Itpartitionsthedataintoroughlyequal-sizedclusters,which,dependingonthegoaloftheclustering,canbeviewedasanadvantageoradisadvantage.Becausetheyareconstrainedtobeofroughlyequalsize,clusterscanbebrokenorcombined.However,ifOPOSSUMisusedtogeneratealargenumberofclusters,thentheseclustersaretypicallyrelativelypurepiecesoflargerclusters.Indeed,OPOSSUMissimilartotheinitialstepoftheChameleonclusteringroutine,whichisdiscussednext.
8.4.4Chameleon:HierarchicalClusteringwithDynamicModeling
Agglomerativehierarchicalclusteringtechniquesoperatebymergingthetwomostsimilarclusters,wherethedefinitionofclustersimilaritydependsontheparticularalgorithm.Someagglomerativealgorithms,suchasgroupaverage,basetheirnotionofsimilarityonthestrengthoftheconnectionsbetweenthetwoclusters(e.g.,thepairwisesimilarityofpointsinthetwoclusters),whileothertechniques,suchasthesinglelinkmethod,usetheclosenessoftheclusters(e.g.,theminimumdistancebetweenpointsindifferentclusters)tomeasureclustersimilarity.Althoughtherearetwobasicapproaches,usingonlyoneofthesetwoapproachescanleadtomistakesinmergingclusters.ConsiderFigure8.17 ,whichshowsfourclusters.Ifweusetheclosenessofclusters(asmeasuredbytheclosesttwopointsindifferentclusters)asour
2:Partitionthesimilaritygraphintokdistinctcomponents(clusters)usingMETIS.
mergingcriterion,thenwewouldmergethetwocircularclusters,(c)and(d),whichalmosttouch,insteadoftherectangularclusters,(a)and(b),whichareseparatedbyasmallgap.However,intuitively,weshouldhavemergedrectangularclusters,(a)and(b).Exercise15 onpage700asksforanexampleofasituationinwhichthestrengthofconnectionslikewiseleadstoanunintuitiveresult.
Figure8.17.Situationinwhichclosenessisnottheappropriatemergingcriterion.©1999,IEEE
Anotherproblemisthatmostclusteringtechniqueshaveaglobal(static)modelofclusters.Forinstance,K-meansassumesthattheclusterswillbeglobular,whileDBSCANdefinesclustersbasedonasingledensitythreshold.Clusteringschemesthatusesuchaglobalmodelcannothandlecasesinwhichclustercharacteristics,suchassize,shape,anddensity,varywidelybetweenclusters.Asanexampleoftheimportanceofthelocal(dynamic)modelingofclusters,considerFigure8.18 .Ifweusetheclosenessofclusterstodeterminewhichpairofclustersshouldbemerged,aswouldbethecaseifweused,forexample,thesinglelinkclusteringalgorithm,thenwewouldmergeclusters(a)and(b).However,wehavenottakenintoaccountthecharacteristicsofeachindividualcluster.Specifically,wehaveignoredthedensityoftheindividualclusters.Forclusters(a)and(b),whicharerelativelydense,thedistancebetweenthetwoclustersissignificantlylargerthanthedistancebetweenapointanditsnearestneighborswithinthesamecluster.
Thisisnotthecaseforclusters(c)and(d),whicharerelativelysparse.Indeed,whenclusters(c)and(d)aremerged,theyyieldaclusterthatseemsmoresimilartotheoriginalclustersthantheclusterthatresultsfrommergingclusters(a)and(b).
Figure8.18.Illustrationofthenotionofrelativecloseness.©1999,IEEE
Chameleonisanagglomerativeclusteringalgorithmthataddressestheissuesoftheprevioustwoparagraphs.Itcombinesaninitialpartitioningofthedata,usinganefficientgraphpartitioningalgorithm,withanovelhierarchicalclusteringschemethatusesthenotionsofclosenessandinterconnectivity,togetherwiththelocalmodelingofclusters.Thekeyideaisthattwoclustersshouldbemergedonlyiftheresultingclusterissimilartothetwooriginalclusters.Self-similarityisdescribedfirst,andthentheremainingdetailsoftheChameleonalgorithmarepresented.
DecidingWhichClusterstoMergeTheagglomerativehierarchicalclusteringtechniquesconsideredinSection7.3 repeatedlycombinethetwoclosestclustersandareprincipallydistinguishedfromoneanotherbythewaytheydefineclusterproximity.Incontrast,Chameleonaimstomergethepairofclustersthatresultsinaclusterthatismostsimilartotheoriginalpairofclusters,asmeasuredbycloseness
andinterconnectivity.Becausethisapproachdependsonlyonthepairofclustersandnotonaglobalmodel,Chameleoncanhandledatathatcontainsclusterswithwidelydifferentcharacteristics.
Followingaremoredetailedexplanationsofthepropertiesofclosenessandinterconnectivity.Tounderstandtheseproperties,itisnecessarytotakeaproximitygraphviewpointandtoconsiderthenumberofthelinksandthestrengthofthoselinksamongpointswithinaclusterandacrossclusters.
RelativeCloseness(RC)istheabsoluteclosenessoftwoclustersnormalizedbytheinternalclosenessoftheclusters.Twoclustersarecombinedonlyifthepointsintheresultingclusterarealmostasclosetoeachotherasineachoftheoriginalclusters.Mathematically,
where and arethesizesofclusters and respectively;istheaverageweightoftheedges(ofthek-nearestneighbor
graph)thatconnectclusters and istheaverageweightofedgesifwebisectcluster and istheaverageweightofedgesifwebisectcluster (ECstandsforedgecut.)Figure8.18 illustratesthenotionofrelativecloseness.Asdiscussedpreviously,whileclusters(a)and(b)arecloserinabsolutetermsthanclusters(c)and(d),thisisnottrueifthenatureoftheclustersistakenintoaccount.RelativeInterconnectivity(RI)istheabsoluteinterconnectivityoftwoclustersnormalizedbytheinternalconnectivityoftheclusters.Twoclustersarecombinedifthepointsintheresultingclusterarealmostasstronglyconnectedaspointsineachoftheoriginalclusters.Mathematically,
RC(Ci,Cj)=S¯EC(Ci,Cj)mimi+mjS¯EC(Ci)+mjmi+mjS¯EC(Cj), (8.17)
mi mj Ci Cj,S¯EC(Ci,Cj)
Ci Cj;S¯EC(Ci)Ci; S¯EC(Cj)
Cj.
RI(Ci,Cj)=EC(Ci,Cj)12(EC(Ci))+EC(Cj)), (8.18)
where isthesumoftheedges(ofthek-nearestneighborgraph)thatconnectclusters and istheminimumsumofthecutedgesifwebisectcluster and istheminimumsumofthecutedgesifwebisectcluster Figure8.19 illustratesthenotionofrelativeinterconnectivity.Thetwocircularclusters,(c)and(d),havemoreconnectionsthantherectangularclusters,(a)and(b).However,merging(c)and(d)producesaclusterthathasconnectivityquitedifferentfromthatof(c)and(d).Incontrast,merging(a)and(b)producesaclusterwithconnectivityverysimilartothatof(a)and(b).
Figure8.19.Illustrationofthenotionofrelativeinterconnectedness.©1999,IEEE
RIandRCcanbecombinedinmanydifferentwaystoyieldanoverallmeasureofself-similarity.OneapproachusedinChameleonistomergethepairofclustersthatmaximizes whereαisauser-specifiedparameterthatistypicallygreaterthan1.
ChameleonAlgorithm
EC(Ci,Cj)Ci Cj; EC(Ci)
Ci; EC(Cj)Cj.
RI(Ci,Cj)*RC(Ci,Cj)α,
Chameleonconsistsofthreekeysteps:sparsification,graphpartitioning,andhierarchicalclustering.Algorithm8.9 andFigure8.20 describethesesteps.
Figure8.20.OverallprocessbywhichChameleonperformsclustering.©1999,IEEE
Algorithm8.9Chameleonalgorithm.
Sparsification
ThefirststepinChameleonistogenerateak-nearestneighborgraph.Conceptually,suchagraphisderivedfromtheproximitygraph,anditcontainslinksonlybetweenapointanditsk-nearestneighbors,i.e.,thepointstowhichitisclosest.Asmentioned,workingwithasparsifiedproximitygraph
1:Buildak-nearestneighborgraph.2:Partitionthegraphusingamultilevelgraphpartitioningalgorithm.3:repeat4:Mergetheclustersthatbestpreservetheclusterself-similaritywithrespecttorelativeinterconnectivityandrelativecloseness.5:untilNomoreclusterscanbemerged.
insteadofthefullproximitygraphcansignificantlyreducetheeffectsofnoiseandoutliersandimprovecomputationalefficiency.
GraphPartitioningOnceasparsifiedgraphhasbeenobtained,anefficientmultilevelgraphpartitioningalgorithm,suchasMETIS(seeBibliographicNotes),canbeusedtopartitionthedataset.Chameleonstartswithanall-inclusivegraph(cluster)andthenbisectsthelargestcurrentsubgraph(cluster)untilnoclusterhasmorethan points,where isauser-specifiedparameter.Thisprocessresultsinalargenumberofroughlyequallysizedgroupsofwell-connectedvertices(highlysimilardatapoints).Thegoalistoensurethateachpartitioncontainsobjectsmostlyfromonetruecluster.
AgglomerativeHierarchicalClustering
Asdiscussedpreviously,Chameleonmergesclustersbasedonthenotionofself-similarity.Chameleoncanbeparameterizedtomergemorethanonepairofclustersinasinglestepandtostopbeforeallobjectshavebeenmergedintoasinglecluster.
Complexity
Assumethatmisthenumberofdatapointsandpisthenumberofpartitions.Performinganagglomerativehierarchicalclusteringoftheppartitionsobtainedfromthegraphpartitioningrequirestime (SeeSection7.3.1 .)Theamountoftimerequiredforpartitioningthegraphis
Thetimecomplexityofgraphsparsificationdependsonhowmuchtimeittakestobuildthek-nearestneighborgraph.Forlow-dimensionaldata,thistakes timeifak-dtreeorasimilartypeofdatastructure
O(p2logp).
O(mp+mlogm).
O(mlogm)
isused.Unfortunately,suchdatastructuresonlyworkwellforlow-dimensionaldatasets,andthus,forhigh-dimensionaldatasets,thetimecomplexityofthesparsificationbecomes Becauseonlythek-nearestneighborlistneedstobestored,thespacecomplexityis plusthespacerequiredtostorethedata.
Example8.11.ChameleonwasappliedtotwodatasetsthatclusteringalgorithmssuchasK-meansandDBSCANhavedifficultyclustering.TheresultsofthisclusteringareshowninFigure8.21 .Theclustersareidentifiedbytheshadingofthepoints.InFigure8.21(a) ,thetwoclustersareirregularlyshapedandquiteclosetoeachother.Also,noiseispresent.InFigure8.21(b) ,thetwoclustersareconnectedbyabridge,andagain,noiseispresent.Nonetheless,Chameleonidentifieswhatmostpeoplewouldidentifyasthenaturalclusters.Chameleonhasspecificallybeenshowntobeveryeffectiveforclusteringspatialdata.Finally,noticethatChameleondoesnotdiscardnoisepoints,asdootherclusteringschemes,butinsteadassignsthemtotheclusters.
Figure8.21.
O(m2).O(km)
Chameleonappliedtoclusterapairoftwo-dimensionalsetsofpoints.©1999,IEEE
StrengthsandLimitationsChameleoncaneffectivelyclusterspatialdata,eventhoughnoiseandoutliersarepresentandtheclustersareofdifferentshapes,sizes,anddensity.Chameleonassumesthatthegroupsofobjectsproducedbythesparsificationandgraphpartitioningprocessaresubclusters;i.e.,thatmostofthepointsinapartitionbelongtothesametruecluster.Ifnot,thenagglomerativehierarchicalclusteringwillonlycompoundtheerrorsbecauseitcanneverseparateobjectsthathavebeenwronglyputtogether.(SeethediscussioninSection7.3.4 .)Thus,Chameleonhasproblemswhenthepartitioningprocessdoesnotproducesubclusters,asisoftenthecaseforhigh-dimensionaldata.
8.4.5SpectralClustering
Spectralclusteringisanelegantgraphpartitioningapproachthatexploitspropertiesofthesimilaritygraphtodeterminetheclusterpartitions.Specifically,itexaminesthegraph’sspectrum,i.e.,eigenvaluesandeigenvectorsassociatedwiththeadjacencymatrixofthegraph,toidentifythenaturalclustersofthedata.Tomotivatetheideasbehindthisapproach,considerthesimilaritygraphshowninFigure8.22 foradatasetthatcontains6datapoints.Thelinkweightsinthegrapharecomputedbasedonsomesimilaritymeasure,withathresholdappliedtoremovelinkswithlowsimilarityvalues.Thesparsificationproducesagraphwithtwoconnectedcomponents,whichtriviallyrepresentthetwoclustersinthedata,and
{v1,v2,v3}{v4,v5,v6}.
Figure8.22.Exampleofasimilaritygraphwithtwoconnectedcomponentsalongwithitsweightedadjacencymatrix(W),graphLaplacianmatrix(L),andeigendecomposition.
Thetopright-handpanelofthefigurealsoshowstheweightedadjacencymatrixofthegraph,denotedasW,andadiagonalmatrix,D,whosediagonalelementscorrespondtothesumoftheweightsofthelinksincidenttoeachnodeinthegraph,i.e.,
Notethattherowsandcolumnsoftheweightedadjacencymatrixhavebeenorderedinsuchawaythatnodesbelongingtothesameconnectedcomponentarenexttoeachother.Withthisordering,thematrixWhasablockstructureoftheform
Dij={ΣkWik,ifi=j;0,otherwise.
W=(W100W2),
inwhichtheoff-diagonalblocksarematricesofzerovaluessincetherearenolinksconnectinganodefromthefirstconnectedcomponenttoanodefromthesecondconnectedcomponent.Indeed,ifthesparsegraphcontainskconnectedcomponents,itsweightedadjacencymatrixcanbere-orderedintothefollowingblockdiagonalform:
Thisexamplesuggeststhepossibilityofidentifyingtheinherentclustersofadatasetbyexaminingtheblockstructureofitsweightedadjacencymatrix.
Unfortunately,unlesstheclustersarewell-separated,theadjacencymatricesassociatedwithmostsimilaritygraphsarenotinblockdiagonalform.Forexample,considerthegraphshowninFigure8.23 ,inwhichthereisalinkbetweennodes and withalowsimilarityvalue.Ifweareinterestedingeneratingtwoclusters,wecouldbreaktheweakestlink,locatedbetween
tosplitthegraphintotwopartitions.Becausethereisonlyoneconnectedcomponentinthegraph,theblockstructureinWishardertodiscern.
W=(W10⋯00W2⋯0⋯⋯⋯⋯00⋯Wk), (8.19)
v3 v4,
(v3,v4),
Figure8.23.Exampleofasimilaritygraphwithasingleconnectedcomponentalongwithitsweightedadjacencymatrix(W),graphLaplacianmatrix(L),andeigendecomposition.
Fortunately,thereisamoreobjectivewaytocreatetheclusterpartitionsbyconsideringthegraphspectrum.First,weneedtocomputethegraphLaplacianmatrix,whichisformallydefinedasfollows:
ThegraphLaplacianmatricesfortheexamplesshowninFigures8.22 and8.23 aredepictedinthebottomleftpanelofbothdiagrams.Thematrixhasseveralnotableproperties:
1. ItisasymmetricmatrixsincebothWandDaresymmetric.2. Itisapositivesemi-definitematrix,whichmeans foranyinput
vectorv.3. AlleigenvaluesofLmustbenon-negative.Theeigenvaluesand
eigenvectorsforthegraphsshowninFigure8.22 and8.23 aredenotedinthediagramsasΛandV,respectively.NotethattheeigenvaluesofthegraphLaplacianmatrixaregivenbythediagonalelementsofΛ.
4. ThesmallesteigenvalueofLiszero,withthecorrespondingeigenvectore,whichisavectorof1s.Thisisbecause
Thus, whichisequivalentto Thiscanbesimplifiedintotheeigenvalueequation since
L=D−W (8.20)
vTLv≥0
We=(W11W12⋯W1nW21W22⋯W2n⋯⋯⋯⋯Wn1Wn2⋯Wnn)(11⋯1)=(ΣjW1jΣjW2j⋯ΣjWnj)De=(ΣjW1j0⋯00ΣjW2j⋯0⋯⋯⋯⋯00⋯ΣjWnj)(11⋯1)=(ΣjW1jΣjW2j⋯ΣjWnj)
We=De, (D−W)e=0.Le=0e L=D−W.
5. AgraphwithkconnectedcomponentshasanadjacencymatrixWinblockdiagonalformasshowninEquation8.19 .ItsgraphLaplacianmatrixalsohasablockdiagonalform
Inaddition,itsgraphLaplacianmatrixhaskeigenvaluesofzeros,withthecorrespondingeigenvectors
wherethe ’sarevectorsof1’sand0’sarevectorsof0’s.Forexample,thegraphshowninFigure8.22 containstwoconnectedcomponents,whichiswhyitsgraphLaplacianmatrixhastwoeigenvaluesofzeros.Moreimportantly,itsfirsttwoeigenvectors(normalizedtounitlength),
correspondingtothefirsttwocolumnsinV,provideinformationabouttheclustermembershipofeachnode.Anodethatbelongtothefirstclusterhasapositivevalueintheitsfirsteigenvectorandazerovalueinitssecondeigenvector,whereasanodethatbelongtothesecondclusterhasazerovalueinthefirsteigenvectorandanegativevalueinthesecondeigenvector.
ThegraphshowninFigure8.23 hasoneeigenvalueofzerobecauseithasonlyoneconnectedcomponent.Nevertheless,ifweexaminethefirsttwoeigenvectorsofitsgraphLaplacianmatrix
thegraphcanbeeasilysplitintotwoclusterssincethesetofnodeshasanegativevalueinthesecondeigenvectorwhereas hasa
L=(L10⋯00L2⋯0⋯⋯⋯⋯00⋯Lk),
(e10⋯0),(0e2⋯0),⋯,(00⋯ek),
ei
v1→v2→v3→v4→v5→v6→(0.5800.5800.5800−0.580−0.580−0.58),
v1→v2→v3→v4→v5→v6→(0.41−0.410.41−0.430.41−0.380.410.380.410.420.41
{v1,v2,v3}{v4,v5,v6}
positivevalueinthesecondeigenvector.Inshort,theeigenvectorsofthegraphLaplacianmatrixcontaininformationthatcanbeusedtopartitionthegraphintoitsunderlyingcomponents.However,insteadofmanuallycheckingtheeigenvectors,itiscommonpracticetoapplyasimpleclusteringalgorithmsuchasK-meanstohelpextracttheclustersfromtheeigenvectors.AsummaryofthespectralclusteringalgorithmisgiveninAlgorithm8.10 .
Algorithm8.10Spectralclustering
algorithm.
Example8.12.Considerthetwo-dimensionalringdatashowninFigure8.24(b) ,whichcontains350datapoints.Thefirst100pointsbelongtotheinnerringwhiletheremaining250pointsbelongtotheouterring.AheatmapshowingtheEuclideandistancebetweeneverypairofpointsisdepictedinFigure8.24(a) .Whilethepointsintheinnerringarerelativelyclosetoeachother,thoselocatedintheouterringcanbequitefarfromeachother.Asaresult,standardclusteringalgorithmssuchasK-meansperformpoorlyonthedata.Incontrast,applyingspectralclusteringonthesparsifiedsimilaritygraphcanproducethecorrectclusteringresults(seeFigure8.24(d) ).Here,thesimilaritybetweenpointsiscalculatedusingtheGaussianradialbasisfunctionandthegraphissparsifiedbychoosingthe10-nearestneighborsforeachdatapoint.Thesparsificationreducesthe
1:CreateasparsifiedsimilaritygraphG.2:ComputethegraphLaplacianforG,L(seeEquation(8.20)).
3:CreateamatrixVfromthefirstkeigenvectorsofL.4:ApplyK-meansclusteringonVtoobtainthekclusters.
similaritybetweenadatapointlocatedintheinnerringandacorrespondingpointintheouterring,whichenablesspectralclusteringtoeffectivelypartitionthedatasetintotwoclusters.
Figure8.24.ApplicationofK-meansandspectralclusteringtoatwo-dimensionalringdata.
RelationshipbetweenSpectralClusteringandGraphPartitioningTheobjectiveofgraphpartitioningistobreaktheweaklinksinagraphuntilthedesirednumberofclusterpartitionsisobtained.Onewaytoassessthequalityofthepartitionsisbysumminguptheweightsofthelinksthatwereremoved.Theresultingmeasureisknownasgraphcut.Unfortunately,minimizingthegraphcutofthepartitionsaloneisinsufficientasittendstoproduceclusterswithhighlyimbalancedsizes.Forexample,considerthegraphshowninFigure8.25 .Supposeweareinterestedinpartitioningthegraphintotwoconnectedcomponents.Thegraphcutmeasurepreferstobreakthelinkbetween and becauseithasthelowestweight.
Figure8.25.Exampletoillustratethelimitationofusinggraphcutasevaluationmeasureforgraphpartitioning.
Unfortunately,suchasplitwouldcreateoneclusterwithasingleisolatednodeandanotherclustercontainingalltheremainingnodes.Toovercomethislimitation,alternativemeasureshavebeenproposedincluding
v4 v5
Ratiocut(C1,C2,⋯,Ck)=12∑i=1kΣp∈Ci,q∉CiWpq|Ci|,
where denotetheclusterpartitions.Thenumeratorrepresentsthesumoftheweightsofthebrokenlinks,i.e.,thegraphcut,whilethedenominatorrepresentsthesizeofeachclusterpartition.Suchameasurecanbeusedtoensurethattheresultingclustersaremorebalancedintermsoftheirsizes.Moreimportantly,itcanbeshownthatminimizingtheratiocutforagraphisequivalenttofindingaclustermembershipmatrixYthatminimizestheexpression where denotesthetraceofamatrixandListhegraphLaplacian,subjecttotheconstraint ByrelaxingtherequirementthatYisabinarymatrix,wecanusetheLagrangemultipliermethodtosolvetheoptimizationproblem.
Inotherwords,anapproximatesolutiontotheratiocutminimizationproblemcanbeobtainedbyfindingtheeigenvectorsofthegraphLaplacianmatrix,whichisexactlytheapproachusedbyspectralclustering.
StrengthsandLimitationsAsshowninExample8.12 ,thestrengthofspectralclusteringliesinitsabilitytodetectclustersofvaryingsizesandshapes.However,theclusteringperformancedependsonhowthesimilaritygraphiscreatedandsparsified.Inparticular,tuningtheparametersofthesimilarityfunction(e.g.,Gaussianradialbasisfunction)toproduceanappropriatesparsegraphforspectralclusteringcanbequiteachallenge.ThetimecomplexityofthealgorithmdependsonhowfasttheeigenvectorsofthegraphLaplacianmatrixcanbecomputed.Efficienteigensolversforsparsematricesareavailable,e.g.,thosebasedonKrylovsubspacemethods,especiallywhenthenumberofclusterschosenissmall.Thestoragecomplexityis thoughitcanbesignificantlyreducedusingasparserepresentationforthegraphLaplacianmatrix.Inmanyways,spectralclusteringbehavessimilarlytotheK-means
C1,C2,⋯,Ck
Tr[YTLY], Tr[⋅]YTY=I.
Lagrangian, ℒ=Tr[YTLY]−λ(Tr[YTY−I])∂ℒ∂Y=LY−λY=0⇒LY=λY
O(N2),
clusteringalgorithm.First,theybothrequiretheusertospecifythenumberofclustersasinputparameter.Bothmethodsarealsosusceptibletothepresenceofoutliers,whichtendtoformtheirownconnectedcomponents(clusters).Thus,preprocessingorpostprocessingmethodswillbeneededtohandleoutliersinthedata.
8.4.6SharedNearestNeighborSimilarity
Insomecases,clusteringtechniquesthatrelyonstandardapproachestosimilarityanddensitydonotproducethedesiredclusteringresults.Thissectionexaminesthereasonsforthisandintroducesanindirectapproachtosimilaritythatisbasedonthefollowingprinciple:
Iftwopointsaresimilartomanyofthesamepoints,thentheyaresimilartooneanother,evenifa
directmeasurementofsimilaritydoesnotindicatethis.
WemotivatethediscussionbyfirstexplainingtwoproblemsthatanSNNversionofsimilarityaddresses:lowsimilarityanddifferencesindensity.
ProblemswithTraditionalSimilarityinHigh-DimensionalDataInhigh-dimensionalspaces,itisnotunusualforsimilaritytobelow.Consider,forexample,asetofdocumentssuchasacollectionofnewspaperarticlesthatcomefromavarietyofsectionsofthenewspaper:Entertainment,Financial,Foreign,Metro,National,andSports.AsexplainedinChapter2 ,thesedocumentscanbeviewedasvectorsinahigh-dimensionalspace,
whereeachcomponentofthevector(attribute)recordsthenumberoftimesthateachwordinavocabularyoccursinadocument.Thecosinesimilaritymeasureisoftenusedtoassessthesimilaritybetweendocuments.Forthisexample,whichcomesfromacollectionofarticlesfromtheLosAngelesTimes,Table8.3 givestheaveragecosinesimilarityineachsectionandamongtheentiresetofdocuments.
Table8.3.Similarityamongdocumentsindifferentsectionsofanewspaper.
Section AverageCosineSimilarity
Entertainment 0.032
Financial 0.030
Foreign 0.030
Metro 0.021
National 0.027
Sports 0.036
AllSections 0.014
Thesimilarityofeachdocumenttoitsmostsimilardocument(thefirstnearestneighbor)isbetter,0.39onaverage.However,aconsequenceoflowsimilarityamongobjectsofthesameclassisthattheirnearestneighborisoftennotofthesameclass.InthecollectionofdocumentsfromwhichTable8.3 wasgenerated,about20%ofthedocumentshaveanearestneighborofadifferentclass.Ingeneral,ifdirectsimilarityislow,thenitbecomesanunreliableguideforclusteringobjects,especiallyforagglomerativehierarchicalclustering,wheretheclosestpointsareputtogetherandcannot
beseparatedafterward.Nonetheless,itisstillusuallythecasethatalargemajorityofthenearestneighborsofanobjectbelongtothesameclass;thisfactcanbeusedtodefineaproximitymeasurethatismoresuitableforclustering.
ProblemswithDifferencesinDensityAnotherproblemrelatestodifferencesindensitiesbetweenclusters.Figure8.26 showsapairoftwo-dimensionalclustersofpointswithdifferingdensity.Thelowerdensityoftherightmostclusterisreflectedinaloweraveragedistanceamongthepoints.Eventhoughthepointsinthelessdenseclusterformanequallyvalidcluster,typicalclusteringtechniqueswillhavemoredifficultyfindingsuchclusters.Also,normalmeasuresofcohesion,suchasSSE,willindicatethattheseclustersarelesscohesive.Toillustratewitharealexample,thestarsinagalaxyarenolessrealclustersofstellarobjectsthantheplanetsinasolarsystem,eventhoughtheplanetsinasolarsystemareconsiderablyclosertooneanotheronaverage,thanthestarsinagalaxy.
Figure8.26.Twocircularclustersof200uniformlydistributedpoints.
SNNSimilarityComputationInbothsituations,thekeyideaistotakethecontextofpointsintoaccountindefiningthesimilaritymeasure.ThisideacanbemadequantitativebyusingasharednearestneighbordefinitionofsimilarityinthemannerindicatedbyAlgorithm8.11 .Essentially,theSNNsimilarityisthenumberofsharedneighborsaslongasthetwoobjectsareoneachother’snearestneighborlists.Notethattheunderlyingproximitymeasurecanbeanymeaningfulsimilarityordissimilaritymeasure.
Algorithm8.11Computingsharednearest
neighborsimilarity
ThecomputationofSNNsimilarityisdescribedbyAlgorithm8.11 andgraphicallyillustratedbyFigure8.27 .Eachofthetwoblackpointshaseightnearestneighbors,includingeachother.Fourofthosenearestneighbors—thepointsingray—areshared.Thus,thesharednearestneighborsimilaritybetweenthetwopointsis4.
1:Findthek-nearestneighborsofallpoints.2:iftwopoints,xandy,arenotamongthek-nearestneighborsofeachotherthen3:similarity4:else5:similarity6:endif
(x,y)←0
(x,y)←numberofsharedneighbors
Figure8.27.ComputationofSNNsimilaritybetweentwopoints.
ThesimilaritygraphoftheSNNsimilaritiesamongobjectsiscalledtheSNNsimilaritygraph.BecausemanypairsofobjectswillhaveanSNNsimilarityof0,thisisaverysparsegraph.
SNNSimilarityversusDirectSimilaritySNNsimilarityisusefulbecauseitaddressessomeoftheproblemsthatoccurwithdirectsimilarity.First,sinceittakesintoaccountthecontextofanobjectbyusingthenumberofsharednearestneighbors,SNNsimilarityhandlesthesituationinwhichanobjecthappenstoberelativelyclosetoanotherobject,butbelongstoadifferentclass.Insuchcases,theobjectstypicallydonotsharemanynearneighborsandtheirSNNsimilarityislow.
SNNsimilarityalsoaddressesproblemswithclustersofvaryingdensity.Inalow-densityregion,theobjectsarefartherapartthanobjectsindenserregions.However,theSNNsimilarityofapairofpointsonlydependsonthenumberofnearestneighborstwoobjectsshare,nothowfartheseneighborsarefromeachobject.Thus,SNNsimilarityperformsanautomaticscalingwithrespecttothedensityofthepoints.
8.4.7TheJarvis-PatrickClustering
Algorithm
Algorithm8.12 expressestheJarvis-Patrickclusteringalgorithmusingtheconceptsofthelastsection.TheJPclusteringalgorithmreplacestheproximitybetweentwopointswiththeSNNsimilarity,whichiscalculatedasdescribedinAlgorithm8.11 .AthresholdisthenusedtosparsifythismatrixofSNNsimilarities.Ingraphterms,anSNNsimilaritygraphiscreatedandsparsified.ClustersaresimplytheconnectedcomponentsoftheSNNgraph.
Algorithm8.12Jarvis-Patrickclustering
algorithm.
ThestoragerequirementsoftheJPclusteringalgorithmareonlybecauseitisnotnecessarytostoretheentiresimilaritymatrix,eveninitially.ThebasictimecomplexityofJPclusteringis ,sincethecreationofthek-nearestneighborlistcanrequirethecomputationofproximities.However,forcertaintypesofdata,suchaslow-dimensionalEuclideandata,specialtechniques,e.g.,ak-dtree,canbeusedtomoreefficientlyfindthek-nearestneighborswithoutcomputingtheentiresimilaritymatrix.Thiscanreducethetimecomplexityfrom to .
Example8.13(JPClusteringofaTwo-
1:ComputetheSNNsimilaritygraph.2:SparsifytheSNNsimilaritygraphbyapplyingasimilaritythreshold.3:Findtheconnectedcomponents(clusters)ofthesparsifiedSNNsimilaritygraph.
O(km)
O(m)2O(m)2
O(m)2 O(mlogm)
DimensionalDataSet).WeappliedJPclusteringtothe“fish”datasetshowninFigure8.28(a)tofindtheclustersshowninFigure8.28(b) .Thesizeofthenearestneighborlistwas20,andtwopointswereplacedinthesameclusteriftheysharedatleast10points.Thedifferentclustersareshownbythedifferentmarkersanddifferentshading.Thepointswhosemarkerisan“x”wereclassifiedasnoisebyJarvis-Patrick.Theyaremostlyinthetransitionregionsbetweenclustersofdifferentdensity.
Figure8.28.Jarvis-Patrickclusteringofatwo-dimensionalpointset.
StrengthsandLimitationsBecauseJPclusteringisbasedonthenotionofSNNsimilarity,itisgoodatdealingwithnoiseandoutliersandcanhandleclustersofdifferentsizes,
shapes,anddensities.Thealgorithmworkswellforhigh-dimensionaldataandisparticularlygoodatfindingtightclustersofstronglyrelatedobjects.
However,JPclusteringdefinesaclusterasaconnectedcomponentintheSNNsimilaritygraph.Thus,whetherasetofobjectsissplitintotwoclustersorleftasonecandependonasinglelink.Hence,JPclusteringissomewhatbrittle;i.e.,itcansplittrueclustersorjoinclustersthatshouldbekeptseparate.
Anotherpotentiallimitationisthatnotallobjectsareclustered.However,theseobjectscanbeaddedtoexistingclusters,andinsomecases,thereisnorequirementforacompleteclustering.JPclusteringhasabasictimecomplexityof ,whichisthetimerequiredtocomputethenearestneighborlistforasetofobjectsinthegeneralcase.Incertaincases,e.g.,low-dimensionaldata,specialtechniquescanbeusedtoreducethetimecomplexityforfindingnearestneighborsto .Finally,aswithotherclusteringalgorithms,choosingthebestvaluesfortheparameterscanbechallenging.
8.4.8SNNDensity
Asdiscussedintheintroductiontothischapter,traditionalEuclideandensitybecomesmeaninglessinhighdimensions.Thisistruewhetherwetakeagrid-basedview,suchasthatusedbyCLIQUE,acenter-basedview,suchasthatusedbyDBSCAN,orakernel-densityestimationapproach,suchasthatusedbyDENCLUE.Itispossibletousethecenter-baseddefinitionofdensitywithasimilaritymeasurethatworkswellforhighdimensions,e.g.,cosineorJaccard,butasdescribedinSection8.4.6 ,suchmeasuresstillhaveproblems.However,becausetheSNNsimilaritymeasurereflectsthelocal
O(m)2
O(mlogm)
configurationofthepointsinthedataspace,itisrelativelyinsensitivetovariationsindensityandthedimensionalityofthespace,andisapromisingcandidateforanewmeasureofdensity.
ThissectionexplainshowtodefineaconceptofSNNdensitybyusingSNNsimilarityandfollowingtheDBSCANapproachdescribedinSection7.4 .Forclarity,thedefinitionsofthatsectionarerepeated,withappropriatemodificationtoaccountforthefactthatweareusingSNNsimilarity.
Corepoints.Apointisacorepointifthenumberofpointswithinagivenneighborhoodaroundthepoint,asdeterminedbySNNsimilarityandasuppliedparameterEpsexceedsacertainthresholdMinPts,whichisalsoasuppliedparameter.
Borderpoints.Aborderpointisapointthatisnotacorepoint,i.e.,therearenotenoughpointsinitsneighborhoodforittobeacorepoint,butitfallswithintheneighborhoodofacorepoint.
Noisepoints.Anoisepointisanypointthatisneitheracorepointnoraborderpoint.
SNNdensitymeasuresthedegreetowhichapointissurroundedbysimilarpoints(withrespecttonearestneighbors).Thus,pointsinregionsofhighandlowdensitywilltypicallyhaverelativelyhighSNNdensity,whilepointsinregionswherethereisatransitionfromlowtohighdensity—pointsthatarebetweenclusters—willtendtohavelowSNNdensity.Suchanapproachiswell-suitedfordatasetsinwhichtherearewidevariationsindensity,butclustersoflowdensityarestillinteresting.
Example8.14(Core,Border,andNoisePoints).
TomaketheprecedingdiscussionofSNNdensitymoreconcrete,weprovideanexampleofhowSNNdensitycanbeusedtofindcorepointsandremovenoiseandoutliers.Thereare10,000pointsinthe2DpointdatasetshowninFigure8.29(a) .Figures8.29(b–d) distinguishbetweenthesepointsbasedontheirSNNdensity.Figure8.29(b) showsthepointswiththehighestSNNdensity,whileFigure8.29(c) showspointsofintermediateSNNdensity,andFigure8.29(d) showsfiguresofthelowestSNNdensity.Fromthesefigures,weseethatthepointsthathavehighdensity(i.e.,highconnectivityintheSNNgraph)arecandidatesforbeingrepresentativeorcorepointssincetheytendtobelocatedwellinsidethecluster,whilethepointsthathavelowconnectivityarecandidatesforbeingnoisepointsandoutliers,astheyaremostlyintheregionssurroundingtheclusters.
Figure8.29.SNNdensityoftwo-dimensionalpoints.
8.4.9SNNDensity-BasedClustering
TheSNNdensitydefinedabovecanbecombinedwiththeDBSCANalgorithmtocreateanewclusteringalgorithm.ThisalgorithmissimilartotheJPclusteringalgorithminthatitstartswiththeSNNsimilaritygraph.However,
insteadofusingathresholdtosparsifytheSNNsimilaritygraphandthentakingconnectedcomponentsasclusters,theSNNdensity-basedclusteringalgorithmsimplyappliesDBSCAN.
TheSNNDensity-basedClusteringAlgorithmThestepsoftheSNNdensity-basedclusteringalgorithmareshowninAlgorithm8.13 .
Algorithm8.13SNNdensity-based
clusteringalgorithm.
Thealgorithmautomaticallydeterminesthenumberofclustersinthedata.Notethatnotallthepointsareclustered.Thepointsthatarediscardedincludenoiseandoutliers,aswellaspointsthatarenotstronglyconnectedtoagroupofpoints.SNNdensity-basedclusteringfindsclustersinwhichthepointsarestronglyrelatedtooneanother.Dependingontheapplication,wemightwanttodiscardmanyofthepoints.Forexample,SNNdensity-basedclusteringisgoodforfindingtopicsingroupsofdocuments.
Example8.15(SNNDensity-basedClusteringofTimeSeries).TheSNNdensity-basedclusteringalgorithmpresentedinthissectionismoreflexiblethanJarvis-PatrickclusteringorDBSCAN.UnlikeDBSCAN,it
1:ComputetheSNNsimilaritygraph.2:ApplyDBSCANwithuser-specifiedparametersforEpsandMinPts.
canbeusedforhigh-dimensionaldataandsituationsinwhichtheclustershavedifferentdensities.UnlikeJarvis-Patrick,whichperformsasimplethresholdingandthentakestheconnectedcomponentsasclusters,SNNdensity-basedclusteringusesalessbrittleapproachthatreliesontheconceptsofSNNdensityandcorepoints.
TodemonstratethecapabilitiesofSNNdensity-basedclusteringonhigh-dimensionaldata,weappliedittomonthlytimeseriesdataofatmosphericpressureatvariouspointsontheEarth.Morespecifically,thedataconsistsoftheaveragemonthlysea-levelpressure(SLP)foraperiodof41yearsateachpointona longitude-latitudegrid.TheSNNdensity-basedclusteringalgorithmfoundtheclusters(grayregions)indicatedinFigure8.30 .Notethattheseareclustersoftimeseriesoflength492months,eventhoughtheyarevisualizedastwo-dimensionalregions.Thewhiteareasareregionsinwhichthepressurewasnotasuniform.Theclustersnearthepolesareelongatedbecauseofthedistortionofmappingasphericalsurfacetoarectangle.
2.5°
Figure8.30.ClustersofpressuretimeseriesfoundusingSNNdensity-basedclustering.
UsingSLP,Earthscientistshavedefinedtimeseries,calledclimateindices,whichareusefulforcapturingthebehaviorofphenomenainvolvingtheEarth’sclimate.Forexample,anomaliesinclimateindicesarerelatedtoabnormallyloworhighprecipitationortemperatureinvariouspartsoftheworld.SomeoftheclustersfoundbySNNdensity-basedclusteringhaveastrongconnectiontosomeoftheclimateindicesknowntoEarthscientists.
Figure8.31 showstheSNNdensitystructureofthedatafromwhichtheclusterswereextracted.Thedensityhasbeennormalizedtobeonascale
between0and1.Thedensityofatimeseriesmayseemlikeanunusualconcept,butitmeasuresthedegreetowhichthetimeseriesanditsnearestneighborshavethesamenearestneighbors.Becauseeachtimeseriesisassociatedwithalocation,itispossibletoplotthesedensitiesonatwo-dimensionalplot.Becauseoftemporalautocorrelation,thesedensitiesformmeaningfulpatterns,e.g.,itispossibletovisuallyidentifytheclustersofFigure8.31 .
Figure8.31.SNNdensityofpressuretimeseries.
StrengthsandLimitations
ThestrengthsandlimitationsofSNNdensity-basedclusteringaresimilartothoseofJPclustering.However,theuseofcorepointsandSNNdensityaddsconsiderablepowerandflexibilitytothisapproach.
8.5ScalableClusteringAlgorithmsEventhebestclusteringalgorithmisoflittlevalueifittakesanunacceptablylongtimetoexecuteorrequirestoomuchmemory.Thissectionexaminesclusteringtechniquesthatplacesignificantemphasisonscalabilitytotheverylargedatasetsthatarebecomingincreasinglycommon.Westartbydiscussingsomegeneralstrategiesforscalability,includingapproachesforreducingthenumberofproximitycalculations,samplingthedata,partitioningthedata,andclusteringasummarizedrepresentationofthedata.Wethendiscusstwospecificexamplesofscalableclusteringalgorithms:CUREandBIRCH.
8.5.1Scalability:GeneralIssuesandApproaches
Theamountofstoragerequiredformanyclusteringalgorithmsismorethanlinear;e.g.,withhierarchicalclustering,memoryrequirementsareusually
,wheremisthenumberofobjects.For10,000,000objects,forexample,theamountofmemoryrequiredisproportionalto ,anumberstillwellbeyondthecapacitiesofcurrentsystems.Notethatbecauseoftherequirementforrandomdataaccess,manyclusteringalgorithmscannoteasilybemodifiedtoefficientlyusesecondarystorage(disk),forwhichrandomdataaccessisslow.Likewise,theamountofcomputationrequiredforsomeclusteringalgorithmsismorethanlinear.Intheremainderofthissection,wediscussavarietyoftechniquesforreducingtheamountofcomputationand
O(m)2
104
storagerequiredbyaclusteringalgorithm.CUREandBIRCHusesomeofthesetechniques.
MultidimensionalorSpatialAccessMethodsManytechniques,suchasK-means,JarvisPatrickclustering,andDBSCAN,needtofindtheclosestcentroid,thenearestneighborsofapoint,orallpointswithinaspecifieddistance.Itispossibletousespecialtechniquescalledmultidimensionalorspatialaccessmethodstomoreefficientlyperformthesetasks,atleastforlow-dimensionaldata.Thesetechniques,suchasthek-dtreeorR*-tree,typicallyproduceahierarchicalpartitionofthedataspacethatcanbeusedtoreducethetimerequiredtofindthenearestneighborsofapoint.Notethatgrid-basedclusteringschemesalsopartitionthedataspace.
BoundsonProximitiesAnotherapproachtoavoidingproximitycomputationsistouseboundsonproximities.Forinstance,whenusingEuclideandistance,itispossibletousethetriangleinequalitytoavoidmanydistancecalculations.Toillustrate,ateachstageoftraditionalK-means,itisnecessarytoevaluatewhetherapointshouldstayinitscurrentclusterorbemovedtoanewcluster.Ifweknowthedistancebetweenthecentroidsandthedistanceofapointtothe(newlyupdated)centroidoftheclustertowhichitcurrentlybelongs,thenwemightbeabletousethetriangleinequalitytoavoidcomputingthedistanceofthepointtoanyoftheothercentroids.SeeExercise21 onpage702.
SamplingAnotherapproachtoreducingthetimecomplexityistosample.Inthisapproach,asampleofpointsistaken,thesepointsareclustered,andthentheremainingpointsareassignedtotheexistingclusters—typicallytotheclosestcluster.Ifthenumberofpointssampledis ,thenthetimecomplexityofan algorithmisreducedto Akeyproblemwithsampling,though,isthatsmallclusterscanbelost.WhenwediscussCURE,
mO(m)2 O(m).
wewillprovideatechniqueforinvestigatinghowfrequentlysuchproblemsoccur.
PartitioningtheDataObjectsAnothercommonapproachtoreducingtimecomplexityistousesomeefficienttechniquetopartitionthedataintodisjointsetsandthenclusterthesesetsseparately.Thefinalsetofclusterseitheristheunionoftheseseparatesetsofclustersorisobtainedbycombiningand/orrefiningtheseparatesetsofclusters.WeonlydiscussbisectingK-means(Section7.2.3 )inthissection,althoughmanyotherapproachesbasedonpartitioningarepossible.Onesuchapproachwillbedescribed,whenwedescribeCURElateroninthissection.
IfK-meansisusedtofindKclusters,thenthedistanceofeachpointtoeachclustercentroidiscalculatedateachiteration.WhenKislarge,thiscanbeveryexpensive.BisectingK-meansstartswiththeentiresetofpointsandusesK-meanstorepeatedlybisectanexistingclusteruntilwehaveobtainedKclusters.Ateachstep,thedistanceofpointstotwoclustercentroidsiscomputed.Exceptforthefirststep,inwhichtheclusterbeingbisectedconsistsofallthepoints,weonlycomputethedistanceofasubsetofpointstothetwocentroidsbeingconsidered.Becauseofthisfact,bisectingK-meanscanrunsignificantlyfasterthanregularK-means.
SummarizationAnotherapproachtoclusteringistosummarizethedata,typicallyinasinglepass,andthenclusterthesummarizeddata.Inparticular,theleaderalgorithm(seeExercise12 onpage605)eitherputsadataobjectintheclosestcluster(ifthatclusterissufficientlyclose)orstartsanewclusterthatcontainsthecurrentobject.Thisalgorithmislinearinthenumberofobjectsandcanbeusedtosummarizethedatasothatotherclusteringtechniquescanbeused.TheBIRCHalgorithmusesasimilarconcept.
ParallelandDistributedComputationIfitisnotpossibletotakeadvantageofthetechniquesdescribedearlier,oriftheseapproachesdonotyieldthedesiredaccuracyorreductionincomputationtime,thenotherapproachesareneeded.Ahighlyeffectiveapproachistodistributethecomputationamongmultipleprocessors.
8.5.2BIRCH
BIRCH(BalancedIterativeReducingandClusteringusingHierarchies)isahighlyefficientclusteringtechniquefordatainEuclideanvectorspaces,i.e.,dataforwhichaveragesmakesense.BIRCHcanefficientlyclustersuchdatawithonepassandcanimprovethatclusteringwithadditionalpasses.BIRCHcanalsodealeffectivelywithoutliers.
BIRCHisbasedonthenotionofaClusteringFeature(CF)andaCFtree.Theideaisthataclusterofdatapoints(vectors)canberepresentedbyatripleofnumbers(N,LS,SS),whereNisthenumberofpointsinthecluster,LSisthelinearsumofthepoints,andSSisthesumofsquaresofthepoints.Thesearecommonstatisticalquantitiesthatcanbeupdatedincrementallyandthatcanbeusedtocomputeanumberofimportantquantities,suchasthecentroidofaclusteranditsvariance(standarddeviation).Thevarianceisusedasameasureofthediameterofacluster.
Thesequantitiescanalsobeusedtocomputethedistancebetweenclusters.Thesimplestapproachistocalculatean (cityblock)or (Euclidean)distancebetweencentroids.Wecanalsousethediameter(variance)ofthemergedclusterasadistance.AnumberofdifferentdistancemeasuresforclustersaredefinedbyBIRCH,butallcanbecomputedusingthesummarystatistics.
L1 L2
ACFtreeisaheight-balancedtree.Eachinteriornodehasentriesoftheform,where isapointertothe childnode.Thespacethat
eachentrytakesandthepagesizedeterminethenumberofentriesinaninteriornode.Thespaceofeachentryis,inturn,determinedbythenumberofattributesofeachpoint.
Leafnodesconsistofasequenceofclusteringfeatures, ,whereeachclusteringfeaturerepresentsanumberofpointsthathavebeenpreviouslyscanned.Leafnodesaresubjecttotherestrictionthateachleafnodemusthaveadiameterthatislessthanaparameterizedthreshold,T.Thespacethateachentrytakes,togetherwiththepagesize,determinesthenumberofentriesinaleaf.
ByadjustingthethresholdparameterT,theheightofthetreecanbecontrolled.Tcontrolsthefinenessoftheclustering,i.e.,theextenttowhichthedataintheoriginalsetofdataisreduced.ThegoalistokeeptheCFtreeinmainmemorybyadjustingtheTparameterasnecessary.
ACFtreeisbuiltasthedataisscanned.Aseachdatapointisencountered,theCFtreeistraversed,startingfromtherootandchoosingtheclosestnodeateachlevel.Whentheclosestleafclusterforthecurrentdatapointisfinallyidentified,atestisperformedtoseeifaddingthedataitemtothecandidateclusterwillresultinanewclusterwithadiametergreaterthanthegiventhreshold,T.Ifnot,thenthedatapointisaddedtothecandidateclusterbyupdatingtheCFinformation.Theclusterinformationforallnodesfromtheleaftotherootisalsoupdated.
IfthenewclusterhasadiametergreaterthanT,thenanewentryiscreatediftheleafnodeisnotfull.Otherwisetheleafnodemustbesplit.Thetwoentries(clusters)thatarefarthestapartareselectedasseedsandtheremainingentriesaredistributedtooneofthetwonewleafnodes,basedonwhichleaf
[CFi, childi] childi ith
CFi
nodecontainstheclosestseedcluster.Oncetheleafnodehasbeensplit,theparentnodeisupdatedandsplitifnecessary;i.e.,iftheparentnodeisfull.Thisprocessmaycontinueallthewaytotherootnode.
BIRCHfollowseachsplitwithamergestep.Attheinteriornodewherethesplitstops,thetwoclosestentriesarefound.Iftheseentriesdonotcorrespondtothetwoentriesthatjustresultedfromthesplit,thenanattemptismadetomergetheseentriesandtheircorrespondingchildnodes.Thisstepisintendedtoincreasespaceutilizationandavoidproblemswithskeweddatainputorder.
BIRCHalsohasaprocedureforremovingoutliers.Whenthetreeneedstoberebuiltbecauseithasrunoutofmemory,thenoutlierscanoptionallybewrittentodisk.(Anoutlierisdefinedtobeanodethathasfarfewerdatapointsthanaverage.)Atcertainpointsintheprocess,outliersarescannedtoseeiftheycanbeabsorbedbackintothetreewithoutcausingthetreetogrowinsize.Ifso,theyarereabsorbed.Ifnot,theyaredeleted.
BIRCHconsistsofanumberofphasesbeyondtheinitialcreationoftheCFtree.AllthephasesofBIRCHaredescribedbrieflyinAlgorithm8.14 .
Algorithm8.14BIRCH.1:LoadthedataintomemorybycreatingaCFtreethatsummarizesthedata.2:BuildasmallerCFtreeifitisnecessaryforphase3.Tisincreased,andthentheleafnodeentries(clusters)arereinserted.SinceThasincreased,someclusterswillbemerged.3:Performglobalclustering.Differentformsofglobalclustering(clusteringthatusesthepairwisedistancesbetween
8.5.3CURE
CURE(ClusteringUsingREpresentatives)isaclusteringalgorithmthatusesavarietyofdifferenttechniquestocreateanapproachthatcanhandlelargedatasets,outliers,andclusterswithnon-sphericalshapesandnon-uniformsizes.CURErepresentsaclusterbyusingmultiplerepresentativepointsfromthecluster.Thesepointswill,intheory,capturethegeometryandshapeofthecluster.Thefirstrepresentativepointischosentobethepointfarthestfromthecenterofthecluster,whiletheremainingpointsarechosensothattheyarefarthestfromallthepreviouslychosenpoints.Inthisway,the
alltheclusters)canbeused.However,anagglomerative,hierarchicaltechniquewasselected.Becausetheclusteringfeaturesstoresummaryinformationthatisimportanttocertainkindsofclustering,theglobalclusteringalgorithmcanbeappliedasifitwerebeingappliedtoallthepointsinaclusterrepresentedbytheCF.4:Redistributethedatapointsusingthecentroidsofclustersdiscoveredinstep3,andthus,discoveranewsetofclusters.ThisovercomescertainproblemsthatcanoccurinthefirstphaseofBIRCH.BecauseofpagesizeconstraintsandtheTparameter,pointsthatshouldbeinoneclusteraresometimessplit,andpointsthatshouldbeindifferentclustersaresometimescombined.Also,ifthedatasetcontainsduplicatepoints,thesepointscansometimesbeclustereddifferently,dependingontheorderinwhichtheyareencountered.Byrepeatingthisphasemultipletimes,theprocessconvergestoalocallyoptimalsolution.
representativepointsarenaturallyrelativelywelldistributed.Thenumberofpointschosenisaparameter,butitwasfoundthatavalueof10ormoreworkedwell.
Oncetherepresentativepointsarechosen,theyareshrunktowardthecenterbyafactor,α.Thishelpsmoderatetheeffectofoutliers,whichareusuallyfartherawayfromthecenterandthus,areshrunkmore.Forexample,arepresentativepointthatwasadistanceof10unitsfromthecenterwouldmoveby3units(for ),whilearepresentativepointatadistanceof1unitwouldonlymove0.3units.
CUREusesanagglomerativehierarchicalschemetoperformtheactualclustering.Thedistancebetweentwoclustersistheminimumdistancebetweenanytworepresentativepoints(aftertheyareshrunktowardtheirrespectivecenters).Whilethisschemeisnotexactlylikeanyotherhierarchicalschemethatwehaveseen,itisequivalenttocentroid-basedhierarchicalclusteringif ,androughlythesameassinglelinkhierarchicalclusteringif .Noticethatwhileahierarchicalclusteringschemeisused,thegoalofCUREistofindagivennumberofclustersasspecifiedbytheuser.
CUREtakesadvantageofcertaincharacteristicsofthehierarchicalclusteringprocesstoeliminateoutliersattwodifferentpointsintheclusteringprocess.First,ifaclusterisgrowingslowly,thenitmayconsistofoutliers,sincebydefinition,outliersarefarfromothersandwillnotbemergedwithotherpointsveryoften.InCURE,thisfirstphaseofoutliereliminationtypicallyoccurswhenthenumberofclustersis1/3theoriginalnumberofpoints.ThesecondphaseofoutliereliminationoccurswhenthenumberofclustersisontheorderofK,thenumberofdesiredclusters.Atthispoint,smallclustersareagaineliminated.
α=0.7
α=0α=1
Becausetheworst-casecomplexityofCUREis ,itcannotbeapplieddirectlytolargedatasets.Forthisreason,CUREusestwotechniquestospeeduptheclusteringprocess.Thefirsttechniquetakesarandomsampleandperformshierarchicalclusteringonthesampleddatapoints.Thisisfollowedbyafinalpassthatassignseachremainingpointinthedatasettooneoftheclustersbychoosingtheclusterwiththeclosestrepresentativepoint.WediscussCURE’ssamplingapproachinmoredetaillater.
Insomecases,thesamplerequiredforclusteringisstilltoolargeandasecondadditionaltechniqueisrequired.Inthissituation,CUREpartitionsthesampledataandthenclustersthepointsineachpartition.Thispreclusteringstepisthenfollowedbyaclusteringoftheintermediateclustersandafinalpassthatassignseachpointinthedatasettooneoftheclusters.CURE’spartitioningschemeisalsodiscussedinmoredetaillater.
Algorithm8.15 summarizesCURE.NotethatKisthedesirednumberofclusters,misthenumberofpoints,pisthenumberofpartitions,andqisthedesiredreductionofpointsinapartition,i.e.,thenumberofclustersinapartitionis .Therefore,thetotalnumberofclustersis .Forexample,if , ,and ,theneachpartitioncontainspoints,andtherewouldbe clustersineachpartitionand
clustersoverall.
Algorithm8.15CURE.
O(m2logm)
mpq mqm=10,000 p=10 q=100 10,000/10=1000
1000/100=1010,000/100=100
1:Drawarandomsamplefromthedataset.TheCUREpaperisnotableforexplicitlyderivingaformulaforwhatthesizeofthissampleshouldbeinordertoguarantee,withhighprobability,thatallclustersarerepresentedbyaminimumnumberofpoints.2:Partitionthesampleintopequal-sizedpartitions.
SamplinginCUREAkeyissueinusingsamplingiswhetherthesampleisrepresentative,thatis,whetheritcapturesthecharacteristicsofinterest.Forclustering,theissueiswhetherwecanfindthesameclustersinthesampleasintheentiresetofobjects.Ideally,wewouldlikethesampletocontainsomeobjectsforeachclusterandfortheretobeaseparateclusterinthesampleforthoseobjectsthatbelongtoseparateclustersintheentiredataset.
Amoreconcreteandattainablegoalistoguarantee(withahighprobability)thatwehaveatleastsomepointsfromeachcluster.Thenumberofpointsrequiredforsuchasamplevariesfromonedatasettoanotheranddependsonthenumberofobjectsandthesizesoftheclusters.ThecreatorsofCUREderivedaboundforthesamplesizethatwouldbeneededtoensure(withhighprobability)thatweobtainatleastacertainnumberofpointsfromacluster.Usingthenotationofthisbook,thisboundisgivenbythefollowingtheorem.
3:Clusterthepointsineachpartitioninto clustersusingCURE’shierarchicalclusteringalgorithmtoobtainatotalof clusters.Notethatsomeoutliereliminationoccursduringthisprocess.4:UseCURE’shierarchicalclusteringalgorithmtoclusterthe clustersfoundinthepreviousstepuntilonlyKclustersremain.5:Eliminateoutliers.Thisisthesecondphaseofoutlierelimination.6:Assignallremainingdatapointstothenearestclustertoobtainacompleteclustering.
mpq
mq
mq
Theorem8.1.Letfbeafraction, .Forcluster ofsize ,wewillobtainatleast objectsfromcluster withaprobabilityof
,ifoursamplesizesisgivenbythefollowing:
wheremisthenumberofobjects.
Whilethisexpressionmightlookintimidating,itisreasonablyeasytouse.Supposethatthereare100,000objectsandthatthegoalistohavean80%chanceofobtaining10%oftheobjectsincluster ,whichhasasizeof1000.Inthiscase, ,andthus .Ifthegoalisa5%sampleof ,whichis50objects,thenasamplesizeof6440willsuffice.
Again,CUREusessamplinginthefollowingway.Firstasampleisdrawn,andthenCUREisusedtoclusterthissample.Afterclustershavebeenfound,eachunclusteredpointisassignedtotheclosestcluster.
PartitioningWhensamplingisnotenough,CUREalsousesapartitioningapproach.Theideaistodividethepointsintopgroupsofsizem/pandtouseCUREtoclustereachpartitioninordertoreducethenumberofobjectsbyafactorof
,whereqcanberoughlythoughtofastheaveragesizeofaclusterina
0≤f≤1 Ci mif*mi Ci
1−δ,0≤δ≤1
s=fm+mmi*log1δ+mmilog12δ+2*f*mi*log1δ. (8.21)
Cif=0.1,δ=0.2,m=100,000,mi=1000 s=11,962
Ci
q>1
partition.Overall, clustersareproduced.(NotethatsinceCURErepresentseachclusterbyanumberofrepresentativepoints,thereductioninthenumberofobjectsisnotq.)Thispreclusteringstepisthenfollowedbyafinalclusteringofthem/qintermediateclusterstoproducethedesirednumberofclusters(K).BothclusteringpassesuseCURE’shierarchicalclusteringalgorithmandarefollowedbyafinalpassthatassignseachpointinthedatasettooneoftheclusters.
Thekeyissueishowpandqshouldbechosen.AlgorithmssuchasCUREhaveatimecomplexityof orhigher,andfurthermore,requirethatallthedatabeinmainmemory.Wethereforewanttochoosepsmallenoughsothatanentirepartitioncanbeprocessedinmainmemoryandina‘reasonable’amountoftime.Atthecurrenttime,atypicaldesktopcomputercanperformahierarchicalclusteringofafewthousandobjectsinafewseconds.
Anotherfactorforchoosingp,andalsoq,concernsthequalityoftheclustering.Specifically,theobjectiveistochoosethevaluesofpandqsuchthatobjectsfromthesameunderlyingclusterendupinthesameclusterseventually.Toillustrate,supposethereare1000objectsandaclusterofsize100.Ifwerandomlygenerate100partitions,theneachpartitionwill,onaverage,haveonlyonepointfromourcluster.Thesepointswilllikelybeputinclusterswithpointsfromotherclustersorwillbediscardedasoutliers.Ifwegenerateonly10partitionsof100objects,butqis50,thenthe10pointsfromeachcluster(onaverage)willlikelystillbecombinedwithpointsfromotherclusters,becausethereareonly(onaverage)10pointsperclusterandweneedtoproduce,foreachpartition,twoclusters.Toavoidthislastproblem,whichconcernstheproperchoiceofq,asuggestedstrategyisnottocombineclustersiftheyaretoodissimilar.
mq
O(m2)
8.6WhichClusteringAlgorithm?Avarietyoffactorsneedtobeconsideredwhendecidingwhichtypeofclusteringtechniquetouse.Many,ifnotall,ofthesefactorshavebeendiscussedtosomeextentinthecurrentandpreviouschapters.Ourgoalinthissectionistosuccinctlysummarizethesefactorsinawaythatshedssomelightonwhichclusteringalgorithmmightbeappropriateforaparticularclusteringtask.
TypeofClusteringForaclusteringalgorithmtobeappropriateforatask,thetypeofclusteringproducedbythealgorithmneedstomatchthetypeofclusteringneededbytheapplication.Forsomeapplications,suchascreatingabiologicaltaxonomy,ahierarchyispreferred.Inthecaseofclusteringforsummarization,apartitionalclusteringistypical.Inyetotherapplications,bothcanproveuseful.
Mostclusteringapplicationsrequireaclusteringofall(oralmostall)oftheobjects.Forinstance,ifclusteringisusedtoorganizeasetofdocumentsforbrowsing,thenwewouldlikemostdocumentstobelongtoagroup.However,ifwewantedtofindthestrongestthemesinasetofdocuments,thenwemightprefertohaveaclusteringschemethatproducesonlyverycohesiveclusters,evenifmanydocumentswereleftunclustered.
Finally,mostapplicationsofclusteringassumethateachobjectisassignedtoonecluster(oroneclusteronalevelforhierarchicalschemes).Aswehaveseen,however,probabilisticandfuzzyschemesprovideweightsthatindicatethedegreeorprobabilityofmembershipinvariousclusters.Othertechniques,suchasDBSCANandSNNdensity-basedclustering,havethenotionofcore
points,whichstronglybelongtoonecluster.Suchconceptsmaybeusefulincertainapplications.
TypeofClusterAnotherkeyaspectiswhetherthetypeofclustermatchestheintendedapplication.Therearethreecommonlyencounteredtypesofclusters:prototype-,graph-,anddensity-based.Prototype-basedclusteringschemes,aswellassomegraph-basedclusteringschemes—completelink,centroid,andWard’s—tendtoproduceglobularclustersinwhicheachobjectisclosetothecluster’sprototypeand/ortotheotherobjectsinthecluster.If,forexample,wewanttosummarizethedatatoreduceitssizeandwewanttodosowiththeminimumamountoferror,thenoneofthesetypesoftechniqueswouldbemostappropriate.Incontrast,density-basedclusteringtechniques,aswellassomegraph-basedclusteringtechniques,suchassinglelink,tendtoproduceclustersthatarenotglobularandthuscontainmanyobjectsthatarenotverysimilartooneanother.Ifclusteringisusedtosegmentageographicalareaintocontiguousregionsbasedonthetypeoflandcover,thenoneofthesetechniquesismoresuitablethanaprototype-basedschemesuchasK-means.
CharacteristicsofClustersBesidesthegeneraltypeofcluster,otherclustercharacteristicsareimportant.Ifwewanttofindclustersinsubspacesoftheoriginaldataspace,thenwemustchooseanalgorithmsuchasCLIQUE,whichexplicitlylooksforsuchclusters.Similarly,ifweareinterestedinenforcingspatialrelationshipsbetweenclusters,thenSOMorsomerelatedapproachwouldbeappropriate.Also,clusteringalgorithmsdifferwidelyintheirabilitytohandleclustersofvaryingshapes,sizes,anddensities.
CharacteristicsoftheDataSetsandAttributesAsdiscussedintheintroduction,thetypeofdatasetandattributescandictatethetypeofalgorithmtouse.Forinstance,theK-meansalgorithmcanonlybeusedondataforwhichanappropriateproximitymeasureisavailablethatallows
meaningfulcomputationofaclustercentroid.Forotherclusteringtechniques,suchasmanyagglomerativehierarchicalapproaches,theunderlyingnatureofthedatasetsandattributesislessimportantaslongasaproximitymatrixcanbecreated.
NoiseandOutliersNoiseandoutliersareparticularlyimportantaspectsofthedata.Wehavetriedtoindicatetheeffectofnoiseandoutliersonthevariousclusteringalgorithmsthatwehavediscussed.Inpractice,however,itcanbedifficulttoevaluatetheamountofnoiseinthedatasetorthenumberofoutliers.Morethanthat,whatisnoiseoranoutliertoonepersonmightbeinterestingtoanotherperson.Forexample,ifweareusingclusteringtosegmentanareaintoregionsofdifferentpopulationdensity,wedonotwanttouseadensity-basedclusteringtechnique,suchasDBSCAN,thatassumesthatregionsorpointswithdensitylowerthanaglobalthresholdarenoiseoroutliers.Asanotherexample,hierarchicalclusteringschemes,suchasCURE,oftendiscardclustersofpointsthataregrowingslowlyassuchgroupstendtorepresentoutliers.However,insomeapplicationswearemostinterestedinrelativelysmallclusters;e.g.,inmarketsegmentation,suchgroupsmightrepresentthemostprofitablecustomers.
NumberofDataObjectsWehaveconsideredhowclusteringisaffectedbythenumberofdataobjectsinconsiderabledetailinprevioussections.Wereiterate,however,thatthisfactoroftenplaysanimportantroleindeterminingthetypeofclusteringalgorithmtobeused.Supposethatwewanttocreateahierarchicalclusteringofasetofdata,wearenotinterestedinacompletehierarchythatextendsallthewaytoindividualobjects,butonlytothepointatwhichwehavesplitthedataintoafewhundredclusters.Ifthedataisverylarge,wecannotdirectlyapplyanagglomerativehierarchicalclusteringtechnique.Wecould,however,useadivisiveclusteringtechnique,suchastheminimumspanningtree(MST)algorithm,whichisthedivisiveanalogtosinglelink,butthiswouldonlyworkifthedatasetisnottoolarge.BisectingK-
meanswouldalsoworkformanydatasets,butifthedatasetislargeenoughthatitcannotbecontainedcompletelyinmemory,thenthisschemealsorunsintoproblems.Inthissituation,atechniquesuchasBIRCH,whichdoesnotrequirethatalldatabeinmainmemory,becomesmoreuseful.
NumberofAttributesWehavealsodiscussedtheimpactofdimensionalityatsomelength.Again,thekeypointistorealizethatanalgorithmthatworkswellinlowormoderatedimensionsmaynotworkwellinhighdimensions.Asinmanyothercasesinwhichaclusteringalgorithmisinappropriatelyapplied,theclusteringalgorithmwillrunandproduceclusters,buttheclusterswilllikelynotrepresentthetruestructureofthedata.
ClusterDescriptionOneaspectofclusteringtechniquesthatisoftenoverlookedishowtheresultingclustersaredescribed.Prototypeclustersaresuccinctlydescribedbyasmallsetofclusterprototypes.Inthecaseofmixturemodels,theclustersaredescribedintermsofsmallsetsofparameters,suchasthemeanvectorandthecovariancematrix.Thisisalsoaverycompactandunderstandablerepresentation.ForSOM,itistypicallypossibletovisualizetherelationshipsbetweenclustersinatwo-dimensionalplot,suchasthatofFigure8.8 .Forgraph-anddensity-basedclusteringapproaches,however,clustersaretypicallydescribedassetsofclustermembers.Nonetheless,inCURE,clusterscanbedescribedbya(relatively)smallsetofrepresentativepoints.Also,forgrid-basedclusteringschemes,suchasCLIQUE,morecompactdescriptionscanbegeneratedintermsofconditionsontheattributevaluesthatdescribethegridcellsinthecluster.
AlgorithmicConsiderationsTherearealsoimportantaspectsofalgorithmsthatneedtobeconsidered.Isthealgorithmnon-deterministicororder-dependent?Doesthealgorithmautomaticallydeterminethenumberofclusters?Isthereatechniquefordeterminingthevaluesofvariousparameters?Manyclusteringalgorithmstrytosolvetheclusteringproblemby
tryingtooptimizeanobjectivefunction.Istheobjectiveagoodmatchfortheapplicationobjective?Ifnot,thenevenifthealgorithmdoesagoodjoboffindingaclusteringthatisoptimalorclosetooptimalwithrespecttotheobjectivefunction,theresultisnotmeaningful.Also,mostobjectivefunctionsgivepreferencetolargerclustersattheexpenseofsmallerclusters.
SummaryThetaskofchoosingtheproperclusteringalgorithminvolvesconsideringalloftheseissues,anddomain-specificissuesaswell.Thereisnoformulafordeterminingthepropertechnique.Nonetheless,ageneralknowledgeofthetypesofclusteringtechniquesthatareavailableandconsiderationoftheissuesmentionedabove,togetherwithafocusontheintendedapplication,shouldallowadataanalysttomakeaninformeddecisiononwhichclusteringapproach(orapproaches)totry.
8.7BibliographicNotesAnextensivediscussionoffuzzyclustering,includingadescriptionoffuzzyc-meansandformalderivationsoftheformulaspresentedinSection8.2.1 ,canbefoundinthebookonfuzzyclusteranalysisbyHöppneretal.[595].Whilenotdiscussedinthischapter,AutoClassbyCheesemanetal.[573]isoneoftheearliestandmostprominentmixture-modelclusteringprograms.AnintroductiontomixturemodelscanbefoundinthetutorialofBilmes[568],thebookbyMitchell[606](whichalsodescribeshowtheK-meansalgorithmcanbederivedfromamixturemodelapproach),andthearticlebyFraleyandRaftery[581].Mixturemodelisanexampleofaprobabilisticclusteringmethod,inwhichtheclustersarerepresentedashiddenvariablesinthemodel.MoresophisticatedprobabilisticclusteringmethodssuchaslatentDirichletallocation(LDA)[570]havebeendevelopedinrecentyearsfordomainssuchastextclustering.
Besidesdataexploration,SOManditssupervisedlearningvariant,LearningVectorQuantization(LVQ),havebeenusedformanytasks:imagesegmentation,organizationofdocumentfiles,andspeechprocessing.OurdiscussionofSOMwascastintheterminologyofprototype-basedclustering.ThebookonSOMbyKohonenetal.[601]containsanextensiveintroductiontoSOMthatemphasizesitsneuralnetworkorigins,aswellasadiscussionofsomeofitsvariationsandapplications.OneimportantSOM-relatedclusteringdevelopmentistheGenerativeTopographicMap(GTM)algorithmbyBishopetal.[569],whichusestheEMalgorithmtofindGaussianmodelssatisfyingtwo-dimensionaltopographicconstraints.
ThedescriptionofChameleoncanbefoundinthepaperbyKarypisetal.[599].Capabilitiessimilar,althoughnotidenticaltothoseofChameleonhave
beenimplementedintheCLUTOclusteringpackagebyKarypis[575].TheMETISgraphpartitioningpackagebyKarypisandKumar[600]isusedtoperformgraphpartitioninginbothprograms,aswellasintheOPOSSUMclusteringalgorithmbyStrehlandGhosh[616].AdetaileddiscussiononspectralclusteringcanbefoundinthetutorialbyvonLuxburg[618].ThespectralclusteringmethoddescribedinthischapterisbasedonanunnormalizedgraphLaplacianmatrixandtheratiocutmeasure[590].AlternativeformulationsofspectralclusteringhavebeendevelopedusingnormalizedgraphLaplacianmatricesforotherevaluationmeasures[613].
ThenotionofSNNsimilaritywasintroducedbyJarvisandPatrick[596].AhierarchicalclusteringschemebasedonasimilarconceptofmutualnearestneighborswasproposedbyGowdaandKrishna[586].Guhaetal.[589]createdROCK,ahierarchicalgraph-basedclusteringalgorithmforclusteringtransactiondata,whichamongotherinterestingfeatures,alsousesanotionofsimilaritybasedonsharedneighborsthatcloselyresemblestheSNNsimilaritydevelopedbyJarvisandPatrick.AdescriptionoftheSNNdensity-basedclusteringtechniquecanbefoundinthepublicationsofErtözetal.[578,579].SNNdensity-basedclusteringwasusedbySteinbachetal.[614]tofindclimateindices.
Examplesofgrid-basedclusteringalgorithmsareOptiGrid(HinneburgandKeim[594]),theBANGclusteringsystem(SchikutaandErhart[611]),andWaveCluster(Sheikholeslamietal.[612]).TheCLIQUEalgorithmisdescribedinthepaperbyAgrawaletal.[564].MAFIA(Nageshetal.[608])isamodificationofCLIQUEwhosegoalisimprovedefficiency.Kailingetal.[598]havedevelopedSUBCLU(density-connectedSUBspaceCLUstering),asubspaceclusteringalgorithmbasedonDBSCAN.TheDENCLUEalgorithmwasproposedbyHinneburgandKeim[593].
OurdiscussionofscalabilitywasstronglyinfluencedbythearticleofGhosh[584].Awide-rangingdiscussionofspecifictechniquesforclusteringmassivedatasetscanbefoundinthepaperbyMurtagh[607].CUREisworkbyGuhaetal.[588],whiledetailsofBIRCHareinthepaperbyZhangetal.[620].CLARANS(NgandHan[609])isanalgorithmforscalingK-medoidclusteringtolargerdatabases.AdiscussionofscalingEMandK-meansclusteringtolargedatasetsisprovidedbyBradleyetal.[571,572].AparallelimplementationofK-meansontheMapReduceframeworkhasalsobeendeveloped[621].InadditiontoK-means,otherclusteringalgorithmsthathavebeenimplementedontheMapReduceframeworkincludeDBScan[592],spectralclustering[574],andhierarchicalclustering[617].
Inadditiontotheapproachesdescribedinthischapter,therearemanyotherclusteringmethodsproposedintheliterature.Oneclassofmethodsthathasbecomeincreasinglypopularinrecentyearsisbasedonnon-negativematrixfactorization(NMF)[602].Theideaisanextensionofthesingularvaluedecomposition(SVD)approachdescribedinChapter2 ,inwhichthedatamatrixisdecomposedintoaproductoflower-rankmatricesthatrepresenttheunderlyingcomponentsorclustersinthedata.InNMF,additionalconstraintsareimposedtoensurenon-negativityintheelementsofthecomponentmatrices.Withdifferentformulationsandconstraints,theNMFmethodcanbeshowntobeequivalenttootherclusteringapproaches,includingK-meansandspectralclustering[577,603].Anotherpopularclassofmethodsutilizestheconstraintsprovidedbyuserstoguidetheclusteringalgorithm.Suchalgorithmsarecommonlyknownasconstrainedclusteringorsemi-supervisedclustering[566,567,576,619].
Therearemanyaspectsofclusteringthatwehavenotcovered.AdditionalpointersaregiveninthebooksandsurveysmentionedintheBibliographicNotesofthepreviouschapter.Here,wementionfourareas—omitting,unfortunately,manymore.Clusteringoftransactiondata(Gantietal.[582],
Gibsonetal.[585],Hanetal.[591],andPetersandZaki[610])isanimportantarea,astransactiondataiscommonandofcommercialimportance.Streamingdataisalsobecomingincreasinglycommonandimportantascommunicationsandsensornetworksbecomepervasive.TwointroductionstoclusteringfordatastreamsaregiveninarticlesbyBarbará[565]andGuhaetal.[587].Conceptualclustering(FisherandLangley[580],Jonyeretal.[597],Mishraetal.[605],MichalskiandStepp[604],SteppandMichalski[615]),whichusesmorecomplicateddefinitionsofclustersthatoftencorrespondbettertohumannotionsofacluster,isanareaofclusteringwhosepotentialhasperhapsnotbeenfullyrealized.Finally,therehasbeenagreatdealofclusteringworkfordatacompressionintheareaofvectorquantization.ThebookbyGershoandGray[583]isastandardtextinthisarea.
Bibliography[564]R.Agrawal,J.Gehrke,D.Gunopulos,andP.Raghavan.Automatic
subspaceclusteringofhighdimensionaldatafordataminingapplications.InProc.of1998ACMSIGMODIntl.Conf.onManagementofData,pages94–105,Seattle,Washington,June1998.ACMPress.
[565]D.Barbará.Requirementsforclusteringdatastreams.SIGKDDExplorationsNewsletter,3(2):23–27,2002.
[566]S.Basu,A.Banerjee,andR.Mooney.Semi-supervisedclusteringbyseeding.InProceedingsof19thInternationalConferenceonMachineLearning,pages19–26,2002.
[567]S.Basu,I.Davidson,andK.Wagstaff.ConstrainedClustering:AdvancesinAlgorithms,Theory,andApplications.CRCPress,2008.
[568]J.Bilmes.AGentleTutorialontheEMAlgorithmanditsApplicationtoParameterEstimationforGaussianMixtureandHiddenMarkovModels.TechnicalReportICSITR-97-021,UniversityofCaliforniaatBerkeley,1997.
[569]C.M.Bishop,M.Svensen,andC.K.I.Williams.GTM:Aprincipledalternativetotheself-organizingmap.InC.vonderMalsburg,W.vonSeelen,J.C.Vorbruggen,andB.Sendhoff,editors,ArtificialNeural
Networks—ICANN96.Intl.Conf,Proc.,pages165–170.Springer-Verlag,Berlin,Germany,1996.
[570]D.M.Blei,A.Y.Ng,andM.I.Jordan.LatentDirichletAllocation.JournalofMachineLearningResearch,3(4-5):993–1022,2003.
[571]P.S.Bradley,U.M.Fayyad,andC.Reina.ScalingClusteringAlgorithmstoLargeDatabases.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages9–15,NewYorkCity,August1998.AAAIPress.
[572]P.S.Bradley,U.M.Fayyad,andC.Reina.ScalingEM(ExpectationMaximization)ClusteringtoLargeDatabases.TechnicalReportMSR-TR-98-35,MicrosoftResearch,October1999.
[573]P.Cheeseman,J.Kelly,M.Self,J.Stutz,W.Taylor,andD.Freeman.AutoClass:aBayesianclassificationsystem.InReadingsinknowledgeacquisitionandlearning:automatingtheconstructionandimprovementofexpertsystems,pages431–441.MorganKaufmannPublishersInc.,1993.
[574]W.Y.Chen,Y.Song,H.Bai,C.J.Lin,andE.Y.Chang.Parallelspectralclusteringindistributedsystems.IEEETransactionsonPatternAnalysisandMachineIntelligence,33(3):568586,2011.
[575]CLUTO2.1.2:SoftwareforClusteringHigh-DimensionalDatasets.www.cs.umn.edu/∼karypis,October2016.
[576]I.DavidsonandS.Basu.Asurveyofclusteringwithinstancelevelconstraints.ACMTransactionsonKnowledgeDiscoveryfromData,1:1–41,2007.
[577]C.Ding,X.He,andH.Simon.Ontheequivalenceofnonnegativematrixfactorizationandspectralclustering.InProcoftheSIAMInternationalConferenceonDataMining,page606-610,2005.
[578]L.Ertöz,M.Steinbach,andV.Kumar.ANewSharedNearestNeighborClusteringAlgorithmanditsApplications.InWorkshoponClusteringHighDimensionalDataanditsApplications,Proc.ofTextMine’01,FirstSIAMIntl.Conf.onDataMining,Chicago,IL,USA,2001.
[579]L.Ertöz,M.Steinbach,andV.Kumar.FindingClustersofDifferentSizes,Shapes,andDensitiesinNoisy,HighDimensionalData.InProc.ofthe2003SIAMIntl.Conf.onDataMining,SanFrancisco,May2003.SIAM.
[580]D.FisherandP.Langley.Conceptualclusteringanditsrelationtonumericaltaxonomy.ArtificialIntelligenceandStatistics,pages77–116,1986.
[581]C.FraleyandA.E.Raftery.HowManyClusters?WhichClusteringMethod?AnswersViaModel-BasedClusterAnalysis.TheComputerJournal,41(8):578–588,1998.
[582]V.Ganti,J.Gehrke,andR.Ramakrishnan.CACTUS–ClusteringCategoricalDataUsingSummaries.InProc.ofthe5thIntl.Conf.on
KnowledgeDiscoveryandDataMining,pages73–83.ACMPress,1999.
[583]A.GershoandR.M.Gray.VectorQuantizationandSignalCompression,volume159ofKluwerInternationalSeriesinEngineeringandComputerScience.KluwerAcademicPublishers,1992.
[584]J.Ghosh.ScalableClusteringMethodsforDataMining.InN.Ye,editor,HandbookofDataMining,pages247–277.LawrenceEalbaumAssoc,2003.
[585]D.Gibson,J.M.Kleinberg,andP.Raghavan.ClusteringCategoricalData:AnApproachBasedonDynamicalSystems.VLDBJournal,8(3–4):222–236,2000.
[586]K.C.GowdaandG.Krishna.AgglomerativeClusteringUsingtheConceptofMutualNearestNeighborhood.PatternRecognition,10(2):105–112,1978.
[587]S.Guha,A.Meyerson,N.Mishra,R.Motwani,andL.O’Callaghan.ClusteringDataStreams:TheoryandPractice.IEEETransactionsonKnowledgeandDataEngineering,15(3):515–528,May/June2003.
[588]S.Guha,R.Rastogi,andK.Shim.CURE:AnEfficientClusteringAlgorithmforLargeDatabases.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages73–84.ACMPress,June1998.
[589]S.Guha,R.Rastogi,andK.Shim.ROCK:ARobustClusteringAlgorithmforCategoricalAttributes.InProc.ofthe15thIntl.Conf.onDataEngineering,pages512–521.IEEEComputerSociety,March1999.
[590]L.HagenandA.Kahng.Newspectralmethodsforratiocutpartitioningandclustering.IEEETrans.Computer-AidedDesign,11(9):10741085,1992.
[591]E.-H.Han,G.Karypis,V.Kumar,andB.Mobasher.HypergraphBasedClusteringinHigh-DimensionalDataSets:ASummaryofResults.IEEEDataEng.Bulletin,21(1):15–22,1998.
[592]Y.He,H.Tan,W.Luo,H.Mao,D.Ma,S.Feng,andJ.Fan.MR-DBSCAN:anefficientparalleldensity-basedclusteringalgorithmusingMapReduce.InProcoftheIEEEInternationalConferenceonParallelandDistributedSystems,pages473–480,2011.
[593]A.HinneburgandD.A.Keim.AnEfficientApproachtoClusteringinLargeMultimediaDatabaseswithNoise.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages58–65,NewYorkCity,August1998.AAAIPress.
[594]A.HinneburgandD.A.Keim.OptimalGrid-Clustering:TowardsBreakingtheCurseofDimensionalityinHigh-DimensionalClustering.InProc.ofthe25thVLDBConf.,pages506–517,Edinburgh,Scotland,UK,September1999.MorganKaufmann.
[595]F.Höppner,F.Klawonn,R.Kruse,andT.Runkler.FuzzyClusterAnalysis:MethodsforClassification,DataAnalysisandImageRecognition.JohnWiley&Sons,NewYork,July21999.
[596]R.A.JarvisandE.A.Patrick.ClusteringUsingaSimilarityMeasureBasedonSharedNearestNeighbors.IEEETransactionsonComputers,C-22(11):1025–1034,1973.
[597]I.Jonyer,D.J.Cook,andL.B.Holder.Graph-basedhierarchicalconceptualclustering.JournalofMachineLearningResearch,2:19–43,2002.
[598]K.Kailing,H.-P.Kriegel,andP.Kröger.Density-ConnectedSubspaceClusteringforHigh-DimensionalData.InProc.ofthe2004SIAMIntl.Conf.onDataMining,pages428–439,LakeBuenaVista,Florida,April2004.SIAM.
[599]G.Karypis,E.-H.Han,andV.Kumar.CHAMELEON:AHierarchicalClusteringAlgorithmUsingDynamicModeling.IEEEComputer,32(8):68–75,August1999.
[600]G.KarypisandV.Kumar.Multilevelk-wayPartitioningSchemeforIrregularGraphs.JournalofParallelandDistributedComputing,48(1):96–129,1998.
[601]T.Kohonen,T.S.Huang,andM.R.Schroeder.Self-OrganizingMaps.Springer-Verlag,December2000.
[602]D.D.LeeandH.S.Seung.Learningthepartsofobjectsbynon-negativematrixfactorization.Nature,401(6755):788791,1999.
[603]T.LiandC.H.Q.Ding.TheRelationshipsAmongVariousNonnegativeMatrixFactorizationMethodsforClustering.InProcoftheIEEEInternationalConferenceonDataMining,pages362–371,2006.
[604]R.S.MichalskiandR.E.Stepp.AutomatedConstructionofClassifications:ConceptualClusteringVersusNumericalTaxonomy.IEEETransactionsonPatternAnalysisandMachineIntelligence,5(4):396–409,1983.
[605]N.Mishra,D.Ron,andR.Swaminathan.ANewConceptualClusteringFramework.MachineLearningJournal,56(1–3):115–151,July/August/September2004.
[606]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.
[607]F.Murtagh.Clusteringmassivedatasets.InJ.Abello,P.M.Pardalos,andM.G.C.Reisende,editors,HandbookofMassiveDataSets.Kluwer,2000.
[608]H.Nagesh,S.Goil,andA.Choudhary.ParallelAlgorithmsforClusteringHigh-DimensionalLarge-ScaleDatasets.InR.L.Grossman,C.Kamath,P.Kegelmeyer,V.Kumar,andR.Namburu,editors,DataMiningforScientificandEngineeringApplications,pages335–356.KluwerAcademicPublishers,Dordrecht,Netherlands,October2001.
[609]R.T.NgandJ.Han.CLARANS:AMethodforClusteringObjectsforSpatialDataMining.IEEETransactionsonKnowledgeandDataEngineering,14(5):1003–1016,2002.
[610]M.PetersandM.J.Zaki.CLICKS:ClusteringCategoricalDatausingK-partiteMaximalCliques.InProc.ofthe21stIntl.Conf.onDataEngineering,Tokyo,Japan,April2005.
[611]E.SchikutaandM.Erhart.TheBANG-ClusteringSystem:Grid-BasedDataAnalysis.InAdvancesinIntelligentDataAnalysis,ReasoningaboutData,SecondIntl.Symposium,IDA-97,London,volume1280ofLectureNotesinComputerScience,pages513–524.Springer,August1997.
[612]G.Sheikholeslami,S.Chatterjee,andA.Zhang.Wavecluster:Amulti-resolutionclusteringapproachforverylargespatialdatabases.InProc.ofthe24thVLDBConf.,pages428–439,NewYorkCity,August1998.MorganKaufmann.
[613]J.ShiandJ.Malik.Normalizedcutsandimagesegmentation.IEEETransactionsonPatternAnalysisandMachineIntelligence,22(8):888905,2000.
[614]M.Steinbach,P.-N.Tan,V.Kumar,S.Klooster,andC.Potter.Discoveryofclimateindicesusingclustering.InKDD’03:ProceedingsoftheninthACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages446–455,NewYork,NY,USA,2003.ACMPress.
[615]R.E.SteppandR.S.Michalski.Conceptualclusteringofstructuredobjects:Agoal-orientedapproach.ArtificialIntelligence,28(1):43–69,1986.
[616]A.StrehlandJ.Ghosh.AScalableApproachtoBalanced,High-dimensionalClusteringofMarket-Baskets.InProc.ofthe7thIntl.Conf.onHighPerformanceComputing(HiPC2000),volume1970ofLectureNotesinComputerScience,pages525–536,Bangalore,India,December2000.Springer.
[617]T.Sun,C.Shu,F.Li,H.Yu,L.Ma,andY.Fang.Anefficienthierarchicalclusteringmethodforlargedatasetswithmap-reduce.InProcoftheIEEEInternationalConferenceonParallelandDistributedComputing,ApplicationsandTechnologies,pages494–499,2009.
[618]U.vonLuxburg.Atutorialonspectralclustering.StatisticsandComputing,17(4):395–416,2007.
[619]K.Wagstaff,C.Cardie,S.Rogers,andS.Schroedl.ConstrainedK-meansClusteringwithBackgroundKnowledge.InProceedingsof18thInternationalConferenceonMachineLearning,pages577–584,2001.
[620]T.Zhang,R.Ramakrishnan,andM.Livny.BIRCH:anefficientdataclusteringmethodforverylargedatabases.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages103–114,Montreal,Quebec,Canada,June1996.ACMPress.
[621]W.Zhao,H.Ma,andQ.He.ParallelK-MeansClusteringbasedonMapReduce.InProcoftheIEEEInternationalConferenceonCloudComputing,page674-679,2009.
8.8Exercises1.Forsparsedata,discusswhyconsideringonlythepresenceofnon-zerovaluesmightgiveamoreaccurateviewoftheobjectsthanconsideringtheactualmagnitudesofvalues.Whenwouldsuchanapproachnotbedesirable?
2.DescribethechangeinthetimecomplexityofK-meansasthenumberofclusterstobefoundincreases.
3.Considerasetofdocuments.Assumethatalldocumentshavebeennormalizedtohaveunitlengthof1.Whatisthe“shape”ofaclusterthatconsistsofalldocumentswhosecosinesimilaritytoacentroidisgreaterthansomespecifiedconstant?Inotherwords, ,where .
4.Discusstheadvantagesanddisadvantagesoftreatingclusteringasanoptimizationproblem.Amongotherfactors,considerefficiency,non-determinism,andwhetheranoptimization-basedapproachcapturesalltypesofclusteringsthatareofinterest.
5.Whatisthetimeandspacecomplexityoffuzzyc-means?OfSOM?HowdothesecomplexitiescomparetothoseofK-means?
6.TraditionalK-meanshasanumberoflimitations,suchassensitivitytooutliersanddifficultyinhandlingclustersofdifferentsizesanddensities,orwithnon-globularshapes.Commentontheabilityoffuzzyc-meanstohandlethesesituations.
7.Forthefuzzyc-meansalgorithmdescribedinthisbook,thesumofthemembershipdegreeofanypointoverallclustersis1.Instead,wecouldonlyrequirethatthemembershipdegreeofapointinaclusterbebetween0and1.Whataretheadvantagesanddisadvantagesofsuchanapproach?
cos(d,c)≥δ 0<δ≤1
8.Explainthedifferencebetweenlikelihoodandprobability.
9.Equation8.12 givesthelikelihoodforasetofpointsfromaGaussiandistributionasafunctionofthemeanμandthestandarddeviationσ.Showmathematicallythatthemaximumlikelihoodestimateofμandσarethesamplemeanandthesamplestandarddeviation,respectively.
10.Wetakeasampleofadultsandmeasuretheirheights.Ifwerecordthegenderofeachperson,wecancalculatetheaverageheightandthevarianceoftheheight,separately,formenandwomen.Suppose,however,thatthisinformationwasnotrecorded.Woulditbepossibletostillobtainthisinformation?Explain.
11.ComparethemembershipweightsandprobabilitiesofFigures8.1 and8.4 ,whichcome,respectively,fromapplyingfuzzyandEMclusteringtothesamesetofdatapoints.Whatdifferencesdoyoudetect,andhowmightyouexplainthesedifferences?
12.Figure8.32 showsaclusteringofatwo-dimensionalpointdatasetwithtwoclusters:Theleftmostcluster,whosepointsaremarkedbyasterisks,issomewhatdiffuse,whiletherightmostcluster,whosepointsaremarkedbycircles,iscompact.Totherightofthecompactcluster,thereisasinglepoint(markedbyanarrow)thatbelongstothediffusecluster,whosecenterisfartherawaythanthatofthecompactcluster.ExplainwhythisispossiblewithEMclustering,butnotK-meansclustering.
Figure8.32.DatasetforExercise12 .EMclusteringofatwo-dimensionalpointsetwithtwoclustersofdifferingdensity.
13.ShowthattheMSTclusteringtechniqueofSection8.4.2 producesthesameclustersassinglelink.Toavoidcomplicationsandspecialcases,assumethatallthepairwisesimilaritiesaredistinct.
14.Onewaytosparsifyaproximitymatrixisthefollowing:Foreachobject(rowinthematrix),setallentriesto0exceptforthosecorrespondingtotheobjectsk-nearestneighbors.However,thesparsifiedproximitymatrixistypicallynotsymmetric.
a. Ifobjectaisamongthek-nearestneighborsofobjectb,whyisbnotguaranteedtobeamongthek-nearestneighborsofa?
b. Suggestatleasttwoapproachesthatcouldbeusedtomakethesparsifiedproximitymatrixsymmetric.
15.Giveanexampleofasetofclustersinwhichmergingbasedontheclosenessofclustersleadstoamorenaturalsetofclustersthanmergingbasedonthestrengthofconnection(interconnectedness)ofclusters.
16.Table8.4 liststhetwonearestneighborsoffourpoints.CalculatetheSNNsimilaritybetweeneachpairofpointsusingthedefinitionofSNNsimilaritydefinedinAlgorithm8.11 .
Table8.4.Twonearestneighborsoffourpoints.
Point FirstNeighbor SecondNeighbor
1 4 3
2 3 4
3 4 2
4 3 1
17.ForthedefinitionofSNNsimilarityprovidedbyAlgorithm8.11 ,thecalculationofSNNdistancedoesnottakeintoaccountthepositionofsharedneighborsinthetwonearestneighborlists.Inotherwords,itmightbedesirabletogivehighersimilaritytotwopointsthatsharethesamenearestneighborsinthesameorroughlythesameorder.
a. DescribehowyoumightmodifythedefinitionofSNNsimilaritytogivehighersimilaritytopointswhosesharedneighborsareinroughlythesameorder.
b. Discusstheadvantagesanddisadvantagesofsuchamodification.
18.NameatleastonesituationinwhichyouwouldnotwanttouseclusteringbasedonSNNsimilarityordensity.
19.Grid-clusteringtechniquesaredifferentfromotherclusteringtechniquesinthattheypartitionspaceinsteadofsetsofpoints.
a. Howdoesthisaffectsuchtechniquesintermsofthedescriptionoftheresultingclustersandthetypesofclustersthatcanbefound?
b. Whatkindofclustercanbefoundwithgrid-basedclustersthatcannotbefoundbyothertypesofclusteringapproaches?(Hint:SeeExercise20inChapter7 ,page608.)
20.InCLIQUE,thethresholdusedtofindclusterdensityremainsconstant,evenasthenumberofdimensionsincreases.Thisisapotentialproblembecausedensitydropsasdimensionalityincreases;i.e.,tofindclustersinhigherdimensionsthethresholdhastobesetatalevelthatmaywellresultinthemergingoflow-dimensionalclusters.Commentonwhetheryoufeelthisistrulyaproblemand,ifso,howyoumightmodifyCLIQUEtoaddressthisproblem.
21.GivenasetofpointsinEuclideanspace,whicharebeingclusteredusingtheK-meansalgorithmwithEuclideandistance,thetriangleinequalitycanbeusedintheassignmentsteptoavoidcalculatingallthedistancesofeachpointtoeachclustercentroid.Provideageneraldiscussionofhowthismightwork.
22.InsteadofusingtheformuladerivedinCURE—seeEquation8.21 —wecouldrunaMonteCarlosimulationtodirectlyestimatetheprobabilitythatasampleofsizeswouldcontainatleastacertainfractionofthepointsfromacluster.UsingaMonteCarlosimulationcomputetheprobabilitythatasampleofsizescontains50%oftheelementsofaclusterofsize100,wherethetotalnumberofpointsis1000,andwherescantakethevalues100,200,or500.
9AnomalyDetection
Inanomalydetection,thegoalistofindobjectsthatdonotconformtonormalpatternsorbehavior.Often,anomalousobjectsareknownasoutliers,since,onascatterplotofthedata,theyliefarawayfromotherdatapoints.Anomalydetectionisalsoknownasdeviationdetection,becauseanomalousobjectshaveattributevaluesthatdeviatesignificantlyfromtheexpectedortypicalattributevalues,orasexceptionmining,becauseanomaliesareexceptionalinsomesense.Inthischapter,wewillmostlyusethetermsanomalyoroutlier.Thereareavarietyofanomalydetectionapproachesfromseveralareas,includingstatistics,machinelearning,anddatamining.Alltrytocapturetheideathatananomalousdataobjectisunusualorinsomewayinconsistentwithotherobjects.
Alt1oughunusualobjectsoreventsare,bydefinition,relativelyrare,theirdetectionandanalysisprovidescriticalinsightsthatareusefulinanumberofapplications.Thefollowingexamplesillustrateapplicationsforwhichanomaliesareofconsiderableinterest.
FraudDetection.Thepurchasingbehaviorofsomeonewhostealsacreditcardisoftendifferentfromthatoftheoriginalowner.Creditcardcompaniesattempttodetectatheftbylookingforbuyingpatternsthatcharacterizetheftorbynoticingachangefromtypicalbehavior.Similarapproachesarerelevantinmanydomainssuchasdetectinginsuranceclaimfraudandinsidertrading.IntrusionDetection.Unfortunately,attacksoncomputersystemsandcomputernetworksarecommonplace.Whilesomeoftheseattacks,suchasthosedesignedtodisableoroverwhelmcomputersandnetworks,areobvious,otherattacks,suchasthosedesignedtosecretlygatherinformation,aredifficulttodetect.Manyoftheseintrusionscanonlybedetectedbymonitoringsystemsandnetworksforunusualbehavior.EcosystemDisturbances.TheEarth’secosystemhasbeenexperiencingrapidchangesinthelastfewdecadesduetonaturaloranthropogenicreasons.Thisincludesanincreasedpropensityforextremeevents,suchasheatwaves,droughts,andfloods,whichhaveahugeimpactontheenvironment.Identifyingsuchextremeeventsfromsensorrecordingsandsatelliteimagesisimportantforunderstandingtheiroriginsandbehavior,aswellasfordevisingsustainableadaptationpolicies.MedicineandPublicHealth.Foraparticularpatient,unusualsymptomsortestresults,suchasananomalousMRIscan,mayindicatepotentialhealthproblems.However,whetheraparticulartest
resultisanomalousmaydependonmanyothercharacteristicsofthepatient,suchasage,sex,andgeneticmakeup.Furthermore,thecategorizationofaresultasanomalousornotincursacost—unneededadditionaltestsifapatientishealthyandpotentialharmtothepatientifaconditionisleftundiagnosedanduntreated.Thedetectionofemergingdiseaseoutbreaks,suchasH1N1-influenzaorSARS,whichresultinunusualandalarmingtestresultsinaseriesofpatients,isalsoimportantformonitoringthespreadofdiseasesandtakingpreventiveactions.AviationSafety.Sinceaircraftsarehighlycomplexanddynamicsystems,theyarepronetoaccidents—oftenwithdrasticconsequences—duetomechanical,environmentalorhumanfactors.Tomonitortheoccurrenceofsuchanomalies,mostcommercialairplanesareequippedwithalargenumberofsensorstomeasuredifferentflightparameters,suchasinformationfromthecontrolsystem,theavionicsandpropulsionsystems,andpilotactions.Identifyingabnormaleventsinthesesensorrecordings(e.g.,ananomaloussequenceofpilotactionsoranabnormallyfunctioningaircraftcomponent)canhelppreventaircraftaccidentsandpromoteaviationsafety.
Althoughmuchoftherecentinterestinanomalydetectionisdrivenbyapplicationsinwhichanomaliesarethefocus,historically,anomalydetection(and
removal)hasbeenviewedasadatapreprocessingtechniquetoeliminateerroneousdataobjectsthatmayberecordedbecauseofhumanerror,aproblemwiththemeasuringdevice,orthepresenceofnoise.Suchanomaliesprovidenointerestinginformationbutonlydistorttheanalysisofnormalobjects.Theidentificationandremovalofsucherroneousdataobjectsisnotthefocusofthischapter.Instead,theemphasisisondetectinganomalousobjectsthatareinterestingintheirownright.
9.1CharacteristicsofAnomalyDetectionProblemsAnomalydetectionproblemsarequitediverseinnatureastheyappearinmultipleapplicationdomainsunderdifferentsettings.Thisdiversityinproblemcharacteristicshasresultedinarichvarietyofanomalydetectionmethodsthatareusefulindifferentsituations.Beforewediscussthesemethods,itwillbeusefultodescribesomeofthekeycharacteristicsofanomalydetectionproblemsthatmotivatethedifferentstylesofanomalydetectionmethods.
9.1.1ADefinitionofanAnomaly
Animportantcharacteristicofananomalydetectionproblemisthewayananomalyisdefined.Sinceanomaliesarerareoccurrencesthatarenotfullyunderstood,theycanbedefinedindifferentwaysdependingontheproblemrequirements.However,thefollowinghigh-leveldefinitionofananomalyencompassesmostofthedefinitionscommonlyemployed.
Definition9.1.Ananomalyisanobservationthatdoesn’tfitthedistributionofthedatafornormalinstances,i.e.,isunlikelyunderthedistributionofthemajorityofinstances.
Wenotethefollowingpoints:
Thisdefinitiondoesnotassumethatthedistributioniseasytoexpressintermsofwell-knownstatisticaldistributions.Indeed,thedifficultyofdoingsoisthereasonthatmanyanomalydetectionapproachesusenon-statisticalapproaches.Nonetheless,theseapproachesaimtofinddataobjectsthatarenotcommon.Conceptually,wecanrankdataobjectsaccordingtotheprobabilityofseeingsuchanobjectorsomethingmoreextreme.Thelowertheprobability,themorelikelytheobjectisananomaly.Often,thereciprocaloftheprobabilityisusedasarankingscore.Again,thisisonlypracticalinsomecases.SuchapproachesarediscussedinSection9.3 .Therecanbevariouscausesofananomaly:noise,theobjectcomesfromanotherdistribution,e.g.,afewgrapefruitmixedwithoranges,ortheobjectisjustarareoccurrenceofdatafromthedistribution,e.g.,a7foottallperson.Asmentioned,wearenotinterestedinanomaliesduetonoise.
9.1.2NatureofData
Thenatureoftheinputdataplaysakeyroleindecidingthechoiceofasuitableanomalydetectiontechnique.Someofthecommoncharacteristicsoftheinputdataincludethenumberandtypesofattributes,andtherepresentationusedfordescribingeverydatainstance.
UnivariateorMultivariate
Ifthedatacontainsasingleattribute,thequestionofwhetheranobjectisanomalousdependsonwhethertheobject’svalueforthatattributeis
anomalous.However,ifthedataobjectsarerepresentedusingmanyattributes,adataobjectmayhaveanomalousvaluesforsomeattributesbutordinaryvaluesforotherattributes.Furthermore,anobjectmaybeanomalousevenifnoneofitsattributevaluesareindividuallyanomalous.Forexample,itiscommontohavepeoplewhoaretwofeettall(children)orare100poundsinweight,butuncommontohaveatwo-foottallpersonwhoweighs100pounds.Identifyingananomalyinamultivariatesettingisthuschallenging,particularlywhenthedimensionalityofthedataishigh.
RecordDataorProximityMatrix
Themostcommonapproachforrepresentingadatasetistouserecorddataoritsvariants,e.g.,adatamatrix,whereeverydatainstanceisdescribedusingthesamesetofattributes.However,forthepurposeofanomalydetection,itisoftensufficienttoknowhowdifferentaninstanceisincomparisontootherinstances.Hence,someanomalydetectionmethodsworkwithadifferentrepresentationoftheinputdataknownasaproximitymatrix,whereeveryentryinthematrixdenotesthepairwiseproximity(similarityordissimilarity)betweentwoinstances.Notethatadatamatrixcanalwaysbeconvertedtoaproximitymatrixbyusinganappropriateproximitymeasure.Also,asimilaritymatrixcanbeeasilyconvertedtoadistancematrixusinganyofthetransformationspresentedinSection2.4.1 .
AvailabilityofLabels
Thelabelofadatainstancedenoteswhethertheinstanceisnormaloranomalous.Ifwehaveatrainingsetwithlabelsforeverydatainstance,thentheproblemofanomalydetectiontranslatestoasupervisedlearning(classification)problem.Classificationtechniquesthataddresstheso-calledrareclassproblemareparticularlyrelevantbecauseanomaliesarerelativelyrarewithrespecttonormalobjects.SeeSection4.11 .
However,inmostpracticalapplications,wedonothaveatrainingsetwithaccurateandrepresentativelabelsofthenormalandanomalousclasses.Notethatobtaininglabelsoftheanomalousclassisespeciallychallengingbecauseoftheirrarity.Itisthusdifficultforahumanexperttocatalogeverytypeofanomalysincethepropertiesoftheanomalousclassareoftenunknown.Hence,mostanomalydetectionproblemsareunsupervisedinnature,i.e.,theinputdatadoesnothaveanylabels.Allanomalydetectionmethodspresentedinthischapteroperateintheunsupervisedsetting.
Notethatintheabsenceoflabels,itischallengingtodifferentiateanomaliesfromnormalinstancesgivenaninputdataset.However,anomaliestypicallyhavesomepropertiesthattechniquescantakeadvantageoftomakefindinganomaliespractical.Twokeypropertiesarethefollowing:
RelativelySmallinNumber
Sinceanomaliesareinfrequent,mostinputdatasetshaveapredominanceofnormalinstances.Theinputdatasetisthusoftenusedasanimperfectrepresentationofthenormalclassinmostanomalydetectiontechniques.However,theperformanceofsuchmethodsneedstoberobusttothepresenceofoutliersintheinputdata.Someanomalydetectionmethodsalsoprovideamechanismtospecifytheexpectednumberofoutliersintheinputdata.Suchmethodscanworkwithalargernumberofanomaliesinthedata.
SparselyDistributed
Anomalies,unlikenormalobjects,areoftenunrelatedtoeachotherandhencedistributedsparselyinthespaceofattributes.Indeed,thesuccessfuloperationofmostanomalydetectionmethodsdependsonanomaliesnotbeingtightlyclustered.However,someanomalydetectionmethodsare
specificallydesignedtofindclusteredanomalies(seeSection9.5.1 ),whichareassumedtoeitherbesmallinsizeordistantfromotherinstances.
9.1.3HowAnomalyDetectionisUsed
Therearetwodifferentwaysinwhichanygenericanomalydetectionmethodcanbeused.Inthefirstapproach,wearegivenaninputdatathatcontainsbothnormalandanomalousinstances,andarerequiredtoidentifyanomaliesinthisinputdata.Allanomalydetectionapproachespresentedinthischapterareabletooperateinthissetup.Inthesecondapproach,wearealsoprovidedwithtestinstances(appearingoneatatime)thatneedtobeidentifiedasanomalies.Mostanomalydetectionmethods(withafewexceptions)areabletousetheinputdatasettoprovideoutputsonnewtestinstances.Findinganomaliesbyfindinganomalousclusters—Section9.5.1—isoneoftheexceptions.
9.2CharacteristicsofAnomalyDetectionMethodsTocatertothediverseneedsofanomalydetectionproblems,anumberoftechniqueshavebeenexploredusingconceptsfromdifferentresearchdisciplines.Inthissection,weprovideahigh-leveldescriptionofsomeofthecommoncharacteristicsofanomalydetectionmethodsthatarehelpfulinunderstandingtheircommonalitiesanddifferences.
Model-basedvs.Model-freeManyapproachesforanomalydetectionusetheinputdatatobuildmodelsthatcanbeusedtoidentifywhetheratestinstanceisanomalousornot.Mostmodel-basedtechniquesforanomalydetectionbuildamodelofthenormalclassandidentifyanomaliesthatdonotfitthismodel.Forexample,wecanfitaGaussiandistributiontomodelthenormalclassandthenidentifyanomaliesthatdonotconformwelltothelearneddistribution.Theothertypeofmodel-basedtechniqueslearnsamodelofboththenormalandanomalousclasses,andidentifiesinstancesasanomaliesiftheyaremorelikelytobelongtotheanomalousclass.Althoughtheseapproachestechnicallyrequirerepresentativelabelsfrombothclasses,theyoftenmakeassumptionsaboutthenatureoftheanomalousclass,e.g.,thatanomaliesarerareandsparselydistributed,andthuscanworkeveninanunsupervisedsetting.
Inadditiontoidentifyinganomalies,model-basedmethodsprovideinformationaboutthenaturethenormalclassandsometimeseventheanomalousclass.However,theassumptionstheymakeaboutthepropertiesofnormaland
anomalousclassesmaynotholdtrueineveryproblem.Incontrast,model-freeapproachesdonotexplicitlycharacterizethedistributionofthenormaloranomalousclasses.Instead,theydirectlyidentifyinstancesasanomalieswithoutlearningmodelsfromtheinputdata.Forexample,aninstancecanbeidentifiedasananomalyifitisquitedifferentfromotherinstancesinitsneighborhood.Model-freeapproachesareoftenintuitiveandsimpletouse.
Globalvs.LocalPerspectiveAninstancecanbeidentifiedasananomalyeitherbyconsideringtheglobalcontext,e.g.,bybuildingamodeloverallnormalinstancesandusingthisglobalmodelforanomalydetection,orbyconsideringthelocalperspectiveofeverydatainstance.Specifically,ananomalydetectionapproachistermedlocalifitsoutputonagiveninstancedoesnotchangeifinstancesoutsideitslocalneighborhoodaremodifiedorremoved.Thedifferencebetweentheglobalandlocalperspectivecanresultinsignificantdifferencesintheresultsofananomalydetectionmethod,becauseanobjectmayseemunusualwithrespecttoallobjectsglobally,butnotwithrespecttoobjectsinitslocalneighborhood.Forexample,apersonwhoseheightis6feet5inchesisunusuallytallwithrespecttothegeneralpopulation,butnotwithrespecttoprofessionalbasketballplayers.
Labelvs.ScoreDifferentapproachesforanomalydetectionproducetheiroutputsindifferentformats.Themostbasictypeofoutputisabinaryanomalylabel:anobjectiseitheridentifiedasananomalyorasanormalinstance.However,labelsdonotprovideanyinformationaboutthedegreetowhichaninstanceisanomalous.Frequently,someofthedetectedanomaliesaremoreextreme
thanothers,whilesomeinstanceslabeledasnormalmaybeonthevergeofbeingidentifiedasanomalies.
Hence,manyanomalydetectionmethodsproduceananomalyscorethatindicateshowstronglyaninstanceislikelytobeananomaly.Ananomalyscorecaneasilybesortedandconvertedintoranks,sothatananalystcanbeprovidedwithonlythetop-mostscoringanomalies.Alternatively,acutoffthresholdcanbeappliedtoananomalyscoretoobtainbinaryanomalylabels.Thetaskofchoosingtherightthresholdisoftenlefttothediscretionoftheanalyst.However,sometimesthescoreshaveanassociatedmeaning,e.g.,statisticalsignificance(seeSection9.3 ),whichmakestheanalysisofanomalieseasierandmoreinterpretable.
Inthefollowingsections,weprovidebriefdescriptionsofsixtypesofanomalydetectionapproaches.Foreachtype,wewilldescribetheirbasicidea,keyfeatures,andunderlyingassumptionsusingillustrativeexamples.Attheendofeverysection,wealsodiscusstheirstrengthsandweaknessinhandlingdifferentaspectsofanomalydetectionproblems.Tofollowcommonpractice,wewillusethetermsoutlierandanomalyinterchangeablyintheremainderofthischapter.
9.3StatisticalApproachesStatisticalapproachesmakeuseofprobabilitydistributions(e.g.,theGaussiandistribution)tomodelthenormalclass.Akeyfeatureofsuchdistributionsisthattheyassociateaprobabilityvaluetoeverydatainstance,indicatinghowlikelyitisfortheinstancetobegeneratedfromthedistribution.Anomaliesarethenidentifiedasinstancesthatareunlikelytobegeneratedfromtheprobabilitydistributionofthenormalclass.
Therearetwotypesofmodelsthatcanbeusedtorepresenttheprobabilitydistributionofthenormalclass:parametricmodelsandnon-parametricmodels.Whileparametricmodelsusewell-knownfamiliesofstatisticaldistributionsthatrequireestimatingparametersfromthedata,non-parametricmodelsaremoreflexibleandlearnthedistributionofthenormalclassdirectlyfromtheavailabledata.Inthefollowing,wediscussbothofthesetypesofmodelsforanomalydetection.
9.3.1UsingParametricModels
Someofthecommontypesofparametricmodelsthatarewidelyusedfordescribingmanytypesofdatasets,includetheGaussiandistribution,thePoissondistribution,andthebinomialdistribution.Theyinvolveparametersthatneedtobelearnedfromthedata,e.g.,aGaussiandistributionrequiresidentifyingthemeanandvarianceparametersfromthedata.
Parametricmodelsarequiteeffectiveinrepresentingthebehaviorofthenormalclass,especiallywhenthenormalclassisknowntofollowaspecific
distribution.Theanomalyscorescomputedbyparametricmodelsalsohavestrongtheoreticalproperties,whichcanbeusedforanalyzingtheanomalyscoresandassessingtheirstatisticalsignificance.Inthefollowing,wediscusstheuseoftheGaussiandistributionformodelingthenormalclass,intheunivariateandmultivariatesettings.
UsingtheUnivariateGaussianDistributionTheGaussian(normal)distributionisoneofthemostfrequentlyuseddistributionsinstatistics,andwewilluseittodescribeasimpleapproachtostatisticaloutlierdetection.TheGaussiandistributionhastwoparameters,μandσ,whicharethemeanandstandarddeviation,respectively,andisrepresentedusingthenotation .Theprobabilitydensityfunction ofapointxundertheGaussiandistributionisgivenas
Figure9.1 showstheprobabilitydensityfunctionof .Wecanseethatthe declinesasxmovesfartherfromthecenterofthedistribution.Wecanthususethedistanceofapointxfromtheoriginasananomalyscore.AswewillseelaterinSection9.3.4 ,thisdistancevaluehasaninterpretationintermsofprobabilitythatcanbeusedtoassesstheconfidenceincallingxasanoutlier.
N(μ,σ) f(x)
f(x)=12πσ2e−(x−μ)22σ2 (9.1)
N(0,1)p(x)
Figure9.1.ProbabilitydensityfunctionofaGaussiandistributionwithameanof0andastandarddeviationof1.
IftheattributeofinterestxfollowsaGaussiandistributionwithmeanμandstandarddeviationσ,i.e., ,acommonapproachistotransformtheattributextoanewattributez,whichhasa distribution.Thiscanbedonebyusing ,whichiscalledthez-score.Notethat isdirectlyrelatedtotheprobabilitydensityofthepointxinEquation9.1 sincethatequationcanberewrittenasfollows:
TheparametersμandσoftheGaussiandistributioncanbeestimatedfromthetrainingdataofmostlynormalinstances,byusingthesamplemean asμandthesamplestandarddeviation asσ.However,ifwebelievethe
N(μ,σ)N(0,1)
z=(x−μ)/σ z2
p(x)=12πσ2e−z2/2 (9.2)
x¯sx
outliersaredistortingtheestimatesoftheseparameterstoomuch,morerobustestimatesofthesequantitiescanbeused—seeBibliographicNotes.
UsingtheMultivariateGaussianDistributionForadatasetcomprisedofmorethanonecontinuousattribute,wecanuseamultivariateGaussiandistributiontomodelthenormalclass.AmultivariateGaussiandistribution involvestwoparameters,themeanvectorμandthecovariancematrixΣ,whichneedtobeestimatedfromthedata.Theprobabilitydensityofapointxdistributedas isgivenby
wherepisthenumberofdimensionsofxand denotesthedeterminantofthecovariancematrixΣ.
InthecaseofamultivariateGaussiandistribution,thedistanceofapointxfromthecenterμcannotbedirectlyusedasaviableanomalyscore.Thisisbecauseamultivariatenormaldistributionisnotsymmetricalwithrespecttoitscenteriftherearecorrelationsbetweentheattributes.Toillustratethis,Figure9.2 showstheprobabilitydensityofatwo-dimensionalmultivariateGaussiandistributionwithmeanof(0,0)andacovariancematrixof
N(μ,Σ)
N(μ,Σ)
f(x)=1(2π)m|Σ|1/2e−(x−μ)Σ−1(x−μ)2, (9.3)
|Σ|
Σ=(1.000.750.753.00).
Figure9.2.ProbabilitydensityofpointsfortheGaussiandistributionusedtogeneratethepointsofFigure9.3 .
Figure9.3.Mahalanobisdistanceofpointsfromthecenterofatwo-dimensionalsetof2002points.
Theprobabilitydensityvariesasymmetricallyaswemoveoutwardfromthecenterindifferentdirections.Toaccountforthisfact,weneedadistancemeasurethattakestheshapeofthedataintoconsideration.TheMahalanobisdistanceisonesuchdistancemeasure.(SeeEquation2.27 onpage96.)TheMahalanobisdistancebetweenapointxandthemeanofthedata isgivenby
whereSistheestimatedcovariancematrixofthedata.NotethattheMahalanobisdistancebetweenxand isdirectlyrelatedtotheprobability
x─
Mahalanobis(x,x─)=(x−x─)S−1(x−x─)T, (9.4)
x─
densityofxinEquation9.3 ,when andSareusedasestimatesofμandΣ,respectively.(SeeExercise9 onpage751.)
Example9.1(OutliersinaMultivariateNormalDistribution).Figure9.3 showstheMahalanobisdistance(fromthemeanofthedistribution)forpointsinatwo-dimensionaldataset.ThepointsA(−4,4)andB(5,5)areoutliersthatwereaddedtothedataset,andtheirMahalanobisdistanceisindicatedinthefigure.Theother2000pointsofthedatasetwererandomlygeneratedusingthedistributionusedforFigure9.2 .
BothAandBhavelargeMahalanobisdistances.However,eventhoughAisclosertothecenter(thelargeblackxat(0,0))asmeasuredbyEuclideandistance,itisfartherawaythanBintermsoftheMahalanobisdistancebecausetheMahalanobisdistancetakestheshapeofthedistributionintoaccount.Inparticular,pointBhasaEuclideandistanceof
andaMahalanobisdistanceof24,whilethepointAhasaEuclideandistanceof andaMahalanobisdistanceof35.
TheaboveapproachesassumethatthenormalclassisgeneratedfromasingleGaussiandistribution.Notethatthismaynotalwaysbethecase,especiallyiftherearemultipletypesofnormalclassesthathavedifferentmeansandvariances.Insuchcases,wecanuseaGaussianmixturemodel(asdescribedinChapter8.2.2 )torepresentthenormalclass.Foreachpoint,thesmallestMahalanobisdistanceofthepointtoanyoftheGaussiandistributionsiscomputedandusedastheanomalyscore.Thisapproachisrelatedtotheclustering-basedapproachesforanomalydetection,whichwillbedescribedinSection9.5 .
x─
5242
9.3.2UsingNon-parametricModels
Analternativeformodelingthedistributionofthenormalclassistousekerneldensityestimation-basedtechniquesthatemploykernelfunctions(describedinSection8.3.3 )toapproximatethedensityofthenormalclassfromtheavailabledata.Thisresultsintheconstructionofanon-parametricprobabilitydistributionofthenormalclass,suchthatregionswithadenseoccurrenceofnormalinstanceshavehighprobabilityandvice-versa.Notethatkernel-basedapproachesdonotassumethatthedataconformstoanyknownfamilyofdistributionsbutinsteadderivethedistributionpurelyfromthedata.Havinglearnedaprobabilitydensityforthenormalclassusingthekerneldensityapproach,theanomalyscoreofaninstanceiscomputedastheinverseofitsprobabilitywithrespecttothelearneddensity.
Asimplernon-parametricapproachtomodelingthenormalclassistobuildahistogramofthenormaldata.Forexample,ifthedatacontainsasinglecontinuousattribute,thenwecanconstructbinsfordifferentrangesoftheattribute,usingtheequal-widthdiscretizationtechniquedescribedinSection2.3.6 .Wecanthencheckifanewtestinstancefallsinanyofthebinsofthehistogram.Ifitdoesnotfallinanyofthebins,wecanidentifyitasananomaly.Otherwise,wecanusetheinverseoftheheight(frequency)ofthebininwhichitfallsasitsanomalyscore.Thisapproachisknownasthefrequency-basedorcounting-basedapproachforanomalydetection.
Akeystepinusingfrequency-basedapproachesforanomalydetectionischoosingthesizeofthebinusedforconstructingthehistogram.Asmallbinsizecanfalselyidentifymanynormalinstancesasanomalous,sincetheymightfallinemptyorsparselypopulatedbins.Ontheotherhand,ifthebinsizeistoolarge,manyanomalousinstancesmayfallinheavilypopulatedbins
andgounnoticed.Thus,choosingtherightbinsizeischallenging,andoftenrequirestryingmultiplesizeoptionsorusingexpertknowledge.
9.3.3ModelingNormalandAnomalousClasses
Thestatisticalapproachesdescribedsofaronlymodelthedistributionofthenormalclassbutnottheanomalousclass.Theyassumethatthetrainingsetpredominantlyhasnormalinstances.However,ifthereareoutlierspresentinthetrainingdata,whichiscommoninmostpracticalapplications,thelearningoftheprobabilitydistributionscorrespondingtothenormalclassmaybedistorted,resultinginpooridentificationofanomalies.
Here,wepresentastatisticalapproachforanomalydetectionthatcantolerateaconsiderablefraction(λ)ofoutliersinthetrainingset,providedthattheoutliersareuniformlydistributed(andthusnotclustered)intheattributespace.Thisapproachmakesuseofamixturemodelingtechniquetolearnthedistributionofnormalandanomalousclasses.ThisapproachissimilartotheExpectation-Maximization(EM)basedtechniqueintroducedinthecontextofclusteringinChapter8.2.2 .Notethatλ),thefractionofoutliers,islikeaprior.
Thebasicideaofthisapproachistoassumethatinstancesaregeneratedwithprobabilityλfromtheanomalousclass,whichhasuniformdistribution,,andwithprobability andfromthenormalclass,whichhasthe
distribution, ,whereθrepresentstheparametersofthedistribution.Theapproachforassigningtraininginstancestothenormalandanomalyclassescanbedescribedasfollows.Initially,alltheobjectsareassignedto
pA 1−λ
fM(θ)
thenormalclassandthesetofanomalousobjectsisempty.AteveryiterationoftheEMalgorithm,objectsaretransferredfromthenormalclasstotheanomalyclasstoimprovethelikelihoodoftheoveralldata.Let and bethesetofnormalandanomalousobjects,respectively,atiterationt.ThelikelihoodofthedatasetD, ,anditslog-likelihood, ,arethengivenbythefollowingequations:
where and arethenumberofobjectsinthenormalandanomalyclasses,respectively,and representstheparametersofthedistributionofthenormalclass,whichcanbeestimatedusing .Ifthetransferofanobjectxfrom to resultsinasignificantincreaseinthelog-likelihoodofthedata(greaterthanathresholdc),thenxisassignedtothesetofoutliers.Thesetofoutliers keepsgrowingtillweachievethemaximumlikelihoodofthedatausing and .ThisapproachissummarizedinAlgorithm9.1 .
Algorithm9.1Likelihood-basedoutlier
detection.
Mt At
ℒt(D) logℒt(D)
ℒt(D)=∏xi∈DP(xi)=((1−λ)|Mt|∏xi∈MtPM(xi,θt))(λ|At|∏xi∈AtPA(xi))
(9.5)
logℒt(D)=|Mt|log(1−λ)+∑xi∈MtlogPM(xi,θt)+|At|log λ+∑xi∈AtlogPA(xi)
(9.6)
|Mt| |At|θt
MtMt At
At At
Mt At
Initialization:Attime ,let containalltheobjects,whileisempty.foreachobjectxthatbelongsto doMovexfrom to toproducethenewdatasets and
.
1: t=0 Mt At
2: Mt3: Mt At At+1
Mt+1
Becausethenumberofnormalobjectsislargecomparedtothenumberofanomalies,thedistributionofthenormalobjectsmaynotchangemuchwhenanobjectismovedtothesetofanomalies.Inthatcase,thecontributionofeachnormalobjecttotheoveralllikelihoodofthenormalobjectswillremainrelativelyconstant.Furthermore,eachobjectmovedtothesetofanomaliescontributesafixedamounttothelikelihoodoftheanomalies.Thus,theoverallchangeinthetotallikelihoodofthedatawhenanobjectismovedtothesetofanomaliesisroughlyequaltotheprobabilityoftheobjectunderauniformdistribution(weightedbyλ)minustheprobabilityoftheobjectunderthedistributionofthenormaldataobjects(weightedby ).Consequently,thesetofanomalieswilltendtoconsistofthoseobjectsthathavesignificantlyhigherprobabilityunderauniformdistributionthanunderthedistributionofthenormalobjects.
Inthesituationjustdiscussed,theapproachdescribedbyAlgorithm9.1 isroughlyequivalenttoclassifyingobjectswithalowprobabilityunderthedistributionofnormalobjectsasoutliers.Forexample,whenappliedtothepointsinFigure9.3 ,thistechniquewouldclassifypointsAandB(andotherpointsfarfromthemean)asoutliers.However,ifthedistributionofthenormalobjectschangessignificantlyasanomaliesareremovedorthedistributionoftheanomaliescanbemodeledinamoresophisticatedmanner,thentheresultsproducedbythisapproachwillbedifferentthantheresultsofsimplyclassifyinglow-probabilityobjectsasoutliers.Also,thisapproachcanwork
Computethenewlog-likelihoodofD,Computethedifference,if ,wherecissomethresholdthenClassifyxasananomaly.Incrementtbyoneanduse and inthenextiteration.endifendfor
4: logℒt+1(D)5: Δ=logℒt+1(D)−logℒt(D)6: Δ>c7:8: Mt+1 At+19:10:
1−λ
evenwhenthedistributionofnormalobjectsismulti-modal,e.g.,byusingamixtureofGaussiandistributionsfor .Also,conceptually,itshouldbepossibletousethisapproachwithdistributionsotherthanGaussian.
9.3.4AssessingStatisticalSignificance
Statisticalapproachesprovideawaytoassignameasureofconfidencefortheinstancesdetectedasanomalies.Forexample,sincetheanomalyscorescomputedbystatisticalapproacheshaveaprobabilisticmeaning,wecanapplyathresholdtothesescoreswithstatisticalguarantees.Alternatively,itispossibletodefinestatisticaltests(alsotermedasdiscordancytests)thatcanidentifythestatisticalsignificanceofaninstancebeingidentifiedasananomalybyastatisticalapproach.Manyofthesediscordancytestsarehighlyspecializedandassumealevelofstatisticalknowledgebeyondthescopeofthistext.Thus,weillustratethebasicideaswithasimpleexamplethatusesunivariateGaussiandistributions,andreferthereadertotheBibliographicNotesforfurtherpointers.
ConsidertheGaussiandistribution showninFigure9.1 .AsdiscussedpreviouslyinSection9.3.1 ,mostoftheprobabilitydensityiscenteredaroundzeroandthereislittleprobabilitythatanobject(value)belongingto willoccurinthetailsofthedistribution.Forinstance,thereisonlyaprobabilityof0.0027thatanobjectliesbeyondthecentralareabetween standarddeviations.Moregenerally,ifcisaconstantandxistheattributevalueofanobject,thentheprobabilitythat decreasesrapidlyascincreases.Let .Table9.1 showssomesamplevaluesforcandthecorrespondingvaluesforαwhenthedistributionis .Notethatavaluethatismorethan4standarddeviationsfromthemeanisaone-inten-thousandoccurrence.
fM(θ)
N(0,1)
N(0,1)
±3|x|≥c
α=prob(|x|≥c)N(0,1)
Table9.1.Samplepairs foraGaussiandistributionwithmean0andstandarddeviation1.
c αfor
1.00 0.3173
1.50 0.1336
2.00 0.0455
2.50 0.0124
3.00 0.0027
3.50 0.0005
4.00 0.0001
Thisinterpretationofthedistanceofapointfromthecentercanbeusedasthebasisofatesttoassesswhetheranobjectisanoutlier,usingthefollowingdefinition.
Definition9.2(OutlierforaSingleGaussianAttribute).AnobjectwithattributevaluexfromaGaussiandistributionwithmeanof0andstandarddeviation1isanoutlierif
(c,α),α=prob(|x|≥c)
N(0,1)
N(0,1)
|x|≥c, (9.7)
wherecisaconstantchosensothat ,wherePrepresentsprobability.
Tousethisdefinitionitisnecessarytospecifyavalueforα.Fromtheviewpointthatunusualvalues(objects)indicateavaluefromadifferentdistribution,αindicatestheprobabilitythatwemistakenlyclassifyavaluefromthegivendistributionasanoutlier.Fromtheviewpointthatanoutlierisararevalueofa distribution,αspecifiesthedegreeofrareness.
Moregenerally,foraGaussiandistributionwithmeanμandstandarddeviationσ,wecanfirstcomputethezscoreofxandthenapplytheabovetestonx.Inpractice,thisworkswellwhenμandσareestimatedfromalargepopulation.Amoresophisticatedstatisticalprocedure(Grubbs’test),whichtakesintoaccountthedistortionofparameterestimatescausedbyoutliers,isexploredinExercise7 onpage750.
Theapproachtooutlierdetectionpresentedhereisequivalenttotestingdataobjectsforstatisticalsignificanceandclassifyingthestatisticallysignificantobjectsasanomalies.ThisisdiscussedinmoredetailinChapter10 .
9.3.5StrengthsandWeaknesses
Statisticalapproachestooutlierdetectionhaveafirmtheoreticalfoundationandbuildonstandardstatisticaltechniques.Whenthereissufficientknowledgeofthedataandthetypeoftestthatshouldbeapplied,theseapproachesarestatisticallyjustifiableandcanbeveryeffective.Theycanalsoprovideconfidenceintervalsassociatedwiththeanomalyscores,which
P(|x|≥c)=α
N(0,1)
canbeveryhelpfulinmakingdecisionsabouttestinstances,e.g.,determiningthresholdsontheanomalyscore.
However,ifthewrongmodelischosen,thenanormalinstancecanbeerroneouslyidentifiedasanoutlier.Forexample,thedatamaybemodeledascomingfromaGaussiandistribution,butmayactuallycomefromadistributionthathasahigherprobability(thantheGaussiandistribution)ofhavingvaluesfarfromthemean.Statisticaldistributionswiththistypeofbehaviorarecommoninpracticeandareknownasheavy-taileddistributions.Also,wenotethatwhilethereareawidevarietyofstatisticaloutliertestsforsingleattributes,farfeweroptionsareavailableformultivariatedata,andthesetestscanperformpoorlyforhigh-dimensionaldata.
9.4Proximity-basedApproachesProximity-basedmethodsidentifyanomaliesasthoseinstancesthataremostdistantfromtheotherobjects.Thisreliesontheassumptionthatnormalinstancesarerelatedandhenceappearclosetoeachother,whileanomaliesaredifferentfromtheotherinstancesandhencearerelativelyfarfromotherinstances.Sincemanyoftheproximity-basedtechniquesarebasedondistances,theyarealsoreferredtoasdistance-basedoutlierdetectiontechniques.
Proximity-basedapproachesaremodel-freeanomalydetectiontechniques,sincetheydonotconstructanexplicitmodelofthenormalclassforcomputingtheanomalyscore.Theymakeuseofthelocalperspectiveofeverydatainstancetocomputeitsanomalyscore.Theyaremoregeneralthanstatisticalapproaches,sinceitisofteneasiertodetermineameaningfulproximitymeasureforadatasetthantodetermineitsstatisticaldistribution.Inthefollowing,wepresentsomeofthebasicproximity-basedapproachesfordefiningananomalyscore.Primarily,thesetechniquesdifferinthewaytheyanalyzethelocalityofadatainstance.
9.4.1Distance-basedAnomalyScore
Oneofthesimplestwaystodefineaproximity-basedanomalyscoreofadatainstancexistousethedistancetoits nearestneighbor, .Ifaninstancexhasmanyotherinstanceslocatedclosetoit(characteristicofthenormalclass),itwillhavealowvalueof .Onotherhand,an
kth dist(x,k)
dist(x,k)
anomalousinstancexwillbequitedistantfromitsk-neighboringinstancesandwouldthushaveahighvalueof .
Figure9.4 showsasetofpointsinatwo-dimensionalspacethathavebeenshadedaccordingtotheirdistancetothe nearestneighbor,(where ).NotethatpointChasbeencorrectlyassignedahighanomalyscore,asitislocatedfarawayfromotherinstances.
Figure9.4.Anomalyscorebasedonthedistancetofifthnearestneighbor.
Notethat canbequitesensitivetothevalueofk.Ifkistoosmall,e.g.,1,thenasmallnumberofoutlierslocatedclosetoeachothercanshowalowanomalyscore.Forexample,Figure9.5 showsanomalyscoresusing
forasetofnormalpointsandtwooutliersthatarelocatedclosetoeachother(shadingreflectsanomalyscores).NotethatbothCanditsneighborhavealowanomalyscore.Ifkistoolarge,thenitispossibleforallobjectsinaclusterthathasfewerthankobjectstobecomeanomalies.Forexample,Figure9.6 showsadatasetthathasasmallclusterofsize5andalarger
dist(x,k)
kth dist(x,k)k=5
dist(x,k)
k=1
clusterofsize30.For ,theanomalyscoreofallpointsinthesmallerclusterisveryhigh.
Figure9.5.Anomalyscorebasedonthedistancetothefirstnearestneighbor.Nearbyoutliershavelowanomalyscores.
Figure9.6.
k=5
Anomalyscorebasedondistancetothefifthnearestneighbor.Asmallclusterbecomesanoutlier.
Analternativedistance-basedanomalyscorethatismorerobusttothechoiceofkistheaveragedistancetothefirstk-nearestneighbors, .Indeed, iswidelyusedinanumberofapplicationsasareliableproximity-basedanomalyscore.
9.4.2Density-basedAnomalyScore
Thedensityaroundaninstancecanbedefinedas ,wherenisthenumberofinstanceswithinaspecifieddistancedfromtheinstance,andisthevolumeoftheneighborhood.Since isconstantforagivend,thedensityaroundaninstanceisoftenrepresentedusingthenumberofinstancesnwithinafixeddistanced.ThisdefinitionissimilartotheoneusedbytheDBSCANclusteringalgorithminSection7.4 .Fromadensity-basedviewpoint,anomaliesareinstancesthatareinregionsoflowdensity.Hence,ananomalywillhaveasmallernumberofinstanceswithinadistancedthananormalinstance.
Similartothetrade-offinchoosingtheparameterkindistance-basedmeasures,itischallengingtochoosetheparameterdindensity-basedmeasures.Ifdistoosmall,thenmanynormalinstancescanincorrectlyshowlowdensityvalues.Ifdistoolarge,thenmanyanomaliesmayhavedensitiesthataresimilartonormalinstances.
Notethatthedistance-basedanddensity-basedviewsofproximityarequitesimilartoeachother.Torealizethis,considerthek-nearestneighborsofadatainstancex,whosedistancetothe nearestneighborisgivenby
avg.dist(x,k)avg.dist(x,k)
n/V(d)V(d)
V(d)
kth
.Inthisapproach, providesameasureofthedensityaroundx,usingadifferentvalueofdforeveryinstance.If islarge,thedensityaroundxissmall,andvice-versa.Distance-basedanddensity-basedanomalyscoresthusfollowaninverserelationship.Thiscanbeusedtodefinethefollowingmeasuresofdensitythatarebasedonthetwodistancemeasures, and :
9.4.3RelativeDensity-basedAnomalyScore
Theaboveproximity-basedapproachesonlyconsiderthelocalityofanindividualinstanceforcomputingitsanomalyscore.Inscenarioswherethedatacontainsregionsofvaryingdensities,suchmethodswouldnotbeabletocorrectlyidentifyanomalies,asthenotionofanormallocalitywouldchangeacrossregions.
Toillustratethis,considerthesetoftwo-dimensionalpointsinFigure9.7 .Thisfigurehasoneratherlooseclusterofpoints,anotherdenseclusterofpoints,andtwopoints,CandD,whicharequitefarfromthesetwoclusters.Assigninganomalyscorestopointsaccordingto with correctlyidentifiespointCtobeananomaly,butshowsalowscoreforpointD.Infact,thescoreforDismuchlowerthanmanypointsthatarepartoftheloosecluster.Tocorrectlyidentifyanomaliesinsuchdatasets,weneedanotionofdensitythatisrelativetothedensitiesofneighboringinstances.Forexample,pointDinFigure9.7 hasahigherabsolutedensitythanpointA,butitsdensityislowerrelativetoitsnearestneighbors.
dist(x,k) dist(x,k)dist(x,k)
dist(x,k) avg.dist(x,k)
density(x,k)=1/dist(x,k),avg.density(x,k)=1/avg.dist(x,k).
dist(x,k) k=5
Figure9.7.Anomalyscorebasedonthedistancetothefifthnearestneighbor,whenthereareclustersofvaryingdensities.
Therearemanywaystodefinetherelativedensityofaninstance.Forapointx,Oneapproachistocomputetheratiooftheaveragedensityofitsk-nearestneighbors, to tothedensityofx,asfollows:
Therelativedensityofapointishighwhentheaveragedensityofpointsinitsneighborhoodissignificantlyhigherthanthedensityofthepoint.
Notethatbyreplacing with intheaboveequation,wecanobtainamorerobustmeasureofrelativedensity.TheaboveapproachissimilartothatusedbytheLocalOutlierFactor(LOF)score,whichisawidely-usedmeasurefordetectinganomaliesusingrelativedensity.(SeeBibliographicNotes.)However,LOFusesasomewhatdifferentdefinitionofdensitytoachieveresultsthataremorerobust.
y1 yk
relative density(x,k)=∑i=1kdensity(yi,k)/kdensity(x,k). (9.8)
density(x,k) avg.density(x,k)
Example9.2(RelativeDensityAnomalyDetection).Figure9.8 showstheperformanceoftherelativedensity-basedanomalydetectionmethodontheexampledatasetusedpreviouslyinFigure9.7 .TheanomalyscoreofeverypointiscomputedusingEquation9.8 (with ).Theshadingofeverypointrepresentsitsscore,i.e.,pointswithahigherscorearedarker.WehavelabeledpointsA,C,andD,whichhavethelargestanomalyscores.Respectively,thesepointsarethemostextremeanomaly,themostextremepointwithrespecttothecompactsetofpoints,andthemostextremepointintheloosesetofpoints.
Figure9.8.
k=10
Relativedensity(LOF)outlierscoresfortwo-dimensionalpointsofFigure9.7 .
9.4.4StrengthsandWeaknesses
Proximity-basedapproachesarenon-parametricinnatureandhencearenotrestrictedtoanyparticularformofdistributionofthenormalandanomalousclasses.Theyhaveabroadapplicabilityoverawiderangeofanomalydetectionproblemswhereareasonableproximitymeasurecanbedefinedbetweeninstances.Theyarequiteintuitiveandvisuallyappealing,sinceproximity-basedanomaliescanbeinterpretedvisuallywhenthedatacanbedisplayedintwo-orthree-dimensionalscatterplots.
However,theeffectivenessofproximity-basedmethodsdependsgreatlyonthechoiceofthedistancemeasure.Definingdistancesinhigh-dimensionalspacescanbechallenging.Insomecases,dimensionalityreductiontechniquescanbeusedtomaptheinstancesintoalowerdimensionalfeaturespace.Proximity-basedmethodscanthenbeappliedinthereducedspacefordetectinganomalies.Anotherchallengecommontoallproximity-basedmethodsistheirhighcomputationalcomplexity.Givennpoints,computingtheanomalyscoreforeverypointrequiresconsideringallpair-wisedistances,resultinginan runningtime.Forlargedatasetsthiscanbetooexpensive,althoughspecializedalgorithmscanbeusedtoimproveperformanceinsomecases,e.g.,withlow-dimensionaldatasets.Choosingtherightvalueofparameters(kord)inproximity-basedmethodsisalsodifficultandoftenrequiresdomainexpertise.
O(n2)
9.5Clustering-basedApproachesClustering-basedmethodsforanomalydetectionuseclusterstorepresentthenormalclass.Thisreliesontheassumptionthatnormalinstancesappearclosetoeachotherandhencecanbegroupedintoclusters.Anomaliesarethenidentifiedasinstancesthatdonotfitwellintheclusteringofthenormalclass,orappearinsmallclustersthatarefarapartfromtheclustersofthenormalclass.Clustering-basedmethodscanbecategorizedintotwotypes:methodsthatconsidersmallclustersasanomalies,andmethodsthatdefineapointasanomalousifdoesnotfittheclusteringwell,typicallyasmeasuredbydistancefromaclustercenter.Wedescribebothtypesofclustering-basedmethodsnext.
9.5.1FindingAnomalousClusters
Thisapproachassumesthepresenceofclusteredanomaliesinthedata,wheretheanomaliesappearintightgroupsofsmallsize.Clusteredanomaliesappearwhentheanomaliesarebeinggeneratedfromthesameanomalousclass.Forexample,anetworkattackmayhaveacommonpatterninitsoccurrence,possiblybecauseofacommonattacker,whoappearsinsimilarwaysinmultipleinstances.
Clustersofanomaliesaregenerallysmallinsize,sinceanomaliesarerareinnature.Theyarealsoexpectedtobequitedistantfromtheclustersofthenormalclass,sinceanomaliesdonotconformtonormalpatternsorbehavior.Hence,abasicapproachfordetectinganomalousclustersistoclusterthe
overalldataandflagclustersthatareeithertoosmallinsizeortoofarfromotherclusters.
Forinstance,ifweuseaprototype-basedmethodforclusteringtheoveralldata,e.g.,usingK-means,everyclustercanberepresentedbyitsprototype,e.g.,thecentroidofthecluster.Wecanthentreateveryprototypeasapointandstraightforwardlyidentifyclustersthataredistantfromtherest.Asanotherexample,ifweareusinghierarchicaltechniquessuchasMIN,MAX,orGroupAverage—seeSection7.3 —thenanomaliesareoftenidentifiedasthoseinstancesthatareinsmallclustersorremainsingletonsevenafteralmostallotherpointshavebeenclustered.
9.5.2FindingAnomalousInstances
Fromaclusteringperspective,anotherwayofdescribingananomalyisasaninstancethatcannotbeexplainedwellbyanyofthenormalclusters.Hence,abasicapproachforanomalydetectionistofirstclusterallthedata(comprisedmainlyofnormalinstances)andthenassessthedegreetowhicheveryinstancebelongstoitsrespectivecluster.Forexample,ifweuseK-meansclustering,thedistanceofaninstancetoitsclustercentroidrepresentshowstronglyitbelongstothecluster.Instancesthatarequitedistantfromtheirrespectiveclustercentroidscanthusbeidentifiedasanomalies.
Althoughclustering-basedmethodsforanomalydetectionarequiteintuitiveandsimpletouse,thereareanumberofconsiderationsthatmustbekeptinmindwhileusingthem,aswediscussinthefollowing.
AssessingtheExtenttoWhichanObject
BelongstoaClusterForprototype-basedclusters,thereareseveralwaystoassesstheextenttowhichaninstancebelongstoacluster.Onemethodistomeasurethedistanceofaninstancefromitsclusterprototypeandconsiderthisastheanomalyscoreoftheinstance.However,iftheclustersareofdifferingdensities,thenwecanconstructananomalyscorethatmeasurestherelativedistanceofaninstancefromtheclusterprototypewithrespecttothedistancesoftheotherinstancesinthecluster.Anotherpossibility,providedthattheclusterscanbeaccuratelymodeledintermsofGaussiandistributions,istousetheMahalanobisdistance.
Forclusteringtechniquesthathaveanobjectivefunction,wecanassignananomalyscoretoaninstancethatreflectstheimprovementintheobjectivefunctionwhenthatinstanceiseliminatedfromtheoveralldata.However,suchanapproachisoftencomputationallyintensive.Forthatreason,thedistance-basedapproachesofthepreviousparagraphareusuallypreferred.
Example9.3(Clustering-BasedExample).ThisexampleisbasedonthesetofpointsshowninFigure9.7 .Prototype-basedclusteringinthisexampleusestheK-meansalgorithm,andtheanomalyscoreofapointiscomputedintwoways:(1)bythepoint’sdistancefromitsclosestcentroid,and(2)bythepoint’srelativedistancefromitsclosestcentroid,wheretherelativedistanceistheratioofthepoint’sdistancefromthecentroidtothemediandistanceofallpointsintheclusterfromthecentroid.Thelatterapproachisusedtoadjustforthelargedifferenceindensitybetweencompactandlooseclusters.
TheresultinganomalyscoresareshowninFigures9.9 and9.10 .Asbefore,theanomalyscore,measuredinthiscasebythedistanceor
relativedistance,isindicatedbytheshading.Weusetwoclustersineachcase.Theapproachbasedonrawdistancehasproblemswiththedifferingdensitiesoftheclusters,e.g.,Disnotconsideredanoutlier.Fortheapproachbasedonrelativedistances,thepointsthathavepreviouslybeenidentifiedasoutliersusingLOF(A,C,andD)alsoshowupasanomalieshere.
Figure9.9.Distanceofpointsfromclosestcentroid.
Figure9.10.Relativedistanceofpointsfromclosestcentroid.
ImpactofOutliersontheInitialClusteringClusteringbasedschemesareoftensensitivetothepresenceofoutliersinthedata.Hence,thepresenceofoutlierscandegradethequalityofclusterscorrespondingtothenormalclasssincetheseclustersarediscoveredbyclusteringtheoveralldata,whichiscomprisedofnormalandanomalousinstances.Toaddressthisissue,thefollowingapproachcanbeused:instancesareclustered,outliers,whicharethepointsfarthestfromanycluster,areremoved,andthentheinstancesareclusteredagain.ThisapproachisusedateveryiterationoftheK-meansalgorithm.TheK-means–algorithmisanexampleofsuchanalgorithm.Whilethereisnoguaranteethatthisapproachwillyieldoptimalresults,itiseasytouse.
Amoresophisticatedapproachistohaveaspecialgroupforinstancesthatdonotcurrentlyfitwellinanycluster.Thisgrouprepresentspotentialoutliers.Astheclusteringprocessproceeds,clusterschange.Instancesthatnolongerbelongstronglytoanyclusterareaddedtothesetofpotentialoutliers,whileinstancescurrentlyintheoutliergrouparetestedtoseeiftheynowstronglybelongtoaclusterandcanberemovedfromthesetofpotentialoutliers.Theinstancesremaininginthesetattheendoftheclusteringareclassifiedasoutliers.Again,thereisnoguaranteeofanoptimalsolutionoreventhatthisapproachwillworkbetterthanthesimpleronedescribedpreviously.
TheNumberofClusterstoUseClusteringtechniquessuchasK-meansdonotautomaticallydeterminethenumberofclusters.Thisisaproblemwhenusingclustering-basedmethodsforanomalydetection,sincewhetheranobjectisconsideredananomalyornotmaydependonthenumberofclusters.Forinstance,agroupof10objectsmayberelativelyclosetooneanother,butmaybeincludedaspartofalargerclusterifonlyafewlargeclustersarefound.Inthatcase,eachofthe10pointscouldberegardedasananomaly,eventhoughtheywouldhaveformedaclusterifalargeenoughnumberofclustershadbeenspecified.
Aswithsomeoftheotherissues,thereisnosimpleanswertothisproblem.Onestrategyistorepeattheanalysisfordifferentnumbersofclusters.Anotherapproachistofindalargenumberofsmallclusters.Theideaisthat(1)smallerclusterstendtobemorecohesiveand(2)ifanobjectisananomalyevenwhentherearealargenumberofsmallclusters,thenitislikelyatrueanomaly.Thedownsideisthatgroupsofanomaliesmayformsmallclustersandthusescapedetection.
9.5.3StrengthsandWeaknesses
Clustering-basedtechniquescanoperateinanunsupervisedsettingastheydonotrequiretrainingdataconsistingofonlynormalinstances.Alongwithidentifyinganomalies,thelearnedclustersofthenormalclasshelpinunderstandingthenatureofthenormaldata.Someclusteringtechniques,suchasK-means,havelinearornear-lineartimeandspacecomplexityandthus,ananomalydetectiontechniquebasedonsuchalgorithmscanbehighlyefficient.However,theperformanceofclustering-basedanomalydetectionmethodsisheavilydependentuponthenumberofclustersusedaswellasthepresenceofoutliersinthedata.AsdiscussedinChapters7 and8 ,eachclusteringalgorithmissuitableonlyforacertaintypeofdata;hencetheclusteringalgorithmneedstobechosencarefullytoeffectivelycapturetheclusterstructureinthedata.
9.6Reconstruction-basedApproachesReconstruction-basedtechniquesrelyontheassumptionthatthenormalclassresidesinaspaceoflowerdimensionalitythantheoriginalspaceofattributes.Inotherwords,therearepatternsinthedistributionofthenormalclassthatcanbecapturedusinglower-dimensionalrepresentations,e.g.,byusingdimensionalityreductiontechniques.
Toillustratethis,consideradatasetofnormalinstances,whereeveryinstanceisrepresentedusingpcontinuousattributes, .Ifthereisahiddenstructureinthenormalclass,wecanexpecttoapproximatethisdatausingfewerthanpderivedfeatures.Onecommonapproachforderivingusefulfeaturesfromadatasetistouseprincipalcomponentsanalysis(PCA),asdescribedinSection2.3.3 .ByapplyingPCAontheoriginaldata,weobtainpprincipalcomponents, ,whereeveryprincipalcomponentisalinearcombinationoftheoriginalattributes.Eachprincipalcomponentcapturesthemaximumamountofvariationintheoriginaldatasubjecttotheconstraintthatitmustbeorthogonaltotheprecedingprincipalcomponents.Thus,theamountofvariationcaptureddecreasesforeachsuccessiveprincipalcomponent,andhence,itispossibletoapproximatetheoriginaldatausingthetopkprincipalcomponents, .Indeed,ifthereisahiddenstructureinthenormalclass,wecanexpecttoobtainagoodapproximationusingasmallernumberoffeatures, .
Once,wehavederivedasmallersetofkfeatures,wecanprojectanynewdatainstancextoitsk-dimensionalrepresentationy.Moreover,wecanalsore-projectybacktotheoriginalspaceofpattributes,resultinginareconstructionofx.Letusdenotethisreconstructionas andthesquaredEuclideandistancebetweenxand asthereconstructionerror.
x1, . . . , xp
y1, . . . , yp
y1, . . . , yk
k<p
x^x^
Sincethelow-dimensionalfeaturesarespecificallylearnedtoexplainmostofthevariationinthenormaldata,wecanexpectthereconstructionerrortobelowfornormalinstances.However,thereconstructionerrorishighforanomalousinstances,astheydonotconformtothehiddenstructureofthenormalclass.Thereconstructionerrorcanthusbeusedasaneffectiveanomalydetectionscore.
Asanillustrationofareconstruction-basedapproachforanomalydetection,consideratwo-dimensionaldatasetofnormalinstances,shownascirclesinFigure9.11 .Theblacksquaresareanomalousinstances.Thesolidblacklineshowsthefirstprincipalcomponentlearnedfromthisdata,whichcorrespondstothedirectionofmaximumvarianceofnormalinstances.
Figure9.11.Reconstructionofatwo-dimensionaldatausingasingleprincipalcomponent(shownassolidblackline).
Wecanseethatmostofthenormalinstancesarecenteredaroundthisline.
ReconstructionError(x)=||x−x^||2
Thissuggeststhatthefirstprincipalcomponentprovidesagoodapproximationtothenormalclassusingalower-dimensionalrepresentation.Usingthisrepresentation,wecanprojecteverydatainstancextoapointontheline.Thisprojection, ,servesasareconstructionoftheoriginalinstanceusingasingleprincipalcomponent.
Thedistancebetweenxand ,whichcorrespondstothereconstructionerrorofx,isshownasdashedlinesinFigure9.11 .Wecanseethat,sincethefirstprincipalcomponenthasbeenlearnedtobestfitthenormalclass,thereconstructionerrorsofthenormalinstancesarequitesmallinvalue.However,thereconstructionerrorsforanomalousinstances(shownassquares)arehigh,sincetheydonotadheretothestructureofthenormalclass.
AlthoughPCAprovidesasimpleapproachforcapturinglow-dimensionalrepresentations,itcanonlyderivefeaturesthatarelinearcombinationsoftheoriginalattributes.Whenthenormalclassexhibitsnonlinearpatterns,itisdifficulttocapturethemusingPCA.Insuchscenarios,theuseofanartificialneuralnetworkknownanautoencoderprovidesonepossibleapproachfornonlineardimensionalityreductionandreconstruction.AsdescribedinSection4.7 ,autoencodersarewidelyusedinthecontextofdeeplearningtoderivecomplexfeaturesfromthetrainingdatainanunsupervisedsetting.
Anautoencoder(alsoreferredtoasanautoassociatororamirroringnetwork)isamulti-layerneuralnetwork,wherethenumberofinputandoutputneuronsisequaltothenumberoforiginalattributes.Figure9.12 showsthegeneralarchitectureofanautoencoder,whichinvolvestwobasicsteps,encodinganddecoding.Duringencoding,adatainstancexistransformedtoalow-dimensionalrepresentationy,usinganumberofnonlineartransformationsintheencodinglayers.Noticethatthenumberofneuronsreducesateveryencodinglayer,soastolearnlow-dimensionalrepresentationsfromthe
x^
x^
originaldata.Thelearnedrepresentationyisthenmappedbacktotheoriginalspaceofattributesusingthedecodinglayers,resultinginareconstructionofx,denotedby .Thedistancebetweenxand (thereconstructionerror)isthenusedasameasureofananomalyscore.
Figure9.12.Abasicarchitectureoftheautoencoder.
Inordertolearnanautoencoderfromaninputdatasetcomprisingprimarilyofnormalinstances,wecanusethebackpropagationtechniquesintroducedinthecontextofartificialneuralnetworksinSection4.7 .Theautoencoderschemeprovidesapowerfulapproachforlearningcomplexandnonlinearrepresentationsofthenormalclass.Anumberofvariantsofthebasicautoencoderschemedescribedabovehavealsobeenexploredtolearnrepresentationsindifferenttypesofdatasets.Forexample,thedenoisingautoencoderisabletorobustlylearnnonlinearrepresentationsfromthetrainingdata,eveninthepresenceofnoise.Formoredetailsonthedifferenttypesofautoencoders,seetheBibliographicNotes.
x^ x^
9.6.1StrengthsandWeaknesses
Reconstruction-basedtechniquesprovideagenericapproachformodelingthenormalclassthatdoesnotrequiremanyassumptionsaboutthedistributionofnormalinstances.Theyareabletolearnarichvarietyofrepresentationsofthenormalclassbyusingabroadfamilyofdimensionalityreductiontechniques.Theycanalsobeusedinthepresenceofirrelevantattributes,sinceanattributethatdoesnotshareanyrelationshipwiththeotherattributesislikelytobeignoredintheencodingstep,asitwouldnotbeofmuchuseinreconstructingthenormalclass.However,sincethereconstructionerroriscomputedbymeasuringthedistancebetweenxandintheoriginalspaceofattributes,performancecansufferwhenthenumber
ofattributesislarge.
x^
9.7One-classClassificationOne-classclassificationapproacheslearnadecisionboundaryintheattributespacethatenclosesallnormalobjectsononesideoftheboundary.Figure9.13 showsanexampleofadecisionboundaryintheone-classsetting,wherepointsbelongingtoonesideoftheboundary(shaded)belongtothenormalclass.Thisisincontrasttobinaryclassificationapproachesintroducedinchapters3 and4 thatlearnboundariestoseparateobjectsfromtwoclasses.
Figure9.13.Thedecisionboundaryofaone-classclassificationproblemattemptstoenclosethenormalinstancesonthesamesideoftheboundary.
One-classclassificationpresentsauniqueperspectiveonanomalydetection,where,insteadoflearningthedistributionofthenormalclass,thefocusisonmodelingtheboundaryofthenormalclass.Fromanoperationalstandpoint,learningtheboundaryisindeedwhatweneedtodistinguishanomaliesfrom
normalobjects.InthewordsofVladimirVapnik,“Oneshouldsolvethe[classification]problemdirectlyandneversolveamoregeneralproblem[suchaslearningthedistributionofthenormalclass]asanintermediatestep.”
Inthissection,wepresentanSVM-basedaone-classapproach,knownasone-classSVM,whichonlyusestraininginstancesfromthenormalclasstolearnitsdecisionboundary.ContrastthiswithanormalSVM—seeSection4.9 –whichusestraininginstancesfromtwoclasses.Thisinvolvestheuseofkernelsandanovel“origintrick,”describedasfollows.(SeeSection2.4.7 foranintroductiontokernelmethods.)
9.7.1UseofKernels
Inordertolearnanonlinearboundarythatenclosesthenormalclass,wetransformthedatatoahigherdimensionalspacewherethenormalclasscanbeseparatedusingalinearhyperplane.Thiscanbedonebyusingafunctionϕthatmapseverydatainstancexintheoriginalspaceofattributestoapoint
inthetransformedhigh-dimensionalspace.(Thechoiceofthemappingfunctionwillbecomeclearlater.)Inthetransformedspace,thetraininginstancescanbeseparatedusingalinearhyperplanedefinedbyparameters
asfollows.
where denotestheinnerproductbetweenvectorsxandy.Ideally,wewantalinearhyperplanethatplacesallofthenormalinstancesononeside.Hence,wewant tobesuchthat ifxbelongstothenormalclass,and ifxbelongstotheanomalyclass.
ϕ(x)
(w,ρ)
⟨w,ϕ(x)⟩=ρ,
⟨x,y⟩
(w,ρ) ⟨w,ϕ(x)⟩>ρ⟨w,ϕ(x)⟩<ρ
Let bethesetoftraininginstancesbelongingtothenormalclass.SimilartotheuseofkernelsinSVMs(seeChapter4 ),wedefinewasalinearcombinationof ’s:
Theseparatinghyperplanecanthenbedescribedusing ’sandρasfollows.
Notethattheaboveequationdealswithinnerproductsof inthetransformedspacetodescribethehyperplane.Tocomputesuchinnerproducts,wecanmakeuseofkernelfunctions, ,introducedinSection2.4.7 .Notethatkernelfunctionsareextensivelyusedforlearningnonlinearboundariesinbinaryclassificationproblems,e.g.,usingkernel-SVMspresentedinChapter4 .However,learningnonlinearboundariesintheone-classsettingischallengingintheabsenceofanyinformationabouttheanomalyclassduringtraining.Toovercomethischallenge,one-classSVMusesthe“origintrick”tolearntheseparatinghyperplane,whichworksbestwithcertaintypesofkernelfunctions.Thisapproachcanbebrieflydescribedasfollows.
9.7.2TheOriginTrick
ConsidertheGaussiankernelthatiscommonlyusedforlearningnonlinearboundaries,whichcanbedefinedas
{x1,x2, . . .xn}
ϕ(xi)
w=∑i=1nαiϕ(xi).
αi
∑i=1nαi⟨ϕ(xi),ϕ(x)⟩=ρ,
ϕ(x)
κ(x, y)=⟨ϕ(x), ϕ(y)⟩
κ(x,y)=exp(−||x−y||22σ2),
where denotesthelengthofavectorandσisahyper-parameter.BeforeweusetheGaussiankerneltolearnaseparatinghyperplaneintheone-classsetting,itwillbeworthwhiletofirstunderstandwhatthetransformedspace
ofaGaussiankernellookslike.TherearetwoimportantpropertiesofthetransformedspaceofGaussiankernelsthatareusefulforunderstandingtheintuitionbehindone-classSVMs:
1. Everypointismappedtoahypersphereofunitradius.Torealizethis,considerthekernelfunction ofapointxontoitself.Since ,
Thisimpliesthatthelengthof isequalto1,andhence,residesonahypersphereofunitradiusforallx.
2. Everypointismappedtothesameorthantinthetransformedspace.Foranytwopointsxandy,since ,theanglebetween and isalwayssmallerthan .Hence,themappingsofallpointslieinthesameorthant(high-dimensionalanalogueof“quadrant”)inthetransformedspace.
Forillustrativepurposes,Figure9.14 showsaschematicvisualizationofthetransformedspaceofGaussiankernels,usingtheabovetwoconsiderations.Theblackdotsrepresentthemappingsoftraininginstancesinthetransformedspace,whichlieonaquarterarcofacirclewithunitradius.Inthisview,theobjectiveofone-classSVMistolearnalinearhyperplanethatcanseparatetheblackdotsfromthemappingsofanomalousinstances,whichwouldalsoresideonthesamequarterarc.Therearemanypossiblehyperplanesthatcanachievethistask,twoofwhichareshowninFigure9.14 asdashedlines.Inordertochoosethebesthyperplane(shownasaboldline),wemakeusetheprincipleofstructuralriskminimization,discussed
||.||
ϕ(x)
κ(x,x)||x−x||2=0
κ(x,x)=⟨ϕ(x),ϕ(x)⟩=||ϕ(x)||2=1.
ϕ(x) ϕ(x)
κ(x,y)=⟨ϕ(x),ϕ(y)⟩≥0ϕ(x) ϕ(y) π/2
inChapter4 inthecontextofSVM.Therearethreemainrequirementsthatweseekintheoptimalhyperplanedefinedbyparameters :
Figure9.14.Illustratingtheconceptofone-classSVMinthetransformedspace.
1. Thehyperplaneshouldhavealarge“margin”orasmallvalueof.Havingalargemarginensuresthatthemodelissimpleandhencelesssusceptibletothephenomenonofoverfitting.
2. Thehyperplaneshouldbeasdistantfromtheoriginaspossible.Thisensuresatightrepresentationofpointsontheuppersideofthehyperplane(correspondingtothenormalclass).NoticefromFigure9.14 thatthedistanceofahyperplanefromtheoriginisessentially
.Hence,maximizingρtranslatestomaximizingthedistanceofthehyperplanefromtheorigin.
(w,ρ)
||w||2
ρ||w||
3. Inthestyleof“soft-margin”SVMs,ifsomeofthetraininginstanceslieontheoppositesideofthehyperplane(correspondingtotheanomalyclass),thenthedistanceofsuchpointsfromthehyperplaneshouldbeminimized.Notethatitisimportantforananomalydetectionalgorithmtoberobusttoasmallnumberofoutliersinthetrainingsetasthatisquitecommoninreal-worldproblems.AnexampleofananomaloustraininginstanceisshowninFigure9.14 asthelower-mostblackdotonthequarterarc.Ifatraininginstance liesontheoppositesideofthehyperplane(correspondingtotheanomalyclass),itsdistancefromthehyperplane,asmeasuredbyitsslackvariable ,shouldbekeptsmall.If liesonthesidecorrespondingtothenormalclass,then .
Theabovethreerequirementsprovidethefoundationoftheoptimizationobjectiveofone-classSVM,whichcanbeformallydescribedasfollows:
wherenisthenumberoftraininginstancesand isahyper-parameterthatmaintainsatrade-offbetweenreducingthemodelcomplexityandimprovingthecoverageofthedecisionboundaryinkeepingthetraininginstancesonthesameside.
NoticethesimilarityoftheaboveequationtotheoptimizationobjectiveofbinarySVM,introducedinChapter4 .However,akeydifferenceinone-classSVMisthattheconstraintsareonlydefinedforthenormalclassbutnottheanomalyclass.Atafirstglance,thismightseemtobeaseriousproblem,becausethehyperplaneisheldbyconstraintsfromoneside(correspondingtothenormalclass)butisunconstrainedfromtheotherside.However,with
xi
ξi xiξi=0
minw,ρ,ξ12||w||2−ρ+1nv∑i=1nξi, (9.9)
subjectto⟨w,ϕ(xi)⟩≥ρ−ξi, ξi≥0,
ν∈(0, 1]
thehelpofthe“origintrick,”one-classSVMisabletoovercomethisinsufficiencybymaximizingthedistanceofthehyperplanefromtheorigin.Fromthisperspective,theoriginactsasasurrogatesecondclassandthelearnedhyperplaneattemptstoseparatethenormalclassfromthissecondclassinamannersimilartothewayabinarySVMseparatestwoclasses.
Equation9.7.2isaninstanceofaquadraticprogrammingproblem(QPP)withlinearinequalityconstraints,whichissimilartotheoptimizationproblemofbinarySVM.Hence,theoptimizationproceduresdiscussedinChapter4forlearningabinarySVMcanbedirectlyappliedforsolvingEquation9.7.2.Thelearnedone-classSVMcanthenbeappliedonatestinstancetoidentifyifitbelongstothenormalclassortheanomalyclass.Further,ifatestinstanceisidentifiedasananomaly,itsdistancefromthehyperplanecanbeseenasanestimateofitsanomalyscore.
Thehyper-parameterνofone-classSVMhasaspecialinterpretation.Itrepresentsanupperboundonthefractionoftraininginstancesthatcanbetoleratedasanomalieswhilelearningthehyperplane.Thismeansthatnνrepresentsthemaximumnumberoftraininginstancesthatcanbeplacedontheothersideofthehyperplane(correspondingtotheanomalyclass).Alowvalueofνassumesthatthetrainingsethasasmallernumberofoutliers,whileahighvalueofνensuresthatthelearningofthehyperplaneisrobusttoalargenumberofoutliersduringtraining.
Figure9.15 showsthelearneddecisionboundaryforanexampletrainingsetofsize200using .WecanseethatthetrainingdataconsistsofmostlynormalinstancesgeneratedfromaGaussiandistributioncenteredat
.However,therearealsosomeoutliersintheinputdatathatdonotconformthedistributionofthenormalclass.With ,theone-classSVMisabletoplaceatmost20traininginstancesontheothersideofthehyperplane(correspondingtothenormalclass).Thisresultsinadecisionboundarythat
ν=0.1(
0, 0)ν=0.1
robustlyenclosesthemajorityofnormalinstances.Ifweinsteaduse ,wewouldonlyhavethebudgettotolerateatmost10outliersinthetrainingset,resultinginthedecisionboundaryshowninFigure9.16(a) .Wecanseethatthisdecisionboundaryassignsamuchlargerregiontothenormalclassthanisnecessary.Ontheotherhand,thedecisionboundarylearnedusing isshowninFigure9.16(b) ,whichappearstobemuchmorecompactasitcantolerateupto40outliersinthetrainingdata.Thechoiceofνthusplaysacrucialroleinthelearningofthedecisionboundaryinone-classSVMs.
Figure9.15.Decisionboundaryofone-classSVMwithν=0.1.
ν=0.05
ν=0.2
Figure9.16.Decisionboundariesofone-classSVMforvaryingvaluesofν.
9.7.3StrengthsandWeaknesses
One-classSVMsleveragetheprincipleofstructuralriskminimizationinthelearningofthedecisionboundary,whichhasstrongtheoreticalfoundations.Theyhavetheabilitytostrikeabalancebetweenthesimplicityofthemodelandtheeffectivenessoftheboundaryinenclosingthedistributionofthenormalclass.Byusingthehyper-parameterν,theyprovideabuilt-inmechanismtoavoidoutliersinthetrainingdata,whichisoftencommoninreal-worldproblems.However,asillustratedinFigure9.16 ,thechoiceofνsignificantlyimpactsthepropertiesofthelearneddecisionboundary.Choosingtherightvalueofνisdifficult,sincethehyper-parameterselectiontechniquesdiscussedinChapter4 areonlyapplicableinthemulticlasssetting,whereitispossibletodefinevalidationerrorrates.Also,theuseofa
Gaussiankernelrequiresarelativelylargetrainingsizetoeffectivelylearnnonlineardecisionboundariesintheattributespace.Further,likeregularSVM,one-classSVMhasahighcomputationalcost.Hence,itisexpensivetotrain,especiallywhenthetrainingsetislarge.
9.8InformationTheoreticApproachesTheseapproachesassumethatthenormalclasscanberepresentedusingcompactrepresentations,alsoknownascodes.Insteadofexplicitlylearningsuchrepresentations,thefocusofinformationtheoreticapproachesistoquantifytheamountofinformationrequiredforencodingthem.Ifthenormalclassshowssomestructureorpattern,wecanexpecttoencodeitusingasmallnumberofbits.Anomaliescanthenbeidentifiedasinstancesthatintroduceirregularitiesinthedata,whichincreasetheoverallinformationcontentofthedataset.Thisisanadmissibledefinitionofananomalyinanoperationalsetting,sinceanomaliesareoftenassociatedwithanelementofsurprise,astheydonotconformtothepatternsorbehaviorofthenormalclass.
Thereareanumberofapproachesforquantifyingtheinformationcontent(alsoreferredtoascomplexity)ofadataset.Forexample,ifthedatasetcontainsacategoricalvariable,wecanassessitsinformationcontentusingtheentropymeasure,describedinSection2.3.6 .Fordatasetswithothertypesofattributes,othermeasuressuchastheKolmogorovcomplexitycanbeused.Intuitively,theKolmogorovcomplexitymeasuresthecomplexityofadatasetbythesizeofthesmallestcomputerprogram(writteninapre-specifiedlanguage)thatcanreproducetheoriginaldata.Amorepracticalapproachistocompressthedatausingstandardcompressiontechniques,andusethesizeoftheresultingcompressedfileasameasureoftheinformationcontentoftheoriginaldata.
Abasicinformationtheoreticapproachforanomalydetectioncanbedescribedasfollows.LetusdenotetheinformationcontentofadatasetDas
ConsidercomputingtheanomalyscoreofadatainstancexinD.IfweInfo(D).
removexfromD,wecanmeasuretheinformationcontentoftheremainingdataas Ifxisindeedananomaly,itwouldshowahighvalueof
Thishappensbecauseanomaliesareexpectedtobesurprising,andthus,theireliminationshouldresultinasubstantialreductionintheinformationcontent.Wecanthususe asameasureofanomalyscore.
Typically,thereductionininformationcontentismeasuredbyeliminatingasubsetofinstances(thataredeemedanomalous)andnotjustasingleinstance.Thisisbecausemostmeasuresofinformationcontentarenotsensitivetotheeliminationofasingleinstance,e.g.,thesizeofacompresseddatafiledoesnotchangesubstantiallybyremovingasingledataentry.ItisthusnecessarytoidentifythesmallestsubsetofinstancesXthatshowthelargestvalueof uponelimination.Thisisanon-trivialproblemrequiringexponentialtimecomplexity,althoughapproximatesolutionswithlineartimecomplexityhavealsobeenproposed.(SeeBibliographicNotes.)
Example9.4.Givenasurveyreportofthe and ofacollectionofparticipants,wewanttoidentifythoseparticipantsthathaveunusualheightsandweights.Both and canberepresentedascategoricalvariablesthattakethreevalues:{low,medium,high}.Table9.2 showsthedatafortheweightandheightinformationof100participants,whichhasanentropyof2.08.Wecanseethatthereisapatternintheheightandweightdistributionofnormalparticipants,sincemostparticipantsthathaveahighvalueof alsohaveahighvalueof ,andvice-versa.However,thereare5participantsthathaveahigh valuebutlow value,whichisquiteunusual.By
Info(Dx).
Gain(x)=Info(D)−Info(Dx).
Gain(x)
Gain(X)
eliminatingthese5instances,theentropyoftheresultingdatasetbecomes1.89,resultinginagainof
Table9.2.Surveydataofweightandheightof100participants.
Frequency
low low 20
low medium 15
medium medium 40
high high 20
high low 5
9.8.1StrengthsandWeaknesses
Informationtheoreticapproachesoperateintheunsupervisedsetting,astheydonotrequireaseparatetrainingsetofnormal-onlyinstances.Theydonotmakemanyassumptionsaboutthestructureofthenormalclassandaregenericenoughtobeappliedwithdatasetsofvaryingtypesandproperties.However,theperformanceofinformationtheoreticapproachesdependsheavilyonthechoiceofthemeasureusedforcapturingtheinformationcontentofadataset.Themeasureshouldbesuitablychosensothatitissensitivetotheeliminationofasmallnumberofinstances.Thisisoftenachallenge,sincecompressiontechniquesareoftenrobusttosmalldeviations,renderingthemusefulonlywhenanomaliesarelargeinnumbers.Further,informationtheoreticapproachessufferfromhighcomputationalcost,makingthemexpensivetoapplyonlargedatasets.
2.08−1.89=0.19.
9.9EvaluationofAnomalyDetectionWhenclasslabelsareavailabletodistinguishbetweenanomaliesandnormaldata,thentheeffectivenessofananomalydetectionschemecanbeevaluatedbyusingmeasuresofclassificationperformancediscussedinSection4.11 .Sincetheanomalousclassisusuallymuchsmallerthanthenormalclass,measuressuchasprecision,recall,andfalsepositiveratearemoreappropriatethanaccuracy.Inparticular,thefalsepositiverate,whichisoftenreferredtothefalsealarmrate,oftendeterminesthepracticalityoftheanomalydetectionschemesincetoomanyfalsealarmsrenderananomalydetectionsystemuseless.
Ifclasslabelsarenotavailable,thenevaluationischallenging.Formodel-basedapproaches,theeffectivenessofoutlierdetectioncanbejudgedwithrespecttotheimprovementinthegoodnessoffitofthemodelonceanomaliesareeliminated.Similarlyforinformationtheoreticapproaches,theinformationgaingivesameasureoftheeffectiveness.Forreconstruction-basedapproaches,thereconstructionerrorprovidesameasurethatcanbeusedforevaluation.
Theevaluationpresentedinthelastparagraphisanalogoustotheunsupervisedevaluationmeasuresforclusteranalysis,wheremeasures—seeSection7.5 ,suchasthesumofthesquarederror(SSE)orthesilhouetteindex,canbecomputedevenwhenclasslabelsarenotpresent.Suchmeasureswerereferredtoas“internal”measuresbecausetheyuseonlyinformationpresentinthedataset.Thesameistrueoftheanomalyevaluationmeasuresmentionedinthelastparagraph,i.e.,theyareinternalmeasures.Thekeypointisthattheanomaliesofinterestforaparticularapplicationmaynotbethosethatananomalydetectionalgorithmlabelsas
anomalies,justastheclusterlabelsproducedbyaclusteringalgorithmmaynotbeconsistentwiththeclasslabelsprovidedexternally.Inpractice,thismeansthatselectingandtuningananomalydetectionapproachbasedonfeedbackfromtheusersofsuchasystem.
Amoregeneralwaytoevaluatetheresultsofanomalydetectionistolookatthedistributionoftheanomalyscores.Thetechniquesthatwehavediscussedassumethatonlyarelativelysmallfractionofthedataconsistsofanomalies.Thus,themajorityofanomalyscoresshouldberelativelylow,withasmallerfractionofscorestowardthehighend.(Thisassumesthatahigherscoreindicatesaninstanceismoreanomalous.)Thus,bylookingatthedistributionofthescoresviaahistogramordensityplot,wecanassesswhethertheapproachweareusinggeneratesscoresthatbehaveinareasonablemanner.Weillustratewithanexample.
Example9.5.(DistributionofAnomalyScores.).Figures9.17 and9.18 showtheanomalyscoresoftwoclustersofpoints.Bothhave100points,buttheleftmostclusterislessdense.Figure9.17 ,whichusestheaveragedistancetothe neighbor(averageKNNdist),showshigheranomalyscoresforthepointsinthelessdensecluster.Incontrast,Figure9.18 ,whichusestheLOFforitsanomalyscoring,showssimilarscoresbetweenthetwoclusters.
kth
Figure9.17.Anomalyscorebasedonaveragedistancetofifthnearestneighbor.
Figure9.18.AnomalyscorebasedonLOFusingfivenearestneighbors.
ThehistogramsoftheaverageKNNdistandtheLOFscoreareshowninfigures9.19 and9.20 ,respectively.ThehistogramoftheLOFscoresshowsmostpointswithsimilaranomalyscoresandafewpointswithsignificantlylargervalues.ThehistogramoftheaverageKNNdistshowsabimodaldistribution.
Figure9.19.Histogramofanomalyscorebasedonaveragedistancetothefifthnearestneighbor.
Figure9.20.HistogramofLOFanomalyscoreusingfivenearestneighbors.
ThekeypointisthatthedistributionofanomalyscoresshouldlooksimilartothatoftheLOFscoresinthisexample.Theremaybeoneormoresecondarypeaksinthedistributionasonemovestotheright,butthesesecondarypeaksshouldonlycontainarelativelysmallfractionofthe
points,andnotalargefractionofthepointsaswiththeaverageKNNdistapproach.
9.10BibliographicNotesAnomalydetectionhasalonghistory,particularlyinstatistics,whereitisknownasoutlierdetection.RelevantbooksonthetopicarethoseofAggarwal[623],BarnettandLewis[627],Hawkins[648],andRousseeuwandLeroy[683].ThearticlebyBeckmanandCook[629]providesageneraloverviewofhowstatisticianslookatthesubjectofoutlierdetectionandprovidesahistoryofthesubjectdatingbacktocommentsbyBernoulliin1777.Alsoseetherelatedarticles[630,649].AnothergeneralarticleonoutlierdetectionistheonebyBarnett[626].ArticlesonfindingoutliersinmultivariatedataincludethosebyDaviesandGather[639],GnanadesikanandKettenring[646],RockeandWoodruff[681],RousseeuwandvanZomerenand[685],andScott[690].Rosner[682]providesadiscussionoffindingmultipleoutliersatthesametime.
SurveysbyChandolaetal.[633]andHodgeandAustin[651]provideextensivecoverageofoutlierdetectionmethods,asdoesarecentbookonthetopicbyAggarwal[623].MarkouandSingh[674,675]giveatwo-partreviewoftechniquesfornoveltydetectionthatcoversstatisticalandneuralnetworktechniques,respectively.Pimentoetal.[678]isanotherreviewofnoveltydetectionapproaches,includingmanyofthemethodsdiscussedinthischapter.
Statisticalapproachesforanomalydetectionintheunivariatecasearewellcoveredbythebooksinthefirstparagraph.Shyuetal.[692]useanapproachbasedonprincipalcomponentsandtheMahalanobisdistancetoproduceanomalyscoresformultivariatedata.AnexampleofthekerneldensityapproachforanomalydetectionisgivenbySchubertetal.[688].ThemixturemodeloutlierapproachdiscussedinSection9.3.3 isfromEskin
[641].Anapproachbasedonthe measureisgivenbyYeandChen[695].Outlierdetectionbasedongeometricideas,suchasthedepthofconvexhulls,hasbeenexploredinpapersbyJohnsonetal.[654],Liuetal.[673],andRousseeuwetal.[684].
Thenotionofadistance-basedoutlierandthefactthatthisdefinitioncanincludemanystatisticaldefinitionsofanoutlierwasdescribedbyKnorretal.[663–665].Ramaswamyetal.[680]proposeanefficientdistance-basedoutlierdetectionprocedurethatgiveseachobjectanoutlierscorebasedonthedistanceofitsk-nearestneighbor.EfficiencyisachievedbypartitioningthedatausingthefirstphaseofBIRCH(Section8.5.2 ).Chaudharyetal.[634]usek-dtreestoimprovetheefficiencyofoutlierdetection,whileBayandSchwabacher[628]userandomizationandpruningtoimproveperformance.
Forrelativedensity-basedapproaches,thebestknowntechniqueisthelocaloutlierfactor(LOF)(Breunigetal.[631,632]),whichgrewoutofDBSCAN.AnotherlocallyawareanallydetectionalgorithmisLOCIbyPapadimitriouetal.[677].AmorerecentviewofthelocalapproachisgivenbySchubertetal.[689].Proximitiescanbeviewedasagraph.Theconnectivity-basedoutlierfactor(COF)byTangetal.[694]isagraph-basedapproachtolocaloutlierdetection.AsurveyofgraphbasedapproachesisprovidedbyAkogluetal.[625].
Highdimensionalityposessignificantproblemsfordistance-anddensity-basedapproaches.Adiscussionofoutlierremovalinhigh-dimensionalspacecanbefoundinthepapersbyAggarwalandYu[624]andDunaganandVempala[640].Zimeketal.provideasurveyofanomalydetectionapproachesforhigh-dimensionalnumericaldata[696].
χ2
Clusteringandanomalydetectionhavealongrelationship.InChapters7and8 ,weconsideredtechniques,suchasBIRCH,CURE,DENCLUE,DB-SCAN,andSNNdensity-basedclustering,whichspecificallyincludetechniquesforhandlinganomalies.StatisticalapproachesthatfurtherdiscussthisrelationshiparedescribedinpapersbyScott[690]andHardinandRocke[647].TheK-means–algorithm,whichcansimultaneouslyhandleclusteringandoutliers,wasproposedbyChawlaandGionis[637].
Ourdiscussionofreconstruction-basedapproachesfocusedonaneuralnetwork-basedapproach,i.e.,theautoencoder.Morebroadly,adiscussionofapproachesintheareaofneuralnetworkscanbefoundinpapersbyGhoshandSchwartzbard[645],Sykacek[693],andHawkinsetal.[650],whodiscussreplicatornetworks.TheoneclassSVMapproachforanomalydetectionwascreatedbySchölkopfetal.[686]andimprovedbyLietal.[672].Moregenerally,techniquesforoneclassclassificationaresurveyedin[662].TheuseofinformationmeasuresinanomalydetectionisdescribedbyLeeandXiang[671].
Inthischapter,wefocusedonunsupervisedanomalydetection.Supervisedanomalydetectionfallsintothecategoryofrareclassclassification.WorkonrareclassdetectionincludestheworkofJoshietal.[655–659].Therareclassproblemisalsosometimesreferredtoastheimbalanceddatasetproblem.OfrelevanceareanAAAIworkshop(Japkowicz[653]),anICMLworkshop(Chawlaetal.[635]),andaspecialissueofSIGKDDExplorations(Chawlaetal.[636]).
EvaluationofunsupervisedanomalydetectionapproacheswasdiscussedinSection9.9 .SeealsothediscussioninChapter8 ofthebookbyAggarwal[623].Insummary,evaluationapproachesarequitelimited.Forsupervisedanomalydetection,anoverviewofcurrentapproachesforevaluationcanbefoundinSchubertetal.[687].
Inthischapter,wehavefocusedonbasicanomalydetectionschemes.Wehavenotconsideredschemesthattakeintoaccountthespatialortemporalnatureofthedata.Shekharetal.[691]provideadetaileddiscussionoftheproblemofspatialoutliersandpresentaunifiedapproachtospatialoutlierdetection.AdiscussionofthechallengesforanomalydetectioninclimatedataisprovidedbyKawaleetal.[660].
TheissueofoutliersintimeserieswasfirstconsideredinastatisticallyrigorouswaybyFox[643].Muirhead[676]providesadiscussionofdifferenttypesofoutliersintimeseries.AbrahamandChuang[622]proposeaBayesianapproachtooutliersintimeseries,whileChenandLiu[638]considerdifferenttypesofoutliersintimeseriesandproposeatechniquetodetectthemandobtaingoodestimatesoftimeseriesparameters.WorkonfindingdeviantorsurprisingpatternsintimeseriesdatabaseshasbeenperformedbyJagadishetal.[652]andKeoghetal.[661].
Animportantapplicationareaforanomalydetectionisintrusiondetection.SurveysoftheapplicationsofdataminingtointrusiondetectionaregivenbyLeeandStolfo[669]andLazarevicetal.[668].Inadifferentpaper,Lazarevicetal.[667]provideacomparisonofanomalydetectionroutinesspecifictonetworkintrusion.Garciaetal.[644]providearecentsurveyofanomalydetectionfornetworkintrusiondetection.AframeworkforusingdataminingtechniquesforintrusiondetectionisprovidedbyLeeetal.[670].Clustering-basedapproachesintheareaofintrusiondetectionincludeworkbyEskinetal.[642],LaneandBrodley[666],andPortnoyetal.[679].
Bibliography[622]B.AbrahamandA.Chuang.OutlierDetectionandTimeSeries
Modeling.Technometrics,31(2):241–248,May1989.
[623]C.C.Aggarwal.OutlierAnalysis.SpringerScience&BusinessMedia,2013.
[624]C.C.AggarwalandP.S.Yu.OutlierDetectionforHighDimensionalData.InProceedingsofthe2001ACMSIGMODInternationalConferenceonManagementofData,SIGMOD’01,pages37–46,NewYork,NY,USA,2001.ACM.
[625]L.Akoglu,H.Tong,andD.Koutra.Graphbasedanomalydetectionanddescription:asurvey.DataMiningandKnowledgeDiscovery,29(3):626–688,2015.
[626]V.Barnett.TheStudyofOutliers:PurposeandModel.AppliedStatistics,27(3):242–250,1978.
[627]V.BarnettandT.Lewis.OutliersinStatisticalData.WileySeriesinProbabilityandStatistics.JohnWiley&Sons,3rdedition,April1994.
[628]S.D.BayandM.Schwabacher.Miningdistance-basedoutliersinnearlineartimewithrandomizationandasimplepruningrule.InProc.ofthe
9thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages29–38.ACMPress,2003.
[629]R.J.BeckmanandR.D.Cook.‘Outlier……….s’.Technometrics,25(2):119–149,May1983.
[630]R.J.BeckmanandR.D.Cook.[‘Outlier……….s’]:Response.Technometrics,25(2):161–163,May1983.
[631]M.M.Breunig,H.-P.Kriegel,R.T.Ng,andJ.Sander.OPTICS-OF:IdentifyingLocalOutliers.InProceedingsoftheThirdEuropeanConferenceonPrinciplesofDataMiningandKnowledgeDiscovery,pages262–270.Springer-Verlag,1999.
[632]M.M.Breunig,H.-P.Kriegel,R.T.Ng,andJ.Sander.LOF:Identifyingdensity-basedlocaloutliers.InProc.of2000ACM-SIGMODIntl.Conf.onManagementofData,pages93–104.ACMPress,2000.
[633]V.Chandola,A.Banerjee,andV.Kumar.Anomalydetection:Asurvey.ACMcomputingsurveys(CSUR),41(3):15,2009.
[634]A.Chaudhary,A.S.Szalay,andA.W.Moore.Veryfastoutlierdetectioninlargemultidimensionaldatasets.InProc.ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery(DMKD),2002.
[635]N.V.Chawla,N.Japkowicz,andA.Kolcz,editors.WorkshoponLearningfromImbalancedDataSetsII,20thIntl.Conf.onMachine
Learning,2000.AAAIPress.
[636]N.V.Chawla,N.Japkowicz,andA.Kolcz,editors.SIGKDDExplorationsNewsletter,Specialissueonlearningfromimbalanceddatasets,volume6(1),June2004.ACMPress.
[637]S.ChawlaandA.Gionis.k-means-:AUnifiedApproachtoClusteringandOutlierDetection.InSDM,pages189–197.SIAM,2013.
[638]C.ChenandL.-M.Liu.JointEstimationofModelParametersandOutlierEffectsinTimeSeries.JournaloftheAmericanStatisticalAssociation,88(421):284–297,March1993.
[639]L.DaviesandU.Gather.TheIdentificationofMultipleOutliers.JournaloftheAmericanStatisticalAssociation,88(423):782–792,September1993.
[640]J.DunaganandS.Vempala.Optimaloutlierremovalinhigh-dimensionalspaces.JournalofComputerandSystemSciences,SpecialIssueonSTOC2001,68(2):335–373,March2004.
[641]E.Eskin.AnomalyDetectionoverNoisyDatausingLearnedProbabilityDistributions.InProc.ofthe17thIntl.Conf.onMachineLearning,pages255–262,2000.
[642]E.Eskin,A.Arnold,M.Prerau,L.Portnoy,andS.J.Stolfo.Ageometricframeworkforunsupervisedanomalydetection.InApplicationsofData
MininginComputerSecurity,pages78–100.KluwerAcademics,2002.
[643]A.J.Fox.OutliersinTimeSeries.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),34(3):350–363,1972.
[644]P.Garcia-Teodoro,J.Diaz-Verdejo,G.Maciá-Fernández,andE.Vázquez.Anomaly-basednetworkintrusiondetection:Techniques,systemsandchallenges.computers&security,28(1):18–28,2009.
[645]A.GhoshandA.Schwartzbard.AStudyinUsingNeuralNetworksforAnomalyandMisuseDetection.In8thUSENIXSecuritySymposium,August1999.
[646]R.GnanadesikanandJ.R.Kettenring.RobustEstimates,Residuals,andOutlierDetectionwithMultiresponseData.Biometrics,28(1):81–124,March1972.
[647]J.HardinandD.M.Rocke.OutlierDetectionintheMultipleClusterSettingusingtheMinimumCovarianceDeterminantEstimator.ComputationalStatisticsandDataAnalysis,44:625–638,2004.
[648]D.M.Hawkins.IdentificationofOutliers.MonographsonAppliedProbabilityandStatistics.Chapman&Hall,May1980.
[649]D.M.Hawkins.‘[Outlier……….s]’:Discussion.Technometrics,25(2):155–156,May1983.
[650]S.Hawkins,H.He,G.J.Williams,andR.A.Baxter.OutlierDetectionUsingReplicatorNeuralNetworks.InDaWaK2000:Proc.ofthe4thIntnl.Conf.onDataWarehousingandKnowledgeDiscovery,pages170–180.Springer-Verlag,2002.
[651]V.J.HodgeandJ.Austin.ASurveyofOutlierDetectionMethodologies.ArtificialIntelligenceReview,22:85–126,2004.
[652]H.V.Jagadish,N.Koudas,andS.Muthukrishnan.MiningDeviantsinaTimeSeriesDatabase.InProc.ofthe25thVLDBConf.,pages102–113,1999.
[653]N.Japkowicz,editor.WorkshoponLearningfromImbalancedDataSetsI,SeventeenthNationalConferenceonArtificialIntelligence,PublishedasTechnicalReportWS-00-05,2000.AAAIPress.
[654]T.Johnson,I.Kwok,andR.T.Ng.FastComputationof2-DimensionalDepthContours.InKDD98,pages224–228,1998.
[655]M.V.Joshi.OnEvaluatingPerformanceofClassifiersforRareClasses.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages641–644,2002.
[656]M.V.Joshi,R.C.Agarwal,andV.Kumar.Miningneedleinahaystack:Classifyingrareclassesviatwo-phaseruleinduction.InProc.of2001ACM-SIGMODIntl.Conf.onManagementofData,pages91–102.ACMPress,2001.
[657]M.V.Joshi,R.C.Agarwal,andV.Kumar.Predictingrareclasses:canboostingmakeanyweaklearnerstrong?InProc.of2002ACM-SIGMODIntl.Conf.onManagementofData,pages297–306.ACMPress,2002.
[658]M.V.Joshi,R.C.Agarwal,andV.Kumar.PredictingRareClasses:ComparingTwo-PhaseRuleInductiontoCost-SensitiveBoosting.InProc.ofthe6thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages237–249.Springer-Verlag,2002.
[659]M.V.Joshi,V.Kumar,andR.C.Agarwal.EvaluatingBoostingAlgorithmstoClassifyRareClasses:ComparisonandImprovements.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages257–264,2001.
[660]J.Kawale,S.Chatterjee,A.Kumar,S.Liess,M.Steinbach,andV.Kumar.Anomalyconstructioninclimatedata:issuesandchallenges.InNASAConferenceonIntelligentDataUnderstandingCIDU,2011.
[661]E.Keogh,S.Lonardi,andB.Chiu.FindingSurprisingPatternsinaTimeSeriesDatabaseinLinearTimeandSpace.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,Edmonton,Alberta,Canada,July2002.
[662]S.S.KhanandM.G.Madden.One-classclassification:taxonomyofstudyandreviewoftechniques.TheKnowledgeEngineeringReview,29(03):345–374,2014.
[663]E.M.KnorrandR.T.Ng.AUnifiedNotionofOutliers:PropertiesandComputation.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryand
DataMining,pages219–222,1997.
[664]E.M.KnorrandR.T.Ng.AlgorithmsforMiningDistance-BasedOutliersinLargeDatasets.InProc.ofthe24thVLDBConf.,pages392–403,August1998.
[665]E.M.Knorr,R.T.Ng,andV.Tucakov.Distance-basedoutliers:algorithmsandapplications.TheVLDBJournal,8(3-4):237–253,2000.
[666]T.LaneandC.E.Brodley.AnApplicationofMachineLearningtoAnomalyDetection.InProc.20thNIST-NCSCNationalInformationSystemsSecurityConf.,pages366–380,1997.
[667]A.Lazarevic,L.Ertöz,V.Kumar,A.Ozgur,andJ.Srivastava.AComparativeStudyofAnomalyDetectionSchemesinNetworkIntrusionDetection.InProc.ofthe2003SIAMIntl.Conf.onDataMining,2003.
[668]A.Lazarevic,V.Kumar,andJ.Srivastava.IntrusionDetection:ASurvey.InManagingCyberThreats:Issues,ApproachesandChallenges,pages19–80.KluwerAcademicPublisher,2005.
[669]W.LeeandS.J.Stolfo.DataMiningApproachesforIntrusionDetection.In7thUSENIXSecuritySymposium,pages26–29,January1998.
[670]W.Lee,S.J.Stolfo,andK.W.Mok.ADataMiningFrameworkforBuildingIntrusionDetectionModels.InIEEESymposiumonSecurityandPrivacy,pages120–132,1999.
[671]W.LeeandD.Xiang.Information-theoreticmeasuresforanomalydetection.InProc.ofthe2001IEEESymposiumonSecurityandPrivacy,pages130–143,May2001.
[672]K.-L.Li,H.-K.Huang,S.-F.Tian,andW.Xu.Improvingone-classSVMforanomalydetection.InMachineLearningandCybernetics,2003InternationalConferenceon,volume5,pages3077–3081.IEEE,2003.
[673]R.Y.Liu,J.M.Parelius,andK.Singh.Multivariateanalysisbydatadepth:descriptivestatistics,graphicsandinference.AnnalsofStatistics,27(3):783–858,1999.
[674]M.MarkouandS.Singh.Noveltydetection:Areview–part1:Statisticalapproaches.SignalProcessing,83(12):2481–2497,2003.
[675]M.MarkouandS.Singh.Noveltydetection:Areview–part2:Neuralnetworkbasedapproaches.SignalProcessing,83(12):2499–2521,2003.
[676]C.R.Muirhead.DistinguishingOutlierTypesinTimeSeries.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),48(1):39–47,1986.
[677]S.Papadimitriou,H.Kitagawa,P.B.Gibbons,andC.Faloutsos.Loci:Fastoutlierdetectionusingthelocalcorrelationintegral.InDataEngineering,2003.Proceedings.19thInternationalConferenceon,pages315–326.IEEE,2003.
[678]M.A.Pimentel,D.A.Clifton,L.Clifton,andL.Tarassenko.Areviewofnoveltydetection.SignalProcessing,99:215–249,2014.
[679]L.Portnoy,E.Eskin,andS.J.Stolfo.Intrusiondetectionwithunlabeleddatausingclustering.InInACMWorkshoponDataMiningAppliedtoSecurity,2001.
[680]S.Ramaswamy,R.Rastogi,andK.Shim.Efficientalgorithmsforminingoutliersfromlargedatasets.InProc.of2000ACM-SIGMODIntl.Conf.onManagementofData,pages427–438.ACMPress,2000.
[681]D.M.RockeandD.L.Woodruff.IdentificationofOutliersinMultivariateData.JournaloftheAmericanStatisticalAssociation,91(435):1047–1061,September1996.
[682]B.Rosner.OntheDetectionofManyOutliers.Technometrics,17(3):221–227,1975.
[683]P.J.RousseeuwandA.M.Leroy.RobustRegressionandOutlierDetection.WileySeriesinProbabilityandStatistics.JohnWiley&Sons,September2003.
[684]P.J.Rousseeuw,I.Ruts,andJ.W.Tukey.TheBagplot:ABivariateBoxplot.TheAmericanStatistician,53(4):382–387,November1999.
[685]P.J.RousseeuwandB.C.vanZomeren.UnmaskingMultivariateOutliersandLeveragePoints.JournaloftheAmericanStatistical
Association,85(411):633–639,September1990.
[686]B.Schölkopf,R.C.Williamson,A.J.Smola,J.Shawe-Taylor,J.C.Platt,etal.SupportVectorMethodforNoveltyDetection.InNIPS,volume12,pages582–588,1999.
[687]E.Schubert,R.Wojdanowski,A.Zimek,andH.-P.Kriegel.Onevaluationofoutlierrankingsandoutlierscores.InProceedingsofthe2012SIAMInternationalConferenceonDataMining.SIAM,2012.
[688]E.Schubert,A.Zimek,andH.-P.Kriegel.GeneralizedOutlierDetectionwithFlexibleKernelDensityEstimates.InSDM,volume14,pages542–550.SIAM,2014.
[689]E.Schubert,A.Zimek,andH.-P.Kriegel.Localoutlierdetectionreconsidered:ageneralizedviewonlocalitywithapplicationstospatial,video,andnetworkoutlierdetection.DataMiningandKnowledgeDiscovery,28(1):190–237,2014.
[690]D.W.Scott.PartialMixtureEstimationandOutlierDetectioninDataandRegression.InM.Hubert,G.Pison,A.Struyf,andS.V.Aelst,editors,TheoryandApplicationsofRecentRobustMethods,StatisticsforIndustryandTechnology.Birkhauser,2003.
[691]S.Shekhar,C.-T.Lu,andP.Zhang.AUnifiedApproachtoDetectingSpatialOutliers.GeoInformatica,7(2):139–166,June2003.
[692]M.-L.Shyu,S.-C.Chen,K.Sarinnapakorn,andL.Chang.ANovelAnomalyDetectionSchemeBasedonPrincipalComponentClassifier.InProc.ofthe2003IEEEIntl.Conf.onDataMining,pages353–365,2003.
[693]P.Sykacek.Equivalenterrorbarsforneuralnetworkclassifierstrainedbybayesianinference.InProc.oftheEuropeanSymposiumonArtificialNeuralNetworks,pages121–126,1997.
[694]J.Tang,Z.Chen,A.W.-c.Fu,andD.Cheung.Arobustoutlierdetectionschemeforlargedatasets.InIn6thPacific-AsiaConf.onKnowledgeDiscoveryandDataMining.Citeseer,2001.
[695]N.YeandQ.Chen.Chi-squareStatisticalProfilingforAnomalyDetection.InProc.ofthe2000IEEEWorkshoponInformationAssuranceandSecurity,pages187–193,June2000.
[696]A.Zimek,E.Schubert,andH.-P.Kriegel.Asurveyonunsupervisedoutlierdetectioninhigh-dimensionalnumericaldata.StatisticalAnalysisandDataMining,5(5):363–387,2012.
9.11Exercises1.CompareandcontrastthedifferenttechniquesforanomalydetectionthatwerepresentedinSection9.2 .Inparticular,trytoidentifycircumstancesinwhichthedefinitionsofanomaliesusedinthedifferenttechniquesmightbeequivalentorsituationsinwhichonemightmakesense,butanotherwouldnot.Besuretoconsiderdifferenttypesofdata.
2.Considerthefollowingdefinitionofananomaly:Ananomalyisanobjectthatisunusuallyinfluentialinthecreationofadatamodel.
a. Comparethisdefinitiontothatofthestandardmodel-baseddefinitionofananomaly.
b. Forwhatsizesofdatasets(small,medium,orlarge)isthisdefinitionappropriate?
3.Inoneapproachtoanomalydetection,objectsarerepresentedaspointsinamultidimensionalspace,andthepointsaregroupedintosuccessiveshells,whereeachshellrepresentsalayeraroundagroupingofpoints,suchasaconvexhull.Anobjectisananomalyifitliesinoneoftheoutershells.
a. TowhichofthedefinitionsofananomalyinSection9.2 isthisdefinitionmostcloselyrelated?
b. Nametwoproblemswiththisdefinitionofananomaly.
4.Associationanalysiscanbeusedtofindanomaliesasfollows.Findstrongassociationpatterns,whichinvolvesomeminimumnumberofobjects.Anomaliesarethoseobjectsthatdonotbelongtoanysuchpatterns.Tomakethismoreconcrete,wenotethatthehypercliqueassociationpattern
discussedinSection5.8 isparticularlysuitableforsuchanapproach.Specifically,givenauser-selectedh-confidencelevel,maximalhypercliquepatternsofobjectsarefound.Allobjectsthatdonotappearinamaximalhypercliquepatternofatleastsizethreeareclassifiedasoutliers.
a. Doesthistechniquefallintoanyofthecategoriesdiscussedinthischapter?Ifso,whichone?
b. Nameonepotentialstrengthandonepotentialweaknessofthisapproach.
5.Discusstechniquesforcombiningmultipleanomalydetectiontechniquestoimprovetheidentificationofanomalousobjects.Considerbothsupervisedandunsupervisedcases.
6.Describethepotentialtimecomplexityofanomalydetectionapproachesbasedonthefollowingapproaches:model-basedusingclustering,proximity-based,anddensity.Noknowledgeofspecifictechniquesisrequired.Rather,focusonthebasiccomputationalrequirementsofeachapproach,suchasthetimerequiredtocomputethedensityofeachobject.
7.TheGrubbs’test,whichisdescribedbyAlgorithm9.2 ,isamorestatisticallysophisticatedprocedurefordetectingoutliersthanthatofDefinition9.2 .Itisiterativeandalsotakesintoaccountthefactthatthez-scoredoesnothaveanormaldistribution.Thisalgorithmcomputesthez-scoreofeachvaluebasedonthesamplemeanandstandarddeviationofthecurrentsetofvalues.Thevaluewiththelargestmagnitudez-scoreisdiscardedifitsz-scoreislargerthan thecriticalvalueofthetestforanoutlieratsignificancelevelα.Thisprocessisrepeateduntilnoobjectsareeliminated.Notethatthesamplemean,standarddeviation,and areupdatedateachiteration.
gc,
gc
a. Whatisthelimitofthevalue usedforGrubbs’testasmapproachesinfinity?Useasignificancelevelof0.05.
b. Describe,inwords,themeaningofthepreviousresult.
Algorithm9.2Grubbs’approachforoutlier
elimination.
m−1mtc2m−2+tc2
Inputthevaluesandα
{misthenumberofvalues,αisaparameter,and isavaluechosensothat foratdistributionwith degreesoffreedom.}
repeat
Computethesamplemean andstandarddeviation
Computeavalue sothat
(Intermsof andm, )
Computethez-scoreofeachvalue,i.e.,
Let i.e.,findthez-scoreoflargestmagnitudeandcallitg.
if then
Eliminatethevaluecorrespondingtog.
endif
untilNoobjectsareeliminated.
1:
tcα=P(x≥tc) m−2
2:
3: (x¯) (sx).
4: gc P(|z|≥gc)=α.
tc gc=m−1mtc2m−2+tc2.
5: z=(x─x¯)/sx.
6: g=max|z|
7: g>gc
8:
9: m←m−1
10:
11:
8.Manystatisticaltestsforoutliersweredevelopedinanenvironmentinwhichafewhundredobservationswasalargedataset.Weexplorethelimitationsofsuchapproaches.
a. Forasetof1,000,000values,howlikelyarewetohaveoutliersaccordingtothetestthatsaysavalueisanoutlierifitismorethanthreestandarddeviationsfromtheaverage?(Assumeanormaldistribution.)
b. Doestheapproachthatstatesanoutlierisanobjectofunusuallylowprobabilityneedtobeadjustedwhendealingwithlargedatasets?Ifso,how?
9.TheprobabilitydensityofapointxwithrespecttoamultivariatenormaldistributionhavingameanμandcovariancematrixΣisgivenbytheequation
Usingthesamplemean andcovariancematrixSasestimatesofthemeanμandcovariancematrixΣ,respectively,showthatthe isequaltotheMahalanobisdistancebetweenadatapointxandthesamplemean plusaconstantthatdoesnotdependonx.
10.Comparethefollowingtwomeasuresoftheextenttowhichanobjectbelongstoacluster:(1)distanceofanobjectfromthecentroidofitsclosestclusterand(2)thesilhouettecoefficientdescribedinSection7.5.2 .
11.Considerthe(relativedistance)K-meansschemeforoutlierdetectiondescribedinSection9.5 andtheaccompanyingfigure,Figure9.10 .
a. ThepointsatthebottomofthecompactclustershowninFigure9.10haveasomewhathigheroutlierscorethanthosepointsatthetopofthecompactcluster.Why?
f(x)=1(2π)m|Σ|1/2e−(x−μ)Σ−1(x−μ)2. (9.10)
x─log(f(x))
x─
b. Supposethatwechoosethenumberofclusterstobemuchlarger,e.g.,10.Wouldtheproposedtechniquestillbeeffectiveinfindingthemostextremeoutlieratthetopofthefigure?Whyorwhynot?
c. Theuseofrelativedistanceadjustsfordifferencesindensity.Giveanexampleofwheresuchanapproachmightleadtothewrongconclusion.
12.Iftheprobabilitythatanormalobjectisclassifiedasananomalyis0.01andtheprobabilitythatananomalousobjectisclassifiedasanomalousis0.99,thenwhatisthefalsealarmrateanddetectionrateif99%oftheobjectsarenormal?(Usethedefinitionsgivenbelow.)
13.Whenacomprehensivetrainingsetisavailable,asupervisedanomalydetectiontechniquecantypicallyoutperformanunsupervisedanomalytechniquewhenperformanceisevaluatedusingmeasuressuchasthedetectionandfalsealarmrate.However,insomecases,suchasfrauddetection,newtypesofanomaliesarealwaysdeveloping.Performancecanbeevaluatedaccordingtothedetectionandfalsealarmrates,becauseitisusuallypossibletodetermineuponinvestigationwhetheranobject(transaction)isanomalous.Discusstherelativemeritsofsupervisedandunsupervisedanomalydetectionundersuchconditions.
14.Consideragroupofdocumentsthathasbeenselectedfromamuchlargersetofdiversedocumentssothattheselecteddocumentsareasdissimilarfromoneanotheraspossible.Ifweconsiderdocumentsthatarenothighlyrelated(connected,similar)tooneanotherasbeinganomalous,thenallofthedocumentsthatwehaveselectedmightbeclassifiedasanomalies.Isit
detectionrate=numberofanomaliesdetectedtotalnumberofanomalies (9.11)
falsealarmrate=numberoffalseanomaliesnumberofobjectsclassifiedasanomalies(9.12)
possibleforadatasettoconsistonlyofanomalousobjectsoristhisanabuseoftheterminology?
15.Considerasetofpoints,wheremostpointsareinregionsoflowdensity,butafewpointsareinregionsofhighdensity.Ifwedefineananomalyasapointinaregionoflowdensity,thenmostpointswillbeclassifiedasanomalies.Isthisanappropriateuseofthedensity-baseddefinitionofananomalyorshouldthedefinitionbemodifiedinsomeway?
16.Considerasetofpointsthatareuniformlydistributedontheinterval[0,1].Isthestatisticalnotionofanoutlierasaninfrequentlyobservedvaluemeaningfulforthisdata?
17.Ananalystappliesananomalydetectionalgorithmtoadatasetandfindsasetofanomalies.Beingcurious,theanalystthenappliestheanomalydetectionalgorithmtothesetofanomalies.
a. Discussthebehaviorofeachoftheanomalydetectiontechniquesdescribedinthischapter.(Ifpossible,trythisforrealdatasetsandalgorithms.)
b. Whatdoyouthinkthebehaviorofananomalydetectionalgorithmshouldbewhenappliedtoasetofanomalousobjects?
10AvoidingFalseDiscoveries
Thepreviouschaptershavedescribedthealgorithms,concepts,andmethodologiesoffourkeyareasofdatamining:classification,associationanalysis,clusteranalysis,andanomalydetection.Athoroughunderstandingofthismaterialprovidesthefoundationrequiredtostartanalyzingdatainreal-worldsituations.However,withoutcarefulconsiderationofsomeimportantissuesinevaluatingtheperformanceofadataminingprocedure,theresultsproducedmaynotbemeaningfulorreproducible,i.e.,theresultsmaybefalsediscoveries.Thewidespreadnatureofthisproblemhasbeenreportedbyanumberofhigh-profilepublicationsinscientificfields,andislikewise,commonincommerceandgovernment.Hence,itisimportanttounderstandsomeofthecommonreasonsforunreliabledataminingresultsandhowtoavoidthesefalsediscoveries.
Whenadataminingalgorithmisappliedtoadataset,itwilldutifullyproduceclusters,patterns,predictivemodels,oralistofanomalies.However,anyavailabledatasetisonlyafinitesamplefromtheoverall
population(distribution)ofallinstances,andthereisoftensignificantvariabilityamonginstanceswithinapopulation.Thus,thepatternsandmodelsdiscoveredfromaspecificdatasetmaynotalwayscapturethetruenatureofthepopulation,i.e.,allowaccurateestimationormodelingofthepropertiesofinterest.Sometimes,thesamealgorithmwillproduceentirelydifferentorinconsistentresultswhenappliedtoanothersampleofdata,thusindicatingthatthediscoveredresultsarespurious,e.g.,notreproducible.
Toproducevalid(reliableandreproducible)results,itisimportanttoensurethatadiscoveredpatternorrelationshipinthedataisnotanoutcomeofrandomchance(arisingduetonaturalvariabilityinthedatasamples),butrather,representsasignificanteffect.Thisofteninvolvesusingstatisticalprocedures,aswillbedescribedlater.Whileensuringthesignificanceofasingleresultisdemanding,theproblembecomesmorecomplexwhenwehavemultipleresultsthatneedtobeevaluatedsimultaneously,suchasthelargenumbersofitemsetstypicallydiscoveredbyafrequentpatternminingalgorithm.Inthiscase,manyorevenmostoftheresultswillrepresentfalsediscoveries.Thisisalsodiscussedindetailinthischapter.
Thepurposeofthischapteristocoverafewselectedtopics,knowledgeofwhichisimportantforavoidingcommondataanalysisproblemsandproducingvaliddataminingresults.Someofthesetopicshavebeen
discussedinspecificcontextsearlierinthebook,particularlyintheevaluationsectionsoftheprecedingchapters.Wewillbuilduponthesediscussionstoprovideanindepthviewofsomestandardproceduresforavoidingfalsediscoveriesthatareapplicableacrossmostareasofdatamining.Manyoftheseapproachesweredevelopedbystatisticiansfordesignedexperiments,wherethegoalwastocontrolexternalfactorsasmuchaspossible.Currently,however,theseapproachesareoften(perhapsevenmostly)appliedtoobservationaldata.Akeygoalofthischapteristoshowhowthesetechniquescanbeappliedtotypicaldataminingtaskstohelpensurethattheresultingmodelsandpatternsarevalid.
10.1Preliminaries:StatisticalTestingBeforewediscussapproachesforproducingvalidresultsindataminingproblems,wefirstintroducethebasicparadigmofstatisticaltestingthatiswidelyusedformakinginferencesaboutthevalidityofresults.Astatisticaltestisagenericprocedureformeasuringtheevidenceforacceptingorrejectingahypothesisthattheoutcome(result)ofanexperimentoradataanalysisprocedureprovides.Forexample,giventheoutcomeofanexperimenttostudyanewdrugforadisease,wecantesttheevidenceforthehypothesisthatthereisameasurableeffectofthedrugintreatingthedisease.Asanotherexample,giventheoutcomeofaclassifieronatestdataset,wecantesttheevidenceforthehypothesisthattheclassifierperformsbetterthanrandomguessing.Inthefollowing,wedescribedifferentframeworksforstatisticaltesting.
10.1.1SignificanceTestingSupposeyouwanttohireastockbrokerwhocanmakeprofitabledecisionsonyourinvestmentswith
ahighsuccessrate.Youknowofastockbroker,Alice,whomadeprofitabledecisionsfor7ofherlast
10stockpicks.Howconfidentwouldyoubeinofferingherthejobbecauseofyourassumptionthata
performanceasgoodasAlice’sisnotlikelyduetorandomchance?
Questionsoftheabovetypecanbeansweredusingthebasictoolsofsignificancetesting.Notethatinanygeneralproblemofstatisticaltesting,wearelookingforsomeevidenceintheoutcometovalidateadesiredphenomenon,pattern,orrelationship.Fortheproblemofhiringasuccessfulstockbroker,thedesiredphenomenonisthatAliceindeedhasknowledgeof
howthestockpricesvaryandusesthisknowledgetomake7correctdecisionsoutof10.However,thereisalsoapossibilitythatAlice’sperformanceisnobetterthanwhatmightbeobtainedbyrandomlyguessingonall10decisions.Theprimarygoalofsignificancetestingistocheckwhethertherearesufficientevidenceintheoutcometorejectthedefaulthypothesis(alsocallednullhypothesis)thatAlice’sperformanceformakingprofitablestockdecisionsisnobetterthanrandom.
NullHypothesisThenullhypothesisisageneralstatementthatthedesiredpatternorphenomenonofinterestisnottrueandtheobservedoutcomecanbeexplainedbythenaturalvariability,e.g.,byrandomchance.Thenullhypothesisisassumedtobetrueuntilthereissufficientevidencetoindicateotherwise.Itiscommonlydenotedas .Informally,iftheresultobtainedfromthedataisunlikelyunderthenullhypothesis,thisprovidesevidencethatourresultisnotjustaresultofnaturalvariabilityinthedata.
Forexample,thenullhypothesisinthestockbrokerproblemcouldbethatAliceisnobetteratmakingdecisionsthanapersonwhoperformsrandomguessing.RejectingthisnullhypothesiswouldimplythatthereissufficientgroundstobelieveAlice’sperformanceisbetterthanrandomguessing.Moregenerally,weareinterestedintherejectionofthenullhypothesissincethattypicallyimpliesanoutcomethatisnotduetonaturalvariability.
Sincedeclaringthenullhypothesisisthefirststepintheframeworkofsignificancetesting,caremustbetakentostateitinapreciseandcompletemannersothatthesubsequentstepsproducemeaningfulresults.Thisisimportantbecausemisstatingorlooselystatingthenullhypothesiscanyieldmisleadingconclusions.Ageneralapproachistobeginwithastatementofthedesiredresult,e.g.,apatterncapturesanactualrelationshipbetween
H0
variables,andtakethenullhypothesistobethenegation(opposite)ofthatstatement,e.g.,thepatternisduetonaturalvariabilityinthedata.
TestStatisticToperformsignificancetesting,wefirstneedawaytoquantifytheevidenceintheobservedoutcomewithrespecttothenullhypothesis.Thisisachievedbyusingateststatistic,R,whichtypicallysummarizeseverypossibleoutcomeasanumericalvalue.Morespecifically,theteststatisticenablesthecomputationoftheprobabilityofanoutcomeunderthenullhypothesis.Forexample,inthestockbrokerproblem,Rcouldbethenumberofsuccessful(profitable)decisionsmadeinthelast10decisions.Inthisway,theteststatisticreducesanoutcomeconsistingof10differentdecisionsintoasinglenumericalvalue,i.e.,thecountofsuccessfuldecisions.
Theteststatisticistypicallyacountorreal-valuedquantityandmeasureshow“extreme”anobservedresultisunderthenullhypothesis.Dependingonthechoiceofthenullhypothesisandthewaytheteststatisticisdesigned,therecanbedifferentwaysofdefiningwhatis“extreme”relativetothenullhypothesis.Forexample,anobservedteststatistic, ,canbeconsideredextremeifitisgreaterthanorequaltoacertainvalue, ,smallerthanorequaltoacertainvalue, ,oroutsideaspecifiedinterval, .Thefirsttwocasesresultin“one-sidedtests”(right-tailedandleft-tailed,respectively),whilethelastcaseresultsina“two-sidedtest.”
NullDistributionHavingdecidedanappropriateteststatisticforaproblem,thenextstepinsignificancetestingistodeterminethedistributionoftheteststatisticunder
RobsRH
RL [RL,RH]
thenullhypothesis.Thisisknownasthenulldistribution,whichcanbeformallydescribedasfollows.
Definition10.1(nulldistribution).Givenateststatistic,R,thedistributionofRunderthenullhypothesis, ,iscalledthenulldistribution, .
Thenulldistributioncanbedeterminedinanumberofways.Forexample,wecanusestatisticalassumptionsaboutthebehaviorofRunder togenerateexactstatisticalmodelsofthenulldistribution.Wecanalsoconductexperimentstoproducesamplesfrom andthenanalyzethesesamplestoapproximatethenulldistribution.Ingeneral,theapproachfordeterminingthenulldistributiondependsonthespecificcharacteristicsoftheproblem.WewilldiscussapproachesfordeterminingthenulldistributioninthecontextofdataminingproblemsinSection10.2 .Weillustratewithanexampleofthenulldistributionforthestockbrokerproblem.
Example10.1(NullDistributionforStockbrokerProblem).Considerthestockbrokerproblemwheretheteststatistic,R,isthenumberofsuccessesofastockbrokerinthelast decisions.Underthenullhypothesisthatthestockbrokerperformsnobetterthanrandomguessing,theprobabilityofmakingasuccessfuldecisionwouldbe .Assumingthatthedecisionsondifferentdaysareindependentofeachother,theprobabilityofobtaininganobservedvalueofR,thetotalnumberof
H0 P(R|H0)
H0
H0
N=100
p=0.5
successesinNdecisions,underthenullhypothesiscanbemodeledusingthebinomialdistribution,whichisdescribedbythefollowingequation:
Figure10.1 showstheplotofthisnulldistributionasafunctionofRfor.
Figure10.1.Nulldistributionforthestockbrokerproblemwith .
Thenulldistributioncanbeusedtodeterminehowunlikelyisittoobtaintheobservedvalueofteststatistic, ,underthenullhypothesis.Inparticular,thenullhypothesiscanbeusedtocomputetheprobabilityofobtainingor“somethingmoreextreme”underthenullhypothesis.Thisprobabilityiscalledthep-value,whichcanbeformallydefinedasfollows:
P(R|H0)=(NR)×pR×(1−p)N−R.
N=100
N=100
RobsRobs
Definition10.2(p-value).Thep-valueofanobservedteststatistic, ,istheprobabilityofobtaining orsomethingmoreextremefromthenulldistribution.Dependingonhow“moreextreme”isdefinedfortheteststatistic,R,underthenullhypothesis, ,thep-valueof
canbewrittenasfollows:
Thereasonthatweaccountfor“somethingmoreextreme”inthecalculationofp-valuesisthattheprobabilityofanyparticularresultisoften0orcloseto0.P-valuesthuscapturetheaggregatetailprobabilitiesofthenulldistributionforteststatisticvaluesthatareatleastasextremeas .Forthestockbrokerproblem,sincelargervaluesoftheteststatistic(countofsuccessfuldecisions)wouldbeconsideredmoreextremeunderthenullhypothesis,wewouldcomputep-valuesusingtherighttailofthenulldistribution.
Example10.2(P-valuesareTailProbabilities).Toillustratethefactthatp-valuescanbecomputedusingthelefttail,righttail,orboth,consideranexamplewherethenulldistributionhasaGaussiandistributionwithmean0andstandarddeviation1,i.e., .Figure10.2 showstheteststatisticvaluescorrespondingtoap-valueof0.05forleft-tailed,righttailed,ortwo-sidedtests.(Seeshadedregions.)
RobsRobs
H0Robs
p-value(Robs)={P(R≥Robs|H0),forright-tailedtests.P(R≤Robs|H0),forleft-tailedtests.P(R≥Robs| or R≤−|Robs| |H0),fortwo-sidedtests.
Robs
N(0,1)
Wecanseethatthep-valuescorrespondtotheareainthetailsofthenulldistribution.Whileatwo-sidedtesthas0.025probabilityineachofthetails,aone-sidedtesthasallofits0.05probabilityinasingletail.
Figure10.2.Illustrationofp-valuesasshadedregionsforleft-tailed,right-tailed,andtwo-sidedtests.
AssessingStatisticalSignificanceP-valuesprovidethenecessarytooltoassessthestrengthoftheevidenceinaresultagainstthenullhypothesis.Thekeyideaisthatifthep-valueislow,thenaresultatleastasextremeastheobservedresultislesslikelytobeobtainedfrom .Forexample,ifthep-valueofaresultis0.01,thenthereisonlya1%chanceofobservingaresultfromthenullhypothesisthatisatleastasextremeastheobservedresult.
Alowp-valueindicatessmallerprobabilitiesinthetailsofthenulldistribution(forbothone-sidedandtwo-sidedtests).Thiscanprovidesufficientevidencetobelievethattheobservedresultisasignificantdeparturefromthenullhypothesis,thusconvincingustoreject .Formally,weoftenauseathresholdonp-values(calledthelevelofsignificance)anddescribeanobservedp-valuethatislowerthanthisthresholdasstatisticallysignificant.
H0
H0
Definition10.3(StatisticallySignificantResult).Givenauser-definedlevelofsignificance,α,aresultiscalledstatisticallysignificantifithasap-valuelowerthanα.
Somecommonchoicesforthelevelofsignificanceare0.05(5%)and0.01(1%).Thep-valueofastatisticallysignificantresultdenotestheprobabilityoffalselyrejecting when istrue.Hence,alowp-valueprovideshigherconfidencethattheobservedresultisnotlikelytobeconsistentwith ,thusmakingitworthyoffurtherinvestigation.Thisoftenmeansgatheringadditionaldataorconductingnon-statisticalverification,e.g.,byperformingexperimentalvalidation.(SeeBibliographicNotes.)However,evenwhenthep-valueislow,thereisalwayssomechance(unlessthep-valueis0)that istrueandwehavemerelyencounteredarareevent.
Itisimportanttokeepinmindthatap-valueisconditionalprobability,i.e.,iscomputedundertheassumptionthat istrue.Consequently,ap-valueisnottheprobabilityof ,whichmaybelikelyorunlikelyevenifthetestresultisnotsignificant.Thus,ifaresultisnotsignificant,thenitisnotappropriatetosaythatweacceptthenullhypothesis.Instead,itisbettertosaythatwefailtorejectthenullhypothesis.However,whenthenullhypothesisisknowntobetruemostofthetime,e.g.,whentestingforaneffectorresultthatisrarelyseen,itiscommontosaythatweacceptthenullhypothesis.(SeeExercise6 .)
H0 H0H0
H0
H0H0
10.1.2HypothesisTesting
WhilesignificancetestingwasdevelopedbythefamousstatisticianFisherasanactionableframeworkforstatisticalinference,itsintendeduseisonlylimitedtoexploratoryanalysesofthenullhypothesisinthepreliminarystagesofastudy,e.g.,torefinethenullhypothesisormodifyfutureexperiments.Oneofthemajorlimitationsofsignificancetestingisthatitdoesnotexplicitlyspecifyanalternativehypothesis, ,whichistypicallythatthestatementwewouldliketoestablishastrue,i.e.,thataresultisnotspurious.Hence,significancetestingcanbeusedtorejectthenullhypothesisbutisunsuitablefordeterminingwhetheranobservedresultactuallysupports .
Theframeworkofhypothesistesting,developedbythestatisticiansNeymanandPearson,providesamoreobjectiveandrigorousapproachforstatisticaltesting,byexplicitlydefiningbothnullandalternativehypotheses.Hence,apartfromcomputingap-value,i.e.,theprobabilityoffalselyrejectingthenullhypothesiswhen istrue,wecanalsocomputetheprobabilityoffalselysayingaresultisnotsignificantwhenthealternativehypothesisistrue.Thisallowshypothesistestingtoprovideamoredetailedassessmentoftheevidenceprovidedbyanobservedresult.
Inhypothesistesting,wefirstdefineboththenullandalternativehypotheses( and ,respectively)andchooseateststatistic,R,thathelpstodifferentiatethebehaviorofresultsunderthenullhypothesisfromthebehaviorofresultsunderthealternativehypothesis.Aswithsignificancetesting,caremustbetakentoensurethatthenullandalternativehypothesesarepreciselyandcomprehensivelydefined.Wethenmodelthedistributionoftheteststatisticunderthenullhypothesis, ,aswellasunderthealternativehypothesis, .Similartothenulldistribution,therecanbemanywaysofgeneratingthedistributionofRunderthealternative
H1
H1
H0
H0 H1
P(R|H0)P(R|H1)
hypothesis,e.g.,bymakingstatisticalassumptionsaboutthenatureof orbyconductingexperimentsandanalyzingsamplesfrom .Inthefollowingexample,weconcretelyillustrateasimpleapproachformodeling inthestockbrokerproblem.
Example10.3(AlternativeHypothesisforStockbrokerProblem).InExample10.1 ,wesawthatunderthenullhypothesisofrandomguessing,theprobabilityofobtainingasuccessonanygivendaycanbeassumedtobe .Therecouldbemanyalternativehypothesesforthisproblem,allofwhichwouldassumethattheprobabilityofsuccessisgreaterthan0.5,i.e., ,thusrepresentingasituationwherethestockbrokerperformsbetterthanrandomguessing.Tobespecific,assume
.Thedistributionoftheteststatistic(numberofsuccessesindecisions)underthealternativehypothesiswouldthenbegivenbythefollowingbinomialdistribution.
Figure10.3 showstheplotofthisdistribution(dottedline)withrespecttothenulldistribution(solidline).Wecanseethatthealternativedistributionisshiftedtowardtheright.Noticethatifastockbrokerhasmorethan60successes,thenthisoutcomewillbemorelikelyunder than
.
H1H1
P(R|H1)
p=0.5
p>0.5
p=0.7 N=100
P(R|H1)=(NR)×pR×(1−p)N−R.
H1H0
Figure10.3.Nullandalternativedistributionsforthestockbrokerproblemwith .
CriticalRegionGiventhedistributionsoftheteststatisticunderthenullandalternativehypotheses,theframeworkofhypothesistestingdecidesifweshould“reject”thenullhypothesisor“notreject”thenullhypothesisgiventheevidenceprovidedbytheteststatisticcomputedfromanobservedresult.Thisbinarydecisionistypicallymadebyspecifyingarangeofpossiblevaluesoftheteststatisticthataretooextremeunder .Thissetofvaluesiscalledthecriticalregion.Iftheobservedteststatistic, ,fallsinthisregion,thenthenullhypothesisisrejected.Otherwise,thenullhypothesisisnotrejected.
Thecriticalregioncorrespondstothecollectionofextremeresultswhoseprobabilityofoccurrenceunderthenullhypothesisislessthanathreshold.Thecriticalregioncaneitherbeinthelefttail,righttail,orbothleftandrighttailsofthenulldistribution,dependingonthetypeofstatisticaltestingbeingused.Theprobabilityofthecriticalregionunder iscalledthesignificance
N=100
H0Robs
H0
level,α.Inotherwords,itistheprobabilityoffalselyrejectingthenullhypothesisforresultsbelongingtothecriticalregionwhen istrue.Inmostapplications,alowvalueofα(e.g.,0.05or0.01)isspecifiedbytheusertodefinethecriticalregion.
Rejectingthenullhypothesisiftheteststatisticfallsinthecriticalregionisequivalenttoevaluatingthep-valueoftheteststatisticandrejectingthenullhypothesisifthep-valuefallsbelowapre-specifiedthreshold,α.Notethatwhileeveryresulthasadifferentp-value,thesignificancelevel,α,inhypothesistestingisafixedconstantwhosevalueisdecidedbeforeanytestsareperformed.
TypeIandTypeIIErrorsUptothispoint,hypothesistestingmayseemsimilartosignificancetesting,atleastsuperficially.However,byconsideringboththenullandalternativehypotheses,hypothesistestingallowsustolookattwodifferenttypesoferrors,typeIerrorandtypeIIerrors,asdefinedbelow.
Definition10.4(TypeIError).AtypeIerroristheerrorofincorrectlyrejectingthenullhypothesisforaresult.TheprobabilityofincurringatypeIerroriscalledthetypeIerrorrate,α.Itisequaltotheprobabilityofthecriticalregionunder ,i.e.,αisthesameasthesignificancelevel.Formally,
H0
H0
α=P(R∈CriticalRegion|H0).
Definition10.5(TypeIIError).AtypeIIErroristheerroroffalselycallingaresulttobenotsignificantwhenthealternativehypothesisistrue.TheprobabilityofincurringatypeIIerroriscalledthetypeIIerrorrate,β,whichisequaltotheprobabilityofobservingteststatisticvaluesoutsidethecriticalregionunder ,i.e.,
Notethatdecidingthecriticalregion(specifyingα)automaticallydeterminesthevalueofβforaparticulartest,giventhedistributionoftheteststatisticunderthealternativehypothesis.
AcloselyrelatedconcepttothetypeIIerrorrateisthepowerofthetest,whichistheprobabilityofthecriticalregionunder ,i.e., .Powerisanimportantcharacteristicofatestbecauseitindicateshoweffectiveatestwillbeatcorrectlyrejectingthenullhypothesis.Lowpowermeansthatmanyresultsthatactuallyshowthedesiredpatternorphenomenonwillnotbeconsideredsignificantandthuswillbemissed.Asaconsequence,ifthepowerofatestislow,thenitmaynotbeappropriatetoignoreresultsthatfalloutsidethecriticalregion.IncreasingthesizeofthecriticalregionstoincreasepoweranddecreasetypeIIerrorswillincreasetypeIerrors,andvice-versa.Hence,itisthebalancebetweenensuringalowvalueofαandahighvalueofpowerthatisatthecoreofhypothesistesting.
H1
β=P(R∈CriticalRegion|H1).
H1 1−β
Whenthedistributionoftheteststatisticunderthenullandalternativehypothesesdependsonthenumberofsamplesusedtoestimatetheteststatistic,thenincreasingthenumberofsampleshelpsobtainlessvariableestimatesofthetruenullandalternativedistributions.ThisreducesthechancesoftypeIandtypeIIerrors.Forexample,evaluatingastockbrokeron100decisionsismorelikelytogiveusanaccurateestimateoftheirtruesuccessratethanevaluatingthestockbrokeron10decisions.Theminimumnumberofsamplesrequiredforensuringalowvalueofαwhilehavingahighvalueofpowerisoftendeterminedbyastatisticalprocedurecalledpoweranalysis.(SeeBibliographicNotesformoredetails.)
Example10.4(ClassifyingMedicalResults).Supposethevalueofabloodtestisusedastheteststatistic,R,toidentifywhetherapatienthasacertaindiseaseornot.ItisknownthatthevalueofthisteststatistichasaGaussiandistributionwithmean40andstandarddeviation5forpatientsthatdonothavethedisease.Forpatientshavingthedisease,theteststatistichasameanof60andastandarddeviationof5.ThesedistributionsareshowninFigure10.4 . isthenullhypothesisthatthepatientdoesn’thavethedisease,i.e.,comesfromtheleftmostdistributionshowninsubfiguresofFigure10.4 . isthealternativehypothesisthatthepatienthasthedisease,i.e.,comesfromtherightmostdistributioninthesubfiguresofFigure10.4 .
H0
H1
Figure10.4.Distributionofteststatisticforthealternativehypothesis(rightmostdensitycurve)andnullhypothesis(leftmostdensitycurve).Shadedregioninrightsubfigureisα.
Supposethecriticalregionischosentobe50andabove,sincealevelof50isexactlyhalfwaybetweenmeansofthetwodistributions.Thesignificancelevel,α,correspondingtothischoiceofcriticalregion,shownastheshadedregionintherightsubfigureofFigure10.4 ,canthenbecalculatedasfollows:
ThetypeIIerrorrate,β,forthischoiceofcriticalregioncanalsobefoundtobeequalto0.023.(Thisisonlybecausenullandalternativehypotheseshavethesamedistribution,exceptfortheirmeans,andtheobservedvalueishalfwaybetweentheirmeans.)ThisisshownastheshadedregionintheleftsubfigureofFigure10.5 .Thepoweristhenequalto
,whichisshownintherightsubfigureofFigure10.5 .
α=P(R≥50|H0)=P(R≥50|R),R~N(μ=40,σ=5))=∫50∞12πσ2e−(R−μ)22σ2dR=∫50∞150πe−(R−40)250dR=0.023
1−0.023=0.977
Figure10.5.Shadedregioninleftsubfigureisβandshadedregioninrightsubfigureispower.
Ifweuseαof0.05insteadof0.023,thecriticalregionwouldbeslightlyexpandedto48.22andabove.Thiswouldincreasethepowerfrom0.977to0.991,thoughatthecostofahighervalueofα.Ontheotherhand,decreasingαto0.01woulddecreasethepowerto0.952.
EffectSizeEffectsizebringsindomainconsiderationsbyconsideringwhethertheobservedresultissignificantfromadomainpointofview.Forexample,supposethatanewdrugisfoundtolowerbloodpressure,butonlyby1%.Thisdifferencewillbestatisticallysignificantwithalargeenoughtestgroup,butthemedicalsignificanceofaneffectsizeof1%isprobablynotworththecostofthemedicineandthepotentialforsideeffects.Thus,aconsiderationofeffectsizeiscriticalsinceitcanoftenhappenthataresultisstatisticallysignificant,butofnopracticalimportanceinthedomain.Thisisparticularlytrueforlargedatasets.
Definition10.6(effectsize).Theeffectsizemeasuresthemagnitudeoftheeffectorcharacteristicbeingevaluated,andistypicallythemagnitudeoftheteststatistic.
Inmostproblemsthereisadesiredeffectsizethathelpsdeterminesthenullandalternativehypotheses.Forthestockbrokerproblem,asillustratedinExample10.3 ,thedesiredeffectsizeisthedesiredprobabilityofsuccess,0.7.Forthemedicaltestingproblem,whichwasjustdiscussedinExample10.4 ,theeffectsizeisthevalueofthethresholdusedtodefinethecutoffbetweennormalpatientsandthosewiththedisease.Whencomparingthemeansoftwosetsofobservations(AandB),theeffectsizeisthedifferenceinthemeans,i.e., ortheabsolutedifference, .
Thedesiredeffectsizeimpactsthechoiceofthecriticalregion,andthusthesignificancelevelandpowerofthetest.Exercises4 and5 furtherexploresomeoftheseconcepts.
10.1.3MultipleHypothesisTesting
Thestatisticaltestingframeworksdiscussedsofararedesignedtomeasuretheevidenceinasingleresult,i.e.,whethertheresultbelongstothenullhypothesisorthealternativehypothesis.However,manysituationsproducemultipleresultsthatneedtobeevaluated.Forexample,frequentpattern
μA−uB |μA−uB|
miningtypicallyproducesmanyfrequentitemsetsfromagiventransactiondataset,andweneedtotesteveryfrequentitemsettodeterminewhetherthereisastatisticallysignificantassociationamongitsconstituentitems.Themultiplehypothesistestingproblem(alsocalledthemultipletestingproblemormultiplecomparisonproblem)addressesthestatisticaltestingproblemwheremultipleresultsareinvolvedandastatisticaltestisperformedoneveryresult.
Thesimplestapproachistocomputethep-valueunderthenullhypothesisforeachresultindependentlyofotherresults.Ifthep-valueissignificantforanyresult,thenthenullhypothesisisrejectedforthatresult.However,thisstrategywilltypicallyproducemanyerroneousresultswhenthenumberofresultstobetestedislarge.Forexample,evenifsomethingonlyhasa5%chanceofhappeningforasingleresult,thenitwillhappen,onaverage,5timesoutofa100.Thus,ourapproachforhypothesistestingneedstobemodified.
Whenworkingwithmultipletests,weareinterestedinreportingthetotalnumberoferrorscommittedonacollectionofresults(alsoreferredtoasafamilyofresults).Forexample,ifwehaveacollectionofmresults,wecancountthetotalnumberoftimesatypeIerroriscommittedoratypeIIerroriscommittedoutofthemtests.TheaggregateinformationoftheperformanceacrossalltestscanbesummarizedbytheconfusionmatrixshowninTable10.1 .Inthistable,aresultthatactuallybelongstothenullhypothesisiscalleda‘negative’whilearesultthatactuallybelongstothealternativehypothesisiscalleda‘positive.’ThistableisessentiallythesameasTable4.6 ,whichwasintroducedinthecontextofevaluatingclassificationperformanceinSection4.11.2 .
Table10.1.Confusiontableinthecontextofmultiplehypothesistesting.
Declaredsignificant(+ Declarednotsignificant(− Total
prediction) prediction)
True(actual+)
TruePositive(TP) FalseNegative(FN)typeIIerror Positives
True(actual−)
FalsePositive(FP)typeIerror
TrueNegative(TN) Negatives
PositivePredictions(Ppred) NegativePredictions(Npred) m
Inmostpracticalsituationswhereweareperformingmultiplehypothesistesting,e.g.,whileusingstatisticalteststoevaluatewhetheracollectionofpatterns,clusters,etc.arespurious,therequiredentriesinTable10.1 areseldomavailable.(Forclassification,thetableisavailablewhenreliablelabelsareavailable,inwhichcase,manyofthequantitiesofinterestcanbedirectlyestimated.SeeSection10.3.2 .)Whenentriesarenotavailable,weneedtoestimatethem,ormoretypically,quantitiesderivedfromtheseentries.Thefollowingparagraphsofthissectiondescribevariousapproachesfordoingthis.
Family-wiseErrorRate(FWER)Ausefulerrormetricwhendealingwithafamilyofresultsisthefamily-wiseerrorrate(FWER),whichistheprobabilityofobservingevenasinglefalsepositive(typeIerror)intheentiresetofmresults.Inparticular,
IftheFWERislowerthanacertainthreshold,sayα,thentheprobabilityofobservinganytypeIerroramongalltheresultsislessthanα.
H1(m1)
H0(m0)
FEWR=P(FP>0).
TheFWERthusmeasurestheprobabilityofobservingatypeIerrorinanyorallofthemtests.ControllingtheFWER,i.e.,ensuringtheFWERtobelow,isusefulinapplicationswhereasetofresultsisdiscardedifevenasingletestiserroneous(producesatypeIerror).Forexample,considertheproblemofselectingastockbrokerdescribedinExample10.3 .Inthiscase,thegoalistofind,fromapoolofapplicants,astockbrokerthatmakescorrectdecisionsatleast70%ofthetime.EvenasingletypeIerrorcanleadtoanerroneoushiringdecision.Insuchcases,estimatingtheFWERgivesusabetterpictureoftheperformanceoftheentiresetofresultsthanthenaїveapproachofcomputingp-valuesseparatelyforeachresult.Thefollowingexampleillustratesthisconceptinthecontextofthestockbrokerproblem.
Example10.5(TestingMultipleStockbrokers).Considertheproblemofselectingsuccessfulstockbrokersfromapoolof
candidates.Foreverystockbroker,weperformastatisticaltesttocheckwhethertheirperformance(numberofsuccessfuldecisionsinthelastNdecisions)isbetterthanrandomguessing.Ifweuseasignificancelevelof foreverysuchtest,theprobabilityofmakingatypeIerroronanyindividualcandidateisequalto0.05.However,ifweassumethattheresultsareindependent,theprobabilityofobservingevenasingletypeIerrorinanyofthe50tests,i.e.,theFWER,isgivenby
whichisextremelyhigh.Eventhoughtheprobabilityofobservingnofalsepositivesonasingletestisquitehigh ,theprobabilityofseeingnofalsepositivesacrossalltests diminishesbyrepeatedmultiplication.Hence,theFWERcanbequitehighwhenmislarge,evenifthetypeIerrorrate,α,islow.
m=50
α=0.05
FEWR=1−(1−α)m=1−(1−0.05)50=0.923, (10.1)
(1−α=0.95)(0.9550=0.077)
BonferroniProcedureAnumberofprocedureshavebeendevelopedtoensurethattheFWERofasetofresultsislowerthananacceptablethreshold,α,whichisoften0.05.Theseapproaches,calledFWERcontrollingprocedures,basicallytrytoadjustthep-valuethresholdwhichisusedforeverytest,sothatthereisonlyasmallchanceoferroneouslyrejectingthenullhypothesisinthepresenceofmultipletests.Toillustratethiscategoryofprocedures,wedescribethemostconservativeapproach,whichistheBonferroniprocedure.
Definition10.7(Bonferroniprocedure).IfmresultsaretobetestedsothattheFWERislessthanα,theBonferroniproceduresetsthesignificancelevelforeverytesttobe .
TheintuitionbehindtheBonferroniprocedurecanbeunderstoodbyobservingtheformulaforFWERinEquation10.1 ,wherethemtestsareassumedtobeindependentofeachother.Usingareducedsignificancelevelof inEquation10.1 andapplyingthebinomialtheorem,wecanseethattheFWERiscontrolledbelowαasfollows:
α*=α/m
α/m
FWER=1−(1−αm)m=1−(1+m(−αm)+(m2)(−αm)2+…+(−αm)m)=α−(m2)(−αm)2−(m3)(−αm)3−…−(−αm)m≤α.
Whiletheabovediscussionwasforthecasewherethetestsareassumedtobeindependent,theBonferroniapproachguaranteesnotypeIerrorinthemtestswithaprobabilityof ,irrespectiveofwhetherthetests(results)arecorrelatedorindependent.WeillustratetheimportanceoftheBonferroniprocedureforcontrollingFWERusingthefollowingexample.
Example10.6(BonferroniProcedure).InthemultiplestockbrokerproblemdescribedinExample10.5 ,weanalyzetheeffectoftheBonferroniprocedureincontrollingtheFWER.Thenulldistributionforanindividualstockbrokercanbemodeledusingthebinomialdistributionwhere and .Givenasetofmresultssimulatedfromthenulldistribution(assumingtheresultsareindependent),wecomparetheperformanceoftwocompetingapproaches:thenaїveapproach,whichusesasignificancelevelof ,andtheBonferroniprocedure,whichusesasignificancelevelof .
Figure10.6 showstheFWERofthesetwoapproachesaswevarythenumberofresults,m.(Weused simulations.)WecanseethattheFWERoftheBonferroniprocedureisalwayscontrolledtobeα,whiletheFWERofthenaїveapproachshootsuprapidlyandreachesavaluecloseto1formgreaterthan70.Hence,theBonferroniprocedureispreferredoverthenaїveapproachwhenmislargeandtheFWERistheerrormetricwewishtocontrol.
1−α
p=0.5 N=100
α=0.05α*=0.05/m
106
Figure10.6.Thefamilywiseerrorrate(FWER)curvesforthenaїveapproachandtheBonferroniprocedureasafunctionofthenumberofresults,m.
TheBonferroniprocedureisalmostalwaysoverlyconservative,i.e.,itwilleliminatesomenon-spuriousresults,especiallywhenthenumberofresultsislargeandtheresultsmaybecorrelatedtoeachother,e.g.,infrequentpatternmining.Intheextremecasewhereallmresultsareperfectlycorrelatedtoeachother(andhenceidentical),theBonferroniprocedurewouldstilluseasignificancelevelof eventhoughasignificancelevelofαwouldhavesufficed.Toaddressthislimitation,anumberofalternativeFWERcontrollingprocedureshavebeendevelopedthatarelessconservativethanBonferroniwhendealingwithcorrelatedresults.(SeeBibliographicNotesformoredetails.)
Falsediscoveryrate(FDR)
α/m
Bydefinition,allFWERcontrollingproceduresseekalowprobabilityforobtaininganyfalsepositives,andthus,arenottheappropriatetoolwhenthegoalistoallowsomefalsepositivesinordertogetmoretruepositives.Forexample,infrequentpatternmining,weareinterestedinselectingfrequentitemsetsthatshowstatisticallysignificantassociations(actualpositives),whilediscardingtheremainingones.Asanotherexample,whentestingforaseriousdisease,itisbettertogetmoretruepositives(detectmoreactualcasesofthedisease)evenifthatmeansgeneratingsomefalsepositives.Inbothcases,wearereadytotolerateafewfalsepositivesaslongasweareabletoachievereasonablepowerforthedetectionoftruepositives.
Thefalsediscoveryrate(FDR)providesanerrormetrictomeasuretherateoffalsepositives,whicharealsocalledfalsediscoveries.TocomputeFDR,wefirstdefineavariable,Q,thatisequaltothenumberoffalsepositives,FP,dividedbythetotalnumberofresultspredictedaspositive,Ppred.(SeeTable10.1 .)
WhenweknowFP,thenumberoffalsepositives,asinclassification,Qisessentiallythefalsediscoveryrate,asdefinedinSection4.11.2 ,whichintroducedmeasuresforevaluatingclassificationperformanceunderclassimbalance.Assuch,Qiscloselyrelatedtotheprecision.Specifically,
.However,inthecontextofstatisticaltesting,when,i.e.,whennoresultsarepredictedaspositive, bydefinition.
However,indatamining,precisionandthusFDR,asdefinedinSection4.11.2 ,aretypicallyconsideredtobeundefinedinthatsituation.
InthecaseswherewedonotknowFP,itisnotpossibletouseQasthefalsediscoveryrate.Nonetheless,itisstillpossibletoestimatethevalueofQ,on
Q=FPPpred=FPTP+FP, if Ppred>0=0, if Ppred=0
precision=1−FDR=1−QPpred=0 Q=0
average,i.e.,tocomputetheexpectedvalueofQandusethatasourfalsediscoveryrate.Formally,
TheFDRisausefulmetricforensuringthattherateoffalsepositivesislow,especiallyincaseswherethepositivesarehighlyskewed,i.e.,thenumberofactualpositivesinthecollectionofresults, ,isverysmallcomparedtothenumberofactualnegatives, .
Benjamini-HochbergProcedureStatisticaltestingproceduresthattrytocontroltheFDRareknownasFDRcontrollingprocedures.Theseprocedurescantypicallyensurealownumberoffalsepositives(evenwhenthepositiveclassisrelativelyinfrequent)whileprovidinghigherpowerthanthemoreconservativeFWERcontrollingprocedures.Awidely-usedFDRcontrollingprocedureistheBenjamini-Hochberg(BH)procedure,whichsortstheresultsinincreasingorderoftheirp-valuesandusesadifferentsignificancelevel, ,foreveryresult, .
ThebasicideabehindtheBHprocedureisthatifwehaveobservedalargenumberofsignificantresultsthathavealowerp-valuethanagivenresult, ,wecanbelessstringentwhiletesting anduseamorerelaxedsignificancelevelthan .Algorithm10.1 providesasummaryoftheBHprocedure.Thefirststepinthisalgorithmistocomputethep-valuesforeveryresultandsorttheresultsinincreasingorderoftheirp-values(steps1to2).Thus,wouldcorrespondtothe smallestp-value.Thesignificancelevel, ,for isthencomputedusingthefollowingcorrection(line3)
FDR=E(Q). (10.2)
m0m1
α(i) Ri
RiRi
α/m
piith αi pi
αi=i×αm.
Noticethatthesignificancelevelforthesmallestp-value, ,isequalto ,whichissameasthecorrectionusedintheBonferroniprocedure.Further,thesignificancelevelforthelargestp-value, ,isequaltoα,whichisthesignificancelevelforasingletest(withoutaccountingformultiplehypothesistesting).Inbetweenthesetwop-values,thesignificancelevelgrowslinearlyfrom toα.Hence,theBHprocedurecanbeviewedasstrikingabalancebetweentheoverlyconservativeBonferroniprocedureandtheoverlyliberalnaїveapproach,thusresultinginhigherpower(findingmoreactualpositives)withoutproducingtoomanyfalsepositives.Letkbethelargestindexsuchthat islowerthanitssignificancelevel, (line4).TheBHprocedurethendeclaresthefirstkp-valuesassignificant(lines4to5).ItcanbeshownthattheFDRcomputedusingtheBHprocedureisguaranteedtobesmallerthanα.Inparticular,
where isthenumberofactualnegativeresultsandmisthetotalnumberofresults.(SeeTable10.1 .)
Algorithm10.1Benjamini-Hochberg(BH)
FDRalgorithm.
p1 α/m
pm
α/m
pk αk
FDR≤m0mα≤α, (10.3)
m0
1:Computep-valuesforthemresults.2:Orderthep-valuesfromsmallesttolargest( to ).3:Computethesignificancelevelfor as4:Letkbethelargestindexsuchthat .5:Reject forallresultscorrespondingtothefirstkp-values,
.
p1 pmpi αi=i×αm.pk≤αk
H0pi,1≤i≤k
Example10.7(BHandBonferroniprocedure).ConsiderthemultiplestockbrokerproblemdiscussedinExample10.6 ,whereinsteadofassumingallmstockbrokerstobelongtothenulldistribution,wemayhaveasmallnumberof candidateswhobelongtoanalternativedistribution.Thenulldistributioncanbemodeledbythebinomialdistributionwith0.5probabilityofmakingasuccessfuldecision.Thealternativedistributioncanbemodeledbythebinomialdistributionwith0.55probabilityofsuccess,whichisaslightlyhigherprobabilityofsuccessthanthatofrandomguessing.Weconsider decisionsforboththenullandalternativedistributions.
WeareinterestedincomparingtheperformanceoftheBonferroniandBHproceduresindetectingalargefractionofactualpositives(stockbrokersthatindeedperformbetterthanrandomguessing)withoutincurringalotoffalsepositives.Weran simulationsofmstockbrokerswherestockbrokersineachsimulationbelongtothealternativedistributionwhiletherestbelongtothenulldistribution.Wechose todemonstratetheeffectofaskewedpositiveclass,whichisquitecommoninmostpracticalapplicationsofmultiplehypothesistesting.Figure10.7 showstheplotsofFDRandexpectedpoweraswevarythenumberofstockbrokersineachsimulationrun,m,forthreecompetingprocedures:thenaїveapproach,theBonferroniprocedure,andtheBHprocedure.Thechoiceofthethreshold,α,ineachofthethreeprocedureswaschosentobe0.05.
m1
N=100
106 m1
m1=m/3
Figure10.7.Comparingtheperformanceofmultiplecomparisonproceduresaswevarythenumberofresults,m,andset resultsaspositive. foreachofthethreeprocedures.
m1=m/3 α=0.05
WecanseethattheFDRofboththeBonferroniandtheBHproceduresaresmallerthan0.05forallvaluesofm,buttheFDRofthenaїveapproachisnotcontrolledandiscloseto0.1.Thisshowsthatthenaїveapproachisquiterelaxedincallingaresulttobepositive,andthusincursmorefalsepositives.However,italsogenerallyshowsahighpowerasmanyoftheactualpositivesareindeedlabeledaspositive.Ontheotherhand,theFDRoftheBonferroniismuchsmallerthan0.05andisthelowestamongallthethreeapproaches.ThisisbecausetheBonferroniapproachaimsatcontrollingamorestringenterrormetric,i.e.,theFWER,tobesmallerthan0.05.However,italsohaslowpowerasitisconservativeincallinganactualpositivetobesignificantsinceitsgoalistoavoidanyfalsepositives.
TheBHproceduremakesabalancebetweenbeingconservativeandrelaxedsuchthatitsFDRisalwayssmallerthan0.05butitsexpectedpowerisquitehighandcomparabletothenaїveapproach.Hence,atthecostofaminorincreaseintheFDRovertheBonferroniprocedure,itisabletoachievemuchhigherpowerandthusobtainabettertrade-offbetweenminimizingtypeIerrorsandtypeIIerrorsinmultiplehypothesistestingproblems.However,weemphasize,thatFWERprocedures,suchasBonferroni,andFDRcontrollingprocedures,suchasBH,areintendedfortwodifferenttasks,andthus,thebestproceduretouseinanyparticularsituationwillvarydependingonthegoaloftheanalysis.
Equation10.3 statesthattheFDRoftheBHprocedureislessthanorequalto ,whichisequaltoαonlywhen ,i.e.,whentherearenoactualpositivesintheresults.Hence,theBHproceduregenerallydiscoversasmallernumberoftruepositives,i.e.,haslowerpower,thanitshouldbegivenadesiredFDRofα.ToaddressthislimitationoftheBHprocedure,anumberofstatisticaltestingprocedureshavebeendevelopedto
m0/m×α m0=m
providetightercontroloverFDR,suchasthepositiveFDRcontrollingprocedures,andthelocalFDRcontrollingprocedures.ThesetechniquesgenerallyshowbetterpowerthantheBHprocedurewhileensuringasmallnumberoffalsepositives.(SeeBibliographicNotesformoredetails.)
NotethatsomeusersofFDRcontrollingproceduresassumethatαshouldbechoseninthesamewayasforhypothesis(significance)testingorforFWERcontrollingprocedures,whichtraditionallyuse or .However,forFDRcontrollingprocedures,αisthedesiredfalsediscoveryrateandisoftenchosentohaveavaluegreaterthan0.05,e.g.,0.20.Thereasonforthisisthatinmanycasesthepersonevaluatingtheresultsiswillingtoacceptmorefalsepositivesinordertogetmoretruepositives.Thisisespeciallytruewhenfew,ifany,potentialpositiveresultsareproducedwhenαissettoalowvalue,suchas0.05or0.01.Inthepreviousexample,wechoseαtobethesameforallthreetechniquestokeepthediscussionsimple.
10.1.4PitfallsinStatisticalTesting
Thestatisticaltestingapproachespresentedaboveprovideaneffectiveframeworkformeasuringtheevidenceinresults.However,aswithotherdataanalysistechniques,usingthemincorrectlycanoftenproducemisleadingconclusions.Muchofthemisunderstandingiscenteredontheuseofp-values.Inparticular,p-valuesarecommonlyassignedadditionalmeaningbeyondwhatcanbesupportedbythedataandtheseprocedures.Inthefollowing,wediscusssomeofthecommonpitfallsinstatisticaltestingthatshouldbeavoidedtoproducevalidresults.Someofthesedescribep-valuesandtheirproperrolewhileothersidentifycommonmisinterpretationsandmisuses.(SeeBibliographicNotesformoredetails.)
α=0.05 α=0.01
1. Ap-valueisnottheprobabilitythatthenullhypothesisistrue.AsdescribedpreviouslyinDefinition10.2 ,thep-valueistheconditionalprobabilityofobservingaparticularvalueofateststatistic,R,orsomethingmoreextremeunderthenullhypothesis.Hence,weareassumingthenullhypothesisistrueinordertocomputethep-value.Ap-valuedoesmeasurehowcompatibletheobservedresultiswiththenullhypothesis.
2. Typically,therecanbemanyhypothesesthatexplainaresultthatisfoundtobesignificantornon-significantunderthenullhypothesis.Notethataresultthatisdeclarednon-significant,i.e.,hasahighp-value,wasnotnecessarilygeneratedfromthenulldistribution.Forexample,ifwemodelthenulldistributionusingaGaussiandistributionwithmean0andstandarddeviation1,wewillfindanobservedteststatistic, ,tobenon-significantata5%level.However,theresultcouldbefromthealternativedistribution,evenifthereisalow(butnonzero)probabilityofthatevent.Further,ifwemisspecifiedournullhypothesis,thenthesameobservationcouldhaveeasilycomefromanotherdistribution,e.g.,aGaussiandistributionwithmean1andstandarddeviation1,underwhichitismorelikely.Hence,declaringaresulttobenon-significantdoesnotamountto“accepting”thenullhypothesis.Likewise,asignificantresultmaybeexplainedbymanyalternativehypothesis.Hence,rejectingthenullhypothesisdoesnotnecessarilyimplythatwehaveacceptedthealternativehypothesis.Thisisoneofthereasonsthatp-values,ormoregenerallytheresultofstatisticaltesting,arenotusuallysufficientformakingdecisions.Factorsexternaltothestatistics,suchasdomainknowledge,mustbeappliedaswell.
3. Alowp-valuedoesnotimplyausefuleffectsize(magnitudeoftheteststatistic)andviceversa.Recallthattheeffectsizeisthesizeoftheteststatisticforwhichtheresultisconsideredimportantinthedomainofinterest.Thus,effectsizebringsindomainconsiderationsby
Robs=1.5
consideringwhethertheobservedresultissignificantfromadomainpointofview.Forexample,supposethatanewdrugisfoundtolowerbloodpressure,butonlyby1%.Thisdifferencemaybestatisticallysignificant,butthemedicalsignificanceofaneffectsizeof1%isprobablynotworththecostofthemedicineandthepotentialforsideeffects.Inparticular,asignificantp-valuemaynothavealargeeffectsizeandanon-significantp-valuedoesnotimplynoeffectsize.Sincep-valuesdependverystronglyonthesizeofthedataset,smallp-valuesforbigdataapplicationsarebecomingincreasinglycommonsinceevensmalleffectsizeswillshowupasbeingstatisticallysignificant.Thus,itbecomescriticaltotakeeffectsizeintoconsiderationtoavoidgeneratingresultsthatarestatisticallysignificantbutnotuseful.Inparticular,evenifaresultisdeclaredsignificant,weshouldensurethatitseffectsizeisgreaterthanadomain-specifiedthresholdtobeofpracticalimportance.
Example10.8(Significantp-valuesinrandomdata).Toillustratethepointthatwecanobtainsignificantlylowp-valuesevenwithsmalleffectsizes,weconsiderthepairwisecorrelationof10randomvectorsthatweregeneratedfromaGaussiandistributionofmean0andstandarddeviation1.Thenullhypothesisisthatthecorrelationofanytwovectorsis0.Figures10.8a and10.8b showthatasthelengthofthevectors,n,increases,themaximumabsolutepairwisecorrelationofanypairofvectorstendsto0,buttheaveragenumberofpairwisecorrelationsthathaveap-valuelessthan0.05remainsconstantatabout2.25.Thisshowsthatthenumberofsignificantpair-wisecorrelationsdoesnotdecreasewhennislarge,althoughtheeffectsize(maximumabsolutecorrelation)isquitelow.
Thisexamplealsoillustratestheimportanceofadjustingformultipletests.Thereare45pairwisecorrelationsandthus,onaverage,wewouldexpect
significantcorrelationsatthe5%level,asisshowninFigure10.8b .0.05×45=2.25
Figure10.8.
Visualizingtheeffectofchangingthevectorlength,n,onthecorrelationsamong10randomvectors.
4. Itisunsoundtoanalyzearesultinmultiplewaysuntilweareabletodeclareitstatisticallysignificant.Thiscreatesamultiplehypothesistestingproblemandthep-valuesofindividualresultsarenolongeragoodguideastowhetherthenullhypothesisshouldberejectedornot.(Suchapproachesareknownasp-valuehacking.)Thiscanincludecaseswherep-valuesarenotexplicitlyusedbutthedataispre-processedoradjusteduntilamodelisfoundthatisacceptabletotheinvestigator.
10.2ModelingNullandAlternativeDistributionsAprimaryrequirementforconductingstatisticaltestingistoknowhowtheteststatisticisdistributedunderthenullhypothesis(andsometimesunderthealternativehypothesis).Inconventionalproblemsofstatisticaltesting,thisconsiderationiskeptinmindwhiledesigningtheexperimentalsetupforcollectingdata,sothatwehaveenoughdatasamplespertainingtothenullandalternativehypotheses.Forexample,inordertotesttheeffectofanewdrugincuringadisease,experimentaldataisusuallycollectedfromtwogroupsofsubjectsthatareassimilaraspossibleinallrespects,exceptthatonegroupisadministeredthedrugwhilethecontrolgroupisnot.Thedatasamplesfromthetwogroupsthenprovideinformationtomodelthealternativeandnulldistributions,respectively.Thereisanextensivebodyofworkonexperimentaldesignthatprovidesguidelinesforconductingexperimentsandcollectingdatapertainingtothenullandalternativehypotheses,sothattheycanbeusedlaterforstatisticaltesting.However,suchguidelinescannotbedirectlyappliedwhendealingwithobservationaldatawherethedataiscollectedwithoutapriorhypothesisinmind,asiscommoninmanydataminingproblems.
Hence,acentralobjectivewhenusingstatisticaltestingwithobservationaldataistocomeupwithanapproachtomodelthedistributionoftheteststatisticunderthenullandalternativehypotheses.Insomecases,thiscanbedonebymakingsomestatisticalassumptionsabouttheobservations,e.g.,thatthedatafollowsaknownstatisticaldistributionsuchasthenormal,binomial,orhypergeometricdistributions.Forexample,theinstancesinadatasetmaybegeneratedbyasinglenormaldistributionwhosemeanand
variancecanbeestimatedfromthedataset.Notethatinalmostallcaseswhereastatisticalmodelisused,theparametersofthatmodelmustbeestimatedfromthedata.Hence,anyprobabilitiescalculatedusingastatisticalmodelmayhavesomeinherenterror,withthemagnitudeofthaterrordependentonhowwellthechosendistributionfitsthedataandhowwelltheparametersofthemodelcanbeestimated.
Insomescenarios,itisdifficult,orevenimpossible,toadequatelymodelthebehaviorofthedatawithaknownstatisticaldistribution.Analternativemethodistofirstgeneratesampledatasetsunderthenulloralternativehypotheses,andthenmodelthedistributionoftheteststatisticusingthenewdatasets.Forthealternativehypothesis,thenewdatasetsmustbesimilartothecurrentdataset,butshouldreflectthenaturalvariabilityinherentinthedata.Forthenullhypothesis,thesedatasetsshouldbeassimilaraspossibletotheoriginaldataset,butlackthestructureorpatternofinterest,e.g.,aconnectionbetweenattributesandvalues,clusterstructure,orassociationsbetweenattributes.
Inthefollowing,wedescribesomegenericproceduresforestimatingthenulldistributioninthecontextofstatisticaltestingfordataminingproblems.(Unfortunately,outsideofusingaknownstatisticaldistribution,therearenotmanywidelyusedmethodsforgeneratingthealternativedistribution.)TheseprocedureswillserveasthebuildingblocksofthespecificapproachesforstatisticaltestingdiscussedinSections10.3 to10.6 .Notethattheexactdetailsoftheapproachusedforestimatingthenulldistributiondependsonthespecifictypeofproblembeingstudiedandthenatureofhypothesesbeingconsidered.However,atahighlevel,approachesinvolvegeneratingcompletelynewsyntheticdatasetsorrandomizinglabels.Inaddition,wewilldiscussapproachesforresamplingexistinginstances,whichcanbeusefulforgeneratingconfidenceintervalsforvariousdataminingresults,suchastheaccuracyofapredictivemodel.
10.2.1GeneratingSyntheticDataSets
Foranalysesinvolvingunlabeleddatasuchasclusteringandfrequentpatternmining,themainapproachforestimatinganulldistributionistogeneratesyntheticdatasets,eitherbyrandomizingtheorderofattributevaluesorbygeneratingnewinstances.Theresultantdatasetsshouldbesimilartotheoriginaldatasetinallmannersexceptthattheylackapatternofinterest,e.g.,clusterstructureorfrequentpatterns,whosesignificancehastobeassessed.
Forexample,ifweneedtoassesswhethertheitemsinatransactiondatasetarerelatedtoeachotherbeyondwhateverassociationoccursbyrandomchance,wecangeneratesynthetictransactiondatasetsbyrandomizing(permuting)theorderofexistingentrieswithinrowsandcolumnsofthebinaryrepresentationofatransactiondataset.Thegoalisthattheresultingdatasetshouldhavepropertiessimilartotheoriginaldatasetintermsofthenumberoftimesanitemoccursinthetransactions(i.e.,thesupportofanitem)andthenumberofitemsineverytransaction(i.e.,thelengthofatransaction),buthavestatisticallyindependentitems.Thesesyntheticdatasetscanbeprocessedtofindassociationpatternsandtheseresultscanbeusedtoprovideanestimateofthedistributionoftheteststatisticofinterest,e.g.,thesupportorconfidenceofanitemset,underthenullhypothesis.SeeSection10.4 formoredetails
Ifweneedtoevaluatewhethertheclusterstructureinadatasetisbetterthanwemightexpectbyrandomchance,weneedtogeneratenewinstancesthat,whencombinedinadataset,lackclusterstructure.Thesyntheticdatasetscanbeclusteredandusedtoestimatethenulldistributionofteststatistic.Forclusteranalysis,thequantityofinterest,i.e.,theteststatistic,istypicallysomemeasureofclusteringgoodness,suchasSSEorthesilhouettecoefficient.SeeSection10.5 .
Althoughtheprocessofrandomizingattributesmayappearsimple,executingthisapproachcanbeverychallengingsinceanäiveattemptatgeneratingsyntheticdatasetsmayomitimportantcharacteristicsorstructureoftheoriginaldataandthusmayyieldaninadequateapproximationofthetruenulldistribution.Forexample,givenatimeseriesdata,weneedtoensurethattheconsecutivevaluesintherandomizedtimeseriesaresimilartoeachother,sincetimeseriesdatatypicallyexhibittemporalautocorrelation.Further,ifthetimeseriesdatahaveayearlycycle(e.g.,inclimatedata),wewouldneedtoensurethatsuchcyclicpatternsarealsopreservedinthesyntheticallygeneratedtimeseries.
SpecifictechniquesforgeneratingsyntheticdatasetswillbediscussedinmoredetailinthecontextofassociationanalysisandclusteringinSections10.4 and10.5 ,respectively.
10.2.2RandomizingClassLabels
Wheneverydatainstancehasanassociatedclasslabel,acommonapproachforgeneratingnewdataistorandomlypermutetheclasslabels,aprocessalsoreferredtoaspermutationtesting.Thisinvolvesrepeatedlyshuffling(permuting)labelsamongdataobjectsatrandomtogenerateanewdatasetthatisidenticaltotheolddatasetexceptforthelabelassignments.Aclassificationmodelisbuiltoneachofthesedatasetsandateststatisticcalculated,e.g.,classificationaccuracy.Theresultingsetofvalues—oneforeachpermutation—canbeusedtoprovideadistributionforthestatisticunderthenullhypothesisthattheattributesinthedatasethavenorelationshipwiththeclasslabels.AswillbedescribedinSection10.3.1 ,thisapproachcanbeusedtotesthowlikelyisittoachievetheclassificationperformanceofalearnedclassifieronatestsetjustbyrandomchance.Althoughpermutingthe
labelsissimple,itcanresultininferiormodelsofthenulldistribution.(SeeBibliographicNotes.)
10.2.3ResamplingInstances
Ideally,wewouldliketohavemultiplesamplesfromtheunderlyingpopulationofdatainstancessothatwecanassessthevalidityandgeneralizabilityofthemodelsandpatternsourdataminingalgorithmsproduce.Onewaytosimulatesuchsamplesistorandomlysampleinstancesfromtheoriginaldatatocreatesyntheticcollectionsofdatainstances—anapproachcalledstatisticalresampling.Forexample,acommonapproachforgeneratingnewdatasetsistousebootstrapsampling,wheredatainstancesarerandomlyselectedwithreplacementsuchthattheresultantdatasetisofthesamesizeastheoriginalset.Forclassification,analternativetobootstrapsamplingisk-foldcross-validation,wherethedatasetissystematicallysplitintosubsetsforknumberoftimes.AswewillseelaterinSection10.3.1 ,suchstatisticalresamplingapproachesareusedtocomputedistributionsofmeasuresofclassificationperformance,suchasaccuracy,precision,andrecall.Resamplingapproachessuchasthebootstrapcanalsobeusedtoestimatethedistributionofthesupportofafrequentitemset.Wecanalsousethesedistributionstoproduceconfidenceintervalsforthesemeasures.
10.2.4ModelingtheDistributionoftheTestStatistic
Givenmultiplesamplesofdatasetsgeneratedunderthenullhypothesis,wecancomputetheteststatisticoneverysetofsamplestoobtainthenull
distributionoftheteststatistic.Thisdistributioncanbeusedforprovidingestimatesoftheprobabilitiesusedinstatisticaltestingprocedures,suchasp-values.Onewaytoachievethisistofitstatisticalmodels,e.g.,thenormalorthebinomialdistribution,ontheteststatisticvaluesfromthedatasetsgeneratedunderthenullhypothesis.Alternatively,wecanalsomakeuseofnon-parametricapproachesforcomputingp-values,givenenoughsamples.Forinstance,wecancountthefractionoftimestheteststatisticgeneratedunderthenulldistributionexceeds(ortakes“moreextreme”valuesthan)theteststatisticoftheobservedresult,andusethisfractionasthep-valueoftheresult.
10.3StatisticalTestingforClassificationThereareanumberofproblemsinclassificationthatcanbeviewedfromtheperspectiveofstatisticaltestingandthuscanbenefitfromthetechniquesdescribedpreviouslyinthischapterforavoidingfalsediscoveries.Inthefollowing,wediscusssomeoftheseproblemsandthestatisticaltestingapproachesthatcanbeusedtoaddressthem.NotethatapproachesforcomparingwhethertheperformanceoftwomodelsissignificantlydifferentisprovidedinSection3.9.2 .
10.3.1EvaluatingClassificationPerformance
Supposethataclassifierappliedtoatestsetshowsanaccuracyofx%.Inordertoassessthevalidityoftheclassifier’sresults,itisimportanttounderstandhowlikelyitistoobtainx%accuracybyrandomchance,i.e.,whenthereisnorelationshipbetweentheattributesinthedatasetandtheclasslabel.Also,ifwechooseacertainthresholdfortheclassificationaccuracytoidentifyeffectiveclassifiers,thenwewouldliketoknowhowmanytimeswecanexpecttofalselyrejectagoodclassifierthatshowsanaccuracylowerthanthethresholdduetothenaturalvariabilityinthedata.
Suchquestionsaboutthevalidityofaclassifier’sperformancecanbeaddressedbyviewingthisproblemfromtheperspectiveofhypothesistesting
asfollows.Considerastatisticaltestingsetupwherewelearnaclassifieronatrainingsetandevaluatethelearnedclassifieronatestset.Thenullhypothesisforthistestisthattheclassifierisnotabletolearnageneralizablerelationshipbetweentheattributesandtheclasslabelsfromthegiventrainingset.Thealternativehypothesisisthattheclassifierindeedlearnsageneralizablerelationshipbetweentheattributesandtheclasslabelsfromthetrainingset.Toevaluatewhetheranobservedresultbelongstothenullhypothesisorthealternativehypothesis,wecanuseameasureoftheclassifier’sperformanceonthetestset,e.g.,precision,recall,oraccuracy,astheteststatistic.
Randomization
Inordertoperformstatisticaltestingusingtheabovesetup,wefirstneedtogeneratenewsampledatasetsunderthenullhypothesisthattherearenonon-randomrelationshipsbetweentheattributesandclasslabels.Thiscanbeachievedbyrandomlypermutingtheclasslabelsofthetrainingdatasuchthatforeverypermutationofthelabels,weproduceanewtrainingsetwheretheattributesandclasslabelsareunrelatedtoeachother.Wecanthenlearnaclassifieroneverysampletrainingsetandapplythelearnedmodelsonthetestsettoobtainanulldistributionoftheteststatistic(classificationperformance).Then,forexample,ifweuseaccuracyasourteststatistic,theobservedvalueofaccuracyforthemodellearnedusingoriginallabelsshouldbesignificantlyhigherthanmostoralloftheaccuraciesgeneratedbymodelslearnedoverrandomlypermutedlabels.However,notethataclassifiermayhaveasignificantp-valuebuthaveanaccuracyonlyslightlybetterthanarandomclassifier,especiallyifthedatasetislarge.Hence,itisimportanttotaketheeffectsizeoftheclassifier(actualvalueofclassificationperformance)intoaccountalongwithinformationaboutitsp-value.
BootstrapandCross-Validation
Anothertypeofanalysisrelevanttopredictivemodels,suchasclassification,istomodelthedistributionofvariousmeasuresofclassificationperformance.Onewaytoestimatesuchdistributionsistogeneratebootstrapsamplesfromthelabeleddata(preservingtheoriginallabels)tocreatenewtrainingandtestsets.Theperformanceofaclassificationmodeltrainedandevaluatedonanumberofthesebootstrappeddatasetscanthenbeusedtogenerateadistributionforthemeasureofinterest.Anotherwaytocreatethealternativedistributionwouldbetousetherandomizedcross-validationprocedure(discussedinSection3.6.2 )wheretheprocessofrandomlypartitioningthelabeleddataintokfoldsisrepeatedmultipletimes.
Suchresamplingapproachescanalsohelpinestimatingconfidenceintervalsformeasuresofthetrueperformanceoftheclassifiertrainedoverallpossibleinstances.Aconfidenceintervalisanintervalofparametervaluesinwhichanestimatedparametervalueisguaranteedtofallacertainpercentageoftimes.Theconfidencelevelisthepercentageoftimestheestimatedparameterwillfallwithintheinterval.Forexample,giventhedistributionofaclassifier’saccuracy,wecanestimatetheintervalofvaluesthatcontains95%ofthedistribution.Thisservesastheconfidenceintervaloftheclassifier’strueaccuracyatthe95%confidencelevel.Toquantifytheinherentuncertaintyintheresult,confidenceintervalsareoftenreportedalongwithpointestimatesofamodel’soutput.
10.3.2BinaryClassificationasMultipleHypothesisTesting
TheprocessofestimatingthegeneralizationperformanceofabinaryclassifierresemblestheproblemofmultiplehypothesistestingdiscussedpreviouslyinSection10.1.2 .Inparticular,everytestinstancebelongstothenullhypothesis(negativeclass)orthealternativehypothesis(positiveclass).Byapplyingaclassificationmodeloneverytestinstance,weassigneachinstancetothepositiveorthenegativeclass.Theperformanceofaclassificationmodelonasetofresults(resultsofclassifyinginstancesinatestset)canthenbesummarizedbythefamiliarconfusionmatrixpresentedinTable10.1 .
Auniqueaspectofbinaryclassificationthatdifferentiatesitfromconventionalproblemsofmultiplehypothesistestingistheavailabilityofgroundtruthlabelsontestinstances.Hence,insteadofmakinginferencesusingstatisticalassumptions(e.g.,thedistributionoftheteststatisticunderthenullandalternativehypothesis),wecandirectlycomputeerrorestimatesforrejectingthenulloralternativehypothesesusingempiricalmethods,suchasthosepresentedinSection4.11.2 .Table10.2 showsthecorrespondencebetweentheerrormetricsusedinstatisticaltestingandevaluationmeasuresusedinclassificationproblems.
Table10.2.Correspondencebetweenstatisticaltestingconceptsandclassifierevaluationmeasures
StatisticalTestingConcept ClassifierEvaluationMeasure Formula
TypeIErrorRate, FalsePositiveRate
TypeIIErrorRate, FalseNegativeRate
Power, Recall
α FPFP+TN
β FNTP+FN
1−β TPTP+FN
Whiletheseerrormetricscanbereadilycomputedwiththehelpoflabeleddata,thereliabilityofsuchestimatesdependsontheaccuracyoftestlabelswhichmaynotalwaysbeperfect.Insuchcases,itisimportanttoquantifytheuncertaintyintheevaluationmeasuresarisingduetoinaccuraciesinthetestlabels.(SeeBibliographicNotesformoredetails.)Further,whenweapplyalearnedclassificationmodelonunlabeledinstances,wecanusestatisticalmethodsforquantifyingtheuncertaintyintheclassificationoutputs.Forexample,wecanbootstrapthetrainingset(asdiscussedinSection10.3.1 )togeneratemultipleclassificationmodels,andthedistributionoftheiroutputsonanunseeninstancecanbeusedtoestimatetheconfidenceintervaloftheoutputonthatinstance.
Althoughtheabovediscussionwasfocusedonassessingthequalityofaclassifierthatproducesbinaryoutputs,statisticalconsiderationscanalsobeusedtoassessthequalityofaclassifierthatproducesreal-valuedoutputssuchasclassificationscores.TheperformanceofaclassifieracrossarangeofscorethresholdsisgenerallyanalyzedwiththehelpofReceiverOperatingCharacteristiccurves,asdiscussedinSection4.11.4 .ThebasicapproachbehindgeneratinganROCcurveistosortthepredictionsaccordingtotheirscorevaluesandthenplotthetruepositiverateandthefalsepositiverateforeverypossiblevalueofscorethreshold.NotethatthisapproachbearssomeresemblancetotheFDRcontrollingproceduresdescribedinSection10.1.3 ,wherethetopfewrankinginstances(withthelowestp-values)arelabeledaspositivebytheclassifierinordertomaximizetheFDR.However,inthepresenceofgroundtruthlabels,wecanempiricallyestimatemeasuresofclassificationperformancefordifferentscorethresholdswithoutmakinguseofanyexplicitstatisticalmodelsorassumptions.
10.3.3MultipleHypothesisTestingin
ModelSelection
Theproblemofmultiplehypothesistestingplaysamajorroleintheprocessofmodelselection,whereevenifamorecomplexmodelshowsbetterperformancethanasimplermodel,thedifferenceintheirperformancesmaynotbestatisticallysignificant.Specifically,fromastatisticalperspective,amodelwithahighercomplexityoffersalargernumberofpossiblesolutionsthatalearningalgorithmcanchoosefrom,foragivenclassificationproblem.Forexample,havingalargernumberofattributesprovidesalargersetofcandidatesplittingcriteriathatadecisiontreelearningalgorithmcanchoosetobestfitthetrainingdata.However,whenthetrainingsizeissmallandthenumberofcandidatemodelsarelarge,thereisahigherchanceofpickingaspuriousmodel.Moregenerally,thisversionofthemultiplehypothesistestingisknownasselectiveinference.Thisproblemarisesinsituationswherethenumberofpossiblesolutionsforagivenproblem,suchasbuildingapredictivemodel,arenumerous,butthenumberofteststorobustlydeterminetheefficacyofasolutionisquitesmall.SelectiveinferencemayleadtothemodeloverfittingproblemdescribedinSection3.4 .
Howdoesthemultiplecomparisonprocedurerelatetomodeloverfitting?Manylearningalgorithmsexploreasetofindependentalternatives, ,andthenchooseanalternative, ,thatmaximizesagivencriterionfunction.Thealgorithmwilladd tothecurrentmodelinordertoimproveitstrainingerror.Thisprocedureisrepeateduntilnofurtherimprovementisobserved.Asanexample,duringdecisiontreegrowing,multipletestsareperformedtodeterminewhichattributecanbestsplitthetrainingdata.Theattributethatleadstothebestsplitischosentoextendthetreeaslongasthestoppingcriterionhasnotbeensatisfied.
{γi}γmaxγmax
Let betheinitialdecisiontreeand bethenewtreeafterinsertinganinternalnodeforattributex.Considerthefollowingstoppingcriterionforadecisiontreeclassifier:xisaddedtothetreeiftheobservedgain,,isgreaterthansomepredefinedthreshold .Ifthereisonlyoneattributetestconditiontobeevaluated,thenwecanavoidinsertingspuriousnodesbychoosingalargeenoughvalueof .However,inpractice,thereismorethanonetestconditionavailableandthedecisiontreealgorithmmustchoosethebestsplittingattribute fromasetofcandidates, .Themultiplecomparisonproblemarisesbecausethealgorithmappliesthefollowingtest, insteadof ,todecidewhetheradecisiontreeshouldbeextended.Justaswiththemultiplestockbrokerexample,asthenumberofalternativeskincreases,sodoesourchanceoffinding .Unlessthegainfunction orthreshold ismodifiedtoaccountfork,thealgorithmmayinadvertentlyaddspuriousnodeswithlowpredictivepowertothetree,whichleadstothemodeloverfittingproblem.
Thiseffectbecomesmorepronouncedwhenthenumberoftraininginstancesfromwhich ischosenissmall,becausethevarianceofishigherwhenfewertraininginstancesareavailable.Asaresult,the
probabilityoffinding increaseswhenthereareveryfewtraininginstances.Thisoftenhappenswhenthedecisiontreegrowsdeeper,whichinturnreducesthenumberofinstancescoveredbythenodesandincreasesthelikelihoodofaddingunnecessarynodesintothetree.
T0 Tx
Δ(T0,Tx) α
α
xmax {x1,x2,…,xk}
Δ(T0,Txmax)>α Δ(T0,Tx)>α
Δ(T0,Txmax)>α Δα
xmax Δ(T0,Txmax)
Δ(T0,Txmax)>α
10.4StatisticalTestingforAssociationAnalysisSinceproblemsinassociationanalysisareusuallyunsupervised,i.e.,wedonothaveaccesstogroundtruthlabelstoevaluateresults,itisimportanttoemployrobuststatisticaltestingapproachestoensurethatthediscoveredresultsarestatisticallysignificantandnotspurious.Forexample,inthediscoveryoffrequentitemsets,weoftenuseevaluationmeasuressuchasthesupportofanitemsettomeasureitsinterestingness.(Theuncertaintyinsuchevaluationmeasurescanbequantifiedbyusingresamplingmethods,e.g.,bybootstrappingthetransactionsandgeneratingadistributionofthesupportofanitemsetfromtheresultingdatasets.)Givenasuitableevaluationmeasure,wealsoneedtospecifyathresholdonthemeasuretoidentifyinterestingpatternssuchasfrequentitemsets.Althoughthechoiceofarelevantthresholdisgenerallyguidedbydomainconsiderations,itcanalsobeinformedwiththehelpofstatisticalprocedures,aswediscussinthefollowing.Tosimplifythisdiscussion,weassumethatourtransactiondatasetisrepresentedasasparsebinarymatrix,with1’srepresentingthepresenceofitemsand0’srepresentingtheirabsence.(SeeSection2.1.2 .)
Givenatransactiondataset,consideraresulttobethediscoveryofafrequentk-itemsetandtheteststatistictobethesupportoftheitemsetoranyotherevaluationmeasuredescribedinSection5.7 .Thenullhypothesisforthisresultwouldbethatthekitemsinanitemsetareunrelatedtoeachother.Givenacollectionoffrequentitemsets,wecouldthenapplymultiplehypothesistestingmethodssuchastheFWERorFDRcontrollingprocedurestoidentifysignificantpatternswithstronglyassociateditems.However,theitemsetsfoundbyanassociationminingalgorithmoverlapintermsofthe
itemstheycontain.Hence,themultipleresultsinassociationanalysiscannotbeassumedtobeindependentofeachother.Forthisreason,approachessuchastheBonferroniproceduremaybeoverlyconservativeincallingaresultsignificant,whichleadstolowpower.Further,atransactiondatasetmayhavestructureorcharacteristics,e.g.,asubsetoftransactionscontainingalargenumberofitems,whichneedtobeaccountedforwhenapplyingmultiplehypothesistestingprocedures.
Beforewecanapplystatisticaltestingproceduresforproblemsrelatedtoassociationanalysis,wefirstneedtoestimatethedistributionoftheteststatisticofanitemsetunderthenullhypothesisofnoassociationamongtheitems.Thiscanbedonebyeithermakinguseofstatisticalmodelsorbyperformingrandomizationexperiments.Boththesecategoriesofapproachesaredescribedinthefollowing.
10.4.1UsingStatisticalModels
Underthenullhypothesisthattheitemsareunrelated,wecanmodelthesupportcountofanitemsetusingstatisticalmodelsofindependenceamongitems.Foritemsetscontainingtwoindependentitems,wecanuseFisher’sexacttest.Foritemsetscontainingmorethantwoitems,wecanusealternativetestsofindependencesuchasthechi-squared test.Boththeseapproachesareillustratedinthefollowing.
UsingFisher’sExactTestConsidertheproblemofmodelingthesupportcountofa2-itemset, ,underthenullhypothesisthatAandBoccurindependentlyofeachother.WearegivenadatasetwithNtransactionswhereAandBappear and
(χ2)
{A,B}
NA NB
times,respectively.AssumingthatAandBareindependent,theprobabilityofobservingAandBtogetherwouldthenbegivenby
where and aretheprobabilitiesofobservingAandBindividually,whichareapproximatedbytheirsupport.TheprobabilityofnotobservingAandBtogetherwouldthenbeequalto .AssumingthattheNtransactionsareindependent,wecanconsidermodeling ,thenumberoftimesAandBappeartogether,usingthebinomialdistribution(introducedinSection10.1 )asfollows:
However,thebinomialdistributiondoesnotaccuratelymodelthesupportcountof becauseitassigns, ,thesupportcountof ,apositiveprobabilityevenwhen exceedstheindividualsupportcountsofAandB.Morespecifically,thebinomialdistributionrepresentstheprobabilityofobservinganevent(co-occurrenceofAandB)whensampledwithreplacementwithafixedprobability .However,inreality,theprobabilityofobserving decreasesifwehavealreadysampledAandBanumberoftimes,becausethesupportcountsofAandBarefixed.
Fisher’sexacttestwasdesignedtohandletheabovesituationwhereweperformsamplingwithoutreplacementfromafinitepopulationoffixedsize.ThistestisreadilyexplainedusingthesameterminologyusedinSection5.7 ,whichdealtwiththeevaluationofassociationpatterns.Foreasyreference,wereproducethecontingencytable,Table5.6 ,whichwasusedinthatdiscussion.SeeTable10.3 .
Table10.3.A2-waycontingencytableforvariablesAandB.
pAB=pA×pB=NAN×NBN,
pA pB
(1−pAB)NAB
P(NAB=k)=(Nk)(pAB)k(1−pAB)N−k.
{A,B} NAB {A,B}NAB
(pAB){A,B}
B
A
N
Weusethenotation toindicatethatA(B)isabsentfromatransaction.Eachentry inthis tabledenotesafrequencycount.Forexample, isthenumberoftimesAandBappeartogetherinthesametransaction,while isthenumberoftransactionsthatcontainBbutnotA.Therowsum representsthesupportcountforA,whilethecolumnsum
representsthesupportcountforB.
NotethatifN,thenumberoftransactions,andthesupportsof andarefixed,i.e.,heldconstant,then and arefixed.Thisalso
impliesthatspecifyingthevalueforoneoftheentries, ,or,completelyspecifiestherestoftheentriesinthetable.Inthatcase,Fisher’sexacttestgivesusasimpleformulaforexactlycomputingtheprobabilityofanyspecificcontingencytable.Becauseofourintendedapplication,weexpresstheformulaintermsofthesupportcount, ,forthe2-itemset
.Notethat istheobservedsupportcountof .
Example10.9(Fisher’sExactTest).WeillustratetheapplicationofFisher’sexacttestusingthetea-coffeeexampledescribedatthebeginningofSection5.7.1 .Weareinterestedinmodelingthenulldistributionofthesupportcountof .As
B¯
A¯
f11
f01
f10
f00
f1+
f0+
f+1 f+0
A¯ (B¯)fij 2×2
f11f01
f1+ f+1
A (f1+)B (f+1) f0+ f+0
f11,f10,f01 f00
NAB{A,B} f11 {A,B}
P(NAB=f11)=(f1+f11)(f0+f+1−f11)(Nf+1) (10.4)
{Tea,Coffee}
describedinSection5.7.1 ,theco-occurrenceofTeaandCoffeecanbesummarizedusingthecontingencytableshowninTable10.4 .WecanseethatthesupportcountofCoffeeis800andthesupportcountofTeais200,outofatotalof1000transactions.
Table10.4.Beveragepreferencesamongagroupof1000people.
Coffee
Tea 150
650
50
150
200
800
800 200 1000
Tomodelthenulldistributionofthesupportcountof ,wesimplyapplyEquation10.4 fromourdiscussionFisher’sexacttest.Thisyieldsthefollowing.
where isthesupportcountof .
Figure10.9 showsaplotofthenulldistributionofthesupportcountfor.Wecanseethatthelargestprobabilityforsupportcount
occurswhenitisequalto160.AnintuitiveexplanationforthisfactisthatwhenTeaandCoffeeareindependent,theprobabilityofobservingTeaandCoffeetogetherisequaltotheproductoftheirindividualprobabilities,i.e., .Theexpectedsupportcountof isthusequalto .Supportcountsthatarelessthan160indicatenegativeassociationsamongitems.
Coffee¯
Tea¯
{Tea,Coffee}
P(NAB=f11)=(200f11)(800800−f11)(1000800),
NAB {Tea, Coffee}
{Tea,Coffee}
0.8×0.2=0.16 {Tea,Coffee}0.16×1000=160
Figure10.9.PlotoftheprobabilityofsupportcountgiventheindependenceofTeaandCoffee
Hence,thep-valueofasupportcountof150canbecalculatedbysummingtheprobabilitiesinthelefttailofthenulldistributionforsupportcount150andsmaller.Thisyieldsap-valueof0.032.Thisresultisnotconclusive,sinceasupportcountof150orlesswilloccurroughly3timesoutofa100,onaverage,ifTeaandCoffeeareindependent.However,thelowp-valuetendstoindicatethatteaandcoffeearerelated,albeitinanegativeway,i.e.,teadrinkersarelesslikelytodrinkcoffeethanthosewhodon’tdrinktea.Notethatthisisjustanexampleanddoesnotnecessarilyreflectreality.Alsonotethatthisfindingisconsistentwithour
previousanalysisusingalternativemeasures,suchastheinterestfactor(liftmeasure),asdescribedinSection5.7.1 .
Althoughthediscussionabovewascenteredonthesupportmeasure,wecanalsomodelthenulldistributionofanyotherobjectiveinterestingnessmeasureofa2-itemsetintroducedinSection5.7 ,suchasinterest,oddsratio,cosine,orall-confidence.Thisisbecausealltheentriesofthecontingencytablecanbeuniquelydeterminedbythesupportmeasureofthe2-itemset,giventhenumberoftransactionsandthesupportcountsofthetwoitems.Morespecifically,theprobabilitiesdisplayedinFigure10.9 aretheprobabilitiesofspecificcontingencytablescorrespondingtoaspecificvalueofsupportforthe2-itemset.Foreachofthesetables,thevaluesofanyobjectiveinterestingnessmeasure(oftwoitems)canbecalculated,andthesevaluesdefinethenulldistributionofthemeasurebeingconsidered.Thisapproachcanalsobeusedtoevaluateinterestingnessmeasuresofassociationrulessuchastheconfidenceof ,whereAandBareitemsets.
NotethatusingFisher’sexacttestisequivalenttousingthehypergeometricdistribution.
UsingtheChi-SquaredTestThechi-squared testprovidesagenericbutapproximateapproachformeasuringthestatisticalindependenceamongmultipleitemsinanitemset.Thebasicideabehindthe testistocomputetheexpectedvalueofeveryentryinacontingencytable,suchastheoneshowninTable10.4 ,assumingthattheitemsarestatisticallyindependent.Thedifferencesbetweentheobservedandexpectedvaluesinthecontingencytablecanthenbeusedtocomputeateststatisticthatfollowsthe distributionunderthenullhypothesisofnoassociationbetweenitems.
A→B
(χ2)
χ2
χ2
Formally,consideratwo-dimensionalcontingencytablewheretheentryattherowand columnisdenotedby .(Weusethenotation
insteadof sincetheformeristraditionallyusedtorepresentthe“observed”valueindiscussionsofthe statistic.)IfthesumofallentriesisequaltoN,thenwecancomputetheexpectedvalueateveryentryas
Thisfollowsfromthefactthatthejointprobabilityofobservingindependenteventsisequaltotheproductoftheindividualprobabilities.Whenallitemsarestatisticallyindependent, wouldusuallybecloseto forallvaluesofiandj.Hence,thedifferencesbetween and canbeusedtomeasurethedeviationoftheobservedcontingencytablefromthenullhypothesisofnoassociation.Inparticular,wecancomputethefollowingteststatistic:
NotethatR=0onlyif and areequalforeveryvalueofiandj.ItcanbeshownthatthenulldistributionofRcanbeapproximatedbythedistributionwith1degreeoffreedomwhenNislarge.Wecanthuscomputethep-valueofanobservedvalueofRusingstandardimplementationsofthedistribution.
Whiletheabovediscussionwascenteredontheanalysisofatwo-dimensionalcontingencytableinvolvingtwoitems,the testcanbereadilyextendedtomulti-dimensionalcontingencytablesinvolvingmorethantwoitems.Forexample,givenak-itemset ,wecanconstructak-dimensionalcontingencytablewithobservedentriesrepresentedas
.TheexpectedvaluesofthecontingencytableandtheteststatisticRcouldthenbecomputedasfollows
ith jth Oi,j(i,j∈{0,1}) Oi,jfij
χ2
Eij=N×(∑iOi,jN)×(∑jOi,jN). (10.5)
Oi,j Ei,jOi,j Ei,j
R=∑i∑j(Oi,j−Ei,j)2Ei,j. (10.6)
Oi,j Ei,jχ2
χ2
χ2
X={i1,i2,…,ik}Oi2,i2,
…,ik (i1,i2,…ik∈{0,1})
UnderthenullhypothesisthatallkitemsintheitemsetXarestatisticallyindependent,thedistributionofRcanagainbeapproximatedbyadistribution.However,thegeneralformulaforthedegreesoffreedomis
.Thus,ifwehavea4by3contingencytable,then .
10.4.2UsingRandomizationMethods
Whenitisdifficulttomodelthenulldistributionofitemsetsusingstatisticalmodels,analternativeapproachistogeneratesynthetictransactiondatasetsunderthenullhypothesisofnoassociationamongtheitems,withthesamenumberofitemsandtransactionsastheoriginaldata.Thisinvolvesrandomlypermutingtherowsorcolumnsintheoriginaldatasuchthattheitemsintheresultantdataareunrelatedtoeachother.AsdiscussedinSection10.2.1 ,wemustensurewhilerandomizingtheattributesthattheresultantdatasetsaresimilartotheoriginaldatasetinallrespectsexceptforthedesiredeffectweareinterestedinevaluating,whichistheassociationamongitems.
Abasicstructurewewouldliketopreserveinthesyntheticdatasetsisthesupportofeveryitemintheoriginaldata.Inparticular,everyitemshouldappearinthesamenumberoftransactionsinthesyntheticdatasetsasintheoriginaldataset.Onewaytopreservethissupportstructureofitemsistorandomlypermutetheentriesineachcolumnoftheoriginaldatasetindependentlyoftheothercolumns.Thisensuresthattheitemshavethesamesupportinthesyntheticallygenerateddatasetsbutareindependentof
Ei1,i2,…,ik=N×∏j=1k(∑ijOi1,i2,…,ikN). (10.7)
R=∑i1∑i2…∑ik(Oi1,i2,…,ik−Ei1,i2,…,ik)2Ei1,i2,…ik. (10.8)
χ2df=
(numberofrows −1)×(numberofcolumns−1)df=(4 −1)×(3−1)=6
eachother.However,thismayviolateadifferentpropertyoftheoriginaldatathatwewouldliketopreserve,whichisthelengthofeverytransaction(numberofitemsinatransaction).Thispropertycanbepreservedbyrandomlyshufflingtherows,i.e.,therowsumsarepreserved.However,adrawbackofthisapproachisthatthesupportofeveryitemintheresultantdatasetmaybedifferentthanthesupportofitemsintheoriginaldataset.
Arandomizationapproachthatcanpreserveboththesupportsandthetransactionlengthsoftheoriginaldataisswaprandomization.Thebasicideaistopickapairofonesintheoriginaldatasetfromtwodifferentrowsandcolumns,sayat(rowk,columni)and(rowl,columnj),where and .(SeelefttableinFigure10.10 .)Thesetwoentriesdefinethediagonalofarectangleofvaluesinthebinarytransactionmatrix.Iftheentriesatoppositecornersoftherectangle,i.e.,(rowk,columnj)and(rowl,columni),arezeros,thenwecanswapthesezeroswiththeones,asshowninFigure10.10 .Notethatbyperformingthisswap,boththerowsumsandcolumnsumsarepreservedwhiletheassociationwithotheritemsisbroken.Thisprocesscontinuesuntilitislikelythatthedatasetissignificantlydifferentfromtheoriginalone.(Anappropriatethresholdforthenumberofswapsneedstobedetermineddependingonthesizeandnatureoftheoriginaldataset.)
Figure10.10.
k≠l i≠j
Illustrationofaswapforswaprandomization.
Swaprandomizationhasbeenshowntopreservethepropertiesoftransactiondatasetsmoreaccuratelythantheotherapproachesmentioned.However,itisverycomputationallyintensive,particularlyforlargerdatasets,whichcanlimititsapplication.Furthermore,apartfromthesupportofitemsandtransactionlengths,theremaybeothertypesofstructureinthetransactiondatathatswaprandomizationmaynotbeabletopreserve.Forinstance,theremaybesomeknowncorrelationsamongtheitems(duetodomainconsiderations)thatwewouldliketoretaininthesyntheticdatasetswhilebreakingthecorrelationsamongotheritems.Agoodexamplearedatasetsthatrecordthepresenceorabsenceofageneticvariationatvariouslocationsonthegenome(items)acrossmultiplesubjects(transactions).Itemsrepresentinglocationsthatarecloseonthegeneticsequenceareknowntobehighlycorrelated.Thislocalstructureofcorrelationmaybelostinthesyntheticdatasetsifwetreateachcolumnidenticallywhilerandomizing.Whatisneededinthiscaseistokeepthelocalcorrelationbuttobreakcorrelationofareasthatarefurtheraway.
Afterconstructingsyntheticdatasetspertainingtothenullhypothesis,wecangeneratethenulldistributionofthesupportofanitemsetbyobservingitssupportinthesyntheticdatasets.Thisprocedurecanhelpindecidingsupportthresholdsusingstatisticalconsiderationssothatthediscoveredfrequentitemsetsarestatisticallysignificant.
10.5StatisticalTestingforClusterAnalysisThegoodnessofaclusteringistypicallyevaluatedwiththehelpofclustervaliditymeasuresthateithercapturethecohesionorseparationofclusters,suchasthesumofsquarederrors(SSE),ormakeuseofexternallabelssuchasentropy.Insomecases,theminimumandmaximumvaluesofmeasureshaveintuitiveinterpretationsthatcanbeusedtoexaminethegoodnessofaclustering.Forinstance,ifwearegiventhetrueclasslabelsofinstancesandwewantourclusteringtoreflecttheclassstructure,thenapurityof0isbad,whileapurityof1isgood.Likewise,anentropyof0isgood,asisanSSEof0.However,inmanycases,wearegivenintermediatevaluesofclustervaliditymeasureswhicharedifficulttointerpretdirectlywithoutthehelpofdomainconsiderations.
Statisticaltestingproceduresprovideausefulwayofmeasuringthesignificanceofadiscoveredclustering.Inparticular,wecanconsiderthenullhypothesisthatthereisnoclusterstructureamongtheinstancesandtheclusteringalgorithmisproducingarandompartitioningofthedata.Theapproachistousetheclustervaliditymeasureasateststatistic.Thedistributionofthatteststatisticundertheassumptionthatthedatahasnoclusteringstructureisthenulldistribution.Wecanthentestwhetherthevaliditymeasureactuallyobservedforthedataissignificant.Inthefollowing,weconsidertwogeneralcases:(1)theteststatisticisaninternalclusteringvalidityindexcomputedforunlabeleddata,suchasSSEorthesilhouettecoefficient,or(2)theteststatisticisanexternalindex,i.e.,theclusterlabelsaretobecomparedagainstclasslabels,suchasentropyorpurity.TheseclustervaliditymeasuresaredescribedinSection7.5 .
10.5.1GeneratingaNullDistributionforInternalIndices
Internalindicesmeasurethegoodnessofaclusteronlybyreferencetothedataitself—seeSection7.5.2 .Furthermore,oftentheclusteringisdrivenbyanobjectivefunction,andinthosecases,themeasureofaclustering’sgoodnessisprovidedbytheobjectivefunction.Thus,mostofthetime,statisticalevaluationofaclusteringisnotperformed.
Anotherreasonthatsuchanevaluationisnotperformedisthedifficultyingeneratinganulldistribution.Inparticular,togetameaningfulnulldistributionfordeterminingclusterstructure,weneedtocreatedatawithsimilaroverallpropertiesandcharacteristicsasthedatawehaveexceptthatithasnoclusterstructure.Butthiscanbedifficultsincedataoftenhasacomplexstructure,e.g.,thedependenciesamongobservationsintimeseriesdata.Nonetheless,statisticaltestingcanbeusefulifthedifficultiescanbeovercome.Wepresentasimpleexampletoillustratetheapproach.
Example10.10(SignificanceofSSE).ThisexampleisbasedonK-meansandtheSSE.Supposethatwewantameasureofhowthewell-separatedclustersofFigure10.11a comparewithrespecttorandomdata.Wegeneratemanyrandom(uniformlydistributed)setsof100pointshavingthesamerangeofvaluesalongthetwodimensionsasthepointsinthethreeclusters,findthreeclustersineachdatasetusingK-means,andaccumulatethedistributionofSSEvaluesfortheseclusterings.ByusingthisdistributionoftheSSEvalues,wecanthenestimatetheprobabilityoftheSSEvaluefortheoriginalclusters.Figure10.11b showsthehistogramoftheSSEfrom500
randomruns.ThelowestSSEinthehistogramis0.0173.ForthethreeclustersofFigure10.11a ,theSSEis0.0050.Wecouldthereforeconservativelyclaimthatthereislessthana1%chancethataclusteringsuchasthatofFigure10.11a couldoccurbychance.
Figure10.11.Usingrandomizationtoevaluatethep-valueforaclustering.
Inthepreviousexample,itwasrelativelystraightforwardtouserandomizationtoevaluatethestatisticalsignificanceofaninternalclustervaliditymeasure.Inpractice,domainevaluationisusuallymoreimportant.Forinstance,adocumentclusteringschemecouldbeevaluatedbylookingatthedocumentsandjudgingwhethertheclustersmakesense.Moregenerally,adomainexpertwouldevaluatetheclusterforsuitabilitytoadesiredapplication.Nonetheless,astatisticalevaluationofclusteringissometimesnecessary.AreferencetoanexampleforclimatetimeseriesisprovidedintheBibliographicNotes.
10.5.2GeneratingaNullDistributionforExternalIndices
Ifexternallabelsareusedforevaluation,thenaclusteringisevaluatedusingameasuresuchasentropyortheRandstatistic—seeSection7.5.7 whichassesseshowcloselytheclusterstructure,asreflectedintheclusterlabels,matchestheclasslabels.Someofthesemeasurescanbemodeledwithastatisticaldistribution,e.g.,theadjustedRandindex,whichisbasedonthemultivariatehypergeometricdistribution.Ifameasurehasawell-knowndistribution,thenthisdistributioncanbeusedtocomputeap-value.
However,randomizationcanalsobeusedtogenerateanulldistributioninthiscase,asfollows.
1. GenerateMrandomizedsetsoflabels,2. Foreachrandomizedsetoflabels,computethevalueoftheexternal
index.Let bethevalueoftheexternalindexobtainedfortherandomization.Let bethevalueoftheexternalindexfortheoriginalsetoflabels.
3. Assumingthatalargervalueoftheexternalindexismoredesirable,definethep-valueof tobethefractionof forwhich .
Aswiththecaseofunsupervisedevaluationofaclustering,domainsignificanceoftenassumesadominantrole.Forexample,considerclusteringnewarticlesintodistinctgroupsasinExample7.15 ,wherethearticlesbelongtotheclasses:Entertainment,Financial,Foreign,Metro,National,andSports.Ifwehavethesamenumberofclustersasthenumberofclassesof
L1,…,Li,…,LM
mi ithm0
m0 mi mi>m0
p-value(m0)= |{mi:mi>mo}|M (10.9)
newsarticles,thenanidealclusteringwouldhavetwocharacteristics.First,everyclusterwouldcontainonlydocumentsfromoneclass,i.e.,itwouldbepure.Second,everyclusterwouldcontainallofthedocumentsfromaparticularclass.Anactualclusteringofdocumentscanbestatisticallysignificant,butstillbequitepoorintermsofpurityand/orcontainingallthedocumentsofaparticulardocumentclass.Sometimes,suchsituationsarestillofinterest,aswedescribenext.
10.5.3Enrichment
Insomecasesinvolvinglabeleddata,thegoalofevaluatingclustersistofindclustersthathavemoreinstancesofaparticularclassthanwouldbeexpectedforarandomclustering.Whenaclusterhasmorethantheexpectednumberofinstancesofaspecificclass,wesaythattheclusterisenrichedinthatclass.Thisapproachiscommonlyusedintheanalysisofbioinformaticsdata,suchasgeneexpressiondata,butisapplicableinmanyotherareasaswell.Furthermore,thisapproachcanbeusedforanycollectionofgroups,notjustthosecreatedbyclustering.Weillustratethisapproachwithasimpleexample.
Example10.11(EnrichmentofNeighborhoodsofaCityinTermsofIncomeLevels).Assumethatinaparticularcitythereare10distinctneighborhoods,whichcorrespondtoclustersinourproblem.Overall,thereare10,000peopleinthecity.Further,assumethatthereare3incomelevels,Poor(30%),Medium(50%),andWealthy(20%).Finally,assumethatoneoftheneighborhoodshas1,000residents,23%ofwhomfallintotheWealthycategory.Thequestioniswhetherthisneighborhoodhasmorewealthy
peoplethanexpectedbyrandomchance.ThecontingencytableforthisexampleisshowninTable10.5 .WecananalyzethistablebyusingFisher’sexacttest.(SeeExample10.9 inSection10.4.1 .)
Table10.5.Beveragepreferencesamongagroupof1000people.
InNeighborhood
Wealthy 230
770
1770
7,230
2,000
8,000
20,000 1,000 10,000
UsingFisher’sexacttest,wefindthatthep-valueforthisresultis0.0076.Thiswouldseemtoindicatethatmorewealthypeopleliveinthisneighborhoodthanwouldbeexpectedbyrandomchanceatasignificancelevelof1%.However,severalpointsneedtobemade.First,wemayverywellbetestingeverygroupagainsteveryneighborhoodtolookforenrichment.Thus,therewouldbe30testsoverallandthep-valuesshouldbeadjustedformultiplecomparisons.Forinstance,ifweusetheBonferroniprocedure,0.0076wouldnotbeasignificantresultsincethesignificancethresholdisnow .Also,theoddsratioforthiscontingencytableisonly1.22.Hence,evenifthedifferenceissignificant,theactualmagnitudeofthedifferencedoesn’tseemverylarge,i.e.,veryfarfromanoddsratioof1.Inaddition,notethatmultiplyingalltheentriesofthetableby10willgreatlydecreasethep-value ,buttheoddsratiowillremainthesame.Despitetheseissues,enrichmentcanbeavaluabletoolandhasyieldedusefulresultsforavarietyofapplications.
InNeighborhood¯
Wealthy¯
0.01/30=0.0003
(≈10−9)
10.6StatisticalTestingforAnomalyDetectionAnomalydetectionalgorithmstypicallyproduceoutputsintheformofclasslabels(whenaclassificationmodelistrainedoverlabeledanomalies)oranomalyscores.Statisticalconsiderationscanbeusedtoensurethevalidityofboththesetypesofoutputsasdescribedinthefollowing.
SupervisedAnomalyDetectionIfwehaveaccesstolabeledanomalousinstances,theproblemofanomalydetectioncanbeconvertedtoabinaryclassificationproblem,wherethenegativeclasscorrespondstothenormaldatainstances,whilethepositiveclasscorrespondstotheanomalousinstances.StatisticaltestingproceduresdiscussedinSection10.3 forclassificationaredirectlyrelevantforavoidingfalsediscoveriesinsupervisedanomalydetection,albeitwiththeadditionalchallengesofbuildingamodelforimbalancedclasses(SeeSection4.11 .)Inparticular,weneedtoensurethattheclassificationerrormetricusedduringstatisticaltestingissensitivetotheimbalanceamongtheclassesandgivesenoughemphasistotheerrorsrelatedtotherareanomalyclass(falsepositivesandfalsenegatives).Afterlearningavalidclassificationmodel,wecanalsousestatisticalmethodstocapturetheuncertaintyintheoutputsofthemodelonunseeninstances.Forexample,wecanuseresamplingapproachessuchasthebootstrappingtechniquetolearnmultipleclassificationmodelsfromthetrainingset,andthedistributionoftheirlabelsproducedonanunseeninstancecanbeusedtoestimateconfidenceintervalsofthetrueclasslabeloftheinstance.
UnsupervisedAnomalyDetectionMostunsupervisedanomalydetectionapproachesproduceananomalyscoreondatainstancestoindicatehowanomalousaninstanceiswithrespecttothenormalclass.Itisthenimportanttodecideasuitablethresholdontheanomalyscoretoidentifyinstancesthataresignificantlyanomalousandhenceareworthyoffurtherinvestigation.Thechoiceofathresholdisgenerallyspecifiedbytheuserbasedondomainconsiderationsonwhatisacceptableasasignificantdeparturefromthenormalbehavior.Suchdecisionscanalsobereinforcedwiththehelpofstatisticaltestingmethods.
Inparticular,fromastatisticalperspective,wecanconsidereveryinstancetobearesultanditsanomalyscoretobetheteststatistic.Thenullhypothesisisthattheinstancebelongstothenormalclasswhilethealternativehypothesisisthattheinstanceissignificantlydifferentfromotherpointsfromthenormalclassandhenceisananomaly.Hence,giventhenulldistributionoftheanomalyscore,wecancomputethep-valueofeveryresultandusethisinformationtodeterminestatisticallysignificantanomalies.
Aprimerequirementforperformingstatisticaltestingforanomalydetectionistoobtainthedistributionofanomalyscoresforinstancesthatbelongtothenormalclass,asthisisthenulldistribution.Iftheanomalydetectionapproachisbasedonstatisticaltechniques(seeSection9.3 ),wehaveaccesstoastatisticalmodelforestimatingthedistributionofthenormalclass.Inothercases,wecanuserandomizationmethodstogeneratesyntheticdatasetswheretheinstancesonlybelongtothenormalclass.Forexample,ifitispossibletoconstructamodelofthedatawithoutanomalies,thenthismodelcanbeusedtogeneratemultiplesamplesofthedata,andinturn,thosesamplescanbeusedtocreateadistributionoftheanomalyscoresforinstancesthatarenormal.Unfortunately,justasforgeneratingsyntheticdataforclustering,thereisusuallynoeasywaytoconstructrandomdatasetsthat
looksimilartotheoriginaldatainallrespectsexceptthattheycontainonlynormalinstances.
Ifanomalydetectionistobeuseful,however,thenatsomepoint,theresultsoftheanomalydetection,particularlythetoprankinganomalies,needtobeevaluatedbydomainexpertstoassesstheperformanceofthealgorithm.Iftheanomaliesproducedbythealgorithmdonotagreewiththeexpertassessment,thisdoesnotnecessarilymeanthatthealgorithmisnotperformingwell.Instead,itmayjustmeanthatthedefinitionofananomalybeingusedbytheexpertandthealgorithmdiffer.Forinstance,theexpertmayviewcertainaspectsofthedataasirrelevant,butthealgorithmmaybetreatingthemasimportant.Insuchcases,theseaspectsofthedatacanbedeemphasizedtohelprefinethestatisticaltestingprocedures.Alternatively,theremaybenewtypesofanomaliesthattheexpertisunfamiliarwith,sinceanomaliesare,bytheirverynature,supposedtobesurprising.
BaseRateFallacyConsiderananomalydetectionsystemthatcanaccuratelydetect99.9%ofthefraudulentcreditcardtransactionswithafalsealarmrateofonly0.01%.Ifatransactionisflaggedasananomalybythesystem,howlikelyitistobegenuinelyfraudulent?Acommonmisconceptionisthatthemajorityofthedetectedanomaliesarefraudulenttransactionsgiventhehighdetectionrateandlowfalsealarmrateofthesystem.However,thiscanbemisleadingiftheskewofthedataisnottakenintoconsideration.Thisproblemisalsoknownasbaseratefallacyorbaserateneglect.
Toillustratetheproblem,considerthecontingencytableshowninTable10.6 .Letdbethedetectionrate(i.e.,truepositiverate)ofthesystemandfbeitsfalsealarmrate,ortobemorespecific
Table10.6.Contingencytableforananomalydetectionsystemwithdetectionratedandfalsealarmratef.
Alarm NoAlarm
Fraud
NoFraud
dαN αN
N
Ourgoalistocalculatetheprecisionofthesystem,i.e.,P(Fraud|Alarm).Iftheprecisionishigh,thenthemajorityofthealarmsareindeedtriggeredbyfraudulenttransactions.BasedontheinformationgiveninTable10.6 ,theprecisionofthesystemcanbecalculatedasfollows:
whereαisthepercentageoffraudulenttransactionsinthedata.Sinceand ,theprecisionofthesystemis
Ifthedataisnotskewed,e.g.,when ,thenitsprecisionwouldbeveryhigh,0.9999,sowecantrustthatthemajorityoftheflaggedtransactionsarefraudulent.However,ifthedataishighlyskewed,e.g.,when (oneinfiftythousandtransactions),thentheprecisionisonly0.167,whichmeansthatonlyaboutoneissixalarmsisatrueanomalies.
f(1−α)N
(1−d)αN
(1−f)(1−α)N (1−α)N
dαN+f(1−α)N (1−d)αN+(1−f)(1−α)N
P(Alarm|Fraud)=dandP(Alarm|NotFraud)=f.
Precision=dαNdαN+f(1−α)N=dαf+(d−f)α, (10.10)
d=0.999 f=0.0001
Precision=0.999α0.0001+0.9989α (10.11)
α=0.5
α=2×10−5
Theprecedingexampleillustratestheimportanceofconsideringskewnessofthedatawhenchoosinganappropriateanomalydetectionsystemforagivenapplication.Iftheeventofinterestoccursrarely,say,oneinfiftythousandofthepopulation,thenevenasystemwith99.9%detectionrateand0.01%falsealarmratecanstillmake5mistakesforevery6anomaliesflaggedbythesystem.Theprecisionofthesystemdegradessignificantlyasthepercentageofskewnessinthedataincreases.Thecruxofthisproblemliesinthefactthatdetectionrateandfalsealarmratearemetricsthatarenotsensitivetoskewnessintheclassdistribution,aproblemthatwasfirstalludedtoinSection4.11 duringourdiscussionontheclassimbalancedproblem.Thelessonhereisthatanyevaluationofananomalydetectionsystemmusttakeintoaccountthedegreeofskewnessinthedatabeforedeployingthesystemintopractice.
10.7BibliographicNotesRecently,therehasbeenagrowingbodyofliteraturethatisconcernedwiththevalidityandreproducibilityofresearchresults.Perhapsthemostwell-knownworkinthatareaisthepaperbyIoannidis[721],whichassertsthatmostpublishedresearchfindingsarefalse.Therehavebeenvariouscritiquesofthiswork,e.g.,seeGoodmanandGreenland[717]andIoannidisrebuttal[716,722].Regardless,concernaboutthevalidityandreproducibilityofresultshasonlycontinuedtoexpand.ApaperbySimmonsetal.[742]statesthatalmostanyeffectinpsychologycanbepresentedasstatisticallysignificantgivencurrentpractice.Thepaperalsosuggestsrecommendedchangesinresearchpracticeandarticlereview.ANaturesurveybyBaker[697]reportedthatmorethan70%ofresearchershavetriedandfailedtoreplicateotherresearchers’results,and50%havefailedtoreplicatetheirownresults.Onamorepositivenote,JagerandLeek[724]lookedatpublishedmedicalresearchandalthoughtheyidentifiedaneedforimprovements,concludedthat“ouranalysissuggeststhatthemedicalliteratureremainsareliablerecordofscientificprogress.”TherecentbookbyNateSilver[741]hasdiscussedanumberofpredictivefailuresinvariousareas,includingbaseball,politics,andeconomics.Althoughnumerousotherstudiesandreferencescanbecitedinanumberofareas,thekeypointisthatthereisawidespreadperception,backedbyafairamountofevidence,thatmanycurrentdataanalysesarenottrustworthyandthattherearevariousstepsthatcanbetakentoimprovethesituation[699,723,729].Althoughthischapterhasfocusedonstatisticalissues,manyofthechangesadvocated,e.g.,byIoannidisinhisoriginalpaper,arenotstatisticalinnature.
ThenotionofsignificancetestingwasintroducedbytheprominentstatisticianRonaldFisher[710,734].Inresponsetoperceivedshortcomings,Neyman
andPearson[735,736]introducedhypothesistesting.Thetwoapproacheshaveoftenbeenmergedinanapproachknownasnullhypothesisstatisticaltesting(NHST)[731],whichhasbeenthesourceofmanyproblems[712,720].Anumberofp-valuemisconceptionsaresummarizedinvariousrecentpapers,forexample,thosebyGoodman[715],Nuzzo[737],andGelman[711].TheAmericanStatisticalAssociationhasrecentlyissuedastatementonp-values[751].PapersthatdescribetheBayesianapproach,asexemplifiedbytheBayesfactorandpriorodds,areKassandRaftery[727]andGoodmanandSander[716].ArecentpaperbyBenjaminandlargenumberofotherprominentstatisticians[699],usessuchanapproachtoarguethat0.005,insteadof0.05,shouldbethedefaultp-valuethresholdforstatisticalsignificance.Moregenerally,themisinterpretationandmisuseofp-valuesisnottheonlyproblemassomehavenoted[730].NotethatbothFisher’ssignificancetestingandtheNeyman-Pearsonhypothesistestingapproachesweredesignedwithstatisticallydesignedexperimentsinmind,butareoften,perhapsmostly,appliedtoobservationaldata.Indeed,mostdatabeinganalyzednowadaysisobservationaldata.
TheseminalpaperforthefalsediscoveryrateisbyBenjaminiandHochberg[701].ThepositivefalsediscoveryratewasproposedbyStorey[743–745].Efronhasadvocatedtheuseofthelocalfalsediscoveryrate[704–707].TheworkofEfron,Storey,Tibshirani,andothershasbeenappliedinasoftwarepackageforanalyzingmicroarrydata:SAM:SignificanceAnalysisofMicroarrays[707,746,750].Moregenerally,mostmathematicalandstatisticalsoftwarehaspackagesforcomputingFDR.Inparticular,seetheFDRtoolinRbyStrimmer[748,749]ortheq-valueroutine[698,747],whichisavailableinBioconductor,awell-knownRpackage.Arecentsurveyofpastandcurrentworkinmultiplehypothesistesting(multiplecomparison)isgivenbyBenjamini[700].
AsdiscussedinSection10.2 ,resamplingapproaches,especiallythosebasedontherandomization/permutationandthebootstrap/cross-validation,arethemainapproachtomodelingthenulldistributionorthedistributionsofevaluationmetrics,andthus,computingevaluationmeasuresofinterest,suchasp-values,falsediscoveryrates,andconfidenceintervals.Discussionandreferencestothebootstrapandcross-validationareprovidedintheBibliographicNotesofChapter3 .Generalresourcesonpermutation/randomizationincludebooksbyEdgingtonandOnghena[703],Good[714],andPesarinandLuigi[740],aswellasthearticlesbyCollingridge[702],Ernst[709]andWelch[756].Althoughsuchtechniquesarewidelyused,therearelimitations,suchasthosediscussedinsomedetailbyEfron[705].Inthispaper,EfrondescribesaBayesianapproachforestimatinganempiricalnulldistributionandusingittocomputea“local”falsediscoveryratethatismoreaccuratethanapproachesusinganulldistributionbasedonrandomizationortheoreticalapproaches.
Aswehaveseenintheapplicationspecificsections,differentareasofdataanalysistendtouseapproachesspecifictotheirproblem.Thepermutation(randomization)ofclasslabelsdescribedinSection10.3.1 isastraightforwardandwell-knowntechniqueinclassification.ThepaperbyOjalaandGarigga[738]examinesthisapproachinmoredepthandpresentsanalternativerandomizationapproachthatcanhelpidentify,foragivendataset,whetherdependencyamongfeaturesisimportantintheclassificationperformance.ThepaperbyJensenandCohen[726]isarelevantreferenceforthediscussionofmultiplehypothesistestinginmodelselection.Clusteringhasrelativelylittleworkintermsofstatisticalvalidationsincemostusersrelyonmeasuresofclusteringgoodnesstoevaluateoutcomes.However,someusefulresourcesareChapter4 ofJainandDubes’clusteringbook[725]andtherecentsurveyofclusteringvaliditymeasuresbyXiongandLi.[757].TheswaprandomizationapproachwasintroducedintoassociationanalysisbyGionisetal.[713].Thispaperhasanumberofreferencesthattracethe
originofthisapproachinotherareas,aswellasreferencestootherpapersfortheassessmentofassociationpatterns.Thisworkwasextendedtoreal-valuedmatricesbyOjalaetal.[739].AnotherimportantresourceforstatisticallysoundassociationpatterndiscoveryistheworkofWebb[752–755].HӓmӓlӓinenandWebbtaughtatutorialinKDD2014,StatisticallySoundPatternDiscovery.RelevantpublicationsbyHӓmӓlӓineninclude[719]and[718].
Thedesignofexperimentstoreducevariabilityandincreasepowerisacorecomponentofstatistics.Thereareanumberofgeneralbooksonthetopic,e.g.,theonebyMontgomery[732],butmanymorespecializedtreatmentsofthetopicareavailableforvariousdomains.Inrecentyears,A/Btestinghasemergedasacommontoolofcompaniesforcomparingtwoalternatives,e.g.,twowebpages.ArecentpaperbyKohavietal.[728]providesasurveyandpracticalguidetoA/Btestingandsomeofitsvariants.
Muchofthematerialpresentedinsections10.1 and10.2 iscoveredinvariousstatisticsbooksandarticles,manyofwhichwerementionedpreviously.Additionalreferencematerialforsignificanceandhypothesistestingcanbefoundinintroductorytexts,althoughasmentionedabove,thesetwoapproachesarenotalwaysclearlydistinguished.Theuseofhypothesistestingiswidespreadinanumberofdomains,e.g.,medicine,sincetheapproachallowinvestigatorstodeterminehowmanysampleswillbeneededtoachievecertaintargetvaluesfortheTypeIerror,power,andeffectsize.See,forexample,Ellis[708]andMurphyetal.[733].
Bibliography[697]M.Baker.1,500scientistsliftthelidonreproducibility.Nature,
533(7604):452–454,2016.
[698]D.Bass,A.Dabney,andD.Robinson.qvalue:Q-valueestimationforfalsediscoveryratecontrol.Rpackage,2012.
[699]D.J.Benjamin,J.Berger,M.Johannesson,B.A.Nosek,E.-J.Wagenmakers,R.Berk,K.Bollen,B.Brembs,L.Brown,C.Camerer,etal.Redefinestatisticalsignificance.PsyArXiv,2017.
[700]Y.Benjamini.Simultaneousandselectiveinference:currentsuccessesandfuturechallenges.BiometricalJournal,52(6):708–721,2010.
[701]Y.BenjaminiandY.Hochberg.Controllingthefalsediscoveryrate:apracticalandpowerfulapproachtomultipletesting.Journaloftheroyalstatisticalsociety.SeriesB(Methodological),pages289–300,1995.
[702]D.S.Collingridge.Aprimeronquantitizeddataanalysisandpermutationtesting.JournalofMixedMethodsResearch,7(1):81–97,2013.
[703]E.EdgingtonandP.Onghena.Randomizationtests.CRCPress,2007.
[704]B.Efron.Localfalsediscoveryrates.DivisionofBiostatistics,StanfordUniversity,2005.
[705]B.Efron.Large-scalesimultaneoushypothesistesting.JournaloftheAmericanStatisticalAssociation,2012.
[706]B.Efronetal.Microarrays,empiricalBayesandthetwo-groupsmodel.Statisticalscience,23(1):1–22,2008.
[707]B.Efron,R.Tibshirani,J.D.Storey,andV.Tusher.EmpiricalBayesanalysisofamicroarrayexperiment.JournaloftheAmericanstatisticalassociation,96(456):1151–1160,2001.
[708]P.D.Ellis.Theessentialguidetoeffectsizes:Statisticalpower,meta-analysis,andtheinterpretationofresearchresults.CambridgeUniversityPress,2010.
[709]M.D.Ernstetal.Permutationmethods:abasisforexactinference.StatisticalScience,19(4):676–685,2004.
[710]R.A.Fisher.Statisticalmethodsforresearchworkers.InBreakthroughsinStatistics,pages66–70.Springer,1992(originally,1925).
[711]A.Gelman.Commentary:Pvaluesandstatisticalpractice.Epidemiology,24(1):69–72,2013.
[712]G.Gigerenzer.Mindlessstatistics.TheJournalofSocio-Economics,33(5):587–606,2004.
[713]A.Gionis,H.Mannila,T.Mielikäinen,andP.Tsaparas.Assessingdataminingresultsviaswaprandomization.ACMTransactionsonKnowledgeDiscoveryfromData(TKDD),1(3):14,2007.
[714]P.Good.Permutationtests:apracticalguidetoresamplingmethodsfortestinghypotheses.SpringerScience&BusinessMedia,2013.
[715]S.Goodman.Adirtydozen:twelvep-valuemisconceptions.InSeminarsinhematology,volume45(13),pages135–140.Elsevier,2008.
[716]S.GoodmanandS.Greenland.AssessingtheUnreliabilityoftheMedicalLiterature:AresponsetoWhyMostPublishedResearchFindingsareFalse.bepress,2007.
[717]S.GoodmanandS.Greenland.Whymostpublishedresearchfindingsarefalse:problemsintheanalysis.PLoSMed,4(4):e168,2007.
[718]W.Hämäläinen.Efficientsearchforstatisticallysignificantdependencyrulesinbinarydata.PhDThesis,DepartmentofComputerScience,UniversityofHelsinki,2010.
[719]W.Hämäläinen.Kingfisher:anefficientalgorithmforsearchingforbothpositiveandnegativedependencyruleswithstatisticalsignificancemeasures.Knowledgeandinformationsystems,32(2):383–414,2012.
[720]R.Hubbard.AlphabetSoup:BlurringtheDistinctionsBetweenpsandα’sinPsychologicalResearch.Theory&Psychology,14(3):295–327,2004.
[721]J.P.Ioannidis.Whymostpublishedresearchfindingsarefalse.PLoSMed,2(8):e124,2005.
[722]J.P.Ioannidis.Whymostpublishedresearchfindingsarefalse:author’sreplytoGoodmanandGreenland.PLoSmedicine,4(6):e215,2007.
[723]J.P.Ioannidis.Howtomakemorepublishedresearchtrue.PLoSmedicine,11(10):e1001747,2014.
[724]L.R.JagerandJ.T.Leek.Anestimateofthescience-wisefalsediscoveryrateandapplicationtothetopmedicalliterature.Biostatistics,15(1):1–12,2013.
[725]A.K.JainandR.C.Dubes.AlgorithmsforClusteringData.PrenticeHallAdvancedReferenceSeries.PrenticeHall,March1988.
[726]D.JensenandP.R.Cohen.MultipleComparisonsinInductionAlgorithms.MachineLearning,38(3):309–338,March2000.
[727]R.E.KassandA.E.Raftery.Bayesfactors.Journaloftheamericanstatisticalassociation,90(430):773–795,1995.
[728]R.Kohavi,A.Deng,B.Frasca,T.Walker,Y.Xu,andN.Pohlmann.Onlinecontrolledexperimentsatlargescale.InProceedingsofthe19thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages1168–1176.ACM,2013.
[729]D.Lakens,F.G.Adolfi,C.Albers,F.Anvari,M.A.Apps,S.E.Argamon,M.A.vanAssen,T.Baguley,R.Becker,S.D.Benning,etal.JustifyYourAlpha:AResponsetoRedefineStatisticalSignificance.PsyArXiv,2017.
[730]J.T.LeekandR.D.Peng.Statistics:Pvaluesarejustthetipoftheiceberg.Nature,520(7549):612,2015.
[731]E.F.Lindquist.Statisticalanalysisineducationalresearch.HoughtonMifflin,1940.
[732]D.C.Montgomery.Designandanalysisofexperiments.JohnWiley&Sons,2017.
[733]K.R.Murphy,B.Myors,andA.Wolach.Statisticalpoweranalysis:Asimpleandgeneralmodelfortraditionalandmodernhypothesistests.Routledge,2014.
[734]J.Neyman.RAFisher(1890–1962):AnAppreciation.Science,156(3781):1456–1460,1967.
[735]J.NeymanandE.S.Pearson.Ontheuseandinterpretationofcertaintestcriteriaforpurposesofstatisticalinference:PartI.Biometrika,pages
175–240,1928.
[736]J.NeymanandE.S.Pearson.Ontheuseandinterpretationofcertaintestcriteriaforpurposesofstatisticalinference:PartII.Biometrika,pages263–294,1928.
[737]R.Nuzzo.Scientificmethod:Statisticalerrors.NatureNews,Feb.122014.
[738]M.OjalaandG.C.Garriga.Permutationtestsforstudyingclassifierperformance.JournalofMachineLearningResearch,11(Jun):1833–1863,2010.
[739]M.Ojala,N.Vuokko,A.Kallio,N.Haiminen,andH.Mannila.Randomizationofreal-valuedmatricesforassessingthesignificanceofdataminingresults.InProceedingsofthe2008SIAMInternationalConferenceonDataMining,pages494–505.SIAM,2008.
[740]F.PesarinandL.Salmaso.Permutationtestsforcomplexdata:theory,applicationsandsoftware.JohnWiley&Sons,2010.
[741]N.Silver.Thesignalandthenoise:Whysomanypredictionsfail-butsomedon’t.Penguin,2012.
[742]J.P.Simmons,L.D.Nelson,andU.Simonsohn.False-positivepsychologyundisclosedflexibilityindatacollectionandanalysisallows
presentinganythingassignificant.Psychologicalscience,page0956797611417632,2011.
[743]J.D.Storey.Adirectapproachtofalsediscoveryrates.JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology),64(3):479–498,2002.
[744]J.D.Storey.Thepositivefalsediscoveryrate:aBayesianinterpretationandtheq-value.Annalsofstatistics,pages2013–2035,2003.
[745]J.D.Storey,J.E.Taylor,andD.Siegmund.Strongcontrol,conservativepointestimationandsimultaneousconservativeconsistencyoffalsediscoveryrates:aunifiedapproach.JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology),66(1):187–205,2004.
[746]J.D.StoreyandR.Tibshirani.SAM:thresholdingandfalsediscoveryratesfordetectingdifferentialgeneexpressioninDNAmicroarrays.InTheanalysisofgeneexpressiondata,pages272–290.Springer,2003.
[747]J.D.Storey,W.Xiao,J.T.Leek,R.G.Tompkins,andR.W.Davis.Significanceanalysisoftimecoursemicroarrayexperiments.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica,102(36):12837–12842,2005.
[748]K.Strimmer.fdrtool:aversatileRpackageforestimatinglocalandtailarea-basedfalsediscoveryrates.Bioinformatics,24(12):1461–1462,2008.
[749]K.Strimmer.Aunifiedapproachtofalsediscoveryrateestimation.BMCbioinformatics,9(1):303,2008.
[750]V.G.Tusher,R.Tibshirani,andG.Chu.Significanceanalysisofmicroarraysappliedtotheionizingradiationresponse.ProceedingsoftheNationalAcademyofSciences,98(9):5116–5121,2001.
[751]R.L.WassersteinandN.A.Lazar.TheASA’sstatementonp-values:context,process,andpurpose.TheAmericanStatistician,2016.
[752]G.I.Webb.Discoveringsignificantpatterns.MachineLearning,68(1):1–33,2007.
[753]G.I.Webb.Layeredcriticalvalues:apowerfuldirect-adjustmentapproachtodiscoveringsignificantpatterns.MachineLearning,71(2):307–323,2008.
[754]G.I.Webb.Self-sufficientitemsets:Anapproachtoscreeningpotentiallyinterestingassociationsbetweenitems.ACMTransactionsonKnowledgeDiscoveryfromData(TKDD),4(1):3,2010.
[755]G.I.WebbandJ.Vreeken.Efficientdiscoveryofthemostinterestingassociations.ACMTransactionsonKnowledgeDiscoveryfromData(TKDD),8(3):15,2014.
[756]W.J.Welch.Constructionofpermutationtests.JournaloftheAmericanStatisticalAssociation,85(411):693–698,1990.
[757]H.XiongandZ.Li.ClusteringValidationMeasures.InC.C.AggarwalandC.K.Reddy,editors,DataClustering:AlgorithmsandApplications,pages571–605.Chapman&Hall/CRC,2013.
10.8Exercises1.Statisticaltestingproceedsinamanneranalogoustothemathematicalprooftechnique,proofbycontradiction,whichprovesastatementbyassumingitisfalseandthenderivingacontradiction.Compareandcontraststatisticaltestingandproofbycontradiction.
2.Whichofthefollowingaresuitablenullhypotheses.Ifnot,explainwhy.
a. ComparingtwogroupsConsidercomparingtheaveragebloodpressureofagroupofsubjects,bothbeforeandaftertheyareplacedonalowsaltdiet.Inthiscase,thenullhypothesisisthatalowsaltdietdoesreducebloodpressure,i.e.,thattheaveragebloodpressureofthesubjectsisthesamebeforeandafterthechangeindiet.
b. ClassificationAssumetherearetwoclasses,labeled+and−,wherewearemostinterestedinthepositiveclass,e.g.,thepresenceofadisease.
isthestatementthattheclassofanobjectisnegative,i.e.,thatthepatientdoesnothavethedisease.
c. AssociationAnalysisForfrequentpatterns,thenullhypothesisisthattheitemsareindependentandthus,anypatternthatwedetectisspurious.
d. ClusteringThenullhypothesisisthatthereisclusterstructureinthedatabeyondwhatmightoccuratrandom.
e. AnomalyDetectionOurassumption, ,isthatanobjectisnotanomalous.
3.Consideronceagainthecoffee-teaexample,presentedinExample10.9 .Thefollowingtwotablesarethesameastheonepresentedin
H0
H0
Example10.9 exceptthateachentryhasbeendividedby10(lefttable)ormultipliedby10(righttable).
Table10.7.Beveragepreferencesamongagroupof100people(left)and10,000people(right).
Coffee
Tea 15
65
5
15
20
80
80 20 100
Coffee
Tea 1500
6500
500
1500
2000
8000
8000 2000 10000
a. Computethep-valueoftheobservedsupportcountforeachtable,i.e.,for15and1500.Whatpatterndoyouobserveasthesamplesizeincreases?
b. ComputetheoddsratioandinterestfactorforthetwocontingencytablespresentedinthisproblemandtheoriginaltableofExample10.9 .(SeeSection5.7.1 fordefinitionsofthesetwomeasures.)Whatpatterndoyouobserve?
c. Theoddsratioandinterestfactoraremeasuresofeffectsize.Arethesetwoeffectsizessignificantfromapracticalpointofview?
d. Whatwouldyouconcludeabouttherelationshipbetweenp-valuesandeffectsizeforthissituation?
4.Considerthedifferentcombinationsofeffectsizeandp-valueappliedtoanexperimentwherewewanttodeterminetheefficacyofanewdrug.
Coffee¯
Tea¯
Coffee¯
Tea¯
effectsizesmall,p-valuesmall
effectsizesmall,p-valuelarge(ii)
Whethereffectsizeissmallorlargedependsonthedomain,whichinthiscaseismedical.Forthisproblemconsiderasmallp-valuetobelessthan0.001,whilealargep-valueisabove0.05.Assumethatthesamplesizeisrelativelylarge,e.g.,thousandsofpatientswiththeconditionthatthedrughopestotreat.
a. Whichcombination(s)wouldverylikelybeofinterest?
b. Whichcombinations(s)wouldverylikelynotbeofinterest?
c. Ifthesamplesizeweresmall,wouldthatchangeyouranswers?
5.ForNeyman-Pearsonhypothesistesting,weneedtobalancethetradeoffbetweenα,theprobabilityofatypeIerrorandpower,i.e., ,whereβistheprobabilityofatypeIIerror.Computeα,β,andthepowerforthecasesgivenbelow,wherewespecifythenullandalternativedistributionsandtheaccompanyingcriticalregion.AlldistributionsareGaussianwithsomespecifiedmeanuandstandarddeviationσ,i.e., .LetTbetheteststatistic.
effectsizelarge,p-valuesmall
effectsizelarge,p-valuelarge
(iii)
(iv)
1−β
N(μ,σ)
,criticalregion: .
,criticalregion: .
,criticalregion: .
,criticalregion: .
,criticalregion: .
,criticalregion: .
H0:N(0,1),H1:N(3,1) T>2
H0:N(0,1),H1:N(3,1) |T|>2
H0:N(−1,1),H1:N(3,1) T>1
H0:N(−1,1),H1:N(3,1) |T|>1
H0:N(−1,0.5),H1:N(3,0.5) T>1
H0:N(−1,0.5),H1:N(3,0.5) |T|>1
6.Ap-valuemeasurestheprobabilityoftheresultgiventhatthenullhypothesisistrue.However,manypeoplewhocalculatep-valueshaveuseditastheprobabilityofthenullhypothesisgiventheresult,whichiserroneous.ABayesianapproachtothisproblemissummarizedbyEquation10.12 .
Thisapproachcomputestheratiooftheprobabilityofthealternativeandnullhypotheses( and ,respectively)giventheobservedoutcome, .Inturn,thisquantityisexpressedastheproductoftwofactors:theBayesfactorandthepriorodds.Theprioroddsistheratiooftheprobabilityof totheprobabilityof basedonpriorinformationabouthowlikelywebelieveeachhypothesisis.Usually,theprioroddsisestimateddirectlybasedonexperience,Forexample,indrugtestinginthelaboratory,itmaybeknownthatmostdrugsdonotproducepotentiallytherapeuticeffects.TheBayesfactoristheratiooftheprobabilityorprobabilitydensityoftheobservedoutcome, ,under and .Thisquantityiscomputedandrepresentsameasureofhowmuchmoreorlesslikelytheobservedresultisunderthealternativehypothesisthanthenullhypothesis.Conceptually,thehigheritis,themorewewouldtendtopreferthealternativetothenull.ThehighertheBayesfactor,thestrongertheevidenceprovidedbythedatafor .Moregenerally,thisapproachcanbeappliedtoassesstheevidenceforanyhypothesisversusanother.Thus,therolesof canbe(andoftenare)reversedinEquation10.12 .
a. SupposethattheBayesfactoris20,whichisverystrong,buttheprioroddsare0.01.Wouldyoubeinclinedtopreferthealternativeornullhypothesis?
b. Supposetheprioroddsare0.25,thenulldistributionisGaussianwithdensitygivenby ,andthealternativedistributionisgivenby
.ComputetheBayesfactorandposterioroddsof forthe
P(H1|xobs)P(H0|xobs)=f(H1|xobs)f(H0|xobs)×P(H1)P(H0)posterioroddsof H1=(10.12)
H1 H0 xobs
H1H0
xobs H1 H0
H1
H0
f0(x)=N(0,2)f1(x)=N(3,1) H1
followingvaluesof :2,2.5,3,3.5,4,4.5,5.Explainthepatternthatyouseeinbothquantities.
7.Considertheproblemofdeterminingwhetheracoinisafairone,i.e.,,byflippingthecoin10times.Usethebinomialtheorem
andbasicprobabilitytoanswerthefollowingquestions.
8.Algorithm10.1 on773providesanmethodforcalculatingthefalsediscoveryrateusingthemethodadvocatedbyBenjaminiandHochberg.Thedescriptioninthetextispresentedintermsoforderingthep-valuesandadjustingthesignificanceleveltoassesswhetherap-valueissignificant.Anotherwaytointerpretthismethodisintermsoforderingthep-values,smallesttolargest,andcomputingadjusted”p-values, ,whereiidentifiesthe smallestp-valueandmisthenumberofp-values.Thestatisticalsignificanceisdeterminedbasedonwhether ,whereαisthedesiredfalsediscoveryrate.
xobs
P(heads)=P(tails)=0.5
Acoinisflippedtentimesanditcomesupheadseverytime.Whatistheprobabilityofgetting10headsinarowandwhatwouldyouconcludeaboutwhetherthecoinisfair?
Suppose10,000coinsareeachflipped10timesinarowandtheflipsof10coinsresultinallheads,canyouconfidentlysaythatthesecoinsarenotfair?
Whatcanyouconcludeaboutresultswhenevaluatedindividuallyversusinagroup?
Supposethatyouflipeachcoin20timesandthenevaluate10,000coins.Canyounowconfidentlysaythatanycoinwhichyieldsallheadsisnotfair?
p′i=pi*m/iith
p′i≤α
a. Computetheadjustedp-valuesforthep-valuesinTable10.8 .Notethattheadjustedp-valuesmaynotbemonotonic.Inthatcase,anadjustedp-valuethatislargerthanitssuccessorischangedtohavethesamevalueasitssuccessor.
Table10.8.OrderedCollectionofp-values..
1 2 3 4 5 6 7 8 9 10
originalp-values
0.001 0.005 0.05 0.065 0.15 0.21 0.25 0.3 0.45 0.5
b. IfthedesiredFDRis20%,i.e., ,thenforwhichp-valuesisrejected?
c. SupposethatweusetheBonferroniprocedureinstead.Fordifferentvaluesofα,namely0.01,0.05,and0.10,computethemodifiedp-valuethreshold, ,thattheBonferroniprocedurewillusetoevaluatep-values.Thendetermine,foreachvalueof ,forwhichp-values, willberejected.(Ifap-valueequalsthethreshold,itisrejected.)
9.Thepositivefalsediscoveryrate(pFDR)issimilartothefalsediscoveryratedefinedinSection10.1.3 butassumesthatthenumberoftruepositivesisgreaterthan0.CalculationofthepFDRissimilartothatofFDR,butrequiresanassumptiononthevalueof ,thenumberofresultsthatsatisfythenullhypothesis.ThepFDRislessconservativethanFDR,butmorecomplicatedtocompute.
ThepositivefalsediscoveryratealsoallowsthedefinitionofanFDRanalogueofthep-value.Theq-valueistheexpectedfractionofhypothesesthatwillbefalseifthegivenhypothesisisaccepted.Specifically,theq-valueassociatedwithap-valueistheexpectedproportionoffalsepositivesamongallhypothesesthataremoreextreme,i.e.,havealowerp-value.Thus,theq-
α=0.20 H0
α*=α/10α* H0
m0
valueassociatedwithap-valueisthepositivefalsediscoveryratethatwouldresultifthep-valuewasusedasthethresholdforrejection.
Belowweshow50p-values,theirBenjamini-Hochbergadjustedp-values,andtheirq-values.
p-values
BHadjustedp-values
q-Values
0.0000 0.0000 0.0002 0.0004 0.0004 0.0010 0.0089 0.0089 0.0288 0.0479
0.0755 0.0755 0.0755 0.1136 0.1631 0.2244 0.2964 0.3768 0.3768 0.3768
0.4623 0.4623 0.4623 0.5491 0.5491 0.6331 0.7107 0.7107 0.7107 0.7107
0.7107 0.8371 0.9201 0.9470 0.9470 0.9660 0.9660 0.9660 0.9790 0.9928
0.9928 0.9928 0.9928 0.9960 0.9960 0.9989 0.9989 0.9995 0.9999 1.0000
0.0000 0.0000 0.0033 0.0040 0.0040 0.0083 0.0556 0.0556 0.1600 0.2395
0.2904 0.2904 0.2904 0.4057 0.5437 0.7012 0.8718 0.9420 0.9420 0.9420
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
0.0000 0.0000 0.0023 0.0033 0.0033 0.0068 0.0454 0.0454 0.1267 0.1861
0.2509 0.2509 0.2509 0.3351 0.4198 0.4989 0.5681 0.6257 0.6257 0.6257
0.6723 0.6723 0.6723 0.7090 0.7090 0.7375 0.7592 0.7592 0.7592 0.7592
a. Howmanyp-valuesareconsideredsignificantusingBHadjustedp-valuesandthresholdsof0.05,0.10,0.15,0.20,0.25,and0.30?
b. Howmanyp-valuesareconsideredsignificantusingq-valuesandthresholdsof0.05,0.10,0.15,0.20,0.25,and0.30?
c. Comparethetwosetsofresults.
10.Analternativetothedefinitionsofthefalsediscoveryratediscussedsofaristhelocalfalsediscoverrate,whichisbasedonmodelingtheobservedvaluesoftheteststatisticasamixtureoftwodistributions,wheremostoftheobservationscomefromthenulldistributionandsomeobservations(theinterestingones)comefromthealternativedistribution.(SeeSection8.2.2formoreinformationonmixturemodels.)IfZisourteststatistic,thedensity,
ofZ,isgivenbythefollowing:
where istheprobabilityaninstancecomesfromthenulldistribution,isthedistributionofthep-valuesunderthenullhypothesis, istheprobabilityaninstancecomesfromthealternativedistribution,and isthedistributionofp-valuesunderthealternativehypothesis.
UsingBayestheorem,wecanderivetheprobabilityofthenullhypothesisforanyvalueofzasfollows.
Thequantity, ,isthequantitythatwewouldliketodefineasthelocalfdr.Since isoftencloseto1,thelocalfalsediscoveryrate,representedasfdr,alllowercase,isdefinedasthefollowing:
0.7592 0.7879 0.8032 0.8078 0.8078 0.8108 0.8108 0.8108 0.8129 0.8150
0.8150 0.8150 0.8150 0.8155 0.8155 0.8159 0.8159 0.8160 0.8161 0.8161
f(z)
f(z)=p0f0(z)+p1f1(z), (10.13)
p0 f0(z)p1
f1(z)
p(H0|z)=f(H0 and z)/f(z)=p0f0(z)/f(z) (10.14)
p(H0|z)p0
Thisisapointestimate,notanintervalestimateaswiththestandardFDR,whichisbasedonp-value,andassuch,itwillvarywiththevalueoftheteststatistic.Notethatthelocalfdrhasaneasyinterpretation,namelyastheratioofthedensityofobservationsfromthenulldistributiontoobservationsfromboththenullandalternativedistributions.Italsohastheadvantageofbeinginterpretabledirectlyasarealprobability.
Thechallenge,ofcourse,isinestimatingthedensitiesinvolvedinEquation10.15 ,whichareusuallyestimatedempirically.Weconsiderthefollowingsimplecase,wherewespecifythedistributionsbyGaussiandistributions.Thenulldistributionisgivenby, ,whilethealternativedistributionisgivenby . and .
a. Compute forthefollowingvaluesofz:2,2.5,3,3.5,4,4.5,5.
b. Computethelocalfdrforthefollowingvaluesofz:2,2.5,3,3.5,4,4.5,5.
c. Howclosearethesetwosetsofvalues?
11.Thefollowingaretwoalternativestoswaprandomization—presentedinSection10.4.2 —forrandomizingabinarymatrixsothatthenumberof1’sinanyrowandcolumnarepreserved.Examineeachmethodand(i)verifythatitdoesindeedpreservethenumberof1’sinanyrowandcolumn,and(ii)identifytheproblemwiththealternativeapproach.
a. Randomlypermutetheorderofthecolumnsandrows.AnexampleisshowninFigure10.12 .
fdr(z)=f0(z)f(z) (10.15)
f0(z)=N(0,1)f0(z)=N(3,1) p0=0.999 p1=0.001
p(H0|z)
Figure10.12.A3×3matrixbeforeandafterrandomizingtheorderoftherowsandcolumns.Theleftmostmatrixistheoriginal.
b. Figure10.13 showsanotherapproachtorandomizingabinarymatrix.Thisapproachconvertsthebinarymatrixtoarow-columnrepresentation,thenrandomlyreassignsthecolumnstovariousentries,andfinallyconvertsthedatabackintotheoriginalbinarymatrixformat.
Figure10.13.A4×4matrixbeforeandafterrandomizingtheentries.Fromrighttoleft,thetablesrepresentthefollowing:Theoriginalbinarydatamatrix,thematrixinrow-columnformat,therow-columnformatafterrandomlypermutingtheentriesinthecolcolumn,andthematrixreconstructedfromtherandomizedrow-columnrepresentation.
AuthorIndex
Abdulghani,A.,433Abe,N.,345Abello,J.,698Abiteboul,S.,437Abraham,B.,745Adams,N.M.,430Adolfi,F.G.,807Adomavicius,G.,18Aelst,S.V.,749Afshar,R.,509Agarwal,R.C.,342,429,747Aggarwal,C.,18,430,507,509Aggarwal,C.C.,339,429,430,600,602,745,808Agrawal,R.,18,104,183,184,430,431,435,436,507,509,696Aha,D.W.,181,339Akaike,H.,181Akoglu,L.,745Aksehirli,E.,434Albers,C.,807Alcalá-Fdez,J.,508Aldenderfer,M.S.,600Alexandridis,M.G.,184Ali,K.,430Allen,D.M.,181Allison,D.B.,181Allwein,E.L.,340Alsabti,K.,181Altman,R.B.,19Alvarez,J.L.,508Amatriain,X.,18Ambroise,C.,181
Anderberg,M.R.,102,600Anderson,T.W.,849Andrews,R.,340Ankerst,M.,600Antonie,M.-L.,507Anvari,F.,807Aone,C.,601Apps,M.A.,807Arabie,P.,600,602Argamon,S.E.,807Arnold,A.,746Arthur,D.,600Atluri,G.,19,508Atwal,G.S.,103Aumann,Y.,507Austin,J.,747Ayres,J.,507
Baguley,T.,807Bai,H.,696Baker,M.,806Baker,R.J.,343Bakiri,G.,341Bakos,J.,438Baldi,P.,340Ball,G.,600Bandyopadhyay,S.,103Banerjee,A.,18,19,600,696,746Barbará,D.,430,696Barnett,V.,745Bass,D.,806Bastide,Y.,435Basu,S.,696Batistakis,Y.,601Baxter,R.A.,747Bay,S.D.,430,746Bayardo,R.,430Becker,R.,807Beckman,R.J.,746Belanche,L.,104Belkin,M.,849Ben-David,S.,19Bengio,S.,341Bengio,Y.,340,341,343,345Benjamin,D.J.,806Benjamini,Y.,430,806Bennett,K.,340Benning,S.D.,807Berger,J.,806
Berk,R.,806Berkhin,P.,600Bernecker,T.,430Berrar,D.,340Berry,M.J.A.,18Bertino,E.,20Bertolami,R.,342Bhaskar,R.,430Bienenstock,E.,182Bilmes,J.,696Bins,M.,183Bishop,C.M.,181,340,696Blashfield,R.K.,600Blei,D.M.,696Blondel,M.,20,184Blum,A.,102,340Bobadilla,J.,18Bock,H.-H.,600Bock,H.H.,102Boiteau,V.,600Boley,D.,600,602Bollen,K.,806Bolton,R.J.,430Borg,I.,102Borgwardt,K.M.,434Boswell,R.,340Bosworth,A.,103Bottou,L.,340Boulicaut,J.-F.,507Bowyer,K.W.,340Bradley,A.P.,340
Bradley,P.S.,18,438,600,696Bradski,G.,182Bratko,I.,183Breiman,L.,181,340Brembs,B.,806Breslow,L.A.,181Breunig,M.M.,600,746Brin,S.,430,431,436Brockwell,A.,20Brodley,C.E.,184,747Brown,L.,806Brucher,M.,20,184Bunke,H.,342Buntine,W.,181Burges,C.J.C.,340Burget,L.,343Burke,R.,18Buturovic,L.J.,183Bykowski,A.,507
Cai,C.H.,431Cai,J.,438Camerer,C.,806Campbell,C.,340Canny,J.F.,344Cantú-Paz,E.,181Cao,H.,431Cardie,C.,699Carreira-Perpinan,M.A.,849Carvalho,C.,435Catanzaro,B.,182Ceri,S.,434Cernock`y,J.,343Chakrabarti,S.,18Chan,P.K.,19,341Chan,R.,431Chandola,V.,746Chang,E.Y.,696Chang,L.,749Charrad,M.,600Chatterjee,S.,698,747Chaudhary,A.,746Chaudhuri,S.,103Chawla,N.V.,340,746Chawla,S.,746Cheeseman,P.,696Chen,B.,436Chen,C.,746Chen,M.-S.,18,435,509Chen,Q.,431,508,749Chen,S.,434
Chen,S.-C.,749Chen,W.Y.,696Chen,Z.,749Cheng,C.H.,431Cheng,H.,432,507Cheng,R.,437Cherkassky,V.,18,182,340Chervonenkis,A.Y.,184Cheung,D.,749Cheung,D.C.,431Cheung,D.W.,431,437Chiu,B.,747Chiu,J.,433Choudhary,A.,434,698Chrisman,N.R.,102Chu,C.,182Chu,G.,808Chuang,A.,745Chui,C.K.,431Chung,S.M.,433,508Clark,P.,340Clifton,C.,18,437Clifton,D.A.,748Clifton,L.,748Coatney,M.,435,508Cochran,W.G.,102Codd,E.F.,102Codd,S.B.,102Cohen,P.R.,183,807Cohen,W.W.,340Collingridge,D.S.,806
Contreras,P.,600,602Cook,D.J.,698Cook,R.D.,746Cooley,R.,431Cost,S.,340Cotter,A.,182Cournapeau,D.,20,184Courville,A.,340,341Courville,A.C.,341Couto,J.,430Cover,T.M.,102,341Cristianini,N.,104,341Cui,X.,181
Dabney,A.,806Dash,M.,103Datta,S.,19Davidson,I.,696Davies,L.,746Davis,R.W.,808Dayal,U.,431,508,602Dean,J.,890Demmel,J.W.,102,832,849Deng,A.,807Desrosiers,C.,18Dhillon,I.S.,600Diaz-Verdejo,J.,746Diday,E.,102Diederich,J.,340Dietterich,T.G.,341,343Ding,C.,437,696Ding,C.H.Q.,698Dokas,P.,431Domingos,P.,18,182,341Dong,G.,431Donoho,D.L.,849Doster,W.,184Dougherty,J.,102Doursat,R.,182Drummond,C.,341Dubes,R.C.,103,601,807Dubourg,V.,20,184Duchesnay,E.,20,184Duda,R.O.,18,182,341,600Duda,W.,344
Dudoit,S.,182Duin,R.P.W.,182,344DuMouchel,W.,431Dunagan,J.,746Dunham,M.H.,18,341Dunkel,B.,431
Edgington,E.,806Edwards,D.D.,344Efron,B.,182,806Elkan,C.,341,601Ellis,P.D.,806Elomaa,T.,103Erhan,D.,341Erhart,M.,698EricksonIII,D.J.,103Ernst,M.D.,806Ertöz,L.,431,696,697,748Eskin,E.,746,748Esposito,F.,182Ester,M.,600–602Everitt,B.S.,601Evfimievski,A.V.,431Ezeife,C.,508
Fürnkranz,J.,341Fabris,C.C.,431Faghmous,J.,19Faghmous,J.H.,18Faloutsos,C.,20,104,748,849Fan,J.,697Fan,W.,341Fang,G.,432,436Fang,Y.,699Fawcett,T.,344Fayyad,U.M.,18,103,438,600,696Feng,L.,432,437Feng,S.,697Fernández,S.,342Ferri,C.,341Field,B.,432Finucane,H.K.,104Fisher,D.,601,697Fisher,N.I.,432Fisher,R.A.,182,806Flach,P.,340Flach,P.A.,341Flannick,J.,507Flynn,P.J.,601Fodor,I.K.,849Fournier-Viger,P.,507Fovino,I.N.,20Fox,A.J.,746Fraley,C.,697Frank,E.,19,20,182,345Frasca,B.,807
Frawley,W.,435Freeman,D.,696Freitas,A.A.,431,432Freund,Y.,341Friedman,J.,182,342Friedman,J.H.,19,181,432,601Fu,A.,431,433Fu,A.W.-c.,749Fu,Y.,432,508Fukuda,T.,432,434,507Fukunaga,K.,182,341Furuichi,E.,435
Gada,D.,103Ganguly,A.,19Ganguly,A.R.,18,103Ganti,V.,182,697Gao,X.,601GaohuaGu,F.H.,103Garcia-Teodoro,P.,746Garofalakis,M.N.,507Garriga,G.C.,807Gather,U.,746Geatz,M.,20Gehrke,J.,18,19,104,182,431,507,696,697Geiger,D.,341Geisser,S.,182Gelman,A.,806Geman,S.,182Gersema,E.,183Gersho,A.,697Ghazzali,N.,600Ghemawat,S.,890Ghosh,A.,746Ghosh,J.,600,697,699Giannella,C.,19Gibbons,P.B.,748Gibson,D.,697Gigerenzer,G.,806Gionis,A.,746,806Glymour,C.,19Gnanadesikan,R.,746Goethals,B.,434Goil,S.,698
Goldberg,A.B.,345Golub,G.H.,832Gomariz,A.,507Good,P.,806Goodfellow,I.,341Goodfellow,I.J.,341Goodman,R.M.,344Goodman,S.,806Gorfine,M.,103Gowda,K.C.,697Grama,A.,20Gramfort,A.,20,184Graves,A.,342Gray,J.,103Gray,R.M.,697Greenland,S.,806Gries,D.,890Grimes,C.,849Grinstein,G.G.,18Grisel,O.,20,184Groenen,P.,102Grossman,R.L.,19,698Grossman,S.R.,104Guan,Y.,600Guestrin,C.,20Guha,S.,19,697Gunopulos,D.,432,437,696Guntzer,U.,432Gupta,M.,432Gupta,R.,432,508Gutiérrez,A.,18
Hagen,L.,697Haibt,L.,344Haight,R.,436Haiminen,N.,807Halic,M.,183Halkidi,M.,601Hall,D.,600Hall,L.O.,340Hall,M.,19,182Hamerly,G.,601Hamilton,H.J.,432Han,E.,432Han,E.-H.,183,342,432,508,601,697,698Han,J.,18,19,342,430,432–435,437,507–509,601,698Hand,D.J.,19,103,342,430Hardin,J.,746Hart,P.E.,18,182,341,600Hartigan,J.,601Hastie,T.,19,182,342,601Hatonen,K.,437Hawkins,D.M.,747Hawkins,S.,747He,H.,747He,Q.,433,699He,X.,437,696He,Y.,437,697Hearst,M.,342Heath,D.,182Heckerman,D.,342Heller,R.,103Heller,Y.,103
Hernando,A.,18Hernández-Orallo,J.,341Herrera,F.,508Hey,T.,19Hidber,C.,432Hilderman,R.J.,432Hinneburg,A.,697Hinton,G.,343Hinton,G.E.,342–344Hipp,J.,432Ho,C.-T.,20Hochberg,Y.,430,806Hodge,V.J.,747Hofmann,H.,432Holbrook,S.R.,437Holder,L.B.,698Holland,J.,344Holmes,G.,19,182Holt,J.D.,433Holte,R.C.,341,342Hong,J.,343Hornick,M.F.,19Houtsma,M.,433Hsieh,M.J.,509Hsu,M.,431,508,602Hsu,W.,434Hsueh,S.,433Huang,H.-K.,748Huang,T.S.,698Huang,Y.,433Hubbard,R.,807
Hubert,L.,600,602Hubert,M.,749Hulten,G.,18Hung,E.,431Hussain,F.,103Hwang,S.,433Hämäläinen,W.,807Höppner,F.,697
Iba,W.,343Imielinski,T.,430,433Inokuchi,A.,433,508Ioannidis,J.P.,807Ioffe,S.,342Irani,K.B.,103
Jagadish,H.V.,747Jager,L.R.,807Jain,A.K.,19,103,182,601,807Jajodia,S.,430Janardan,R.,849Japkowicz,N.,340,342,746,747Jardine,N.,601Jaroszewicz,S.,433Jarvis,R.A.,697Jensen,D.,104,183,807Jensen,F.V.,342Jeudy,B.,507Johannesson,M.,806John,G.H.,103Johnson,T.,747Jolliffe,I.T.,103,849Jonyer,I.,698Jordan,M.I.,342,696Joshi,A.,19Joshi,M.V.,183,342,343,508,747
Kahng,A.,697Kailing,K.,698Kallio,A.,807Kalpakis,K.,103Kamath,C.,19,181,698Kamber,M.,19,342,433,601Kantarcioglu,M.,18Kantardzic,M.,19Kao,B.,431Karafiát,M.,343Kargupta,H.,19Karpatne,A.,19Karypis,G.,18,183,342,432,433,436,508,509,601,602,697,698Kasif,S.,182,183Kass,G.V.,183Kass,R.E.,807Kaufman,L.,103,601Kawale,J.,747Kegelmeyer,P.,19,698Kegelmeyer,W.P.,340Keim,D.A.,697Kelly,J.,696Keogh,E.,747Keogh,E.J.,103Keshet,J.,182Kettenring,J.R.,746Keutzer,K.,182Khan,S.,103Khan,S.S.,747Khardon,R.,432Khoshgoftaar,T.M.,20
Khudanpur,S.,343Kifer,D.,19Kim,B.,183Kim,S.K.,182Kinney,J.B.,103Kitagawa,H.,748Kitsuregawa,M.,436Kivinen,J.,343Klawonn,F.,697Kleinberg,J.,19Kleinberg,J.M.,601,697Klemettinen,M.,433,437Klooster,S.,435,436,698Knorr,E.M.,747Kogan,J.,600Kohavi,R.,102,103,183,807Kohonen,T.,698Kolcz,A.,340,746Kong,E.B.,343Koperski,K.,432Kosters,W.A.,433Koudas,N.,747Koutra,D.,745Kröger,P.,698Kramer,S.,509Krantz,D.,103–105Kriegel,H.,430Kriegel,H.-P.,600–602,698,746,748,749Krishna,G.,697Krizhevsky,A.,342–344Krstajic,D.,183
Kruse,R.,697Kruskal,J.B.,103,849Kröger,P.,601Kubat,M.,343Kuhara,S.,435Kulkarni,S.R.,183Kumar,A.,747Kumar,V.,18,19,183,342–344,431,432,435–437,508,509,601,602,696–698,746–748,849Kuok,C.M.,433Kuramochi,M.,433,508Kwok,I.,747Kwong,W.W.,431
Lagani,V.,184Lajoie,I.,345Lakens,D.,807Lakhal,L.,435Lakshmanan,L.V.S.,434Lambert,D.,19Landau,S.,601Lander,E.S.,104Landeweerd,G.,183Landgrebe,D.,183,184Lane,T.,747Langford,J.C.,345,849Langley,P.,102,343,697Larochelle,H.,345Larsen,B.,601Lavrac,N.,343Lavrač,N.,434Law,M.H.C.,19Laxman,S.,430Layman,A.,103Lazar,N.A.,808Lazarevic,A.,431,748Leahy,D.E.,183LeCun,Y.,343Lee,D.D.,698Lee,P.,433Lee,S.D.,431,437Lee,W.,433,748Lee,Y.W.,105Leek,J.T.,807,808Leese,M.,601
Lent,B.,508Leroy,A.M.,748Lewis,D.D.,343Lewis,T.,745Li,F.,699Li,J.,431Li,K.-L.,748Li,N.,433Li,Q.,849Li,T.,698Li,W.,105,433,438Li,Y.,430Li,Z.,601,602,808Liao,W.-K.,434Liess,S.,747Lim,E.,433Lin,C.J.,696Lin,K.-I.,434,849Lin,M.,433Lin,Y.-A.,182Lindell,Y.,507Lindgren,B.W.,103Lindquist,E.F.,807Ling,C.X.,343Linoff,G.,18Lipton,Z.C.,20Liu,B.,434,437,509Liu,H.,103,104Liu,J.,434Liu,L.-M.,746Liu,R.Y.,748
Liu,Y.,433,434,601Livny,M.,699Liwicki,M.,342Llinares-López,F.,434Lonardi,S.,747Lu,C.-T.,749Lu,H.J.,432,435,437Lu,Y.,438Luce,R.D.,103–105Ludwig,J.,19Lugosi,G.,183Luo,C.,508Luo,W.,697
Ma,D.,697Ma,H.,699Ma,L.,699Ma,Y.,434Mabroukeh,N.R.,508Maciá-Fernández,G.,746MacQueen,J.,601Madden,M.G.,747Madigan,D.,19Malerba,D.,182Maletic,J.I.,434Malik,J.,698Malik,J.M.,344Mamoulis,N.,431Manganaris,S.,430Mangasarian,O.,343Mannila,H.,19,342,432,437,508,806,807Manzagol,P.-A.,341,345Mao,H.,697Mao,J.,182Maranell,G.M.,104Marchiori,E.,433Marcus,A.,434Margineantu,D.D.,343Markou,M.,748Martin,D.,508Masand,B.,431Mata,J.,508Matsuzawa,H.,434Matwin,S.,343McCullagh,P.,343
McCulloch,W.S.,343McLachlan,G.J.,181McVean,G.,104Megiddo,N.,434Mehta,M.,183,184Meilǎ,M.,602MeiraJr.,W.,20Meo,R.,434Merugu,S.,600Meyer,G.,19Meyerson,A.,19,697Michalski,R.S.,183,343,698,699Michel,V.,20,184Michie,D.,183,184Mielikäinen,T.,806Mikolov,T.,343Miller,H.J.,601Miller,R.J.,434,508Milligan,G.W.,602Mingers,J.,183Mirkin,B.,602Mirza,M.,341Mishra,N.,19,697,698Misra,J.,890Mitchell,T.,20,183,340,343,602,698Mitzenmacher,M.,104Mobasher,B.,432,697Modha,D.S.,600Moens,S.,434Mok,K.W.,433,748Molina,L.C.,104
Montgomery,D.C.,807Mooney,R.,696Moore,A.W.,602,746Moret,B.M.E.,183Morimoto,Y.,432,434,507Morishita,S.,432,507Mortazavi-Asl,B.,435,508Mosteller,F.,104,434Motoda,H.,103,104,433,508Motwani,R.,19,430,431,436,437,697Mozetic,I.,343Mueller,A.,434Muggleton,S.,343Muirhead,C.R.,748Mulier,F.,18,340Mulier,F.M.,182Mullainathan,S.,19Murphy,K.P.,183Murphy,K.R.,807Murtagh,F.,600,602,698Murthy,S.K.,183Murty,M.N.,601Muthukrishnan,S.,747Myers,C.L.,508Myneni,R.,435Myors,B.,807Müller,K.-R.,849
Nagesh,H.,698Nakhaeizadeh,G.,432Namburu,R.,19,698Naughton,J.F.,438Navathe,S.,435,509Nebot,A.,104Nelder,J.A.,343Nelson,L.D.,808Nemani,R.,435Nestorov,S.,437Neyman,J.,807Ng,A.Y.,182,696Ng,R.T.,434,698,746,747Niblett,T.,183,340Nielsen,M.A.,343Niknafs,A.,600Nishio,S.,435Niyogi,P.,849Nobel,A.B.,434Norvig,P.,344Nosek,B.A.,806Novak,P.K.,434Nuzzo,R.,807
O’Callaghan,L.,19,697Oates,T.,104Oerlemans,A.,433Ogihara,M.,105,438Ohsuga,S.,438Ojala,M.,807Olken,F.,104Olshen,R.,181Olukotun,K.,182Omiecinski,E.,435,509Onghena,P.,806Ono,T.,435Orihara,M.,438Ortega,F.,18Osborne,J.,104Ostrouchov,G.,103others,19,184,602,748,806,807Ozden,B.,435Ozgur,A.,435,748
Padmanabhan,B.,438,509Page,G.P.,181Palit,I.,183Palmer,C.R.,104Pan,S.J.,343Pandey,G.,432,508Pang,A.,434Papadimitriou,S.,20,748Papaxanthos,L.,434Pardalos,P.M.,698Parelius,J.M.,748Park,H.,849Park,J.S.,435ParrRud,O.,20Parthasarathy,S.,105,435,438,508Pasquier,N.,435Passos,A.,20,184Patrick,E.A.,697Pattipati,K.R.,184Paulsen,S.,434Pazzani,M.,341,430Pazzani,M.J.,103Pearl,J.,341,344Pearson,E.S.,807Pedregosa,F.,20,184Pei,J.,19,432,433,435,508Pelleg,D.,602Pellow,F.,103Peng,R.D.,807Perrot,M.,20,184Pesarin,F.,807
Peters,M.,698Pfahringer,B.,19,182Piatetsky-Shapiro,G.,18,20,435Pimentel,M.A.,748Pirahesh,H.,103Pison,G.,749Pitts,W.,343Platt,J.C.,748Pohlmann,N.,807Portnoy,L.,746,748Potter,C.,435,436,698Powers,D.M.,344Prasad,V.V.V.,429Pregibon,D.,19,20,431Prerau,M.,746Prettenhofer,P.,20,184Prince,M.,344Prins,J.,434Protopopescu,V.,103Provenza,L.P.,20Provost,F.J.,104,344Psaila,G.,434Pujol,J.M.,18Puttagunta,V.,103Pyle,D.,20
Quinlan,J.R.,184,344
Raftery,A.E.,697,807Raghavan,P.,696,697Rakhshani,A.,184Ramakrishnan,N.,20Ramakrishnan,R.,18,104,182,697,699Ramaswamy,S.,435,748Ramkumar,G.D.,435Ramoni,M.,344Ranka,S.,181,435Rao,N.,508Rastogi,R.,507,697,748Reddy,C.K.,183,600,602,808Redman,T.C.,104Rehmsmeier,M.,344Reichart,D.,103Reina,C.,696Reisende,M.G.C.,698Renz,M.,430Reshef,D.,104Reshef,D.N.,104Reshef,Y.,104Reshef,Y.A.,104Reutemann,P.,19,182Ribeiro,M.T.,20Richter,L.,509Riondato,M.,435Riquelme,J.C.,508Rissanen,J.,183Rivest,R.L.,184Robinson,D.,806Rochester,N.,344
Rocke,D.M.,746,748Rogers,S.,699Roiger,R.,20Romesburg,C.,602Ron,D.,698Ronkainen,P.,437Rosenblatt,F.,344Rosenthal,A.,437Rosete,A.,508Rosner,B.,748Rotem,D.,104Rousseeuw,P.J.,103,601,748Rousu,J.,103Roweis,S.T.,849Ruckert,U.,509Runkler,T.,697Russell,S.J.,344Ruts,I.,748
Sabeti,P.,104Sabeti,P.C.,104Sabripour,M.,181Safavian,S.R.,184Sahami,M.,102Saigal,S.,103Saito,T.,344Salakhutdinov,R.,344Salakhutdinov,R.R.,342Salmaso,L.,807Salzberg,S.,182,183,340Samatova,N.,18,19Sander,J.,600–602,746Sarawagi,S.,435Sarinnapakorn,K.,749Satou,K.,435Saul,L.K.,849Savaresi,S.M.,602Savasere,A.,435,509Saygin,Y.,20Schölkopf,B.,344Schafer,J.,20Schaffer,C.,184Schapire,R.E.,340,341Scheuermann,P.,436Schikuta,E.,698Schmidhuber,J.,342,344Schroeder,M.R.,698Schroedl,S.,699Schubert,E.,748,749Schuermann,J.,184
Schwabacher,M.,746Schwartzbard,A.,746Schwarz,G.,184Schölkopf,B.,104,748,849Scott,D.W.,749Sebastiani,P.,344Self,M.,696Semeraro,G.,182Sendhoff,B.,696Seno,M.,436,509Settles,B.,344Seung,H.S.,698Shafer,J.C.,184,429Shasha,D.E.,20Shawe-Taylor,J.,104,341,748Sheikholeslami,G.,698Shekhar,S.,18,19,433,435,437,749Shen,W.,509Shen,Y.,431Sheng,V.S.,343Shi,J.,698Shi,Z.,433Shibayama,G.,435Shim,K.,507,697,748Shinghal,R.,433Shintani,T.,436Shu,C.,699Shyu,M.-L.,749Sibson,R.,601Siebes,A.P.J.M.,432Siegmund,D.,808
Silberschatz,A.,435,436Silva,V.d.,849Silver,N.,808Silverstein,C.,430,436Simmons,J.P.,808Simon,H.,696Simon,N.,104Simon,R.,184Simonsohn,U.,808Simovici,D.,433Simpson,E.-H.,436Singer,Y.,340Singh,K.,748Singh,L.,436Singh,S.,20,748Singh,V.,181Sivakumar,K.,19Smalley,C.T.,102Smith,A.D.,430Smola,A.J.,104,343,344,748,849Smyth,P.,18–20,342,344Sneath,P.H.A.,104,602Soete,G.D.,600,602Sokal,R.R.,104,602Song,Y.,696Soparkar,N.,431Speed,T.,104Spiegelhalter,D.J.,183Spiliopoulou,M.,431Späth,H.,602Srebro,N.,182
Srikant,R.,18,104,430,431,434,436,507,509Srivastava,J.,431,436,509,748Srivastava,N.,342,344Steinbach,M.,18,19,183,344,432,435–437,508,602,696–698,747Stepp,R.E.,698,699Stevens,S.S.,104Stolfo,S.J.,341,433,746,748Stone,C.J.,181Stone,M.,184Storey,J.D.,806,808Stork,D.G.,18,182,341,600Strang,G.,832Strehl,A.,699Strimmer,K.,808Struyf,A.,749Stutz,J.,696Su,X.,20Suen,C.Y.,185Sugiyama,M.,434Sun,S.,344Sun,T.,699Sun,Z.,430Sundaram,N.,182Suppes,P.,103–105Sutskever,I.,342–344Suzuki,E.,436Svensen,M.,696Swami,A.,430,433,508Swaminathan,R.,698Sykacek,P.,749Szalay,A.S.,746
Szegedy,C.,342
Takagi,T.,435Tan,C.L.,103Tan,H.,697Tan,P.-N.,344,698Tan,P.N.,183,431,435–437,509Tang,J.,749Tang,S.,435Tansley,S.,19Tao,D.,345Taouil,R.,435Tarassenko,L.,748Tatti,N.,437Tax,D.M.J.,344Tay,S.H.,437,509Taylor,C.C.,183Taylor,J.E.,808Taylor,W.,696Tenenbaum,J.B.,849Teng,W.G.,509Thakurta,A.,430Theodoridis,Y.,20Thirion,B.,20,184Thomas,J.A.,102Thomas,S.,183,435Thompson,K.,343Tian,S.-F.,748Tibshirani,R.,19,104,182,184,342,344,601,806,808Tibshirani,R.J.,184Tickle,A.,340Timmers,T.,183Toivonen,H.,20,105,432,437,508
Tokuyama,T.,432,434,507Tolle,K.M.,19Tompkins,R.G.,808Tong,H.,745Torregrosa,A.,436Tsamardinos,I.,184Tsaparas,P.,806Tseng,V.S.,507Tsoukatos,I.,437Tsur,S.,431,435,437Tucakov,V.,747Tukey,J.W.,104,105,748Tung,A.,437,601Turnbaugh,P.J.,104Tusher,V.,806Tusher,V.G.,808Tuzhilin,A.,18,436,438,509Tversky,A.,103–105Tzvetkov,P.,509
Ullman,J.,431,437Uslaner,E.M.,103Utgoff,P.E.,184Uthurusamy,R.,18
Vaidya,J.,18,437Valiant,L.,184vanAssen,M.A.,807vanRijsbergen,C.J.,344vanZomeren,B.C.,748Vanderplas,J.,20,184Vandin,F.,435vanderLaan,M.J.,182VanLoan,C.F.,832Vapnik,V.,345Vapnik,V.N.,184Varma,S.,184Varoquaux,G.,20,184Vassilvitskii,S.,600Vazirgiannis,M.,601Velleman,P.F.,105Vempala,S.,746Venkatesh,S.S.,183Venkatrao,M.,103Verhein,F.,430Verkamo,A.I.,508Verma,T.S.,341Verykios,V.S.,20Vincent,P.,340,341,345Virmani,A.,433Vitter,J.S.,890vonLuxburg,U.,699vonSeelen,W.,696vonderMalsburg,C.,696Vorbruggen,J.C.,696Vreeken,J.,808
Vu,Q.,436Vuokko,N.,807Vázquez,E.,746
Wagenmakers,E.-J.,806Wagstaff,K.,696,699Wainwright,M.,342Walker,T.,807Wang,H.,184Wang,J.,430,509Wang,J.T.L.,20Wang,K.,437,509Wang,L.,437Wang,Q.,19Wang,Q.R.,185Wang,R.Y.,105Wang,W.,432,434Wang,Y.R.,105Warde-Farley,D.,341Washio,T.,433,508Wasserstein,R.L.,808Webb,A.R.,20,345Webb,G.I.,434,437,509,808Weiss,G.M.,345Weiss,R.,20,184Welch,W.J.,808Werbos,P.,345Widmer,G.,341Widom,J.,508Wierse,A.,18Wilhelm,A.F.X.,432Wilkinson,L.,105Williams,C.K.I.,696Williams,G.J.,747Williamson,R.C.,343,748
Wimmer,M.,600Wish,M.,849Witten,I.H.,19,20,182,345Wojdanowski,R.,748Wolach,A.,807Wong,M.H.,433Woodruff,D.L.,748Wu,C.-W.,507Wu,J.,601Wu,N.,430Wu,S.,601Wu,X.,20,344,509Wunsch,D.,602
Xiang,D.,748Xiao,W.,808Xin,D.,432Xiong,H.,433,436,437,601,602,808,849Xu,C.,345Xu,R.,602Xu,W.,748Xu,X.,600–602Xu,Y.,807
Yamamura,Y.,435Yan,X.,19,432,437,507,509Yang,C.,438Yang,Q.,343,431Yang,Y.,185,434,508Yao,Y.Y.,438Ye,J.,849Ye,N.,104,697,749Yesha,Y.,19Yin,Y.,432Yiu,T.,507Yoda,K.,434Yu,H.,436,699Yu,J.X.,432Yu,L.,104Yu,P.S.,18–20,430,435,745Yu,Y.,182
Zaïane,O.R.,432,507Zadrozny,B.,345Zahn,C.T.,602Zaki,M.J.,20,105,438,509,698Zaniolo,C.,184Zeng,C.,438Zeng,L.,433Zhang,A.,698Zhang,B.,438,602Zhang,C.,509Zhang,F.,438Zhang,H.,438,509Zhang,J.,341Zhang,M.-L.,345Zhang,N.,19Zhang,P.,749Zhang,S.,509Zhang,T.,699Zhang,Y.,185,437,438Zhang,Z.,438Zhao,W.,699Zhao,Y.,602Zhong,N.,438Zhou,Z.-H.,345Zhu,H.,435Zhu,X.,345Ziad,M.,105Zimek,A.,601,748,749Züfle,A.,430
SubjectIndex
k-nearestneighborgraph,657,663,664
accuracy,119,196activationfunction,251AdaBoost,306aggregation,51–52anomalydetection
applications,703–704clustering-based,724–728
example,726impactofoutliers,726membershipinacluster,725numberofclusters,728strengthsandweaknesses,728
definition,705–706definitions
distance-based,719density-based,720–724deviationdetection,703exceptionmining,703outliers,703proximity-based
distance-based,seeanomalydetection,distance-basedrelativedensity,722–723
example,723statistical,710–719
Gaussian,710Grubbs,751likelihoodapproach,715multivariate,712strengthsandweaknesses,718
techniques,708–709Apriori
algorithm,364principle,363
associationanalysis,357categoricalattributes,451continuousattributes,454indirect,503pattern,358rule,seerule
attribute,26–33definitionof,27numberofvalues,32type,27–32
asymmetric,32–33binary,32continuous,30,32discrete,32generalcomments,33–34interval,29,30nominal,29,30ordinal,29,30qualitative,30quantitative,30ratio,29
avoidingfalsediscoveries,755–806considerationsforanomalydetection,800–803considerationsforassociationanalysis,787randomization,793–795considerationsforclassification,783–787considerationsforclusteranalysis,795–800generatinganulldistribution,776–783
permutationtest,781randomization,781
hypothesistesting,seehypothesistestingmultiplehypothesistesting,seeFalseDiscoveryRateproblemswithsignificanceandhypothesistesting,778
axon,249
backpropagation,258bagging,seeclassifierBayes
naive,seeclassifiernetwork,seeclassifiertheorem,214
biasvariancedecomposition,300binarization,seediscretization,binarization,452,455BIRCH,684–686BonferroniProcedure),768boosting,seeclassifierBregmandivergence,94–95
candidategeneration,367,368,471,487itemset,362pruning,368,472,493rule,381sequence,468
case,seeobjectchameleon,660–666
algorithm,664–665graphpartitioning,664,665mergingstrategy,662relativecloseness,663relativeinterconnectivity,663self-similarity,656,661,663–665strengthsandlimitations,666
characteristic,seeattributecityblockdistance,seedistance,cityblockclass
imbalance,313classification
classlabel,114evaluation,119
classifierbagging,302base,296Bayesianbelief,227boosting,305combination,296decisiontree,119ensemble,296logisticregression,243
maximalmargin,278naive-Bayes,218nearestneighbor,208neuralnetworks,249perceptron,250probabilistic,212randomforest,310Rote,208rule-based,195supportvectormachine,276unstable,300
climateindices,680clusteranalysis
algorithmcharacteristics,619–620mappingtoanotherdomain,620nondeterminism,619optimization,620orderdependence,619parameterselection,seeparameterselectionscalability,seescalability
applications,525–527asanoptimizationproblem,620chameleon,seechameleonchoosinganalgorithm,690–693clustercharacteristics,617–618
datadistribution,618density,618poorlyseparated,618relationships,618shape,618size,618
subspace,618clusterdensity,618clustershape,548,618clustersize,618datacharacteristics,615–617
attributetypes,617datatypes,617high-dimensionality,616mathematicalproperties,617noise,616outliers,616scale,617size,616sparseness,616
DBSCAN,seeDBSCANdefinitionof,525,528DENCLUE,seeDENCLUEdensity-basedclustering,644–656fuzzyclustering,seefuzzyclusteringgraph-basedclustering,656–681
sparsification,657–658grid-basedclustering,seegrid-basedclusteringhierarchical,seehierarchicalclustering
CURE,seeCURE,seeCUREminimumspanningtree,658–659
Jarvis-Patrick,seeJarvis-PatrickK-means,seeK-meansmixturemodes,seemixturemodelsopossum,seeopossumparameterselection,567,587,619prototype-basedclustering,621–644
seesharednearestneighbor,density-basedclustering,679self-organizingmaps,seeself-organizingmapsspectralclustering,666subspaceclustering,seesubspaceclusteringsubspaceclusters,618typesofclusterings,529–531
complete,531exclusive,530fuzzy,530hierarchical,529overlapping,530partial,531partitional,529
typesofclusters,531–533conceptual,533density-based,532graph-based,532prototype-based,531well-separated,531
validation,seeclustervalidationclustervalidation,571–597
assessmentofmeasures,594–596clusteringtendency,571,588cohesion,574–579copheneticcorrelation,586forindividualclusters,581forindividualobjects,581hierarchical,585,594numberofclusters,587relativemeasures,574separation,574–578
silhouettecoefficient,581–582supervised,589–594
classificationmeasures,590–592similaritymeasures,592–594
supervisedmeasures,573unsupervised,574–589unsupervisedmeasures,573withproximitymatrix,582–585
codeword,332compactionfactor,400concepthierarchy,462conditionalindependence,229confidence
factor,196level,857measure,seemeasure
confusionmatrix,118constraint
maxgap,475maxspan,474mingap,475timing,473windowsize,476
contingencytable,402correlation
ϕ-coefficient,406coverage,196criticalregion,seehypothesistesting,criticalregioncross-validation,165CURE,686–690
algorithm,686,688
clusterfeature,684clusteringfeaturetree,684useofpartitioning,689–690useofsampling,688–689
curseofdimensionality,292
dag,seegraphdata
attribute,seeattributeattributetypes,617cleaning,seedataquality,datacleaningdistribution,618high-dimensional,616
problemswithsimilarity,673marketbasket,357mathematicalproperties,617noise,616object,seeobjectoutliers,616preprocessing,seepreprocessingquality,seedataqualityscale,617set,seedatasetsimilarity,seesimilaritysize,616sparse,616transformations,seetransformationstypes,617typesof,23,26–42
dataquality,23,42–50applicationissues,49–50
datadocumentation,50relevance,49timliness,49
datacleaning,42errors,43–48
accuracy,45
artifacts,44bias,45collection,43duplicatedata,48inconsistentvalues,47–48measurment,43missingvalues,46–47noise,43–44outliers,46precision,45significantdigits,45
dataset,26characteristics,34–35
dimensionality,34resolution,35sparsity,34
typesof,34–42graph-based,37–38matrix,seematrixordered,38–41record,35–37sequence,40sequential,38spatial,41temporal,38timeseries,39transaction,36
DBSCAN,565–569algorithm,567comparisontoK-means,614–615complexity,567
definitionofdensity,565parameterselection,567typesofpoints,566
border,566core,566noise,566
decisionboundary,146list,198stump,303tree,seeclassifier
deduplication,48DENCLUE,652–656
algorithm,653implementationissues,654kerneldensityestimation,654strengthsandlimitations,654
dendrite,249density
centerbased,565dimension,seeattributedimensionality
curse,57dimensionalityreduction,56–58,833–848
factoranalysis,840–842FastMap,845ISOMAP,845–847issues,847–848LocallyLinearEmbedding,842–844multidimensionalscaling,844–845PCA,58
SVD,58discretization,63–69,221
association,seeassociationbinarization,64–65clustering,456equalfrequency,456equalwidth,456ofbinaryattributes,seediscretization,binarizationofcategoricalvariables,68–69ofcontinuousattributes,65–68
supervised,66–68unsupervised,65–66
dissimilarity,76–78,94–95choosing,98–100definitionof,72distance,seedistancenon-metric,77transformations,72–75
distance,76–77cityblock,76Euclidean,76,822Hamming,332L1norm,76L2norm,76L∞,76Lmax,76Mahalanobis,96Manhattan,76metric,77
positivity,77symmetry,77
triangleinequality,77Minkowski,76–77supremum,76
distributionbinomial,162Gaussian,162,221
eagerlearner,seelearneredge,480effectsize,seehypothesistesting,effectsizeelement,466EMalgorithm,631–637ensemblemethod,seeclassifierentity,seeobjectentropy,67,128
useindiscretization,seediscretization,supervisederror
error-correctingoutputcoding,331generalization,156pessimistic,158
errorrate,119estimateerror,164Euclideandistance,seedistance,Euclideanevaluation
association,401exhaustive,198
factoranalysis,seedimensionalityreduction,factoranalysisFalseDiscoveryRate,778
Benjamini-HochbergFDR,772family-wiseerrorrate,768FastMap,seedimensionalityreduction,FastMapfeature
irrelevant,144featurecreation,61–63
featureextraction,61–62mappingdatatoanewspace,62–63
featureextraction,seefeaturecreation,featureextractionfeatureselection,58–61
architecturefor,59–60featureweighting,61irrelevantfeatures,58redundantfeatures,58typesof,58–59
embedded,58filter,59wrapper,59
field,seeattributeFouriertransform,62FP-growth,393FP-tree,seetreefrequentsubgraph,479fuzzyclustering,621–626
fuzzyc-means,623–626algorithm,623centroids,624example,626initialization,624
SSE,624strenthsandlimitations,626weightupdate,625
fuzzysets,622fuzzypsuedo-partition,623
gainratio,135generalization,seeruleginiindex,128graph,480
connected,484directedacyclic,462Laplacian,667undirected,484
grid-basedclustering,644–648algorithm,645clusters,646density,645gridcells,645
hierarchicalclustering,554–565agglomerativealgorithm,555centroidmethods,562clusterproximity,555
Lance-Williamsformula,562completelink,555,558–559complexity,556groupaverage,555,559–560inversions,562MAX,seecompletelinkmergingdecisions,564MIN,seesinglelinksinglelink,555,558Ward’smethod,561
high-dimensionalityseedata,high-dimensional,616
holdout,165hypothesis
alternative,459,858null,459,858
hypothesistesting,761criticalregion,763effectsize,766power,764TypeIerror,763TypeIIerror,764
independenceconditional,218
informationgainentropy-based,131FOIL’s,201
interest,seemeasureISOMAP,seedimensionalityreduction,ISOMAPisomorphism
definition,481item,seeattribute,358
competing,494negative,494
itemset,359candidate,seecandidateclosed,386maximal,384
Jarvis-Patrick,676–678algorithm,676example,677strengthsandlimitations,677
K-means,534–553algorithm,535–536bisecting,547–548centroids,537,539
choosinginitial,539–544comparisontodBSCAN,614–615complexity,544derivation,549–553emptyclusters,544incremental,546K-means++,543–544limitations,548–549objectivefunctions,537,539outliers,545reducingSEE,545–546
kerneldensityestimation,654kernelfunction,90–94
L1norm,seedistance,L1normL2norm,seedistance,L2normLagrangian,280lazylearner,seelearnerlearner
eager,208,211lazy,208,211
leastsquares,831leave-one-out,167lexicographicorder,371linearalgebra,817–832
matrix,seematrixvector,seevector
linearregression,831linearsystemsofequations,831lineartransformation,seematrix,lineartransformationLocallyLinearEmbedding,seedimensionalityreduction,LocallyLinearEmbedding
m-estimate,224majorityvoting,seevotingManhattandistance,seedistance,Manhattanmargin
soft,284marketbasketdata,seedatamatrix,37,823–829
addition,824–825columnvector,824confusion,seeconfusionmatrixdefinition,823–824document-term,37eigenvalue,829eigenvaluedecomposition,829–830eigenvector,829indataanalysis,831–832inverse,828–829linearcombination,835lineartransformations,827–829
columnspace,828leftnullspace,828nullspace,828projection,827reflection,827rotation,827rowspace,828scaling,827
multiplication,825–827positivesemidefinite,835rank,828rowvector,824
scalarmultiplication,825singularvalue,830singularvaluedecomposition,830singularvector,830sparse,37
maxgap,seeconstraintmaximumlikelihoodestimation,629–631maxspan,seeconstraintMDL,160mean,222measure
confidence,360consistency,408interest,405IS,406objective,401properties,409support,360symmetric,414
measurement,27–32definitionof,27scale,27
permissibletransformations,30–31types,27–32
metricaccuracy,119
metricsclassification,119
min-Apriori,461mingap,seeconstraintminimumdescriptionlength,seeMDL
missingvalues,seedataquality,errors,missingvaluesmixturemodels,627–637
advantagesandlimitations,637definitionof,627–629EMalgorithm,631–637maximumlikelihoodestimation,629–631
modelcomparison,173descriptive,116generalization,118overfitting,147predictive,116selection,156
modelcomplexityOccam’sRazor
AIC,157BIC,157
monotonicity,364multiclass,330multidimensionalscaling,seedimensionalityreduction,multidimensionalscalingmultiplecomparison,seeFalseDiscoveryRatemultiplehypothesistesting,seeFalseDiscoveryRate
family-wiseerrorrate,seefamily-wiseerrorratemutualexclusive,198mutualinformation,88–89
nearestneighborclassifier,seeclassifiernetwork
Bayesian,seeclassifiermultilayer,seeclassifierneural,seeclassifier
neuron,249node
internal,120leaf,120non-terminal,120root,120terminal,120
noise,211nulldistribution,758nullhypothesis,757
object,26observation,seeobjectOccam’srazor,157OLAP,51opposum,659–660
algorithm,660strengthsandweaknesses,660
outliers,seedataqualityoverfitting,seemodel,149
p-value,759pattern
cross-support,420hyperclique,423infrequent,493negative,494negativelycorrelated,495,496sequential,seesequentialsubgraph,seesubgraph
PCA,833–836examples,836mathematics,834–835
perceptron,seeclassifierpermutationtest,781Piatesky-Shapiro
PS,405point,seeobjectpower,seehypothesistesting,powerPrecision-RecallCurve,328precondition,195preprocessing,23,50–71
aggregation,seeaggregationbinarization,seediscretization,binarizationdimensionalityreduction,56discretization,seediscretizationfeaturecreation,seefeaturecreationfeatureselection,seefeatureselectionsampling,seesamplingtransformations,seetransformations
proximity,71–100choosing,98–100
cluster,555definitionof,71dissimilarity,seedissimilaritydistance,seedistanceforsimpleattributes,74–75issues,96–98
attributeweights,98combiningproximities,97–98correlation,96standardization,96
similarity,seesimilaritytransformations,72–74
pruningpost-pruning,163prepruning,162
randomforestseeclassifier,310
randomization,781associationpatterns,793–795
ReceiverOperatingCharacteristiccurve,seeROCrecord,seeobjectreducederrorpruning,189,346regression
logistic,243ROC,323Roteclassifier,seeclassifierrule
antecedent,195association,360candidate,seecandidateconsequent,195generalization,458generation,205,362,380,458ordered,198ordering,206pruning,202quantitative,454
discretization-based,454non-discretization,460statistics-based,458
redundant,458specialization,458validation,459
ruleset,195
sample,seeobjectsampling,52–56,314
approaches,53–54progressive,55–56random,53stratified,54withreplacement,54withoutreplacement,54
samplesize,54–55scalability
clusteringalgorithms,681–690BIRCH,684–686CURE,686–690generalissues,681–684
segmentation,529self-organizingmaps,637–644
algorithm,638–641applications,643strengthsandlimitations,643
sensitivity,319sequence
datasequence,468definition,466
sequentialpattern,464patterndiscovery,468timingconstraints,seeconstraint
sequentialcovering,199sharednearestneighbor,656
density,678–679density-basedclustering,679–681
algorithm,680example,680strengthsandlimitations,681
principle,657similarity,673–676
computation,675differencesindensity,674versusdirectsimilarity,676
significancelevel,859
significancetesting,761nulldistribution,seenulldistributionnullhypothesis,seenullhypothesisp-value,seep-valuestatisticalsignificance,seestatisticalsignificance
similarity,24,78–85choosing,98–100correlation,83–85cosine,81–82,822definitionof,72differences,85–88extendedJaccard,83Jaccard,80–81kernelfunction,90–94mutualinformation,88–89sharednearestneighbor,seesharednearestneighbor,similaritysimplematchingcoefficient,80Tanimoto,83transformations,72–75
Simpson’sparadox,416softsplitting,178
SOM,618,seeself-organizingmapsspecialization,seerulesplitinformation,135statisticalsignificance,760statistics
covarinacematrix,834subgraph
core,487definition,482pattern,479support,seesupport
subsequence,467contiguous,475
subspaceclustering,648–652CLIQUE,650
algorithm,651monotonicityproperty,651strengthsandlimitations,652
example,648subtree
replacement,163support
count,359counting,373,473,477,493limitation,402measure,seemeasurepruning,364sequence,468subgraph,483
supportvector,276supportvectormachine,seeclassifier
SVD,838–840example,838–840mathematics,838
SVM,seeclassifiernonlinear,290
svmnon-separable,284
synapse,249
taxonomy,seeconcepthierarchytransaction,358
extended,463width,379
transformations,69–71betweensimilarityanddissimilarity,72–74normalization,70–71simplefunctions,70standardization,70–71
treeconditionalFP-tree,398decision,seeclassifierFP-tree,394hash,375oblique,146
triangleinequality,77truepositive,319TypeIerror,seehypothesistesting,TypeIerrorTypeIIerror,seehypothesistesting,TypeIIerror
underfitting,149universalapproximator,261
variable,seeattributevariance,222vector,817–823
addition,817–818column,seematrix,columnvectordefinition,817dotproduct,820–822indataanalysis,822–823linearindependence,821–822mean,823mulitplicationbyascalar,818–819norm,820orthogonal,819–821orthogonalprojection,821row,seematrix,rowvectorspace,819–820basis,819dimension,819independentcomponents,819linearcombination,819span,819
vectorquantization,527vertex,480voting
distance-weighted,210majority,210
wavelettransform,63webcrawler,138windowsize,seeconstraint
CopyrightPermissionsSomefiguresandpartofthetextofChapter8originallyappearedinthearticle“FindingClustersofDifferentSizes,Shapes,andDensitiesinNoisy,HighDimensionalData,”LeventErtöz,MichaelSteinbach,andVipinKumar,ProceedingsoftheThirdSIAMInternationalConferenceonDataMining,SanFrancisco,CA,May1–3,2003,SIAM.©2003,SIAM.
SomefiguresandpartofthetextofChapter5appearedinthearticle“SelectingtheRightObjectiveMeasureforAssociationAnalysis,”Pang-NingTan,VipinKumar,andJaideepSrivastava,InformationSystems,29(4),293-313,2004,Elsevier.©2004,Elsevier.
SomeofthefiguresandtextofChapters8appearedinthearticle“DiscoveryofClimateIndicesUsingClustering,”MichaelSteinbach,Pang-NingTan,VipinKumar,StevenKlooster,andChristopherPotter,KDD’03:ProceedingsoftheNinthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,446–455,Washington,DC,August2003,ACM.©2003,ACM,INC.DOI=http://doi.acm.org/10.1145/956750.956801
Someofthefigures(1-7,13)andtextofChapter7originallyappearedinthechapter“TheChallengeofClusteringHigh-DimensionalData,”LeventErtoz,MichaelSteinbach,andVipinKumarinNewDirectionsinStatisticalPhysics,Econophysics,Bioinformatics,andPatternRecognition,273–312,Editor,LucWille,Springer,ISBN3-540-43182-9.©2004,Springer-Verlag.
SomeofthefiguresandtextofChapter8originallyappearedinthearticle“Chameleon:HierarchicalClusteringUsingDynamicModeling,”byGeorge
Karypis,Eui-Hong(Sam)Han,andVipinKumar,IEEEComputer,Volume32(8),68-75,August,1999,IEEE.©1999,IEEE.
Contents1. INTRODUCTIONTODATAMINING2. INTRODUCTIONTODATAMINING3. PrefacetotheSecondEdition
A. OverviewB. WhatisNewintheSecondEdition?C. TotheInstructorD. SupportMaterials
4. Contents5. 1Introduction
A. 1.1WhatIsDataMining?B. 1.2MotivatingChallengesC. 1.3TheOriginsofDataMiningD. 1.4DataMiningTasksE. 1.5ScopeandOrganizationoftheBookF. 1.6BibliographicNotesG. BibliographyH. 1.7Exercises
6. 2DataA. 2.1TypesofData
1. 2.1.1AttributesandMeasurementa. WhatIsanAttribute?b. TheTypeofanAttributec. TheDifferentTypesofAttributesd. DescribingAttributesbytheNumberofValuese. AsymmetricAttributes
f. GeneralCommentsonLevelsofMeasurement
2. 2.1.2TypesofDataSetsa. GeneralCharacteristicsofDataSets
a. Dimensionalityb. Distributionc. Resolution
b. RecordDataa. TransactionorMarketBasketDatab. TheDataMatrixc. TheSparseDataMatrix
c. Graph-BasedDataa. DatawithRelationshipsamongObjectsb. DatawithObjectsThatAreGraphs
d. OrderedDataa. SequentialTransactionDatab. TimeSeriesDatac. SequenceDatad. SpatialandSpatio-TemporalData
e. HandlingNon-RecordData
B. 2.2DataQuality1. 2.2.1MeasurementandDataCollectionIssues
a. MeasurementandDataCollectionErrorsb. NoiseandArtifactsc. Precision,Bias,andAccuracyd. Outliers
e. MissingValuesa. EliminateDataObjectsorAttributesb. EstimateMissingValuesc. IgnoretheMissingValueduringAnalysis
f. InconsistentValuesg. DuplicateData
2. 2.2.2IssuesRelatedtoApplications
C. 2.3DataPreprocessing1. 2.3.1Aggregation2. 2.3.2Sampling
a. SamplingApproachesb. ProgressiveSampling
3. 2.3.3DimensionalityReductiona. TheCurseofDimensionalityb. LinearAlgebraTechniquesforDimensionalityReduction
4. 2.3.4FeatureSubsetSelectiona. AnArchitectureforFeatureSubsetSelectionb. FeatureWeighting
5. 2.3.5FeatureCreationa. FeatureExtractionb. MappingtheDatatoaNewSpace
6. 2.3.6DiscretizationandBinarizationa. Binarizationb. DiscretizationofContinuousAttributes
a. UnsupervisedDiscretizationb. SupervisedDiscretization
c. CategoricalAttributeswithTooManyValues
7. 2.3.7VariableTransformationa. SimpleFunctionsb. NormalizationorStandardization
D. 2.4MeasuresofSimilarityandDissimilarity1. 2.4.1Basics
a. Definitionsb. Transformations
2. 2.4.2SimilarityandDissimilaritybetweenSimpleAttributes3. 2.4.3DissimilaritiesbetweenDataObjects
a. Distances
4. 2.4.4SimilaritiesbetweenDataObjects5. 2.4.5ExamplesofProximityMeasures
a. SimilarityMeasuresforBinaryDataa. SimpleMatchingCoefficientb. JaccardCoefficient
b. CosineSimilarityc. ExtendedJaccardCoefficient(TanimotoCoefficient)d. Correlatione. DifferencesAmongMeasuresForContinuousAttributes
6. 2.4.6MutualInformation7. 2.4.7KernelFunctions*
8. 2.4.8BregmanDivergence*9. 2.4.9IssuesinProximityCalculation
a. StandardizationandCorrelationforDistanceMeasuresb. CombiningSimilaritiesforHeterogeneousAttributesc. UsingWeights
10. 2.4.10SelectingtheRightProximityMeasure
E. 2.5BibliographicNotesF. BibliographyG. 2.6Exercises
7. 3Classification:BasicConceptsandTechniquesA. 3.1BasicConceptsB. 3.2GeneralFrameworkforClassificationC. 3.3DecisionTreeClassifier
1. 3.3.1ABasicAlgorithmtoBuildaDecisionTreea. Hunt'sAlgorithmb. DesignIssuesofDecisionTreeInduction
2. 3.3.2MethodsforExpressingAttributeTestConditions3. 3.3.3MeasuresforSelectinganAttributeTestCondition
a. ImpurityMeasureforaSingleNodeb. CollectiveImpurityofChildNodesc. Identifyingthebestattributetestconditiond. SplittingofQualitativeAttributese. BinarySplittingofQualitativeAttributesf. BinarySplittingofQuantitativeAttributesg. GainRatio
4. 3.3.4AlgorithmforDecisionTreeInduction
5. 3.3.5ExampleApplication:WebRobotDetection6. 3.3.6CharacteristicsofDecisionTreeClassifiers
D. 3.4ModelOverfitting1. 3.4.1ReasonsforModelOverfitting
a. LimitedTrainingSizeb. HighModelComplexity
E. 3.5ModelSelection1. 3.5.1UsingaValidationSet2. 3.5.2IncorporatingModelComplexity
a. EstimatingtheComplexityofDecisionTreesb. MinimumDescriptionLengthPrinciple
3. 3.5.3EstimatingStatisticalBounds4. 3.5.4ModelSelectionforDecisionTrees
F. 3.6ModelEvaluation1. 3.6.1HoldoutMethod2. 3.6.2Cross-Validation
G. 3.7PresenceofHyper-parameters1. 3.7.1Hyper-parameterSelection2. 3.7.2NestedCross-Validation
H. 3.8PitfallsofModelSelectionandEvaluation1. 3.8.1OverlapbetweenTrainingandTestSets2. 3.8.2UseofValidationErrorasGeneralizationError
I. 3.9ModelComparison1. 3.9.1EstimatingtheConfidenceIntervalforAccuracy
*
2. 3.9.2ComparingthePerformanceofTwoModels
J. 3.10BibliographicNotesK. BibliographyL. 3.11Exercises
8. 4Classification:AlternativeTechniquesA. 4.1TypesofClassifiersB. 4.2Rule-BasedClassifier
1. 4.2.1HowaRule-BasedClassifierWorks2. 4.2.2PropertiesofaRuleSet3. 4.2.3DirectMethodsforRuleExtraction
a. Learn-One-RuleFunctiona. RulePruningb. BuildingtheRuleSet
b. InstanceElimination
4. 4.2.4IndirectMethodsforRuleExtraction5. 4.2.5CharacteristicsofRule-BasedClassifiers
C. 4.3NearestNeighborClassifiers1. 4.3.1Algorithm2. 4.3.2CharacteristicsofNearestNeighborClassifiers
D. 4.4NaïveBayesClassifier1. 4.4.1BasicsofProbabilityTheory
a. BayesTheoremb. UsingBayesTheoremforClassification
2. 4.4.2NaïveBayesAssumption
a. ConditionalIndependenceb. HowaNaïveBayesClassifierWorks
c.EstimatingConditionalProbabilitiesforCategoricalAttributes
d.EstimatingConditionalProbabilitiesforContinuousAttributes
e. HandlingZeroConditionalProbabilitiesf. CharacteristicsofNaïveBayesClassifiers
E. 4.5BayesianNetworks1. 4.5.1GraphicalRepresentation
a. ConditionalIndependenceb. JointProbabilityc. UseofHiddenVariables
2. 4.5.2InferenceandLearninga. VariableEliminationb. Sum-ProductAlgorithmforTreesc. GeneralizationsforNon-TreeGraphsd. LearningModelParameters
3. 4.5.3CharacteristicsofBayesianNetworks
F. 4.6LogisticRegression1. 4.6.1LogisticRegressionasaGeneralizedLinearModel2. 4.6.2LearningModelParameters3. 4.6.3CharacteristicsofLogisticRegression
G. 4.7ArtificialNeuralNetwork(ANN)1. 4.7.1Perceptron
a. LearningthePerceptron
2. 4.7.2Multi-layerNeuralNetworka. LearningModelParameters
3. 4.7.3CharacteristicsofANN
H. 4.8DeepLearning1. 4.8.1UsingSynergisticLossFunctions
a. SaturationofOutputsb. Crossentropylossfunction
2. 4.8.2UsingResponsiveActivationFunctionsa. VanishingGradientProblemb. RectifiedLinearUnits(ReLU)
3. 4.8.3Regularizationa. Dropout
4. 4.8.4InitializationofModelParametersa. SupervisedPretrainingb. UnsupervisedPretrainingc. UseofAutoencodersd. HybridPretraining
5. 4.8.5CharacteristicsofDeepLearning
I. 4.9SupportVectorMachine(SVM)1. 4.9.1MarginofaSeparatingHyperplane
a. RationaleforMaximumMargin
2. 4.9.2LinearSVMa. LearningModelParameters
3. 4.9.3Soft-marginSVMa. SVMasaRegularizerofHingeLoss
4. 4.9.4NonlinearSVMa. AttributeTransformationb. LearningaNonlinearSVMModel
5. 4.9.5CharacteristicsofSVM
J. 4.10EnsembleMethods1. 4.10.1RationaleforEnsembleMethod2. 4.10.2MethodsforConstructinganEnsembleClassifier3. 4.10.3Bias-VarianceDecomposition4. 4.10.4Bagging5. 4.10.5Boosting
a. AdaBoost
6. 4.10.6RandomForests7. 4.10.7EmpiricalComparisonamongEnsembleMethods
K. 4.11ClassImbalanceProblem1. 4.11.1BuildingClassifierswithClassImbalance
a. OversamplingandUndersamplingb. AssigningScorestoTestInstances
2. 4.11.2EvaluatingPerformancewithClassImbalance3. 4.11.3FindinganOptimalScoreThreshold4. 4.11.4AggregateEvaluationofPerformance
a. ROCCurveb. Precision-RecallCurve
L. 4.12MulticlassProblemM. 4.13BibliographicNotesN. BibliographyO. 4.14Exercises
9. 5AssociationAnalysis:BasicConceptsandAlgorithmsA. 5.1PreliminariesB. 5.2FrequentItemsetGeneration
1. 5.2.1TheAprioriPrinciple2. 5.2.2FrequentItemsetGenerationintheAprioriAlgorithm3. 5.2.3CandidateGenerationandPruning
a. CandidateGenerationa. Brute-ForceMethodb. Methodc. Method
b. CandidatePruning
4. 5.2.4SupportCountinga. SupportCountingUsingaHashTree*
5. 5.2.5ComputationalComplexity
C. 5.3RuleGeneration1. 5.3.1Confidence-BasedPruning2. 5.3.2RuleGenerationinAprioriAlgorithm3. 5.3.3AnExample:CongressionalVotingRecords
D. 5.4CompactRepresentationofFrequentItemsets1. 5.4.1MaximalFrequentItemsets2. 5.4.2ClosedItemsets
Fk−1×F1Fk−1×Fk−1
E. 5.5AlternativeMethodsforGeneratingFrequentItemsets*F. 5.6FP-GrowthAlgorithm*
1. 5.6.1FP-TreeRepresentation2. 5.6.2FrequentItemsetGenerationinFP-GrowthAlgorithm
G. 5.7EvaluationofAssociationPatterns1. 5.7.1ObjectiveMeasuresofInterestingness
a. AlternativeObjectiveInterestingnessMeasuresb. ConsistencyamongObjectiveMeasuresc. PropertiesofObjectiveMeasures
a. InversionPropertyb. ScalingPropertyc. NullAdditionProperty
d. AsymmetricInterestingnessMeasures
2. 5.7.2MeasuresbeyondPairsofBinaryVariables3. 5.7.3Simpson'sParadox
H. 5.8EffectofSkewedSupportDistributionI. 5.9BibliographicNotesJ. BibliographyK. 5.10Exercises
10. 6AssociationAnalysis:AdvancedConceptsA. 6.1HandlingCategoricalAttributesB. 6.2HandlingContinuousAttributes
1. 6.2.1Discretization-BasedMethods2. 6.2.2Statistics-BasedMethods
a. RuleGenerationb. RuleValidation
3. 6.2.3Non-discretizationMethods
C. 6.3HandlingaConceptHierarchyD. 6.4SequentialPatterns
1. 6.4.1Preliminariesa. Sequencesb. Subsequences
2. 6.4.2SequentialPatternDiscovery3. 6.4.3TimingConstraints*
a. ThemaxspanConstraintb. ThemingapandmaxgapConstraintsc. TheWindowSizeConstraint
4. 6.4.4AlternativeCountingSchemes*
E. 6.5SubgraphPatterns1. 6.5.1Preliminaries
a. Graphsb. GraphIsomorphismc. Subgraphs
2. 6.5.2FrequentSubgraphMining3. 6.5.3CandidateGeneration4. 6.5.4CandidatePruning5. 6.5.5SupportCounting
F. 6.6InfrequentPatterns*1. 6.6.1NegativePatterns2. 6.6.2NegativelyCorrelatedPatterns
3.6.6.3ComparisonsamongInfrequentPatterns,NegativePatterns,andNegativelyCorrelatedPatterns
4. 6.6.4TechniquesforMiningInterestingInfrequentPatterns5. 6.6.5TechniquesBasedonMiningNegativePatterns6. 6.6.6TechniquesBasedonSupportExpectation
a. SupportExpectationBasedonConceptHierarchyb. SupportExpectationBasedonIndirectAssociation
G. 6.7BibliographicNotesH. BibliographyI. 6.8Exercises
11. 7ClusterAnalysis:BasicConceptsandAlgorithmsA. 7.1Overview
1. 7.1.1WhatIsClusterAnalysis?2. 7.1.2DifferentTypesofClusterings3. 7.1.3DifferentTypesofClusters
a. RoadMap
B. 7.2K-means1. 7.2.1TheBasicK-meansAlgorithm
a. AssigningPointstotheClosestCentroidb. CentroidsandObjectiveFunctions
a. DatainEuclideanSpaceb. DocumentDatac. TheGeneralCase
c. ChoosingInitialCentroidsa. K-means++
d. TimeandSpaceComplexity
2. 7.2.2K-means:AdditionalIssuesa. HandlingEmptyClustersb. Outliersc. ReducingtheSSEwithPostprocessingd. UpdatingCentroidsIncrementally
3. 7.2.3BisectingK-means4. 7.2.4K-meansandDifferentTypesofClusters5. 7.2.5StrengthsandWeaknesses6. 7.2.6K-meansasanOptimizationProblem
a. DerivationofK-meansasanAlgorithmtoMinimizetheSSEb. DerivationofK-meansforSAE
C. 7.3AgglomerativeHierarchicalClustering1. 7.3.1BasicAgglomerativeHierarchicalClusteringAlgorithm
a. DefiningProximitybetweenClustersb. TimeandSpaceComplexity
2. 7.3.2SpecificTechniquesa. SampleDatab. SingleLinkorMINc. CompleteLinkorMAXorCLIQUEd. GroupAveragee. Ward’sMethodandCentroidMethods
3. 7.3.3TheLance-WilliamsFormulaforClusterProximity4. 7.3.4KeyIssuesinHierarchicalClustering
a. LackofaGlobalObjectiveFunctionb. AbilitytoHandleDifferentClusterSizesc. MergingDecisionsAreFinal
5. 7.3.5Outliers6. 7.3.6StrengthsandWeaknesses
D. 7.4DBSCAN1. 7.4.1TraditionalDensity:Center-BasedApproach
a. ClassificationofPointsAccordingtoCenter-BasedDensity
2. 7.4.2TheDBSCANAlgorithma. TimeandSpaceComplexityb. SelectionofDBSCANParametersc. ClustersofVaryingDensityd. AnExample
3. 7.4.3StrengthsandWeaknesses
E. 7.5ClusterEvaluation1. 7.5.1Overview
2.7.5.2UnsupervisedClusterEvaluationUsingCohesionandSeparationa. Graph-BasedViewofCohesionandSeparationb. Prototype-BasedViewofCohesionandSeparation
c.RelationshipbetweenPrototype-BasedCohesionandGraph-BasedCohesion
d.RelationshipoftheTwoApproachestoPrototype-BasedSeparation
e. RelationshipbetweenCohesionandSeparationf. RelationshipbetweenGraph-andCentroid-BasedCohesiong. OverallMeasuresofCohesionandSeparationh. EvaluatingIndividualClustersandObjectsi. TheSilhouetteCoefficient
3.7.5.3UnsupervisedClusterEvaluationUsingtheProximityMatrix
a.GeneralCommentsonUnsupervisedClusterEvaluationMeasures
b. MeasuringClusterValidityviaCorrelationc. JudgingaClusteringVisuallybyItsSimilarityMatrix
4. 7.5.4UnsupervisedEvaluationofHierarchicalClustering5. 7.5.5DeterminingtheCorrectNumberofClusters6. 7.5.6ClusteringTendency7. 7.5.7SupervisedMeasuresofClusterValidity
a. Classification-OrientedMeasuresofClusterValidityb. Similarity-OrientedMeasuresofClusterValidityc. ClusterValidityforHierarchicalClusterings
8. 7.5.8AssessingtheSignificanceofClusterValidityMeasures9. 7.5.9ChoosingaClusterValidityMeasure
F. 7.6BibliographicNotesG. BibliographyH. 7.7Exercises
12. 8ClusterAnalysis:AdditionalIssuesandAlgorithmsA. 8.1CharacteristicsofData,Clusters,andClusteringAlgorithms
1. 8.1.1Example:ComparingK-meansandDBSCAN2. 8.1.2DataCharacteristics3. 8.1.3ClusterCharacteristics4. 8.1.4GeneralCharacteristicsofClusteringAlgorithms
a. RoadMap
B. 8.2Prototype-BasedClustering
1. 8.2.1FuzzyClusteringa. FuzzySetsb. FuzzyClustersc. Fuzzyc-means
a. ComputingSSEb. Initializationc. ComputingCentroidsd. UpdatingtheFuzzyPseudo-partition
d. StrengthsandLimitations
2. 8.2.2ClusteringUsingMixtureModelsa. MixtureModelsb. EstimatingModelParametersUsingMaximumLikelihood
c.EstimatingMixtureModelParametersUsingMaximumLikelihood:TheEMAlgorithm
d.AdvantagesandLimitationsofMixtureModelClusteringUsingtheEMAlgorithm
3. 8.2.3Self-OrganizingMaps(SOM)a. TheSOMAlgorithm
a. Initializationb. SelectionofanObjectc. Assignmentd. Updatee. Termination
b. Applicationsc. StrengthsandLimitations
C. 8.3Density-BasedClustering
1. 8.3.1Grid-BasedClusteringa. DefiningGridCellsb. TheDensityofGridCellsc. FormingClustersfromDenseGridCellsd. StrengthsandLimitations
2. 8.3.2SubspaceClusteringa. CLIQUEb. StrengthsandLimitationsofCLIQUE
3.8.3.3DENCLUE:AKernel-BasedSchemeforDensity-BasedClusteringa. KernelDensityEstimationb. ImplementationIssuesc. StrengthsandLimitationsofDENCLUE
D. 8.4Graph-BasedClustering1. 8.4.1Sparsification2. 8.4.2MinimumSpanningTree(MST)Clustering
3.8.4.3OPOSSUM:OptimalPartitioningofSparseSimilaritiesUsingMETISa. StrengthsandWeaknesses
4.8.4.4Chameleon:HierarchicalClusteringwithDynamicModelinga. DecidingWhichClusterstoMergeb. ChameleonAlgorithm
a. Sparsification
c. GraphPartitioninga. AgglomerativeHierarchicalClustering
b. Complexity
d. StrengthsandLimitations
5. 8.4.5SpectralClustering
a.RelationshipbetweenSpectralClusteringandGraphPartitioning
b. StrengthsandLimitations
6. 8.4.6SharedNearestNeighborSimilarity
a.ProblemswithTraditionalSimilarityinHigh-DimensionalData
b. ProblemswithDifferencesinDensityc. SNNSimilarityComputationd. SNNSimilarityversusDirectSimilarity
7. 8.4.7TheJarvis-PatrickClusteringAlgorithma. StrengthsandLimitations
8. 8.4.8SNNDensity9. 8.4.9SNNDensity-BasedClustering
a. TheSNNDensity-basedClusteringAlgorithmb. StrengthsandLimitations
E. 8.5ScalableClusteringAlgorithms1. 8.5.1Scalability:GeneralIssuesandApproaches2. 8.5.2BIRCH3. 8.5.3CURE
a. SamplinginCUREb. Partitioning
F. 8.6WhichClusteringAlgorithm?G. 8.7BibliographicNotesH. BibliographyI. 8.8Exercises
13. 9AnomalyDetectionA. 9.1CharacteristicsofAnomalyDetectionProblems
1. 9.1.1ADefinitionofanAnomaly2. 9.1.2NatureofData3. 9.1.3HowAnomalyDetectionisUsed
B. 9.2CharacteristicsofAnomalyDetectionMethodsC. 9.3StatisticalApproaches
1. 9.3.1UsingParametricModelsa. UsingtheUnivariateGaussianDistributionb. UsingtheMultivariateGaussianDistribution
2. 9.3.2UsingNon-parametricModels3. 9.3.3ModelingNormalandAnomalousClasses4. 9.3.4AssessingStatisticalSignificance5. 9.3.5StrengthsandWeaknesses
D. 9.4Proximity-basedApproaches1. 9.4.1Distance-basedAnomalyScore2. 9.4.2Density-basedAnomalyScore3. 9.4.3RelativeDensity-basedAnomalyScore4. 9.4.4StrengthsandWeaknesses
E. 9.5Clustering-basedApproaches1. 9.5.1FindingAnomalousClusters2. 9.5.2FindingAnomalousInstances
a.AssessingtheExtenttoWhichanObjectBelongstoaCluster
b. ImpactofOutliersontheInitialClusteringc. TheNumberofClusterstoUse
3. 9.5.3StrengthsandWeaknesses
F. 9.6Reconstruction-basedApproaches1. 9.6.1StrengthsandWeaknesses
G. 9.7One-classClassification1. 9.7.1UseofKernels2. 9.7.2TheOriginTrick3. 9.7.3StrengthsandWeaknesses
H. 9.8InformationTheoreticApproaches1. 9.8.1StrengthsandWeaknesses
I. 9.9EvaluationofAnomalyDetectionJ. 9.10BibliographicNotesK. BibliographyL. 9.11Exercises
14. 10AvoidingFalseDiscoveriesA. 10.1Preliminaries:StatisticalTesting
1. 10.1.1SignificanceTestinga. NullHypothesisb. TestStatisticc. NullDistributiond. AssessingStatisticalSignificance
2. 10.1.2HypothesisTestinga. CriticalRegionb. TypeIandTypeIIErrorsc. EffectSize
3. 10.1.3MultipleHypothesisTestinga. Family-wiseErrorRate(FWER)b. BonferroniProcedurec. Falsediscoveryrate(FDR)d. Benjamini-HochbergProcedure
4. 10.1.4PitfallsinStatisticalTesting
B. 10.2ModelingNullandAlternativeDistributions1. 10.2.1GeneratingSyntheticDataSets2. 10.2.2RandomizingClassLabels3. 10.2.3ResamplingInstances4. 10.2.4ModelingtheDistributionoftheTestStatistic
C. 10.3StatisticalTestingforClassification1. 10.3.1EvaluatingClassificationPerformance2. 10.3.2BinaryClassificationasMultipleHypothesisTesting3. 10.3.3MultipleHypothesisTestinginModelSelection
D. 10.4StatisticalTestingforAssociationAnalysis1. 10.4.1UsingStatisticalModels
a. UsingFisher’sExactTestb. UsingtheChi-SquaredTest
2. 10.4.2UsingRandomizationMethods
E. 10.5StatisticalTestingforClusterAnalysis1. 10.5.1GeneratingaNullDistributionforInternalIndices2. 10.5.2GeneratingaNullDistributionforExternalIndices3. 10.5.3Enrichment
F. 10.6StatisticalTestingforAnomalyDetectionG. 10.7BibliographicNotesH. BibliographyI. 10.8Exercises
15. AuthorIndex16. SubjectIndex17. CopyrightPermissions
ListofIllustrations
1. Figure1.1.2. Figure1.2.3. Figure1.3.4. Figure1.4.5. Figure2.1.6. Figure2.2.7. Figure2.3.8. Figure2.4.9. Figure2.5.10. Figure2.6.11. Figure2.7.12. Figure2.8.13. Figure2.9.
14. Figure2.10.15. Figure2.11.16. Figure2.12.17. Figure2.13.18. Figure2.14.19. Figure2.15.20. Figure2.16.21. Figure2.17.22. Figure2.19.23. Figure2.20.24. Figure2.21.25. Figure2.22.26. Figure3.1.27. Figure3.2.28. Figure3.3.29. Figure3.4.30. Figure3.5.31. Figure3.6.32. Figure3.7.33. Figure3.8.34. Figure3.9.35. Figure3.10.36. Figure3.11.37. Figure3.12.38. Figure3.13.39. Figure3.14.40. Figure3.15.41. Figure3.16.42. Figure3.17.43. Figure3.18.44. Figure3.19.
45. Figure3.20.46. Figure3.21.47. Figure3.22.48. Figure3.23.49. Figure3.24.50. Figure3.25.51. Figure3.26.52. Figure3.27.53. Figure3.28.54. Figure3.29.55. Figure3.30.56. Figure3.31.57. Figure3.32.58. Figure3.33.59. Figure3.34.60. Figure3.35.61. Figure3.36.62. Figure3.37.63. Figure4.1.64. Figure4.2.65. Figure4.3.66. Figure4.4.67. Figure4.5.68. Figure4.6.69. Figure4.7.70. Figure4.8.71. Figure4.9.72. Figure4.10.73. Figure4.11.74. Figure4.12.75. Figure4.13.
76. Figure4.14.77. Figure4.15.78. Figure4.16.79. Figure4.17.80. Figure4.18.81. Figure4.19.82. Figure4.20.83. Figure4.21.84. Figure4.22.85. Figure4.23.86. Figure4.24.87. Figure4.25.88. Figure4.26.89. Figure4.27.90. Figure4.28.91. Figure4.29.92. Figure4.30.93. Figure4.31.94. Figure4.32.95. Figure4.33.96. Figure4.34.97. Figure4.35.98. Figure4.36.99. Figure4.37.100. Figure4.38.101. Figure4.39.102. Figure4.40.103. Figure4.41.104. Figure4.42.105. Figure4.43.106. Figure4.44.
107. Figure4.45.108. Figure4.46.109. Figure4.47.110. Figure4.48.111. Figure4.49.112. Figure4.50.113. Figure4.51.114. Figure4.52.115. Figure4.53.116. Figure4.54.117. Figure4.55.118. Figure4.56.119. Figure4.57.120. Figure4.58.121. Figure4.59.122. Figure5.1.123. Figure5.2.124. Figure5.3.125. Figure5.4.126. Figure5.5.127. Figure5.6.128. Figure5.7.129. Figure5.8.130. Figure5.9.131. Figure5.10.132. Figure5.11.133. Figure5.12.134. Figure5.13.135. Figure5.14.136. Figure5.15.137. Figure5.16.
138. Figure5.17.139. Figure5.18.140. Figure5.19.141. Figure5.20.142. Figure5.21.143. Figure5.22.144. Figure5.23.145. Figure5.24.146. Figure5.25.147. Figure5.26.148. Figure5.27.149. Figure5.28.150. Figure5.29.151. Figure5.30.152. Figure5.31.153. Figure5.32.154. Figure5.33.155. Figure5.34.156. Figure6.1.157. Figure6.2.158. Figure6.3.159. Figure6.4.160. Figure6.5.161. Figure6.6.162. Figure6.7.163. Figure6.8.164. Figure6.9.165. Figure6.10.166. Figure6.11.167. Figure6.12.168. Figure6.13.
169. Figure6.14.170. Figure6.15.171. Figure6.16.172. Figure6.17.173. Figure6.18.174. Figure6.19.175. Figure6.20.176. Figure6.21.177. Figure6.22.178. Figure6.23.179. Figure6.24.180. Figure6.25.181. Figure6.26.182. Figure6.27.183. Figure6.28.184. Figure6.29.185. Figure7.1.186. Figure7.2.187. Figure7.3.188. Figure7.4.189. Figure7.5.190. Figure7.6.191. Figure7.7.192. Figure7.8.193. Figure7.9.194. Figure7.10.195. Figure7.11.196. Figure7.12.197. Figure7.13.198. Figure7.14.199. Figure7.15.
200. Figure7.16.201. Figure7.17.202. Figure7.18.203. Figure7.19.204. Figure7.20.205. Figure7.21.206. Figure7.22.207. Figure7.23.208. Figure7.24.209. Figure7.25.210. Figure7.26.211. Figure7.27.212. Figure7.28.213. Figure7.29.214. Figure7.30.215. Figure7.31.216. Figure7.32.217. Figure7.33.218. Figure7.34.219. Figure7.35.220. Figure7.36.221. Figure7.37.222. Figure7.38.223. Figure7.39.224. Figure7.40.225. Figure7.41.226. Figure8.1.227. Figure8.2.228. Figure8.3.229. Figure8.4.230. Figure8.5.
231. Figure8.6.232. Figure8.7.233. Figure8.8.234. Figure8.9.235. Figure8.10.236. Figure8.11.237. Figure8.12.238. Figure8.13.239. Figure8.14.240. Figure8.15.241. Figure8.16.242. Figure8.17.243. Figure8.18.244. Figure8.19.245. Figure8.20.246. Figure8.21.247. Figure8.22.248. Figure8.23.249. Figure8.24.250. Figure8.25.251. Figure8.26.252. Figure8.27.253. Figure8.28.254. Figure8.29.255. Figure8.30.256. Figure8.31.257. Figure8.32.258. Figure9.1.259. Figure9.2.260. Figure9.3.261. Figure9.4.
262. Figure9.5.263. Figure9.6.264. Figure9.7.265. Figure9.8.266. Figure9.9.267. Figure9.10.268. Figure9.11.269. Figure9.12.270. Figure9.13.271. Figure9.14.272. Figure9.15.273. Figure9.16.274. Figure9.17.275. Figure9.18.276. Figure9.19.277. Figure9.20.278. Figure10.1.279. Figure10.2.280. Figure10.3.281. Figure10.4.282. Figure10.5.283. Figure10.6.284. Figure10.7.285. Figure10.8.286. Figure10.9.287. Figure10.10.288. Figure10.11.289. Figure10.12.290. Figure10.13.
ListofTables
1. Table1.1.Marketbasketdata.2. Table1.2.Collectionofnewsarticles.3. Table2.1.Asampledatasetcontainingstudentinformation.4. Table2.2.Differentattributetypes.5. Table2.3.Transformationsthatdefineattributelevels.6. Table2.4.Datasetcontaininginformationaboutcustomer
purchases.7. Table2.5.Conversionofacategoricalattributetothreebinary
attributes.8. Table2.6.Conversionofacategoricalattributetofiveasymmetric
binaryattributes.9. Table2.7.Similarityanddissimilarityforsimpleattributes10. Table2.8.xandycoordinatesoffourpoints.11. Table2.9.EuclideandistancematrixforTable2.8.12. Table2.10.L1distancematrixforTable2.8.13. Table2.11.L∞distancematrixforTable2.8.14. Table2.12.Propertiesofcosine,correlation,andMinkowski
distancemeasures.15. Table2.13.Similaritybetween(x,y),(x,ys),and(x,yt).16. Table2.14.Entropyforx17. Table2.15.Entropyfory18. Table2.16.Jointentropyforxandy19. Table3.1.Examplesofclassificationtasks.20. Table3.2.Asampledataforthevertebrateclassificationproblem.21. Table3.3.Asampledatafortheloanborrowerclassification
problem.22. Table3.4.Confusionmatrixforabinaryclassificationproblem.23. Table3.5.DatasetforExercise2.
24. Table3.6.DatasetforExercise3.25. Table3.7.ComparingthetestaccuracyofdecisiontreesT10and
T100.26. Table3.8.Comparingtheaccuracyofvariousclassification
methods.27. Table4.1.Exampleofarulesetforthevertebrateclassification
problem.28. Table4.2.Thevertebratedataset.29. Table4.3.Exampleofamutuallyexclusiveandexhaustiverule
set.30. Table4.4.Exampleofdatasetusedtoconstructanensembleof
baggingclassifiers.31. Table4.5.Comparingtheaccuracyofadecisiontreeclassifier
againstthreeensemblemethods.32. Table4.6.Aconfusionmatrixforabinaryclassificationproblemin
whichtheclassesarenotequallyimportant.33. Table4.7.EntriesoftheconfusionmatrixintermsoftheTPR,
TNR,skew,α,andtotalnumberofinstances,N.34. Table4.8.Comparisonofvariousrule-basedclassifiers.35. Table4.9.DatasetforExercise7.36. Table4.10.DatasetforExercise8.37. Table4.11.DatasetforExercise10.38. Table4.12.DatasetforExercise12.39. Table4.13.PosteriorprobabilitiesforExercise16.40. Table5.1.Anexampleofmarketbaskettransactions.41. Table5.2.Abinary0/1representationofmarketbasketdata.42. Table5.3.Listofbinaryattributesfromthe1984UnitedStates
CongressionalVotingRecords.Source:TheUCImachinelearningrepository.
43. Table5.4.Associationrulesextractedfromthe1984UnitedStatesCongressionalVotingRecords.
44. Table5.5.Atransactiondatasetforminingcloseditemsets.45. Table5.6.A2-waycontingencytableforvariablesAandB.46. Table5.7.Beveragepreferencesamongagroupof1000people.47. Table5.8.Informationaboutpeoplewhodrinkteaandpeoplewho
usehoneyintheirbeverage.48. Table5.9.Examplesofobjectivemeasuresfortheitemset{A,B}.49. Table5.10.Exampleofcontingencytables.50. Table5.11.Rankingsofcontingencytablesusingthemeasures
giveninTable5.9.51. Table5.12.Contingencytablesforthepairs{p,q}and{r,s}.52. Table5.13.Thegrade-genderexample.(a)Sampledataofsize100.53. Table5.14.Anexampledemonstratingtheeffectofnulladdition.54. Table5.15.Propertiesofsymmetricmeasures.55. Table5.16.Exampleofathree-dimensionalcontingencytable.56. Table5.17.Atwo-waycontingencytablebetweenthesaleofhigh-
definitiontelevisionandexercisemachine.57. Table5.18.Exampleofathree-waycontingencytable.58. Table5.19.Groupingtheitemsinthecensusdatasetbasedon
theirsupportvalues.59. Table5.20.Exampleofmarketbaskettransactions.60. Table5.21.Marketbaskettransactions.61. Table5.22.Exampleofmarketbaskettransactions.62. Table5.23.Exampleofmarketbaskettransactions.63. Table5.24.AContingencyTable.64. Table5.25.ContingencytablesforExercise20.65. Table6.1.Internetsurveydatawithcategoricalattributes.66. Table6.2.Internetsurveydataafterbinarizingcategoricaland
symmetricbinaryattributes.67. Table6.3.Internetsurveydatawithcontinuousattributes.68. Table6.4.Internetsurveydataafterbinarizingcategoricaland
continuousattributes.
69. Table6.5.AbreakdownofInternetuserswhoparticipatedinonlinechataccordingtotheiragegroup.
70. Table6.6.Document-wordmatrix.71. Table6.7.Examplesillustratingtheconceptofasubsequence.72. Table6.8.Graphrepresentationofentitiesinvariousapplication
domains.73. Table6.9.Atwo-waycontingencytablefortheassociationrule
X→Y.74. Table6.10.Trafficaccidentdataset.75. Table6.11.DatasetforExercise2.76. Table6.12.DatasetforExercise3.77. Table6.13.DatasetforExercise4.78. Table6.14.DatasetforExercise6.79. Table6.15.Exampleofmarketbaskettransactions.80. Table6.16.Exampleofeventsequencesgeneratedbyvarious
sensors.81. Table6.17.ExampleofeventsequencedataforExercise14.82. Table6.18.Exampleofnumericdataset.83. Table7.1.Tableofnotation.84. Table7.2.K-means:Commonchoicesforproximity,centroids,
andobjectivefunctions.85. Table7.3.xy-coordinatesofsixpoints.86. Table7.4.Euclideandistancematrixforsixpoints.87. Table7.5.TableofLance-Williamscoefficientsforcommon
hierarchicalclusteringapproaches.88. Table7.6.Tableofgraph-basedclusterevaluationmeasures.89. Table7.7.Copheneticdistancematrixforsinglelinkanddatain
Table2.14onpage90.90. Table7.8.CopheneticcorrelationcoefficientfordataofTable2.14
andfouragglomerativehierarchicalclusteringtechniques.91. Table7.9.K-meansclusteringresultsfortheLATimesdocument
dataset.92. Table7.10.Idealclustersimilaritymatrix.93. Table7.11.Classsimilaritymatrix.94. Table7.12.Two-waycontingencytablefordeterminingwhether
pairsofobjectsareinthesameclassandsamecluster.95. Table7.13.SimilaritymatrixforExercise16.96. Table7.14.ConfusionmatrixforExercise21.97. Table7.15.TableofclusterlabelsforExercise24.98. Table7.16.SimilaritymatrixforExercise24.99. Table8.1.FirstfewiterationsoftheEMalgorithmforthesimple
example.100. Table8.2.Pointcountsforgridcells.101. Table8.3.Similarityamongdocumentsindifferentsectionsofa
newspaper.102. Table8.4.Twonearestneighborsoffourpoints.103. Table9.1.Samplepairs(c,α),α=prob(|x|≥c)foraGaussian
distributionwithmean0andstandarddeviation1.104. Table9.2.Surveydataofweightandheightof100participants.105. Table10.1.Confusiontableinthecontextofmultiplehypothesis
testing.106. Table10.2.Correspondencebetweenstatisticaltestingconcepts
andclassifierevaluationmeasures107. Table10.3.A2-waycontingencytableforvariablesAandB.108. Table10.4.Beveragepreferencesamongagroupof1000people.109. Table10.5.Beveragepreferencesamongagroupof1000people.110. Table10.6.Contingencytableforananomalydetectionsystem
withdetectionratedandfalsealarmratef.111. Table10.7.Beveragepreferencesamongagroupof100people
(left)and10,000people(right).112. Table10.8.OrderedCollectionofp-values..
Landmarks1. Frontmatter2. StartofContent3. backmatter4. ListofIllustrations5. ListofTables
1. i2. ii3. iii4. iv5. v6. vi7. vii8. viii9. ix10. x11. xi12. xii13. xiii14. xiv15. xv16. xvi17. xvii18. xviii19. xix20. xx21. 122. 2
23. 324. 425. 526. 627. 728. 829. 930. 1031. 1132. 1233. 1334. 1435. 1536. 1637. 1738. 1839. 1940. 2041. 2142. 2243. 2344. 2445. 2546. 2647. 2748. 2849. 2950. 3051. 3152. 3253. 33
54. 3455. 3556. 3657. 3758. 3859. 3960. 4061. 4162. 4263. 4364. 4465. 4566. 4667. 4768. 4869. 4970. 5071. 5172. 5273. 5374. 5475. 5576. 5677. 5778. 5879. 5980. 6081. 6182. 6283. 6384. 64
85. 6586. 6687. 6788. 6889. 6990. 7091. 7192. 7293. 7394. 7495. 7596. 7697. 7798. 7899. 79100. 80101. 81102. 82103. 83104. 84105. 85106. 86107. 87108. 88109. 89110. 90111. 91112. 92113. 93114. 94115. 95
116. 96117. 97118. 98119. 99120. 100121. 101122. 102123. 103124. 104125. 105126. 106127. 107128. 108129. 109130. 110131. 111132. 112133. 113134. 114135. 115136. 116137. 117138. 118139. 119140. 120141. 121142. 122143. 123144. 124145. 125146. 126
147. 127148. 128149. 129150. 130151. 131152. 132153. 133154. 134155. 135156. 136157. 137158. 138159. 139160. 140161. 141162. 142163. 143164. 144165. 145166. 146167. 147168. 148169. 149170. 150171. 151172. 152173. 153174. 154175. 155176. 156177. 157
178. 158179. 159180. 160181. 161182. 162183. 163184. 164185. 165186. 166187. 167188. 168189. 169190. 170191. 171192. 172193. 173194. 174195. 175196. 176197. 177198. 178199. 179200. 180201. 181202. 182203. 183204. 184205. 185206. 186207. 187208. 188
209. 189210. 190211. 191212. 192213. 193214. 194215. 195216. 196217. 197218. 198219. 199220. 200221. 201222. 202223. 203224. 204225. 205226. 206227. 207228. 208229. 209230. 210231. 211232. 212233. 213234. 214235. 215236. 216237. 217238. 218239. 219
240. 220241. 221242. 222243. 223244. 224245. 225246. 226247. 227248. 228249. 229250. 230251. 231252. 232253. 233254. 234255. 235256. 236257. 237258. 238259. 239260. 240261. 241262. 242263. 243264. 244265. 245266. 246267. 247268. 248269. 249270. 250
271. 251272. 252273. 253274. 254275. 255276. 256277. 257278. 258279. 259280. 260281. 261282. 262283. 263284. 264285. 265286. 266287. 267288. 268289. 269290. 270291. 271292. 272293. 273294. 274295. 275296. 276297. 277298. 278299. 279300. 280301. 281
302. 282303. 283304. 284305. 285306. 286307. 287308. 288309. 289310. 290311. 291312. 292313. 293314. 294315. 295316. 296317. 297318. 298319. 299320. 300321. 301322. 302323. 303324. 304325. 305326. 306327. 307328. 308329. 309330. 310331. 311332. 312
333. 313334. 314335. 315336. 316337. 317338. 318339. 319340. 320341. 321342. 322343. 323344. 324345. 325346. 326347. 327348. 328349. 329350. 330351. 331352. 332353. 333354. 334355. 335356. 336357. 337358. 338359. 339360. 340361. 341362. 342363. 343
364. 344365. 345366. 346367. 347368. 348369. 349370. 350371. 351372. 352373. 353374. 354375. 355376. 356377. 357378. 358379. 359380. 360381. 361382. 362383. 363384. 364385. 365386. 366387. 367388. 368389. 369390. 370391. 371392. 372393. 373394. 374
395. 375396. 376397. 377398. 378399. 379400. 380401. 381402. 382403. 383404. 384405. 385406. 386407. 387408. 388409. 389410. 390411. 391412. 392413. 393414. 394415. 395416. 396417. 397418. 398419. 399420. 400421. 401422. 402423. 403424. 404425. 405
426. 406427. 407428. 408429. 409430. 410431. 411432. 412433. 413434. 414435. 415436. 416437. 417438. 418439. 419440. 420441. 421442. 422443. 423444. 424445. 425446. 426447. 427448. 428449. 429450. 430451. 431452. 432453. 433454. 434455. 435456. 436
457. 437458. 438459. 439460. 440461. 441462. 442463. 443464. 444465. 445466. 446467. 447468. 448469. 449470. 450471. 451472. 452473. 453474. 454475. 455476. 456477. 457478. 458479. 459480. 460481. 461482. 462483. 463484. 464485. 465486. 466487. 467
488. 468489. 469490. 470491. 471492. 472493. 473494. 474495. 475496. 476497. 477498. 478499. 479500. 480501. 481502. 482503. 483504. 484505. 485506. 486507. 487508. 488509. 489510. 490511. 491512. 492513. 493514. 494515. 495516. 496517. 497518. 498
519. 499520. 500521. 501522. 502523. 503524. 504525. 505526. 506527. 507528. 508529. 509530. 510531. 511532. 512533. 513534. 514535. 515536. 516537. 517538. 518539. 519540. 520541. 521542. 522543. 523544. 524545. 525546. 526547. 527548. 528549. 529
550. 530551. 531552. 532553. 533554. 534555. 535556. 536557. 537558. 538559. 539560. 540561. 541562. 542563. 543564. 544565. 545566. 546567. 547568. 548569. 549570. 550571. 551572. 552573. 553574. 554575. 555576. 556577. 557578. 558579. 559580. 560
581. 561582. 562583. 563584. 564585. 565586. 566587. 567588. 568589. 569590. 570591. 571592. 572593. 573594. 574595. 575596. 576597. 577598. 578599. 579600. 580601. 581602. 582603. 583604. 584605. 585606. 586607. 587608. 588609. 589610. 590611. 591
612. 592613. 593614. 594615. 595616. 596617. 597618. 598619. 599620. 600621. 601622. 602623. 603624. 604625. 605626. 606627. 607628. 608629. 609630. 610631. 611632. 612633. 613634. 614635. 615636. 616637. 617638. 618639. 619640. 620641. 621642. 622
643. 623644. 624645. 625646. 626647. 627648. 628649. 629650. 630651. 631652. 632653. 633654. 634655. 635656. 636657. 637658. 638659. 639660. 640661. 641662. 642663. 643664. 644665. 645666. 646667. 647668. 648669. 649670. 650671. 651672. 652673. 653
674. 654675. 655676. 656677. 657678. 658679. 659680. 660681. 661682. 662683. 663684. 664685. 665686. 666687. 667688. 668689. 669690. 670691. 671692. 672693. 673694. 674695. 675696. 676697. 677698. 678699. 679700. 680701. 681702. 682703. 683704. 684
705. 685706. 686707. 687708. 688709. 689710. 690711. 691712. 692713. 693714. 694715. 695716. 696717. 697718. 698719. 699720. 700721. 701722. 702723. 703724. 704725. 705726. 706727. 707728. 708729. 709730. 710731. 711732. 712733. 713734. 714735. 715
736. 716737. 717738. 718739. 719740. 720741. 721742. 722743. 723744. 724745. 725746. 726747. 727748. 728749. 729750. 730751. 731752. 732753. 733754. 734755. 735756. 736757. 737758. 738759. 739760. 740761. 741762. 742763. 743764. 744765. 745766. 746
767. 747768. 748769. 749770. 750771. 751772. 752773. 753774. 754775. 755776. 756777. 757778. 758779. 759780. 760781. 761782. 762783. 763784. 764785. 765786. 766787. 767788. 768789. 769790. 770791. 771792. 772793. 773794. 774795. 775796. 776797. 777
798. 778799. 779800. 780801. 781802. 782803. 783804. 784805. 785806. 786807. 787808. 788809. 789810. 790811. 791812. 792813. 793814. 794815. 795816. 796817. 797818. 798819. 799820. 800821. 801822. 802823. 803824. 804825. 805826. 806827. 807828. 808
829. 809830. 810831. 811832. 812833. 813834. 814835. 815836. 816837. 817838. 818839. 819840. 820841. 821842. 822843. 823844. 824845. 825846. 826847. 827848. 828849. 829850. 830851. 831852. 832853. 833854. 834855. 835856. 836857. 837858. 838859. 839
860. 840861. 841862. 842863. 843
- INTRODUCTION TO DATA MINING
- INTRODUCTION TO DATA MINING
- Preface to the Second Edition
- Overview
- What is New in the Second Edition?
- To the Instructor
- Support Materials
- Contents
- 1 Introduction
- 1.1 What Is Data Mining?
- 1.2 Motivating Challenges
- 1.3 The Origins of Data Mining
- 1.4 Data Mining Tasks
- 1.5 Scope and Organization of the Book
- 1.6 Bibliographic Notes
- Bibliography
- 1.7 Exercises
- 2 Data
- 2.1 Types of Data
- 2.1.1 Attributes and Measurement
- What Is an Attribute?
- The Type of an Attribute
- The Different Types of Attributes
- Describing Attributes by the Number of Values
- Asymmetric Attributes
- General Comments on Levels of Measurement
- 2.1.2 Types of Data Sets
- General Characteristics of Data Sets
- Dimensionality
- Distribution
- Resolution
- Record Data
- Transaction or Market Basket Data
- The Data Matrix
- The Sparse Data Matrix
- Graph-Based Data
- Data with Relationships among Objects
- Data with Objects That Are Graphs
- Ordered Data
- Sequential Transaction Data
- Time Series Data
- Sequence Data
- Spatial and Spatio-Temporal Data
- Handling Non-Record Data
- 2.2 Data Quality
- 2.2.1 Measurement and Data Collection Issues
- Measurement and Data Collection Errors
- Noise and Artifacts
- Precision, Bias, and Accuracy
- Outliers
- Missing Values
- Eliminate Data Objects or Attributes
- Estimate Missing Values
- Ignore the Missing Value during Analysis
- Inconsistent Values
- Duplicate Data
- 2.2.2 Issues Related to Applications
- 2.3 Data Preprocessing
- 2.3.1 Aggregation
- 2.3.2 Sampling
- Sampling Approaches
- Progressive Sampling
- 2.3.3 Dimensionality Reduction
- The Curse of Dimensionality
- Linear Algebra Techniques for Dimensionality Reduction
- 2.3.4 Feature Subset Selection
- An Architecture for Feature Subset Selection
- Feature Weighting
- 2.3.5 Feature Creation
- Feature Extraction
- Mapping the Data to a New Space
- 2.3.6 Discretization and Binarization
- Binarization
- Discretization of Continuous Attributes
- Unsupervised Discretization
- Supervised Discretization
- Categorical Attributes with Too Many Values
- 2.3.7 Variable Transformation
- Simple Functions
- Normalization or Standardization
- 2.4 Measures of Similarity and Dissimilarity
- 2.4.1 Basics
- Definitions
- Transformations
- 2.4.2 Similarity and Dissimilarity between Simple Attributes
- 2.4.3 Dissimilarities between Data Objects
- Distances
- 2.4.4 Similarities between Data Objects
- 2.4.5 Examples of Proximity Measures
- Similarity Measures for Binary Data
- Simple Matching Coefficient
- Jaccard Coefficient
- Cosine Similarity
- Extended Jaccard Coefficient (Tanimoto Coefficient)
- Correlation
- Differences Among Measures For Continuous Attributes
- 2.4.6 Mutual Information
- 2.4.7 Kernel Functions*
- 2.4.8 Bregman Divergence*
- 2.4.9 Issues in Proximity Calculation
- Standardization and Correlation for Distance Measures
- Combining Similarities for Heterogeneous Attributes
- Using Weights
- 2.4.10 Selecting the Right Proximity Measure
- 2.5 Bibliographic Notes
- Bibliography
- 2.6 Exercises
- 3 Classification: Basic Concepts and Techniques
- 3.1 Basic Concepts
- 3.2 General Framework for Classification
- 3.3 Decision Tree Classifier
- 3.3.1 A Basic Algorithm to Build a Decision Tree
- Hunt's Algorithm
- Design Issues of Decision Tree Induction
- 3.3.2 Methods for Expressing Attribute Test Conditions
- 3.3.3 Measures for Selecting an Attribute Test Condition
- Impurity Measure for a Single Node
- Collective Impurity of Child Nodes
- Identifying the best attribute test condition
- Splitting of Qualitative Attributes
- Binary Splitting of Qualitative Attributes
- Binary Splitting of Quantitative Attributes
- Gain Ratio
- 3.3.4 Algorithm for Decision Tree Induction
- 3.3.5 Example Application: Web Robot Detection
- 3.3.6 Characteristics of Decision Tree Classifiers
- 3.4 Model Overfitting
- 3.4.1 Reasons for Model Overfitting
- Limited Training Size
- High Model Complexity
- 3.5 Model Selection
- 3.5.1 Using a Validation Set
- 3.5.2 Incorporating Model Complexity
- Estimating the Complexity of Decision Trees
- Minimum Description Length Principle
- 3.5.3 Estimating Statistical Bounds
- 3.5.4 Model Selection for Decision Trees
- 3.6 Model Evaluation
- 3.6.1 Holdout Method
- 3.6.2 Cross-Validation
- 3.7 Presence of Hyper-parameters
- 3.7.1 Hyper-parameter Selection
- 3.7.2 Nested Cross-Validation
- 3.8 Pitfalls of Model Selection and Evaluation
- 3.8.1 Overlap between Training and Test Sets
- 3.8.2 Use of Validation Error as Generalization Error
- 3.9 Model Comparison*
- 3.9.1 Estimating the Confidence Interval for Accuracy
- 3.9.2 Comparing the Performance of Two Models
- 3.10 Bibliographic Notes
- Bibliography
- 3.11 Exercises
- 4 Classification: Alternative Techniques
- 4.1 Types of Classifiers
- 4.2 Rule-Based Classifier
- 4.2.1 How a Rule-Based Classifier Works
- 4.2.2 Properties of a Rule Set
- 4.2.3 Direct Methods for Rule Extraction
- Learn-One-Rule Function
- Rule Pruning
- Building the Rule Set
- Instance Elimination
- 4.2.4 Indirect Methods for Rule Extraction
- 4.2.5 Characteristics of Rule-Based Classifiers
- 4.3 Nearest Neighbor Classifiers
- 4.3.1 Algorithm
- 4.3.2 Characteristics of Nearest Neighbor Classifiers
- 4.4 Naïve Bayes Classifier
- 4.4.1 Basics of Probability Theory
- Bayes Theorem
- Using Bayes Theorem for Classification
- 4.4.2 Naïve Bayes Assumption
- Conditional Independence
- How a Naïve Bayes Classifier Works
- Estimating Conditional Probabilities for Categorical Attributes
- Estimating Conditional Probabilities for Continuous Attributes
- Handling Zero Conditional Probabilities
- Characteristics of Naïve Bayes Classifiers
- 4.5 Bayesian Networks
- 4.5.1 Graphical Representation
- Conditional Independence
- Joint Probability
- Use of Hidden Variables
- 4.5.2 Inference and Learning
- Variable Elimination
- Sum-Product Algorithm for Trees
- Generalizations for Non-Tree Graphs
- Learning Model Parameters
- 4.5.3 Characteristics of Bayesian Networks
- 4.6 Logistic Regression
- 4.6.1 Logistic Regression as a Generalized Linear Model
- 4.6.2 Learning Model Parameters
- 4.6.3 Characteristics of Logistic Regression
- 4.7 Artificial Neural Network (ANN)
- 4.7.1 Perceptron
- Learning the Perceptron
- 4.7.2 Multi-layer Neural Network
- Learning Model Parameters
- 4.7.3 Characteristics of ANN
- 4.8 Deep Learning
- 4.8.1 Using Synergistic Loss Functions
- Saturation of Outputs
- Cross entropy loss function
- 4.8.2 Using Responsive Activation Functions
- Vanishing Gradient Problem
- Rectified Linear Units (ReLU)
- 4.8.3 Regularization
- Dropout
- 4.8.4 Initialization of Model Parameters
- Supervised Pretraining
- Unsupervised Pretraining
- Use of Autoencoders
- Hybrid Pretraining
- 4.8.5 Characteristics of Deep Learning
- 4.9 Support Vector Machine (SVM)
- 4.9.1 Margin of a Separating Hyperplane
- Rationale for Maximum Margin
- 4.9.2 Linear SVM
- Learning Model Parameters
- 4.9.3 Soft-margin SVM
- SVM as a Regularizer of Hinge Loss
- 4.9.4 Nonlinear SVM
- Attribute Transformation
- Learning a Nonlinear SVM Model
- 4.9.5 Characteristics of SVM
- 4.10 Ensemble Methods
- 4.10.1 Rationale for Ensemble Method
- 4.10.2 Methods for Constructing an Ensemble Classifier
- 4.10.3 Bias-Variance Decomposition
- 4.10.4 Bagging
- 4.10.5 Boosting
- AdaBoost
- 4.10.6 Random Forests
- 4.10.7 Empirical Comparison among Ensemble Methods
- 4.11 Class Imbalance Problem
- 4.11.1 Building Classifiers with Class Imbalance
- Oversampling and Undersampling
- Assigning Scores to Test Instances
- 4.11.2 Evaluating Performance with Class Imbalance
- 4.11.3 Finding an Optimal Score Threshold
- 4.11.4 Aggregate Evaluation of Performance
- ROC Curve
- Precision-Recall Curve
- 4.12 Multiclass Problem
- 4.13 Bibliographic Notes
- Bibliography
- 4.14 Exercises
- 5 Association Analysis: Basic Concepts and Algorithms
- 5.1 Preliminaries
- 5.2 Frequent Itemset Generation
- 5.2.1 The Apriori Principle
- 5.2.2 Frequent Itemset Generation in the Apriori Algorithm
- 5.2.3 Candidate Generation and Pruning
- Candidate Generation
- Brute-Force Method
- Fk−1×F1 Method
- Fk−1×Fk−1 Method
- Candidate Pruning
- 5.2.4 Support Counting
- Support Counting Using a Hash Tree*
- 5.2.5 Computational Complexity
- 5.3 Rule Generation
- 5.3.1 Confidence-Based Pruning
- 5.3.2 Rule Generation in Apriori Algorithm
- 5.3.3 An Example: Congressional Voting Records
- 5.4 Compact Representation of Frequent Itemsets
- 5.4.1 Maximal Frequent Itemsets
- 5.4.2 Closed Itemsets
- 5.5 Alternative Methods for Generating Frequent Itemsets*
- 5.6 FP-Growth Algorithm*
- 5.6.1 FP-Tree Representation
- 5.6.2 Frequent Itemset Generation in FP-Growth Algorithm
- 5.7 Evaluation of Association Patterns
- 5.7.1 Objective Measures of Interestingness
- Alternative Objective Interestingness Measures
- Consistency among Objective Measures
- Properties of Objective Measures
- Inversion Property
- Scaling Property
- Null Addition Property
- Asymmetric Interestingness Measures
- 5.7.2 Measures beyond Pairs of Binary Variables
- 5.7.3 Simpson's Paradox
- 5.8 Effect of Skewed Support Distribution
- 5.9 Bibliographic Notes
- Bibliography
- 5.10 Exercises
- 6 Association Analysis: Advanced Concepts
- 6.1 Handling Categorical Attributes
- 6.2 Handling Continuous Attributes
- 6.2.1 Discretization-Based Methods
- 6.2.2 Statistics-Based Methods
- Rule Generation
- Rule Validation
- 6.2.3 Non-discretization Methods
- 6.3 Handling a Concept Hierarchy
- 6.4 Sequential Patterns
- 6.4.1 Preliminaries
- Sequences
- Subsequences
- 6.4.2 Sequential Pattern Discovery
- 6.4.3 Timing Constraints*
- The maxspan Constraint
- The mingap and maxgap Constraints
- The Window Size Constraint
- 6.4.4 Alternative Counting Schemes*
- 6.5 Subgraph Patterns
- 6.5.1 Preliminaries
- Graphs
- Graph Isomorphism
- Subgraphs
- 6.5.2 Frequent Subgraph Mining
- 6.5.3 Candidate Generation
- 6.5.4 Candidate Pruning
- 6.5.5 Support Counting
- 6.6 Infrequent Patterns*
- 6.6.1 Negative Patterns
- 6.6.2 Negatively Correlated Patterns
- 6.6.3 Comparisons among Infrequent Patterns, Negative Patterns, and Negatively Correlated Patterns
- 6.6.4 Techniques for Mining Interesting Infrequent Patterns
- 6.6.5 Techniques Based on Mining Negative Patterns
- 6.6.6 Techniques Based on Support Expectation
- Support Expectation Based on Concept Hierarchy
- Support Expectation Based on Indirect Association
- 6.7 Bibliographic Notes
- Bibliography
- 6.8 Exercises
- 7 Cluster Analysis: Basic Concepts and Algorithms
- 7.1 Overview
- 7.1.1 What Is Cluster Analysis?
- 7.1.2 Different Types of Clusterings
- 7.1.3 Different Types of Clusters
- Road Map
- 7.2 K-means
- 7.2.1 The Basic K-means Algorithm
- Assigning Points to the Closest Centroid
- Centroids and Objective Functions
- Data in Euclidean Space
- Document Data
- The General Case
- Choosing Initial Centroids
- K-means++
- Time and Space Complexity
- 7.2.2 K-means: Additional Issues
- Handling Empty Clusters
- Outliers
- Reducing the SSE with Postprocessing
- Updating Centroids Incrementally
- 7.2.3 Bisecting K-means
- 7.2.4 K-means and Different Types of Clusters
- 7.2.5 Strengths and Weaknesses
- 7.2.6 K-means as an Optimization Problem
- Derivation of K-means as an Algorithm to Minimize the SSE
- Derivation of K-means for SAE
- 7.3 Agglomerative Hierarchical Clustering
- 7.3.1 Basic Agglomerative Hierarchical Clustering Algorithm
- Defining Proximity between Clusters
- Time and Space Complexity
- 7.3.2 Specific Techniques
- Sample Data
- Single Link or MIN
- Complete Link or MAX or CLIQUE
- Group Average
- Ward’s Method and Centroid Methods
- 7.3.3 The Lance-Williams Formula for Cluster Proximity
- 7.3.4 Key Issues in Hierarchical Clustering
- Lack of a Global Objective Function
- Ability to Handle Different Cluster Sizes
- Merging Decisions Are Final
- 7.3.5 Outliers
- 7.3.6 Strengths and Weaknesses
- 7.4 DBSCAN
- 7.4.1 Traditional Density: Center-Based Approach
- Classification of Points According to Center-Based Density
- 7.4.2 The DBSCAN Algorithm
- Time and Space Complexity
- Selection of DBSCAN Parameters
- Clusters of Varying Density
- An Example
- 7.4.3 Strengths and Weaknesses
- 7.5 Cluster Evaluation
- 7.5.1 Overview
- 7.5.2 Unsupervised Cluster Evaluation Using Cohesion and Separation
- Graph-Based View of Cohesion and Separation
- Prototype-Based View of Cohesion and Separation
- Relationship between Prototype-Based Cohesion and Graph-Based Cohesion
- Relationship of the Two Approaches to Prototype-Based Separation
- Relationship between Cohesion and Separation
- Relationship between Graph- and Centroid-Based Cohesion
- Overall Measures of Cohesion and Separation
- Evaluating Individual Clusters and Objects
- The Silhouette Coefficient
- 7.5.3 Unsupervised Cluster Evaluation Using the Proximity Matrix
- General Comments on Unsupervised Cluster Evaluation Measures
- Measuring Cluster Validity via Correlation
- Judging a Clustering Visually by Its Similarity Matrix
- 7.5.4 Unsupervised Evaluation of Hierarchical Clustering
- 7.5.5 Determining the Correct Number of Clusters
- 7.5.6 Clustering Tendency
- 7.5.7 Supervised Measures of Cluster Validity
- Classification-Oriented Measures of Cluster Validity
- Similarity-Oriented Measures of Cluster Validity
- Cluster Validity for Hierarchical Clusterings
- 7.5.8 Assessing the Significance of Cluster Validity Measures
- 7.5.9 Choosing a Cluster Validity Measure
- 7.6 Bibliographic Notes
- Bibliography
- 7.7 Exercises
- 8 Cluster Analysis: Additional Issues and Algorithms
- 8.1 Characteristics of Data, Clusters, and Clustering Algorithms
- 8.1.1 Example: Comparing K-means and DBSCAN
- 8.1.2 Data Characteristics
- 8.1.3 Cluster Characteristics
- 8.1.4 General Characteristics of Clustering Algorithms
- Road Map
- 8.2 Prototype-Based Clustering
- 8.2.1 Fuzzy Clustering
- Fuzzy Sets
- Fuzzy Clusters
- Fuzzy c-means
- Computing SSE
- Initialization
- Computing Centroids
- Updating the Fuzzy Pseudo-partition
- Strengths and Limitations
- 8.2.2 Clustering Using Mixture Models
- Mixture Models
- Estimating Model Parameters Using Maximum Likelihood
- Estimating Mixture Model Parameters Using Maximum Likelihood: The EM Algorithm
- Advantages and Limitations of Mixture Model Clustering Using the EM Algorithm
- 8.2.3 Self-Organizing Maps (SOM)
- The SOM Algorithm
- Initialization
- Selection of an Object
- Assignment
- Update
- Termination
- Applications
- Strengths and Limitations
- 8.3 Density-Based Clustering
- 8.3.1 Grid-Based Clustering
- Defining Grid Cells
- The Density of Grid Cells
- Forming Clusters from Dense Grid Cells
- Strengths and Limitations
- 8.3.2 Subspace Clustering
- CLIQUE
- Strengths and Limitations of CLIQUE
- 8.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based Clustering
- Kernel Density Estimation
- Implementation Issues
- Strengths and Limitations of DENCLUE
- 8.4 Graph-Based Clustering
- 8.4.1 Sparsification
- 8.4.2 Minimum Spanning Tree (MST) Clustering
- 8.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities Using METIS
- Strengths and Weaknesses
- 8.4.4 Chameleon: Hierarchical Clustering with Dynamic Modeling
- Deciding Which Clusters to Merge
- Chameleon Algorithm
- Sparsification
- Graph Partitioning
- Agglomerative Hierarchical Clustering
- Complexity
- Strengths and Limitations
- 8.4.5 Spectral Clustering
- Relationship between Spectral Clustering and Graph Partitioning
- Strengths and Limitations
- 8.4.6 Shared Nearest Neighbor Similarity
- Problems with Traditional Similarity in High-Dimensional Data
- Problems with Differences in Density
- SNN Similarity Computation
- SNN Similarity versus Direct Similarity
- 8.4.7 The Jarvis-Patrick Clustering Algorithm
- Strengths and Limitations
- 8.4.8 SNN Density
- 8.4.9 SNN Density-Based Clustering
- The SNN Density-based Clustering Algorithm
- Strengths and Limitations
- 8.5 Scalable Clustering Algorithms
- 8.5.1 Scalability: General Issues and Approaches
- 8.5.2 BIRCH
- 8.5.3 CURE
- Sampling in CURE
- Partitioning
- 8.6 Which Clustering Algorithm?
- 8.7 Bibliographic Notes
- Bibliography
- 8.8 Exercises
- 9 Anomaly Detection
- 9.1 Characteristics of Anomaly Detection Problems
- 9.1.1 A Definition of an Anomaly
- 9.1.2 Nature of Data
- 9.1.3 How Anomaly Detection is Used
- 9.2 Characteristics of Anomaly Detection Methods
- 9.3 Statistical Approaches
- 9.3.1 Using Parametric Models
- Using the Univariate Gaussian Distribution
- Using the Multivariate Gaussian Distribution
- 9.3.2 Using Non-parametric Models
- 9.3.3 Modeling Normal and Anomalous Classes
- 9.3.4 Assessing Statistical Significance
- 9.3.5 Strengths and Weaknesses
- 9.4 Proximity-based Approaches
- 9.4.1 Distance-based Anomaly Score
- 9.4.2 Density-based Anomaly Score
- 9.4.3 Relative Density-based Anomaly Score
- 9.4.4 Strengths and Weaknesses
- 9.5 Clustering-based Approaches
- 9.5.1 Finding Anomalous Clusters
- 9.5.2 Finding Anomalous Instances
- Assessing the Extent to Which an Object Belongs to a Cluster
- Impact of Outliers on the Initial Clustering
- The Number of Clusters to Use
- 9.5.3 Strengths and Weaknesses
- 9.6 Reconstruction-based Approaches
- 9.6.1 Strengths and Weaknesses
- 9.7 One-class Classification
- 9.7.1 Use of Kernels
- 9.7.2 The Origin Trick
- 9.7.3 Strengths and Weaknesses
- 9.8 Information Theoretic Approaches
- 9.8.1 Strengths and Weaknesses
- 9.9 Evaluation of Anomaly Detection
- 9.10 Bibliographic Notes
- Bibliography
- 9.11 Exercises
- 10 Avoiding False Discoveries
- 10.1 Preliminaries: Statistical Testing
- 10.1.1 Significance Testing
- Null Hypothesis
- Test Statistic
- Null Distribution
- Assessing Statistical Significance
- 10.1.2 Hypothesis Testing
- Critical Region
- Type I and Type II Errors
- Effect Size
- 10.1.3 Multiple Hypothesis Testing
- Family-wise Error Rate (FWER)
- Bonferroni Procedure
- False discovery rate (FDR)
- Benjamini-Hochberg Procedure
- 10.1.4 Pitfalls in Statistical Testing
- 10.2 Modeling Null and Alternative Distributions
- 10.2.1 Generating Synthetic Data Sets
- 10.2.2 Randomizing Class Labels
- 10.2.3 Resampling Instances
- 10.2.4 Modeling the Distribution of the Test Statistic
- 10.3 Statistical Testing for Classification
- 10.3.1 Evaluating Classification Performance
- 10.3.2 Binary Classification as Multiple Hypothesis Testing
- 10.3.3 Multiple Hypothesis Testing in Model Selection
- 10.4 Statistical Testing for Association Analysis
- 10.4.1 Using Statistical Models
- Using Fisher’s Exact Test
- Using the Chi-Squared Test
- 10.4.2 Using Randomization Methods
- 10.5 Statistical Testing for Cluster Analysis
- 10.5.1 Generating a Null Distribution for Internal Indices
- 10.5.2 Generating a Null Distribution for External Indices
- 10.5.3 Enrichment
- 10.6 Statistical Testing for Anomaly Detection
- 10.7 Bibliographic Notes
- Bibliography
- 10.8 Exercises
- Author Index
- Subject Index
- Copyright Permissions