資源描述:
《Lec14_text_classification》由會員上傳分享,免費在線閱讀,更多相關內容在學術論文-天天文庫。
1、OutlineTextClassificationand?BasicConcepts?TextClassificationSentimentAnalysis?TextRepresentation?FeatureSelection?ClassificationAlgorithmRuiXia?SentimentAnalysisrxia@njust.edu.cn?TraditionalMethods2013.11.23?NewChallenges?NewDirections2Real-worldPR&MLApplicationsStructureofAPR&M
2、LSystemTrainingSamples(LabeledPatterns)FeatureFeatureClassPatternClassifierRepresentationSelectionLabel34ATextClassificationSystemOutline?BasicConcepts?TextClassification?TextRepresentation?FeatureSelection?ClassificationAlgorithm?SentimentAnalysis?TraditionalMethods?NewChallenge
3、s561TextRepresentationTermWeight?VectorSpaceModel(VSM)?BOOL(presence)?1,iftexistsinidk???kialsocalledBag-of-words(BOW)model?0,otherwise?Termfrequency(TF)??tfkiki?InversedocumentN??logfrequency(IDF)idfiN?TF-IDF??tf?logkikidfi78AnExampleofTextRepresentationAnExampleofTR(cont.)?Trai
4、ningdata(labeleddocuments)?Bagofwords(containing40words)教育體育19582008奧林匹克北京博弈場地創(chuàng)創(chuàng)建大學北京理工大學計算機北京理工大學體育館的第四第五東亞奪冠高校計算機獎牌屆錦專業(yè)創(chuàng)建于1958年是2008年中國北京奧標賽軍團理工男女年排球設立是雙雙體育是中國最早設立計算林匹克運動會的排球館新高學子于預賽運動會在之一中中國專機專業(yè)的高校之一預賽場地業(yè)總數(shù)最早北京理工大學學子在第五屆東亞運動會中第四屆中國計算機博國軍團獎牌總數(shù)創(chuàng)新弈錦標賽中奪冠高男女排球雙雙奪冠910OutlineFeature
5、Selection-FrequencyFilter?BasicConcepts?DocumentFrequency(DF)?TextClassificationFeaturesarerankedaccordingtoitsDocument?TextRepresentationFrequencyinthetrainingcorpus?FeatureSelection?TermFrequency(TF)?DocumentalFrequency(DF)?MutualInformation(MI)FeaturesarerankedaccordingtoitsTe
6、rm?InformationGain(IG)Frequencyinthetrainingcorpus?ClassifierDesign?Shortcomings?SentimentAnalysisUnsupervised-lackoftheclassinformation11122BasicProbabilityEstimateConceptsfromInformationTheory?Entropy?AstatistictableforfeaturetandclasscPc()j?(Aij?Cij)NallHX()???px()log()pxijxPt
7、()?(A?B)Niijijall?JointEntropyPt()?(C?D)NclassiijijallfeaturecjcjA?1HXY(,)????pxy(,)log(,)pxyPc(
8、)t?ijxyABjitiijijAij?Bij?C?ConditionalEntropyC?1tCDijiijijPc(j
9、)ti?HYX(
10、)??pxHYX()(
11、?x)????pxy(,)log(
12、)pyxC?D?CijijxxyHYX(
13、)?HXY(,)?HX()1314FeatureSelection-MIFeatureSelection-IG?Mutu
14、alInformation(MI)?InformationGain(IG)MIo