資源描述:
《python data analysis analyzing textual data》由會員上傳分享,免費在線閱讀,更多相關(guān)內(nèi)容在工程資料-天天文庫。
1、AnalyzingTextualDataandSocialMediaInthepreviouschapters,wefocusedontheanalysisofstructureddata,mostlyintabularformat.Inreality,plaintextisthemostpredominantformofdataavailabletoday.Textanalysisappliesanalysisofwordfrequencydistributions,patternrecognition,tagging,linkandassociationanalysis,sentimen
2、tanalysis,andvisualization.WewillanalyzetextwiththePythonNaturalLanguageToolkit(NLTK)library.NLTKcomeswithacollectionofsampletextscalledcorpora.Asmallexampleofnetworkanalysiswillalsobecovered.Thefollowingtopicswillbediscussedinthischapter:?InstallingNLTK?Filteringoutstopwords,names,andnumbers?Theba
3、g-of-wordsmodel?Analyzingwordfrequencies?NaiveBayesclassification?Sentimentanalysis?Creatingwordclouds?SocialnetworkanalysisAnalyzingTextualDataandSocialMediaInstallingNLTKNLTKisaPythonAPIfortheanalysisoftextswritteninnaturallanguages,suchasEnglish.NLTKwascreatedin2001andwasoriginallyintendedasatea
4、chingtool.InstallNLTKwiththefollowingcommand:$sudopipinstallnltk$pipfreeze
5、grepnltknltk==2.0.4Asusual,wewillchecktheinstallationwithanewversionofthepkg_check.pyfile.Thefollowingimportstatementisrequired:importnltkIfeverythingworks,weshouldgetaresultsimilartothefollowing:nltkversion2.0.4nltk.appDESC
6、RIPTIONchartparser:ChartParserchunkparser:Regular-ExpressionChunkParsercollocations:Findcollocationsintextconcordance:Partnltk.ccgDESCRIPTIONFormoreinformationseenltk/doc/contrib/ccg/ccg.pdfPACKAGECONTENTSapichartcombinatorlexiconDATABackwardApplication7、Theseperformsimplepatternmatchingonsentencestypedbyusers,andrespondwithautomaticallygnltk.chunkDESCRIPTIONClassesandinterfacesforidentifyingnon-overlappinglinguisticgroups(suchasbasenounphrases)inunrestrictedtext.Thisnltk.classifyDESCRIPTIONClassesandinterfacesforlabelingtokenswithcategorylabels(or
8、"classlabels").Typically,labelsarerepresentedwithstrinltk.clusterDESCRIPTIONThismodulecontainsanumberofbasicclusteringalgorithms.Clusteringdescribesthetaskofdiscoveringgroupsofsimilaritenltk.corpusnltk.draw