資源描述:
《基于hadoop的云存儲》由會員上傳分享,免費(fèi)在線閱讀,更多相關(guān)內(nèi)容在教育資源-天天文庫。
1、1基于Hadoop的云存儲云計算隨時隨地訪問您的應(yīng)用2云存儲的數(shù)據(jù)管理特點(diǎn)與挑戰(zhàn)特點(diǎn)挑戰(zhàn)計算資源是可伸縮的數(shù)據(jù)具有備份數(shù)據(jù)存儲在大量分布的節(jié)點(diǎn)之上數(shù)據(jù)的自我管理和自調(diào)優(yōu)基于大量節(jié)點(diǎn)的查詢優(yōu)化算法基于大量節(jié)點(diǎn)的索引結(jié)構(gòu)資源調(diào)度和負(fù)載均衡多租戶情況3我們面臨的問題您如何來管理大量的應(yīng)用程序?運(yùn)行任務(wù)來處理100百萬兆字節(jié)的數(shù)據(jù)花費(fèi)11天在一臺電腦上讀取數(shù)據(jù)需要大量低價的計算機(jī)故障處理速度問題(15分鐘修復(fù)1000臺計算機(jī)),但…可靠性問題在大型計算機(jī)集群中,每天都有計算機(jī)出現(xiàn)故障集群的規(guī)模不斷變化需要通用
2、的基礎(chǔ)架構(gòu)必須是高效且可靠的4解決方案開源的Apache項(xiàng)目Hadoop主要包括:DistributedFileSystem–分布的數(shù)據(jù)Map/Reduce–分布的應(yīng)用程序使用Java開發(fā)運(yùn)行在Linux,MacOS/X,Windows,andSolaris廉價的硬件設(shè)備5Typicallyin2levelarchitectureNodesarecommodityPCs40nodes/rackUplinkfromrackis8gigabitRack-internalis1gigabitHardware
3、ClusterofHadoop6DistributedFileSystemSinglenamespaceforentireclusterManagedbyasinglenamenode.Filesaresingle-writerandappend-only.Optimizedforstreamingreadsoflargefiles.Filesarebrokenintolargeblocks.Typically128MBReplicatedtoseveraldatanodes,forreliabili
4、tyAccessfromJava,C,orcommandline.7BlockPlacementDefaultis3replicas,butsettableBlocksareplaced(writesarepipelined):OnsamenodeOndifferentrackOntheotherrackClientsreadfromclosestreplicaIfthereplicationforablockdropsbelowtarget,itisautomaticallyre-replicate
5、d.8HowisYahoousingHadoop?StartedwithbuildingbetterapplicationsScaleupwebscalebatchapplications(search,ads,…)Factoroutcommoncodefromexistingsystems,sonewapplicationswillbeeasiertowriteManagethemanyclusters9RunningProductionWebMapSearchneedsagraphofthe“kn
6、own”webInvertedges,computelinktext,wholegraphheuristicsPeriodicbatchjobusingMap/ReduceUsesachainof~100map/reducejobsScale1trillionedgesingraphLargestshuffleis450TBFinaloutputis300TBcompressedRunson10,000coresRawdiskused5PB10TerabyteSortBenchmarkStartedb
7、yJimGrayatMicrosoftin1998Sorting10billion100byterecordsHadoopwonthegeneralcategoryin209seconds910nodes2quad-coreXeons@2.0Ghz/node4SATAdisks/node8GBram/node1gbethernet/node40nodes/rack8gbethernetuplink/rackPreviousrecordswas297seconds11HadoopclustersWeha
8、ve~20,000machinesrunningHadoopOurlargestclustersarecurrently2000nodesSeveralpetabytesofuserdata(compressed,unreplicated)Werunhundredsofthousandsofjobseverymonth12ResearchClusterUsage13WhoUsesHadoop?Amazon/A9AOLFacebookFoxinteract