Spark高级数据分析

副标题:无

作   者:(美)里扎 等著,

分类号:

ISBN:9787564159108

微信扫一扫,移动浏览光盘

简介

  在里扎等编著的《Spark高级数据分析(影印版 )(英文版)》这本实用书籍中,4位Cloude阳公司 的数据科学家讲解了一系列自包含模式,用于在 Spark中进行大规模数据分析。本书作者们把Spark、 统计原理和现实世界中的数据集合放到一起,通过实 例教你如何解决数据分析问题。  你将从Spark及其生态系统的介绍开始,然后深 入运用标准技巧的模式——归类、聚合过滤及异常检 测等,这些技巧被用于生物基因、安全和金融等行业 。如果你对机器学习和统计学有初步了解,使用Java 、Pytton或者Scala编程,就会发现这些模式对于你 的数据分析应用程序会非常有用。  模式包括: 音乐推荐和Audioscrobbler数据集合 用决策树分析森林覆盖 用K均值聚合检测网络流量中的异常 用潜在语义分析理解维基百科 用GraphX分析共生网络 用地理空间和瞬态数据分析纽约市出租车路线的 数据 用蒙地卡罗模拟来估计金融风险 分析基因数据和BDG项目 通过PySpark和Thunder分析神经造影数据

目录

ForewordPreface1. Analyzing Big DataThe Challenges of Data ScienceIntroducing Apache SparkAbout This Book2. Introduction to Data Analysis with Scala and SparkScala for Data ScientistsThe Spark Programming ModelRecord LinkageGetting Started: The Spark Shell and SparkContextBringing Data from the Cluster to the ClientShipping Code from the Client to the ClusterStructuring Data with Tuples and Case ClassesAggregationsCreating HistogramsSummary Statistics for Continuous VariablesCreating Reusable Code for Computing Summary StatisticsSimple Variable Selection and ScoringWhere to Go from Here3. Recommending Music and the Audioscrobbler Data SetData SetThe Alternating Least Squares Recommender AlgorithmPreparing the DataBuilding a First ModelSpot Checking RecommendationsEvaluating Recommendation QualityComputing AUCHyperparameter SelectionMaking RecommendationsWhere to Go from Here4. Predicting Forest Cover with Decision TreesFast Forward to RegressionVectors and FeaturesTraining ExamplesDecision Trees and ForestsCovtype Data SetPreparing the DataA First Decision TreeDecision Tree HyperparametersTuning Decision TreesCategorical Features RevisitedRandom Decision ForestsMaking PredictionsWhere to Go from Here5. Anomaly Detection in Network Traffic with K-means ClusteringAnomaly DetectionK-means ClusteringNetwork IntrusionKDD Cup 1999 Data SetA First Take on ClusteringChoosing kVisualization in RFeature NormalizationCategorical VariablesUsing Labels with EntropyClustering in ActionWhere to Go from Here6. Understanding Wikipedia with Latent Semantic AnalysisThe Term-Document MatrixGetting the DataParsing and Preparing the DataLemmatizationComputing the TF-IDFsSingular Value DecompositionFinding Important ConceptsQuerying and Scoring with the Low-Dimensional RepresentationTerm-Term RelevanceDocument-Document RelevanceTerm-Document RelevanceMultiple-Term QueriesWhere to Go from Here7. Analyzing Co-occurrence Networks with GraphXThe MEDLINE Citation Index: A Network AnalysisGetting the DataParsing XML Documents with Scala's XML LibraryAnalyzing the MeSH Major Topics and Their Co-occurrencesConstructing a Co-occurrence Network with GraphXUnderstanding the Structure of NetworksConnected ComponentsDegree DistributionFiltering Out Noisy EdgesProcessing EdgeTripletsAnalyzing the Filtered GraphSmall-World NetworksCliques and Clustering CoefficientsComputing Average Path Length with PregelWhere to Go from Here8. 6eospatial and Temporal Data Analysis on the New York City Taxi Trip DataGetting the DataWorking with Temporal and Geospatial Data in SparkTemporal Data with JodaTime and NScalaTimeGeospatial Data with the Esri Geometry API and SprayExploring the Esri Geometry APIIntro to GeoJSONPreparing the New York City Taxi Trip DataHandling Invalid Records at ScaleGeospatial AnalysisSessionization in SparkBuilding Sessions: Secondary Sorts in SparkWhere to Go from Here9. Estimating Financial Risk through Monte Carlo SimulationTerminologyMethods for Calculating VaRVariance-CovarianceHistorical SimulationMonte Carlo SimulationOur ModelGetting the DataPreprocessingDetermining the Factor WeightsSamplingThe Multivariate Normal DistributionRunning the TrialsVisualizing the Distribution of ReturnsEvaluating Our ResultsWhere to Go from Here10. Analyzing Genomics Data and the BDG ProjectDecoupling Storage from ModelingIngesting Genomics Data with the ADAM CLIParquet Format and Columnar StoragePredicting Transcription Factor Binding Sites from ENCODE DataQuerying Genotypes from the 1000 Genomes ProjectWhere to Go from Here11. Analyzing Neuroimaging Data with PySpark and ThunderOverview of PySparkPySpark InternalsOverview and Installation of the Thunder LibraryLoading Data with ThunderThunder Core Data TypesCategorizing Neuron Types with ThunderWhere to Go from HereA.Deeper into SparkB.Upcoming MLlib Pipelines APIIndex

已确认勘误

次印刷

页码 勘误内容 提交人 修订印次

Spark高级数据分析
    • 名称
    • 类型
    • 大小

    光盘服务联系方式: 020-38250260    客服QQ:4006604884

    意见反馈

    14:15

    关闭

    云图客服:

    尊敬的用户,您好!您有任何提议或者建议都可以在此提出来,我们会谦虚地接受任何意见。

    或者您是想咨询:

    用户发送的提问,这种方式就需要有位在线客服来回答用户的问题,这种 就属于对话式的,问题是这种提问是否需要用户登录才能提问

    Video Player
    ×
    Audio Player
    ×
    pdf Player
    ×
    Current View

    看过该图书的还喜欢

    some pictures

    解忧杂货店

    东野圭吾 (作者), 李盈春 (译者)

    loading icon