Hadoop: the definitive guide

副标题：无

作者：Tom White著

分类号：

ISBN：9787564126766

收录收藏 (0) 评论纠错

微信扫一扫,移动浏览光盘

简介

简介

　　揭示了apache hadoop如何为你释放数据的力量。这本内容全面的书籍展示了如何使用hadoop架构搭建和维护可靠、可伸缩的分布式系统。hadoop架构是mapreduce算法的一种开源应用，是google开创其帝国的重要基石。程序员可从中探索如何分析海量数据集，管理员可以了解如何建立与运行hadoop集群。　　《hadoop权威指南(英文影印版)(第二版修订版)》涵盖了hadoop最近的更新，包括诸如hive、sqoop和avro之类的新特性。它也提供了案例学习来展示hadoop如何解决特殊问题。期待尽情享受你的数据？这就是你要的书。　　 ·使用hadoop分布式文件系统(hdfs)来存储海量数据集，通过mapreduce对这些数据集运行分布式计算　　 ·熟悉hadoop的数据和i／o构件，用于压缩、数据集成、序列化和持久处理　　 ·洞悉编写mapreduce实际应用程序时的常见陷阱和高级特性　　 ·设计、构建和管理专用的hadoop集群或在云上运行hadoop 　　 ·使用pig这种高级的查询语言来处理大规模数据　　 ·使用hive、hadoop的数据仓库系统来分析数据集　　 ·利用hbase这个hadoop数据库来处理结构化和半结构化数据　　 ·学习zookeeper，这是一个用于构建分布式系统的协作原语工具箱

《hadoop权威指南(英文影印版)(第二版修订版)》

foreword

preface

1. meet hadoop

data!

data storage and analysis

comparison with other systems

rdbms

grid computing

volunteer computing

a brief history of hadoop

apache hadoop and the hadoop ecosystem

2. mapreduce

a weather dataset

data format

analyzing the data with unix tools

analyzing the data with hadoop

map and reduce

java mapreduce

scaling out

.data flow

combiner functions

running a distributed mapreduce job

hadoop streaming

ruby

python

hadoop pipes

compiling and running

3. the hadoop distributed filesystem

the design of hdfs

hdfs concepts

blocks

namenodes and datanodes

the command-line interface

basic filesystem operations

hadoop filesystems

interfaces

the java interface

reading data from a hadoop url

reading data using the filesystem api

writing data

directories

querying the filesystem

deleting data

data flow.

anatomy of a file read

anatomy of a file write

coherency model

parallel copying with distcp

keeping an hdfs cluster balanced

hadoop archives

using hadoop archives

limitations

4. hadoop i/o

data integrity

data integrity in hdfs

localfilesystem

checksumfilesystem

compression

codecs

compression and input splits

using compression in mapreduce

serialization

the writable interface

writable classes

implementing a custom writable

serialization frameworks

avro

file-based data structures

sequencefile

mapfile

5. developing a mapreduce application
the configuration api

combining resources

variable expansion

configuring the development environment

managing configuration

genericoptionsparser, tool, and toolrunner

writing a unit test

mapper

reducer

running locally on test data

running a job in a local job runner

testing the driver

running on a cluster

packaging

launching a job

the mapreduce web ui

retrieving the results

debugging a job

using a remote debugger

tuning a job

profiling tasks

mapreduce workfiows

decomposing a problem into mapreduce jobs

running dependent jobs

6. how mapreduce works
anatomy of a mapreduce job run

job submission

job initialization

task assignment

task execution

progress and status updates

job completion

failures

task failure

tasktracker failure

jobtracker failure

job scheduling

the fair scheduler

the capacity scheduler

shuffle and sort

the map side

the reduce side

configuration tuning

task execution

speculative execution

task jvm reuse

skipping bad records

the task execution environment

7. mapreduce types and formats
mapreduce types

the default mapreduce job

input formats

input splits and records

text input

binary input

multiple inputs

database input (and output)

output formats

text output

binary output

multiple outputs

lazy output

database output

8. mapreduce features
counters

built-in counters

user-defined java counters

user-defined streaming counters

sorting

preparation

partial sort

total sort

secondary sort

joins

map-side joins

reduce-side joins

side data distribution

using the job configuration

distributed cache

mapreduce library classes

9. setting up a hadoop cluster
cluster specification

network topology

cluster setup and installation

installing java

creating a hadoop user

installing hadoop

testing the installation

ssh configuration

hadoop configuration

configuration management

environment settings

important hadoop daemon properties

hadoop daemon addresses and ports

other hadoop properties

user account creation

security

kerberos and hadoop

delegation tokens

other security enhancements

benchmarking a hadoop cluster

hadoop benchmarks

user jobs

hadoop in the cloud

hadoop on amazon ec2

10. administering hadoop
hdfs

persistent data structures

safe mode

audit logging

tools

monitoring

logging

metrics

java management extensions

maintenance

routine administration procedures

commissioning and decommissioning nodes

upgrades

11. pig
installing and running pig

execution types

running pig programs

grunt

pig latin editors

an example

generating examples

comparison with databases

pig latin

structure

statements

expressions

types

schemas

functions

user-defined functions

a filter udf

an eval udf

a load udf

data processing operators

loading and storing data

filtering data

grouping and joining data

sorting data

combining and splitting data

pig in practice

parallelism

parameter substitution

12. hive
installing hive

the hive shell

an example

running hive

configuring hive

hive services

the metastore

comparison with traditional databases

schema on read versus schema on write

updates, transactions, and indexes

hiveql

data types

operators and functions

tables

managed tables and external tables

partitions and buckets

storage formats

importing, data

altering tables

dropping tables

querying data

sorting and aggregating

mapreduce scripts

joins

subqueries

views

user-defined functions

writing a udf

writing a udaf

13. hbase
hbasics

backdrop

concepts

whirlwind tour of the data model

implementation

installation

test drive

clients

java

avro, rest, and thrift

example

schemas

loading data

web queries

hbase versus rdbms

successful service

hbase

use case: hbase at streamy.com

praxis

versions

hdfs

ui

metrics

schema design

counters

bulk load

14. zookeeper
installing and running zookeeper

an example

group membership in zookeeper

creating the group

joining a group

listing members in a group

deleting a group

the zookeeper service

data model

operations

implementation

consistency

sessions

states

building applications with zookeeper

a configuration service

the resilient zookeeper application

a lock service

more distributeddata structures and protocols

zookeeper in production

resilience and performance

configuration

15. sqoop
getting sqoop

a sample import

generated code

additional serialization systems

database imports: a deeper look

controlling the import

imports and consistency

direct-mode imports

working with imported data

imported data and hive

importing large objects

performing an export

exports: a deeper look

exports and transactionality

exports and sequencefiles

16. case studies
hadoop usage at last.fm

last.fro: the social music revolution

hadoop at last.fm

generating charts with hadoop

the track statistics program

summary

hadoop and hive at facebook

introduction

hadoop at facebook

hypothetical use case studies

hive

problems and future work

nutch search engine

background

data structures

selected examples of hadoop data processing in nutch

summary

log processing at rackspace

requirements/the problem

brief history

choosing hadoop

collection and storage

mapreduce for logs

cascading

fields, tuples, and pipes

operations

taps, schemes, and flows

cascading in practice

flexibility

hadoop and cascading at sharethis

summary

terabyte sort on apache hadoop

using pig and wukong to explore billion-edge network graphs

measuring community

everybody's talkin' at me: the twitter reply graph

symmetric links

community extraction

a. installing apache hadoop
b. cioudera's distribution for hadoop
c. preparing the ncdc weather data
index

已确认勘误

页码	勘误内容	提交人	修订印次

Hadoop: the definitive guide

名称
类型
大小

用户反馈

FAQ

Hadoop: the definitive guide

已确认勘误

第次印刷 筛选

第次印刷