简介
揭示了apache hadoop如何为你释放数据的力量。这本内容全面的书籍展示了如何使用hadoop架构搭建和维护可靠、可伸缩的分布式系统。hadoop架构是mapreduce算法的一种开源应用,是google开创其帝国的重要基石。程序员可从中探索如何分析海量数据集,管理员可以了解如何建立与运行hadoop集群。
《hadoop权威指南(英文影印版)(第二版 修订版)》涵盖了hadoop最近的更新,包括诸如hive、sqoop和avro之类的新特性。它也提供了案例学习来展示hadoop如何解决特殊问题。期待尽情享受你的数据?这就是你要的书。
·使用hadoop分布式文件系统(hdfs)来存储海量数据集,通过mapreduce对这些数据集运行分布式计算
·熟悉hadoop的数据和i/o构件,用于压缩、数据集成、序列化和持久处理
·洞悉编写mapreduce实际应用程序时的常见陷阱和高级特性
·设计、构建和管理专用的hadoop集群或在云上运行hadoop
·使用pig这种高级的查询语言来处理大规模数据
·使用hive、hadoop的数据仓库系统来分析数据集
·利用hbase这个hadoop数据库来处理结构化和半结构化数据
·学习zookeeper,这是一个用于构建分布式系统的协作原语工具箱
目录
《hadoop权威指南(英文影印版)(第二版 修订版)》
foreword
preface
1. meet hadoop
data!
data storage and analysis
comparison with other systems
rdbms
grid computing
volunteer computing
a brief history of hadoop
apache hadoop and the hadoop ecosystem
2. mapreduce
a weather dataset
data format
analyzing the data with unix tools
analyzing the data with hadoop
map and reduce
java mapreduce
scaling out
.data flow
combiner functions
running a distributed mapreduce job
hadoop streaming
ruby
python
hadoop pipes
compiling and running
3. the hadoop distributed filesystem
the design of hdfs
hdfs concepts
blocks
namenodes and datanodes
the command-line interface
basic filesystem operations
hadoop filesystems
interfaces
the java interface
reading data from a hadoop url
reading data using the filesystem api
writing data
directories
querying the filesystem
deleting data
data flow.
anatomy of a file read
anatomy of a file write
coherency model
parallel copying with distcp
keeping an hdfs cluster balanced
hadoop archives
using hadoop archives
limitations
4. hadoop i/o
data integrity
data integrity in hdfs
localfilesystem
checksumfilesystem
compression
codecs
compression and input splits
using compression in mapreduce
serialization
the writable interface
writable classes
implementing a custom writable
serialization frameworks
avro
file-based data structures
sequencefile
mapfile
5. developing a mapreduce application
the configuration api
combining resources
variable expansion
configuring the development environment
managing configuration
genericoptionsparser, tool, and toolrunner
writing a unit test
mapper
reducer
running locally on test data
running a job in a local job runner
testing the driver
running on a cluster
packaging
launching a job
the mapreduce web ui
retrieving the results
debugging a job
using a remote debugger
tuning a job
profiling tasks
mapreduce workfiows
decomposing a problem into mapreduce jobs
running dependent jobs
6. how mapreduce works
anatomy of a mapreduce job run
job submission
job initialization
task assignment
task execution
progress and status updates
job completion
failures
task failure
tasktracker failure
jobtracker failure
job scheduling
the fair scheduler
the capacity scheduler
shuffle and sort
the map side
the reduce side
configuration tuning
task execution
speculative execution
task jvm reuse
skipping bad records
the task execution environment
7. mapreduce types and formats
mapreduce types
the default mapreduce job
input formats
input splits and records
text input
binary input
multiple inputs
database input (and output)
output formats
text output
binary output
multiple outputs
lazy output
database output
8. mapreduce features
counters
built-in counters
user-defined java counters
user-defined streaming counters
sorting
preparation
partial sort
total sort
secondary sort
joins
map-side joins
reduce-side joins
side data distribution
using the job configuration
distributed cache
mapreduce library classes
9. setting up a hadoop cluster
cluster specification
network topology
cluster setup and installation
installing java
creating a hadoop user
installing hadoop
testing the installation
ssh configuration
hadoop configuration
configuration management
environment settings
important hadoop daemon properties
hadoop daemon addresses and ports
other hadoop properties
user account creation
security
kerberos and hadoop
delegation tokens
other security enhancements
benchmarking a hadoop cluster
hadoop benchmarks
user jobs
hadoop in the cloud
hadoop on amazon ec2
10. administering hadoop
hdfs
persistent data structures
safe mode
audit logging
tools
monitoring
logging
metrics
java management extensions
maintenance
routine administration procedures
commissioning and decommissioning nodes
upgrades
11. pig
installing and running pig
execution types
running pig programs
grunt
pig latin editors
an example
generating examples
comparison with databases
pig latin
structure
statements
expressions
types
schemas
functions
user-defined functions
a filter udf
an eval udf
a load udf
data processing operators
loading and storing data
filtering data
grouping and joining data
sorting data
combining and splitting data
pig in practice
parallelism
parameter substitution
12. hive
installing hive
the hive shell
an example
running hive
configuring hive
hive services
the metastore
comparison with traditional databases
schema on read versus schema on write
updates, transactions, and indexes
hiveql
data types
operators and functions
tables
managed tables and external tables
partitions and buckets
storage formats
importing, data
altering tables
dropping tables
querying data
sorting and aggregating
mapreduce scripts
joins
subqueries
views
user-defined functions
writing a udf
writing a udaf
13. hbase
hbasics
backdrop
concepts
whirlwind tour of the data model
implementation
installation
test drive
clients
java
avro, rest, and thrift
example
schemas
loading data
web queries
hbase versus rdbms
successful service
hbase
use case: hbase at streamy.com
praxis
versions
hdfs
ui
metrics
schema design
counters
bulk load
14. zookeeper
installing and running zookeeper
an example
group membership in zookeeper
creating the group
joining a group
listing members in a group
deleting a group
the zookeeper service
data model
operations
implementation
consistency
sessions
states
building applications with zookeeper
a configuration service
the resilient zookeeper application
a lock service
more distributeddata structures and protocols
zookeeper in production
resilience and performance
configuration
15. sqoop
getting sqoop
a sample import
generated code
additional serialization systems
database imports: a deeper look
controlling the import
imports and consistency
direct-mode imports
working with imported data
imported data and hive
importing large objects
performing an export
exports: a deeper look
exports and transactionality
exports and sequencefiles
16. case studies
hadoop usage at last.fm
last.fro: the social music revolution
hadoop at last.fm
generating charts with hadoop
the track statistics program
summary
hadoop and hive at facebook
introduction
hadoop at facebook
hypothetical use case studies
hive
problems and future work
nutch search engine
background
data structures
selected examples of hadoop data processing in nutch
summary
log processing at rackspace
requirements/the problem
brief history
choosing hadoop
collection and storage
mapreduce for logs
cascading
fields, tuples, and pipes
operations
taps, schemes, and flows
cascading in practice
flexibility
hadoop and cascading at sharethis
summary
terabyte sort on apache hadoop
using pig and wukong to explore billion-edge network graphs
measuring community
everybody's talkin' at me: the twitter reply graph
symmetric links
community extraction
a. installing apache hadoop
b. cioudera's distribution for hadoop
c. preparing the ncdc weather data
index
foreword
preface
1. meet hadoop
data!
data storage and analysis
comparison with other systems
rdbms
grid computing
volunteer computing
a brief history of hadoop
apache hadoop and the hadoop ecosystem
2. mapreduce
a weather dataset
data format
analyzing the data with unix tools
analyzing the data with hadoop
map and reduce
java mapreduce
scaling out
.data flow
combiner functions
running a distributed mapreduce job
hadoop streaming
ruby
python
hadoop pipes
compiling and running
3. the hadoop distributed filesystem
the design of hdfs
hdfs concepts
blocks
namenodes and datanodes
the command-line interface
basic filesystem operations
hadoop filesystems
interfaces
the java interface
reading data from a hadoop url
reading data using the filesystem api
writing data
directories
querying the filesystem
deleting data
data flow.
anatomy of a file read
anatomy of a file write
coherency model
parallel copying with distcp
keeping an hdfs cluster balanced
hadoop archives
using hadoop archives
limitations
4. hadoop i/o
data integrity
data integrity in hdfs
localfilesystem
checksumfilesystem
compression
codecs
compression and input splits
using compression in mapreduce
serialization
the writable interface
writable classes
implementing a custom writable
serialization frameworks
avro
file-based data structures
sequencefile
mapfile
5. developing a mapreduce application
the configuration api
combining resources
variable expansion
configuring the development environment
managing configuration
genericoptionsparser, tool, and toolrunner
writing a unit test
mapper
reducer
running locally on test data
running a job in a local job runner
testing the driver
running on a cluster
packaging
launching a job
the mapreduce web ui
retrieving the results
debugging a job
using a remote debugger
tuning a job
profiling tasks
mapreduce workfiows
decomposing a problem into mapreduce jobs
running dependent jobs
6. how mapreduce works
anatomy of a mapreduce job run
job submission
job initialization
task assignment
task execution
progress and status updates
job completion
failures
task failure
tasktracker failure
jobtracker failure
job scheduling
the fair scheduler
the capacity scheduler
shuffle and sort
the map side
the reduce side
configuration tuning
task execution
speculative execution
task jvm reuse
skipping bad records
the task execution environment
7. mapreduce types and formats
mapreduce types
the default mapreduce job
input formats
input splits and records
text input
binary input
multiple inputs
database input (and output)
output formats
text output
binary output
multiple outputs
lazy output
database output
8. mapreduce features
counters
built-in counters
user-defined java counters
user-defined streaming counters
sorting
preparation
partial sort
total sort
secondary sort
joins
map-side joins
reduce-side joins
side data distribution
using the job configuration
distributed cache
mapreduce library classes
9. setting up a hadoop cluster
cluster specification
network topology
cluster setup and installation
installing java
creating a hadoop user
installing hadoop
testing the installation
ssh configuration
hadoop configuration
configuration management
environment settings
important hadoop daemon properties
hadoop daemon addresses and ports
other hadoop properties
user account creation
security
kerberos and hadoop
delegation tokens
other security enhancements
benchmarking a hadoop cluster
hadoop benchmarks
user jobs
hadoop in the cloud
hadoop on amazon ec2
10. administering hadoop
hdfs
persistent data structures
safe mode
audit logging
tools
monitoring
logging
metrics
java management extensions
maintenance
routine administration procedures
commissioning and decommissioning nodes
upgrades
11. pig
installing and running pig
execution types
running pig programs
grunt
pig latin editors
an example
generating examples
comparison with databases
pig latin
structure
statements
expressions
types
schemas
functions
user-defined functions
a filter udf
an eval udf
a load udf
data processing operators
loading and storing data
filtering data
grouping and joining data
sorting data
combining and splitting data
pig in practice
parallelism
parameter substitution
12. hive
installing hive
the hive shell
an example
running hive
configuring hive
hive services
the metastore
comparison with traditional databases
schema on read versus schema on write
updates, transactions, and indexes
hiveql
data types
operators and functions
tables
managed tables and external tables
partitions and buckets
storage formats
importing, data
altering tables
dropping tables
querying data
sorting and aggregating
mapreduce scripts
joins
subqueries
views
user-defined functions
writing a udf
writing a udaf
13. hbase
hbasics
backdrop
concepts
whirlwind tour of the data model
implementation
installation
test drive
clients
java
avro, rest, and thrift
example
schemas
loading data
web queries
hbase versus rdbms
successful service
hbase
use case: hbase at streamy.com
praxis
versions
hdfs
ui
metrics
schema design
counters
bulk load
14. zookeeper
installing and running zookeeper
an example
group membership in zookeeper
creating the group
joining a group
listing members in a group
deleting a group
the zookeeper service
data model
operations
implementation
consistency
sessions
states
building applications with zookeeper
a configuration service
the resilient zookeeper application
a lock service
more distributeddata structures and protocols
zookeeper in production
resilience and performance
configuration
15. sqoop
getting sqoop
a sample import
generated code
additional serialization systems
database imports: a deeper look
controlling the import
imports and consistency
direct-mode imports
working with imported data
imported data and hive
importing large objects
performing an export
exports: a deeper look
exports and transactionality
exports and sequencefiles
16. case studies
hadoop usage at last.fm
last.fro: the social music revolution
hadoop at last.fm
generating charts with hadoop
the track statistics program
summary
hadoop and hive at facebook
introduction
hadoop at facebook
hypothetical use case studies
hive
problems and future work
nutch search engine
background
data structures
selected examples of hadoop data processing in nutch
summary
log processing at rackspace
requirements/the problem
brief history
choosing hadoop
collection and storage
mapreduce for logs
cascading
fields, tuples, and pipes
operations
taps, schemes, and flows
cascading in practice
flexibility
hadoop and cascading at sharethis
summary
terabyte sort on apache hadoop
using pig and wukong to explore billion-edge network graphs
measuring community
everybody's talkin' at me: the twitter reply graph
symmetric links
community extraction
a. installing apache hadoop
b. cioudera's distribution for hadoop
c. preparing the ncdc weather data
index
Hadoop: the definitive guide
- 名称
- 类型
- 大小
光盘服务联系方式: 020-38250260 客服QQ:4006604884
云图客服:
用户发送的提问,这种方式就需要有位在线客服来回答用户的问题,这种 就属于对话式的,问题是这种提问是否需要用户登录才能提问
Video Player
×
Audio Player
×
pdf Player
×