副标题：无

作者：

分类号：

ISBN：9780123884268

收录收藏 (0) 评论纠错

微信扫一扫,移动浏览光盘

简介

简介

Summary: Publisher Summary 1 As the computer industry retools to leverage massively parallel graphics processing units (GPUs), this book is designed to meet the needs of working software developers who need to understand GPU programming with CUDA and increase efficiency in their projects. CUDA Application Design and Developmentstarts with an introduction to parallel computing concepts for readers with no previous parallel experience, and focuses on issues of immediate importance to working software developers: achieving high performance, maintaining competitiveness, analyzing CUDA benefits versus costs, and determining application lifespan.The book then details the thought behind CUDA and teaches how to create, analyze, and debug CUDA applications. Throughout, the focus is on software engineering issues: how to use CUDA in the context of existing application code, with existing compilers, languages, software tools, and industry-standard API libraries.Using an approach refined in a series of well-received articles at Dr Dobb's Journal, author Rob Farber takes the reader step-by-step from fundamentals to implementation, moving from language theory to practical coding. Includes multiple examples building from simple to more complex applications in four key areas: machine learning, visualization, vision recognition, and mobile computing Addresses the foundational issues for CUDA development: multi-threaded programming and the different memory hierarchy Includes teaching chapters designed to give a full understanding of CUDA tools, techniques and structure. Presents CUDA techniques in the context of the hardware they are implemented on as well as other styles of programming that will help readers bridge into the new material

Front Cover 1
CUDA Application Design and Development 4
Copyright 5
Dedication 6
Table of Contents 8
Foreword 12
Preface 14
1 First Programs and How to Think in CUDA 19
Source Code and Wiki 20
Distinguishing CUDA from Conventional Programming with a Simple Example 20
Choosing a CUDA API 23
Some Basic CUDA Concepts 25
Understanding Our First Runtime Kernel 28
Three Rules of GPGPU Programming 30
Rule 1: Get the Data on the GPU and Keep It There 30
Rule 2: Give the GPGPU Enough Work to Do 31
Rule 3: Focus on Data Reuse within the GPGPU to Avoid Memory Bandwidth Limitations 31
Big-O Considerations and Data Transfers 32
CUDA and Amdahl\u2019s Law 34
Data and Task Parallelism 35
Hybrid Execution: Using Both CPU and GPU Resources 36
Regression Testing and Accuracy 39
Silent Errors 40
Introduction to Debugging 41
UNIX Debugging 42
NVIDIA's cuda-gdb Debugger 42
The CUDA Memory Checker 44
Use cuda-gdb with the UNIX ddd Interface 45
Windows Debugging with Parallel Nsight 47
Summary 48
2 CUDA for Machine Learning and Optimization 51
Modeling and Simulation 52
Fitting Parameterized Models 53
Nelder-Mead Method 54
Levenberg-Marquardt Method 54
Algorithmic Speedups 55
Machine Learning and Neural Networks 56
XOR: An Important Nonlinear Machine-Learning Problem 57
An Example Objective Function 59
A Complete Functor for Multiple GPU Devices and the Host Processors 60
Brief Discussion of a Complete Nelder-Mead Optimization Code 62
Performance Results on XOR 71
Performance Discussion 71
Summary 74
The C++ Nelder-Mead Template 75
3 The CUDA Tool Suite: Profiling a PCA/NLPCA Functor 81
PCA and NLPCA 82
Autoencoders 83
An Example Functor for PCA Analysis 84
An Example Functor for NLPCA Analysis 86
Obtaining Basic Profile Information 89
Gprof: A Common UNIX Profiler 91
The NVIDIA Visual Profiler: Computeprof 92
Parallel Nsight for Microsoft Visual Studio 95
The Nsight Timeline Analysis 95
The NVTX Tracing Library 97
Scaling Behavior of the CUDA API 98
Tuning and Analysis Utilities (TAU) 100
Summary 101
4 The CUDA Execution Model 103
GPU Architecture Overview 104
Thread Scheduling: Orchestrating Performance and Parallelism via the Execution Configuration 105
Relevant computeprof Values for a Warp 108
Warp Divergence 108
Guidelines for Warp Divergence 109
Relevant computeprof Values for Warp Divergence 110
Warp Scheduling and TLP 110
Relevant computeprof Values for Occupancy 112
ILP: Higher Performance at Lower Occupancy 112
ILP Hides Arithmetic Latency 113
ILP Hides Data Latency 116
ILP in the Future 116
Relevant computeprof Values for Instruction Rates 118
Little\u2019s Law 118
CUDA Tools to Identify Limiting Factors 120
The nvcc Compiler 121
Launch Bounds 122
The Disassembler 123
PTX Kernels 124
GPU Emulators 125
Summary 126
5 CUDA Memory 127
The CUDA Memory Hierarchy 127
GPU Memory 129
L2 Cache 130
Relevant computeprof Values for the L2 Cache 131
L1 Cache 132
Relevant computeprof Values for the L1 Cache 133
CUDA Memory Types 134
Registers 134
Local memory 134
Relevant computeprof Values for Local Memory Cache 135
Shared Memory 135
Relevant computeprof Values for Shared Memory 138
Constant Memory 138
Texture Memory 139
Relevant computeprof Values for Texture Memory 142
Global Memory 142
Common Coalescing Use Cases 144
Allocation of Global Memory 145
Limiting Factors in the Design of Global Memory 146
Relevant computeprof Values for Global Memory 148
Summary 149
6 Efficiently Using GPU Memory 151
Reduction 152
The Reduction Template 152
A Test Program for functionReduce.h 158
Results 162
Utilizing Irregular Data Structures 164
Sparse Matrices and the CUSP Library 167
Graph Algorithms 169
SoA, AoS, and Other Structures 172
Tiles and Stencils 172
Summary 173
7 Techniques to Increase Parallelism 175
CUDA Contexts Extend Parallelism 176
Streams and Contexts 177
Multiple GPUs 177
Explicit Synchronization 178
Implicit Synchronization 179
The Unified Virtual Address Space 180
A Simple Example 180
Profiling Results 183
Out-of-Order Execution with Multiple Streams 184
Tip for Concurrent Kernel Execution on the Same GPU 187
Atomic Operations for Implicitly Concurrent Kernels 187
Tying Data to Computation 190
Manually Partitioning Data 190
Mapped Memory 191
How Mapped Memory Works 193
Summary 194
8 CUDA for All GPU and CPU Applications 197
Pathways from CUDA to Multiple Hardware Backends 198
The PGI CUDA x86 Compiler 199
The PGI CUDA x86 Compiler 201
An x86 core as an SM 203
The NVIDIA NVCC Compiler 204
Ocelot 205
Swan 206
MCUDA 206
Accessing CUDA from Other Languages 206
SWIG 207
Copperhead 207
EXCEL 208
MATLAB 208
Libraries 209
CUBLAS 209
CUFFT 209
MAGMA 220
phiGEMM Library 221
CURAND 221
Summary 223
9 Mixing CUDA and Rendering 225
OpenGL 226
GLUT 226
Mapping GPU Memory with OpenGL 227
Using Primitive Restart for 3D Performance 228
Introduction to the Files in the Framework 231
The Demo and Perlin Example Kernels 231
The Demo Kernel 232
The Demo Kernel to Generate a Colored Sinusoidal Surface 232
Perlin Noise 235
Using the Perlin Noise Kernel to Generate Artificial Terrain 237
The simpleGLmain.cpp File 242
The simpleVBO.cpp File 246
The callbacksVBO.cpp File 251
Summary 256
10 CUDA in a Cloud and Cluster Environments 259
The Message Passing Interface (MPI) 260
The MPI Programming Model 260
The MPI Communicator 261
MPI Rank 261
Master-Slave 263
Point-to-Point Basics 263
How MPI Communicates 264
Bandwidth 266
Balance Ratios 267
Considerations for Large MPI Runs 270
Scalability of the Initial Data Load 270
Using MPI to Perform a Calculation 271
Check Scalability 272
Cloud Computing 273
A Code Example 274
Data Generation 274
Summary 282
11 CUDA for Real Problems 283
Working with High-Dimensional Data 284
PCA/NLPCA 285
Multidimensional Scaling 285
K-Means Clustering 286
Expectation-Maximization 286
Support Vector Machines 287
Bayesian Networks 287
Mutual information 288
Force-Directed Graphs 289
Monte Carlo Methods 290
Molecular Modeling 291
Quantum Chemistry 291
Interactive Workflows 292
A Plethora of Projects 292
Summary 293
12 Application Focus on Live Streaming Video 295
Topics in Machine Vision 296
3D Effects 297
Segmentation of Flesh-colored Regions 297
Edge Detection 298
FFmpeg 299
TCP Server 301
Live Stream Application 305
kernelWave(): An Animated Kernel 305
kernelFlat(): Render the Image on a Flat Surface 306
kernelSkin(): Keep Only Flesh-colored Regions 306
kernelSobel(): A Simple Sobel Edge Detection Filter 307
The launch_kernel() Method 308
The simpleVBO.cpp File 309
The callbacksVBO.cpp File 309
Building and Running the Code 313
The Future 313
Machine Learning 313
The Connectome 314
Summary 315
Listing for simpleVBO.cpp 315
Works Cited 321
Index 329
A 329
B 329
C 329
D 329
E 330
F 330
G 330
H 330
I 330
J 330
K 330
L 331
M 331
N 331
O 331
P 332
Q 332
R 332
S 332
T 332
U 333
V 333
W 333
X 333

已确认勘误

页码	勘误内容	提交人	修订印次

名称
类型
大小

用户反馈

FAQ

光盘服务联系方式: 020-38250260 客服QQ：4006604884

意见反馈

已确认勘误

第次印刷 筛选

第次印刷