Nvidia cuda reference manual




















Existing suballocations can be resized using nvtxMemRegionsResize. The following example resizes our previous suballocation at address ptr from 16 bytes to In a similar fashion, existing allocations can be removed using nvtxMemRegionsUnregister. The following example removes our previous suballocation at address ptr. Naming API Any allocation can be assigned a name, so future Compute Sanitizer error reports can refer to an allocation by its name.

This example names the allocation at address ptr : "My Allocation". As of now, only leak and unused memory reporting features allocation names. Permissions API. For this example, we use the global program scope by calling nvtxMemCudaGetProcessWidePermissions , meaning permissions are applied on all kernel launches.

This example restricts the allocation at address ptr to read-only permissions. The following example gets the permissions handle from device device , a handle that is used with nvtxMemPermissionsAssign to change permissions for the allocation at address ptr , previously restricted to read-only on the global scope, and now read-write for kernel launched on device no atomic allowed. Advanced Permissions Management Permissions can be assigned to a specific stream scope thanks to custom permissions objects.

The following example restricts the allocation at address ptr to read-only permissions. For example, excluding write permissions will block access for all allocations with unassigned permissions on that scope. These are applied: nvtxMemPermissionsCreate : applied for kernel launches on stream bound to the created object.

Permissions objects currently bound can be unbound using nvtxMemPermissionsUnbind and destroyed using nvtxMemPermissionsDestroy. The host issues a succession of kernel invocations to the device. Each kernel is executed as a batch of threads organized as a grid of CTAs Figure 1. A cooperative thread array CTA is a set of concurrent threads that execute the same kernel program.

A grid is a set of CTAs that execute independently. PTX threads may access data from multiple memory spaces during their execution as illustrated by Figure 2. Each thread has a private local memory. Each thread block CTA has a shared memory visible to all threads of the block and with the same lifetime as the block. Finally, all threads have access to the same global memory. There are additional memory spaces accessible by all threads: the constant, texture, and surface memory spaces.

Constant and texture memory are read-only; surface memory is readable and writable. The global, constant, texture, and surface memory spaces are optimized for different memory usages. For example, texture memory offers different addressing modes as well as data filtering for specific data formats.

Note that texture and surface memory is cached, and within the same kernel call, the cache is not kept coherent with respect to global memory writes and surface memory writes, so any texture fetch or surface read to an address that has been written to via a global or a surface write in the same kernel call returns undefined data. In other words, a thread can safely read some texture or surface memory location only if this memory location has been updated by a previous kernel call or memory copy, but not if it has been previously updated by the same thread or another thread from the same kernel call.

The global, constant, and texture memory spaces are persistent across kernel launches by the same application. Both the host and the device maintain their own local memory, referred to as host memory and device memory , respectively. The device memory may be mapped and read or written by the host, or, for more efficient transfer, copied from the host memory through optimized API calls that utilize the device's high-performance Direct Memory Access DMA engine.

When a host program invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors. A multiprocessor consists of multiple Scalar Processor SP cores, a multithreaded instruction unit, and on-chip shared memory.

The multiprocessor creates, manages, and executes concurrent threads in hardware with zero scheduling overhead. It implements a single-instruction barrier synchronization.

Fast barrier synchronization together with lightweight thread creation and zero-overhead thread scheduling efficiently support very fine-grained parallelism, allowing, for example, a low granularity decomposition of problems by assigning one thread to each data element such as a pixel in an image, a voxel in a volume, a cell in a grid-based computation. To manage hundreds of threads running several different programs, the multiprocessor employs an architecture we call SIMT single-instruction, multiple-thread.

The multiprocessor maps each thread to one scalar processor core, and each scalar thread executes independently with its own instruction address and register state. The multiprocessor SIMT unit creates, manages, schedules, and executes threads in groups of parallel threads called warps.

This term originates from weaving, the first parallel thread technology. Individual threads composing a SIMT warp start together at the same program address but are otherwise free to branch and execute independently.

When a multiprocessor is given one or more thread blocks to execute, it splits them into warps that get scheduled by the SIMT unit.

The way a block is split into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. At every instruction issue time, the SIMT unit selects a warp that is ready to execute and issues the next instruction to the active threads of the warp.

A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path.

Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths. In contrast with SIMD vector machines, SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads.

For the purposes of correctness, the programmer can essentially ignore the SIMT behavior; however, substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge. In practice, this is analogous to the role of cache lines in traditional code: Cache line size can be safely ignored when designing for correctness but must be considered in the code structure when designing for peak performance.

Vector architectures, on the other hand, require the software to coalesce loads into vectors and manage divergence manually. How many blocks a multiprocessor can process at once depends on how many registers per thread and how much shared memory per block are required for a given kernel since the multiprocessor's registers and shared memory are split among all the threads of the batch of blocks.

If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch.

On architectures prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from. Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp.

With Independent Thread Scheduling , the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. A schedule optimizer determines how to group active threads from the same warp together into SIMT units.

Independent Thread Scheduling can lead to a rather different set of threads participating in the executed code than intended if the developer made assumptions about warp-synchronicity of previous hardware architectures.

In particular, any warp-synchronous code such as synchronization-free, intra-warp reductions should be revisited to ensure compatibility with Volta and beyond. See the section on Compute Capability 7. As illustrated by Figure 3 ,each multiprocessor has on-chip memory of the four following types:.

The local and global memory spaces are read-write regions of device memory and are not cached. PTX programs are a collection of text source modules files. PTX source modules have an assembly-language style syntax with instruction operation codes and operands. Pseudo-operations specify symbol and addressing management. The ptxas optimizing backend compiler optimizes and assembles PTX source modules to produce corresponding binary object files.

All whitespace characters are equivalent; whitespace is ignored except for its use in separating tokens in the language. The C preprocessor cpp may be used to process PTX source modules. Lines beginning with are preprocessor directives. The following are common preprocessor directives:. Each PTX module must begin with a. Comments cannot occur within character constants, string literals, or within other comments.

A PTX statement is either a directive or an instruction. Statements begin with an optional label and end with a semicolon. Directive keywords begin with a dot, so no conflict is possible with user-defined identifiers. Instructions are formed from an instruction opcode followed by a comma-separated list of zero or more operands, and terminated with a semicolon.

Operands may be register variables, constant expressions, address expressions, or label names. Instructions have an optional guard predicate which controls conditional execution.

The guard predicate follows the optional label and precedes the opcode, and is written as p , where p is a predicate register. The guard predicate may be optionally negated, written as! Instruction keywords are listed in Table 2. All instruction keywords are reserved tokens in PTX.

PTX does not specify a maximum length for identifiers and suggests that all implementations support a minimum length of at least characters. PTX allows the percentage sign as the first character of an identifier. The percentage sign can be used to avoid name conflicts, e.

PTX predefines one constant and a small number of special registers that begin with the percentage sign, listed in Table 3. PTX supports integer and floating-point constants and constant expressions. These constants may be used in data initialization and as operands to instructions.

Type checking rules remain the same for integer, floating-point, and bit-size types. For predicate-type data and instructions, integer constants are allowed and are interpreted as in C, i. Integer constants are bits in size and are either signed or unsigned, i. When used in an instruction or data initialization, each integer constant is converted to the appropriate size based on the data or instruction type at its use.

Integer literals may be written in decimal, hexadecimal, octal, or binary notation. The syntax follows that of C. Integer literals may be followed immediately by the letter U to indicate that the literal is unsigned.

Integer literals are non-negative and have a type determined by their magnitude and optional type suffix as follows: literals are signed. Floating-point constants are represented as bit double-precision values, and all floating-point constant expressions are evaluated using bit double precision arithmetic.

The only exception is the bit hex notation for expressing an exact single-precision floating-point value; such values retain their exact bit single-precision value and may not be used in constant expressions. Each bit floating-point constant is converted to the appropriate floating-point size based on the data or instruction type at its use.

Floating-point literals may be written with an optional decimal point and an optional signed exponent. PTX includes a second representation of floating-point constants for specifying the exact machine representation using a hexadecimal constant. To specify IEEE double-precision floating point values, the constant begins with 0d or 0D followed by 16 hex digits.

To specify IEEE single-precision floating point values, the constant begins with 0f or 0F followed by 8 hex digits. In PTX, integer constants may be used as predicates. For predicate-type data initializers and instruction operands, integer constants are interpreted as in C, i. In PTX, constant expressions are formed using operators as in C and are evaluated using rules similar to those in C, but simplified by restricting types and sizes, removing most casts, and defining full semantics to eliminate cases where expression evaluation in C is implementation dependent.

Constant expressions are formed from constant literals, unary plus and minus, basic arithmetic operators addition, subtraction, multiplication, division , comparison operators, the conditional ternary operator?

Integer constant expressions also allow unary logical negation! Constant expressions in PTX do not support casts between integer and floating-point. Constant expressions are evaluated using the same operator precedence as in C. Table 4 gives operator precedence and associativity. Operator precedence is highest for unary operators and decreases with each line in the chart.

Operators on the same line have the same precedence and are evaluated right-to-left for unary operators and left-to-right for binary operators. Integer constant expressions are evaluated at compile time according to a set of rules that determine the type signed. These rules are based on the rules in C, but they've been simplified to apply only to bit integers, and behavior is fully defined in all cases specifically, for remainder and shift operators. Table 5 contains a summary of the constant expression evaluation rules.

While the specific resources available in a given target GPU will vary, the kinds of resources will be common across platforms, and these resources are abstracted in PTX through state spaces and data types. A state space is a storage area with particular characteristics. All variables reside in some state space. The characteristics of a state space include its size, addressability, access speed, access rights, and level of sharing between threads.

The state spaces defined in PTX are a byproduct of parallel programming and graphics programming. The list of state spaces is shown in Table 6 ,and properties of state spaces are shown in Table 7.

Address may be taken via mov instruction. Device function input and return parameters may have their address taken via mov ; the parameter is then located on the stack frame and its address is in the. The number of registers is limited, and will vary from platform to platform.

When the limit is exceeded, register variables will be spilled to memory, causing changes in performance. For each architecture, there is a recommended maximum number of registers to use see the CUDA Programming Guide for details. Registers may be typed signed integer, unsigned integer, floating point, predicate or untyped. Register size is restricted; aside from predicate registers which are 1-bit, scalar registers have a width of 8-, , , or bits, and vector registers have a width of , , , or bits.

The most common use of 8-bit registers is with ld , st , and cvt instructions, or as elements of vector tuples. Registers differ from the other state spaces in that they are not fully addressable, i.

When compiling to use the Application Binary Interface ABI , register variables are restricted to function scope and may not be declared at module scope.

Registers may have alignment boundaries required by multi-word loads and stores. The special register. All special registers are predefined. The constant. Constant memory is accessed with a ld. Constant memory is restricted in size, currently limited to 64 KB which can be used to hold statically-sized constant variables.

There is an additional KB of constant memory, organized as ten independent 64 KB regions. The driver may allocate and initialize constant buffers in these regions and pass pointers to the buffers as kernel function parameters. Since the ten regions are not contiguous, the driver must ensure that constant buffers are allocated so that each buffer fits entirely within a 64 KB region and does not span a region boundary. Statically-sized constant variables have an optional variable initializer; constant variables with no explicit initializer are initialized to zero by default.

Constant buffers allocated by the driver are initialized by the host, and pointers to such buffers are passed to the kernel as parameters. See the description of kernel parameter attributes in Kernel Function Parameter Attributes for more details on passing pointers to constant buffers as kernel parameters. Previous versions of PTX exposed constant memory as a set of eleven 64 KB banks, with explicit bank numbers required for variable declaration and during access. There were eleven 64 KB banks, and banks were specified using the.

If no bank number was given, bank zero was assumed. The global. It is the mechanism by which different CTAs and different grids can communicate. Use ld. Global variables have an optional variable initializer; global variables with no explicit initializer are initialized to zero by default.

The local state space. It is typically standard memory with cache. The size is limited, as it must be allocated on a per-thread basis. In implementations that do not support a stack, all local memory variables are stored at fixed addresses, recursive function calls are not supported, and. The parameter. Kernel function parameters differ from device function parameters in terms of access and sharing read-only versus read-write, per-kernel versus per-thread.

Each kernel function definition includes an optional list of parameters. These parameters are addressable, read-only variables declared in the. Values passed from the host to the kernel are accessed through these parameter variables using ld. The kernel parameter variables are shared across all CTAs within a grid. The address of a kernel parameter may be moved into a register using the mov instruction. The resulting address is in the.

Kernel function parameters may represent normal data values, or they may hold addresses to objects in constant, global, local, or shared state spaces. In the case of pointers, the compiler and runtime system need information about which parameters are pointers, and to which state space they point. Kernel parameter attribute directives are used to provide this information at the PTX level. See Kernel Function Parameter Attributes for a description of kernel parameter attribute directives.

Kernel function parameters may be declared with an optional. Kernel Parameter Attribute:. Used to specify the state space and, optionally, the alignment of memory pointed to by a pointer type kernel parameter. The alignment value N , if present, must be a power of two. If no state space is specified, the pointer is assumed to be a generic address pointing to one of const, global, local, or shared memory.

If no alignment is specified, the memory pointed to is assumed to be aligned to a 4 byte boundary. Spaces between. The most common use is for passing objects by value that do not fit within a PTX register, such as C structures larger than 8 bytes. In this case, a byte array in parameter space is used. Typically, the caller will declare a locally-scoped.

This will be passed by value to a callee, which declares a. Function input parameters may be read via ld. Aside from passing structures by value,. In PTX, the address of a function input parameter may be moved into a register using the mov instruction. Note that the parameter will be copied to the stack if necessary, and so the address will be in the. It is not possible to use mov to get the address of or a locally-scoped.

The shared. An address in shared memory can be read and written by any thread in a CTA. Shared memory typically has some optimizations to support the sharing. One example is broadcast; where all threads read from the same address. Another is sequential access from sequential threads. The texture. It is shared by all threads in a context. Texture memory is read-only and cached, so accesses to texture memory are not coherent with global memory stores to the texture image.

The GPU hardware has a fixed number of texture bindings that can be accessed within a single kernel typically Multiple names may be bound to the same physical texture identifier. An error is generated if the maximum number of physical resources is exceeded. The texture name must be of type. Physical texture resources are allocated on a per-kernel granularity, and. Texture memory is read-only. A texture's base address is assumed to be aligned to a 16 byte boundary.

See Texture Sampler and Surface Types for the description of the. In PTX, the fundamental types reflect the native data types supported by the target architectures. A fundamental type specifies both a basic type and a size. Register variables are always of a fundamental type, and instructions operate on these types. The same type-size specifiers are used for both variable definitions and for typing instructions, so their names are intentionally short.

Table 8 lists the fundamental type specifiers for each basic type:. Most instructions have one or more type specifiers, needed to fully specify instruction behavior. Operand types and sizes are checked against instruction types for compatibility.

Two fundamental types are compatible if they have the same basic type and are the same size. Signed and unsigned integer types are compatible if they have the same size. The bit-size type is compatible with any fundamental type having the same size. In principle, all variables aside from predicates could be declared using only bit-size types, but typed variables enhance program readability and allow for better operand type checking.

For convenience, ld , st , and cvt instructions permit source and destination data operands to be wider than the instruction-type size, so that narrow values may be loaded, stored, and converted using regular-width registers. For example, 8-bit or bit values may be held directly in bit or bit registers when being loaded, stored, or converted to other types and sizes.

The fundamental floating-point types supported in PTX have implicit bit representations that indicate the number of bits used to store exponent and mantissa. For example, the. In addition to the floating-point representations assumed by the fundamental types, PTX allows the following alternate floating-point data formats:.

Alternate data formats cannot be used as fundamental types. They are supported as source or destination formats by certain instructions. PTX includes built-in opaque types for defining texture, sampler, and surface descriptor variables. These types have named fields similar to structures, but all information about layout, field ordering, base address, and overall size is hidden to a PTX program, hence the term opaque.

The use of these opaque types is limited to:. Indirect access to textures and surfaces using pointers to opaque variables is supported beginning with PTX ISA version 3. Indirect access to textures is supported only in unified texture mode see below. The three built-in types are. For working with textures and samplers, PTX has two modes of operation. In the unified mode, texture and sampler information is accessed through a single. In the independent mode , texture and sampler information each have their own handle, allowing them to be defined separately and combined at the site of usage in the program.

In independent mode, the fields of the. Table 9 and Table 10 list the named members of each type for unified and independent texture modes.

These members and their values have precise mappings to methods and values defined in the texture HW class as well as exposed values via the API.

Fields width , height , and depth specify the size of the texture or surface in number of elements in each dimension. If no value is specified, the default is set by the runtime system based on the source language. In independent texture mode, the sampler properties are carried in an independent. This field is defined only in independent texture mode. When True , the texture header setting is overridden and unnormalized coordinates are used; when False , the texture header setting is used.

Variables using these types may be declared at module scope or within kernel entry parameter lists. At module scope, these variables must be in the. As kernel parameters, these variables are declared in the. When declared at module scope, the types may be initialized using a list of static expressions assigning values to the named members.

Currently, OpenCL is the only source language that defines these fields. Table 12 and Table 11 show the enumeration values defined in OpenCL version 1. In PTX, a variable declaration describes both the variable's type and its state space. In addition to fundamental types, PTX supports types for simple aggregate objects such as vectors and arrays. All storage for data is specified with variable declarations. Every variable must reside in one of the state spaces enumerated in the previous section.

A variable declaration names the space in which the variable resides, its type and size, its name, an optional array size, an optional initializer, and an optional fixed address for the variable. Limited-length vector types are supported. Vectors of length 2 and 4 of any non-predicate fundamental type can be declared by prefixing the type with. Vectors must be based on a fundamental type, and they may reside in the register space.

Vectors cannot exceed bits in length; for example,. Three-element vectors may be handled by using a. This is a common case for three-dimensional grids, textures, etc. By default, vector variables are aligned to a multiple of their overall size vector length times base-type size , to enable vector load and store instructions which require addresses aligned to a multiple of the access size.

Array declarations are provided to allow the programmer to reserve space. To declare an array, the variable name is followed with dimensional declarations similar to fixed-size array declarations in C.

The size of each dimension is a constant expression. The size of the array specifies how many elements should be reserved. When declared with an initializer, the first dimension of the array may be omitted. The size of the first array dimension is determined by the number of elements in the array initializer.

Array index has eight elements, and array offset is a 4x2 array. A scalar takes a single value, while vectors and arrays take nested lists of values inside of curly braces the nesting matches the dimensionality of the declaration. As in C, array initializers may be incomplete, i.

Currently, variable initialization is supported only for constant and global state spaces. Variables in constant and global state spaces with no explicit initializer are initialized to zero by default. Initializers are not allowed in external variable declarations. Variable names appearing in initializers represent the address of the variable; this can be used to statically initialize a pointer to a variable. Only variables in. By default, the resulting address is the offset in the variable's state space as is the case when taking the address of a variable with a mov instruction.

An operator, generic , is provided to create a generic address for variables used in initializers. The only allowed expressions in the mask operator are integer constant expression and symbol expression representing address of variable.

The mask operator extracts n consecutive bits from the expression used in initializers and inserts these bits at the lowest position of the initialized variable. The number n and the starting position of the bits to be extracted is specified by the integer immediate mask. Device function names appearing in initializers represent the address of the first instruction in the function; this can be used to initialize a table of function pointers to be used with indirect calls.

Variables that hold addresses of variables or functions should be of type. Initializers are allowed for all types except. Byte alignment of storage for all addressable variables can be specified in the variable declaration. Alignment is specified using an optional.

The variable will be aligned to an address which is an integer multiple of byte-count. The alignment value byte-count must be a power of two. For arrays, alignment specifies the address alignment for the starting address of the entire array, not for individual elements. The default alignment for scalar and array variables is to a multiple of the base-type size. The default alignment for vector variables is to a multiple of the overall vector size.

Note that all PTX instructions that access memory require that the address be aligned to a multiple of the access size. The access size of a memory instruction is the total number of bytes accessed in memory.

For example, the access size of ld. Since PTX supports virtual registers, it is quite common for a compiler frontend to generate a large number of register names. Rather than require explicit declaration of every name, PTX supports a syntax for creating a set of variables having a common prefix string appended with integer suffixes. This shorthand syntax may be used with any of the fundamental types and with any state space, and may be preceded by an alignment specifier.

Array variables cannot be declared this way, nor are initializers permitted. Variables may be declared with an optional. Multiple attributes are separated by comma.

Variable Attribute Directive:. All operands in instructions have a known type from their declarations. Each operand type must be compatible with the type determined by the instruction template and instruction type.

There is no automatic conversion between types. The bit-size type is compatible with every type having the same size. Integer types of a common size are compatible with each other. Operands having type different from but compatible with the instruction type are silently cast to the instruction type. The source operands are denoted in the instruction descriptions by the names a , b , and c.

PTX describes a load-store machine, so operands for ALU instructions must all be in variables declared in the. For most operations, the sizes of the operands must be consistent. The cvt convert instruction takes a variety of operand types and sizes, as its job is to convert from nearly any data type to any other data type and size.

The ld , st , mov , and cvt instructions copy data from one location to another. The mov instruction copies data between registers. Most instructions have an optional predicate guard that controls conditional execution, and a few instructions have additional predicate source operands. Predicate operands are denoted by the names p , q , r , s. PTX instructions that produce a single result store the result in the field denoted by d for destination in the instruction descriptions.

The result operand is a scalar or vector variable in the register state space. Using scalar variables as operands is straightforward. The interesting capabilities begin with addresses, arrays, and vectors. All the memory instructions take an address operand that specifies the memory location being accessed.

This addressable operand is one of:. The address must be naturally aligned to a multiple of the access size. If an address is not properly aligned, the resulting behavior is undefined. For example, among other things, the access may proceed by silently masking off low-order address bits to achieve proper rounding, or the instruction may fault. The address size may be either bit or bit.

Addresses are zero-extended to the specified width as needed, and truncated if the register width exceeds the state space address width for the target architecture. Address arithmetic is performed using integer arithmetic and logical instructions. Examples include pointer arithmetic and pointer comparisons. All addresses and address computations are byte-based; there is no support for C-style pointer arithmetic. The mov instruction can be used to move the address of a variable into a pointer.

The address is an offset in the state space in which the variable is declared. Load and store operations move data between registers and locations in addressable state spaces. The syntax is similar to that used in many assembly languages, where scalar variables are simply named and addresses are de-referenced by enclosing the address expression in square brackets.

Address expressions include variable names, address registers, address register plus byte offset, and immediate address expressions which evaluate at compile-time to a constant address.

If a memory instruction does not specify a state space, the operation is performed using generic addressing. The state spaces const , local and shared are modeled as windows within the generic address space.

Each window is defined by a window base and a window size that is equal to the size of the corresponding state space. A generic address maps to global memory unless it falls within the window for const , local , or shared memory.

Within each window, a generic address maps to an address in the underlying state space by subtracting the window base from the generic address. Arrays of all types can be declared, and the identifier becomes an address constant in the space where the array is declared.

The size of the array is a constant in the program. Array elements can be accessed using an explicitly calculated byte address, or by indexing into the array using square-bracket notation.

The expression within square brackets is either a constant integer, a register variable, or a simple register with constant offset expression, where the offset is a constant expression that is either added or subtracted from a register variable. If more complicated indexing is desired, it must be written as an address calculation prior to use.

Examples are:. Vector operands are supported by a limited subset of instructions, which include mov , ld , st , and tex. Vectors may also be passed as arguments to called functions. Vector elements can be extracted from the vector with the suffixes. Vector loads and stores can be used to implement wide loads and stores, which may improve memory performance. Here are examples:. Function names can be used in mov instruction to get the address of the function into a register, for use in an indirect call.

Operands of different sizes or types must be converted prior to the operation. Table 13 shows what precision and format the cvt instruction uses given operands of differing types. For example, if a cvt. Stream synchronization behavior. Graph object thread safety. Rules for version mixing. Device Management. Error Handling. Stream Management. Event Management.

External Resource Interoperability. Execution Control. Memory Management. Stream Ordered Memory Allocator. Unified Addressing. Peer Device Memory Access.

OpenGL Interoperability. Direct3D 9 Interoperability. Direct3D 10 Interoperability. Direct3D 11 Interoperability. EGL Interoperability. Graphics Interoperability. Texture Object Management.



0コメント

  • 1000 / 1000