d3d/archive/D3D11_3_FunctionalSpec.htm [27310:27485]: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Description: Thread group sync and/or memory barrier. Operation: Sync has options _uglobal, _ugroup, _g and _t, described further below. In graphics shader stages, only sync_uglobal is allowed. In the Compute Shader, (_uglobal or _ugroup*) and/or _g must be specified. _t is optional in addition. *Note the _ugroup option will not be exposed to developers unless discovered to be critical – discussed further below. _uglobal: --------- Global u# (UAV) memory fence. All prior u# memory reads/writes by this thread in program order are made visible to all threads on the "entire GPU" before any subsequent u# memory accesses by this thread. The "entire GPU" part of the definition is replaced by a less-than-global scope in one case though, described below. This applies to all UAV memory bound at the currently executing pipeline (graphics or compute). _uglobal is available in any shader stage. For any bound UAV that has not been declared by the shader as "Globally Coherent" (see the discussion of the Shader Memory Consistency Model"), the _uglobal u# memory fence only has visibility within the current Compute Shader thread-group for that UAV (as if _ugroup instead of _uglobal). (This issue only applies to the Compute Shader, since the graphics shaders must declare all UAVs as Globally Coherent). _ugroup: -------- Thread group scope u# (UAV) memory fence. All prior u# memory reads/writes by this thread in program order are made visible to all threads in the thread group before any subsequent u# memory accesses by this thread. This applies to all UAV memory bound at the current Shader stage. _ugroup is available in the Compute Shader only. Note that _ugroup will initially not be exposed to developers, although drivers will be tested by Microsoft such that they handle the option correctly through test shaders. If missing the _ugroup option becomes a significant issue for developers, Microsoft will consider exposing it in the future via compiler update. If _ugroup were to be exposed, for some implementations, the advantage of specifying _ugroup when that is all that is needed (instead of _uglobal) is that the sync operation can complete more quickly. Other implementations do not distinguish _ugroup from _uglobal, so both operations are equivalent and behave like _uglobal. Basically, it does not hurt for applications to specify their intent by requesting the narrowest scope of sync necessary. Note that even if a particular UAV is declared as "Globally Coherent" (see the discussion of the Shader Memory Consistency Model), a _ugroup sync operation could still function more efficiently on that UAV if a global barrier is not required. _g: --- g# (Thread Group Shared Memory) fence. All prior g# memory reads/writes by this thread in program order are made visible to all threads in the thread group before any subsequent g# memory accesses by this thread. This applies to all of the current Thread Group's g# Shared Memory. _g is available in the Compute Shader only. _t: --- Thread group sync. All threads within a single thread group (those that can share access to a common set of shared register space) will be executed up to the point where they reach this instruction before any thread can continue. _t cannot be placed in dynamic flow control (branches which could vary within a thread group), but can be present in uniform flow control, where all threads in the group pick the same path. _t is available in the Compute Shader only. -------- Listing of Compute Shader “sync” variants: sync_g sync_ugroup* sync_uglobal sync_g_t sync_ugroup_t* sync_uglobal_t sync_ugroup_g* sync_uglobal_g sync_ugroup_g_t* sync_uglobal_g_t *Variants with _ugroup may not be targeted by the HLSL compiler, per the earlier discussion in the _ugroup section above. Listing of Graphics Shader “sync” variants: sync_uglobal only. Observations: ------------- Memory fences prevent affected instructions from being reordered by compilers or hardware across the fence. Multiple reads from the same address by a shader invocation that are not separated by memory barriers or writes to the address can be collapsed together. Likewise for writes. But accesses separated by a barrier cannot be merged or moved across the barrier. Memory fences are not necessary for atomic operations to a given address by different threads to function correctly. Fences are needed when atomics and/or load/store operations need to be synchronized with respect to each other as they appear in individual threads from the point of view of other threads. In the Pixel Shader, discard instructions imply a sync_uglobal fence, in that instructions cannot be reordered across the discard. sync_uglobal in helper pixels (which run only to support derivatives) or discarded pixels may or may not have any affect. Note it is disallowed for helper or discarded pixels to write to UAVs (in the case of discard, if the writes issued after the discard), and returned values from UAVs are not allowed to contribute to derivative calculations. Therefore whether or not sync_u is honored or not for helper pixels or when issued after a discard is moot.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - d3d/archive/images/d3d11/D3D11_3_FunctionalSpec.htm [24936:25111]: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Description: Thread group sync and/or memory barrier. Operation: Sync has options _uglobal, _ugroup, _g and _t, described further below. In graphics shader stages, only sync_uglobal is allowed. In the Compute Shader, (_uglobal or _ugroup*) and/or _g must be specified. _t is optional in addition. *Note the _ugroup option will not be exposed to developers unless discovered to be critical – discussed further below. _uglobal: --------- Global u# (UAV) memory fence. All prior u# memory reads/writes by this thread in program order are made visible to all threads on the "entire GPU" before any subsequent u# memory accesses by this thread. The "entire GPU" part of the definition is replaced by a less-than-global scope in one case though, described below. This applies to all UAV memory bound at the currently executing pipeline (graphics or compute). _uglobal is available in any shader stage. For any bound UAV that has not been declared by the shader as "Globally Coherent" (see the discussion of the Shader Memory Consistency Model"), the _uglobal u# memory fence only has visibility within the current Compute Shader thread-group for that UAV (as if _ugroup instead of _uglobal). (This issue only applies to the Compute Shader, since the graphics shaders must declare all UAVs as Globally Coherent). _ugroup: -------- Thread group scope u# (UAV) memory fence. All prior u# memory reads/writes by this thread in program order are made visible to all threads in the thread group before any subsequent u# memory accesses by this thread. This applies to all UAV memory bound at the current Shader stage. _ugroup is available in the Compute Shader only. Note that _ugroup will initially not be exposed to developers, although drivers will be tested by Microsoft such that they handle the option correctly through test shaders. If missing the _ugroup option becomes a significant issue for developers, Microsoft will consider exposing it in the future via compiler update. If _ugroup were to be exposed, for some implementations, the advantage of specifying _ugroup when that is all that is needed (instead of _uglobal) is that the sync operation can complete more quickly. Other implementations do not distinguish _ugroup from _uglobal, so both operations are equivalent and behave like _uglobal. Basically, it does not hurt for applications to specify their intent by requesting the narrowest scope of sync necessary. Note that even if a particular UAV is declared as "Globally Coherent" (see the discussion of the Shader Memory Consistency Model), a _ugroup sync operation could still function more efficiently on that UAV if a global barrier is not required. _g: --- g# (Thread Group Shared Memory) fence. All prior g# memory reads/writes by this thread in program order are made visible to all threads in the thread group before any subsequent g# memory accesses by this thread. This applies to all of the current Thread Group's g# Shared Memory. _g is available in the Compute Shader only. _t: --- Thread group sync. All threads within a single thread group (those that can share access to a common set of shared register space) will be executed up to the point where they reach this instruction before any thread can continue. _t cannot be placed in dynamic flow control (branches which could vary within a thread group), but can be present in uniform flow control, where all threads in the group pick the same path. _t is available in the Compute Shader only. -------- Listing of Compute Shader “sync” variants: sync_g sync_ugroup* sync_uglobal sync_g_t sync_ugroup_t* sync_uglobal_t sync_ugroup_g* sync_uglobal_g sync_ugroup_g_t* sync_uglobal_g_t *Variants with _ugroup may not be targeted by the HLSL compiler, per the earlier discussion in the _ugroup section above. Listing of Graphics Shader “sync” variants: sync_uglobal only. Observations: ------------- Memory fences prevent affected instructions from being reordered by compilers or hardware across the fence. Multiple reads from the same address by a shader invocation that are not separated by memory barriers or writes to the address can be collapsed together. Likewise for writes. But accesses separated by a barrier cannot be merged or moved across the barrier. Memory fences are not necessary for atomic operations to a given address by different threads to function correctly. Fences are needed when atomics and/or load/store operations need to be synchronized with respect to each other as they appear in individual threads from the point of view of other threads. In the Pixel Shader, discard instructions imply a sync_uglobal fence, in that instructions cannot be reordered across the discard. sync_uglobal in helper pixels (which run only to support derivatives) or discarded pixels may or may not have any affect. Note it is disallowed for helper or discarded pixels to write to UAVs (in the case of discard, if the writes issued after the discard), and returned values from UAVs are not allowed to contribute to derivative calculations. Therefore whether or not sync_u is honored or not for helper pixels or when issued after a discard is moot.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -