d3d/archive/D3D11_3_FunctionalSpec.htm [27310:27485]:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Description: Thread group sync and/or memory barrier.
Operation: Sync has options _uglobal, _ugroup, _g and _t,
described further below.
In graphics shader stages, only sync_uglobal is allowed.
In the Compute Shader, (_uglobal or _ugroup*)
and/or _g must be specified. _t is
optional in addition.
*Note the _ugroup option will not be exposed
to developers unless discovered to be critical –
discussed further below.
_uglobal:
---------
Global u# (UAV) memory fence.
All prior u# memory reads/writes by this thread in
program order are made visible to all threads
on the "entire GPU" before any subsequent u# memory
accesses by this thread. The "entire GPU" part
of the definition is replaced by a less-than-global
scope in one case though, described below.
This applies to all UAV memory bound at the
currently executing pipeline (graphics or compute).
_uglobal is available in any shader
stage.
For any bound UAV that has not been
declared by the shader as "Globally
Coherent" (see the discussion of the
Shader Memory Consistency Model"), the
_uglobal u# memory fence only has visibility
within the current Compute Shader thread-group
for that UAV (as if _ugroup instead of _uglobal).
(This issue only applies to the Compute Shader,
since the graphics shaders must declare all UAVs as
Globally Coherent).
_ugroup:
--------
Thread group scope u# (UAV) memory fence.
All prior u# memory reads/writes by this thread in
program order are made visible to all threads
in the thread group before any subsequent u# memory
accesses by this thread.
This applies to all UAV memory bound at the
current Shader stage.
_ugroup is available in the Compute Shader only.
Note that _ugroup will initially not be exposed to
developers, although drivers will be tested by
Microsoft such that they handle the option correctly
through test shaders. If missing the _ugroup option
becomes a significant issue for developers, Microsoft
will consider exposing it in the future via compiler
update.
If _ugroup were to be exposed, for some implementations,
the advantage of specifying _ugroup when that
is all that is needed (instead of _uglobal) is that
the sync operation can complete more quickly.
Other implementations do not distinguish _ugroup from
_uglobal, so both operations are equivalent and behave
like _uglobal. Basically, it does not hurt for
applications to specify their intent by requesting
the narrowest scope of sync necessary.
Note that even if a particular UAV is declared as
"Globally Coherent" (see the discussion of the
Shader Memory Consistency Model), a _ugroup sync
operation could still function more efficiently on
that UAV if a global barrier is not required.
_g:
---
g# (Thread Group Shared Memory) fence.
All prior g# memory reads/writes by this thread in
program order are made visible to all threads
in the thread group before any subsequent g# memory
accesses by this thread.
This applies to all of the current Thread
Group's g# Shared Memory.
_g is available in the Compute Shader only.
_t:
---
Thread group sync. All threads within a
single thread group (those that can share
access to a common set of shared register
space) will be executed up to the point
where they reach this instruction before
any thread can continue.
_t cannot be placed in
dynamic flow control (branches which could
vary within a thread group), but can be
present in uniform flow control, where all
threads in the group pick the same path.
_t is available in the Compute Shader only.
--------
Listing of Compute Shader “sync” variants:
sync_g
sync_ugroup*
sync_uglobal
sync_g_t
sync_ugroup_t*
sync_uglobal_t
sync_ugroup_g*
sync_uglobal_g
sync_ugroup_g_t*
sync_uglobal_g_t
*Variants with _ugroup may not be targeted
by the HLSL compiler, per the earlier discussion
in the _ugroup section above.
Listing of Graphics Shader “sync” variants:
sync_uglobal only.
Observations:
-------------
Memory fences prevent affected instructions
from being reordered by compilers or hardware
across the fence.
Multiple reads from the same address
by a shader invocation that are not separated
by memory barriers or writes to the address
can be collapsed together. Likewise for writes.
But accesses separated by a barrier cannot be
merged or moved across the barrier.
Memory fences are not necessary for atomic operations
to a given address by different threads to function
correctly. Fences are needed when atomics and/or
load/store operations need to be synchronized with
respect to each other as they appear in individual
threads from the point of view of other threads.
In the Pixel Shader, discard instructions imply
a sync_uglobal fence, in that instructions cannot be
reordered across the discard. sync_uglobal in
helper pixels (which run only to support
derivatives) or discarded pixels may or may not
have any affect. Note it is disallowed
for helper or discarded pixels to write to UAVs
(in the case of discard, if the writes issued after
the discard), and returned values from UAVs are
not allowed to contribute to derivative calculations.
Therefore whether or not sync_u is honored or not
for helper pixels or when issued after a discard is
moot.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
d3d/archive/images/d3d11/D3D11_3_FunctionalSpec.htm [24936:25111]:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Description: Thread group sync and/or memory barrier.
Operation: Sync has options _uglobal, _ugroup, _g and _t,
described further below.
In graphics shader stages, only sync_uglobal is allowed.
In the Compute Shader, (_uglobal or _ugroup*)
and/or _g must be specified. _t is
optional in addition.
*Note the _ugroup option will not be exposed
to developers unless discovered to be critical –
discussed further below.
_uglobal:
---------
Global u# (UAV) memory fence.
All prior u# memory reads/writes by this thread in
program order are made visible to all threads
on the "entire GPU" before any subsequent u# memory
accesses by this thread. The "entire GPU" part
of the definition is replaced by a less-than-global
scope in one case though, described below.
This applies to all UAV memory bound at the
currently executing pipeline (graphics or compute).
_uglobal is available in any shader
stage.
For any bound UAV that has not been
declared by the shader as "Globally
Coherent" (see the discussion of the
Shader Memory Consistency Model"), the
_uglobal u# memory fence only has visibility
within the current Compute Shader thread-group
for that UAV (as if _ugroup instead of _uglobal).
(This issue only applies to the Compute Shader,
since the graphics shaders must declare all UAVs as
Globally Coherent).
_ugroup:
--------
Thread group scope u# (UAV) memory fence.
All prior u# memory reads/writes by this thread in
program order are made visible to all threads
in the thread group before any subsequent u# memory
accesses by this thread.
This applies to all UAV memory bound at the
current Shader stage.
_ugroup is available in the Compute Shader only.
Note that _ugroup will initially not be exposed to
developers, although drivers will be tested by
Microsoft such that they handle the option correctly
through test shaders. If missing the _ugroup option
becomes a significant issue for developers, Microsoft
will consider exposing it in the future via compiler
update.
If _ugroup were to be exposed, for some implementations,
the advantage of specifying _ugroup when that
is all that is needed (instead of _uglobal) is that
the sync operation can complete more quickly.
Other implementations do not distinguish _ugroup from
_uglobal, so both operations are equivalent and behave
like _uglobal. Basically, it does not hurt for
applications to specify their intent by requesting
the narrowest scope of sync necessary.
Note that even if a particular UAV is declared as
"Globally Coherent" (see the discussion of the
Shader Memory Consistency Model), a _ugroup sync
operation could still function more efficiently on
that UAV if a global barrier is not required.
_g:
---
g# (Thread Group Shared Memory) fence.
All prior g# memory reads/writes by this thread in
program order are made visible to all threads
in the thread group before any subsequent g# memory
accesses by this thread.
This applies to all of the current Thread
Group's g# Shared Memory.
_g is available in the Compute Shader only.
_t:
---
Thread group sync. All threads within a
single thread group (those that can share
access to a common set of shared register
space) will be executed up to the point
where they reach this instruction before
any thread can continue.
_t cannot be placed in
dynamic flow control (branches which could
vary within a thread group), but can be
present in uniform flow control, where all
threads in the group pick the same path.
_t is available in the Compute Shader only.
--------
Listing of Compute Shader “sync” variants:
sync_g
sync_ugroup*
sync_uglobal
sync_g_t
sync_ugroup_t*
sync_uglobal_t
sync_ugroup_g*
sync_uglobal_g
sync_ugroup_g_t*
sync_uglobal_g_t
*Variants with _ugroup may not be targeted
by the HLSL compiler, per the earlier discussion
in the _ugroup section above.
Listing of Graphics Shader “sync” variants:
sync_uglobal only.
Observations:
-------------
Memory fences prevent affected instructions
from being reordered by compilers or hardware
across the fence.
Multiple reads from the same address
by a shader invocation that are not separated
by memory barriers or writes to the address
can be collapsed together. Likewise for writes.
But accesses separated by a barrier cannot be
merged or moved across the barrier.
Memory fences are not necessary for atomic operations
to a given address by different threads to function
correctly. Fences are needed when atomics and/or
load/store operations need to be synchronized with
respect to each other as they appear in individual
threads from the point of view of other threads.
In the Pixel Shader, discard instructions imply
a sync_uglobal fence, in that instructions cannot be
reordered across the discard. sync_uglobal in
helper pixels (which run only to support
derivatives) or discarded pixels may or may not
have any affect. Note it is disallowed
for helper or discarded pixels to write to UAVs
(in the case of discard, if the writes issued after
the discard), and returned values from UAVs are
not allowed to contribute to derivative calculations.
Therefore whether or not sync_u is honored or not
for helper pixels or when issued after a discard is
moot.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -