Summary: 78 instances, 60 unique

Text	Count
// FIXME: __any_sync with mask isn't supported by HIP yet.	1
// TODO: Add more inputs	5
// FIXME: some advanced "cudaMemAdvise" flags are not supported by HIP.	1
// TODO: optimization to use multiple warps per row.	1
// TODO: need to do a tree-reduction to fully take advantage of unrolling	1
# TODO: add in-place FloatToFused8BitRowwiseQuantized conversion	1
// TODO: Add option to pass in total_L	1
// TODO: Have default slower path	1
# TODO: add per-feature based converter option (based on embedding_specs during inference)	1
// TODO: fp16 and weighted	1
// TODO: need to do a tree-reduction to fully take advantage of	1
// TODO: add a flag later to control whether underflow	1
// TODO: use shared memory	1
# TODO: the result here is wrong. Once MixedDim version for FusedNBit quantization is done, switch to that.	1
// TODO: need to tune	1
// TODO: lift range_list into shared memory. However, when nrows is large,	1
// TODO: Have default slower path	4
!ten->is_cuda(); // TODO: Should be a better way to do this	1
// TODO: T26263653 fix signed-integer-overflow undefined behavior	1
// TODO: more smart ways for thread partitions considering the	1
// FIXME: Enable merge_pooled_embeddings for HIP.	1
// TODO: to parallelize, we should easily identify segments belong to	1
// TODO: add compressed_data_size and check array bound	1
// FIXME: some advanced "cudaMemAdvise" flags are not supported by HIP.	2
FIXME: make it in-place fill.	1
// TODO: respect output_dtype	1
TODO: populate the supported list of optimizers	1
// TODO: if not hyper sparse, transpose a block of A matrix as in SpMDM.	1
*                    TODO: if Aq_zero_point == 0, allow passing nullptr.	2
// TODO: use accessor for the following 3 parameters	1
// TODO: should print warning because this path is not optimized yet	1
// TODO: avx512 path	2
// TODO: Have default slower path	2
// TODO: replace uses in backward kernels with warp find qparams	1
// TODO: when G == nthreads + 1, we'll have a big load imbalance because	1
# TODO: optimize embedding pruning and quantization latency.	1
// TODO: measure load imbalance	1
// TODO: pass in dimension info and calculate qparams for rowwise integer	1
// TODO: handle transpose during packing	1
// TODO: torch check input is 2D	1
// TODO: Have default slower path	2
# TODO: diagnostic if dir does not exist	2
// FIXME: __any_sync with mask isn't supported by HIP yet.	2
# TODO: ideally the version.py should be generated when setup is run	1
// TODO: JIT	2
// TODO: Make this tunable	2
// TODO: Have default slower path	1
// TODO: change to avx512 when avx512 support is available	1
// TODO: fallback when AVX2 is not available	1
// TODO: unroll this loop?	1
// TODO: Wrap this as a inverseTransposeConvWeights()?	1
# FIXME: How to view the PyTorch Tensor as a different type (e.g., uint8)	1
// TODO: check appropriate vals for g,h,w	3
// // TODO: do we care about 64-bit indices? Currently we just ignore.	1
// TODO: short-circuit when B_zero_point is 0 or A_zero_point is 0	1
/* TODO: Remove the condition protecting the slow path because even when the	1
// passs after D22767058. TODO: optimize and make sure pooled_embs is	1
// TODO: Remove this when threading is supported.	1
// TODO: increase code sharing (templates for accumulator_ty, accumulation, outputs per thread, etc?)	1
// FIXME: __ballot_sync with mask isn't supported by HIP yet.	1