Summary: 78 instances, 60 unique Text Count // FIXME: __any_sync with mask isn't supported by HIP yet. 1 // TODO: Add more inputs 5 // FIXME: some advanced "cudaMemAdvise" flags are not supported by HIP. 1 // TODO: optimization to use multiple warps per row. 1 // TODO: need to do a tree-reduction to fully take advantage of unrolling 1 # TODO: add in-place FloatToFused8BitRowwiseQuantized conversion 1 // TODO: Add option to pass in total_L 1 // TODO: Have default slower path 1 # TODO: add per-feature based converter option (based on embedding_specs during inference) 1 // TODO: fp16 and weighted 1 // TODO: need to do a tree-reduction to fully take advantage of 1 // TODO: add a flag later to control whether underflow 1 // TODO: use shared memory 1 # TODO: the result here is wrong. Once MixedDim version for FusedNBit quantization is done, switch to that. 1 // TODO: need to tune 1 // TODO: lift range_list into shared memory. However, when nrows is large, 1 // TODO: Have default slower path 4 !ten->is_cuda(); // TODO: Should be a better way to do this 1 // TODO: T26263653 fix signed-integer-overflow undefined behavior 1 // TODO: more smart ways for thread partitions considering the 1 // FIXME: Enable merge_pooled_embeddings for HIP. 1 // TODO: to parallelize, we should easily identify segments belong to 1 // TODO: add compressed_data_size and check array bound 1 // FIXME: some advanced "cudaMemAdvise" flags are not supported by HIP. 2 FIXME: make it in-place fill. 1 // TODO: respect output_dtype 1 TODO: populate the supported list of optimizers 1 // TODO: if not hyper sparse, transpose a block of A matrix as in SpMDM. 1 * TODO: if Aq_zero_point == 0, allow passing nullptr. 2 // TODO: use accessor for the following 3 parameters 1 // TODO: should print warning because this path is not optimized yet 1 // TODO: avx512 path 2 // TODO: Have default slower path 2 // TODO: replace uses in backward kernels with warp find qparams 1 // TODO: when G == nthreads + 1, we'll have a big load imbalance because 1 # TODO: optimize embedding pruning and quantization latency. 1 // TODO: measure load imbalance 1 // TODO: pass in dimension info and calculate qparams for rowwise integer 1 // TODO: handle transpose during packing 1 // TODO: torch check input is 2D 1 // TODO: Have default slower path 2 # TODO: diagnostic if dir does not exist 2 // FIXME: __any_sync with mask isn't supported by HIP yet. 2 # TODO: ideally the version.py should be generated when setup is run 1 // TODO: JIT 2 // TODO: Make this tunable 2 // TODO: Have default slower path 1 // TODO: change to avx512 when avx512 support is available 1 // TODO: fallback when AVX2 is not available 1 // TODO: unroll this loop? 1 // TODO: Wrap this as a inverseTransposeConvWeights()? 1 # FIXME: How to view the PyTorch Tensor as a different type (e.g., uint8) 1 // TODO: check appropriate vals for g,h,w 3 // // TODO: do we care about 64-bit indices? Currently we just ignore. 1 // TODO: short-circuit when B_zero_point is 0 or A_zero_point is 0 1 /* TODO: Remove the condition protecting the slow path because even when the 1 // passs after D22767058. TODO: optimize and make sure pooled_embs is 1 // TODO: Remove this when threading is supported. 1 // TODO: increase code sharing (templates for accumulator_ty, accumulation, outputs per thread, etc?) 1 // FIXME: __ballot_sync with mask isn't supported by HIP yet. 1