Summary: 28 instances, 28 unique

Text	Count
# TODO: investigate how to avoid using torch.cat (because it seems to be slow for CPU tensors)	1
# TODO: Optimizer state gets cast to FP16 and back to FP32 for	1
# TODO: is there a way to split by device and dtype without appending in the inner loop?	1
# TODO (Min): these two factors doesn't have to be the same. More tuning can be done.	1
# TODO: We need to make sure following conditions hold before preparing the graph for the pipeline:	1
// TODO:  Print which tensor fails.	1
# TODO: There seems to be a memory leak that is solved by following line.	1
# TODO type process_group once `distributed` module is stubbed	1
// TODO:  Kernel arg size limit may be <4KB for some other cards (ie Jetson)	1
# TODO (min): this may not work with activation checkpointing when	1
# TODO (Min): Future PyTorch versions may provide a way to completely disable this	1
# TODO: This dependency injection causes deadlock if this partition	1
# TODO: fill in the types for these, or otherwise figure out some	1
TODO (min): need a way of determine how much to increase the step size?	1
# TODO (Min): the `group` used here in the key is the object hash, not the content	1
# TODO (min, mike): We need to have a way to check that training loop & DDP	1
TODO (min): have both `set_scale` and `set_num_gradients_to_accumulate`	1
# TODO: Remove the toggle here when github open issue #801 is resolved.	1
# TODO More precise types here.	1
# TODO (Min): this global used to be assigned every time this file is imported.	1
# TODO: One downside of doing it this way, is direct use of	1
# TODO: deprecate this function	1
# TODO set process_group to the write type once torch.distributed is stubbed	1
# TODO: Remove the toggle-enable_nccl_base_collectives when github open issue #801 is resolved.	1
TODO (Min): It would be nice to capture the stepping state as well.	1
# TODO: this list could be huge. Can we avoid materializing?	1
# TODO: Debug why we need this special handling	1
# TODO make Variable and Function more precise	1