fairscale/optim/adascale.py (4 lines): - line 422: # TODO (min): this may not work with activation checkpointing when - line 426: # TODO (min, mike): We need to have a way to check that training loop & DDP - line 579: TODO (min): need a way of determine how much to increase the step size? - line 581: TODO (min): have both `set_scale` and `set_num_gradients_to_accumulate` fairscale/utils/reduce_scatter_bucketer.py (3 lines): - line 15: # TODO: Remove the toggle-enable_nccl_base_collectives when github open issue #801 is resolved. - line 140: # TODO: investigate how to avoid using torch.cat (because it seems to be slow for CPU tensors) - line 190: # TODO (Min): the `group` used here in the key is the object hash, not the content fairscale/nn/data_parallel/fully_sharded_data_parallel.py (3 lines): - line 61: # TODO: Remove the toggle here when github open issue #801 is resolved. - line 90: TODO (Min): It would be nice to capture the stepping state as well. - line 1396: # TODO (Min): Future PyTorch versions may provide a way to completely disable this fairscale/clib/fused_adam_cuda/multi_tensor_apply.cuh (2 lines): - line 14: // TODO: Kernel arg size limit may be <4KB for some other cards (ie Jetson) - line 58: // TODO: Print which tensor fails. stubs/torch/__init__.pyi (2 lines): - line 112: # TODO: One downside of doing it this way, is direct use of - line 930: # TODO: fill in the types for these, or otherwise figure out some fairscale/experimental/nn/distributed_pipeline/partition_handler.py (2 lines): - line 106: # TODO: This dependency injection causes deadlock if this partition - line 287: # TODO: There seems to be a memory leak that is solved by following line. fairscale/optim/adam.py (1 line): - line 176: # TODO: Optimizer state gets cast to FP16 and back to FP32 for fairscale/experimental/nn/distributed_pipeline/pipeline.py (1 line): - line 194: # TODO: Debug why we need this special handling fairscale/experimental/nn/mevo.py (1 line): - line 405: # TODO (Min): these two factors doesn't have to be the same. More tuning can be done. fairscale/utils/testing.py (1 line): - line 335: # TODO (Min): this global used to be assigned every time this file is imported. fairscale/experimental/nn/distributed_pipeline/graph.py (1 line): - line 132: # TODO: We need to make sure following conditions hold before preparing the graph for the pipeline: fairscale/optim/grad_scaler.py (1 line): - line 205: # TODO: is there a way to split by device and dtype without appending in the inner loop? stubs/torch/autograd/__init__.pyi (1 line): - line 13: # TODO make Variable and Function more precise fairscale/experimental/nn/data_parallel/gossip/utils/helpers.py (1 line): - line 100: # TODO: deprecate this function stubs/torch/nn/parallel/scatter_gather.pyi (1 line): - line 22: # TODO More precise types here. stubs/torch/nn/parallel/distributed.pyi (1 line): - line 23: # TODO type process_group once `distributed` module is stubbed fairscale/nn/data_parallel/fsdp_optim_utils.py (1 line): - line 58: # TODO: this list could be huge. Can we avoid materializing? stubs/torch/nn/modules/batchnorm.pyi (1 line): - line 45: # TODO set process_group to the write type once torch.distributed is stubbed