fairscale/optim/adascale.py (4 lines):
	- line 422: # TODO (min): this may not work with activation checkpointing when
	- line 426: # TODO (min, mike): We need to have a way to check that training loop & DDP
	- line 579: TODO (min): need a way of determine how much to increase the step size?
	- line 581: TODO (min): have both `set_scale` and `set_num_gradients_to_accumulate`


fairscale/utils/reduce_scatter_bucketer.py (3 lines):
	- line 15: # TODO: Remove the toggle-enable_nccl_base_collectives when github open issue #801 is resolved.
	- line 140: # TODO: investigate how to avoid using torch.cat (because it seems to be slow for CPU tensors)
	- line 190: # TODO (Min): the `group` used here in the key is the object hash, not the content


fairscale/nn/data_parallel/fully_sharded_data_parallel.py (3 lines):
	- line 61: # TODO: Remove the toggle here when github open issue #801 is resolved.
	- line 90: TODO (Min): It would be nice to capture the stepping state as well.
	- line 1396: # TODO (Min): Future PyTorch versions may provide a way to completely disable this


fairscale/clib/fused_adam_cuda/multi_tensor_apply.cuh (2 lines):
	- line 14: // TODO:  Kernel arg size limit may be <4KB for some other cards (ie Jetson)
	- line 58: // TODO:  Print which tensor fails.


stubs/torch/__init__.pyi (2 lines):
	- line 112: # TODO: One downside of doing it this way, is direct use of
	- line 930: # TODO: fill in the types for these, or otherwise figure out some


fairscale/experimental/nn/distributed_pipeline/partition_handler.py (2 lines):
	- line 106: # TODO: This dependency injection causes deadlock if this partition
	- line 287: # TODO: There seems to be a memory leak that is solved by following line.


fairscale/optim/adam.py (1 line):
	- line 176: # TODO: Optimizer state gets cast to FP16 and back to FP32 for


fairscale/experimental/nn/distributed_pipeline/pipeline.py (1 line):
	- line 194: # TODO: Debug why we need this special handling


fairscale/experimental/nn/mevo.py (1 line):
	- line 405: # TODO (Min): these two factors doesn't have to be the same. More tuning can be done.


fairscale/utils/testing.py (1 line):
	- line 335: # TODO (Min): this global used to be assigned every time this file is imported.


fairscale/experimental/nn/distributed_pipeline/graph.py (1 line):
	- line 132: # TODO: We need to make sure following conditions hold before preparing the graph for the pipeline:


fairscale/optim/grad_scaler.py (1 line):
	- line 205: # TODO: is there a way to split by device and dtype without appending in the inner loop?


stubs/torch/autograd/__init__.pyi (1 line):
	- line 13: # TODO make Variable and Function more precise


fairscale/experimental/nn/data_parallel/gossip/utils/helpers.py (1 line):
	- line 100: # TODO: deprecate this function


stubs/torch/nn/parallel/scatter_gather.pyi (1 line):
	- line 22: # TODO More precise types here.


stubs/torch/nn/parallel/distributed.pyi (1 line):
	- line 23: # TODO type process_group once `distributed` module is stubbed


fairscale/nn/data_parallel/fsdp_optim_utils.py (1 line):
	- line 58: # TODO: this list could be huge. Can we avoid materializing?


stubs/torch/nn/modules/batchnorm.pyi (1 line):
	- line 45: # TODO set process_group to the write type once torch.distributed is stubbed