models/nlp/electra/run_pretraining.py (7 lines):
	- line 9: TODO: Combine two segments into a single example. https://github.com/google-research/electra/blob/master/build_pretraining_dataset.py
	- line 10: TODO: Add zero-padding for shorter sequences
	- line 64: # TODO: Limit code duplication between train_step and val_step.
	- line 67: # TODO: Re-add validation step
	- line 82: # TODO: Make temperature a hyperparameter
	- line 151: # TODO: Check that accuracy_mask is different
	- line 280: # SageMaker may have some extra strings. TODO: Test this on SM.


models/nlp/common/arguments.py (4 lines):
	- line 40: # TODO: Change this to per_gpu_train_batch_size
	- line 96: # TODO: Move this to PathArguments?
	- line 107: # TODO: Pre-layer norm is not yet supported in transformers. PR is at https://github.com/huggingface/transformers/pull/3929, but maintainers are unresponsive.
	- line 163: # TODO: Remove these since they're a little too specific


models/nlp/albert/run_squad.py (3 lines):
	- line 238: # TODO: Improve. If only tf.keras.clone_model(model) worked.
	- line 440: # TODO: Abstract out to specify any checkpoint path
	- line 448: # TODO: Abstract out to specify any logs path


models/nlp/albert/run_pretraining.py (3 lines):
	- line 162: # TODO: On iteration 0, loss=11 and loss_scale()=32768, so scaled_loss=inf.
	- line 181: # TODO: Does placing this clip before or after allreduce affect accuracy?
	- line 368: # SageMaker may have some extra strings. TODO: Test this on SM.


models/vision/detection/awsdet/models/anchor_heads/retina_head.py (3 lines):
	- line 78: positive_fraction=1.0, # no sampling TODO: pass sampler as arg into anchor target generator
	- line 82: #TODO make losses package common to all models
	- line 278: TODO: move to utils


models/nlp/common/optimizers.py (3 lines):
	- line 19: wd_schedule = train_args.weight_decay  # TODO: Get weight decay schedule working.
	- line 40: TODO: If weight decay is decoupled from the gradient update, then we need to decay the weight decay
	- line 44: TODO: Unclear if LAMB needs a weight decay schedule. If so, that would explain why performance gets


models/vision/classification/models/resnet_evo.py (2 lines):
	- line 313: use_bias: whether to use biases for convolutional layers or not #FIXME:
	- line 365: if not backend.is_keras_tensor(input_tensor): # FIXME: dead code


legacy/models/resnet/tensorflow/train_imagenet_resnet_hvd.py (2 lines):
	- line 642: inputs = features  # TODO: Should be using feature columns?
	- line 686: loss = tf.identity(loss, name='loss')  # For access by logger (TODO: Better way to access it?)


models/vision/detection/awsdet/utils/runner/hooks/lr_updater.py (2 lines):
	- line 39: #TODO param groups not supported
	- line 63: # TODO: setting different learning rates for different param groups unsupported


models/vision/detection/awsdet/utils/runner/runner.py (2 lines):
	- line 171: # TODO: move this method out of runner
	- line 205: list: Current learning rate (#TODO: support individual LR for param groups)


models/vision/detection/awsdet/core/anchor/anchor_target.py (2 lines):
	- line 60: Used to provide a normalizing factor for bbox regression loss calculation. TODO: revisit this implementation
	- line 142: # TODO: handle scenario where all flags are False


models/vision/detection/awsdet/models/detectors/faster_rcnn.py (2 lines):
	- line 46: #TODO: delegate to assigner and sampler in the future
	- line 153: # AS: currently we limit eval to 1 image bs per GPU - TODO: extend to multiple


models/vision/detection/tools/test.py (1 line):
	- line 124: # TODO: support multiple images per gpu (only minor changes are needed)


models/vision/detection/awsdet/datasets/builder.py (1 line):
	- line 10: #TODO: Handle cases with multiple datasets, etc.


models/vision/detection/awsdet/core/bbox/bbox_target.py (1 line):
	- line 90: # TODO: concat proposals list and rois_list


models/vision/detection/awsdet/utils/runner/hooks/logger/text.py (1 line):
	- line 42: # TODO: resolve this hack


models/vision/detection/awsdet/models/backbones/resnet_aws.py (1 line):
	- line 294: use_bias: whether to use biases for convolutional layers or not #FIXME:


models/nlp/common/utils.py (1 line):
	- line 97: # TODO: Merge this with gather_indexes()


ci/frcnn/train_sagemaker.py (1 line):
	- line 124: # FIXME: currently hardcoded to SGD


models/vision/detection/awsdet/core/bbox/transforms.py (1 line):
	- line 59: max_ratio = np.abs(np.log(16.0/1000)) #TODO: make part of config


models/vision/detection/awsdet/apis/train.py (1 line):
	- line 101: # add regularization losses TODO: change multiplier based on type of regularizer (currently L2)


models/vision/detection/awsdet/models/anchor_heads/anchor_head.py (1 line):
	- line 98: TODO: build common when assigner and sampler functionality is done


models/vision/detection/awsdet/datasets/loader/build_loader.py (1 line):
	- line 27: for each GPU. - TODO: unused


models/vision/classification/train_backbone.py (1 line):
	- line 141: #TODO support optimizers choice via config


models/vision/classification/models/resnet.py (1 line):
	- line 300: use_bias: whether to use biases for convolutional layers or not #FIXME:


models/vision/detection/awsdet/utils/runner/dist_utils.py (1 line):
	- line 20: return hvd.rank(), hvd.local_rank(), hvd.size(), hvd.local_size() #TODO return a dict instead


models/vision/detection/awsdet/models/anchor_heads/rpn_head.py (1 line):
	- line 252: TODO: move to utils