picotron/model.py (5 lines): - line 13: #TODO: Maybe do class RotaryEmbedding(nn.Module) later - line 108: ## TODO support mask - line 147: # TODO: replace everything with flex attention - line 185: #TODO: dont do single line operations as it is harder to debug - line 231: #TODO: Use the default position_ids for RoPE during training. If we have time, work on generation submit_slurm_jobs.py (2 lines): - line 152: #TODO: batch into job arrays - line 219: #TODO: add more option like "python slurm.py submit_jobs --...." or "python slurm.py update_jobs --...." or "python slurm.py cancel_jobs --...." or "python slurm.py check_status --...." train.py (1 line): - line 185: #TODO: load existing checkpoint here to continue pre-training picotron/utils.py (1 line): - line 67: # TODO: LayerNorm is also split across TP ranks for sequence parallelism picotron/pipeline_parallel/pipeline_parallel.py (1 line): - line 173: #TODO: we should call deallocate_output_tensor as in Megatron-LM picotron/tensor_parallel/tensor_parallel.py (1 line): - line 225: #TODO: do some padding for the vocab size