Summary: 696 instances, 579 unique Text Count # TODO: see if we need to provide a way to specify these values. could ask for the whole 1 #ifdef USE_DML // TODO: This is necessary for the workaround in TransformGraph 1 // TODO: Revisit whether there is a significant benefit for the current 1 // TODO: add 16bit tensorize GEMM for AVX2 1 if (is_verbose) { // TODO: should we disable this if the number of nodes is above a certain threshold? 1 // TODO: support float16 1 // TODO: create one for nuphar 1 // Could reduce the chance of arithmetic overflow. TODO: fix it 5 TODO: When the NNAPI or CoreML EPs are setup to use the provider bridge the source code for that will be included 1 // TODO move this tvm_codegen 2 nullptr); // TODO: use thread pool from OpContext 2 // TODO remove this after decoupling compiler and runtime of WeightLayout 2 codegen_target_ = std::make_unique(target_str, 128, 1); // TODO: use real values 1 // TODO: Remove asym padding correction. 1 // TODO: get hostname for inclusion in Tensorboard events filename 1 // TODO: limit num_beams > 1 when we can have another operator for greedy search. 1 // TODO: currently this is a hacking to overwrite Conv's weight. The correct way to do this should be: 1 //TODO: fast_divmod only supports int32 1 // FIXME: This will trigger a llvm buffer overflow when group is too large 1 #command.extend([inputs_arg]) TODO: rebind IO inputs in TRT 8.2 1 // TODO: in most cases, the path name is a relative path and the behavior of the following line of code is undefined. 1 // TODO: Update TreeEnsemble* ops to use this instead of write_scores if possible. 1 # TODO: figure out a way to secure args.account_key to prevent later code changes 1 // TODO: What other high level categories are meaningful? Model? Optimizer? Execution? 1 //TODO tix, refactor the mpi related code to populate all fields correctly by default. 1 // TODO: this is a static hash lookup, it's easy to do it better 1 // TODO: complex64 complex128 2 // TODO: BinaryElementwiseBroadcastPrepare shall take dy_tensor as const Tensor*. 2 // TODO: Make this save the pointer, and not copy the whole structure across 1 # TODO: provider_bridge_ort.cc should not include nccl.h 1 * TODO: add support for arguments of type std::wstring. Currently it is not supported as the underneath 1 // TODO: trim the ending '\r' and/or '\n' 2 // TODO: use thread pool to parrellel 1 // TODO: move to nuphar 1 #TODO: add epsilon attribute 1 // TODO: Temperately removed path: parallelize by partitioning the batch rows, 1 // TODO - Split apart multiple op-sets with a common domain into multiple registries, as required by Lotus 1 # TODO: handle inner graph 1 # TODO: for now, core framework depends on CUDA. It should be moved to TensorRT EP 1 for (int64_t i = 0; i < num_elements; ++i) { // TODO: parallelize this? didn't give any benefit in my tests 1 //TODO: we might want to support load custom op schema on the fly 1 # TODO: shall we add back the checking of children op types. 1 // TODO 2: There is an inherent assumption that if we cast a cpu op's input from float16 to float 1 # TODO: Add anything that is needed for testing a minimal build 1 #include "core/graph/graph_utils.h" // TODO: Minimize usage of this given we want to use Actions in a minimal build 1 # TODO - review and add ML ops as needed 1 // TODO: We could decide that an ORT model is load only even in a full build, 1 // TODO: Subgraph not supported yet. 1 // TODO: support other modes 1 // TODO: Should we also ensure that element type for tensor is set ? 1 // TODO: Support 'string' and 'float16' types for completeness 1 // vocab_mask shape (vocab_size). TODO: support shape (batch_size, vocab_size) 1 // TODO: Add multi-threading logic if num_of_copies_per_batch is large enough 1 # is generating attribute error. TODO investigate the pipelines to 1 #TODO: the graph library should focus on ONNX IR, it shouldn't depend on math libraries like MKLML/OpenBlas 1 // TODO does the extra output need any changes? 1 // TODO Validate against allowed operator output NAMES. The types are not deduced. 1 // TODO: Minor Optimization possibility: This input_axis_patch will be common across all inputs 1 // TODO: temporary hack until View is improved (it doesn't work with Alias) 4 // TODO: support new attributes for MaxPool-8 and MaxPool-10 1 # TODO: use run_orttraining_test_orttrainer_frontend_separately to work around a sporadic segfault. 1 // TODO this should be an error case, needs more investigation 1 // TODO: only calculate once per model. 1 # TODO: Assert return type 1 # TODO: Move this to model_desc_validation.py 1 // TODO: remove this. 2 // TODO: change to use MlasExecuteThreaded 1 // TODO: this is a temp workaround for removing rank tensor before adding optimizer. 1 // TODO: merge all videoframes to a single DX12Texture Resource before call ConvertDX12TextureToGPUTensor. 1 TODO: How to support dynamic axes? Dimensions are determined by samples 1 // TODO: use Span.IndexOf when/if it removes the IEquatable type constraint 1 //TODO: not sure how to support reverse stride. may be able to calculate the shape differently 1 #TODO: fix the warnings, they are dangerous 1 libs = ['onnxruntime_pybind11_state.so', 'libdnnl.2.dylib', 'mimalloc.so'] # TODO add libmklml and libiomp5 later. 1 // TODO reduction types 1 # TODO: ESRP team is working on enable signing workflow on Mac. Should enable the following step when it's ready. 1 // TODO: Should we remove the OrtAllocatorType field from the OrtMemoryInfo struct to 1 // TODO: Currently, triggers copy on stream 0, investigate if we can still do that 1 # TODO: Explore ways to make self._graph_info.initializer_names and self._graph_info.initializer_names_to_train 1 //TODO: change to use MLAS when no layout could apply 2 # TODO: enable switching the feature on/off in the middle of the training 1 //TODO: for better performance, we could try to transfer the nodes in parent graph to sub-graph directly, 1 // TODO add support of scalar C for Gemm 1 # TODO: merge ConstantOfShape -->Cast to ConstantOfShape (need update the data type of value) 1 // TODO: release 16bit tensorize GEMV for AVX2 1 //TODO: sync_gpu if needed. 1 // TODO: remove compute_root 1 CXX_STANDARD 17 # TODO remove when everything else moves to 17 1 // TODO: Temporary threadpool for the executor. This is a costly way to handle the problem. 1 // TODO: change it to use provider::CopyTensor for non-CPU devices 1 // TODO: we don't really need the second parameter 1 // TODO: If the specified level of the optimization is greater than 1 then 1 // TODO: replace with ONNX one when that one is fixed 1 // TODO: a better check is to check does the CPU kernel with float exist or not. 1 // TODO: We probably need to do the recursion inline when processing the main graph in order to maximize efficiency. 1 // TODO convert {1} outputs back to scaler {} once that is done DnnlElementwiseCapability 1 # TODO: convert it to the specified input_type 1 // TODO: as fetch_allocators is optional, it should be a pointer instead of reference 1 // TODO: fix this checker later 1 // TODO: when T is FP16, compute in FP32, then cast result back to FP16. length_penalty_ might also be float. 1 // TODO if you need the number of physical cores you'll need to parse 1 //TODO: add a default constructor 3 // TODO: Tune the cost 1 // TODO: Scale during the tensorization phase instead of using the video frame pipeline when the input bounds are not the same size as the tensor 1 // TODO: Ideally we could avoid the overhead of creating a new Tensor (mainly cost of copying type and shape info) 1 // TODO standadize the getter and setter convention in env for other fields. 1 // TODO: Not sure if we need to purely limit to iOS builds, so limit to __APPLE__ for now 1 // TODO: Move away from usage of vector of `OrtMemoryInfo`s per weight (initializer) 1 # FIXME: find the first at::Tensor param anywhere in the signature 1 handle->allow_unaligned_buffers = info.allow_unaligned_buffers; // TODO remove this 1 // TODO: Add this to onnx instead of adding it here. 1 # TODO: Would be nice to specify ORT release and pull the config for that release. 1 # TODO: Try to reuse the output buffers as some of the output tensors are same sizes, 2 // TODO: Replace this with usage of code in Softmax operator 1 // TODO: topi only support 2d-pool, MaxPool1d and MaxPool3d will need to be added if necessary. 1 // TODO: make the max request size configable. 1 // TODO: test if it is reg 1 // TODO: support multi-target support 1 // TODO: solve the op version issue in the entire training framework 1 // TODO: optimization opportunity - if all tensors in the seq are scalars, we can potentially represent them 1 // TODO: Profile and tune this batch parallel execution based on input size. 1 * TODO: Consider MEMFS or IndexedDB instead of console. 1 // TODO: enable 16 bit tensorization 1 // TODO: add generic functions to manipulate the graph 1 # TODO: will input order get messed up 1 # TODO: this is a temporary allowlist for ops need type promotion 1 // TODO change name space 2 @TODO: Is it worth adding the max number of passes a rule should be applied for? 1 # TODO: thiagofc: Temporary experimental namespace for new PyTorch front-end 1 // TODO: adding other activations that can be fused. 1 // TODO: support stride 1 // TODO: x86 release builds produce warning C4723: potential divide by 0 1 // TODO: squeeze other shapes to 2D case 1 // TODO: This is copied from onnx schemas. When the change is in and we update this can be removed. 1 // TODO: rename tvm_target to a proper name 1 # TODO: how to support multiple CUDA versions? 1 // TODO: re-visit the different cases to see if there are other situations aside from constant ops. 1 // TODO: Find a better name than Reserve() and update in all places. 1 #TODO: set MASM flags properly 1 // TODO : Need to add support for float64 1 // TODO (contd.) We could also possibly absorb the per-thread logic in a new allocator decorator that derives 1 // TODO support ResizeBilinear 2 // TODO: this format for unbound output needs more discussion 1 // TODO: Implement BinarySearch in GLSL 1 //TODO:check overflow 1 // TODO remove deprecated global config 18 // TODO investigate whether we want to support subgraph using CoreML EP 1 // TODO: support other type 1 // TODO: add a global cache if need to handle cases for multiple frames running simultaneously with different batch_size 2 // TODO: Validate if this filtering makes sense. 1 // TODO: Re-enable when attributes values are guaranteed to be filled. 1 # TODO: accept loss_fn as an onnx model. build self.onnx_model_ with model and loss_fn 1 // TODO replace NodeUseCount approximation 1 # TODO: match mask path for GPT2LMHeadModel_BeamSearchStep. 1 //TODO: rewrite it with PathFindNextComponentW 1 // TODO: handle symbolic dim 1 # TODO: This assumes that the first output corresponds to the first input. 1 # TODO: to unpack the input argument. 1 // TODO: we could refactor the allocators to not require the call to GetAllocator but that change is much bigger 1 // TODO: temporary hack until View is improved (it doesn't work with Alias) 1 // TODO: implement reduction type of NONE 4 // TODO: Run benchmarks to determine if a dedicated kernel doing data copy will be faster than invoking cudaMemcpy ? 1 // TODO: we may delay-init this variable 1 # TODO: Remove when experimental checkpoint functions are removed. 1 /** TODO: remove this class 1 // TODO: change to TypedArray::ByteLength() in next node-addon-api release. 1 // TODO: If the specified optimization is greater than 1 then insert a Cast to the 1 // TODO: Scale during the tensorization phase instead of using the video frame pipeline when the input bounds are not the same size as the tensor 1 // TODO: Both onnxruntime and ONNX lists of types seem somewhat incomplete and incomparable. 1 TODO: mix of in-place and return, but results are different 1 // TODO: Create a general class to describe for computation topology and unify all similar uses. 1 # TODO: support these activation attributes 1 # TODO: detect when installed extensions are outdated and need reinstallation. Hash? Version file? 1 // TODO: Support string concat 1 // TODO: this method does not handle parallel optimization. 1 // TODO: Assumes that the OrtValue is a Tensor, make this generic to handle non-Tensors 1 // TODO: support other type 1 // TODO: Check if framebuffer is ready 1 // TODO: remove it after weight layout refactoring 1 // TODO change namespace from codegen to nuphar 1 // TODO: fix const_cast; Graph doesn't provide us a method "GetMutableInputs". 1 # TODO: Use external format if input has extra data. 1 // TODO: use GraphTransformer::GetCompatibleExecutionProviders if we need something more flexible 1 // TODO: currently variadic inputs have their count corrected (should be one entry with the total number of 1 // TODO support 2+ inputs for Min/Max op 1 // TODO need to ensure the data is written in little-endian format... 1 TODO: Support the Add use case: Sum(x, y) ~= Add. 1 // TODO: support float16 1 # TODO: Make detection of parameter update robust 1 // TODO: revisit the logic here to make it simpler 1 # TODO: replace this once PyTorch supports unregister natively. 1 // TODO: Clean up the ROCMExecutionProvider interface to avoid this 1 // TODO is the encoding correct? https://github.com/onnx/onnx/issues/2392 1 beam_scorer_->Initialize(allocator_, parameters_->sequence_length); // TODO: use temp_space_allocator 1 //TODO: (csteegz) Add modelspec for both paths. 1 // TODO: The resolve below is likely unnecessary. 1 // TODO: Reuse buffer for input_ids and position_ids to reduce memory allocation. 1 // TODO: refactor it 1 // TODO: Kernel arg size limit may be <4KB for some other cards (ie Jetson) 1 # TODO: Only apply required types to the global type list and ignore if it's model based per-op type reduction 1 # TODO: assert the the initializer raw data is not 0 (False) 1 # TODO:use OnnxModel.graph_topological_sort(self.model.graph) from transformers.onnx_model 1 # TODO: shall we have a separated group for nodes with subgraph? 1 #include "core/providers/nuphar/compiler/traverse_shape_infer.h" // TODO: get rid of this 1 // TODO: if there is a FusedKernelFn attached, onnxruntime will generate 1 #include // TODO remove this after removing tvm::runtime 1 // TODO: if the tensor need be copied, does it have enough room? 1 // TODO: we may need consider the lifetime of alloc carefully 1 // TODO: Should we remove the OrtAllocatorType field from OrtMemoryInfo to 1 // TODO: document the path separator convention? '/' vs '\' 1 // TODO add quantized matmul when reshape support quantized input 1 // TODO: remove tvm core function 1 // TODO: change to delayed call 2 // TODO: optimize memory 1 // TODO: enable following when is_train is present 1 // TODO: re-enable this shortcut once the index calculation bug is fixed. 1 // TODO retire this file 2 // TODO make OptimizerGraphConfig::loss_scale_input_name optional 1 // TODO: use MLAS or BLAS 2 std::uniform_real_distribution dist(0.0, 1.0); // TODO: should this be initialized per batch? 1 // TODO: May want a different calculation for each branch below instead. 1 // TODO: ideally we can just reuse the CPU EP registered with the session, but in the training session case 1 // TODO analysis move to namespace nuphar 1 // TODO change name space from tvm_codegen to nuphar 1 self._model_bytes = path_or_bytes # TODO: This is bad as we're holding the memory indefinitely 1 // TODO move this to another place 1 TODO: replace Gemm_Subgraph by MatMul + Add 1 // TODO: Use the mutli weight / grad version. 1 // TODO: validate options and memory format 1 // TODO - refactor the function with similar logic in Transpose3DKernel using 16x16 Tile 1 // TODO make OptimizerNodeConfig::loss_scale_input_name optional 1 // TODO: Should we consider weights here too? 1 // TODO: name() method of `type_info` class is implementation dependent 1 // TODO: eliminate use of const_casts 1 // TODO: Currently we use global AlgoPerfCache for ConvGrad only. Conv's perf cache is still per node. 1 // TODO: Shouldn't we check stage2_fp32_node_args to conditionally transform this? 1 // TODO should Run() call io_binding.SynchronizeInputs() or should it let the callers do it? 1 # TODO: search the directory to find correct output filename of "python setup.py install" when python version is not 3.6 1 // TODO hack - constant folding currently doesn't work after mixed precision transformation so it's disabled for now 1 // TODO: Add check to ensure that string arrays are not passed - we currently don't support string tensors in CUDA 1 // TODO: change this logic to removing a partition with only all alias ops 1 # TODO: monitor this for perf issues 1 /* Get Tensor element type */ //TODO: Assumed value is Tensor, need to support non-tensor types in future 1 // TODO: how to handle type conversion 1 // TODO: handling Cuda EP options 1 // TODO add multi-target support 1 // TODO: we should not have to add the TypeConstraint below, since it is meant to be in 1 // TODO: Does this need (and maybe AllocateNode) to be threadsafe so nodes_ and num_of_nodes_ managed more carefully? 1 // NV_TODO: optimize speed -- pass things needed in, optimize kernel speed, add half2 2 // TODO: this is actually used for the rocm kernels which explicitly ask for inputs from CPU. 1 // TODO: Task 812756: NNAPI EP, add support for subgraph (If and Loop operators) 1 // TODO: Parallelize this operation 1 //#include "core/codegen/passes/utils/codegen_context.h" // TODO: remove it 1 // TODO: Remove this contrib kernel registration and the schema from the appropriate places 3 # TODO: add postprocess of ONNX model to use graph input directly: glboal_mask = global_attention_mask 1 // TODO: Make an allocator for upload heaps 1 // TODO: log kernel inputs? 2 #include "core/providers/nuphar/common/analysis/shape_expr.h" // TODO: remove this shape_expr after shape_infernece refinement 1 // TODO: It's ugly to include a .cc file but this .cc file defines the implementation of some templates which we need. 1 # TODO: support peephole 1 // TODO: this IA("") does not cause kernel to know it is unneeded. 1 TODO: Create environment variables to allow extensions to be hosted outside ONNX runtime installation folder 1 # TODO: Implement a cleaner way to auto-generate torch_gpu_allocator.cc 1 // TODO: use cub::DeviceSegmentedRadixSort::SortPairsDescending instead of cub::DeviceRadixSort::SortPairsDescending 1 # TODO: Rework by implementing the "__getattribute__" method. 1 // TODO: push this to ONNX 1 # TODO: detect num_heads from graph like FusionAttention 1 // TODO move compilation part out of create_state_func to above 1 // TODO: refactor the following logic for both JIT-caching and AOT support 1 // TODO: unclear why we should go through a string-representation of type 1 // TODO: We construct a Transpose kernel on each call as doing so is fairly lightweight. 1 // TODO: mask_index can be used in softmax to save some calculation. 1 //TODO: should be size_t 1 @TODO: We need to define a contract about whether a rewrite rule is allowed to leave 1 // TODO: Must match onnxruntime::profiling::EventRecord 1 // TODO: Support non-tensors 1 // TODO: Understand the reason for failure by calling GetLastError(). 1 /* TODO support these options 1 // TODO: remove 1 //TODO: might not need two sets, the union of them might be enough 1 // TODO: Identify error cases where-in an initializer is used on different 1 #include "core/framework/allocator.h" // TODO: get rid of this 1 // TODO: update OrtMemoryInfo, use unordered_set instead 1 // TODO: This is copied from onnx schemas. When the change is in and we update this can be removed. 1 // TODO: support more general depthwise convolution. 1 // TODO: It's slightly insane we don't support optionally removing the output edges as part of Graph::RemoveNode 1 // TODO: use custom kernel of expand to improve the performance. 1 // TODO: after AOT support, we should move it to a proper location 1 # TODO: Current DLPack doesn't support bool and PyTorch disables converting bool tensor to DLPack in recent commit. 2 // TODO - ... 1 // TODO: Only used for TRT and CUDA EP currently, need to add more identifiers to use it across all EPs 1 // TODO need a heuristic when to insert FP32 Cast 1 // TODO should we allow sharing of non-CPU allocators? 1 // TODO: support more types 1 if (inner_dim_size < kParallelizationThreshold) { // TODO: tune this, arbitrary threshold 1 // TODO - handle non-tensor types 1 std::remove(ToUTF8String(model_uri).c_str()); // TODO would be good to have something like RemoveFile(PathString) 1 // TODO: revisit to see if we can eliminate typeconstraint step 1 //TODO: valildate shapes 1 // TODO: support it 1 use_external_data_format = (config.n_layer > 24) #TODO: find a way to check model size > 2GB 1 // TODO: Once Schema is checked in to onnx lets fix this to match that 2 // TODO: Convert RunNodeAsync return Status. 1 //TODO: avoid using const_cast 2 // TODO Validate against allowed operator input NAMES. The types are not deduced. 2 // TODO figure out the eigen threadpool stuff for use here 1 // TODO: Could move to a helper 1 // TODO: Multi-thread if needed 1 // TODO: fix the warnings 2 //TODO: Don't do it when Protobuf_USE_STATIC_LIBS is OFF 1 // TODO: please reimplement output alias using the right algorithm. 1 // TODO: support other type 1 // TODO: temparary sulotion, need to unify the interface in EP and AllocatorManager 1 // TODO: enhance opgen.py to support inplace binary operations. 1 // TODO: remove default capture 1 // TODO: instead of calling Promote, we may consider to expose 2 // TODO: change identity to other alias 1 // TODO: could copy blocks instead of individual items for non std::string types 1 // TODO: add support for mixed precisions 2 // Dml methods (TODO need to figure out how these need to move to session somehow...) 1 TODO: Create a generic mechanism to pass arguments from ORTModule into each extension setup.py 1 # TODO: get the constant string via pybind. 1 # TODO: Non-contiguous tensor input in execution_session_run_forward, need tensor copy. 1 // TODO: support other type 1 //TODO: if we reuse the nodes in parent graph, maybe we don't need to resolve it. 1 TODO: Currently the type tracking is not specific to a version of the operator. 1 // TODO: is it possible to use nsync_from_time_point_ ? 2 // TODO: what is the difference between the device allocator and the arena allocator? 1 // TODO: Support all valid session options 1 #include "core/codegen/passes/utils/ort_tvm_utils.h" // TODO remove this after removing tvm::runtime 1 // TODO: Implement logic to check if the framebuffer is ready 1 //TODO: if initializer zero point is a vector of a unique value, we can still compute it 1 // TODO: Re-visit this method if more subgraph ops get accepted into ONNX 1 // TODO: Check if some version of the ONNX IR allows op_type to be empty 1 // TODO add a list of condition to handle dynamic registration 1 # TODO: move this info to a file (json?) and load the data from there. 1 // TODO: Investigate why IsConflict() was not triggered when there were duplicate Tile CUDA 1 // TODO: use ORT allocator to avoid allocating from heap directly 1 // TODO: Validate Bounds 1 // TODO: how to support layout 1 // TODO: Print which tensor fails. 1 # TODO: link to the Changelog.md 1 // TODO: add other activation methods 1 //TODO: The dynamic and statically planned alignments may not be the same, need to check 1 // TODO: Add memory analysis 1 # TODO: Decide if we need to wrap the type in a std::vector. Issue is that the element type is internal 1 // TODO: the error message should tell which operator raises it. 2 // NV_TODO: investigate cub support for half 2 // TODO: Do we need this class or is IAllocator::MakeUniquePtr sufficient/better 1 // TODO: force checksum validation in final release 1 // TODO: Support string depth-to-space. 1 // TODO: handling CPU EP options 1 //TODO: Better handle op version and domain. 1 #include "core/codegen/passes/utils/ort_tvm_utils.h" // TODO: remove this after decoupling layout compile and run 1 // TODO: Investigate whether we should set a max size, and clarify the cost of having a buffer smaller than 1 assert not is_sequence(vi.type) # TODO: handle loop accumulation in sequence_type 1 // TODO potential optimization opportunity 1 // TODO support more dimensionalities 1 // TODO: Explore padding the provided input shape [N, C, D] to [N, C, 1, D] 1 // TODO Here we require the all floating point inputs are generated by an immediate 1 // TODO: copy between devices? i.e. multiple GPUs 1 //TODO: fix the warnings 19 // TODO: Try expand inputs/outputs after first subgraph call instead. That may get better peroformance, but more complex to implement. 1 // TODO: Investigate benefits of using an arena based allocator 1 # TODO: Remove when experimental checkpoint functions are removed. 1 # TODO: rcom 1 // TODO ROCM added support recently, should verify. 1 # TODO: use other information (like initializer names) to identify different embedding weights automatically. 1 static_cast(element_count), //TODO: check range 1 // TODO: optimize the algorithm -- in some cases, if the last two dims are 1 // TODO: handle more general cases for activations f, h, g and activation_alpha and 1 // TODO: we want to support wildcard or regex? 1 // TODO: Switch this to Path.of when the minimum Java version is 11. 1 // TODO: modify row/col order for other dimensions. 1 // TODO: Clean up the CUDAExecutionProvider interface to avoid this 3 // TODO: Make an allocator for readback heaps 2 // TODO: fix this change it to batched gemm/conv 1 // TODO: What is the plan for optional sequence tensors ? 1 // TODO: Check if we can upgrade all the current opset 6 models that are being tested 1 # TODO: Support roberta (position starts from 2 instead of 0) in EmbedLayerNormalization kernel 1 // TODO: support profling 1 // TODO: We should consider checking the cut is valid --- Cut Infos should 1 #include "core/common/cpuid_info.h" // TODO: refactor to control through config 1 // TODO: get maximum available size from memory areana 2 // TODO: Currently we remove the node and use the input name from the node being removed. 1 // TODO: Need a proper scheme to generate unique names to avoid name-collision. 1 // TODO: Handle optional attribute but no default value specified in op definition. 1 default: // TODO: will add more type support if needed. 1 // TODO: refactor the LSTMcell building to a tvm function 1 // TODO: only call OnRunStart for all providers in-use 2 // TODO this memset can be expensive, a sparse tensor representation would help here 1 // TODO: should specify the access characteristics of model_path. Is this read only during the 1 // TODO: this is a temporary workaround to apply the DML EP's custom graph transformer prior to partitioning. This 1 // TODO: decide whether this should be a class 1 nullptr, // TODO: support add_qk in quantized attention 1 # TODO: remove after PyTorch ONNX exporter supports VAR_KEYWORD parameters. 1 //TODO: Fix the warning. The variable should be put in the environment class. 1 // TODO: Check if we can adopt a similar approach to deal with asymmetric pads in 'ConvTranspose' 1 // TODO Instead of returning an error, should we set mem pattern to false here and log a warning saying so? 1 // TODO: handle stride 1 // TODO: decprecate GatherND-1 after updating training models to opset-12 2 #TODO: fix the warnings, they are dangerous 1 // TODO: remove it, after TVM is fixed 1 // TODO: remove this when we update to 6.1 or later 1 # TODO: add fully-qualified name. 1 //TODO: use format any to choose the best layout 1 // TODO ReduceKernel::ReduceKernelShared() is still used by some other training classes though it's not used here - this should be refactored. 2 // TODO: support attribute 'ceil_mode' 1 // TODO: Handle string copies when the kernel eventually supports string type. 2 // TODO: validate inputs. 1 // TODO: support more types than float 1 // TODO remove deprecated global config 9 //TODO:ensure dtype_!=nullptr 1 ) #TODO: shall we use one-sided: less or greater according to "top1_match_rate" 1 // TODO: Refine this rough metric to choose when to parallelize. 1 // TODO: maybe group the gradient builders and split them into different files. 1 // TODO: This should probably be defined to be the equality operator on TensorShapeProto. 1 # TODO: Pick the right "out" Torch parameter; do not assume the first one 1 # TODO: import from onnxconverter_common when it is stable 1 // TODO: support pre-allocated output 1 TODO: This is visible for testing at the moment, but we should rather make it private. */ 1 // TODO: when the graph contain a function node, and user pass in the dll which could 1 //TODO: Fix the warning 1 // TODO: build N-Gram index (hash table with prefix of length NGram - 1 as key, and list of last word of NGram as value) for fast matching. 1 // TODO:supporting other coordinateTransformModes 1 // TODO is the encoding correct? https://github.com/onnx/onnx/issues/2392 1 // TODO: This is consistent with the previous implementation but inconsistent with GetValueType which returns 1 // TODO: Add check to ensure that string arrays are not passed - we currently don't support string tensors in CUDA 1 // TODO: This is a hack which strips the suffix added within Lotus transforms that insert mem copies. 1 # TODO: comple this once ORT training can do evaluation. 1 // TODO: decprecate GatherND-1 after updating training models to opset-12 3 // TODO: The current implementation of sequence ops relies on tensor copies. Ideally we should try to avoid 1 // caller. TODO: Make sure this semantics is clear in InferenceSession API. 1 // TODO: for now this just needs to support ONNX ops. If we ever had a transformer that was going to 1 // TODO: make it better 1 # TODO: check quantize_inputs after sub graph is supported 1 // TODO : Make this generic enough to compute default attributes for multi-dimensional conv 1 //TODO: we need to make it a stand-alone header because both graph.cc and model.cc need to implement create instance of the graph object. 1 // TODO: Check all videoFrames come from either CPU or GPU. 1 @TODO: Is a bottom-up traversal more efficient? 1 // TODO: change name 1 // TODO: Depreacate this schema when training support is udpated to opset-12 1 // TODO: See if we need to calculate the root_nodes_ of the filtered graph. 1 // TODO: calculate source location from 'stack' 1 // TODO: create a list of must-to-fetch tensors and pass it to all graph transformer. 1 // TODO: support attribute 'ceil_mode' and 'storage_order' 1 // TODO: figure out how to get the correct element type. 1 # TODO: call convert_longformer_to_onnx to export onnx instead. 1 TODO: Note that the above doesn't handle the case where training_mode=false and a non-zero 1 // TODO: OverlappingTimeSchedules should be private 1 // TODO: support setting attributes of the custom op. 1 // TODO: change the Cost and ForcePartition to a more complex form. 1 // TODO: use LoggingManager::GetTimestamp and date::operator<< 1 #include "core/codegen/mti/mti_tvm_utils.h" // TODO: remove this after decoupling layout compile and run 1 // TODO: Need to add denotations here 1 // TODO: We could try and call ExecuteGraph to get the output shape from fetches so the rank is correct, 1 // Dml methods (TODO need to figure out how these need to move to session somehow...) 1 # TODO: add postprocess of ONNX model to calculate based on graph input: input_mask = (attention_mask - 1) * 10000.0 1 // TODO: The CUDA EP may not need to use the per-thread allocator for allocations that would use this map 1 // TODO: implement profiling 5 // TODO: handle symbolic dimensions 1 // TODO: add flags 2 // TODO: make this the Base 1 // TODO: support new attributes for AveragePool-10 1 // TODO support more algos when they become available. 1 // TODO The line above has const-correctness issues. Below is a possible fix which copies the tensor_proto data 1 // TODO: These two AllocateMLValue... methods are in the API purely for unit test usage. 1 # TODO: Remove when experimental checkpoint functions are removed 1 // TODO move the following variables to member variables for extending life-time 4 "engine": "torch", #TODO: test torchscript 1 // TODO: How to handle unload failure ? 1 // TODO rename this file to creator_base 1 unsigned max_num_graph_transformation_steps = 10; // TODO choose a good default here? 1 // TODO: This is a cuDNN bug that gave wrong results in certain strided convolution gradient setups 1 //TODO: expand to arbitrary permutation or transpose on given 2 dims for higher dimensional tensors 1 // Dml methods (TODO need to figure out how these need to move to session somehow...) 1 // TODO: Remove this after we have supported asymmetric padding in the CUDA ConvTranspose kernel 1 // TODO : Need to add support for multi-dimensional conv 1 // TODO: throw exception? 1 // TODO: The copy being made is possibly redundant if this occurs in a subgraph 1 // TODO: add support for sequence not axis 0 1 // TODO: remove once we have a better solution for alias optimization 1 //TODO: enable the check back 1 /* TODO: we can improve this if the concrete shapes are known for both as below. 1 // TODO: this mapping is temporary. 1 class DivGrad : public ReduceKernel { // TODO: not to derive from ReduceKernel. 1 // TODO remove this after rewriting Cost function and ForcePartition 1 // TODO: Change the current interface to take Path object for model path 1 #TODO: fix the warnings 1 // TODO 1: Check if the kernel available 1 // TODO: test case in which a tensor is produced by a forward op, stashed 1 // TODO: We should check that names specified in kernel_type_constraints are 1 // TODO implement hardware feature detection in windows. 1 //TODO: is it optional or not? 1 // TODO: wrap with change generation guard 1 // TODO: Should we re-use the block if the size is large enough? Would probably need to allow it 1 // TODO: handle invalid conversion cases 1 // TODO: add 16bit tensorize GEMV/GEMM for AVX512 1 #include // TODO: remove this after decoupling layout compile and run 1 // TODO: Graph::FuseSubGraph and Graph::FinalizeFuseSubGraph should now create valid edges so this call to 1 TODO: Used for CUDA & TRT only for now, will have one more PR to apply this for all EPs. 1 # TODO: Currently we have different collect() for entropy and percentile method respectively. 1 //TODO:fix the warnings 4 * @remarks TODO: Provide way to load from in-memory bytes without copying. InferenceSession would need to 1 // TODO Add isa support 1 // TODO: Simple copy for now. std::string would be better as std::reference_wrapper 1 //TODO Add exhaustive and default cases for algo. 1 #TODO: support graph_topological_sort() in subgraphs 1 // TODO: Why do we need to handle "Cast" here? 1 // TODO: enable support for Indices 1 // TODO: Add support for 'double' to the scoring functions in ml_common.h 1 // TODO: should I try to use SetThreadSelectedCpuSets? 1 // TODO: move it to weight layout place 1 // TODO: Compute is a const function, generator cannot be a private meber 1 # FIXME: warn if we have not consumed all torch parameters (either as 1 // TODO return these only when packing is enabled. 1 // TODO: node->ImplicitInputDefs() need to be added here for control flow nodes. 2 //TODO: we are only provide host allocate method in compute context. 1 # TODO: PyTorch exporter bug: changes the initializer order in ONNX model 1 // TODO: Expose knobs so that users can set fields associated with OrtArenaCfg so that we can pass it to the following method 2 //TODO: need to support float -> float16 and float16-> float conversion 2 // TODO: check binary size of this and replace it with constexpr if large 1 // TODO: add webgl 1 handling. 4 # TODO: support GEMM op if needed 1 // TODO: Ideally, we should not need to transform a function-body here. 1 # TODO: let user give their own Boost installation 1 {"Pow", {1}}, // TODO: Pow's input_1 is differentiable, but gradient not yet implemented 1 // TODO: re-use node arg of mul5 so that it is allowed to be graph output (Need modify CheckNode as well). 1 # TODO: this test will be replaced with convergence test ported from backend 1 // TODO - Evaluate switching the types below to support transparent comparators and enable 1 // TODO: rename class name to more target-specific in the tvm refactoring 1 // TODO: This macro can be removed when we changed all APIs to return a status code. 1 # TODO This is a temporary fix to stop exporting QAttention with qkv_hidden_sizes 1 //TODO: valildate shapes 1 // TODO: add target as an input 1 # TODO: this is the last time we have to do this!!! 1 // TODO: Move this to the right location. Its only here for quick experimentation. 1 // TODO: log kernel outputs? 2 // TODO: differentiate reuse and alias, by add AllocKind::kAlias? 1 // TODO: Take into account the input's shape to get a tighter bound for the ends. 1 // TODO add other post load processing here 1 // TODO: this is actually used for the cuda kernels which explicitly ask for inputs from CPU. 2 // TODO add more validation here like checking shape of the allocated buffers 1 // TODO: 1 // TODO: move all tvm related code a manager 1 // TODO: add more tensor utilities 1 // TODO in training mode, it may be possible to infer some of 1 # TODO: handle automatically by merging lines 1 // TODO: replace it with GemmBatch for performance, it's OK for now as GemmBatch unrolls as well 1 # TODO: Add more. 1 ['Slice', '', '10+', sliceV10], // TODO: support 'steps' for Slice-10 1 #TODO: inlcude external data in total model size. 1 # TODO: Handle mutliple results 1 # TODO: this variable is not really needed 1 // TODO: there is no need to manually malloc/free these memory, it is insecure 1 // TODO: Could avoid the topological sort in the GraphViewer ctor by constructing from an existing 1 // TODO: move this to an operator in graph. 1 // TODO: Tell user why it has conflicts 1 // TODO: enable support for non-default dilations 1 // TODO: verify if scalar shape works here together with matmul_add_fusion. 1 TODO: Remove once PyTorch 1.8.2 or newer is released 1 // TODO: This currently assumes that all outputs of FP32 ops are FP32. 1 // TODO: Do we need to catch any std::exceptions from creating/writing to disk and convert to Status codes? 1 # TODO: implement input_forget = 1 1 // TODO we need this to properly return a packed vec4 from kernels. 2 // TODO: swap texR, texC order in following function so row is corresponding to u and column is corresponding to 1 #TODO: move this step to a CPU-only machine to save GPU resources. 1 #define SHARED_PROVIDER_TODO 0 1 // TODO: add more parameter in consructor to support different target 1 // TODO: allow specification of axis1 and axis2? 3 // TODO: Support 'Variadic' inputs/outputs 1 // TODO: Currently we use global AlgoPerfCache for ConvGrad only. Conv's perf cache is till per node. 1 const Tensor* marshalled_initializer = nullptr; // TODO: change it to unique_ptr 1 // TODO: validataion might not be needed, since it's already done once in the fw pass 1 using onnxruntime::common::Status; // TODO: Needed by WinML, but shouldn't be put into the global namespace like this 1 // TODO support other data types 2 if (x_size < kParallelizationThreshold) { // TODO: tune this, arbitrary threshold 1 // TODO: Support string type 1 // TODO: Currently we have no way to transport this error to the API user 1