core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala (17 lines): - line 556: * TODO SPARK-24942 Improve cluster resource management with jobs containing barrier stage - line 1208: // TODO: Probably should actually find among the active jobs that need this - line 1625: // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times. - line 1941: // TODO Refactor this out to a function that accepts a ResultStage - line 1979: // TODO: Perhaps we want to mark the resultStage as failed? - line 2069: // TODO: SPARK-35547: Clean all push-based shuffle metadata like merge enabled and - line 2070: // TODO: finalized as we are clearing all the merge results. - line 2119: // TODO: Cancel running tasks in the failed stage -- cf. SPARK-17064 - line 2180: // TODO: support to rollback result tasks. - line 2213: // TODO: mark the executor as failed only if there were lots of fetch failures on it - line 2280: // TODO SPARK-24877 leave the zombie tasks and ignore their completion events. - line 2288: // TODO Refactor the failure handling logic to combine similar code with that of - line 2484: // TODO: SPARK-35536: Cancel finalizeShuffleMerge if the stage is cancelled - line 2485: // TODO: during shuffleMergeFinalizeWaitSec - line 2571: // TODO: Lower-level scheduler should also deal with this - line 2587: // TODO: SPARK-35549: Currently merge statuses results which come after shuffle merge - line 2588: // TODO: is finalized is not registered. python/pyspark/pandas/series.py (16 lines): - line 1063: # TODO: NaN and None when ``arg`` is an empty dict - line 1064: # TODO: Support ps.Series ``arg`` - line 1198: # TODO: Currently, changing index labels taking dictionary/Series is not supported. - line 2216: self._column_label, scol.alias(name_like_string(self.name)) # TODO: dtype? - line 2335: self._psdf._internal.with_new_spark_column(self._column_label, scol) # TODO: dtype? - line 2381: # TODO: last two examples from pandas produce different results. - line 2811: # TODO: Categorical type isn't supported (due to PySpark's limitation) and - line 3769: # TODO: not all arguments are implemented comparing to pandas' for now. - line 3978: # TODO: add 'interpolation' parameter. - line 4060: # TODO: add axis, pct, na_option parameter - line 4858: # TODO: introduce 'in_place'; fully support 'regex' - line 5112: return self._with_new_scol(current) # TODO: dtype? - line 5194: self._column_label, scol # TODO: dtype? - line 5210: self._column_label, scol # TODO: dtype? - line 5584: return self._with_new_scol(cond) # TODO: dtype? - line 5590: sdf, index_fields=combined._internal.index_fields, data_fields=[None] # TODO: dtype? python/pyspark/pandas/groupby.py (15 lines): - line 156: # TODO: Series support is not implemented yet. - line 157: # TODO: not all arguments are implemented comparing to pandas' for now. - line 667: # TODO: 'q' accepts list like type - line 800: # TODO: sync the doc. - line 932: # TODO: sync the doc. - line 1072: # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter - line 1335: # TODO: skipna should be implemented. - line 1377: # TODO: groupby multiply columns should be implemented. - line 2100: # TODO: implement 'dropna' parameter - line 3265: # TODO: 'adjust', 'axis', 'method' parameter should be implemented. - line 3854: # TODO: Implement 'percentiles', 'include', and 'exclude' arguments. - line 3855: # TODO: Add ``DataFrame.select_dtypes`` to See Also when 'include' - line 4231: # TODO: add keep parameter - line 4313: # TODO: add keep parameter - line 4398: # TODO: add bins, normalize parameter python/pyspark/pandas/namespace.py (12 lines): - line 1399: # TODO: add `coerce_float` and 'parse_dates' parameters - line 1470: # TODO: add `coerce_float`, `params`, and 'parse_dates' parameters - line 1525: # TODO: add `coerce_float`, `params`, and 'parse_dates' parameters - line 1580: if " " not in striped: # TODO: identify the table name or not more precisely. - line 2339: # TODO: there are many parameters to implement and support. See pandas's pd.concat. - line 2491: ): # TODO: support dict - line 2670: # FIXME: better ordering - line 2683: # FIXME: better ordering - line 2690: # TODO: NaN and None difference for missing values. pandas seems to be filling NaN. - line 2743: data_fields=None, # TODO: dtypes? - line 2842: # TODO: Add back: - line 2916: # TODO: Add back: python/pyspark/pandas/generic.py (11 lines): - line 151: # TODO: add 'axis' parameter - line 211: # TODO: add 'axis' parameter - line 272: # TODO: add 'axis' parameter - line 333: # TODO: add 'axis' parameter - line 2393: # TODO: by argument only support the grouping name and as_index only for now. Documentation - line 2790: # TODO: 'center', 'win_type', 'on', 'axis' parameter should be implemented. - line 2822: # TODO: 'center' and 'axis' parameter should be implemented. - line 2846: # TODO: 'adjust', 'axis', 'method' parameter should be implemented. - line 3322: # TODO: add 'downcast' when value parameter exists - line 3401: # TODO: add 'downcast' when value parameter exists - line 3480: # TODO: add 'axis', 'inplace', 'downcast' mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala (8 lines): - line 113: /* TODO (once LDA can be trained with Strings or given a dictionary) - line 132: /* TODO (once LDA can be trained with Strings or given a dictionary) - line 241: // TODO: declare in LDAModel and override once implemented in DistributedLDAModel - line 363: // TODO: declare in LDAModel and override once implemented in DistributedLDAModel - line 683: // TODO: Avoid zip, which is inefficient. - line 710: // TODO: generalize this for asymmetric (non-scalar) alpha - line 737: // TODO: generalize this for asymmetric (non-scalar) alpha - line 809: // TODO: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala (8 lines): - line 250: // TODO: use Breeze UFunc - line 263: // TODO: use Breeze UFunc - line 458: // TODO: squared error is more natural but converges slower - line 494: // TODO: allocate outputs as one big array and then create BDMs from it - line 523: // TODO: allocate deltas as one big array and then create BDMs from it - line 534: // TODO: explain why delta of top layer is null (because it might contain loss+layer) - line 843: // TODO: will make a copy if vector is a subvector of BDV (see Vectors code) - line 848: // TODO: deprecate standard optimizer because it needs Vector sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (7 lines): - line 375: // Please do not insert any other rules in between. See the TODO comments in rule - line 629: // TODO: For Cube/Rollup just set nullability to be `true`. - line 739: // TODO: mark Aggregate as resolved even if it has GROUPING SETS. We can expand it at the end - line 911: // TODO: Don't construct the physical container until after analysis. - line 922: // TODO: Support Pandas UDF. - line 1135: // TODO (SPARK-27484): handle streaming write commands when we have them. - line 3937: // TODO: since the field name is already resolved, it's more efficient if python/pyspark/ml/connect/classification.py (7 lines): - line 96: # TODO: add a setting seed param. - line 99: # TODO: support training on GPU - line 100: # TODO: support L1 / L2 regularization - line 132: # TODO: early stopping - line 227: # TODO: support pandas dataframe fitting - line 254: # TODO: support GPU. - line 343: # TODO: Use spark broadast for `model_state_dict`, python/pyspark/pandas/indexes/base.py (6 lines): - line 437: # TODO: avoid using default index? - line 802: # TODO: add downcast parameter for fillna function - line 830: internal = InternalFrame( # TODO: dtypes? - line 1208: # TODO: add error parameter - line 2261: # TODO: non-categorical or categorical with different categories - line 2322: # TODO: We can't support different type of values in a single column for now. sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (6 lines): - line 487: // TODO: In future we can have Spark support columns sorted in descending order - line 684: // TODO: figure out how to drop multiple partitions in one call - line 988: new Path(loadPath), // TODO: Use URI - line 998: loadPath: String, // TODO URI - line 1333: // TODO: stats should include all the other two fields (`numFiles` and `numPartitions`). - line 1346: // TODO: still fill the rowCount even if sizeInBytes is empty. Might break anything? python/pyspark/pandas/base.py (6 lines): - line 262: # TODO: This is a quick hack to support NumPy type. We should revisit this. - line 466: # TODO: support more APIs? - line 966: # TODO: axis and many arguments should be implemented. - line 1055: # TODO: axis, skipna, and many arguments should be implemented. - line 1118: # TODO: add frep and axis parameter - line 1193: # TODO: Update Documentation for Bins Parameter when its supported python/pyspark/pandas/window.py (6 lines): - line 161: # TODO: 'min_periods' is not equivalent in pandas because it does not count NA as - line 209: lambda psser: psser._with_new_scol(func(psser.spark.column)), # TODO: dtype? - line 957: applied.append(agg_column._with_new_scol(func(agg_column.spark.column))) # TODO: dtype? - line 1444: # TODO: when add 'axis' parameter, should add to here too. - line 2554: # TODO: when add 'adjust' parameter, should add to here too. - line 2665: # TODO: when add 'adjust' parameter, should add to here too. python/pyspark/core/rdd.py (5 lines): - line 1055: # TODO: add log warning for when more than one iteration was run - line 3677: # TODO: add option to control map-side combining - line 3774: # TODO: add control over map-side aggregation - line 3982: # TODO: support variant with custom partitioner - line 4203: # TODO: add variant with custom partitioner common/utils/src/main/scala/org/apache/spark/util/ClosureCleaner.scala (5 lines): - line 209: // TODO: clean all inner closures first. This requires us to find the inner objects. - line 210: // TODO: cache outerClasses / innerClasses / accessedFields - line 621: // TODO: maybe lift this restriction to support other functional interfaces in the future - line 949: // TODO: maybe lift this restriction and support other functional interfaces - line 1078: // TODO: Recursively find inner closures that we indirectly reference, e.g. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala (5 lines): - line 284: // TODO: We don't support an array value tags in map yet. - line 321: // TODO: This method might have to be removed. Some logics duplicate `convertObject()` - line 430: // TODO: we don't support partial results now - line 444: // TODO: find a more efficient way to convert ArrayBuffer to GenericArrayData - line 521: // TODO: This function unnecessarily does type dispatch. Should merge it with `castTo`. mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala (5 lines): - line 182: // TODO: allow to specify label precision and feature precision. - line 356: // TODO: This implementation has performance issues due to unnecessary serialization. - line 357: // TODO: It is better (but trickier) if we can cast the old vector type to new type directly. - line 409: // TODO: This implementation has performance issues due to unnecessary serialization. - line 410: // TODO: It is better (but trickier) if we can cast the new vector type to old type directly. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TryEval.scala (5 lines): - line 84: // TODO: support TRY eval mode on datetime arithmetic expressions. - line 122: // TODO: support TRY eval mode on datetime arithmetic expressions. - line 159: // TODO: support TRY eval mode on datetime arithmetic expressions. - line 198: // TODO: support TRY eval mode on datetime arithmetic expressions. - line 230: // TODO: support TRY eval mode on datetime arithmetic expressions. sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala (4 lines): - line 134: * UnionObjectInspector: (tag: Int, object data) (TODO: not supported by SparkSQL yet) - line 278: // TODO we don't support the HiveVarcharObjectInspector yet. - line 842: // TODO decimal precision? - line 1118: // TODO precise, scale? core/src/main/scala/org/apache/spark/MapOutputTracker.scala (4 lines): - line 260: // TODO support updateMergeResult for similar use cases as updateMapOutput - line 1675: // TODO: SPARK-35036: Instead of reading map blocks in case of AQE with Push based shuffle, - line 1676: // TODO: improve push based shuffle to read partial merged blocks satisfying the start/end - line 1677: // TODO: map indexes sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala (4 lines): - line 143: // TODO: move the rest of the table commands from ddl.scala to this file - line 1184: // TODO: [SPARK-28692] unify this after we unify the - line 1232: // TODO: some Hive fileformat + row serde might be mapped to Spark data source, e.g. CSV. - line 1248: // TODO: should we keep Hive serde properties? sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala (4 lines): - line 361: // TODO: Variable cleanup (once we add SQL script execution logic). - line 362: // TODO: Add interpreter tests as well. - line 380: // TODO: Variable cleanup (once we add SQL script execution logic). - line 381: // TODO: Add interpreter tests as well. mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala (4 lines): - line 112: * TODO: Future extensions: The following functionality is planned for the future: - line 186: * TODO: Track which features are known to be continuous already; do not update counts for them. - line 256: // TODO: This might be able to handle 0's more efficiently. - line 357: // TODO: Check more carefully about whether this whole class will be included in a closure. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (4 lines): - line 1254: // TODO For v2 commands, we will cast the string back to its actual value, - line 1621: // TODO we should use the visitRowFormatDelimited function here. However HiveScriptIOSchema - line 4660: // TODO we need proper support for the NULL format. - line 5972: // TODO a partition spec is allowed to have optional values. This is currently violated. mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala (4 lines): - line 106: // TODO: support unbounded pattern length when maxPatternLength = 0 - line 357: // TODO: We collect projected postfixes into memory. We should also compare the performance - line 358: // TODO: of keeping them on shuffle files. - line 463: // TODO: use PrimitiveKeyOpenHashMap core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala (4 lines): - line 914: // TODO: manage the memory used here, and spill it into disk in case of OOM. - line 928: // TODO (SPARK-36284): Add shuffle checksum support for push-based shuffle - line 966: // TODO: release the buf here to free memory earlier - line 1153: // TODO SPARK-36284 Add shuffle checksum support for push-based shuffle sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (3 lines): - line 259: // TODO (SPARK-27484): handle streaming write commands when we have them. - line 845: // TODO: although map type is not orderable, technically map type should be able to be - line 854: // TODO: Remove this type check once we support Variant ordering python/pyspark/pandas/typedef/typehints.py (3 lines): - line 191: # TODO: considering the precision & scale for decimal type. - line 652: # TODO: once pandas exposes a typing module like numpy.typing, we should deprecate - line 675: # TODO: Remove this variadic-generic hack by tuple once ww drop Python up to 3.9. sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala (3 lines): - line 156: // TODO: Move to `DistributedPlan` - line 163: def outputPartitioning: Partitioning = UnknownPartitioning(0) // TODO: WRONG WIDTH! - line 511: // TODO: refactor and reuse the code from RDD's take() sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/OperatorStateMetadata.scala (3 lines): - line 416: // TODO: [SPARK-50845]: Currently, deleteSchemaFiles is a no-op since earliestBatchIdKept - line 429: // TODO: [SPARK-50845]: Currently, deleteSchemaFiles is a no-op since thresholdBatchId - line 479: // TODO: [SPARK-50845]: Return earliest schema file we need after implementing sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (3 lines): - line 264: // TODO: do not reorder consecutive `Add`s or `Multiply`s with different `failOnError` flags - line 325: // TODO: `EqualTo` for structural types are not working. Until SPARK-24443 is addressed, - line 326: // TODO: we exclude them in this rule. core/src/main/scala/org/apache/spark/deploy/master/Master.scala (3 lines): - line 292: // TODO Prevent repeated registrations from some driver - line 423: // TODO: It might be good to instead have the submission client poll the master to determine - line 451: // TODO: It would be nice for this to be a synchronous response sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala (3 lines): - line 424: // TODO: we should follow hive to roll back if one partition path failed to create. - line 469: // TODO: we should follow hive to roll back if one partition path failed to delete, and support - line 502: // TODO: we should follow hive to roll back if one partition path failed to rename. sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala (3 lines): - line 190: // TODO: operator pushdown. - line 258: // TODO (SPARK-27484): we should add the writing node before the plan is analyzed. - line 934: // TODO validate baseStateStoreCkptId core/src/main/scala/org/apache/spark/storage/BlockManager.scala (3 lines): - line 666: // TODO: We might need to rate limit re-registering. - line 1420: // TODO: need a better way to handle blocks with indeterminate/unordered results, replicas - line 2032: // TODO: Avoid a linear scan by creating another mapping of RDD.id to blocks. mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala (3 lines): - line 76: // TODO: Implement SPARK-11543? - line 550: // TODO: for Multinomial logistic regression, take numClasses into account - line 1048: // TODO: get numClasses and numFeatures together from dataset python/pyspark/pandas/accessors.py (3 lines): - line 328: # TODO: codes here partially duplicate `DataFrame.apply`. Can we deduplicate? - line 626: # TODO: Index will be lost in this case. - line 913: # TODO: In this case, it avoids the shortcut for now (but only infers schema) core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala (3 lines): - line 100: // TODO: use the actual number of slots for standalone mode. - line 174: // TODO: We should kill any running task attempts when the task set manager becomes a zombie. - line 1302: // TODO: Threshold should also look at standard deviation of task durations and have a lower sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala (3 lines): - line 206: // TODO: it's difficult to support string operators without advanced statistics. - line 285: // TODO: It is difficult to support other binary comparisons for String/Binary - line 670: // TODO: It is difficult to support other binary comparisons for String/Binary sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (3 lines): - line 546: // TODO: Implement a more accurate method for checking whether a decimal value can be cast - line 1089: // TODO: Could be faster? - line 1117: // TODO: Could be faster? core/src/main/scala/org/apache/spark/api/r/BaseRRunner.scala (3 lines): - line 67: // TODO: optimize it to use one socket - line 184: // TODO: Pass a byte array from R to avoid this cast ? - line 207: // TODO: We should propagate this error to the task thread python/pyspark/sql/pandas/types.py (3 lines): - line 566: # TODO: handle nested timestamps, such as ArrayType(TimestampType())? - line 600: # TODO: handle nested timestamps, such as ArrayType(TimestampType())? - line 670: # TODO: handle nested timestamps, such as ArrayType(TimestampType())? sql/connect/common/src/main/protobuf/spark/connect/commands.proto (3 lines): - line 269: // TODO: How do we indicate errors? - line 270: // TODO: Consider adding status, last progress etc here. - line 316: // TODO: Consider reusing Explain from AnalyzePlanRequest message. mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala (3 lines): - line 211: * TODO: See if we can deprecate `intercept` in `GeneralizedLinearModel`, and always - line 281: // TODO: Apply feature scaling to the weight vector instead of input data. - line 298: * TODO: For better convergence, in logistic regression, the intercepts should be computed core/src/main/scala/org/apache/spark/rdd/RDD.scala (3 lines): - line 167: // TODO: Handle changes of StorageLevel - line 1947: // TODO We can collect all the RDDs that needs to be checkpointed, and then checkpoint - line 2093: // TODO: this can be per-partition. e.g. UnionRDD can have different deterministic level for connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala (3 lines): - line 183: // TODO: we can avoid boxing if future version of Protobuf provide primitive accessors. - line 407: // TODO revisit validation of protobuf-catalyst fields. - line 470: // TODO: All of the code below this line is same between protobuf and avro, it can be shared. python/pyspark/pandas/indexes/multi.py (3 lines): - line 690: # TODO: We might need to handle internal state change. - line 704: # TODO: add 'name' parameter after pd.MultiIndex.name is implemented - line 834: # TODO: ADD error parameter mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala (2 lines): - line 29: // TODO: switch to MLlib BLAS interface - line 41: // TODO: add code if matrices isTranspose!!! core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala (2 lines): - line 37: * TODO: This is marked as sharable to get a handle to RBackend. Is it safe to re-use - line 265: // TODO: find best method in matching methods. core/src/main/scala/org/apache/spark/memory/ExecutionMemoryPool.scala (2 lines): - line 99: // TODO: clean up this clunky method signature - line 112: // TODO: simplify this to limit each task to its own slot sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala (2 lines): - line 83: // TODO: To reduce code diff of SPARK-29665, we create stub implementations for file source v2, so - line 101: // TODO: implement a light-weight partition inference which only looks at the path of one leaf sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpression.scala (2 lines): - line 134: // TODO: things can go wrong if the common expression is nondeterministic. We - line 137: // TODO: we should calculate the ref count and also inline the common expression sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala (2 lines): - line 128: // TODO: Generalize statistics collection. - line 129: // TODO: Why fs.getContentSummary returns wrong size on Jenkins? sql/core/src/main/scala/org/apache/spark/sql/classic/DataStreamWriter.scala (2 lines): - line 150: // TODO (SPARK-27484): we should add the writing node before the plan is analyzed. - line 168: * TODO (SPARK-33638): Full support of v2 table creation core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java (2 lines): - line 436: // TODO: try to find space on previous pages - line 757: * TODO: support forced spilling sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala (2 lines): - line 1078: // TODO: although map type is not orderable, technically map type should be able to be used - line 1123: // TODO: although map type is not orderable, technically map type should be able to be used mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala (2 lines): - line 70: // TODO: Output vectors of dimension numHashFunctions in SPARK-18450 - line 114: // TODO: This hashDistance function requires more discussion in SPARK-18454 core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java (2 lines): - line 113: // TODO: we're wasting 32 bits of space here; we can probably store fewer bits of the hashcode - line 353: // TODO: use existing ShuffleWriteMetrics sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala (2 lines): - line 321: // TODO: add support for multiple col families with HDFSBackedStateStoreProvider - line 422: // TODO: The validation should be moved to a higher level so that it works for all state store mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala (2 lines): - line 238: // TODO: find a fast and stable way for sparse data. - line 362: // TODO: The conditions below are not fully tested. sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala (2 lines): - line 69: // TODO: We will need to prune bad plans when we improve plan space exploration - line 101: // TODO: Decouple final output schema from expression evaluation so this copy can be sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWrite.scala (2 lines): - line 101: // TODO: [SPARK-36340] Unify check schema filed of DataSource V2 Insert. - line 131: // TODO: after partitioning is supported in V2: python/pyspark/sql/connect/protobuf/functions.py (2 lines): - line 50: # TODO: simplify the code when _invoke_function() supports None as input. - line 89: # TODO: simplify the code when _invoke_function() supports None as input. mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala (2 lines): - line 432: // TODO: Use ByteBuffer to optimize - line 437: // TODO: Use ByteBuffer to optimize core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala (2 lines): - line 556: // TODO SPARK-24819 If the job requires more slots than available (both busy and free - line 730: // TODO SPARK-24823 Cancel a job that contains barrier stage(s) if the barrier tasks don't get sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala (2 lines): - line 1259: // TODO handle the metadata? - line 1642: // TODO: streaming could be adapted to use this interface sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala (2 lines): - line 91: // TODO: operator pushdown. - line 105: // TODO (SPARK-27484): we should add the writing node before the plan is analyzed. sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CommitLog.scala (2 lines): - line 63: // TODO [SPARK-49462] This validation should be relaxed for a stateless query. - line 64: // TODO [SPARK-50653] This validation should be relaxed to support reading python/pyspark/sql/connect/dataframe.py (2 lines): - line 963: # TODO: reuse error handling code in sql.DataFrame.withWatermark() - line 1732: # TODO: revisit classic Spark's Dataset.col sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExtendedExec.scala (2 lines): - line 48: // TODO We need a new listTable overload that takes a pattern string. - line 183: // TODO "Created Time", "Last Access", "Partition Statistics" core/src/main/scala/org/apache/spark/executor/ExecutorExitCode.scala (2 lines): - line 62: // TODO: replace external block store with concrete implementation name - line 64: // TODO: replace external block store with concrete implementation name sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingLocalVariableManager.scala (2 lines): - line 40: // TODO [SPARK-50785]: Refactor ForStatementExec to use local variables properly. - line 67: // TODO: Update logic and comments once stored procedures are introduced. resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala (2 lines): - line 180: // TODO (SPARK-33481) This is a naive way of calculating numMergersDesired for a stage, - line 181: // TODO we can use better heuristics to calculate numMergersDesired for a stage. python/pyspark/ml/torch/distributor.py (2 lines): - line 429: "--rdzv_id=0", # TODO: setup random ID that is gleaned from env variables - line 920: # TODO: need to do this in a safe way to avoid issues during concurrent runs mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala (2 lines): - line 29: // TODO: Add aggregate stats (once available). This will happen after we move the DecisionTree - line 92: // TODO: Once the implementation has been moved to this API, then include sufficient resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverKubernetesCredentialsFeatureStep.scala (2 lines): - line 34: // TODO clean up this class, and credentials in general. See also SparkKubernetesClientFactory. - line 63: // TODO decide whether or not to apply this step entirely in the caller, i.e. the builder. sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala (2 lines): - line 387: // TODO SPARK-24528 Sort order is currently ignored if buckets are coalesced. - line 389: // TODO Currently Spark does not support writing columns sorting in descending order sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala (2 lines): - line 36: // TODO: we should have 2 RDDs: an RDD[InternalRow] for row-based scan, an `RDD[ColumnarBatch]` for - line 78: // TODO: SPARK-25083 remove the type erasure hack in data source scan sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala (2 lines): - line 121: // TODO: implement hive compatibility as rules. - line 229: tmpLocation.toString, // TODO: URI mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala (2 lines): - line 191: // TODO: Add zero/seqOp/combOp option to aggregateMessages. (SPARK-5438) - line 638: // TODO: Keep more values in log space, and only exponentiate when needed. sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIService.java (2 lines): - line 258: // TODO: set real configuration map - line 275: // TODO: We don't do anything for now, just log this for debugging. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (2 lines): - line 324: // TODO: support nullCount updates for specific outer joins - line 344: // TODO: It's error-prone to estimate cardinalities for LeftSemi and LeftAnti based on basic core/src/main/scala/org/apache/spark/util/Utils.scala (2 lines): - line 2566: // TODO: [SPARK-36744] needs to support IO encryption for push-based shuffle - line 3253: // FIXME: We copy the stream on the level of bytes to avoid encoding problems. sql/core/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala (2 lines): - line 142: // TODO: we can avoid boxing if future version of avro provide primitive accessors. - line 408: // TODO: move the following method in Decimal object on creating Decimal from BigDecimal? core/src/main/scala/org/apache/spark/util/SizeEstimator.scala (2 lines): - line 86: // TODO: Is this arch dependent ? - line 151: // TODO: We could use reflection on the VMOption returned ? common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java (2 lines): - line 212: // TODO: [SPARK-33236] Change the message when this service is able to handle NM restart - line 775: // TODO we may use a new exception class to include the finalizeShuffleMerge sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala (2 lines): - line 57: // TODO: Instead of returning a default value here, find a way to return a meaningful size - line 152: // TODO: Instead of returning a default value here, find a way to return a meaningful size sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (2 lines): - line 69: // TODO: Move the planner an optimizer into here from SessionState. - line 579: // TODO: We use next(), i.e. take the first plan returned by the planner, here for now, sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala (2 lines): - line 368: // TODO: support change column name/dataType/metadata/position. - line 801: // TODO: Validate the value sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/LogicalQueryStage.scala (2 lines): - line 36: // TODO we can potentially include only [[QueryStageExec]] in this class if we make the aggregation - line 56: // TODO this is not accurate when there is other physical nodes above QueryStageExec. core/src/main/scala/org/apache/spark/util/JsonProtocol.scala (2 lines): - line 67: // TODO: Remove this file and put JSON serialization into each individual class. - line 1361: // TODO: Drop the redundant "Shuffle" since it's inconsistent with related classes. graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala (2 lines): - line 88: // TODO: unpersist the replicated vertices in `replicatedVertexView` but leave the edges alone - line 221: // TODO: Because we only have a clustered index on the source vertex ID, we can't filter python/pyspark/pandas/plot/matplotlib.py (2 lines): - line 423: # TODO: this logic is similar to KdePlot. Might have to deduplicate it. - line 965: # TODO: check if value of y is plottable sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIServiceClient.java (2 lines): - line 99: // FIXME extract the right info type - line 403: // TODO: set the correct default fetch size python/pyspark/ml/connect/tuning.py (2 lines): - line 418: # TODO: support pandas dataframe fitting - line 481: # TODO: sql/core/src/main/scala/org/apache/spark/sql/classic/SparkSession.scala (2 lines): - line 306: // TODO: use MutableProjection when rowRDD is another DataFrame and the applied - line 405: // TODO: use MutableProjection when rowRDD is another DataFrame and the applied mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala (2 lines): - line 560: * TODO: Consider adding check for correct class name. - line 811: // TODO: Revert back to the original content if save is not successful. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (2 lines): - line 1044: // TODO: Pruning `UnionLoop`s needs to take into account both the outer `Project` and the inner - line 1962: // TODO: non-deterministic predicates could be pushed through some operators that do not change sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala (2 lines): - line 38: // TODO: all commands should look up table from the current catalog. The `SessionCatalog` doesn't - line 49: // TODO: create a real SYSTEM catalog to host `TempVariableManager` under the SESSION namespace. python/pyspark/sql/types.py (2 lines): - line 288: # TODO: do this properly like on the scala side - line 2463: # TODO: type cast (such as int -> long) mllib/src/main/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.scala (2 lines): - line 90: // TODO: Output vectors of dimension numHashFunctions in SPARK-18450 - line 215: // TODO: Save using the existing format of Array[Vector] once SPARK-12878 is resolved. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala (2 lines): - line 137: // TODO: support whole stage codegen - line 295: // TODO: consider large decimal and interval type mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala (2 lines): - line 107: // TODO: Fix the MultiProbe NN Search in SPARK-18454 - line 277: // TODO: Remove recreateCol logic once SPARK-17154 is resolved. sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/HiveSQLException.java (2 lines): - line 104: // TODO: set correct vendorCode field - line 116: // TODO: convert sqlState, etc. core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala (2 lines): - line 237: * TODO: Don't use a global map; these should be tied to a SparkContext (SPARK-13051). - line 303: * TODO: Eventually if this spreads out to more values then using sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java (2 lines): - line 58: * TODO: decimal requiring more than 8 bytes, INT96. Schema mismatch. - line 63: * TODO: make this always return ColumnarBatches. sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionContext.scala (2 lines): - line 70: // TODO: After introducing stored procedures, we need to handle the case with multiple - line 137: // TODO: Introduce a separate class for different frame types (Script, Stored Procedure, mllib/src/main/scala/org/apache/spark/ml/regression/Regressor.scala (2 lines): - line 36: // TODO: defaultEvaluator (follow-up PR) - line 48: // TODO: defaultEvaluator (follow-up PR) mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala (2 lines): - line 313: // TODO: predicted labels are +1 or -1 for GBT. Need a better way to store this info. - line 402: // TODO: Fix this issue for real. connector/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala (2 lines): - line 69: // TODO is it necessary to have separate configs for initial poll time vs ongoing poll time? - line 164: // TODO what about hosts specified by ip vs name mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala (2 lines): - line 370: * TODO: Make the use of zero matrices more storage efficient. - line 541: // TODO: Try to use aggregateByKey instead of reduceByKey to get rid of intermediate matrices sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala (2 lines): - line 46: // TODO: We should tighten up visibility of the classes here once we clean up Hive coupling. - line 176: // TODO: Selective case sensitivity. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/V2ExpressionBuilder.scala (2 lines): - line 309: // TODO supports other expressions - line 390: // TODO supports other aggregate functions core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala (2 lines): - line 203: // TODO: support accumulator in multiple UDF - line 665: // TODO: This has a race condition if interruption occurs, as completed may still become true. mllib/src/main/scala/org/apache/spark/ml/param/params.scala (2 lines): - line 1026: // TODO: Provide a better method name for Java users. - line 1035: // TODO: Provide a better method name for Java users. sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala (2 lines): - line 168: // TODO: Restore this from the checkpoint when possible. - line 616: * TODO: We do not seem to clear up all values in StateOperatorProgress which are bound to the core/src/main/scala/org/apache/spark/util/random/StratifiedSamplingUtils.scala (2 lines): - line 97: // TODO change this to the streaming version - line 241: // TODO implement the streaming version of sampling w/ replacement that doesn't require counts core/src/main/resources/org/apache/spark/ui/static/dataTables.rowsGroup.js (2 lines): - line 212: TODO: Provide function which determines the all s and s with "rowspan" html-attribute is parent (groupped) for the specified or . To use in selections, editing or hover styles. - line 214: TODO: Feature sql/core/src/main/scala/org/apache/spark/sql/execution/ShuffledRowRDD.scala (2 lines): - line 152: // TODO this check is based on assumptions of callers' behavior but is sufficient for now. - line 172: // TODO order by partition size. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala (2 lines): - line 364: * TODO: Validate somewhere (in debug mode?) that children are ordered correctly. - line 1173: // TODO: currently if the class name ends with "$", we think it's a scala object, there is sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala (2 lines): - line 345: * TODO: Remove this. It is used because CreateTempViewUsing is not a Catalyst plan. - line 984: // TODO: what does this message mean? common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java (2 lines): - line 688: // TODO: Add more collation-aware string expressions. - line 717: // TODO: Add other collation-aware expressions. mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala (2 lines): - line 409: // TODO: Make standardizeFeatures and standardizeLabel configurable. - line 528: // TODO: Make standardizeFeatures and standardizeLabel configurable. sql/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/StreamingForeachBatchHelper.scala (2 lines): - line 71: // TODO: Add query id to the log. - line 191: // TODO: Better handling (e.g. retries) on exceptions like EOFException to avoid sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java (1 line): - line 28: * TODO: merge this into parquet-mr. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala (1 line): - line 212: // TODO: create a generic representation for views, after we add view support to v2 catalog. For now python/pyspark/pandas/supported_api_gen.py (1 line): - line 395: # TODO: Take into account that this function can create links incorrectly sql/core/src/main/scala/org/apache/spark/sql/execution/python/streaming/TransformWithStateInPySparkExec.scala (1 line): - line 213: // TODO SPARK-50180: check if we can return true only if actual timers are registered, mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala (1 line): - line 103: // TODO: Sparse representation might be ineffective if (newSize ~= newValues.size) build/sbt-launch-lib.bash (1 line): - line 7: # TODO - Should we merge the main SBT script with this library? sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala (1 line): - line 229: // TODO: support whole stage codegen too sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala (1 line): - line 361: // TODO: move these into BytesToBytesMap mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDs.scala (1 line): - line 439: // TODO Generate RDD[Vector] from multivariate distributions. core/src/main/scala/org/apache/spark/util/collection/SortDataFormat.scala (1 line): - line 36: // TODO: Making Buffer a real trait would be a better abstraction, but adds some complexity. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala (1 line): - line 500: // TODO: Optimise this logic. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushDownLeftSemiAntiJoin.scala (1 line): - line 183: * TODO: sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala (1 line): - line 37: * TODO: Merge this file with [[org.apache.spark.ml.util.SchemaUtils]]. python/pyspark/ml/param/_shared_params_code_gen.py (1 line): - line 93: # TODO: How to correctly inherit instance attributes? sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala (1 line): - line 33: // TODO: Expand distinctKeys for redundant aliases on the same expression sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala (1 line): - line 153: // TODO: Make CollectSet collation aware sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala (1 line): - line 83: // TODO: ideally Aggregate should also be handled here, but its grouping expressions are project/SparkBuild.scala (1 line): - line 483: // TODO: move this to its upstream project. common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImpl.java (1 line): - line 131: // TODO: the strategy of `CountMinSketch` looks more advanced, should we follow it here? connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala (1 line): - line 111: // TODO (SPARK-37973) Directly call super.getDefaultReadLimit when scala issue 12523 is fixed sql/connect/common/src/main/scala/org/apache/spark/sql/connect/Dataset.scala (1 line): - line 732: // TODO we need to have a proper way of stabilizing the input data. The current approach does sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (1 line): - line 1015: // TODO: Now, always set environmentContext to null. In the future, we should avoid setting mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala (1 line): - line 82: // TODO: defaultEvaluator (follow-up PR) mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala (1 line): - line 355: // TODO: consolidate aggregates for summary statistics python/pyspark/ml/util.py (1 line): - line 1109: # TODO: We need to handle `RFormulaModel.pipelineModel` here after Pyspark RFormulaModel sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala (1 line): - line 482: // TODO: reuse the key buffer and value buffer across records. core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala (1 line): - line 268: // TODO: When visible is false(the task had failed), we should be asking the block managers to core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala (1 line): - line 96: // TODO: a non-blocking TransportClientFactory.createClient in future sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java (1 line): - line 286: // TODO: consider pushing this in ColumnVector by adding a readBytes with a stride. core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriterUtils.scala (1 line): - line 109: // TODO: these don't seem like the right abstractions. sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala (1 line): - line 71: // TODO ideally, we should get the output data ready first and then mllib/src/main/scala/org/apache/spark/ml/Predictor.scala (1 line): - line 49: // TODO: Support casting Array[Double] and Array[Float] to Vector when FeaturesType = Vector sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala (1 line): - line 174: // TODO: to reuse the existing partition parameters for those partition specific options core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala (1 line): - line 116: * (TODO: add a new type like `ExecutorDecommissionInfo` for the core/src/main/resources/org/apache/spark/ui/static/utils.js (1 line): - line 279: // TODO: Reused stacktrace-details* style for convenience, but it's not really a stacktrace python/pyspark/ml/functions.py (1 line): - line 774: # TODO: adjust return type hint when Iterator[Union[pd.Series, pd.DataFrame]] is supported sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala (1 line): - line 358: // TODO (hvanhovell) this is can be improved. sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java (1 line): - line 67: * TODO: move this to the parquet-mr project. There are performance benefits of doing it sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeShuffleWithLocalRead.scala (1 line): - line 67: // TODO: this method assumes all shuffle blocks are the same data size. We should calculate the mllib/src/main/scala/org/apache/spark/mllib/linalg/EigenValueDecomposition.scala (1 line): - line 50: // TODO: remove this function and use eigs in breeze when switching breeze version sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala (1 line): - line 272: // TODO: Make this work for Stream-Stream joins, where we use multiple python/pyspark/ml/torch/log_communication.py (1 line): - line 182: # TODO: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala (1 line): - line 203: // TODO: it would be nicer to just make sure there are no null commands here common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java (1 line): - line 1196: return 0; // TODO: Fix this behaviour (SPARK-48284) python/pyspark/worker.py (1 line): - line 2381: # TODO: Remove the following two lines and use `Process.pid()` when we drop JDK 8. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSetVariable.scala (1 line): - line 55: // TODO: we need to group by the qualified variable name once other catalogs support it. core/src/main/scala/org/apache/spark/deploy/ClientArguments.scala (1 line): - line 104: // TODO: It wouldn't be too hard to allow users to submit their app and dependency jars sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCWriteBuilder.scala (1 line): - line 39: // TODO (SPARK-32595): do truncate and append atomically. sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala (1 line): - line 64: // TODO: Finish input output types. sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala (1 line): - line 816: // TODO (SPARK-26174): disallow it with a config. core/src/main/scala/org/apache/spark/SparkContext.scala (1 line): - line 2395: // TODO: Cache.stop()? core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala (1 line): - line 54: // TODO We cannot force the return type of `anyToWritable` be same as keyWritableClass and mllib/src/main/scala/org/apache/spark/ml/r/AFTSurvivalRegressionWrapper.scala (1 line): - line 70: // TODO: Support dot operator. core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java (1 line): - line 407: // TODO: try to find space in previous pages core/src/main/scala/org/apache/spark/Dependency.scala (1 line): - line 183: // TODO: SPARK-35547: Push based shuffle is currently unsupported for Barrier stages python/pyspark/sql/connect/udf.py (1 line): - line 165: # TODO: PythonEvalType.SQL_BATCHED_UDF sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala (1 line): - line 804: // TODO: Merge this and `NamedLambdaVariable`. sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClient.scala (1 line): - line 259: loadPath: String, // TODO URI python/pyspark/ml/classification.py (1 line): - line 3864: # TODO: need to set metadata sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpCLIService.java (1 line): - line 140: // TODO: check defaults: maxTimeout, keepalive, maxBodySize, bodyReceiveDuration, etc. sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TableCapabilityCheck.scala (1 line): - line 47: // TODO: check STREAMING_WRITE capability. It's not doable now because we don't have a python/pyspark/sql/connect/client/artifact.py (1 line): - line 248: # TODO: Support directory path. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala (1 line): - line 256: // TODO (SPARK-44754): we should handle all special cases here. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala (1 line): - line 211: // TODO: SPARK-48701: PandasMode (all collations) python/pyspark/ml/tuning.py (1 line): - line 115: # TODO: duplicate evaluator to take extra params from input core/src/main/resources/org/apache/spark/ui/static/historypage.js (1 line): - line 142: // TODO: Replace hasOwnProperty with prototype.hasOwnProperty after we find it's safe to do. sql/core/src/main/scala/org/apache/spark/sql/execution/InsertSortForLimitAndOffset.scala (1 line): - line 36: * TODO: add a order preserving mode in the shuffle reader. python/pyspark/sql/connect/proto/commands_pb2.pyi (1 line): - line 1199: """TODO: Consider reusing Explain from AnalyzePlanRequest message. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala (1 line): - line 490: * TODO: We could make nullability more precise in foldable cases (e.g., literal input). sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala (1 line): - line 145: // TODO: introduce a user defined type for serialized R data. core/src/main/scala/org/apache/spark/status/AppStatusListener.scala (1 line): - line 645: // TODO: can this really happen? sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/RowBasedHashMapGenerator.scala (1 line): - line 53: // TODO: consider large decimal and interval type core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala (1 line): - line 132: // TODO: Make it tighter. python/pyspark/sql/connect/client/core.py (1 line): - line 1985: # TODO: Fix the code: change thread-local `ml_caches` to global `ml_caches`. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala (1 line): - line 286: // TODO: this condition only guarantees to keep the shape after the plan has sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/EventTimeWatermark.scala (1 line): - line 84: // TODO: Disallow updating the metadata once we remove the compatibility flag. sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala (1 line): - line 100: // TODO check if this works. sql/api/src/main/java/org/apache/spark/sql/types/SQLUserDefinedType.java (1 line): - line 29: // TODO: Should I used @Documented ? sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java (1 line): - line 419: // TODO: best perf? sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala (1 line): - line 145: // TODO: implement support for decimal/datetime/interval types sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala (1 line): - line 347: // TODO: Shall use TIMESTAMPLOCALTZ_TYPE, keep AS-IS now for sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala (1 line): - line 172: // TODO (SPARK-32593): JDBC support nested column and nested column pruning. sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala (1 line): - line 375: // TODO: Handle BroadcastPartitioning. core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala (1 line): - line 165: // TODO: differentiate between the intention to cache an RDD and whether it's actually cached sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala (1 line): - line 658: // TODO: we should do this check earlier when we have capability API. sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CoalesceShufflePartitions.scala (1 line): - line 181: // TODO: match more plan nodes here. sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala (1 line): - line 338: // TODO: this is not right for DecimalType with precision > 18 mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala (1 line): - line 303: // TODO: When we add a generic Boosting class, handle transform there? SPARK-7129 core/src/main/scala/org/apache/spark/rdd/RDDBarrier.scala (1 line): - line 91: // TODO: [SPARK-25247] add extra conf to RDDBarrier, e.g., timeout. mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala (1 line): - line 32: // TODO: Move the utility methods to SQL. sql/core/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala (1 line): - line 97: // TODO Removes this check once `FileFormat` gets a general file filtering interface method. sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/HadoopCompressionCodec.java (1 line): - line 44: // TODO supports ZStandardCodec python/run-tests.py (1 line): - line 64: # TODO: revisit for Scala 2.13 core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala (1 line): - line 116: // TODO: SPARK-48789: the naming is confusing since this does not really reflect the whole connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala (1 line): - line 146: // TODO (SPARK-37973) Directly call super.getDefaultReadLimit when scala issue 12523 is fixed sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala (1 line): - line 38: // TODO: move SchemaPruning into catalyst sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala (1 line): - line 35: * TODO: This can be made generic to generate any type of random distribution, or any type of mllib/src/main/scala/org/apache/spark/ml/tree/impl/BaggedPoint.scala (1 line): - line 66: // TODO: implement weighted bootstrapping mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala (1 line): - line 64: // TODO remove once covariance numerical issue resolved. mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala (1 line): - line 438: * TODO: Change to always do bootstrapping (simpler). SPARK-7130 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala (1 line): - line 146: // TODO: remove operators from this list as support for avro encoding is added sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateResolvedHint.scala (1 line): - line 88: // TODO revisit this logic: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala (1 line): - line 109: // TODO: validate places that are trying to encode the key and check if we can eliminate/ sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala (1 line): - line 1077: * TODO: remove this after we completely make hive as a data source. mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala (1 line): - line 52: // TODO: computation of statistics may take seconds, so save it to KMeansModel in training sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala (1 line): - line 86: }.asInstanceOf[A] // Kind of a hack, but safe. TODO: Tighten return type when possible. resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala (1 line): - line 127: * TODO: If the OOM is not recoverable by rescheduling it on different node, then do sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala (1 line): - line 728: // TODO: support BooleanType, DateType and TimestampType sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala (1 line): - line 146: // TODO: Non-catalog paths for DSV2 are currently not well defined. python/pyspark/pandas/sql_processor.py (1 line): - line 296: # TODO: use a string builder sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala (1 line): - line 112: // TODO: merge these 2 implementations once we refactor the `AggregateFunction` interface. python/pyspark/sql/session.py (1 line): - line 1580: # TODO: Apply the logic below when self._jconf.arrowPySparkEnabled() is True python/pyspark/sql/connect/session.py (1 line): - line 709: # TODO: Beside the validation on number of columns, we should also check mllib/src/main/scala/org/apache/spark/ml/tree/impl/DTStatsAggregator.scala (1 line): - line 169: // TODO: Test BLAS.axpy resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala (1 line): - line 295: // TODO: with dynamic allocation off, handle edge cases if we end up with more running core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala (1 line): - line 329: // TODO: Only track used files (SPARK-31974) core/src/main/scala/org/apache/spark/deploy/client/StandaloneAppClient.scala (1 line): - line 163: // FIXME How to handle the following cases? sql/api/src/main/scala/org/apache/spark/sql/Row.scala (1 line): - line 79: // TODO: Improve the performance of this if used in performance critical part. core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala (1 line): - line 186: // TODO: stop combining if we find that the reduction factor isn't high mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala (1 line): - line 338: * TODO: SPARK-20443 - expose blockSize as a param? sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala (1 line): - line 806: // TODO (SPARK-44225): Move this into analyzer sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala (1 line): - line 70: * TODO: we should just have different traits for the different formats. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala (1 line): - line 894: // TODO: support whole stage codegen python/pyspark/ml/torch/data.py (1 line): - line 91: # TODO: we can optimize this further by directly extracting python/pyspark/ml/feature.py (1 line): - line 5854: TODO: Future extensions: The following functionality is planned for the future: python/pyspark/pandas/_typing.py (1 line): - line 44: # TODO: use the actual type parameters. sql/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/StreamingQueryListenerHelper.scala (1 line): - line 85: // TODO: Reuse the same method in StreamingForeachBatchHelper to avoid duplication. sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SupportsStreamingUpdateAsAppend.scala (1 line): - line 30: // TODO: design an official API for streaming output mode UPDATE which can do the upsert core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala (1 line): - line 65: // TODO: We should consider increasing the number of this parameter over time mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala (1 line): - line 327: // TODO: Implement this method. core/src/main/scala/org/apache/spark/rdd/ShuffledRDD.scala (1 line): - line 39: // TODO: Make this return RDD[Product2[K, C]] or have some way to configure mutable pairs core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala (1 line): - line 37: * TODO: Cache the hash values of each key? java.util.HashMap does that. mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala (1 line): - line 333: // TODO: use a numerically stable approach to estimate cost resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala (1 line): - line 65: // TODO: Currently, task to container is computed once (TaskSetManager) - which need not be sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala (1 line): - line 69: // TODO: improve error message for java bean encoder. python/pyspark/cloudpickle/cloudpickle.py (1 line): - line 1342: # TODO: decorrelate reducer_override (which is tied to CPython's core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (1 line): - line 611: // TODO: Support distributing R packages with standalone cluster sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala (1 line): - line 95: // SPARK-51779 TODO: Support stream-stream joins with virtual column families sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SerializerBuildHelper.scala (1 line): - line 483: // TODO replace this with `createSerializerForPrimitiveArray` as core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala (1 line): - line 51: // TODO: are we sure we need to use a global lock in the following methods? connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/InternalKafkaConsumerPool.scala (1 line): - line 132: // TODO: revisit the relation between CacheKey and kafkaParams - for now it looks a bit weird sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/OperationType.java (1 line): - line 46: // TODO: replace this with a Map? core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala (1 line): - line 375: // TODO: We can sort these blocks based on some policy (LRU/blockSize etc) sql/core/src/main/scala/org/apache/spark/sql/classic/SQLContext.scala (1 line): - line 75: // TODO: move this logic into SparkSession resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala (1 line): - line 25: // TODO: Add code and support for ensuring that yarn resource 'tasks' are location aware ! sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEPropagateEmptyRelation.scala (1 line): - line 91: // TODO: we can return the original query plan before broadcast. mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala (1 line): - line 221: // TODO: In the future, also support normalizing by tree.rootNode.impurityStats.count? sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveReferencesInAggregate.scala (1 line): - line 79: // TODO: currently we don't support LCA in `groupingExpressions` yet. mllib/src/main/scala/org/apache/spark/ml/classification/ClassificationSummary.scala (1 line): - line 189: // TODO: Allow the user to vary the number of bins using a setBins method in python/pyspark/mllib/linalg/__init__.py (1 line): - line 1555: # TODO: More efficient implementation: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala (1 line): - line 157: // TODO: Calculate memory usage more precisely. mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala (1 line): - line 1086: // TODO: Generalize PeriodicGraphCheckpointer and use it here. core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala (1 line): - line 60: // TODO: this currently doesn't work on P other than Tuple2! sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEShuffleReadExec.scala (1 line): - line 59: // TODO this check is based on assumptions of callers' behavior but is sufficient for now. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala (1 line): - line 392: // TODO support multi column NULL-aware anti join in future. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala (1 line): - line 55: // TODO: specialize it core/src/main/scala/org/apache/spark/TestUtils.scala (1 line): - line 62: * TODO: See if we can move this to the test codebase by specifying sql/core/src/main/scala/org/apache/spark/sql/classic/DataFrameNaFunctions.scala (1 line): - line 189: * TODO: This can be optimized to use broadcast join when replacementMap is large. sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (1 line): - line 351: // TODO: revisit it. If left side is much smaller than the right side, it may be better core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala (1 line): - line 193: // TODO: If we add ability to submit multiple jars they should also be added here sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala (1 line): - line 435: // TODO: add a more general rule to optimize join with OneRowRelation. mllib/src/main/scala/org/apache/spark/ml/feature/UnivariateFeatureSelector.scala (1 line): - line 453: // TODO: Sparse representation might be ineffective if (newSize ~= newValues.size) python/pyspark/sql/udf.py (1 line): - line 363: # TODO: PythonEvalType.SQL_BATCHED_UDF sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala (1 line): - line 103: // TODO SPARK-50180: check if we can return true only if actual timers are registered, python/pyspark/core/status.py (1 line): - line 109: # TODO: fetch them in batch for better performance sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala (1 line): - line 178: // TODO: revisit this. We can consider reordering predicates as well. common/utils/src/main/scala/org/apache/spark/storage/StorageLevel.scala (1 line): - line 47: // TODO: Also add fields for caching priority, dataset ID, and flushing. core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala (1 line): - line 115: // TODO: It would be nice to add a shutdown hook here that explains why the output is core/src/main/scala/org/apache/spark/rdd/LocalRDDCheckpointData.scala (1 line): - line 49: // must cache any missing partitions. TODO: avoid running another job here (SPARK-8582). python/pyspark/pandas/indexing.py (1 line): - line 663: self._psdf_or_psser._column_label, scol # TODO: dtype? sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/UnsupportedExpressionInOperatorValidation.scala (1 line): - line 62: // TODO: check if we are resolving a lateral join condition once lateral join is supported. sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/Operation.java (1 line): - line 300: // TODO: make this abstract and implement in subclasses. sql/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecuteEventsManager.scala (1 line): - line 31: // TODO: Make this configurable sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/V2ExpressionSQLBuilder.java (1 line): - line 128: // TODO supports other expressions core/src/main/scala/org/apache/spark/executor/Executor.scala (1 line): - line 739: // TODO: do not serialize value twice sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala (1 line): - line 33: // TODO: This is boxing. We should specialize. mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala (1 line): - line 158: // TODO: duplicate evaluator to take extra params from input common/network-common/src/main/java/org/apache/spark/network/crypto/AuthClientBootstrap.java (1 line): - line 60: // TODO: right now this behaves like the SASL backend, because when executors start up python/pyspark/ml/linalg/__init__.py (1 line): - line 1338: # TODO: More efficient implementation: core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala (1 line): - line 91: // TODO: Right now, each split sends along its full data, even if later down the RDD chain it gets sql/connect/common/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowSerializer.scala (1 line): - line 242: // TODO throw better errors on class cast exceptions. core/src/main/scala/org/apache/spark/SparkConf.scala (1 line): - line 721: * TODO: consolidate it with `ConfigBuilder.withAlternative`. sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala (1 line): - line 61: * TODO: AggregateMode should have only two modes: Update and Merge, AggregateExpression mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala (1 line): - line 808: // TODO: Implement this method. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveColumnDefaultInCommandInputQuery.scala (1 line): - line 46: // TODO (SPARK-43752): support v2 write commands as well. sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousQueuedDataReader.scala (1 line): - line 101: // TODO: The obvious generalization of this logic to multiple stages won't work. It's sql/core/src/main/scala/org/apache/spark/sql/classic/Catalog.scala (1 line): - line 287: // TODO: The SHOW FUNCTIONS should tell us the function type (built-in, temp, persistent) and resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorNodeHealthTracker.scala (1 line): - line 139: // TODO - We need to update once Hadoop changes - sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala (1 line): - line 32: * TODO reusing the CompletionIterator? sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/CLIServiceClient.java (1 line): - line 40: // TODO: provide STATIC default value sql/connect/common/src/main/scala/org/apache/spark/sql/connect/columnNodeSupport.scala (1 line): - line 174: // TODO we should probably 'just' detect this particular scenario streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala (1 line): - line 428: * TODO Should poll the executor number and wait for executors according to sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java (1 line): - line 46: // TODO: Use a more efficient format which doesn't depend on unsafe array. sql/api/src/main/scala/org/apache/spark/sql/types/UpCastRule.scala (1 line): - line 45: // TODO: allow upcast from int/double/decimal to char/varchar of sufficient length core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala (1 line): - line 131: * TODO: as we consistently allocate addresses beginning from the "small" address, it can python/pyspark/shuffle.py (1 line): - line 71: # TODO: support windows sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala (1 line): - line 178: // TODO: handle functions defined in Scala too - in Scala, even if a python/pyspark/pandas/datetimes.py (1 line): - line 51: # TODO: Hit a weird exception sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/OperationManager.java (1 line): - line 269: //TODO: Since OperationLog is moved to package o.a.h.h.ql.session, sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StreamStreamJoinStatePartitionReader.scala (1 line): - line 108: // TODO after we persistent the StateStoreCheckpointID to the commit log, we can get it from connector/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala (1 line): - line 103: // TODO if the buffer was kept around as a random-access structure, sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala (1 line): - line 42: * TODO: merge this with PruneFileSourcePartitions after we completely make hive as a data source. sql/connect/common/src/main/scala/org/apache/spark/sql/connect/UdfToProtoUtils.scala (1 line): - line 89: f.inputEncoders.map(e => agnosticEncoderFor(e.get)) // TODO support Any and UnboundRow. mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala (1 line): - line 421: // TODO: SPARK-15785 Support users supplied initial GMM. sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ContinuousScanExec.scala (1 line): - line 39: // TODO: unify the equal/hashCode implementation for all data source v2 query plans. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/BooleanEqualityTypeCoercion.scala (1 line): - line 50: // TODO: Maybe these rules should go into the optimizer. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/broadcastMode.scala (1 line): - line 38: // TODO: pack the UnsafeRows into single bytes array. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala (1 line): - line 47: // TODO [SPARK-50785]: Uncomment this when For Statement starts properly using local vars. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/ProjectResolver.scala (1 line): - line 83: // TODO: This validation function does a post-traversal. This is discouraged in mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala (1 line): - line 57: // TODO: skip computation if both withMean and withStd are false core/src/main/scala/org/apache/spark/BarrierCoordinator.scala (1 line): - line 55: // TODO SPARK-25030 Create a Timer() in the mainClass submitted to SparkSubmit makes it unable to sql/api/src/main/scala/org/apache/spark/sql/ForeachWriter.scala (1 line): - line 96: // TODO: Move this to org.apache.spark.sql.util or consolidate this with batch API. core/src/main/resources/org/apache/spark/ui/static/spark-dag-viz.js (1 line): - line 287: // Link each graph to the corresponding stage page (TODO: handle stage attempts) sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/TableSchema.java (1 line): - line 40: // TODO: remove this constructor sql/core/src/main/scala/org/apache/spark/sql/execution/command/v2/V2CommandStrategy.scala (1 line): - line 28: // TODO: move v2 commands to here which are not data source v2 related. core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala (1 line): - line 261: // TODO: When we drop JDK 8, we can just use workerProcess.pid() mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala (1 line): - line 394: TODO: Add simplex constraints to allow alpha in (0,1). resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala (1 line): - line 1159: // TODO: it would be nicer to just make sure there are no null commands here common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java (1 line): - line 1604: // TODO: Add more collation-aware UTF8String operations here. sql/core/src/main/java/org/apache/parquet/filter2/predicate/SparkFilterApi.java (1 line): - line 29: * TODO (PARQUET-1809): This is a temporary workaround; it is intended to be moved to Parquet. sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkParserUtils.scala (1 line): - line 39: // The following 2 lines are exactly what MySQL does TODO: why do we do this? sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/MicroBatchScanExec.scala (1 line): - line 42: // TODO: unify the equal/hashCode implementation for all data source v2 query plans. core/src/main/scala/org/apache/spark/shuffle/ShuffleReader.scala (1 line): - line 29: * TODO: Add this back when we make the ShuffleReader a developer API that others can implement sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala (1 line): - line 48: // TODO: unify the equal/hashCode implementation for all data source v2 query plans. core/src/main/resources/org/apache/spark/ui/static/executorspage.js (1 line): - line 286: // TODO: Replace hasOwnProperty with prototype.hasOwnProperty after we find it's safe to do. sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala (1 line): - line 227: // TODO: It's possible that only one skewed join in the query plan leads to extra shuffles and sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java (1 line): - line 113: // TODO: It's possible to change UnsafeInMemorySorter to have multiple entries with same key, core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala (1 line): - line 62: // TODO: Don't use Java serialization, use a more cross-version compatible serialization format. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (1 line): - line 858: // TODO: remove this `retainedHiveConfigs` after the `RelationConversions` is moved to common/network-common/src/main/java/org/apache/spark/network/buffer/ManagedBuffer.java (1 line): - line 49: // TODO: Deprecate this, usage may require expensive memory mapping or allocation. mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala (1 line): - line 180: // TODO: duplicate evaluator to take extra params from input sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java (1 line): - line 115: // TODO: this is extremely expensive. mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala (1 line): - line 111: // TODO: Override methods such as aggregate, which only requires one Spark job. mllib/src/main/scala/org/apache/spark/ml/regression/RandomForestRegressor.scala (1 line): - line 270: // TODO: When we add a generic Bagging class, handle transform there. SPARK-7128 core/src/main/scala/org/apache/spark/SecurityManager.scala (1 line): - line 353: // TODO: this really should be abstracted somewhere else. sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (1 line): - line 358: // TODO: Throw when too much is given. sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala (1 line): - line 150: // TODO: Remove copy logic. python/pyspark/streaming/dstream.py (1 line): - line 368: # TODO: uncomment this until we have ssc.pickleFileStream() sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala (1 line): - line 325: // TODO: SPARK-51957: Fix partition column and EMTPY_DATA_SCHEMA naming conflict mllib/src/main/scala/org/apache/spark/ml/feature/Selector.scala (1 line): - line 384: // TODO: Sparse representation might be ineffective if (newSize ~= newValues.size) sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala (1 line): - line 49: * TODO: implement the read logic. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/QueryPlanner.scala (1 line): - line 50: * TODO: RIGHT NOW ONLY ONE PLAN IS RETURNED EVER... sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala (1 line): - line 149: // TODO: we should only collect properties that have getter and setter. However, some tests sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/FetchOrientation.java (1 line): - line 46: // TODO: Should this really default to FETCH_NEXT? sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala (1 line): - line 142: // TODO: we should do this check earlier when we have capability API. sql/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala (1 line): - line 1379: // TODO this might be too complex for no good reason. It might core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala (1 line): - line 166: // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582) sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala (1 line): - line 183: // TODO: Check that options from the resolved relation match the relation that we are sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala (1 line): - line 375: // TODO: Support for loading the jars from an already downloaded location. graphx/src/main/scala/org/apache/spark/graphx/lib/SVDPlusPlus.scala (1 line): - line 68: // TODO: use a fixed random seed sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala (1 line): - line 145: // TODO When state store checkpoint format V2 is used, after state store checkpoint ID is common/utils/src/main/scala/org/apache/spark/util/SparkClassUtils.scala (1 line): - line 86: // TODO: the value returned here isn't even quite right; it returns simple names sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala (1 line): - line 212: // TODO: we should have a better separation of row based and batch based scan, so that we streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala (1 line): - line 412: // TODO: merge callsites with scopes so we can just reuse the code there core/src/main/scala/org/apache/spark/deploy/Client.scala (1 line): - line 83: // TODO: We could add an env variable here and intercept it in `sc.addJar` that would core/src/main/scala/org/apache/spark/scheduler/Task.scala (1 line): - line 103: // TODO SPARK-24874 Allow create BarrierTaskContext based on partitions, instead of whether mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala (1 line): - line 105: case _ => // TODO: there is no RegressorParams, maybe add it in the future? mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala (1 line): - line 829: * TODO figure out return type. python/pyspark/sql/worker/analyze_udtf.py (1 line): - line 281: # TODO: Remove the following two lines and use `Process.pid()` when we drop JDK 8. streaming/src/main/scala/org/apache/spark/streaming/scheduler/StreamingListenerBus.scala (1 line): - line 98: // events (SPARK-12140). TODO Once SPARK-12140 is resolved we should set it to true. launcher/src/main/java/org/apache/spark/launcher/CommandBuilderUtils.java (1 line): - line 315: // TODO: change to the correct directory once the assembly build is changed. sql/connect/common/src/main/scala/org/apache/spark/sql/connect/SparkSession.scala (1 line): - line 691: // TODO: implements all methods mentioned in the scaladoc of [[SparkSession]] mllib/src/main/scala/org/apache/spark/mllib/tree/model/DecisionTreeModel.scala (1 line): - line 204: // TODO: Fix this issue for real. mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala (1 line): - line 118: // TODO: Define a rigorous naming scheme. mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala (1 line): - line 337: // TODO: When we add a generic Bagging class, handle transform there: SPARK-7128 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (1 line): - line 460: * TODO: Look to merge this rule with RewritePredicateSubquery. core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala (1 line): - line 127: * TODO do we need a timeout parameter? python/pyspark/pandas/mlflow.py (1 line): - line 55: # TODO: do something smarter, for example when there is a sklearn.Classifier (it should common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java (1 line): - line 38: // TODO: StorageLevel is serialized separately in here because StorageLevel is not available in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala (1 line): - line 185: * TODO: Currently we don't allow deep correlation. Also, we don't allow mixing of graphx/src/main/scala/org/apache/spark/graphx/package.scala (1 line): - line 34: // TODO: Consider using Char. core/src/main/scala/org/apache/spark/scheduler/AccumulableInfo.scala (1 line): - line 48: // TODO: use this to identify internal task metrics instead of encoding it in the name sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala (1 line): - line 130: // TODO: Check if the paths coming in are already qualified and simplify. mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala (1 line): - line 286: * TODO: We need to clean it up by separating the logic of regularization out sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (1 line): - line 250: * TODO: we should remove the special handling for hive tables after completely making hive as a mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala (1 line): - line 46: // TODO: Allow different IDF formulations. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateResolver.scala (1 line): - line 109: // TODO: This validation function does a post-traversal. This is discouraged in single-pass sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala (1 line): - line 80: // TODO: This can be further extended in the future. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala (1 line): - line 1217: // TODO: We need to figure out how these methods interact with our data source sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala (1 line): - line 166: // TODO: if you move this into the closure it reverts to the default values. mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala (1 line): - line 137: require(doc.nonEmpty) // TODO: more rigorous on doc sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDataSource.scala (1 line): - line 113: // TODO (SPARK-27483): we should move this fallback logic to an analyzer rule. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala (1 line): - line 268: // TODO more implicit class for literal? sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala (1 line): - line 181: // TODO work around for set the log output to console, because the HiveContext