lib/elasticsearch-serverless/api/reindex.rb

# Licensed to Elasticsearch B.V. under one or more contributor # license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright # ownership. Elasticsearch B.V. licenses this file to you under # the Apache License, Version 2.0 (the "License"); you may # not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. # # Auto generated from commit f284cc16f4d4b4289bc679aa1529bb504190fe80 # @see https://github.com/elastic/elasticsearch-specification # module ElasticsearchServerless module API module Actions # Reindex documents. # Copy documents from a source to a destination. # You can copy all documents to the destination index or reindex a subset of the documents. # The source can be any existing index, alias, or data stream. # The destination must differ from the source. # For example, you cannot reindex a data stream into itself. # IMPORTANT: Reindex requires +_source+ to be enabled for all documents in the source. # The destination should be configured as wanted before calling the reindex API. # Reindex does not copy the settings from the source or its associated template. # Mappings, shard counts, and replicas, for example, must be configured ahead of time. # If the Elasticsearch security features are enabled, you must have the following security privileges: # * The +read+ index privilege for the source data stream, index, or alias. # * The +write+ index privilege for the destination data stream, index, or index alias. # * To automatically create a data stream or index with a reindex API request, you must have the +auto_configure+, +create_index+, or +manage+ index privilege for the destination data stream, index, or alias. # * If reindexing from a remote cluster, the +source.remote.user+ must have the +monitor+ cluster privilege and the +read+ index privilege for the source data stream, index, or alias. # If reindexing from a remote cluster, you must explicitly allow the remote host in the +reindex.remote.whitelist+ setting. # Automatic data stream creation requires a matching index template with data stream enabled. # The +dest+ element can be configured like the index API to control optimistic concurrency control. # Omitting +version_type+ or setting it to +internal+ causes Elasticsearch to blindly dump documents into the destination, overwriting any that happen to have the same ID. # Setting +version_type+ to +external+ causes Elasticsearch to preserve the +version+ from the source, create any documents that are missing, and update any documents that have an older version in the destination than they do in the source. # Setting +op_type+ to +create+ causes the reindex API to create only missing documents in the destination. # All existing documents will cause a version conflict. # IMPORTANT: Because data streams are append-only, any reindex request to a destination data stream must have an +op_type+ of +create+. # A reindex can only add new documents to a destination data stream. # It cannot update existing documents in a destination data stream. # By default, version conflicts abort the reindex process. # To continue reindexing if there are conflicts, set the +conflicts+ request body property to +proceed+. # In this case, the response includes a count of the version conflicts that were encountered. # Note that the handling of other error types is unaffected by the +conflicts+ property. # Additionally, if you opt to count version conflicts, the operation could attempt to reindex more documents from the source than +max_docs+ until it has successfully indexed +max_docs+ documents into the target or it has gone through every document in the source query. # NOTE: The reindex API makes no effort to handle ID collisions. # The last document written will "win" but the order isn't usually predictable so it is not a good idea to rely on this behavior. # Instead, make sure that IDs are unique by using a script. # **Running reindex asynchronously** # If the request contains +wait_for_completion=false+, Elasticsearch performs some preflight checks, launches the request, and returns a task you can use to cancel or get the status of the task. # Elasticsearch creates a record of this task as a document at +_tasks/<task_id>+. # **Reindex from multiple sources** # If you have many sources to reindex it is generally better to reindex them one at a time rather than using a glob pattern to pick up multiple sources. # That way you can resume the process if there are any errors by removing the partially completed source and starting over. # It also makes parallelizing the process fairly simple: split the list of sources to reindex and run each list in parallel. # For example, you can use a bash script like this: # + # for index in i1 i2 i3 i4 i5; do # curl -HContent-Type:application/json -XPOST localhost:9200/_reindex?pretty -d'{ # "source": { # "index": "'$index'" # }, # "dest": { # "index": "'$index'-reindexed" # } # }' # done # + # **Throttling** # Set +requests_per_second+ to any positive decimal number (+1.4+, +6+, +1000+, for example) to throttle the rate at which reindex issues batches of index operations. # Requests are throttled by padding each batch with a wait time. # To turn off throttling, set +requests_per_second+ to +-1+. # The throttling is done by waiting between batches so that the scroll that reindex uses internally can be given a timeout that takes into account the padding. # The padding time is the difference between the batch size divided by the +requests_per_second+ and the time spent writing. # By default the batch size is +1000+, so if +requests_per_second+ is set to +500+: # + # target_time = 1000 / 500 per second = 2 seconds # wait_time = target_time - write_time = 2 seconds - .5 seconds = 1.5 seconds # + # Since the batch is issued as a single bulk request, large batch sizes cause Elasticsearch to create many requests and then wait for a while before starting the next set. # This is "bursty" instead of "smooth". # **Slicing** # Reindex supports sliced scroll to parallelize the reindexing process. # This parallelization can improve efficiency and provide a convenient way to break the request down into smaller parts. # NOTE: Reindexing from remote clusters does not support manual or automatic slicing. # You can slice a reindex request manually by providing a slice ID and total number of slices to each request. # You can also let reindex automatically parallelize by using sliced scroll to slice on +_id+. # The +slices+ parameter specifies the number of slices to use. # Adding +slices+ to the reindex request just automates the manual process, creating sub-requests which means it has some quirks: # * You can see these requests in the tasks API. These sub-requests are "child" tasks of the task for the request with slices. # * Fetching the status of the task for the request with +slices+ only contains the status of completed slices. # * These sub-requests are individually addressable for things like cancellation and rethrottling. # * Rethrottling the request with +slices+ will rethrottle the unfinished sub-request proportionally. # * Canceling the request with +slices+ will cancel each sub-request. # * Due to the nature of +slices+, each sub-request won't get a perfectly even portion of the documents. All documents will be addressed, but some slices may be larger than others. Expect larger slices to have a more even distribution. # * Parameters like +requests_per_second+ and +max_docs+ on a request with +slices+ are distributed proportionally to each sub-request. Combine that with the previous point about distribution being uneven and you should conclude that using +max_docs+ with +slices+ might not result in exactly +max_docs+ documents being reindexed. # * Each sub-request gets a slightly different snapshot of the source, though these are all taken at approximately the same time. # If slicing automatically, setting +slices+ to +auto+ will choose a reasonable number for most indices. # If slicing manually or otherwise tuning automatic slicing, use the following guidelines. # Query performance is most efficient when the number of slices is equal to the number of shards in the index. # If that number is large (for example, +500+), choose a lower number as too many slices will hurt performance. # Setting slices higher than the number of shards generally does not improve efficiency and adds overhead. # Indexing performance scales linearly across available resources with the number of slices. # Whether query or indexing performance dominates the runtime depends on the documents being reindexed and cluster resources. # **Modify documents during reindexing** # Like +_update_by_query+, reindex operations support a script that modifies the document. # Unlike +_update_by_query+, the script is allowed to modify the document's metadata. # Just as in +_update_by_query+, you can set +ctx.op+ to change the operation that is run on the destination. # For example, set +ctx.op+ to +noop+ if your script decides that the document doesn’t have to be indexed in the destination. This "no operation" will be reported in the +noop+ counter in the response body. # Set +ctx.op+ to +delete+ if your script decides that the document must be deleted from the destination. # The deletion will be reported in the +deleted+ counter in the response body. # Setting +ctx.op+ to anything else will return an error, as will setting any other field in +ctx+. # Think of the possibilities! Just be careful; you are able to change: # * +_id+ # * +_index+ # * +_version+ # * +_routing+ # Setting +_version+ to +null+ or clearing it from the +ctx+ map is just like not sending the version in an indexing request. # It will cause the document to be overwritten in the destination regardless of the version on the target or the version type you use in the reindex API. # **Reindex from remote** # Reindex supports reindexing from a remote Elasticsearch cluster. # The +host+ parameter must contain a scheme, host, port, and optional path. # The +username+ and +password+ parameters are optional and when they are present the reindex operation will connect to the remote Elasticsearch node using basic authentication. # Be sure to use HTTPS when using basic authentication or the password will be sent in plain text. # There are a range of settings available to configure the behavior of the HTTPS connection. # When using Elastic Cloud, it is also possible to authenticate against the remote cluster through the use of a valid API key. # Remote hosts must be explicitly allowed with the +reindex.remote.whitelist+ setting. # It can be set to a comma delimited list of allowed remote host and port combinations. # Scheme is ignored; only the host and port are used. # For example: # + # reindex.remote.whitelist: [otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*"] # + # The list of allowed hosts must be configured on any nodes that will coordinate the reindex. # This feature should work with remote clusters of any version of Elasticsearch. # This should enable you to upgrade from any version of Elasticsearch to the current version by reindexing from a cluster of the old version. # WARNING: Elasticsearch does not support forward compatibility across major versions. # For example, you cannot reindex from a 7.x cluster into a 6.x cluster. # To enable queries sent to older versions of Elasticsearch, the +query+ parameter is sent directly to the remote host without validation or modification. # NOTE: Reindexing from remote clusters does not support manual or automatic slicing. # Reindexing from a remote server uses an on-heap buffer that defaults to a maximum size of 100mb. # If the remote index includes very large documents you'll need to use a smaller batch size. # It is also possible to set the socket read timeout on the remote connection with the +socket_timeout+ field and the connection timeout with the +connect_timeout+ field. # Both default to 30 seconds. # **Configuring SSL parameters** # Reindex from remote supports configurable SSL settings. # These must be specified in the +elasticsearch.yml+ file, with the exception of the secure settings, which you add in the Elasticsearch keystore. # It is not possible to configure SSL in the body of the reindex request. # # @option arguments [Boolean] :refresh If +true+, the request refreshes affected shards to make this operation visible to search. # @option arguments [Float] :requests_per_second The throttle for this request in sub-requests per second. # By default, there is no throttle. Server default: -1. # @option arguments [Time] :scroll The period of time that a consistent view of the index should be maintained for scrolled search. # @option arguments [Integer, String] :slices The number of slices this task should be divided into. # It defaults to one slice, which means the task isn't sliced into subtasks.Reindex supports sliced scroll to parallelize the reindexing process. # This parallelization can improve efficiency and provide a convenient way to break the request down into smaller parts.NOTE: Reindexing from remote clusters does not support manual or automatic slicing.If set to +auto+, Elasticsearch chooses the number of slices to use. # This setting will use one slice per shard, up to a certain limit. # If there are multiple sources, it will choose the number of slices based on the index or backing index with the smallest number of shards. Server default: 1. # @option arguments [Time] :timeout The period each indexing waits for automatic index creation, dynamic mapping updates, and waiting for active shards. # By default, Elasticsearch waits for at least one minute before failing. # The actual wait time could be longer, particularly when multiple waits occur. Server default: 1m. # @option arguments [Integer, String] :wait_for_active_shards The number of shard copies that must be active before proceeding with the operation. # Set it to +all+ or any positive integer up to the total number of shards in the index (+number_of_replicas+1+). # The default value is one, which means it waits for each primary shard to be active. Server default: 1. # @option arguments [Boolean] :wait_for_completion If +true+, the request blocks until the operation is complete. Server default: true. # @option arguments [Boolean] :require_alias If +true+, the destination must be an index alias. # @option arguments [Hash] :headers Custom HTTP headers # @option arguments [Hash] :body request body # # @see https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-reindex # def reindex(arguments = {}) request_opts = { endpoint: arguments[:endpoint] || 'reindex' } raise ArgumentError, "Required argument 'body' missing" unless arguments[:body] arguments = arguments.clone headers = arguments.delete(:headers) || {} body = arguments.delete(:body) method = ElasticsearchServerless::API::HTTP_POST path = '_reindex' params = Utils.process_params(arguments) ElasticsearchServerless::API::Response.new( perform_request(method, path, params, body, headers, request_opts) ) end end end end

lib/elasticsearch-serverless/api/reindex.rb (19 lines of code) (raw):