common/quotas/global/doc.go (1 lines of code) (raw):
// The MIT License (MIT)
// Copyright (c) 2017-2020 Uber Technologies Inc.
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.
/*
Package global contains a global-load-balance-aware ratelimiter (when complete).
# High level overview
At a very high level, this will:
- collect usage metrics in-memory in "limiting" hosts (which use an in-memory ratelimiter)
- asynchronously submit these usage metrics to "aggregating" hosts, which will return per-key RPS data
- "limiting" hosts use this returned value to update their in-memory ratelimiters
"Aggregating" hosts will internally:
- receive data for key X from all "limiting" hosts
- compute request-weight per host+key in the cluster
- return the "limiting"-host's weighted portion of each ratelimit to allow
The exact logic that the "aggregating" hosts use is intentionally hidden from the
"limiting" hosts, so it can be changed without changing how they enforce limits.
The weight-calculation algorithm will (eventually) be controlled by a dynamicconfig value,
to allow shadowing and experimenting with different algorithms at runtime, or disabling
the global logic altogether.
See sub-packages for implementation details.
# Data flow in the system
The overall planned system's data-flow is driven entirely by limiting hosts repeatedly
calling an "update(request, data)" API on aggregating hosts, and using the return value.
Aggregating hosts make no outbound requests at all:
lim: a host limiting e.g. user RPS for domain ASDF
agg: a host aggregating this data, to compute fair RPS allowances
┌───┐ ┌────┐┌────┐┌────┐ ┌────┐
│lim│ │ring││agg1││agg2│ │lim2│
└─┬─┘ └─┬──┘└─┬──┘└─┬──┘ └─┬──┘
│ │ │ │ │
│ shard(a,b,c) │ │ │ update(a1,d37) │
│────────────────>┌┴┐ ┌┴┐<───────────────────────│
│ [[a], [b,c]] │ │ └┬┘───────────────────────>│
│<────────────────└┬┘ │ │ ...
│ │ │ │
│ update(a1) │ │ │ got 1 request on "a" limit
│──────────────────────>┌┴┐ │
│ │ │ │ │
│ update(b1,c2) │ │ │ │ 1 req on "b", 2 on "c"
│────────────────────────────>┌┴┐
│ │ │ │ │ │
│ [b4/s, c9/s] │ │ │ │ │ allow "b" 4 RPS, "c" gets 9
│<────────────────────────────└┬┘
│ │ │ │ │
│ [a2/s] │ │ │ │ 2 RPS for "a"
│<──────────────────────└┬┘ │
│ │ │ │
┌─┴─┐ ┌─┴──┐┌─┴──┐┌─┴──┐
│lim│ │ring││agg1││agg2│
└───┘ └────┘└────┘└────┘
This package as a whole is only concerned with the high level request pattern in use:
- limiters enforce limits and collect data
- limiters asynchronously submit that data to all relevant aggregators, sharded by key
- aggregators decide per-limiter-and-key limits based on global data, and return it to limiters
Currently:
- limiters are Frontends, but can trivially be any Cadence service as they only send outbound requests
- aggregators are arbitrarily in History service because it already has a ring implemented (Frontend does not)
- neither of these are fundamentally required, and they may change or be in multiple services later
The actual contents of the requests and responses and how limits are decided is
intentionally flexible and not defined here - see individual implementations / sub-packages.
These will likely evolve as needs change, and will ideally be pluggable and
dynamicconfig-selectable and shadow-able at runtime without changing this overall structure.
# Planned use
The first use of this is intentionally focused on addressing imbalanced usage
of per-domain user/worker/ES ratelimits across different frontend hosts, e.g.
due to unfair load balancing or per-datacenter affinity or other
occasionally-changing-else-mostly-stable routing decisions.
This will essentially imply 3 keys per domain, and will look something like
`domain-user-request` / `domain-worker-request` / etc - they just have to be
unique per conceptual thing-with-a-configured-limit.
These frontend limits are generally configured up-front and are used to protect
the cluster (especially the database) from noisy neighbors and un-planned major
usage changes. True precision is not necessary (accepting 110 instead of 100 will
not collapse the cluster, but allowing *unlimited* could), but users plan around the
numbers we have given them, and significant or sustained differences in reality can
cause problems for them. Particularly if it significantly decreases due to a change
in how our network decides to distribute their requests.
By detecting this imbalanced load we can fairly allow "hot" hosts more RPS than our
"limit / N hosts" limiters allow (essentially stealing unused quota from low-usage
hosts), and ultimately allow request rates closer to the intended limits as the
balance changes over time. It's better for our users (they get what we told them
they get), and better for us (no need to artificially raise the limit to work around
the load imbalance, nor keep changing it as load changes).
Weight-data loss is expected due to frontend or agg process losses, and allowed-limits
naturally trail real-time use due to the asynchronous nature - this is not *ideal* but
is expected to be acceptable, as it will adjust / self-repair "soon" as normal updates
occur.
More concretely, our planned target is something like:
- frontends (the limiting hosts) submit usage data every 3s
- history's aggregating algorithms "quickly" account for new load, and return allowed
limits to frontends
- within about 10s of a change, we want things to be ~90% accurate or better
compared to the "ground truth" (this implies ~0.5 weight per 3s update)
- with stable traffic, >99% accuracy at allowing the intended RPS of a limit key,
regardless of how requests are distributed
- briefly going over or under the intended limit during changes or partial data
is allowed, but generally requests should be incorrectly rejected over
incorrectly allowed
Future uses (with minor changes) may include things like:
- limiting cluster-wide database RPS across all hosts (one key, 50k+ RPS) despite
significantly-different load depending on server role and shard distribution
- different load-deciding algorithms, e.g. TCP Vegas or some other more sophisticated
weight-deciding algorithm (this first one is intentionally simple, and might prove
good enough)
- closed-source pluggable algorithms that share the same intent and request-flow
but little else (i.e. extendable by operators without new RPC definitions.
routing and usage patterns are often unique, no one approach can be ideal everywhere)
# Out of scope
Uses explicitly out of scope for this whole structure includes things like:
- perfect accuracy on RPS targets (requires synchronous decisions)
- rapid and significant usage changes, like a domain sending all requests to one
frontend host one second, then a different the next, etc.
- brief spikes (e.g. sub-10-seconds. these are broadly undesirable and can
behave worse than RPS/hosts would allow)
- guaranteed limiting below intended RPS (partial data may lead to split-brain-like
behavior, leading to over-allowing)
- fully automated target-RPS adjusting (e.g. system-load detecting, not pre-configured RPS.
this may be *possible* to implement in this system, but it is not *planned* and
may require significant changes)
*/
package global