turbonfs/inc/readahead.h (144 lines of code) (raw):
#ifndef __READAHEAD_H__
#define __READAHEAD_H__
#include <atomic>
#include <shared_mutex>
#include "aznfsc.h"
// Forward declarations.
struct nfs_inode;
struct nfs_client;
namespace aznfsc {
/**
* Readahead state for a Blob.
* This maintains state to track application read pattern and based on that
* can decide if readahead will help. If so, it can also perform the readahead
* read IOs for the given file updating the cache appropriately. User can call
* the issue_readaheads() method to issue all permitted readaheads at the time
* of called. Only the READ RPCs are queued at libnfs in the calling thread
* context. The actual send will be performed by the libnfs service thread of
* the selected nfs context, and the callback will also be called in the
* context of the same libnfs service thread.
* Following are some of its properties:
*
* 1. User must call issue_readaheads() usually right after application read,
* in the same context. This will issue the required readahead read
* requests after the application read.
* 2. issue_readaheads() will call the private method get_next_ra() method to
* find the offset of the next readahead read it should issue, and it'll
* issue all the suggested reads. Obviously it'll not issue any readahead
* read for random read pattern and also if there are already enough ongoing
* readahead reads.
* 3. It should *never* suggest readahead for the same offset more than once
* if the reads are issued at monotonically increasing offsets.
* 4. It should *never* suggest readahead for the same offset for which a read
* has been recently issued, if the reads are issued at monotonically
* increasing offsets.
* 5. Since it tracks read IO pattern, it should be made aware of *all* reads
* issued by the application. Also, it should be told when a readahead
* completes.
*
* TODO: Currently it only tracks a single reader stream. If it's used in a
* scope where multiple reader applications are performing reads, it
* may not be able to correctly detect sequential patterns, even if all
* those multiple reader streams are sequential by themselves.
*
* Another way of achieving the same is for the user to use multiple
* ra_state objects, one per reading context, f.e., one way of doing it
* is to associate the ra_state with the issuing process' (pid returned
* by fuse_get_context()) and not with the file inode.
*
* Note that it would be ideal if fuse provided us info on the file
* pointer on whose behalf a specific IO is being issued. This would
* help us associate readahead state with a file pointer and not with
* the inode as the readahead (like read offset) is a property of the
* file pointer and not of the inode. In absence of this, above pid
* based readahead state maintenance is the next best thing we can do.
*
* How does pattern detection work?
* ================================
* File is divided into 1GB logical sections. Everytime access moves to a new
* section, pattern tracking variables are reset (this is skipped for an
* ongoing sequential access). This is done to make sure we use the most recent
* accesses to correctly detect the pattern and older accesses do not muddle the
* pattern detection. Following pattern tracking variables are maintained:
*
* - ra_bytes is the amount of readahead in bytes. We never keep more than
* ra_bytes of readahead reads ongoing.
* - min_byte_read and max_byte_read track the min and max bytes read by the
* application in the current section. max_byte_read-min_byte_read is called
* the access_range.
* - num_reads and num_bytes_read is the total number of reads and number of
* bytes read in the current section, respectively.
* - If num_reads >= 3 and num_bytes_read/access_range > 0.7, the pattern is
* considered sequential. ((num_bytes_read * 100) / access_range) is called
* access_density as it represents how densely the reads are issued over a
* file range. Note that this allows for some reordered reads due to multiple
* async reads handled by multiple threads, but at the same time it marks the
* pattern sequential only when application is indeed reading sequentially.
* Note that random reads or "jumping reads" after a fixed gap will not meet
* the access_density threshold and hence will not qualify as sequential.
* - Readahead windows starts from max_byte_read+1 and is ra_bytes wide.
* - ra_ongoing is the number of readahead bytes which are still ongoing.
* - last_byte_readahead is the last byte of readahead read issued, which means
* next readahead is issued from last_byte_readahead+1. When max_byte_read
* crosses last_byte_readahead, last_byte_readahead is updated to
* max_byte_read, so that we never issue readahead for something that's
* already read recently.
* - Pattern tracking is reset when one of the following happens:
* - New read from the application lies in a different section than
* max_byte_read (and the current access is not sequential). This ensures
* our pattern detection is based on recent data and historical accesses
* do not have carry influence for long time.
* - New read starts after max_byte_read+ra_bytes. Such a large jump in
* read offset hints at non-sequential access and hence the access pattern
* need to be reviewed again and sequential pattern must be proved afresh.
* - Following pattern tracking variables are reset:
* - min_byte_read
* - max_byte_read
* - num_reads
* - num_bytes_read
* - last_byte_readahead
* - When pattern tracking is reset it'll take at least 3 reads to detect the
* pattern again. Till that time we won't recommend any new readaheads.
* Previously issued readaheads will continue and ra_ongoing is not reset.
*/
class ra_state
{
public:
/*
* Logical section size in bytes.
* Everytime access moves to a new section pattern detection is reset
* and access pattern has to prove its sequential-ness again.
*/
const uint64_t SECTION_SIZE = (1024ULL * 1024 * 1024);
/*
* Access density is a percentage measure of how "packed" the reads are.
* If an application is reading all over the file (aka random reads)
* or it's reading with periodic gaps between accesses, then the access
* density will be low and we won't consider it as sequential.
*/
const int ACCESS_DENSITY_MIN = 70;
/**
* Initialize readahead state.
* nfs_client is for convenience, nfs_inode identifies the target file.
* It'll query the readaheadkb config from the mount options and use that
* for deciding the readahead size.
*
* TODO: If we can pass the filename add it too for better logging.
*/
ra_state(struct nfs_client *_client,
struct nfs_inode *_inode);
/**
* Hook for reporting an application read to ra_state.
* All application read requests MUST be reported so that the readahead
* engine has complete knowledge of the application read pattern and can
* provide correct recommendations on readahead.
* This must be called *before* issuing the read and not after the read
* completes.
*/
void on_application_read(uint64_t offset, uint64_t length)
{
assert(offset < AZNFSC_MAX_FILE_SIZE);
assert((offset + length) <= AZNFSC_MAX_FILE_SIZE);
if (length == 0) {
assert(0);
return;
}
assert((int64_t) length > 0);
std::unique_lock<std::shared_mutex> _lock(ra_lock_40);
const uint64_t curr_section = (max_byte_read / SECTION_SIZE);
const uint64_t this_section = ((offset + length) / SECTION_SIZE);
// How far from the current last byte read, is this new request.
const uint64_t read_gap =
std::abs((int64_t) (offset - max_byte_read));
bool reset_readahead = false;
/*
* Scaled ra_bytes is the ra_bytes scaled to account for global cache
* pressure. We use that to decide how much to readahead.
*/
const uint64_t ra_bytes_scaled = get_ra_bytes();
/*
* If this read is beyond ra_bytes away from the current last byte
* read, then this strongly indicates a non-sequential pattern.
* Reset the readahead state, switching to random access and let the
* read pattern prove once again for sequential-ness.
*/
if (read_gap > ra_bytes_scaled) {
reset_readahead = true;
} else if (curr_section != this_section) {
/*
* Read is within the readahead window but the section changes.
* Since the section size is usually much larger than the readahead
* window size, so this usually means one of the following two:
* 1. this_section == "curr_section + 1" (likely for seq pattern).
* 2. this_section == "curr_section - 1"
*/
if (this_section != curr_section + 1) {
assert((this_section == (curr_section - 1)) ||
(max_byte_read == UINT64_MAX));
reset_readahead = true;
} else {
/*
* Common case of sequential reads progressing to the next
* section, don't reset pattern detector.
*/
reset_readahead = !is_sequential_nolock();
}
}
if (reset_readahead) {
num_reads = 1;
num_bytes_read = length;
min_byte_read = offset;
max_byte_read = offset + length - 1;
last_byte_readahead = 0;
} else {
num_reads++;
num_bytes_read += length;
max_byte_read = std::max(max_byte_read.load(), offset + length - 1);
min_byte_read = std::min(min_byte_read.load(), offset);
}
assert(max_byte_read >= min_byte_read);
/*
* Next readahead will be from last_byte_readahead+1, so if this read
* is past the current last_byte_readahead, update last_byte_readahead.
*/
while (last_byte_readahead < max_byte_read) {
uint64_t expected = last_byte_readahead;
last_byte_readahead.compare_exchange_weak(expected,
max_byte_read);
}
}
/**
* Returns the byte offset below which it's unlikely to get any application
* read (given that the application is doing (pseudo) sequential reads).
* Caller can use this to release any unused buffers which may have not been
* released as it wasn't safe to release when release was last attempted.
*/
uint64_t release_till() const
{
/*
* Our active readahead window starts from max_byte_read and spans
* get_ra_bytes() bytes. For the case of sequential reads, anything
* less than "max_byte_read - SECTION_SIZE" is unlikely to be read
* again. The margin of SECTION_SIZE helps readers like fio which
* issue lot of parallel reads which together are working towards a
* sequential goal but may be ordered upto SECTION_SIZE apart.
*/
return (max_byte_read == UINT64_MAX) ? 0 :
std::max((int64_t) (max_byte_read - SECTION_SIZE), 0L);
}
/**
* Returns the currently observed access pattern.
*/
bool is_sequential() const
{
std::shared_lock<std::shared_mutex> _lock(ra_lock_40);
return is_sequential_nolock();
}
/**
* This will issue all readaheads as permitted by get_next_ra(). As these
* readahead reads complete it'll cause the corresponding membuf(s) to be
* marked uptodate so subsequent application reads will return data from
* the cache. If already enough readahead reads are dispatched or the
* application read pattern is seen to not benefit from readahead, then
* issue_readaheads() will be a no-op.
* It'll call on_readahead_complete() as readahead callbacks are called.
*
* It returns the number of readahead read RPCs dispatched.
*/
int issue_readaheads();
/**
* Hook for reporting completion of a readahead read.
* This MUST be called for every readahead that get_next_ra() suggested
* and the length parameter MUST match what was passed to get_next_ra().
* This must be called before the readahead read completes, successful or
* not.
*
* Note: If you don't pass the length parameter, it uses the def_ra_size
* set in the constructor. This is the recommended usage, but if you
* pass length parameter to get_next_ra() then you MUST pass the
* same length to on_readahead_complete().
*
* Note: This is not meant to be called by user. This is made public method
* as it's called from the global readahead callback method.
*/
void on_readahead_complete(uint64_t offset, uint64_t length = 0)
{
if (length == 0) {
length = def_ra_size;
assert(length > 0);
}
// ra_ongoing is atomic, don't need the lock.
assert(ra_ongoing >= length);
ra_ongoing -= length;
}
/**
* Wait for ongoing readaheads to complete.
*/
void wait_for_ongoing_readahead() const;
bool in_ra_window(uint64_t offset, uint64_t length) const
{
assert((int64_t) (offset + length) >= 0);
/*
* If last_byte_readahead is 0 it mostly means we are not doing
* readaheads which is mostly true for files which are being
* written and not read.
*/
if (last_byte_readahead == 0)
return false;
/*
* Scaled ra_bytes is the ra_bytes scaled to account for global cache
* pressure. We use that to decide how much to readahead.
*/
const uint64_t ra_bytes_scaled = get_ra_bytes();
const uint64_t le = offset;
const uint64_t re = offset + length;
const uint64_t lra = max_byte_read + 1;
const uint64_t rra = max_byte_read + ra_bytes_scaled;
const bool ends_before = re <= lra;
const bool starts_after = le > rra;
return !ends_before && !starts_after;
}
/**
* XXX This is provided as a quick hack to reset readahead state on
* file open so that the access to the file using the new handle
* does not confuse ra_state from access using the older handle.
* New handle means new access pattern so we cannot continue using
* the readahead state from older handle.
*/
void reset()
{
num_reads = 0;
num_bytes_read = 0;
min_byte_read = 0;
max_byte_read = UINT64_MAX;
}
/**
* Returns the scaled ra window that caller can safely use.
*/
uint64_t get_ra_bytes() const;
/**
* This will run self tests to test the correctness of this class.
*/
static int unit_test();
private:
/**
* This private constructor is only to be called from unit_test().
*/
ra_state(int _ra_kib, int _def_ra_size_kib) :
ra_bytes(_ra_kib * 1024),
def_ra_size(std::min<uint64_t>(_def_ra_size_kib * 1024ULL, ra_bytes))
{
/*
* Some sanity asserts
* Readahead less than 128KB are not effective and more than 1GB is
* unnecessary.
* Readahead reads also must have a reasonable size.
*/
assert(_ra_kib >= 128 && _ra_kib <= 1024*1024);
assert(_def_ra_size_kib >= 8 && _def_ra_size_kib <= 16*1024);
AZLogInfo("[TEST] Readahead set to {} KiB with default RA size {} KiB",
_ra_kib, _def_ra_size_kib);
}
/**
* Returns the offset of the next readahead to issue. Caller must pass the
* length of the readahead it wants to issue.
* Return value of 0 would indicate "don't issue readahead read", this would
* mostly be caused by recent application read pattern which has been
* indentifed as non-sequential, or if the current ongoing readaheads are
* already ra_bytes.
*
* If this function returns a non-zero value, then caller SHOULD issue a
* readahead read at the returned offset and 'length' (or less) and MUST
* call on_readahead_complete(length) when this readahead read completes,
* to let ra_state know.
* Note that the argument to on_readahead_complete() MUST be 'length' even
* if the readahead read ends up reading less.
*
* Note: If you don't pass the length parameter, it uses the def_ra_size
* set in the constructor. This is the recommended usage.
*
* Note: It doesn't track the file size, so it may recommend readahead
* offsets beyond eof. It's the caller's responsibility to handle
* that.
*/
int64_t get_next_ra(uint64_t length = 0);
bool is_sequential_nolock() const
{
/*
* Need minimum 3 reads from current section to check the access
* pattern.
*/
if (num_reads < 3) {
return false;
}
const int64_t access_range = (max_byte_read - min_byte_read + 1);
assert(access_range > 0);
const int access_density = (num_bytes_read * 100) / access_range;
/*
* This can happen in case of duplicate reads, which is not a case of
* sequential reads.
*/
if (access_density > 100) {
return false;
}
return (access_density > ACCESS_DENSITY_MIN);
}
/*
* The singleton nfs_client, for convenience.
*/
struct nfs_client *client;
/*
* File inode we are tracking the readaheads for.
*/
struct nfs_inode *inode;
/*
* Total readahead size in bytes, aka the "readahead window".
* Readahead reads recommended by us will always be less than
* "max_byte_read + ra_bytes".
*/
const uint64_t ra_bytes;
/*
* Default size of a readahead IO returned by get_next_ra() if length is
* not passed explicitly. This is initialized in the constructor and later
* used.
*/
const uint64_t def_ra_size;
/*
* Last byte of readahead read recommended by most recent call to
* get_next_ra(). Next readahead recommended will start at the next byte
* after this.
* This is reset when pattern detection is reset.
*/
std::atomic<uint64_t> last_byte_readahead = 0;
/*
* Smallest and largest byte read in the current section. These point to
* the minimum and the maximum byte read, so if application reads 3 bytes
* at offset 0, we will have:
* min_byte_read == 0, and
* max_byte_read == 2.
* These are truthfully updated as application reports its read calls
* through on_application_read().
* These are reset when pattern detection is reset.
*/
std::atomic<uint64_t> min_byte_read = 0;
std::atomic<uint64_t> max_byte_read = UINT64_MAX;
/*
* Current ongoing readahead bytes.
* This depends on application correctly informing us of readahead reads
* completing by calling on_readahead_complete().
* This is not reset when pattern detection is reset.
*/
std::atomic<uint64_t> ra_ongoing = 0;
/*
* Number of read calls and number of bytes read by those, in the current
* section.
* These are reset when pattern detection is reset.
*/
std::atomic<uint64_t> num_reads = 0;
std::atomic<uint64_t> num_bytes_read = 0;
/*
* Common memory pressure code will update this scaling factor to force
* all ra_state machines to slow down in case if high memory pressure.
*/
static std::atomic<double> ra_scale_factor;
/*
* Lock for safely accessing/updating above state.
*/
mutable std::shared_mutex ra_lock_40;
};
}
#endif /* __READAHEAD_H__ */