Check out the new USENIX Web site. next up previous
Next: Conclusion Up: An Analytical Approach to Previous: Discussion

Related Work

 

Prefetching is an old idea. It has been studied extensively in various areas, including prepaging, prefetching of files and prefetching of database objects. Prepaging has not had a major impact in computer architecture because of the tight time and complexity constraints on paging hardware and software. However, prefetching of files and database objects is a more promising endeavor for two reasons. First, since file and database accesses are less frequent than page accesses, the speed with which the decision to prefetch must be made is not so much of the essence. Secondly, the resource most needed to arrange intelligent prefetching, namely client CPU cycles, is the resource most in excess in distributed systems now and in the likely future.

Some researchers have looked into prefetching blocks within files. Sequential readahead [7, 17] is the most primitive and yet successful approach. The work of Kotz and Ellis (see, for example, [14]) focuses on the uses of prefetching to increase I/O bandwidth on MIMD shared-memory architectures. Their prefetching methods are geared to the patterns of many scientific and database applications running on multiprocessors. In contrast to these methods, our work looks for access patterns across files.

Some file prefetching methods require that each application inform the operating system of its future demands. This includes the TIP project by Patterson et al. [21, 22] and the work of Cao, Felton, Karlin and Li [3, 4]. These researchers consider the interaction between prefetching and caching more carefully. TIP uses application-disclosed access patterns to dynamically allocate file buffers between the competing demands of prefetching and caching, based on a cost-benefit model. Cao et al. allow applications to pass down both prefetching hints and caching hints. They then employ an integrated algorithm for prefetching and caching, which is shown to be theoretically near-optimal. These ``informed'' approaches possess an advantage over ours in that prefetching is driven not by deductions made after snooping, but rather by certain knowledge provided in advance by higher levels. There is no danger that disastrously incorrect speculative prefetching might trash the cache. On the other hand, these approaches require re-coding applications. Also, the prefetching mechanism must act within the interval between when the higher level learns of the need to do I/O and when it actually initiates I/O; this interval may not always be sufficient to perform the prefetch I/O.

Like ours, a number of other prefetching methods are completely transparent to clients. They use past accesses to predict future accesses. While we seek to build semantic structures, i.e., access trees, that are endowed with application-level meaning, most of the other approaches use a probabilistic method to model the user behavior. In its most general form, a probabilistic method regards file references as a string of events, and uses information about the last C events to estimate the frequency distributions of the next L events (C and L are called context order and lookahead size, respectively). For simplicity, however, existing methods have either single context order or single lookahead. Unlike our tree-based approach, there is no attempt to build in an ``understanding'' of why files are likely to be referenced together.

Some recent examples of work based on probabilistic methods are a paper by Curewitz et al. [6] on prefetching objects in an object-oriented database, work by Griffioen and Appleton [9, 10] and by Kroeger and Long [15] that prefetch whole files. Both [6] and [15] adapt context modeling techniques used in data compression to predict the next access. Their work is inspired by the idea that a good data compressor should be able to predict future data well. Griffioen and Appleton's work, in comparison, employs a ``probability graph'' that for each file accumulates frequency counts of all files that are accessed within the lookahead window.

In the initial stages of our work, we considered probabilistic modeling. We implemented (in the simulator only) two extremely simple probabilistic methods, which we called ``stupid pairs'' and ``smart pairs:''

The stupid pair scheme worked better than the smart pair approach when applied to our traces. This unexpected result is explained by considering the locality of many accesses: often, a user works on one file or one group of files for some time, then moves on to similar operations with different files. Stupid pairs are well-equipped to handle this usage pattern, since they invariably prefetch the most recently used successor file. The seemingly superior intelligence of smart pairs actually becomes a liability when locality is strong; files no longer in active use may be given undue weight if they were heavily accessed in the past.

The phenomenon, that less information is better provided that it is more recent, identifies one common weakness of existing probabilistic methods: they don't address the problem of aged information. Instead they make predictions based on a total history of accesses. In addition, single lookahead methods like [6] and [15] are less likely to be widely applicable: as I/O latencies grow in terms of CPU cycles, prefetches must begin ever further in advance if they are to complete in time. On the other hand, single context order methods like that of Griffioen and Appleton fail to make full use of historical file access information, and thus are unable to confidently infer accesses far into the future.

Similar to our work, Palmer and Zdonik's work on Fido [20] also explicitly recognizes and maintains access patterns. Several important aspects make their work different. First, their work was conducted in the context of object-oriented database systems, whereas our context is file systems. Second, they represent access patterns with strings of object identifiers, with no semantics involved. We represent patterns with access trees. Third, they employ specialized pattern memory, while we store pattern trees in virtual memory. Finally and most notably, Fido requires a separate training phase after each user session, while our mechanism is more on-line: there is no off-line computation or periodic analysis needed.

An issue that is related to prefetching is hoarding [12, 11, 16, 25]. Both prefetching and hoarding involve anticipatory file fetches: bringing files from remote servers into a local cache before they are needed. These are not exactly the same techniques, however. Hoarding is a scheme designed to increase the likelihood that a mobile client will be able to continue working during periods of total disconnection from file servers. Since hoarding is a relatively infrequent operation performed only at a client's request prior to disconnection, timing is not critical. On the other hand, prefetching is mainly concerned with improving performance and timing is important. With prefetching, the file server is assumed to be still accessible, although the network connectivity may be weak. A cache miss is much more catastrophic in disconnected operations, hence hoarding is typically willing to overfetch substantially in order to enhance the availability of files. Despite the differences, our idea of uncovering and exploiting the semantic structure underlying file references also applies to the hoarding problem, as shown in [25].


next up previous
Next: Conclusion Up: An Analytical Approach to Previous: Discussion

An Analytical Approach to File Prefetching
Hui Lei and Dan Duchamp