Replicate and Migrate Objects in the Runtime, not Cache Lines or Pages in Hardware

Manolis Katevenis (Forth)

Tasks of parallel or parallelized programs cooperate with each other by exchanging data, which get transferred from one local or cache memory to another. If we let hardware prefetchers and cache coherence perform these transfers, significant network bandwidth (and energy) are consumed, especially under directory-based coherence, and extra latencies occur when prefetchers fail to correctly predict software behavior. Recent advances in programming models and runtime systems allow runtime libraries to know when specific software objects should be transferred, from where to where, during task scheduling and execution, thus explicitely managing locality and economizing on network packets and energy.

We argue that the runtime tables that contain such knowledge for explicit communication fulfill goals analogous to coherence directories, and can thus obviate hardware coherence. Furthermore, these runtime tables also serve functions analogous to page tables, and thus traditional virtual memory could perhaps be replaced by a simpler scheme, used for protection purposes only. In such new systems, the runtime instructs the hardware to replicate or migrate entire (variable-size) "objects", rather than individual cache lines or pages one at a time. When a large data structure spans several such objects, inter-object pointers are a problem. We argue for a new breed of parallel data structures and algorithms that operate in units of objects that are larger than the traditional small data structure nodes, in a way analogous to what the data base community has done long time ago for disk-resident data.