Strings : Software Distributed Shared Memory

 

Applications for distributed memory systems are cumbersome to develop due to the need for programmers to handle communication primitives explicitly, just as coding in MPI. In addition, applications have to be tuned for each individual architecture to achieve reasonable performance. Since hardware shared memory machines do not scale well and are relatively expensive to build, software distributed shared memory (DSM) systems are gaining popularity for providing a logically shared memory over physically distributed memory. These software DSM systems combine programming advantages of shared memory and the cost advantages of distributed memory. The programmer is given the illusion of a large global address space encompassing all available memory, thereby eliminating the task of explicitly moving data between processes located on separate machines.

DSMs share data at the relatively large granularity of a virtual memory page and can suffer from a phenomenon known as "false sharing", wherein two processes simultaneously attempt to write to different data items that reside on the same page. If only a single writer is permitted, the page may ping-pong between the nodes. One solution to this problem is to ``hold" a freshly arrived page for some time before releasing it to another requester. Relaxed memory consistency models that allow multiple concurrent writers have also been proposed to alleviate this symptom. The systems ensure that all nodes see the same data at well defined points in the program, usually when synchronization occurs. Extra effort is required to ensure program correctness in this case. One technique that has been investigated to improve DSM performance is the use of multiple threads of control in the system. Up to now, the third generation DSM systems utilize relaxed consistency models and multithreading technologies.

Strings is built using POSIX threads, which can be multiplexed on kernel lightweight processes. The kernel can schedule these lightweight processes across multiple processors on symmetrical multiprocessors (SMPs) for better performance. Therefore, in Strings, each thread could be assigned to any processor on the SMP if there is no special request, and all local threads could run in parallel if there are enough processors. Strings is designed to exploit data parallelism by allowing multiple application threads to share the same address space on a node. Additionally, the protocol handler is multi-threaded. The overhead of interrupt driven network I/O is avoided by using a dedicated communication thread. Strings is designed to exploit data parallelism at the application level and task parallelism at the run-time level.

  • Execution Model

Strings starts a master process that forks child processes on remote nodes using rsh(). Each of these processes creates a dsm_server thread and a communication thread. The forked processes then register their listening ports with the master. The master process enters the application proper and creates shared memory regions. It then creates application threads on remote nodes by sending requests to the dsm_server threads on the respective nodes. Shared memory identifiers and global synchronization primitives are sent as part of the thread create call. The virtual memory subsystem is used to enforce coherent access to the globally shared regions.

  • Kernel Threads

Thread implementations can be either user-level, usually implemented as a library, or kernel-level in terms of light-weight processes. Kernel level threads are more expensive to create, since the kernel is involved in managing them. User level threads suffer from some limitations, since they are implemented as a user-level library, they cannot be scheduled by the kernel. If any thread issues a blocking system call, all associated threads will also be blocked. Also on a multi-processor system, user-level threads bound to a light-weight process can only on one processor at a time. User level threads do not allow the programmer to control their scheduling within the process, on the other hand kernel level threads can be scheduled by the operating system across multiple processors.

  • Shared Memory

    Strings implements shared memory by using the mmap() call to map a file to the bottom of the stack segment. With dynamically linked programs, it was found that mmap() would map the same page to different addresses on different processors. Allowing multiple application threads on the same node leads to a peculiar problem. Once a page has been fetched from a remote node, its contents must be written to the corresponding memory region, so the protection has to be changed to writable. At this time no other thread should be able to access this page. Suspending all kernel level threads can lead to a deadlock and also reduce concurrency. In Strings, every page is mapped to two different addresses. It is then possible to write to the shadow address without changing the protection of the primary memory region.

  • Memory Consistency Model

A release consistency model using an update protocol has been implemented. When a thread tries to write to a page, a twin copy of the page is created. When either a lock is released or a barrier is reached, the difference (diff) between the current contents and its twin are sent to threads that share the page. Multiple diffs are aggregated to decrease the number of messages sent.

Download

Team

  • Prof. Vipin Chaudhary
  • Dr. Sumit Roy
  • Darshan Thaker
  • Hai Jiang
  • Yanqing Ji

Related Publications

  • Cost-Performance Evaluation of SMP Clusters

  • D. Thaker, V. Chaudhary, G. Edjlali, and S. Roy
    In Proc. of the Intl. Conference on Parallel and Distributed Processing Techniques and Applications, pp. 718 - 724, Las Vegas, Nevada, June 28 - July 1, 1999.