Statement
of work
The goal of
this project is to evaluate the possibility of developing a high-performance
implementation of the Portals 3.0 API on TNet.
Background
The Portals
3.0 API was developed as a joint project between Sandia National
Laboratories and the Scalable Systems Lab at the University of New
Mexico. Like many other high-performance message passing APIs (e.g.
Scheduled Transfer and Virtual Interface Architecture), the Portals
API supports OS-Bypass. OS-Bypass is motivated by the high cost,
in terms of time, associated with servicing interrupts during high
speed communication. In OS-bypass, the relevant policies of the
OS are implemented in a control program which is run on the Network
Interface Card (NIC), thus eliminating the need to generate many
of the interrupts associated with high speed communication. In addition
to OS-Bypass, the Portals API also supports "application-bypass."
Application-bypass is motivated by the need to minimize memory copies
during communication. In application-bypass, the policies of the
application regarding message placement are implemented on the NIC.
Because the NIC is able to deliver messages to the correct location
based on the contents of the message, the application is able to
avoid a costly memory copy operation.
The company
Supercomputing Systems in Zurich designed a custom network called
TNet for the parallel computing project "Swiss-Tx" at the Swiss
Federal Institute of Technology in Lausanne (EPFL). The message
passing library MPI is installed and executed through the hardware
interpreted Fast Communication Interface (FCI) that enables a direct
store from one processor into the memory of another processor. Because
the network interface card carries a large FPGA and 16 (or more)
MB of memory, a flexible and fast implementation of any communication
protocol can be done. By putting time-critical parts of the protocol
into the hardware it is possible to optimize latency and throughput
of high-performance networks.
Project Scope
The goal of
this project is to design and develop an initial implementation
of the Portals 3.0 API for TNet. We will start from the reference
implementation of the Portals 3.0 API. The Portals 3.0 reference
implementation uses a Network Abstraction Layer (NAL) to achieve
independence of protection domains. That is, all of the calls to
functions in the NAL are implemented as call-backs which may or
may not cross protection domain boundaries. The three protection
domains of interest are the application, the OS (kernel), and the
domain defined by the control program on the NIC.
The primary
goal of this project is to design an implementation of the Portals
API that places as much of the functionality on the NIC as is feasible.
This design would define the goal of a full implementation. A secondary
goal is to develop a preliminary implementation of this design.
In the preliminary implementation, much of the Portals functionality
will remain in the application and OS domains and the NIC will have
minimal functionality.
|