GASNet Performance

The following graphs show performance of GASNet release v1.4, measured 11/2004.

Performance on newer systems

Jump to: Summary
elan: Itanium2-QsNet2/elan4, AlphaServer-QsNet1/elan3
gm: Itanium2-Myrinet/GM2
lapi: Power4-Federation/LAPI and GASNet vs MPI
shmem: Itanium2-Altix3000/SHMEM, CrayX1/SHMEM and GASNet vs MPI
udp: Itanium2-GigabitEthernet - GASNet vs MPI
vapi: Opteron-Infiniband/VAPI - GASNet vs MPI

Summary: GASNet performance across many networks, machines and conduits


elan-conduit: on 'MPP2' at PNNL

1.5 GHz Intel Itanium-2, Quadrics QsNet2/Elan4 interconnect, 8GB main mem


elan-conduit: on the Lemieux TSC at the Pittsburgh Supercomputing Center

Compaq Alphaserver SC, ES45 Elan3, double-rail (only tested w/single) 
750-node, 4-way 1GHz Alpha, 4GB, libelan1.3, OSF 5.1


gm-conduit: on the Berkeley CITRIS cluster

Itanium-2 Linux Cluster, LANai10.0 PCI-X, GM 2.0.8 
64-node, 2-way 1.3 GHz Itanium-2, 4GB 
PCI-X bus bandwidth: 482 MB/sec read, 501 MB/sec read


lapi-conduit: on the NPACI/SDSC 'DataStar' IBM SP

IBM Federation Interconnect, LAPI v., 8-way 1.5GHz Power4, 16GB main mem

Comparing GASNet lapi-conduit and IBM MPI back-to-back on the same Power4/Federation hardware:
Both GASNet and IBM MPI are built on top of the IBM LAPI interface. GASNet consistently and significantly outperforms IBM MPI, because GASNet's lightweight semantics are a closer match to the operations exposed in the LAPI interface, and eliminate MPI's tag/communicator matching and message ordering enforcement that generally impose additional copies and CPU overheads.

shmem-conduit: on the Altix 3000 'Ram' at ORNL

1.5 GHz Intel Itanium-2, 6MB L3, 256KB L2, 32K L1, 2 TB system memory (8GB/node)


shmem-conduit: on the Cray X1 'Phoenix' at ORNL

512 MSPs, 2MB cache per MSP, 16 GB per node


udp-conduit vs MPICH_p4 MPI: 

Comparing GASNet udp-conduit and MPICH_p4 MPI back-to-back on the same Gigabit Ethernet hardware:
1.3 GHz Dual Itanium-2, 4 GB, Broadcom NetXtreme BCM5701 Gigabit Ethernet, Linux 2.4.20, PCI-X  
PCI-X bus bandwidth: 482 MB/sec read, 501 MB/sec read

Both layers pay a high latency cost due to kernel crossings and store-and-forward routing required by Ethernet (which generally make Ethernet an unsuitable network hardware for parallel scientific computing). However, GASNet consistently and significantly outperforms MPICH_p4. MPICH-p4 is built on TCP, a non-scalable and heavyweight connection-based protocol that often imposes extra copies and includes reliability and ordering protocols designed for streaming applications. GASNet builds its operations directly on UDP, a light-weight unordered datagram protocol which is the lowest-level protocol portably available in the TCP/IP stack. UDP is an unreliable protocol, so GASNet achieves reliability using a protocol which is designed and tuned specifically for the needs of HPC (see AMUDP). GASNet semantics never guarantee ordering, and the implementation gains performance by not providing it.

vapi-conduit vs MVAPICH MPI: 

Comparing GASNet vapi-conduit and OSU MVAPICH MPI back-to-back on the same Infiniband hardware:
Dual 1.4Ghz Opteron, 1GB main memory mem Linux 2.4.21-bigphys, Mellanox InfiniHost (Cougar) IB-4X HCAs, Mellanox Drivers 3.1, Firmware 3.0.0
MVAPICH numbers are from their own unmodified tester, re-normalized to MB=2^20 bandwidth bytes and full round-trip latency. GASNet numbers are from the GASNet testsmall and testlarge benchmarks.

GASNet consistently and significantly outperforms MVAPICH because the GASNet one-sided put/get semantics are fundamentally a better match for the capabilities of the underlying RDMA hardware than MPI's two-sided message passing semantics. GASNet's put/gets turn into simple, fully one-sided RDMA operations in the common case, and therefore reap the hardware peak performance, whereas MPI pays in performance for enforcing MPI's ordering and tag matching semantics.


Back to the GASNet home page