dcsimg

Talk About HPC Bang For Your Buck, How About Ka-Boom For The Server Room

An HPC product review featuring the Appro 1U Tetra GP-GPU server

Reviewing HPC hardware is not easy. You usually need to travel to a data center and look at a rack of servers while someone tells you where they landed on the Top500 list. One could review a server, but basically they are all pretty much the same inside. They are running Linux and use either AMD or Intel processors. In addition, testing a cluster takes time because running meaningful programs that exercise the whole system must be done carefully. And finally, clusters are not sitting on the “shelf” as they vary by customer due to possible packaging, interconnect, processor, and storage choices.

Recently, Appro asked me if I wanted to review their Appro 1U Tetra GPU Server (model 1426G4). This system packs quite a bit of processing power into a small space. The base server uses the Intel 5520 chipset and has two six-core Intel Xeon X5670 processors (2.93GHz), 48 GB of DDR3 memory (supports up to 96GB), and a 250G 2.5-inch Hard Drive (supports up to 3.0TB of disk storage using six SATA 2.5 inch disk bays). There is also a one available PCIe 2.0 x4 slot and a 1400 Watt power supply.

By itself this is a formidable piece of computing hardware. Add four, that’s right four NVidia® Tesla™ M2050 (Fermi) cards and you have 12 CPU-cores and 1,792 GPU-cores in a single 1U server. This server is possibly the most powerful 1U box ever constructed. My plan was to put this system through some tests and get a feel for this level of dense computing. After, all it is not everyday I get to crank out hundreds of GigaFLOPS in my basement.

Packaging

From the outside, the Tetra GP-GPU server looks like a standard 1U box. The small front panel has six SATA (2.5 inch) hard drive bays, a cold swap power supply, and the standard array of small switches and LEDs (i.e. power, reset, HDD activity, etc.). The Tetra also has a GPU power enable/disable switch that allows the server to run without supplying power to the GPUs (more on this feature below).

The rear of the unit has the the standard outputs including the power cord, 2 USB ports, 2 Ethernet ports, IMPI LAN port, VGA, serial port, ID LED, and a Gen 2 PCI-E low profile slot.
There is also a rack mounting kit with rails.

As soon as you pop the lid on this server you know something is different inside. The internal configuration is shown in Figure One below. The server motherboard is in the middle with two M2050 cards on either side. There are twelve fans employed to cool the system. The fans are also temperature controlled and will increase in speed as needed. As with all Appro hardware the construction is solid and clean. There are also instructions in the Quick Start Guide on how to remove hard disks, the power supply, and the GPU cards.

Figure One: Internals of Appro 1U Tetra GPU Server (front is to the right, click for larger image)
Figure One: Internals of Appro 1U Tetra GPU Server (front is to the right, click for larger image)

Software

The system was ready to run out of the box. It had Red Hat 5.3 and both the NVidia drivers and the CUDA SDK (3.0) installed. (Note: there is now an updated CUDA SDK available). The system booted as expected and when I logged in I found a familiar Red Hat server environment. The install was actually quite complete. In the past, I often find myself installing RPMs (usually gfortran) in order to run some tests. This was the first time, I did not have install any extra software to run tests.

I was able to build the CUDA SDK examples by simply typing “make“. The programs ran without any issues. The first example program I ran was the deviceQuery diagnostic program. I wanted to make sure all four Tesla M2050 were recognized and working. As expected all the devices reported in and the CUDA environment seemed to be working correctly. The truncated output are shown below (Note: Each Tesla M2050 has 448 cores for a total of 1792 cores.)

# deviceQuery
deviceQuery Starting…

 CUDA Device Query (Runtime API) version (CUDART static linking)

There are 4 devices supporting CUDA

Device 0: “Tesla M2050″
  CUDA Driver Version:                           3.0
  CUDA Runtime Version:                          3.0
  CUDA Capability Major revision number:         2
  CUDA Capability Minor revision number:         0
  Total amount of global memory:bytes
  Number of multiprocessors:                     14
  Number of cores:                               448
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  …
  (continues for all four GPUs)

I ran some of the other other example programs including the classic nbody on all four GPUs.
The CUDA SDK includes a memory bandwidth test that measures the transfer speed to/from the host to the Tesla M2050. I also wanted to see if using “pinned” memory helped these transfers. (Pinned memory is not movable or “swappable” by the OS and therefore provides faster reading/writing). I ran the bandwidthTest on one of the GPU’s in both the “un-pinned” and “pinned” modes. The results are shown below.

#bandwidthTest
/bandwidthTest Starting…

Running on…

 Device 0: Tesla M2050
 Quick Mode

 Host to Device Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			3529.8

 Device to Host Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			3085.9

 Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			79518.8

# ./bandwidthTest –memory=pinned
[bandwidthTest]
./bandwidthTest Starting…

Running on…

 Device 0: Tesla M2050
 Quick Mode

 Host to Device Bandwidth, 1 Device(s), Pinned memory, Write-Combined Memory Enabled
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			5746.0

 Device to Host Bandwidth, 1 Device(s), Pinned memory, Write-Combined Memory Enabled
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6231.8

 Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			79509.6

Note the increase in both the “Host to Device” (1.6 times faster) and the “Device to Host” (2.0 times faster) memory transfers. I could have spent more time on this benchmark, but there was much more to explore with this system.

Appro provides an important piece of software called TetraGPU. This utility provides management and monitoring of the GPU’s. Typing TetraGPU starts the local IMPI service and the begins monitoring the temperatures of the GPUs. Full local and remote IMPI management is available through TetraGPU. As shown below, the GPU temperature and the board temperature are given for each M2050.

# TetraGPU
Starting IMPI drivers:                                     [  OK  ]
Detected Board Model: X8DTG-D
GPU1      GPU2      GPU3      GPU4
41/30(ok) 44/30(ok) 39/29(ok) 46/30(ok)
41/30(ok) 44/30(ok) 39/29(ok) 46/30(ok)
41/30(ok) 44/30(ok) 39/29(ok) 46/30(ok)
...

There are many options with TetraGPU and a few are worth mentioning. First, it is possible to enable/disable power to a GPU using TetraGPU. Users may want to do this to save power or to disable a GPU for another reason. It it should be noted that disabling power will not cause problems to a running system, but the OS still believes the device is present and if you try and use it it will not work. If you were to manage the GPUs (i.e. turn them on when users need them) the Tetra server should be started with all GPUs on, then power them off and on as needed so the OS believes they are present. As mentioned above, there is front panel GPU power disable switch that can be used to disable power to all the GPUs.

Another important feature of TetraGPU is temperature monitoring and auto-shutoff. If a GPU temperature is too high, the GPU will be powered down automatically. If a GPU(s) cannot be powered down for any reason and are in an overheat situation, TetraGPU will power down the entire server and log the event. This feature adds some comfort to managing a server that has a high compute (and heat) density.

If the node is to be part of a cluster, Appro also offers the Rocks+ and MOAB Cluster Suite. There is also a choice of Linux or Windows for operating system.

Comments are closed.