It's often said that today we have in the phone in our pocket computers more powerful than the supercomputers of the past. Let's see if that is true.
The Raspberry Pi contains a Broadcom BCM2835 system on chip. The CPU within that system is a single ARM6 clocked at 700MHz. Located under the system on chip is 512MB of RAM -- this arrangement is called "package-on-package". As well as the RPi the BCM2835 SoC was also used in some phones, these days they are the cheapest of smartphones.
The Whetstone benchmark was widely used in the 1980s to measure the performance of supercomputers. It gives a result in millions of floating point operations per second. Running Whetstone on the Raspberry Pi gives 380 MFLOPS. See Appendix 1 for the details.
Let's see what supercomputer comes closest to 380 MFLOPS. That would be the Cray X-MP/EA/164 supercomputer from 1988. That is a classic supercomputer: the X-MP was a 1982 revision of the 1975 Cray 1. So good was the revision work by Steve Chen that it's performance rivalled the company's own later Cray 2. The Cray X-MP was the workhorse supercomputer for most of the 1980s, the EA series was the last version of the X-MP and its major feature was to allow a selectable word size -- either 24-bit or 32-bit -- which allowed the X-MP to run programs designed for the Cray 1 (24-bit), Cray X-MP (24-bit) or Cray Y-MP (32-bit).
Let's do some comparisons of the shipped hardware.
|Item||Cray X-MP/EA/164||Raspberry Pi Model B+|
|Price, adjusted for inflation||US$16m||A$38|
|Number of CPUs||1||1|
|Word size||24 or 32||32|
|RAM||64MW = 256MB||512MB|
|Cooling||Air cooled, heatsinks, fans||Air cooled, no heatsink, no fan|
Neither unit comes with a power supply. The Cray does come with housing, famously including a free bench seat. The RPi requires third-party housing, typically for A$10; bench seats are not available as an option.
The Cray had the option of solid-state storage. A Secure Digital card is needed to contain the RPi's boot image and, usually, its operating system and data.
|SSD size||512MB||Third party, minimum of 4096MB|
|Price, adjusted for inflation||US$12m||A$20|
Of course the Cray X-MP also had rotating disk. Each disk unit could contain 1.2GB and had a peak transfer rate of 10MBps. This was achieved by using a large number of platters to compensate for the low density of the recorded data, giving the typical "top loading washing machine" look of disks of that era. The disk was attached to a I/O channel. The channel could connect many disks, collectively called a "string" of disks. The Cray X-MP had two to four I/O channels, each capable of 13MBps.
In comparison the Raspberry Pi's SDHC connector attaches one SDHC card at a speed of 25MBps. The performance of the SD cards themselves varies hugely, ranging from 2MBps to 30MBps.
What is clear from the number is that the floating point performance of the Cray X-MP/EA has fared better with the passage of time than the other aspects of the system. That's because floating point performance was the major design goal of that era of supercomputers. Ignoring the floating point performance, the Raspberry Pi handily beats out every computer in the Cray X-MP range.
Would Cray have been surprised by these results? I doubt it. Seymour Cray left CDC when they decided to build a larger supercomputer. He viewed this as showing CDC as not "getting it": larger computers have longer wires, more electronics to drive the wires, more heat from the electronics, more design issues such as crosstalk and more latency. Cray's main design insight was that computers needed to be as small as possible. There's not much smaller you can make a computer than a system-on-chip.
So why aren't today's supercomputers systems-on-chip? The answer has two parts. Firstly, the chip would be too small to remove the heat from. This is why "chip packaging" has moved to near the centre of chip design. Secondly, chip design, verification, and tooling (called "tape out") is astonishingly expensive for advanced chips. It's simply not affordable. You can afford a small variation on a proven design, but that is about the extent of the financial risk which designers care to take. A failed tape out was one of the causes of the downfall of the network processor design of Procket Networks.
Appendix 1. Whetstone benchmark
Compiling this for the RPi is simple enough. Since benchmark geeks care about the details, here they are.
$ diff -d -U 0 whets.c.orig whets.c @@ -886 +886 @@ -#ifdef UNIX +#ifdef linux
$ gcc --version | head -1 gcc (Debian 4.6.3-14+rpi1) 4.6.3 $ gcc -O3 -lm -s -o whets whets.c
Here's the run. This is using a Raspbian updated to 2014-08-23 on a Raspberry Pi Model B+ with the "turbo" overclocking to 1000MHz (this runs the RPi between 700MHz and 1000MHz depending upon demand and the SoC temperature). The Model B+ has 512MB of RAM. The machine was in multiuser text mode. There was no swap used before and after the run.
$ uname -a Linux raspberry 3.12.22+ #691 PREEMPT Wed Jun 18 18:29:58 BST 2014 armv6l GNU/Linux $ cat /etc/debian_version 7.6 $ ./whets ########################################## Single Precision C/C++ Whetstone Benchmark Calibrate 0.04 Seconds 1 Passes (x 100) 0.19 Seconds 5 Passes (x 100) 0.74 Seconds 25 Passes (x 100) 3.25 Seconds 125 Passes (x 100) Use 3849 passes (x 100) Single Precision C/C++ Whetstone Benchmark Loop content Result MFLOPS MOPS Seconds N1 floating point -1.12475013732910156 138.651 0.533 N2 floating point -1.12274742126464844 143.298 3.610 N3 if then else 1.00000000000000000 971.638 0.410 N4 fixed point 12.00000000000000000 0.000 0.000 N5 sin,cos etc. 0.49911010265350342 7.876 40.660 N6 floating point 0.99999982118606567 122.487 16.950 N7 assignments 3.00000000000000000 592.747 1.200 N8 exp,sqrt etc. 0.75110864639282227 3.869 37.010 MWIPS 383.470 100.373
It is worthwhile making the point that this took maybe ten minutes. Cray Research had multiple staff working on making benchmark numbers such as Whetstone as high as possible.