Skip to comments.Adapteva ships Kickstarted baby supercomputer boards (--51 gigaflops per watt.)
Posted on 07/26/2013 7:39:31 PM PDT by Ernest_at_the_Beach
Upstart RISC processor and coprocessor designer Adapteva is shipping the first of its Parallella system boards, which its Epiphany multicore processors with ARM processors to create a spunky and reasonably peppy hybrid compute engine that doesn't cost much and is very energy efficient for certain kinds of processing.
It is not cheap to design and fab coprocessors or to make system boards that make use of them, so Adapteva's cofounder and CEO Andreas Olofsson fired up a project on fund-raising site Kickstarter last fall to raise the money to fab the chips, instead of going the traditional route of raising venture funding and trying to get design wins.
While Adapteva did not meet its pie-in-the-sky dream of raising $3m to fully fund a set of multi-core Epiphany RISC coprocessors and Parallella system boards that make use of them, the company does have 4,965 backers who pledged $898,921 and have ordered over 6,300 boards using various Epiphany processors matched up with Zync dual-core ARM Cortex A9 processors from Xilinx, which peddles those ARM chips mashed up with its field programmable gate arrays (FPGAs).
The Epiphany core embodies the essence of Reduced Instruction Set Computing, with a mere 35 instructions, and has a dual-issue core with 64 registers. It has one arithmetic-logic unit (ALU) and one floating-point unit, and a 32KB static RAM on the other side of those registers. Each core also has a router that has four ports that can be extended out to a 64x64 array of cores for a total of 4,096 cores.
Block diagram of the Epiphany RISC chip
The Epiphany-III chip is implemented in a 65 nanometer process and sports 16 cores, and the Epiphany-IV is implemented in a 28nm process and offers 64 cores. This latter chip delivers about 102 gigaflops of performance at 2 watts, or 51 gigaflops per watt. (Adapteva has chosen GlobalFoundries as its wafer baker, by the way.)
The Epiphany memory architecture allows any core to access the SRAM of any other core on the die because the SRAM is mapped as a single address space across the cores. This greatly simplifies memory management, and it has a direct memory access (DMA) unit that can prefetch data from external flash memory.
How the computing elements of the Parallella board come together
At the moment, this DMA support is not extended to InfiniBand or Ethernet network adapters with Remote Direct Memory Access (RDMA) on top of those two network protocols, but Olofsson concedes to El Reg that this presents an interesting set of possibilities to link multiple coprocessors in a Parallella cluster together and have the Epiphany coprocessors share data directly over the network as they chew on data. (You would use the RDMA over Converged Ethernet, or RoCE, over the Ethernet links.)
The board does not have a SATA port or a fast InfiniBand or Ethernet link, but three of the four 10Gb/sec expansion ports can be ganged up together for a maximum of 30Gb/sec of bandwidth for attaching other kinds of ports to the Parallella board. You would have to create the daughter card to do this and write its drivers.
The Parallella-16 ARM-FPGA-Epiphany triple hybrid board
The Epiphany-IV design is meant to scale to 64 cores at 1GHz and burn about 25 milliwatts per core. The current chip runs at 800MHz and delivers that 51 gigaflops per watt performance on the number-crunching work mentioned above. At 1GHz, the Epiphany-IV can do an estimated 70 gigaflops per watt.
If you participated in the Kickstarter program, you will get a Parallella-16 board with a Zync-7020 processor from Xilinx, which has two Cortex-A9 cores that run at 800MHz and an FPGA on the same package with 85,000 logic cells and 220 programmable digital signal processing slices. This board has one of the 16-core Epiphany-III processors on it as well, and sports 1GB of SDRAM main memory, a MicroSD card slot, four expansion connectors, a Gigabit Ethernet network interface card, and an HDMI connector.
If you want to buy a Parallella-16 board and you did not participate in the Kickstarter program, you can get one from the online store that Adapteva has set up, but you will get a Zync-7010 processor instead, which has only 29,000 logic cells and 80 DSP slices on the FPGA side of the Xilinx chip.
It will take about twelve weeks to fulfill those orders because Adapteva is not pre-manufacturing boards. That will cost you $99, just like the base level of the Kickstarter support did. You will eventually be able to order the Zync chip with the fatter FPGA, but pricing is not yet set for this upgrade.
A 42-node cluster of Parallella-16 boards from Adapteva
If you don't want to do much work at all and want to start playing with a baby cluster of these Parallella-16 system boards, Adapteva is selling those as well for $575. That includes four of the Parallella-16 cards with connectors, four 16GB SD cards loaded up with Canonical's Ubuntu Server 12.04, a power supply, and 20 metal standoff legs to screw the boards into a tower of computing power. The Parallella-16 card is a mere 3.4 inches by 2.1 inches.
The Parallella design required the Epiphany chip packaging to be redesigned, Olofsson tells El Reg, and the software drivers and SDK were also improved and made to work better with the FPGAs on the Xilinx chips. That stack includes a C compiler, a multicore debugger, the Eclipse IDE, an OpenCL SDK and compiler set, and runtime libraries.
Just for fun, Olofsson grabbed two 24-port Gigabit Ethernet switches and 42 of the Parallella boards to create a 42-node cluster that is about the size of a tower PC. It will cost around $5,000 and burn less than 500 watts (all in, including the three kinds of processing, memory, flash storage, and Ethernet ports).
Such a machine delivers around 1.1 teraflops of oomph, and by shifting to the 64-core Epiphany-IV would push that up to 4.3 teraflops. That's not a lot of teraflops, and a bunch of GPU coprocessors can match that in a much smaller form factor to be sure. But the Epiphany RISC coprocessor is more than twice as energy efficient, according to Adapteva.
Adapteva still wants to be an exascale player in the high performance computing arena, and as El Reg has previously reported, has set its sights on creating two chips by 2018 to reach its exascale aspirations. One future Epiphany chip is an entry coprocessor with 1,000 cores on a die that delivers 2 teraflops of performance in a 2 watt thermal envelope. The second is a massive chip with 64,000 cores with 1MB of SRAM per core that can deliver 100 teraflops of floating point coprocessing at 100 watts. The plan is to have both chips deliver 1 teraflops per watt using the 7 nanometer wafer baking processes that are expected to be generally available by 2018.
The Kickstarter program for these future Epiphany chips will probably require some support from big government agencies. And with those kinds of performance and thermal numbers, the US Defense Advanced Research Projects Agency is probably sniffing around, and maybe the Department of Energy, too. ®
Tempting, isn't it?
We should put on Kickstarter: A New American Revolution, I say we would need about a trillion.
Pull quote near the end: “the Epiphany RISC coprocessor is more than twice as energy efficient” as GPUs.
Simply have the EPA mandate these. Just like they did with the ultra-expensive LED lights in my kitchen remodel and my power-miser TV and my low-flow toilet and my dishwasher that doesn’t wash dishes and my alcohol contaminated gasoline that destroys my small engines.
Wonder if anyone is mucking around with the SLATEC math routines....for ARM?
It’s enough to give an old fart who was involved in chip design back in the days of 3.5 micron CMOS (think mid 80’s) a hard on. : )
Not really. It's a multi-core setup of Raspberry Pi fame. Underpowered and overpriced. Zynq is expensive, and it is pretty slow. Xilinx put Zynq there as a replacement of old PowerPC cores that it had in Virtex. The purpose of those cores is not supercomputing, but tight integration with the rest of the fabric, to be used for sequential algorithms that are less convenient to synthesize in logic.
The article correctly points out that speed-wise there is zero value in this setup. You would be better off with GPUs, especially considering that they are made in volume. The only savings here are consumed power - and, perhaps, the "system in a box" that comes ready to run.
This may be a good setup for universities, as an educational tool. Students do not need petaflops; they only need something that can run distributed tasks. It's also pretty affordable for that purpose.
If I were to build such a system for real use, I would use Intel CPU blades with fast and hot CPUs. Usually users of clusters are not concerned much about price, space or energy - they are governments or large research facilities. There is overhead in splitting the task into slices, so a task may run faster on two blazing fast CPU cards than on a thousand of handheld calculators. This design is closer to calculators.
Watching BB here...be up for awhile yet....Have a good rest.
It happens my friend in due time. :)
Likewise American patriot.
“Tempting, isn’t it?”
Yes, but I’ll wait to see if a RPi like ecosystem grows up around it.
I want one, but I don’t know what for. :^|
Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.