Heads up: Fujitsu tips its hand to reveal exascale Arm supercomputer processor

Heads up: Fujitsu tips its hand to reveal exascale Arm supercomputer processor – the A64FX
The Register ^ | 22 August 2018 | Chris Williams

Posted on 08/22/2018 10:51:33 AM PDT by ShadowAce

Hot Chips Fujitsu has unfurled the blueprints for its homegrown high-performance Arm-based processors dubbed A64FX, the brains of its Post-K supercomputer.

The designs were shown on Tuesday at a gathering of semiconductor engineers in Silicon Valley. The Post-K is a 1,000 peta-FLOPS monster – an exascale machine – that will supersede Japan's SPARC64-based K supercomputer. It is due to go online in 2021, and has just completed a round of trials that demonstrated the processors work – to some degree, at least.

Post-K hopes to be the world's fastest publicly known supercomputer by the time it's fully powered up and consuming 30 to 40MW. Today, the top slot is held by the US government's Summit machine that uses IBM POWER9 and Nvidia Volta GV100 processors, along with Mellanox networking gear, to max out at 188 peta-FLOPS.

Crucially, it will be an exascale Arm-compatible supercomputer, a significant milestone for the CPU architecture that's famous for being in practically everyone's phones, hard drives, smart cards, and other embedded electronics, and has dreams of driving laptops and servers.

So what does a Fujitsu-designed supercomputer Arm processor look like? Here's what we learned from Fujitsu's Toshio Yoshida at the Hot Chips engineering conference in Santa Clara: the A64FX has 8.8 billion 7nm FinFET transistors in a package with 594 pins, and 48 CPU cores plus four management cores. Each chip has a total of 32GB of high-bandwidth memory (HBM2), 16 PCIe 3.0 lanes, and a 1024GB/s total memory bandwidth, and hits at least 2.7 tera-FLOPS in terms of performance.

The 52 CPU cores are split into four clusters of 12 main cores plus one management core, each group has 8GB of HBM2 rated at 256GB/s, and 8MB of shared L2 cache. There is cache coherency across the clusters and the whole chip.

The chips are interconnected via Fujitsu's second-generation Tofu mesh-torus-like network. This interconnect can shift data, in and out of each processor chip, via 10 ports each with two lanes maxing out at 28Gbps each.

Caches and access speeds for the Fujitsu A64FX

The A64FX's cache hierarchy and speeds, for the 12 compute cores and management core per cluster, four clusters to a chip ... Source: Fujitsu
Click to enlarge

The CPU cores are 64-bit only – there's no 32-bit mode – and they use the Armv8.2-A instruction set. It supports Arm's 512-bit-wide SIMD scalable vector extension (SVE) that we described in detail, here. It means the chips can crunch vector and matrix calculations in hardware – a must for supercomputer and machine-learning applications. It also supports 16 and 8-bit integer math, as well as the usual floating-point precisions (FP16, 32, and 64), which are useful for AI inference code.

The A64FX is a superscalar, out-of-order execution beast, and first Armv8.2-A design, we're told. Folks who have done 32 and 64-bit Arm assembly programming will know the architecture has fixed-width instructions, typically one operation per instruction, as per the classic RISC school of thinking. Interestingly enough, the A64FX, by implementing SVE, has an instruction prefix for its four-operand fused-multiply-add instruction (FMA4) – an incredibly useful operation – that kinda reminds this vulture of x86 instruction prefixes.

To perform the calculation r0 = r3 + r1 * r2, you use two instructions that are merged into one at the pre-decode stage, and are performed in one step despite being fetched as two instructions. These are:

movprfx r0, r3      ; prefix next instruction
fma3 r0, r0, r1, r2 ; r0 = r3 + r1 * r2, the r3 substituted in

Each CPU core's execution unit can handle two 512-bit SIMD operations at once. The input data is packed into 512 bits and crunched in one go – like Intel's AVX512 operations on its server parts. So you could feed in four 8-bit values, four corresponding 8-bit coefficients or weights, which are multiplied to get four answers then added to a 32-bit offset, and written out to a register.

Fujitsu reckons its A64FX can hit 21.6 TOPS (trillion or tera operations per second) when doing 8-bit integer math; 10.8 TOPS with 16-bit integers; 5.4 TOPS with 32-bit; and 2.7 TOPS with 64-bit, all when performing integer SIMD. Overall, it's said the A64FX is at least 2.5 faster than Fujitsu's previous supercomputer processor – the SPARC64 XIfx – at HPC and AI work.

Nvidia's P4 and P40 accelerators for servers clock in at 22 and 47 TOPS with 8-bit integer, for what it's worth.

The L1 cache has a combined gathering mechanism that can fetch consecutive elements in arrays and copy them into a register. So, for example, you could use this to hoover up eight bytes spread over memory into one 64-bit register, each byte slotted into its own byte position in the register. The per-core four-way 64KB L1 data cache is read by the instruction engines at 230GB/s, and written back at 115GB/s. The L2 shared cache feeds data in at 115G/s, and receives at 57GB/s.

Pipeline stages of the A64FX ... Source: Fujitsu
Click to enlarge

Per-chip power usage is monitored and controlled on a per-millisecond basis, and down to the nanosecond per-core. Fujitsu claims its A64FX has mainframe-grade resilience, with ECC or duplication on all caches, parity checks within the execution units, instructions are retried if something is detected as going wrong, error recovery on the Tofu interconnect links, and 128,000 error checkers in total on the chip.

The whole shebang runs Linux, with a Lustre-based distributed file system and non-volatile memory for accelerating file input-output. The toolchain supports C, C++ and Fortran compilers, MPI, OpenMP, debuggers, and other utilities and languages.

You'll note there are no third-party accelerators: it's pure Arm, Fujitsu's way. The aim is to design a chip that runs supercomputer-style applications – simulations, science experiment analysis, machine learning, and other number crunching – with a higher performance-per-watt than general-purpose CPUs.

Yoshida didn't want to talk about clock speeds and individual chip power usage just yet, sadly. The machine is still years away from being finished, and all the specifications and implementation details have yet to be nailed down or revealed. "We will continue to develop Arm processors," he told the conference, though. Despite its delays, Fujitsu hasn't been put off Arm big iron.

And, er, yes, you might be able to play Crysis on it. ®

TOPICS: Computers/Internet
KEYWORDS: arm; supercomputer

Navigation: use the links below to view more comments.
first previous 1-20, 21-32 last

To: 2ndDivisionVet

If it’s a Bethesda game it will still get 30 FPS in populated areas.

21 posted on 08/22/2018 12:24:07 PM PDT by LambSlave

[ Post Reply | Private Reply | To 7 | View Replies]

To: freedumb2003

Will it implement the HCF command?

22 posted on 08/22/2018 12:45:45 PM PDT by Mr. K (No consequence of repealing Obamacare is worse than Obamacare itself.)

[ Post Reply | Private Reply | To 6 | View Replies]

To: ShadowAce

And I thought a PDP-8 was pretty slick....

23 posted on 08/22/2018 1:04:04 PM PDT by wouldilie

[ Post Reply | Private Reply | To 1 | View Replies]

To: wouldilie

The PDP-8 WAS pretty slick. I have about 150 IM6100 chips that someday I am going to turn into PDP-8 dev boards so we OFs can relive our youth. :-)

24 posted on 08/22/2018 2:06:15 PM PDT by beef

[ Post Reply | Private Reply | To 23 | View Replies]

To: LambSlave

More like 3FPS :p

25 posted on 08/22/2018 3:57:47 PM PDT by Bikkuri

[ Post Reply | Private Reply | To 21 | View Replies]

To: wouldilie

My High school had a PDP-8 back in the very early 70’s i first learned to programme on that machine. Had a card reader, and DEC terminals. No monitors. I/O was to tractor fed paper. The terminals could also read DEC tape (the yellow ribbon strips that had punch holes).

i still miss it.

26 posted on 08/22/2018 5:07:54 PM PDT by Calvinist_Dark_Lord ((I have come here to kick @$$ and chew bubblegum...and I'm all outta bubblegum! ~Roddy Piper))

[ Post Reply | Private Reply | To 23 | View Replies]

To: ShadowAce

Ridiculous. Nobody needs more than 640K.

27 posted on 08/22/2018 5:14:01 PM PDT by Billthedrill

[ Post Reply | Private Reply | To 1 | View Replies]

To: entropy12

Their is no temporal cure for Original Sin or its consequences.

28 posted on 08/22/2018 6:28:29 PM PDT by YogicCowboy ("I am not entirely on anyone's side, because no one is entirely on mine." - J. R. R. Tolkien)

[ Post Reply | Private Reply | To 9 | View Replies]

To: freedumb2003

I still have my dBase 4 manual and user’s guide. Just can’t bear to throw them away.

29 posted on 08/23/2018 5:46:06 AM PDT by Bloody Sam Roberts (Get in the Spirit! The Spirit of '76!)

[ Post Reply | Private Reply | To 14 | View Replies]

To: entropy12

You would think with all these super-fast computers, cancer would be cured by now...

Cancer is not a disease. It is a category of disease, with every one being different.

30 posted on 08/23/2018 7:06:33 AM PDT by ShadowAce (Linux - The Ultimate Windows Service Pack)

[ Post Reply | Private Reply | To 9 | View Replies]

To: ShadowAce

You missed my point badly. I never said anything about what cancer is or is not.

Since you failed to understand what my main point was...again it was...

Garbage in-——>faster computer——>faster garbage out.

If faster computers were the anti-dote for intelligence of human beings, we should be able to predict weather better 2 days ahead, be able to predict where earthquakes will happen, predict who will get cancer based on DNA and genetics, predict that Hillary was going to lose in spite of the polls, and so on and so on.

Super fast computers have failed in all of the above.

31 posted on 08/23/2018 9:12:15 AM PDT by entropy12 (Trump/Pence 2020)

[ Post Reply | Private Reply | To 30 | View Replies]

To: entropy12

...predict who will get cancer based on DNA and genetics...

Again, you are lumping "cancer" into a single thing. It's not.

If faster computers were the anti-dote for intelligence of human beings...

And your premise is incorrect. The faster, more powerful technology becomes, the stupider humanity gets because we don't have to think for ourselves. This is obviously a generalization, and I do not include any specific individual in that claim, but it sure seems observably true.

Super fast computers have failed in all of the above.

Given my last statement, this is not at all surprising as computers only do what the programmer tells tells them to do with the data that they have been provided. Since humanity is declining, our models/algorythms will also decline in quality.

I actually agree with 99% of what you are saying. I was just pointing out the one statement that was less than 100% accurate.

32 posted on 08/23/2018 2:35:49 PM PDT by ShadowAce (Linux - The Ultimate Windows Service Pack)

[ Post Reply | Private Reply | To 31 | View Replies]

Navigation: use the links below to view more comments.
first previous 1-20, 21-32 last

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search

General/Chat
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794