Project Description

In order to sustain the ever-increasing demand of storing, transferring and mainly processing data, HPC servers need to improve their capabilities. Scaling in number of cores alone is not a feasible solution any more due to the increasing utility costs and power consumption limitations. While current HPC systems can offer petaflop performance, their architecture limits their capabilities in terms of scalability and energy consumption. Extrapolating from the top HPC systems, such as China's Tianhe-2 Supercomputer, we would require an enormous 1GW power to sustain exaflop performance while a similar yet smaller number is triggered even if we take the best system of the Green 500 list as an initial reference.

Apart from improving transistor and integration technology, what is needed is to refine the HPC application development and the HPC architecture design. Towards this end, ECOSCALE will analyse the characteristics and trends of current and future applications in order to provide a hybrid MPI+OpenCL programming environment, a hierarchical architecture, runtime system and middleware, and a shared distributed reconfigurable hardware based acceleration.

Technical Description

Driven by the characteristics and trends of future HPC applications, ECOSCALE will co-design a) novel HPC applications and b) the novel hierarchical UNIMEM+UNILOGIC architecture in order to reach exascale performance while achieving exascale class energy-efficiency. The novel UNILOGIC (Unified Logic) architecture, introduced within ECOSCALE for the first time, is an extension of the UNIMEM architecture, proposed within the EUROSERVER project. UNIMEM provides shared partitioned global address space while UNILOGIC provides shared partitioned reconfigurable resources. The UNIMEM architecture gives the user the option to move tasks and processes close to data instead of moving data around and thus it reduces significantly the data traffic and related energy consumption and delays. The proposed UNILOGIC+UNIMEM architecture partitions the design into several Worker nodes that communicate through a fat-tree communication infrastructure, similar to the one shown in the figure below.

ecoscale partition

These Worker nodes correspond to the partitions of the HPC application. Each Worker node is an entire sub-system including processing units, memory, and storage. Within a PGAS domain (consisting of several Workers), the proposed architecture offers a shared partitioned global address space and shared partitioned reconfigurable resources that can be accessed via regular load and store instructions without using any global cache coherent mechanism.

In order to further tackle the scalability problems in an exascale machine, ECOSCALE targets to decrease the number of required interconnected compute nodes (a compute node is called Worker node in ECOSCALE), which is becoming a critical factor. This number is related both to the computing power and to the energy efficiency level provided by each node. The more computing power provided by a compute node, the fewer compute nodes are required in the HPC system. On the other hand, the higher the energy efficiency is, the more nodes can be used without breaking the total budget. The approach taken by several architectures, such as the EUROSERVER architecture, is to integrate several low-power CPUs in a single compute node. However, technology advancements (3D-stacking, tri-gates, etc.) are not enough by themselves to provide an exascale solution. Novel architectural approaches are also required. Using energy-efficient reconfigurable accelerators that can provide significant energy optimizations for data flow applications can further improve the energy efficiency of a compute node.

While reconfigurable devices (such as FPGAs) have been proven to be significantly more energy efficient than other traditional multi-core architectures (both CPUs and GPUs), enabling up to 25X better performance/watt, their use is still limited, since the path to porting an application onto reconfigurable hardware is often prohibitively cumbersome. Things are even harder in the HPC domain where thousands or millions of reconfigurable accelerators have to be managed.

Therefore, ECOSCALE aims to facilitate this path by providing a novel methodology and architecture to automatically execute HPC applications onto an HPC platform that supports thousands or millions of reconfigurable hardware blocks, while taking into account the projected trends and characteristics of HPC applications. Within this context, the project aims at linking and extending various disconnected existing FPGA-based acceleration approaches and adapting them to work in an HPC environment providing a novel framework from the ground up. In order to efficiently do so, we follow a holistic approach providing solutions for all the aspects of an HPC environment, ranging from architecture and runtime optimizations to high level synthesis (HLS) and hardware virtualization.

ecoscale framework

The proposed HW design consists of a stack of three HW interdependent abstraction layers, as shown in the figure above. At the bottom, the proposed hardware architecture provides the basic hardware components and functionality in order to efficiently use the HW resources (CPU, memory, Reconfigurable Hardware, etc.) in an HPC system. In the next layer, a middleware provides the primitives to reconfigure hardware blocks at runtime. An HLS tool provides the synthesized application tasks to the middleware. In the third layer a runtime system schedules tasks inside a PGAS partition, provides the MPI primitives for communication between PGAS partitions and decides at run-time which functions of the accelerated application should be implemented and executed in hardware.

Seventh Framework Programme