HPC4U - Introducing Quality of Service for GRIDS

   

Nowadays a lot of companies use their internal computing resources (e.g. clusters) to run high computing applications. They often take days of computation time. Since hardware is often not reliable enough, resulting in hardware failures, companies face the loss of computation time and money. Moreover, current resource management systems (RMS) do not provide any guarantee regarding the deadline for job completion. Only best effort planning is offered to the users.

HPC4U solutions

HPC4U addresses those issues and provides as a software solution a generic and modular Grid middleware which enables an increased level of Fault Tolerance and covers multiple administrative domains. HPC4U integrates Quality of Service in the product by using a Resource Management System called CCS which implements Service Level Agreements (SLA). The Quality of Service provides a reliable and clear statement about the service level for a job, which is submitted and fulfilled despite hardware or software failures.

Fault tolerance mechanisms are offered by HPC4U for parallel and non parallel applications at all levels.

-  checkpoint of the entire process (memory, registry,...)
-  checkpoint of the network
-  snapshot of the storage

Business Case

The Swedish Meteorological and Hydrological Institute (SMHI) runs daily weather forecasts using the Hirlam (High Resolution Limited Area Model) meteorological forecast model. The main 48 hour-forecasts are run every 6 hours. The time window to do the 48-hour forecast computation is about one hour. The input data comes from several sources for instance aircraft, ballon, satellite and ground observations. One 48-hour forecast produces about 3.2 GB of data.

(PNG)

The current situation is that the weather forecast is run in parallel on two separate computers, a Linux cluster at the SMHI weather service, and a SGI-3800 at the National Supercomputer Centre in Sweden to guarantee that a result is available in time. This is a highly redundant way of doing the forecast.

With HPC4U, the weather forecast will only need to be run once. The result will be guaranteed to be available in time because of the checkpointing and migration functionalities, and the availability of alternative resources. To be sure that the forecast data will be available on time, the user will specify an SLA with a deadline bounded job and with all fault tolerance mechanisms.

HPC4U Freeware Solution

You can try out the freeware version of the HPC4U solution. You are welcome to download it !

(PNG)

Print this page| More information about HPC4U


Cetic