The big data use case will leverage Mikelangelo’s software stack, to provide big data services flexibly with high performance. In the context of this use case flexibility means on-demand access, customisability, and elasticity. To achieve this flexibility, we will integrate a big data stack with a cloud middleware. The cloud middleware, in turn, will use Mikelangelo’s hypervisor, which provides low I/O overhead and short boot durations. A flexible platform requires virtualisation, which in turn leads to a high I/O overhead. Thus, without Mikelangelo’s software stack, we cannot realise this use case.
The GWDG treats this use case as strategically important, since, as a company, it has many clients who want to apply the big data paradigm. The GWDG provides IT services for the whole Max Planck Society, the University of Göttingen, and related institutes with large data sets, such as the Göttingen State and University Library. Most of these clients have not deployed any big data applications yet.
According to Figure 1, we motivate this use case from top to bottom. Requirements from users and management lead to technical requirements. These technical requirements affect the cloud layer and the virtualisation layer.
The user requirements comprise experimentation with big data stacks, elasticity, and self-service access. Before GWDG’s users can use big data in production, they need to familiarise themselves with big data stacks, to elicit concrete requirements. Two of this use case’s requirements correspond to the variety criterion of big data. First, GWDG’s clients want to try different big data stacks with custom configurations. Second, GWDG’s clients use diverse methods in diverse problem domains; these domains span all fields of science. Once the clients decide to use a specific big data stack, they will require high performance, due to the high volume of their data and their time constraints. Many of the clients operate on sensor data, which provides large data sets with real-time constraints in bursts. These bursts result in a highly variable velocity. Thus to handle request bursts, our solution needs to feature elasticity. An additional requirement is to provide appropriate access to the clients. The access needs to fit their technical expertise, which varies from IT-experts to non-IT experts.
From a management perspective, we need to keep the administrative overhead low, thus we require a self-service user interface. Furthermore, we face the following circular dependency to bootstrap a big data service. On one hand, if the GWDG does not provide a big data platform to its clients, the clients will not use big data on GWDG’s IT infrastructure. Thus, the clients will not be able to try out the big data paradigm, to elicit their concrete requirements. On the other hand, the GWDG cannot provide a big data installation without knowing its clients’ concrete requirements. The GWDG can ad hoc offer a big data stack on its cloud infrastructure. However, any cloud technology relies on virtualisation. Due to its high I/O overhead, virtualisation currently does not perform well enough for big data installations in production.
The technical requirements for this use case span cloud management and virtualisation. On the topmost technical layer, the use case requires a cloud middleware that supports orchestration, to manage whole big data clusters. The orchestration service needs to allow clients to create and manage those clusters by themselves. Currently, OpenStack’s Savanna is the most advanced project that tackles some of the problems of our use case. Savanna integrates Apache Hadoop with OpenStack, to manage Hadoop clusters on an OpenStack-based cloud infrastructure. However, Savanna is in an early stage of development. The project still misses integration with higher level big data platforms, such as Apache’s Pig, Spark, and Hive, which GWDG’s clients frequently request. Furthermore, the elasticity and ease-of-access that Savanna provides do not yet suffice for our use case. Finally, the high I/O overhead, of the underlying virtualisation technology, constrains Savanna to small scale applications. We plan to collaborate with the Savanna team, to reach the goals of this use case, and to contribute our work to Savanna. In the virtualisation layer, the major problem lies in the high I/O overhead of current hypervisors. This overhead is the main reason why cloud providers do not deploy big data stacks on cloud infrastructure for production. Here, the use case requires state-of-the-art computational efficiency and an I/O efficiency that goes well beyond the current state of the art. This combination will allow our virtualised big data stack to handle high volume data efficiently.
We propose an architecture that satisfies all of this use case’s requirements, by using Mikelangelo’s technologies. In contrast to the problem statement, we present the proposed solution in a bottom-up fashion, according to Figure 1.
The virtualisation layer will base on Mikelangelo’s hypervisor, which our consortium will optimise for I/O operations. The hypervisor will function as the technical foundation in our use case, since it will perform with high efficiency, and provide the basis for elasticity in the cloud layer. The cloud layer will consist of a cloud middleware, such as OpenStack or OpenNebula. In the cloud layer, virtual machine templates will use OSv, which will further boost performance through para-virtualisation. In addition, OSv will allow for very short boot durations, since it has a minimal, customised disk image. To manage whole big data clusters on demand, we will integrate an orchestration service with the popular Apache big data stack. In specific, the cloud middleware will work with Hadoop, HDFS, HBase, Hive, Mahout, Pig, and Spark, all Apache projects.
To target GWDG’s clients’ needs for usability, we will provide three different user interfaces. The first user interface will be a web-based GUI, which will target non-IT experts. The second interface will be a command-line interface, which will target power users and IT experts. The third interface will be a RESTful interface, and allow programmers to extend our software stack. These interfaces will differ in their ease of use and level of customisability. All three interfaces will be self-sufficient and allow for self-service. Self-service, in turn, will allow the GWDG to keep its administrative overhead minimal.
This use case strives to achieve two major results: to win new clients who need a big data stack, and to create open-source software that manages a virtualised big data stack. Thus, first, the GWDG wants to establish big data as a new service. We envision that researchers at many Max Planck Institutes, at the University of Göttingen, and at related institutes, such as the Göttingen State and University Library will use GWDG’s big data service. Through this use case, the GWDG strives to gain a competitive advantage, to be the first organisation in Germany that offers a virtualised, high performance big data stack to the public sector. Second, we envision that the software that we program for this use case will find uptake in the big data community at large. We are confident that most big data users will invite a virtualised big data solution with low I/O overhead.