Storage Unification and Abstraction

Big data technologies have enabled organizations to handle and process the growing volume of data, however, the data is often stored across many different storage systems within the organization. With data being stored in different systems and potentially different physical locations, it is difficult to provide a consolidated and aggregated view of the data in a performant and efficient manner. Implementing a data lake is a common solution to this problem, but that requires maintaining permanent copies of the data, which may be costly.

Alluxio, through its unified namespace feature, facilitates access to different systems and seamlessly bridges computation frameworks and underlying storage. Applications only need to interact with Alluxio to access data stored in any underlying storage system. Alluxio acts as a “virtual data lake” which provides an aggregated view of all data from different data sources, while not creating permanent copies of that data.


Alluxio efficiently unifies access to big data
alluxio-unifies-access.png

There are several benefits to using Alluxio as a “virtual data lake”:

  • Unified access. Applications only need to interact with a single system and single namespace for all of their data. Applications do not need to be concerned with how to access data from different systems. It is convenient for applications to access any data, simply identified by a global path.
  • No ETL. Alluxio will transparently pull in the data from the existing storage system on demand, when the application requires it. Therefore, explicit ETL or copy of the data is not required.
  • Configuration Management. Different storage systems typically need specific configuration for access. Alluxio stores and manages the configuration for the storage systems, so the applications no longer need to, thus simplifying applications. With Alluxio-managed storage configurations, Alluxio can enable applications to access data from systems which have conflicting configuration.
  • Modern, flexible architecture. Alluxio unified namespace promotes and supports the separation of compute from storage. This type of architecture enables great flexibility of resources for modern data processing.
  • Storage API Independence. Alluxio supports common storage interfaces, including HDFS and S3. Because of the Alluxio unified namespace, applications can access all the data via their desired interface, regardless of API of the source data.
  • Performance. Alluxio implements local caching and eviction strategies to provide fast local access to important and frequently used data, without maintaining permanent copies of the data.

As users are faced with increasing big data volumes, a growing number of storage technologies, and a growing number of applications, diverse storage systems are becoming a challenge to manage using traditional approaches. Alluxio’s unified namespace enables the “virtual data lake” without owning a permanent copy. Applications access all of their files from different storage systems via Alluxio as they would within a traditional data lake. Users simply need to configure the underlying storage into Alluxio, and Alluxio transparently manages the data.

Need help? Ask a Question