Alluxio holds a unique place in the big data ecosystem, residing between storage systems (such as Amazon S3, Apache HDFS or OpenStack Swift) and computation frameworks and applications (such as Apache Spark or Hadoop MapReduce) to provide a central point of access with a memory-centric design. Alluxio works best when computation frameworks and distributed storage are decoupled and Alluxio is deployed alongside a cluster’s computation framework.
For user applications and computation frameworks, Alluxio is the layer that manages data access and provides fast storage, facilitating data sharing and locality between jobs, regardless of whether they are running on the same computation engine. As a result, Alluxio can bring an order of magnitude speed up for those big data applications while providing a common interface of data access.
For under storage systems, Alluxio bridges the gap between big data applications and traditional storage systems, and expands the set of workloads available to utilize the data. Since Alluxio hides the integration of under storage systems from applications, any under storage can back all the applications and frameworks running on top of Alluxio. Also, when mounting multiple under storage systems simultaneously, Alluxio can serve as a unifying layer for any number of varied data sources.
Alluxio’s design uses a single primary master and multiple workers. At a high level, Alluxio can be divided into three components: the master, workers, and clients. The master and workers together make up the Alluxio servers, which are the components a system admin would maintain and manage. The clients are used to talk to Alluxio servers by the applications, such as Spark or MapReduce jobs, Alluxio command-line users, or the FUSE layer.
Alluxio master service can be deployed as one primary master and several secondary masters for fault tolerance. When the primary master goes down, a secondary master is elected to become the primary master.
There is only one primary master in an Alluxio cluster. The primary master is responsible for managing the global metadata of the system, for example, the file system metadata (e.g. the namespace tree), the block metadata (e.g the block worker locations), as well as the workers metadata. Alluxio Clients interact with the primary master to read or modify this metadata. In addition, all workers periodically send heartbeat information to the primary master to maintain their participation in the cluster. The primary master does not initiate communication with other components; it only interacts with other components as a response to requests via RPC services. Additionally, the primary master also saves file system journals to a distributed persistent storage to allow for recovery of master state information.
The secondary master replays journals written by the primary master, periodically squashes the journal entries, and writes checkpoints for faster recovery in the future. It does not process any requests from any Alluxo components.
Alluxio workers are responsible for managing configurable local resources allocated to Alluxio (e.g. memory, SSDs, HDDs etc.). Alluxio workers store data as blocks and serve client requests that read or write data by reading or creating new blocks within its local resources. The worker is only responsible for the data within these blocks; the actual mapping from file to blocks is only stored by the master.
Also, Alluxio workers perform data operations on the under store (e.g. data transfer or under store metadata operations). This brings two important benefits: Data read from by one client is immediately available to other clients reading from the same worker, and clients don't need to know details about how to interact with each under store; the worker shields the client from the complexity of talking to various under storages.
Because RAM usually offers limited capacity, blocks in a worker can be evicted when space is full. Workers employ eviction policies to decide which data to keep in the Alluxio space. For more on this topic, please check out the documentation for Tiered Storage.
Users interact with Alluxio servers through Alluxio clients. Clients communicate with masters to carry out metadata operations and with workers to read data. Alluxio provides both an Alluxio Java client and a [Hadoop-compatible Java client. In addition to Java clients, there are Go and Python clients built on top of the Alluxio's REST API. Alluxio also supports the S3 API, so existing S3 clients can interact with Alluxio the same way they interact with S3.