2 .1/ Problems with Scaling On-Premise Hadoop Clusters
Many companies have data stored in a Hadoop Distributed File System (HDFS) cluster in their on-premises environment.
As part of data-driven transformation eﬀorts, the amount of data stored and the number of queries is growing fast. This
puts more load on the HDFS systems.
As the number of frameworks to derive insights from the data have increased over the past few years, platform engineer-
ing teams at enterprises have been pushed to support newer and popular frameworks on their already busy data lake. In
addition, the data lakes become the landing zone for all enterprise data. It is not uncommon to see Hadoop-based data
lakes running at beyond 100% utilization. All of this leads to very large and busy Hadoop clusters.
SYMPTOMS THAT REQUIRE LEVERAGING A HYBRID CLOUD DEPLOYMENT
• Hadoop clusters are running beyond 100% CPU capacity
• Hadoop cluster are running close to 100% I/O capacity
• Hadoop clusters cannot be expanded due to high load on the master named node
• No additional capacity on Hadoop cluster to support new frameworks
In spite of these issues, the other alternative of managing data across the data center and the public cloud can be daunt-
ing to enterprises. However, as architectures move to a disaggregated compute and storage stack, there are many oppor-
tunities to leverage big data in the cloud and oﬀload the HDFS systems. Enterprises can leverage data orchestration
technologies to seamlessly implement hybrid cloud approaches and get the best of both worlds.
3 .2/ Leveraging the Power of Hybrid Cloud Data Analytics
Problems in current approaches for managing data in hybrid environments
Today’s conventional wisdom states that hybrid latency prevents you from running analytic workloads in the cloud with
the data on-prem. As a result, most companies copy their data into a cloud environment and maintain that duplicate data.
And companies with compliance and data sovereignty requirements may even prevent organizations from copying data
into the cloud. All of this means that it is challenging to make both on-prem HDFS data accessible and high performing.
TWO APPROACHES TO MANAGING DATA FOR ANALYTICS ACROSS CLOUD STACKS
There are two common approaches we see today in managing hybrid data across technology stacks.
1. Copying data from on premise to cloud storage to run analytics
Typically users use commands like distCP to copy data from Hadoop clusters to cloud stores like Google Cloud Storage.
While this may make it easy to move data, there are several problems that this method creates:
• As soon as the data is moved, it is already out of sync and stale as data may change on the on-premise cluster.
There is no easy way to get data in sync.
• As a result, users may only run read-only analytic workloads on the data copied into the cloud, limiting the
value of hybrid deployments.
• Workloads may not directly work on cloud storage and may need application changes. In addition, perfor-
mance may be significantly lower than on-premise deployments
2. Using cloud network services like Netapp ONTAP
Users can move file and object data in an automated tool-driven manner using a product like Netapp ONTAP. There are a
few challenges with this approach as well:
• These technologies can be expensive and only work with limited APIs like network file system API (NFS).
• After moving data, users still need to leverage cloud storage in addition to on-premise storage to run the analyt-
4 . 3/ Solution Overview
Alluxio is a data orchestration platform for analytics and machine learning applications. Counter to conventional
wisdom, you can create high performance hybrid cloud data analytics systems with Alluxio data orchestration. How? By
mounting the on-prem data stores into Alluxio. Alluxio data orchestration provides caching, API translation, and a unified
namespace to the applications.
Alluxio works with cloud stores like AWS S3, Google Compute Engine, and Microsoft Azure to provide you with an enter-
prise hybrid cloud analytics strategy to burst compute that spans on-prem and cloud data stores. By bringing the data to
the analytics and machine learning applications, the performance is the same as having the data co-located in the cloud.
Also, the on-prem data stores will have oﬀloaded the computation and minimized the additional I/O overhead.
SAME INSTANCE / CONTAINER SAME INSTANCE / CONTAINER
Alluxio brings your data to compute, on demand. It’s a data orchestration platform for analytics and AI/ML workloads in
the cloud, enabling data locality, data accessibility, and data elasticity. Alluxio is designed not for persistence but to
address the concerns of data access as required by computational frameworks. It depends on the persistence layer below
- HDFS or GCS - as the system of truth. There are a few core capabilities it provides:
With Alluxio, your on-premise data gets moved closer to compute, directly co-located within the same instance as the
Apache Spark or Presto executor / node that needs that piece of data in Alluxio, providing a highly distributed caching
Once data from on-premise Hadoop clusters are in Alluxio, the same data can be accessible in many diﬀerent ways using
many diﬀerent APIs including the HDFS API, S3 API, POSIX API and others. This means all existing applications built for
analytical and AI workloads can run directly on this data without any changes to the application itself.
Alluxio can be elastically scaled along with the analytics frameworks, including in container orchestrated environments.
Data can also be replicated within the Alluxio cluster with each file being replicated as many times as needed.
COMPUTE-DRIVEN DATA ON-DEMAND
Any folder or bucket can be mounted into Alluxio and immediately the data from that location can be pulled into Alluxio as
the workload demands. Some of the folders’ metadata made accessible to Alluxio is initially read, but the data itself will
get pulled only when the compute framework asks for a specific file. Data can also be pretched, pinned or expired depend-
ing on the workload.
6 .4/ Example: How Leading Hedge Funds are Leveraging the Hybrid Cloud
The Challenge for Quantitative Hedge Funds
Quantitative hedge funds rely on financial models to manage their business and drive investment strategy. The ongoing
business challenge is to develop more powerful models so they can make intelligent investment decisions in a shorter
period of time and at the lowest possible cost. The development and testing of investment models relies on Machine
Learning techniques applied to vast amounts of data – the more data, the better the model. Data is collected from
thousands of public and proprietary sources and totals many petabytes. The speed at which this data is processed is
critical, as faster model runs enable multiple iterations and improved decision making.
Typically, model runs are performed on-premise with a typical run taking about one hour on few hundreds to thousands
of data processing nodes. Apache Spark is commonly used for the compute framework and data is typically stored using
the Hadoop Distributed File System (HDFS). The workload profile can be variable, with periodic load bursts significantly
higher than average. Because of the challenges around overprovisioning of the infrastructure and constraints around peak
loads, many hedge funds leverage hybrid cloud data bursting.
But remotely running computational frameworks like Apache Spark on data on-premises presents challenges:
• For security reasons data may not be allowed to be stored in the cloud, requiring model data to be transferred
from the on-premise data center prior to each run.
• Due to the size of the data and the physical transfer requirement, model run time in the cloud can increase
significantly resulting in fewer models built per day.
• Any change to the model parameters requires a restart of the data loading process.
The Hybrid Cloud Analytics Solution with Alluxio & Google Compute Platform
Alluxio running with compute frameworks on Google Cloud Platform solves the challenges listed above. An Alluxio cluster
can deployed on GCP and data can be loaded into Alluxio once. Subsequent data requests by the application are served
from Alluxio memory.
The Alluxio cluster provides temporary, non-persistent storage of the data in memory so when the Alluxio instances are
brought down, the data is eﬀectively removed. Additionally, the data in Alluxio in encrypted (by the client), so even if the
cluster is compromised, the data is still secure.
Leading Hedge Fund Example
A leading hedge fund with more than $50 billion under management, turned to Alluxio for help with bursting Spark work-
loads in a public cloud to enable hybrid workloads for on-premise HDFS. With Alluxio, the hedge fund sees better perfor-
mance, increased flexibility and dramatically lower costs with the number of model runs per day increased by 4x and the
cost of compute reduced by 95%.
The following image shows how they run hybrid cloud analytics, bursting additional Google Compute Engine VMs, directly
using on-prem data.
With Alluxio deployed, Machine Learning run time was reduced by 75% the number of model iterations per day increased
from two to eight. As the data sets grow in size Alluxio will be able to scale linearly to deliver the same performance. With
the dramatic reduction in data access time enabling the use of spot instances, the company achieved a 95% reduction in
cost of compute. Alluxio integrated seamlessly with the existing infrastructure, presenting the same API to the application
and requiring no changes to applications or storage. All security requirements were met with data encrypted in Alluxio and
no persistent storage in the cloud.
8 .5/ Additional Resources
• Hybrid Cloud Analytics with Alluxio
• White Paper: Using Alluxio to Improve the Performance and Consistency of HDFS Clusters