Caches¶
Alluxio SDK cache
Alluxio provides a distributed cache layer that can be used between Embedded MPP and Object Storage to improve I/O performance. By caching data closer to the Embedded MPP workers, Alluxio reduces the latency of data access and relieves pressure on the underlying storage system.
The speed of the local cache storage is crucial to the performance of the cache. The recommended approach is to attach NVMe SSDs (or other high performance storage) to the workers of the cluster.
There are two main reasons to use the Alluxio cache:
Reduce data transfer costs from the Object Storage to the Embedded MPP. By reducing the number of remote table scans, caching reduces query latency, saves on egress and cloud storage API costs.
Improve performance in cases where data reading is a bottleneck. This is the case when the storage is slow, for example an on prem HDFS, or the latency is high, for instance, if the Object Storage and the Embedded MPP are located in different cloud provider regions. But keep in mind that, if Object Storage is already running at very high performance, and your local cache storage has a similar speed, the performance benefits may be minimal.
The Alluxio SDK cache is configured as follows:
Add the following properties in the
values.yaml
additionalConfig
property of the desired catalog:hive
,iceberg
ordelta
.
hive:
additionalConfig: [
cache.enabled=true
cache.type=ALLUXIO
cache.alluxio.max-cache-size=xxxGB
cache.base-directory=file:////mnt/flash/data
hive.node-selection-strategy=SOFT_AFFINITY
]
Add the following
volumeMount
totemplates/presto-template.yaml
.volumeMounts: - name: cache-volume mountPath: /opt/data/alluxio
Add the following
volume
totemplates/presto-template.yaml
.volumes: - name: cache-volume hostPath: path: /opt/data/
This Alluxio SDK cache is completely transparent to users. To verify if the cache is working, you can check the directory set by cache.base-directory
and see if temporary files are created there. Additionally, Alluxio exports various JMX metrics while performing caching-related operations. Refer to “Monitoring Alluxio SDK <https://prestodb.io/docs/current/cache/local.html#monitoring> for more information.