Optimizing Memory and Storage Disaggregation for Data-intensive Systems

Description

Data-intensive systems such as big data and large machine learning (ML) systems experience serious scalability challenges due to the ever-increasing data demand from ML and analytics applications and the resource fragmentation caused by conventional monolithic server architecture. Memory and storage…

Data-intensive systems such as big data and large machine learning (ML) systems experience serious scalability challenges due to the ever-increasing data demand from ML and analytics applications and the resource fragmentation caused by conventional monolithic server architecture. Memory and storage disaggregation emerges as a pivotal technology to address these challenges by decoupling memory and storage resources from individual servers and managing and provisioning them to applications as a shared resource pool. This dissertation investigates several important aspects of memory and storage disaggregation and proposes novel solutions to support data-intensive applications.First, caching is a fundamental way to utilize disaggregated storage, but building a large disaggregated cache is challenging because the commonly-used fix-sized cache block allocation scheme is unable to provide good cache performance with low memory overhead for diverse cloud workloads with vastly different I/O patterns. The dissertation proposes a novel adaptive cache block allocation approach that dynamically adjusts cache block sizes based on changing I/O patterns. This approach significantly improves I/O performance while reducing memory usage, outperforming traditional fixed-size cache systems in diverse cloud workloads. Evaluation shows that it improves read latency by 20% and write latency by 9%. It also reduces the amount of I/O traffic to cloud block storage by up to 74% while achieving up to 41% memory savings with only 2 ms. Second, large ML applications such as large language model (LLM) inference are memory demanding, but to support them using disaggregated memory brings challenges to memory management since disaggregated memory has higher memory access latency compared to local memory. The dissertation proposes latency-aware memory aggregation which cautiously distributes memory accesses to minimize the latency gap between local and disaggregated memory. It also proposes NUMA-aligned tensor parallelism to further improve the computing efficiency. With these optimizations, LLM inference achieves substantial speedups. For example, first token latency improves by 61%, and end-to-end latency improves by 43% for a LLM inference task which uses a model of 66 billion parameters when the batch size is 8. Finally, to address the cost, power consumption, and volatility of DRAM, the dissertation proposes to incorporate flash memory into memory pools within the disaggregation framework. By establishing a tiered memory architecture which combines fast-tier local DRAM with slow-tier DRAM and flash memory in the memory pool and effectively migrates data based on hotness across memory tiers, this approach not only reduces expenses but also maintains the overall performance and scalability of data-intensive systems. For example, with 50% saving in memory cost, the performance degradation of training ResNet50 on ImageNet dataset is only 2.68%. Together, these contributions systematically optimize the use of memory and storage disaggregation to deliver more efficient, scalable, and cost-effective systems for supporting the data explosion in today’s and future computing systems.