[ISMM'25] EMD: Fair and Efficient Dynamic Memory De-bloating of Transparent Huge Pages
Parth Gangar, Ashish Panwar, K. Gopinath
24th ACM SIGPLAN International Symposium on Memory Management (ISMM) 2025.
Processors rely on huge pages to reduce the cost of virtual-to-physical address translation. However, huge pages are notorious
for creating memory bloat – a phenomenon wherein the OS ends up allocating more physical memory to an application than its actual requirement.
This extra memory can be reclaimed by the OS via de-bloating at runtime. However, we find that current OS-level solutions either lack support for
dynamic memory de-bloating, or suffer from performance and fairness pathologies while de-bloating.
We address these issues with EMD (Efficient Memory Debloating). The key insight in EMD is that different regions in an application’s address space
exhibit different amounts of memory bloat. Consequently, the tradeoff between memory efficiency and performance varies significantly within a given
application e.g., we find that memory bloat is typically concentrated in specific regions, and de-bloating them leads to minimal performance impact.
Hinged on this insight, EMD employs a prioritization scheme for fine-grained, efficient, and fair reclamation of memory bloat. EMD improves performance
by up to 69% compared to HawkEye — a stateof-the-art OS-based huge page management system. EMD also eliminates fairness concerns associated with dynamic memory de-bloating.
[ASPLOS'25] POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, Ashish Panwar
30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2025.
Each request in LLM inference goes through two phases: compute-bound prefill and memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid
batching that combines the prefill and decode phases of different requests into the same batch. This approach optimizes linear operations but remains inefficient for
attention computation because existing attention kernels specialize execution independently for the prefill and decode phases.
In this paper, we present POD-Attention — the first GPU kernel that efficiently computes attention for hybrid batches. POD-Attention aims to maximize the utilization
of both compute and memory bandwidth by carefully allocating the GPU’s resources such that prefill and decode operations happen concurrently on the same multiprocessor.
POD-Attention speeds up attention computation by up to 59% (mean 28%), enabling higher throughput and lower latency LLM inference compared to the use of independently
optimized prefill and decode attention kernels.
[ASPLOS'25] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar
30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2025.
PagedAttention is a popular approach for dynamic memory allocation in LLM serving systems. It enables on-demand allocation of GPU memory to mitigate KV cache fragmentation — a phenomenon
that crippled the batch size (and consequently throughput) in prior systems. However, in trying to allocate physical memory at runtime, PagedAttention ends up changing the virtual memory
layout of the KV cache from contiguous to non-contiguous. Such a design leads to non-trivial programming and performance overheads.
We present vAttention — an approach that mitigates fragmentation in physical memory while retaining the virtual memory contiguity of the KV cache. We achieve this by decoupling the
allocation of virtual and physical memory using CUDA virtual memory management APIs. We also introduce various LLM-specific optimizations to address the limitations of CUDA virtual memory
support. Overall, vAttention is a simpler, portable, and performant alternative to PagedAttention: it supports various attention kernels outof-the-box and improves LLM serving throughput
by up to 1.23× compared to the use of PagedAttention-based kernels of FlashAttention-2 and FlashInfer.
[OSDI'24] Taming Throughput-Latency Trade-off in LLM Inference with Sarathi-Serve
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, Ramachandran Ramjee
18th USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2024.
Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which
generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode
iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and
consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency.
We introduce an efficient LLM inference scheduler, SarathiServe, to address this throughput-latency tradeoff. SarathiServe introduces chunked-prefills which splits a prefill request into
near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput
with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations, resulting in minimal pipeline bubbles.
Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6× higher serving capacity and
up to 3.7× higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon180B, Sarathi-Serve provides up to 5.6× gain in the end-toend serving capacity.
The source code for Sarathi-Serve is available at https://github.com/microsoft/sarathi-serve.
[MLSys'24] VIDUR: A Large-Scale Simulation Framework for LLM Inference
Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, Alexey Tumanov
7th Annual Conference on Machine Learning and Systems (MLSys) 2024.
Optimizing the deployment of Large language models (LLMs) is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed
by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur – a large-scale, high-fidelity, easily-extensible simulation framework for LLM
inference performance. Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates the end-to-end inference performance for different workloads
by estimating several metrics of interest such as latency and throughput. We validate the fidelity of Vidur on several LLMs and show that it estimates inference latency with less than 9% error across the range. Further, we
present Vidur-Search, a configuration search tool that helps optimize LLM deployment. Vidur-Search uses Vidur to automatically identify the most cost-effective deployment configuration that meets application performance
constraints. For example, Vidur-Search finds the best deployment configuration for LLaMA2-70B in one hour on a CPU machine, in contrast to a deployment-based exploration which would require 42K GPU hours – costing
218K dollars. Source code for Vidur is available at https://github.com/microsoft/vidur.
[PETS'24] SIGMA: Secure GPT Inference with Function Secret Sharing
Kanav Gupta, Neha Jawalkar, Ananta Mukherjee, Nishanth Chandran, Divya Gupta, Ashish Panwar, Rahul Sharma
24th Privacy Enhancing Technologies Symposium (PETS) 2024.
Secure 2-party computation (2PC) enables secure inference that offers protection for both proprietary machine learning (ML) models and sensitive inputs to them. However, the existing secure inference
solutions suffer from high latency and communication overheads, particularly for transformers. Function secret sharing (FSS) is a recent paradigm for obtaining efficient 2PC protocols with a
preprocessing phase. We provide Sigma, the first end-to-end system for secure transformer inference based on FSS. By constructing new FSS-based protocols for complex machine learning functionalities,
such as Softmax, GeLU and SiLU, and also accelerating their computation on GPUs, Sigma improves the latency of secure inference of transformers by 12 − 19× over the state-of-the-art that uses
preprocessing and GPUs. We present the first secure inference of generative pre-trained transformer (GPT) models. In particular,Sigma executes Meta’s Llama2 (available on HuggingFace) with 13
billion parameters in 38 seconds and GPT2 in 1.5 seconds.
[IEEE CAL'24] Address Scaling: Architectural Support for Fine-Grained Thread-Safe Metadata Management
Deepanjali Mishra, Konstantinos Kanellopoulos, Ashish Panwar, Akshitha Sriraman, Vivek Seshadri, Onur Mutlu, Todd C Mowry
IEEE Computer Architecture Letters 2024.
In recent decades, software systems have grown significantly in size and complexity. As a result, such systems are more prone to bugs which can cause performance and correctness
challenges. Using run-time monitoring tools is one approach to mitigate these challenges. However, these tools maintain metadata for every byte of application data they monitor,
which precipitates performance overheads from additional metadata accesses. We propose Address Scaling, a new hardware framework that performs fine-grained metadata management to
reduce metadata access overheads in run-time monitoring tools. Our mechanism is based on the observation that different run-time monitoring tools maintain metadata at varied
granularities. Our key insight is to maintain the data and its corresponding metadata within the same cache line, to preserve locality. Address Scaling improves the performance of
Memcheck, a dynamic monitoring tool that detects memory-related errors, by 3.55× and 6.58× for sequential and random memory access patterns respectively, compared to the state-of-the-art
systems that store the metadata in a memory region that is separate from the data.
[Arxiv] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee
Arxiv 2023.
Large Language Model (LLM) inference consists of two distinct phases – prefill phase which processes the input prompt and decode phase which generates output tokens
autoregressively. While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates
one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipelineparallelism, resulting in further
inefficiency due to bubbles.
We present SARATHI to address these challenges. SARATHIemploys chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching,
which constructs a batch using a single prefill chunk and populates the remaining slots with decodes. During inference, the prefill chunk saturates GPU compute, while
the decode requests ‘piggyback’ and cost up to an order of magnitude less compared to a decode-only batch. Chunked-prefills allows constructing multiple decode-maximal
batches from a single prefill request, maximizing coverage of decodes that can piggyback.
Furthermore, the uniform compute design of these batches ameliorates the imbalance between micro-batches, significantly reducing pipeline bubbles. Our techniques yield
significant improvements in inference performance across models and hardware. For the LLaMA-13B model on A6000 GPU, SARATHI improves decode throughput by up to 10×, and
accelerates end-to-end throughput by up to 1.33×. For LLaMa-33B on A100 GPU, we achieve 1.25× higher end-to-end-throughput and up to 4.25× higher decode throughput. When
used with pipeline parallelism on GPT-3, SARATHI reduces bubbles by 6.29×, resulting in an end-to-end throughput improvement of 1.91×.
[MICRO'21] Trident: Harnessing Architectural Resources for All Page Sizes in x86 Processors
Venkat Sri Sai Ram, Ashish Panwar, Arkaprava Basu
54th IEEE/ACM International Symposium on Microarchitecture (MICRO) 2021.
Intel and AMD processors have long supported more than one large page sizes – 1GB and 2MB, to reduce address translation overheads for applications with large memory footprints. However, previous
works on large pages have primarily focused on 2MB pages, partly due to a lack of evidence on the usefulness of 1GB pages to realworld applications. Consequently, micro-architectural resources
devoted to 1GB pages have gone underutilized for a decade.
We quantitatively demonstrate where 1GB pages can be valuable, especially when employed in conjunction with 2MB pages. Unfortunately, the lack of application-transparent dynamic allocation of
1GB pages is to blame for the under-utilization of 1GB pages on today’s systems. Toward this, we design and implement Trident in Linux to fully harness micro-architectural resources devoted for
all page sizes in the current x86 hardware by transparently allocating 1GB, 2MB, and 4KB pages as suitable at runtime. Tridentspeeds up eight memory-intensive applications by 18%, on average, over Linux’s use of 2MB pages. We then propose Tridentpv, an
extension to Trident that virtualizes 1GB pages via copy-less promotion and compaction in the guest OS. Overall, this paper shows that adequate software enablement brings practical relevance to even GB-sized pages, and motivates micro-architects to continue
enhancing hardware support for all large page sizes.
[PACT'21] nuKSM: NUMA-aware Memory De-duplication on Multi-socket Servers
Akash Panda, Ashish Panwar, Arkaprava Basu
30th International Conference on Parallel Architectures and Compilation Techniques (PACT) 2021.
An operating system has many memory management goals including reducing memory access latency, and reducing memory footprint. These goals can conflict with each
other when independent subsystems optimize them in silos.
In this work, we report one such conflict that appears between memory de-duplication and NUMA (non-uniform memory access) management. Linux’s memory de-duplication
subsystem, namely KSM, is NUMA unaware. Consequently, while de-duplicating pages across NUMA nodes, it can place de-duplicated pages in a manner that can lead to significant
performance variations, unfairness, and subvert process priority. Toward this, we introduce NUMA-aware KSM, a.k.a.,nuKSM, that makes judicious decisions about the placement of
de-duplicated pages to reduce the impact of NUMA, unfairness, and avoid priority subversion. Independent of the NUMAeffects, we observed that KSM scales poorly to systems with
larger memory sizes due to its centralized design. Thus, we extended nuKSM to adopt a de-centralized design.
[ASPLOS'21] Fast Local Page-Tables for Virtualized NUMA Servers with vMitosis
Ashish Panwar, Reto Achermann, Arkaprava Basu, Abhishek Bhattacharjee, K. Gopinath, Jayneel Gandhi
26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2021.
Increasing memory heterogeneity mandates careful data placement to hide the non-uniform memory access (NUMA) effects on applications. While NUMA optimizations have focused on application data
for decades, they have ignored the placement of kernel data structures due to their small memory footprint; this is evident in typical OSes that pin kernel data structures in memory. In this paper, we
show that careful placement of kernel data structures is gaining importance in the context of page-tables: their sub-optimal placement causes severe slowdown (up to 3.1×) on virtualized NUMA servers.
In response, we present vMitosis ś a system for explicit management of two-level page-tables, i.e., the guest and extended pagetables, on virtualized NUMA servers. vMitosis enables faster address
translation by migrating and replicating page-tables. It supports two prevalent virtualization configurations: first, where the hypervisor exposes the NUMA architecture to the guest OS, and second,
where such information is hidden from the guest OS. vMitosis is implemented in Linux/KVM, and our evaluation on a recent 1.5TiB 4-socket server shows that it effectively eliminates NUMA effects
on 2D page-table walks, resulting in a speedup of 1.8−3.1× for single-socket, and 1.06 − 1.6× for multi-socket workloads.
[ASPLOS'20] Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines
Reto Achermann, Ashish Panwar, Abhishek Bhattacharjee, Timothy Roscoe, Jayneel Gandhi
25th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2020.
Multi-socket machines with 1-100 TBs of physical memory are becoming prevalent. Applications running on such multisocket machines suffer non-uniform bandwidth and latency
when accessing physical memory. Decades of research have focused on data allocation and placement policies in NUMA settings, but there have been no studies on the question of
how to place page-tables amongst sockets. We make the case for explicit page-table allocation policies and show that pagetable placement is becoming crucial to overall performance.
We propose Mitosis to mitigate NUMA effects on pagetable walks by transparently replicating and migrating pagetables across sockets without application changes. This reduces the
frequency of accesses to remote NUMA nodes when performing page-table walks. Mitosis uses two components: (i) a mechanism to efficiently enable and (ii) policies to effectively
control – page-table replication and migration. We implement Mitosis in Linux and evaluate its benefits on real hardware. Mitosis improves performance for large-scale
multi-socket workloads by up to 1.34x by replicating pagetables across sockets. Moreover, it improves performance by up to 3.24x in cases when the OS migrates a process across
sockets by enabling cross-socket page-table migration.
[ASPLOS'19] HawkEye: Efficient Fine-grained OS Support for Huge Pages
Ashish Panwar, Sorav Bansal, K. Gopinath
24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2019.
Effective huge page management in operating systems is necessary for mitigation of address translation overheads. However, this continues to remain a difficult
area in OS design. Recent work on Ingens [55] uncovered some interesting pitfalls in current huge page management strategies. Using both page access patterns
discovered by the OS kernel and fine-grained data from hardware performance counters, we expose problematic aspects of current huge page management strategies.
In our system, called HawkEye/Linux, we demonstrate alternate ways to address issues related to performance, page fault latency and memory bloat; the primary
ideas behind HawkEye management algorithms are async page pre-zeroing, de-duplication of zero-filled pages, finegrained page access tracking and measurement of address
translation overheads through hardware performance counters. Our evaluation shows that HawkEye is more performant, robust and better-suited to handle diverse workloads
when compared with current state-of-the-art systems.
[ASPLOS'18] Making Huge Pages Actually Useful
Ashish Panwar, Aravinda Prasad, K. Gopinath
23rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2018.
The virtual-to-physical address translation overhead, a major performance bottleneck for modern workloads, can be effectively alleviated with huge pages.
However, since huge pages must be mapped contiguously, OSs have not been able to use them well because of the memory fragmentation problem despite hardware
support for huge pages being available for nearly two decades.
This paper presents a comprehensive study of the interaction of fragmentation with huge pages in the Linux kernel. We observe that when huge pages are used, problems such
as high CPU utilization and latency spikes occur because of unnecessary work (e.g., useless page migration) performed by memory management related subsystems due to the poor
handling of unmovable (i.e., kernel) pages. This behavior is even more harmful in virtualized systems where unnecessary work may be performed in both guest and host OSs.
We present Illuminator, an efficient memory manager that provides various subsystems, such as the page allocator, the ability to track all unmovable pages. It allows subsystems to
make informed decisions and eliminate unnecessary work which in turn leads to cost-effective huge page allocations. Illuminator reduces the cost of compaction (up to 99%),
improves application performance (up to 2.3×) and reduces the maximum latency of MySQL database server (by 30×).
[APSys'16] A Case for Protecting Huge Pages from the Kernel
Ashish Panwar, Naman Patel, K. Gopinath
7th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys) 2016.
Controlling memory fragmentation is critical for leveraging the benefits of huge page support offered by modern architectures. The division of free memory into non-contiguous
regions over time restricts huge page allocations in long run. Compaction is a popular mechanism to recover contiguous blocks of free memory on demand. However, its success rate
of accumulating free regions at huge page granularity (e.g., 2MB in x86 64) is low in most situations due to the presence of unmovable kernel memory. Hence, a prudent page placement
algorithm is required to control fragmentation and protect huge pages against kernel memory allocations, in order to sustain system performance over long periods of time.
In this work, we explore the interaction of kernel pages with fragmentation avoidance and recovery mechanism in Linux. Our analysis shows that stock kernel memory layout
thwarts the progress of memory compaction. Furthermore, compaction can potentially induce more fragmentation depending on where kernel memory was placed by the underlying
page allocator. We discuss the scope of optimization in current Linux framework and show how an effective fragmentation management can yield up to 20% performance improvement
and up to 27% energy savings with the help of additional huge pages.
[HiPC'15] Towards Practical Page Placement for a Green Memory Manager
Ashish Panwar and K. Gopinath
22nd IEEE International Conference on High Performance Computing (HiPC) 2015.
Increased performance demand of modern applications has resulted in large memory modules and higher performance processors in computing systems. Power consumption becomes an important aspect when these
resources go underutilized in a running system; e.g. during idle periods or lighter workloads. CPUs have come a long way in optimizing away the unnecessary power consumption in both hardware and
software for such scenarios through solutions like Dynamic Voltage/Frequency Scaling. However, support for memory power optimization is still missing in modern operating systems despite hardware support
being available for many years in the form of multiple power states and techniques like Partial Array SelfRefresh.
In this work, we explore the behavior of Linux memory manager and report that even at 10% of memory utilization, there are references to all physical memory banks in a long
running system due to random page allocation and ignorance of memory bank boundaries. These references can be consolidated to a subset of memory banks by using page migration techniques.
Unfortunately, migration of large contiguous blocks is often restricted due to the presence of unmovable pages primarily owned by kernel.
We provide some techniques for utilizing the hardware facilitated Partial Array Self-Refresh by introducing bank awareness in the existing buddy allocation framework of Linux memory
manager as well as for improving the page migration support of large contiguous blocks. Through a set of simple changes in Linux VM, we have been able to reduce the number of referenced
memory banks significantly. Memory-hotplug framework, which relies on page migration of large contiguous blocks, also shows significant improvement in terms of number of removable memory
sections. Benchmark results show no performance degradation in the modified kernel which makes the proposed solution desirable.