Subject: Transcendent memory ("tmem") for Linux From: http://xenbits.xensource.com/linux-2.6.18-xen.hg (tip 908:baeb818cd2dc) Patch-mainline: n/a Tmem, when called from a tmem-capable (paravirtualized) guest, makes use of otherwise unutilized ("fallow") memory to create and manage pools of pages that can be accessed from the guest either as "ephemeral" pages or as "persistent" pages. In either case, the pages are not directly addressible by the guest, only copied to and fro via the tmem interface. Ephemeral pages are a nice place for a guest to put recently evicted clean pages that it might need again; these pages can be reclaimed synchronously by Xen for other guests or other uses. Persistent pages are a nice place for a guest to put "swap" pages to avoid sending them to disk. These pages retain data as long as the guest lives, but count against the guest memory allocation. This patch contains the Linux paravirtualization changes to complement the tmem Xen patch (xen-unstable c/s 19646). It implements "precache" (ext3 only as of now), "preswap", and limited "shared precache" (ocfs2 only as of now) support. CONFIG options are required to turn on the support (but in this patch they default to "y"). If the underlying Xen does not have tmem support or has it turned off, this is sensed early to avoid nearly all hypercalls. Lots of useful prose about tmem can be found at http://oss.oracle.com/projects/tmem Signed-off-by: Dan Magenheimer Acked-by: jbeulich@novell.com --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ head-2010-05-12/Documentation/transcendent-memory.txt 2010-03-24 14:09:47.000000000 +0100 @@ -0,0 +1,176 @@ +Normal memory is directly addressable by the kernel, of a known +normally-fixed size, synchronously accessible, and persistent (though +not across a reboot). + +What if there was a class of memory that is of unknown and dynamically +variable size, is addressable only indirectly by the kernel, can be +configured either as persistent or as "ephemeral" (meaning it will be +around for awhile, but might disappear without warning), and is still +fast enough to be synchronously accessible? + +We call this latter class "transcendent memory" and it provides an +interesting opportunity to more efficiently utilize RAM in a virtualized +environment. However this "memory but not really memory" may also have +applications in NON-virtualized environments, such as hotplug-memory +deletion, SSDs, and page cache compression. Others have suggested ideas +such as allowing use of highmem memory without a highmem kernel, or use +of spare video memory. + +Transcendent memory, or "tmem" for short, provides a well-defined API to +access this unusual class of memory. (A summary of the API is provided +below.) The basic operations are page-copy-based and use a flexible +object-oriented addressing mechanism. Tmem assumes that some "privileged +entity" is capable of executing tmem requests and storing pages of data; +this entity is currently a hypervisor and operations are performed via +hypercalls, but the entity could be a kernel policy, or perhaps a +"memory node" in a cluster of blades connected by a high-speed +interconnect such as hypertransport or QPI. + +Since tmem is not directly accessible and because page copying is done +to/from physical pageframes, it more suitable for in-kernel memory needs +than for userland applications. However, there may be yet undiscovered +userland possibilities. + +With the tmem concept outlined vaguely and its broader potential hinted, +we will overview two existing examples of how tmem can be used by the +kernel. + +"Cleancache" can be thought of as a page-granularity victim cache for clean +pages that the kernel's pageframe replacement algorithm (PFRA) would like +to keep around, but can't since there isn't enough memory. So when the +PFRA "evicts" a page, it first puts it into the cleancache via a call to +tmem. And any time a filesystem reads a page from disk, it first attempts +to get the page from cleancache. If it's there, a disk access is eliminated. +If not, the filesystem just goes to the disk like normal. Cleancache is +"ephemeral" so whether a page is kept in cleancache (between the "put" and +the "get") is dependent on a number of factors that are invisible to +the kernel. + +"Frontswap" is so named because it can be thought of as the opposite of +a "backing store". Frontswap IS persistent, but for various reasons may not +always be available for use, again due to factors that may not be visible to +the kernel. (But, briefly, if the kernel is being "good" and has shared its +resources nicely, then it will be able to use frontswap, else it will not.) +Once a page is put, a get on the page will always succeed. So when the +kernel finds itself in a situation where it needs to swap out a page, it +first attempts to use frontswap. If the put works, a disk write and +(usually) a disk read are avoided. If it doesn't, the page is written +to swap as usual. Unlike cleancache, whether a page is stored in frontswap +vs swap is recorded in kernel data structures, so when a page needs to +be fetched, the kernel does a get if it is in frontswap and reads from +swap if it is not in frontswap. + +Both cleancache and frontswap may be optionally compressed, trading off 2x +space reduction vs 10x performance for access. Cleancache also has a +sharing feature, which allows different nodes in a "virtual cluster" +to share a local page cache. + +Tmem has some similarity to IBM's Collaborative Memory Management, but +creates more of a partnership between the kernel and the "privileged +entity" and is not very invasive. Tmem may be applicable for KVM and +containers; there is some disagreement on the extent of its value. +Tmem is highly complementary to ballooning (aka page granularity hot +plug) and memory deduplication (aka transparent content-based page +sharing) but still has value when neither are present. + +Performance is difficult to quantify because some benchmarks respond +very favorably to increases in memory and tmem may do quite well on +those, depending on how much tmem is available which may vary widely +and dynamically, depending on conditions completely outside of the +system being measured. Ideas on how best to provide useful metrics +would be appreciated. + +Tmem is supported starting in Xen 4.0 and is in Xen's Linux 2.6.18-xen +source tree. It is also released as a technology preview in Oracle's +Xen-based virtualization product, Oracle VM 2.2. Again, Xen is not +necessarily a requirement, but currently provides the only existing +implementation of tmem. + +Lots more information about tmem can be found at: + http://oss.oracle.com/projects/tmem +and there was a talk about it on the first day of Linux Symposium in +July 2009; an updated talk is planned at linux.conf.au in January 2010. +Tmem is the result of a group effort, including Dan Magenheimer, +Chris Mason, Dave McCracken, Kurt Hackel and Zhigang Wang, with helpful +input from Jeremy Fitzhardinge, Keir Fraser, Ian Pratt, Sunil Mushran, +Joel Becker, and Jan Beulich. + +THE TRANSCENDENT MEMORY API + +Transcendent memory is made up of a set of pools. Each pool is made +up of a set of objects. And each object contains a set of pages. +The combination of a 32-bit pool id, a 64-bit object id, and a 32-bit +page id, uniquely identify a page of tmem data, and this tuple is called +a "handle." Commonly, the three parts of a handle are used to address +a filesystem, a file within that filesystem, and a page within that file; +however an OS can use any values as long as they uniquely identify +a page of data. + +When a tmem pool is created, it is given certain attributes: It can +be private or shared, and it can be persistent or ephemeral. Each +combination of these attributes provides a different set of useful +functionality and also defines a slightly different set of semantics +for the various operations on the pool. Other pool attributes include +the size of the page and a version number. + +Once a pool is created, operations are performed on the pool. Pages +are copied between the OS and tmem and are addressed using a handle. +Pages and/or objects may also be flushed from the pool. When all +operations are completed, a pool can be destroyed. + +The specific tmem functions are called in Linux through a set of +accessor functions: + +int (*new_pool)(struct tmem_pool_uuid uuid, u32 flags); +int (*destroy_pool)(u32 pool_id); +int (*put_page)(u32 pool_id, u64 object, u32 index, unsigned long pfn); +int (*get_page)(u32 pool_id, u64 object, u32 index, unsigned long pfn); +int (*flush_page)(u32 pool_id, u64 object, u32 index); +int (*flush_object)(u32 pool_id, u64 object); + +The new_pool accessor creates a new pool and returns a pool id +which is a non-negative 32-bit integer. If the flags parameter +specifies that the pool is to be shared, the uuid is a 128-bit "shared +secret" else it is ignored. The destroy_pool accessor destroys the pool. +(Note: shared pools are not supported until security implications +are better understood.) + +The put_page accessor copies a page of data from the specified pageframe +and associates it with the specified handle. + +The get_page accessor looks up a page of data in tmem associated with +the specified handle and, if found, copies it to the specified pageframe. + +The flush_page accessor ensures that subsequent gets of a page with +the specified handle will fail. The flush_object accessor ensures +that subsequent gets of any page matching the pool id and object +will fail. + +There are many subtle but critical behaviors for get_page and put_page: +- Any put_page (with one notable exception) may be rejected and the client + must be prepared to deal with that failure. A put_page copies, NOT moves, + data; that is the data exists in both places. Linux is responsible for + destroying or overwriting its own copy, or alternately managing any + coherency between the copies. +- Every page successfully put to a persistent pool must be found by a + subsequent get_page that specifies the same handle. A page successfully + put to an ephemeral pool has an indeterminate lifetime and even an + immediately subsequent get_page may fail. +- A get_page to a private pool is destructive, that is it behaves as if + the get_page were atomically followed by a flush_page. A get_page + to a shared pool is non-destructive. A flush_page behaves just like + a get_page to a private pool except the data is thrown away. +- Put-put-get coherency is guaranteed. For example, after the sequence: + put_page(ABC,D1); + put_page(ABC,D2); + get_page(ABC,E) + E may never contain the data from D1. However, even for a persistent + pool, the get_page may fail if the second put_page indicates failure. +- Get-get coherency is guaranteed. For example, in the sequence: + put_page(ABC,D); + get_page(ABC,E1); + get_page(ABC,E2) + if the first get_page fails, the second must also fail. +- A tmem implementation provides no serialization guarantees (e.g. to + an SMP Linux). So if different Linux threads are putting and flushing + the same page, the results are indeterminate. --- head-2010-05-12.orig/fs/btrfs/extent_io.c 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/fs/btrfs/extent_io.c 2010-04-15 09:41:13.000000000 +0200 @@ -10,6 +10,7 @@ #include #include #include +#include #include "extent_io.h" #include "extent_map.h" #include "compat.h" @@ -2030,6 +2031,13 @@ static int __extent_read_full_page(struc set_page_extent_mapped(page); + if (!PageUptodate(page)) { + if (precache_get(page->mapping, page->index, page) == 1) { + BUG_ON(blocksize != PAGE_SIZE); + goto out; + } + } + end = page_end; lock_extent(tree, start, end, GFP_NOFS); @@ -2146,6 +2154,7 @@ static int __extent_read_full_page(struc cur = cur + iosize; page_offset += iosize; } +out: if (!nr) { if (!PageError(page)) SetPageUptodate(page); --- head-2010-05-12.orig/fs/btrfs/super.c 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/fs/btrfs/super.c 2010-04-15 09:41:20.000000000 +0200 @@ -39,6 +39,7 @@ #include #include #include +#include #include "compat.h" #include "ctree.h" #include "disk-io.h" @@ -477,6 +478,7 @@ static int btrfs_fill_super(struct super sb->s_root = root_dentry; save_mount_options(sb, data); + precache_init(sb); return 0; fail_close: --- head-2010-05-12.orig/fs/buffer.c 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/fs/buffer.c 2010-03-24 14:09:47.000000000 +0100 @@ -41,6 +41,7 @@ #include #include #include +#include static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); @@ -276,6 +277,11 @@ void invalidate_bdev(struct block_device invalidate_bh_lrus(); invalidate_mapping_pages(mapping, 0, -1); + + /* 99% of the time, we don't need to flush the precache on the bdev. + * But, for the strange corners, lets be cautious + */ + precache_flush_inode(mapping); } EXPORT_SYMBOL(invalidate_bdev); --- head-2010-05-12.orig/fs/ext3/super.c 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/fs/ext3/super.c 2010-03-24 14:09:47.000000000 +0100 @@ -38,6 +38,7 @@ #include #include #include +#include #include @@ -1370,6 +1371,7 @@ static int ext3_setup_super(struct super } else { ext3_msg(sb, KERN_INFO, "using internal journal"); } + precache_init(sb); return res; } --- head-2010-05-12.orig/fs/ext4/super.c 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/fs/ext4/super.c 2010-04-15 09:41:30.000000000 +0200 @@ -39,6 +39,7 @@ #include #include #include +#include #include #include "ext4.h" @@ -1784,6 +1785,8 @@ static int ext4_setup_super(struct super EXT4_INODES_PER_GROUP(sb), sbi->s_mount_opt); + precache_init(sb); + return res; } --- head-2010-05-12.orig/fs/mpage.c 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/fs/mpage.c 2010-04-15 09:41:38.000000000 +0200 @@ -27,6 +27,7 @@ #include #include #include +#include /* * I/O completion handler for multipage BIOs. @@ -286,6 +287,13 @@ do_mpage_readpage(struct bio *bio, struc SetPageMappedToDisk(page); } + if (fully_mapped && + blocks_per_page == 1 && !PageUptodate(page) && + precache_get(page->mapping, page->index, page) == 1) { + SetPageUptodate(page); + goto confused; + } + /* * This page will go to BIO. Do we need to send this BIO off first? */ --- head-2010-05-12.orig/fs/ocfs2/super.c 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/fs/ocfs2/super.c 2010-03-24 14:09:47.000000000 +0100 @@ -41,6 +41,7 @@ #include #include #include +#include #include #define MLOG_MASK_PREFIX ML_SUPER @@ -2260,6 +2261,7 @@ static int ocfs2_initialize_super(struct mlog_errno(status); goto bail; } + shared_precache_init(sb, &di->id2.i_super.s_uuid[0]); bail: mlog_exit(status); --- head-2010-05-12.orig/fs/super.c 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/fs/super.c 2010-05-12 08:57:07.000000000 +0200 @@ -38,6 +38,7 @@ #include #include #include +#include #include #include "internal.h" @@ -105,6 +106,9 @@ static struct super_block *alloc_super(s s->s_qcop = sb_quotactl_ops; s->s_op = &default_op; s->s_time_gran = 1000000000; +#ifdef CONFIG_PRECACHE + s->precache_poolid = -1; +#endif } out: return s; @@ -195,6 +199,7 @@ void deactivate_super(struct super_block vfs_dq_off(s, 0); down_write(&s->s_umount); fs->kill_sb(s); + precache_flush_filesystem(s); put_filesystem(fs); put_super(s); } @@ -221,6 +226,7 @@ void deactivate_locked_super(struct supe spin_unlock(&sb_lock); vfs_dq_off(s, 0); fs->kill_sb(s); + precache_flush_filesystem(s); put_filesystem(fs); put_super(s); } else { --- head-2010-05-12.orig/include/linux/fs.h 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/include/linux/fs.h 2010-03-24 14:09:47.000000000 +0100 @@ -1377,6 +1377,9 @@ struct super_block { /* Granularity of c/m/atime in ns. Cannot be worse than a second */ u32 s_time_gran; +#ifdef CONFIG_PRECACHE + u32 precache_poolid; +#endif /* * The next field is for VFS *only*. No filesystems have any business --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ head-2010-05-12/include/linux/precache.h 2010-03-24 14:09:47.000000000 +0100 @@ -0,0 +1,55 @@ +#ifndef _LINUX_PRECACHE_H + +#include +#include + +#ifdef CONFIG_PRECACHE +extern void precache_init(struct super_block *sb); +extern void shared_precache_init(struct super_block *sb, char *uuid); +extern int precache_get(struct address_space *mapping, unsigned long index, + struct page *empty_page); +extern int precache_put(struct address_space *mapping, unsigned long index, + struct page *page); +extern int precache_flush(struct address_space *mapping, unsigned long index); +extern int precache_flush_inode(struct address_space *mapping); +extern int precache_flush_filesystem(struct super_block *s); +#else +static inline void precache_init(struct super_block *sb) +{ +} + +static inline void shared_precache_init(struct super_block *sb, char *uuid) +{ +} + +static inline int precache_get(struct address_space *mapping, + unsigned long index, struct page *empty_page) +{ + return 0; +} + +static inline int precache_put(struct address_space *mapping, + unsigned long index, struct page *page) +{ + return 0; +} + +static inline int precache_flush(struct address_space *mapping, + unsigned long index) +{ + return 0; +} + +static inline int precache_flush_inode(struct address_space *mapping) +{ + return 0; +} + +static inline int precache_flush_filesystem(struct super_block *s) +{ + return 0; +} +#endif + +#define _LINUX_PRECACHE_H +#endif /* _LINUX_PRECACHE_H */ --- head-2010-05-12.orig/include/linux/swap.h 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/include/linux/swap.h 2010-03-24 14:09:47.000000000 +0100 @@ -183,8 +183,61 @@ struct swap_info_struct { struct block_device *bdev; /* swap device or bdev of swap file */ struct file *swap_file; /* seldom referenced */ unsigned int old_block_size; /* seldom referenced */ +#ifdef CONFIG_PRESWAP + unsigned long *preswap_map; + unsigned int preswap_pages; +#endif }; +#ifdef CONFIG_PRESWAP + +#include +extern int preswap_sysctl_handler(struct ctl_table *, int, void __user *, + size_t *, loff_t *); +extern const unsigned long preswap_zero, preswap_infinity; + +extern struct swap_info_struct *get_swap_info_struct(unsigned int type); + +extern void preswap_shrink(unsigned long); +extern int preswap_test(struct swap_info_struct *, unsigned long); +extern void preswap_init(unsigned); +extern int preswap_put(struct page *); +extern int preswap_get(struct page *); +extern void preswap_flush(unsigned, unsigned long); +extern void preswap_flush_area(unsigned); +#else +static inline void preswap_shrink(unsigned long target_pages) +{ +} + +static inline int preswap_test(struct swap_info_struct *sis, unsigned long offset) +{ + return 0; +} + +static inline void preswap_init(unsigned type) +{ +} + +static inline int preswap_put(struct page *page) +{ + return 0; +} + +static inline int preswap_get(struct page *get) +{ + return 0; +} + +static inline void preswap_flush(unsigned type, unsigned long offset) +{ +} + +static inline void preswap_flush_area(unsigned type) +{ +} +#endif /* CONFIG_PRESWAP */ + struct swap_list_t { int head; /* head of priority-ordered swapfile list */ int next; /* swapfile to be used next */ --- head-2010-05-12.orig/kernel/sysctl.c 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/kernel/sysctl.c 2010-03-24 14:09:47.000000000 +0100 @@ -1274,6 +1274,17 @@ static struct ctl_table vm_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, +#ifdef CONFIG_PRESWAP + { + .procname = "preswap", + .data = NULL, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = preswap_sysctl_handler, + .extra1 = (void *)&preswap_zero, + .extra2 = (void *)&preswap_infinity, + }, +#endif #ifdef CONFIG_MEMORY_FAILURE { .procname = "memory_failure_early_kill", --- head-2010-05-12.orig/mm/Kconfig 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/mm/Kconfig 2010-03-24 14:09:47.000000000 +0100 @@ -287,3 +287,31 @@ config NOMMU_INITIAL_TRIM_EXCESS of 1 says that all excess pages should be trimmed. See Documentation/nommu-mmap.txt for more information. + +# +# support for transcendent memory +# +config TMEM + bool + help + In a virtualized environment, allows unused and underutilized + system physical memory to be made accessible through a narrow + well-defined page-copy-based API. If unsure, say Y. + +config PRECACHE + bool "Cache clean pages in transcendent memory" + depends on XEN + select TMEM + help + Allows the transcendent memory pool to be used to store clean + page-cache pages which, under some circumstances, will greatly + reduce paging and thus improve performance. If unsure, say Y. + +config PRESWAP + bool "Swap pages to transcendent memory" + depends on XEN + select TMEM + help + Allows the transcendent memory pool to be used as a pseudo-swap + device which, under some circumstances, will greatly reduce + swapping and thus improve performance. If unsure, say Y. --- head-2010-05-12.orig/mm/Makefile 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/mm/Makefile 2010-03-24 14:09:47.000000000 +0100 @@ -17,6 +17,9 @@ obj-y += init-mm.o obj-$(CONFIG_BOUNCE) += bounce.o obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o +obj-$(CONFIG_TMEM) += tmem.o +obj-$(CONFIG_PRESWAP) += preswap.o +obj-$(CONFIG_PRECACHE) += precache.o obj-$(CONFIG_HAS_DMA) += dmapool.o obj-$(CONFIG_HUGETLBFS) += hugetlb.o obj-$(CONFIG_NUMA) += mempolicy.o --- head-2010-05-12.orig/mm/filemap.c 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/mm/filemap.c 2010-03-24 14:09:47.000000000 +0100 @@ -33,6 +33,7 @@ #include #include /* for BUG_ON(!in_atomic()) only */ #include +#include #include /* for page_is_file_cache() */ #include "internal.h" @@ -119,6 +120,16 @@ void __remove_from_page_cache(struct pag { struct address_space *mapping = page->mapping; + /* + * if we're uptodate, flush out into the precache, otherwise + * invalidate any existing precache entries. We can't leave + * stale data around in the precache once our page is gone + */ + if (PageUptodate(page)) + precache_put(page->mapping, page->index, page); + else + precache_flush(page->mapping, page->index); + radix_tree_delete(&mapping->page_tree, page->index); page->mapping = NULL; mapping->nrpages--; --- head-2010-05-12.orig/mm/page_io.c 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/mm/page_io.c 2010-04-15 09:41:45.000000000 +0200 @@ -111,6 +111,13 @@ int swap_writepage(struct page *page, st return ret; } + if (preswap_put(page) == 1) { + set_page_writeback(page); + unlock_page(page); + end_page_writeback(page); + goto out; + } + bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write); if (bio == NULL) { set_page_dirty(page); @@ -179,6 +186,12 @@ int swap_readpage(struct page *page) return ret; } + if (preswap_get(page) == 1) { + SetPageUptodate(page); + unlock_page(page); + goto out; + } + bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read); if (bio == NULL) { unlock_page(page); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ head-2010-05-12/mm/precache.c 2010-03-24 14:09:47.000000000 +0100 @@ -0,0 +1,140 @@ +/* + * linux/mm/precache.c + * + * Implements "precache" for filesystems/pagecache on top of transcendent + * memory ("tmem") API. A filesystem creates an "ephemeral tmem pool" + * and retains the returned pool_id in its superblock. Clean pages evicted + * from pagecache may be "put" into the pool and associated with a "handle" + * consisting of the pool_id, an object (inode) id, and an index (page offset). + * Note that the page is copied to tmem; no kernel mappings are changed. + * If the page is later needed, the filesystem (or VFS) issues a "get", passing + * the same handle and an empty pageframe. If successful, the page is copied + * into the pageframe and a disk read is avoided. But since the tmem pool + * is of indeterminate size, a "put" page has indeterminate longevity + * ("ephemeral"), and the "get" may fail, in which case the filesystem must + * read the page from disk as before. Note that the filesystem/pagecache are + * responsible for maintaining coherency between the pagecache, precache, + * and the disk, for which "flush page" and "flush object" actions are + * provided. And when a filesystem is unmounted, it must "destroy" the pool. + * + * Two types of pools may be created for a precache: "private" or "shared". + * For a private pool, a successful "get" always flushes, implementing + * exclusive semantics; for a "shared" pool (which is intended for use by + * co-resident nodes of a cluster filesystem), the "flush" is not guaranteed. + * In either case, a failed "duplicate" put (overwrite) always guarantee + * the old data is flushed. + * + * Note also that multiple accesses to a tmem pool may be concurrent and any + * ordering must be guaranteed by the caller. + * + * Copyright (C) 2008,2009 Dan Magenheimer, Oracle Corp. + */ + +#include +#include +#include "tmem.h" + +static int precache_auto_allocate; /* set to 1 to auto_allocate */ + +int precache_put(struct address_space *mapping, unsigned long index, + struct page *page) +{ + u32 tmem_pool = mapping->host->i_sb->precache_poolid; + u64 obj = (unsigned long) mapping->host->i_ino; + u32 ind = (u32) index; + unsigned long mfn = pfn_to_mfn(page_to_pfn(page)); + int ret; + + if ((s32)tmem_pool < 0) { + if (!precache_auto_allocate) + return 0; + /* a put on a non-existent precache may auto-allocate one */ + ret = tmem_new_pool(0, 0, 0); + if (ret < 0) + return 0; + printk(KERN_INFO + "Mapping superblock for s_id=%s to precache_id=%d\n", + mapping->host->i_sb->s_id, tmem_pool); + mapping->host->i_sb->precache_poolid = tmem_pool; + } + if (ind != index) + return 0; + mb(); /* ensure page is quiescent; tmem may address it with an alias */ + return tmem_put_page(tmem_pool, obj, ind, mfn); +} + +int precache_get(struct address_space *mapping, unsigned long index, + struct page *empty_page) +{ + u32 tmem_pool = mapping->host->i_sb->precache_poolid; + u64 obj = (unsigned long) mapping->host->i_ino; + u32 ind = (u32) index; + unsigned long mfn = pfn_to_mfn(page_to_pfn(empty_page)); + + if ((s32)tmem_pool < 0) + return 0; + if (ind != index) + return 0; + + return tmem_get_page(tmem_pool, obj, ind, mfn); +} +EXPORT_SYMBOL(precache_get); + +int precache_flush(struct address_space *mapping, unsigned long index) +{ + u32 tmem_pool = mapping->host->i_sb->precache_poolid; + u64 obj = (unsigned long) mapping->host->i_ino; + u32 ind = (u32) index; + + if ((s32)tmem_pool < 0) + return 0; + if (ind != index) + return 0; + + return tmem_flush_page(tmem_pool, obj, ind); +} +EXPORT_SYMBOL(precache_flush); + +int precache_flush_inode(struct address_space *mapping) +{ + u32 tmem_pool = mapping->host->i_sb->precache_poolid; + u64 obj = (unsigned long) mapping->host->i_ino; + + if ((s32)tmem_pool < 0) + return 0; + + return tmem_flush_object(tmem_pool, obj); +} +EXPORT_SYMBOL(precache_flush_inode); + +int precache_flush_filesystem(struct super_block *sb) +{ + u32 tmem_pool = sb->precache_poolid; + int ret; + + if ((s32)tmem_pool < 0) + return 0; + ret = tmem_destroy_pool(tmem_pool); + if (!ret) + return 0; + printk(KERN_INFO + "Unmapping superblock for s_id=%s from precache_id=%d\n", + sb->s_id, ret); + sb->precache_poolid = 0; + return 1; +} +EXPORT_SYMBOL(precache_flush_filesystem); + +void precache_init(struct super_block *sb) +{ + sb->precache_poolid = tmem_new_pool(0, 0, 0); +} +EXPORT_SYMBOL(precache_init); + +void shared_precache_init(struct super_block *sb, char *uuid) +{ + u64 uuid_lo = *(u64 *)uuid; + u64 uuid_hi = *(u64 *)(&uuid[8]); + sb->precache_poolid = tmem_new_pool(uuid_lo, uuid_hi, TMEM_POOL_SHARED); +} +EXPORT_SYMBOL(shared_precache_init); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ head-2010-05-12/mm/preswap.c 2010-03-24 14:09:47.000000000 +0100 @@ -0,0 +1,182 @@ +/* + * linux/mm/preswap.c + * + * Implements a fast "preswap" on top of the transcendent memory ("tmem") API. + * When a swapdisk is enabled (with swapon), a "private persistent tmem pool" + * is created along with a bit-per-page preswap_map. When swapping occurs + * and a page is about to be written to disk, a "put" into the pool may first + * be attempted by passing the pageframe to be swapped, along with a "handle" + * consisting of a pool_id, an object id, and an index. Since the pool is of + * indeterminate size, the "put" may be rejected, in which case the page + * is swapped to disk as normal. If the "put" is successful, the page is + * copied to tmem and the preswap_map records the success. Later, when + * the page needs to be swapped in, the preswap_map is checked and, if set, + * the page may be obtained with a "get" operation. Note that the swap + * subsystem is responsible for: maintaining coherency between the swapcache, + * preswap, and the swapdisk; for evicting stale pages from preswap; and for + * emptying preswap when swapoff is performed. The "flush page" and "flush + * object" actions are provided for this. + * + * Note that if a "duplicate put" is performed to overwrite a page and + * the "put" operation fails, the page (and old data) is flushed and lost. + * Also note that multiple accesses to a tmem pool may be concurrent and + * any ordering must be guaranteed by the caller. + * + * Copyright (C) 2008,2009 Dan Magenheimer, Oracle Corp. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "tmem.h" + +static u32 preswap_poolid = -1; /* if negative, preswap will never call tmem */ + +const unsigned long preswap_zero = 0, preswap_infinity = ~0UL; /* for sysctl */ + +/* + * Swizzling increases objects per swaptype, increasing tmem concurrency + * for heavy swaploads. Later, larger nr_cpus -> larger SWIZ_BITS + */ +#define SWIZ_BITS 4 +#define SWIZ_MASK ((1 << SWIZ_BITS) - 1) +#define oswiz(_type, _ind) ((_type << SWIZ_BITS) | (_ind & SWIZ_MASK)) +#define iswiz(_ind) (_ind >> SWIZ_BITS) + +/* + * preswap_map test/set/clear operations (must be atomic) + */ + +int preswap_test(struct swap_info_struct *sis, unsigned long offset) +{ + if (!sis->preswap_map) + return 0; + return test_bit(offset % BITS_PER_LONG, + &sis->preswap_map[offset/BITS_PER_LONG]); +} + +static inline void preswap_set(struct swap_info_struct *sis, + unsigned long offset) +{ + if (!sis->preswap_map) + return; + set_bit(offset % BITS_PER_LONG, + &sis->preswap_map[offset/BITS_PER_LONG]); +} + +static inline void preswap_clear(struct swap_info_struct *sis, + unsigned long offset) +{ + if (!sis->preswap_map) + return; + clear_bit(offset % BITS_PER_LONG, + &sis->preswap_map[offset/BITS_PER_LONG]); +} + +/* + * preswap tmem operations + */ + +/* returns 1 if the page was successfully put into preswap, 0 if the page + * was declined, and -ERRNO for a specific error */ +int preswap_put(struct page *page) +{ + swp_entry_t entry = { .val = page_private(page), }; + unsigned type = swp_type(entry); + pgoff_t offset = swp_offset(entry); + u64 ind64 = (u64)offset; + u32 ind = (u32)offset; + unsigned long mfn = pfn_to_mfn(page_to_pfn(page)); + struct swap_info_struct *sis = get_swap_info_struct(type); + int dup = 0, ret; + + if ((s32)preswap_poolid < 0) + return 0; + if (ind64 != ind) + return 0; + if (preswap_test(sis, offset)) + dup = 1; + mb(); /* ensure page is quiescent; tmem may address it with an alias */ + ret = tmem_put_page(preswap_poolid, oswiz(type, ind), iswiz(ind), mfn); + if (ret == 1) { + preswap_set(sis, offset); + if (!dup) + sis->preswap_pages++; + } else if (dup) { + /* failed dup put always results in an automatic flush of + * the (older) page from preswap */ + preswap_clear(sis, offset); + sis->preswap_pages--; + } + return ret; +} + +/* returns 1 if the page was successfully gotten from preswap, 0 if the page + * was not present (should never happen!), and -ERRNO for a specific error */ +int preswap_get(struct page *page) +{ + swp_entry_t entry = { .val = page_private(page), }; + unsigned type = swp_type(entry); + pgoff_t offset = swp_offset(entry); + u64 ind64 = (u64)offset; + u32 ind = (u32)offset; + unsigned long mfn = pfn_to_mfn(page_to_pfn(page)); + struct swap_info_struct *sis = get_swap_info_struct(type); + int ret; + + if ((s32)preswap_poolid < 0) + return 0; + if (ind64 != ind) + return 0; + if (!preswap_test(sis, offset)) + return 0; + ret = tmem_get_page(preswap_poolid, oswiz(type, ind), iswiz(ind), mfn); + return ret; +} + +/* flush a single page from preswap */ +void preswap_flush(unsigned type, unsigned long offset) +{ + u64 ind64 = (u64)offset; + u32 ind = (u32)offset; + struct swap_info_struct *sis = get_swap_info_struct(type); + int ret = 1; + + if ((s32)preswap_poolid < 0) + return; + if (ind64 != ind) + return; + if (preswap_test(sis, offset)) { + ret = tmem_flush_page(preswap_poolid, + oswiz(type, ind), iswiz(ind)); + sis->preswap_pages--; + preswap_clear(sis, offset); + } +} + +/* flush all pages from the passed swaptype */ +void preswap_flush_area(unsigned type) +{ + struct swap_info_struct *sis = get_swap_info_struct(type); + int ind; + + if ((s32)preswap_poolid < 0) + return; + for (ind = SWIZ_MASK; ind >= 0; ind--) + (void)tmem_flush_object(preswap_poolid, oswiz(type, ind)); + sis->preswap_pages = 0; +} + +void preswap_init(unsigned type) +{ + /* only need one tmem pool for all swap types */ + if ((s32)preswap_poolid >= 0) + return; + preswap_poolid = tmem_new_pool(0, 0, TMEM_POOL_PERSIST); +} --- head-2010-05-12.orig/mm/swapfile.c 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/mm/swapfile.c 2010-03-24 14:09:47.000000000 +0100 @@ -587,6 +587,7 @@ static unsigned char swap_entry_free(str swap_list.next = p->type; nr_swap_pages++; p->inuse_pages--; + preswap_flush(p->type, offset); } return usage; @@ -1029,7 +1030,7 @@ static int unuse_mm(struct mm_struct *mm * Recycle to start on reaching the end, returning 0 when empty. */ static unsigned int find_next_to_unuse(struct swap_info_struct *si, - unsigned int prev) + unsigned int prev, unsigned int preswap) { unsigned int max = si->max; unsigned int i = prev; @@ -1055,6 +1056,12 @@ static unsigned int find_next_to_unuse(s prev = 0; i = 1; } + if (preswap) { + if (preswap_test(si, i)) + break; + else + continue; + } count = si->swap_map[i]; if (count && swap_count(count) != SWAP_MAP_BAD) break; @@ -1066,8 +1073,12 @@ static unsigned int find_next_to_unuse(s * We completely avoid races by reading each swap page in advance, * and then search for the process using it. All the necessary * page table adjustments can then be made atomically. + * + * if the boolean preswap is true, only unuse pages_to_unuse pages; + * pages_to_unuse==0 means all pages */ -static int try_to_unuse(unsigned int type) +static int try_to_unuse(unsigned int type, unsigned int preswap, + unsigned long pages_to_unuse) { struct swap_info_struct *si = swap_info[type]; struct mm_struct *start_mm; @@ -1100,7 +1111,7 @@ static int try_to_unuse(unsigned int typ * one pass through swap_map is enough, but not necessarily: * there are races when an instance of an entry might be missed. */ - while ((i = find_next_to_unuse(si, i)) != 0) { + while ((i = find_next_to_unuse(si, i, preswap)) != 0) { if (signal_pending(current)) { retval = -EINTR; break; @@ -1267,6 +1278,8 @@ static int try_to_unuse(unsigned int typ * interactive performance. */ cond_resched(); + if (preswap && pages_to_unuse && !--pages_to_unuse) + break; } mmput(start_mm); @@ -1611,7 +1624,7 @@ SYSCALL_DEFINE1(swapoff, const char __us spin_unlock(&swap_lock); current->flags |= PF_OOM_ORIGIN; - err = try_to_unuse(type); + err = try_to_unuse(type, 0, 0); current->flags &= ~PF_OOM_ORIGIN; if (err) { @@ -1663,9 +1676,14 @@ SYSCALL_DEFINE1(swapoff, const char __us swap_map = p->swap_map; p->swap_map = NULL; p->flags = 0; + preswap_flush_area(type); spin_unlock(&swap_lock); mutex_unlock(&swapon_mutex); vfree(swap_map); +#ifdef CONFIG_PRESWAP + if (p->preswap_map) + vfree(p->preswap_map); +#endif /* Destroy swap account informatin */ swap_cgroup_swapoff(type); @@ -1821,6 +1839,7 @@ SYSCALL_DEFINE2(swapon, const char __use unsigned long maxpages; unsigned long swapfilepages; unsigned char *swap_map = NULL; + unsigned long *preswap_map = NULL; struct page *page = NULL; struct inode *inode = NULL; int did_down = 0; @@ -2021,6 +2040,12 @@ SYSCALL_DEFINE2(swapon, const char __use } } +#ifdef CONFIG_PRESWAP + preswap_map = vmalloc(maxpages / sizeof(long)); + if (preswap_map) + memset(preswap_map, 0, maxpages / sizeof(long)); +#endif + error = swap_cgroup_swapon(type, maxpages); if (error) goto bad_swap; @@ -2059,6 +2084,9 @@ SYSCALL_DEFINE2(swapon, const char __use else p->prio = --least_priority; p->swap_map = swap_map; +#ifdef CONFIG_PRESWAP + p->preswap_map = preswap_map; +#endif p->flags |= SWP_WRITEOK; nr_swap_pages += nr_good_pages; total_swap_pages += nr_good_pages; @@ -2082,6 +2110,7 @@ SYSCALL_DEFINE2(swapon, const char __use swap_list.head = swap_list.next = type; else swap_info[prev]->next = type; + preswap_init(type); spin_unlock(&swap_lock); mutex_unlock(&swapon_mutex); error = 0; @@ -2098,6 +2127,7 @@ bad_swap_2: p->swap_file = NULL; p->flags = 0; spin_unlock(&swap_lock); + vfree(preswap_map); vfree(swap_map); if (swap_file) filp_close(swap_file, NULL); @@ -2316,6 +2346,10 @@ int valid_swaphandles(swp_entry_t entry, base++; spin_lock(&swap_lock); + if (preswap_test(si, target)) { + spin_unlock(&swap_lock); + return 0; + } if (end > si->max) /* don't go beyond end of map */ end = si->max; @@ -2326,6 +2360,9 @@ int valid_swaphandles(swp_entry_t entry, break; if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD) break; + /* Don't read in preswap pages */ + if (preswap_test(si, toff)) + break; } /* Count contiguous allocated slots below our target */ for (toff = target; --toff >= base; nr_pages++) { @@ -2334,6 +2371,9 @@ int valid_swaphandles(swp_entry_t entry, break; if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD) break; + /* Don't read in preswap pages */ + if (preswap_test(si, toff)) + break; } spin_unlock(&swap_lock); @@ -2560,3 +2600,98 @@ static void free_swap_count_continuation } } } + +#ifdef CONFIG_PRESWAP +/* + * preswap infrastructure functions + */ + +struct swap_info_struct *get_swap_info_struct(unsigned int type) +{ + BUG_ON(type > MAX_SWAPFILES); + return swap_info[type]; +} + +/* code structure leveraged from sys_swapoff */ +void preswap_shrink(unsigned long target_pages) +{ + struct swap_info_struct *si = NULL; + unsigned long total_pages = 0, total_pages_to_unuse; + unsigned long pages = 0, unuse_pages = 0; + int type; + int wrapped = 0; + + do { + /* + * we don't want to hold swap_lock while doing a very + * lengthy try_to_unuse, but swap_list may change + * so restart scan from swap_list.head each time + */ + spin_lock(&swap_lock); + total_pages = 0; + for (type = swap_list.head; type >= 0; type = si->next) { + si = swap_info[type]; + total_pages += si->preswap_pages; + } + if (total_pages <= target_pages) { + spin_unlock(&swap_lock); + return; + } + total_pages_to_unuse = total_pages - target_pages; + for (type = swap_list.head; type >= 0; type = si->next) { + si = swap_info[type]; + if (total_pages_to_unuse < si->preswap_pages) + pages = unuse_pages = total_pages_to_unuse; + else { + pages = si->preswap_pages; + unuse_pages = 0; /* unuse all */ + } + if (security_vm_enough_memory_kern(pages)) + continue; + vm_unacct_memory(pages); + break; + } + spin_unlock(&swap_lock); + if (type < 0) + return; + current->flags |= PF_OOM_ORIGIN; + (void)try_to_unuse(type, 1, unuse_pages); + current->flags &= ~PF_OOM_ORIGIN; + wrapped++; + } while (wrapped <= 3); +} + + +#ifdef CONFIG_SYSCTL +/* cat /sys/proc/vm/preswap provides total number of pages in preswap + * across all swaptypes. echo N > /sys/proc/vm/preswap attempts to shrink + * preswap page usage to N (usually 0) */ +int preswap_sysctl_handler(ctl_table *table, int write, + void __user *buffer, size_t *length, loff_t *ppos) +{ + unsigned long npages; + int type; + unsigned long totalpages = 0; + struct swap_info_struct *si = NULL; + + /* modeled after hugetlb_sysctl_handler in mm/hugetlb.c */ + if (!write) { + spin_lock(&swap_lock); + for (type = swap_list.head; type >= 0; type = si->next) { + si = swap_info[type]; + totalpages += si->preswap_pages; + } + spin_unlock(&swap_lock); + npages = totalpages; + } + table->data = &npages; + table->maxlen = sizeof(unsigned long); + proc_doulongvec_minmax(table, write, buffer, length, ppos); + + if (write) + preswap_shrink(npages); + + return 0; +} +#endif +#endif /* CONFIG_PRESWAP */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ head-2010-05-12/mm/tmem.h 2010-03-24 14:09:47.000000000 +0100 @@ -0,0 +1,84 @@ +/* + * linux/mm/tmem.h + * + * Interface to transcendent memory, used by mm/precache.c and mm/preswap.c + * Currently implemented on XEN, but may be implemented elsewhere in future. + * + * Copyright (C) 2008,2009 Dan Magenheimer, Oracle Corp. + */ + +#ifdef CONFIG_XEN +#include + +/* Bits for HYPERVISOR_tmem_op(TMEM_NEW_POOL) */ +#define TMEM_POOL_MIN_PAGESHIFT 12 +#define TMEM_POOL_PAGEORDER (PAGE_SHIFT - TMEM_POOL_MIN_PAGESHIFT) + +extern int xen_tmem_op(u32 tmem_cmd, u32 tmem_pool, u64 object, u32 index, + unsigned long gmfn, u32 tmem_offset, u32 pfn_offset, u32 len); +extern int xen_tmem_new_pool(u32 tmem_cmd, u64 uuid_lo, u64 uuid_hi, u32 flags); + +static inline int tmem_put_page(u32 pool_id, u64 object, u32 index, + unsigned long gmfn) +{ + return xen_tmem_op(TMEM_PUT_PAGE, pool_id, object, index, + gmfn, 0, 0, 0); +} + +static inline int tmem_get_page(u32 pool_id, u64 object, u32 index, + unsigned long gmfn) +{ + return xen_tmem_op(TMEM_GET_PAGE, pool_id, object, index, + gmfn, 0, 0, 0); +} + +static inline int tmem_flush_page(u32 pool_id, u64 object, u32 index) +{ + return xen_tmem_op(TMEM_FLUSH_PAGE, pool_id, object, index, + 0, 0, 0, 0); +} + +static inline int tmem_flush_object(u32 pool_id, u64 object) +{ + return xen_tmem_op(TMEM_FLUSH_OBJECT, pool_id, object, 0, 0, 0, 0, 0); +} + +static inline int tmem_new_pool(u64 uuid_lo, u64 uuid_hi, u32 flags) +{ + BUILD_BUG_ON((TMEM_POOL_PAGEORDER < 0) || + (TMEM_POOL_PAGEORDER >= TMEM_POOL_PAGESIZE_MASK)); + flags |= TMEM_POOL_PAGEORDER << TMEM_POOL_PAGESIZE_SHIFT; + return xen_tmem_new_pool(TMEM_NEW_POOL, uuid_lo, uuid_hi, flags); +} + +static inline int tmem_destroy_pool(u32 pool_id) +{ + return xen_tmem_op(TMEM_DESTROY_POOL, pool_id, 0, 0, 0, 0, 0, 0); +} +#else +struct tmem_op { + u32 cmd; + s32 pool_id; /* private > 0; shared < 0; 0 is invalid */ + union { + struct { /* for cmd == TMEM_NEW_POOL */ + u64 uuid[2]; + u32 flags; + } new; + struct { /* for cmd == TMEM_CONTROL */ + u32 subop; + u32 cli_id; + u32 arg1; + u32 arg2; + void *buf; + } ctrl; + struct { + u64 object; + u32 index; + u32 tmem_offset; + u32 pfn_offset; + u32 len; + unsigned long pfn; /* page frame */ + } gen; + } u; +}; +#endif --- head-2010-05-12.orig/mm/truncate.c 2010-05-12 08:55:24.000000000 +0200 +++ head-2010-05-12/mm/truncate.c 2010-04-15 09:41:48.000000000 +0200 @@ -16,6 +16,7 @@ #include #include #include +#include #include #include /* grr. try_to_release_page, do_invalidatepage */ @@ -51,6 +52,7 @@ void do_invalidatepage(struct page *page static inline void truncate_partial_page(struct page *page, unsigned partial) { zero_user_segment(page, partial, PAGE_CACHE_SIZE); + precache_flush(page->mapping, page->index); if (page_has_private(page)) do_invalidatepage(page, partial); } @@ -108,6 +110,10 @@ truncate_complete_page(struct address_sp clear_page_mlock(page); remove_from_page_cache(page); ClearPageMappedToDisk(page); + /* this must be after the remove_from_page_cache which + * calls precache_put + */ + precache_flush(mapping, page->index); page_cache_release(page); /* pagecache ref */ return 0; } @@ -215,6 +221,7 @@ void truncate_inode_pages_range(struct a pgoff_t next; int i; + precache_flush_inode(mapping); if (mapping->nrpages == 0) return; @@ -290,6 +297,7 @@ void truncate_inode_pages_range(struct a pagevec_release(&pvec); mem_cgroup_uncharge_end(); } + precache_flush_inode(mapping); } EXPORT_SYMBOL(truncate_inode_pages_range); @@ -428,6 +436,7 @@ int invalidate_inode_pages2_range(struct int did_range_unmap = 0; int wrapped = 0; + precache_flush_inode(mapping); pagevec_init(&pvec, 0); next = start; while (next <= end && !wrapped && @@ -486,6 +495,7 @@ int invalidate_inode_pages2_range(struct mem_cgroup_uncharge_end(); cond_resched(); } + precache_flush_inode(mapping); return ret; } EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);