1411 lines
45 KiB
Plaintext
1411 lines
45 KiB
Plaintext
Subject: Transcendent memory ("tmem") for Linux
|
|
From: http://xenbits.xensource.com/linux-2.6.18-xen.hg (tip 908:baeb818cd2dc)
|
|
Patch-mainline: n/a
|
|
|
|
Tmem, when called from a tmem-capable (paravirtualized) guest, makes
|
|
use of otherwise unutilized ("fallow") memory to create and manage
|
|
pools of pages that can be accessed from the guest either as
|
|
"ephemeral" pages or as "persistent" pages. In either case, the pages
|
|
are not directly addressible by the guest, only copied to and fro via
|
|
the tmem interface. Ephemeral pages are a nice place for a guest to
|
|
put recently evicted clean pages that it might need again; these pages
|
|
can be reclaimed synchronously by Xen for other guests or other uses.
|
|
Persistent pages are a nice place for a guest to put "swap" pages to
|
|
avoid sending them to disk. These pages retain data as long as the
|
|
guest lives, but count against the guest memory allocation.
|
|
|
|
This patch contains the Linux paravirtualization changes to
|
|
complement the tmem Xen patch (xen-unstable c/s 19646). It
|
|
implements "precache" (ext3 only as of now), "preswap",
|
|
and limited "shared precache" (ocfs2 only as of now) support.
|
|
CONFIG options are required to turn on
|
|
the support (but in this patch they default to "y"). If
|
|
the underlying Xen does not have tmem support or has it
|
|
turned off, this is sensed early to avoid nearly all
|
|
hypercalls.
|
|
|
|
Lots of useful prose about tmem can be found at
|
|
http://oss.oracle.com/projects/tmem
|
|
|
|
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
|
|
Acked-by: jbeulich@novell.com
|
|
|
|
---
|
|
Documentation/transcendent-memory.txt | 176 ++++++++++++++++++++++++++++++++
|
|
fs/btrfs/extent_io.c | 9 +
|
|
fs/btrfs/super.c | 2
|
|
fs/buffer.c | 6 +
|
|
fs/ext3/super.c | 2
|
|
fs/ext4/super.c | 3
|
|
fs/mpage.c | 8 +
|
|
fs/ocfs2/super.c | 2
|
|
fs/super.c | 5
|
|
include/linux/fs.h | 3
|
|
include/linux/precache.h | 55 ++++++++++
|
|
include/linux/swap.h | 53 +++++++++
|
|
kernel/sysctl.c | 11 ++
|
|
mm/Kconfig | 28 +++++
|
|
mm/Makefile | 3
|
|
mm/filemap.c | 11 ++
|
|
mm/page_io.c | 13 ++
|
|
mm/precache.c | 138 +++++++++++++++++++++++++
|
|
mm/preswap.c | 182 ++++++++++++++++++++++++++++++++++
|
|
mm/swapfile.c | 143 +++++++++++++++++++++++++-
|
|
mm/tmem.h | 84 +++++++++++++++
|
|
mm/truncate.c | 10 +
|
|
22 files changed, 943 insertions(+), 4 deletions(-)
|
|
|
|
--- /dev/null
|
|
+++ b/Documentation/transcendent-memory.txt
|
|
@@ -0,0 +1,176 @@
|
|
+Normal memory is directly addressable by the kernel, of a known
|
|
+normally-fixed size, synchronously accessible, and persistent (though
|
|
+not across a reboot).
|
|
+
|
|
+What if there was a class of memory that is of unknown and dynamically
|
|
+variable size, is addressable only indirectly by the kernel, can be
|
|
+configured either as persistent or as "ephemeral" (meaning it will be
|
|
+around for awhile, but might disappear without warning), and is still
|
|
+fast enough to be synchronously accessible?
|
|
+
|
|
+We call this latter class "transcendent memory" and it provides an
|
|
+interesting opportunity to more efficiently utilize RAM in a virtualized
|
|
+environment. However this "memory but not really memory" may also have
|
|
+applications in NON-virtualized environments, such as hotplug-memory
|
|
+deletion, SSDs, and page cache compression. Others have suggested ideas
|
|
+such as allowing use of highmem memory without a highmem kernel, or use
|
|
+of spare video memory.
|
|
+
|
|
+Transcendent memory, or "tmem" for short, provides a well-defined API to
|
|
+access this unusual class of memory. (A summary of the API is provided
|
|
+below.) The basic operations are page-copy-based and use a flexible
|
|
+object-oriented addressing mechanism. Tmem assumes that some "privileged
|
|
+entity" is capable of executing tmem requests and storing pages of data;
|
|
+this entity is currently a hypervisor and operations are performed via
|
|
+hypercalls, but the entity could be a kernel policy, or perhaps a
|
|
+"memory node" in a cluster of blades connected by a high-speed
|
|
+interconnect such as hypertransport or QPI.
|
|
+
|
|
+Since tmem is not directly accessible and because page copying is done
|
|
+to/from physical pageframes, it more suitable for in-kernel memory needs
|
|
+than for userland applications. However, there may be yet undiscovered
|
|
+userland possibilities.
|
|
+
|
|
+With the tmem concept outlined vaguely and its broader potential hinted,
|
|
+we will overview two existing examples of how tmem can be used by the
|
|
+kernel.
|
|
+
|
|
+"Cleancache" can be thought of as a page-granularity victim cache for clean
|
|
+pages that the kernel's pageframe replacement algorithm (PFRA) would like
|
|
+to keep around, but can't since there isn't enough memory. So when the
|
|
+PFRA "evicts" a page, it first puts it into the cleancache via a call to
|
|
+tmem. And any time a filesystem reads a page from disk, it first attempts
|
|
+to get the page from cleancache. If it's there, a disk access is eliminated.
|
|
+If not, the filesystem just goes to the disk like normal. Cleancache is
|
|
+"ephemeral" so whether a page is kept in cleancache (between the "put" and
|
|
+the "get") is dependent on a number of factors that are invisible to
|
|
+the kernel.
|
|
+
|
|
+"Frontswap" is so named because it can be thought of as the opposite of
|
|
+a "backing store". Frontswap IS persistent, but for various reasons may not
|
|
+always be available for use, again due to factors that may not be visible to
|
|
+the kernel. (But, briefly, if the kernel is being "good" and has shared its
|
|
+resources nicely, then it will be able to use frontswap, else it will not.)
|
|
+Once a page is put, a get on the page will always succeed. So when the
|
|
+kernel finds itself in a situation where it needs to swap out a page, it
|
|
+first attempts to use frontswap. If the put works, a disk write and
|
|
+(usually) a disk read are avoided. If it doesn't, the page is written
|
|
+to swap as usual. Unlike cleancache, whether a page is stored in frontswap
|
|
+vs swap is recorded in kernel data structures, so when a page needs to
|
|
+be fetched, the kernel does a get if it is in frontswap and reads from
|
|
+swap if it is not in frontswap.
|
|
+
|
|
+Both cleancache and frontswap may be optionally compressed, trading off 2x
|
|
+space reduction vs 10x performance for access. Cleancache also has a
|
|
+sharing feature, which allows different nodes in a "virtual cluster"
|
|
+to share a local page cache.
|
|
+
|
|
+Tmem has some similarity to IBM's Collaborative Memory Management, but
|
|
+creates more of a partnership between the kernel and the "privileged
|
|
+entity" and is not very invasive. Tmem may be applicable for KVM and
|
|
+containers; there is some disagreement on the extent of its value.
|
|
+Tmem is highly complementary to ballooning (aka page granularity hot
|
|
+plug) and memory deduplication (aka transparent content-based page
|
|
+sharing) but still has value when neither are present.
|
|
+
|
|
+Performance is difficult to quantify because some benchmarks respond
|
|
+very favorably to increases in memory and tmem may do quite well on
|
|
+those, depending on how much tmem is available which may vary widely
|
|
+and dynamically, depending on conditions completely outside of the
|
|
+system being measured. Ideas on how best to provide useful metrics
|
|
+would be appreciated.
|
|
+
|
|
+Tmem is supported starting in Xen 4.0 and is in Xen's Linux 2.6.18-xen
|
|
+source tree. It is also released as a technology preview in Oracle's
|
|
+Xen-based virtualization product, Oracle VM 2.2. Again, Xen is not
|
|
+necessarily a requirement, but currently provides the only existing
|
|
+implementation of tmem.
|
|
+
|
|
+Lots more information about tmem can be found at:
|
|
+ http://oss.oracle.com/projects/tmem
|
|
+and there was a talk about it on the first day of Linux Symposium in
|
|
+July 2009; an updated talk is planned at linux.conf.au in January 2010.
|
|
+Tmem is the result of a group effort, including Dan Magenheimer,
|
|
+Chris Mason, Dave McCracken, Kurt Hackel and Zhigang Wang, with helpful
|
|
+input from Jeremy Fitzhardinge, Keir Fraser, Ian Pratt, Sunil Mushran,
|
|
+Joel Becker, and Jan Beulich.
|
|
+
|
|
+THE TRANSCENDENT MEMORY API
|
|
+
|
|
+Transcendent memory is made up of a set of pools. Each pool is made
|
|
+up of a set of objects. And each object contains a set of pages.
|
|
+The combination of a 32-bit pool id, a 64-bit object id, and a 32-bit
|
|
+page id, uniquely identify a page of tmem data, and this tuple is called
|
|
+a "handle." Commonly, the three parts of a handle are used to address
|
|
+a filesystem, a file within that filesystem, and a page within that file;
|
|
+however an OS can use any values as long as they uniquely identify
|
|
+a page of data.
|
|
+
|
|
+When a tmem pool is created, it is given certain attributes: It can
|
|
+be private or shared, and it can be persistent or ephemeral. Each
|
|
+combination of these attributes provides a different set of useful
|
|
+functionality and also defines a slightly different set of semantics
|
|
+for the various operations on the pool. Other pool attributes include
|
|
+the size of the page and a version number.
|
|
+
|
|
+Once a pool is created, operations are performed on the pool. Pages
|
|
+are copied between the OS and tmem and are addressed using a handle.
|
|
+Pages and/or objects may also be flushed from the pool. When all
|
|
+operations are completed, a pool can be destroyed.
|
|
+
|
|
+The specific tmem functions are called in Linux through a set of
|
|
+accessor functions:
|
|
+
|
|
+int (*new_pool)(struct tmem_pool_uuid uuid, u32 flags);
|
|
+int (*destroy_pool)(u32 pool_id);
|
|
+int (*put_page)(u32 pool_id, u64 object, u32 index, unsigned long pfn);
|
|
+int (*get_page)(u32 pool_id, u64 object, u32 index, unsigned long pfn);
|
|
+int (*flush_page)(u32 pool_id, u64 object, u32 index);
|
|
+int (*flush_object)(u32 pool_id, u64 object);
|
|
+
|
|
+The new_pool accessor creates a new pool and returns a pool id
|
|
+which is a non-negative 32-bit integer. If the flags parameter
|
|
+specifies that the pool is to be shared, the uuid is a 128-bit "shared
|
|
+secret" else it is ignored. The destroy_pool accessor destroys the pool.
|
|
+(Note: shared pools are not supported until security implications
|
|
+are better understood.)
|
|
+
|
|
+The put_page accessor copies a page of data from the specified pageframe
|
|
+and associates it with the specified handle.
|
|
+
|
|
+The get_page accessor looks up a page of data in tmem associated with
|
|
+the specified handle and, if found, copies it to the specified pageframe.
|
|
+
|
|
+The flush_page accessor ensures that subsequent gets of a page with
|
|
+the specified handle will fail. The flush_object accessor ensures
|
|
+that subsequent gets of any page matching the pool id and object
|
|
+will fail.
|
|
+
|
|
+There are many subtle but critical behaviors for get_page and put_page:
|
|
+- Any put_page (with one notable exception) may be rejected and the client
|
|
+ must be prepared to deal with that failure. A put_page copies, NOT moves,
|
|
+ data; that is the data exists in both places. Linux is responsible for
|
|
+ destroying or overwriting its own copy, or alternately managing any
|
|
+ coherency between the copies.
|
|
+- Every page successfully put to a persistent pool must be found by a
|
|
+ subsequent get_page that specifies the same handle. A page successfully
|
|
+ put to an ephemeral pool has an indeterminate lifetime and even an
|
|
+ immediately subsequent get_page may fail.
|
|
+- A get_page to a private pool is destructive, that is it behaves as if
|
|
+ the get_page were atomically followed by a flush_page. A get_page
|
|
+ to a shared pool is non-destructive. A flush_page behaves just like
|
|
+ a get_page to a private pool except the data is thrown away.
|
|
+- Put-put-get coherency is guaranteed. For example, after the sequence:
|
|
+ put_page(ABC,D1);
|
|
+ put_page(ABC,D2);
|
|
+ get_page(ABC,E)
|
|
+ E may never contain the data from D1. However, even for a persistent
|
|
+ pool, the get_page may fail if the second put_page indicates failure.
|
|
+- Get-get coherency is guaranteed. For example, in the sequence:
|
|
+ put_page(ABC,D);
|
|
+ get_page(ABC,E1);
|
|
+ get_page(ABC,E2)
|
|
+ if the first get_page fails, the second must also fail.
|
|
+- A tmem implementation provides no serialization guarantees (e.g. to
|
|
+ an SMP Linux). So if different Linux threads are putting and flushing
|
|
+ the same page, the results are indeterminate.
|
|
--- a/fs/btrfs/extent_io.c
|
|
+++ b/fs/btrfs/extent_io.c
|
|
@@ -10,6 +10,7 @@
|
|
#include <linux/swap.h>
|
|
#include <linux/writeback.h>
|
|
#include <linux/pagevec.h>
|
|
+#include <linux/precache.h>
|
|
#include "extent_io.h"
|
|
#include "extent_map.h"
|
|
#include "compat.h"
|
|
@@ -1990,6 +1991,13 @@ static int __extent_read_full_page(struc
|
|
|
|
set_page_extent_mapped(page);
|
|
|
|
+ if (!PageUptodate(page)) {
|
|
+ if (precache_get(page->mapping, page->index, page) == 1) {
|
|
+ BUG_ON(blocksize != PAGE_SIZE);
|
|
+ goto out;
|
|
+ }
|
|
+ }
|
|
+
|
|
end = page_end;
|
|
while (1) {
|
|
lock_extent(tree, start, end, GFP_NOFS);
|
|
@@ -2117,6 +2125,7 @@ static int __extent_read_full_page(struc
|
|
cur = cur + iosize;
|
|
page_offset += iosize;
|
|
}
|
|
+out:
|
|
if (!nr) {
|
|
if (!PageError(page))
|
|
SetPageUptodate(page);
|
|
--- a/fs/btrfs/super.c
|
|
+++ b/fs/btrfs/super.c
|
|
@@ -39,6 +39,7 @@
|
|
#include <linux/miscdevice.h>
|
|
#include <linux/magic.h>
|
|
#include <linux/slab.h>
|
|
+#include <linux/precache.h>
|
|
#include "compat.h"
|
|
#include "ctree.h"
|
|
#include "disk-io.h"
|
|
@@ -607,6 +608,7 @@ static int btrfs_fill_super(struct super
|
|
sb->s_root = root_dentry;
|
|
|
|
save_mount_options(sb, data);
|
|
+ precache_init(sb);
|
|
return 0;
|
|
|
|
fail_close:
|
|
--- a/fs/buffer.c
|
|
+++ b/fs/buffer.c
|
|
@@ -41,6 +41,7 @@
|
|
#include <linux/bitops.h>
|
|
#include <linux/mpage.h>
|
|
#include <linux/bit_spinlock.h>
|
|
+#include <linux/precache.h>
|
|
|
|
static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
|
|
|
|
@@ -277,6 +278,11 @@ void invalidate_bdev(struct block_device
|
|
invalidate_bh_lrus();
|
|
lru_add_drain_all(); /* make sure all lru add caches are flushed */
|
|
invalidate_mapping_pages(mapping, 0, -1);
|
|
+
|
|
+ /* 99% of the time, we don't need to flush the precache on the bdev.
|
|
+ * But, for the strange corners, lets be cautious
|
|
+ */
|
|
+ precache_flush_inode(mapping);
|
|
}
|
|
EXPORT_SYMBOL(invalidate_bdev);
|
|
|
|
--- a/fs/ext3/super.c
|
|
+++ b/fs/ext3/super.c
|
|
@@ -36,6 +36,7 @@
|
|
#include <linux/quotaops.h>
|
|
#include <linux/seq_file.h>
|
|
#include <linux/log2.h>
|
|
+#include <linux/precache.h>
|
|
|
|
#include <asm/uaccess.h>
|
|
|
|
@@ -1367,6 +1368,7 @@ static int ext3_setup_super(struct super
|
|
} else {
|
|
ext3_msg(sb, KERN_INFO, "using internal journal");
|
|
}
|
|
+ precache_init(sb);
|
|
return res;
|
|
}
|
|
|
|
--- a/fs/ext4/super.c
|
|
+++ b/fs/ext4/super.c
|
|
@@ -38,6 +38,7 @@
|
|
#include <linux/ctype.h>
|
|
#include <linux/log2.h>
|
|
#include <linux/crc16.h>
|
|
+#include <linux/precache.h>
|
|
#include <asm/uaccess.h>
|
|
|
|
#include <linux/kthread.h>
|
|
@@ -1941,6 +1942,8 @@ static int ext4_setup_super(struct super
|
|
EXT4_INODES_PER_GROUP(sb),
|
|
sbi->s_mount_opt, sbi->s_mount_opt2);
|
|
|
|
+ precache_init(sb);
|
|
+
|
|
return res;
|
|
}
|
|
|
|
--- a/fs/mpage.c
|
|
+++ b/fs/mpage.c
|
|
@@ -27,6 +27,7 @@
|
|
#include <linux/writeback.h>
|
|
#include <linux/backing-dev.h>
|
|
#include <linux/pagevec.h>
|
|
+#include <linux/precache.h>
|
|
|
|
/*
|
|
* I/O completion handler for multipage BIOs.
|
|
@@ -271,6 +272,13 @@ do_mpage_readpage(struct bio *bio, struc
|
|
SetPageMappedToDisk(page);
|
|
}
|
|
|
|
+ if (fully_mapped &&
|
|
+ blocks_per_page == 1 && !PageUptodate(page) &&
|
|
+ precache_get(page->mapping, page->index, page) == 1) {
|
|
+ SetPageUptodate(page);
|
|
+ goto confused;
|
|
+ }
|
|
+
|
|
/*
|
|
* This page will go to BIO. Do we need to send this BIO off first?
|
|
*/
|
|
--- a/fs/ocfs2/super.c
|
|
+++ b/fs/ocfs2/super.c
|
|
@@ -41,6 +41,7 @@
|
|
#include <linux/mount.h>
|
|
#include <linux/seq_file.h>
|
|
#include <linux/quotaops.h>
|
|
+#include <linux/precache.h>
|
|
|
|
#define MLOG_MASK_PREFIX ML_SUPER
|
|
#include <cluster/masklog.h>
|
|
@@ -2385,6 +2386,7 @@ static int ocfs2_initialize_super(struct
|
|
mlog_errno(status);
|
|
goto bail;
|
|
}
|
|
+ shared_precache_init(sb, &di->id2.i_super.s_uuid[0]);
|
|
|
|
bail:
|
|
mlog_exit(status);
|
|
--- a/fs/super.c
|
|
+++ b/fs/super.c
|
|
@@ -31,6 +31,7 @@
|
|
#include <linux/mutex.h>
|
|
#include <linux/backing-dev.h>
|
|
#include <linux/rculist_bl.h>
|
|
+#include <linux/precache.h>
|
|
#include "internal.h"
|
|
|
|
|
|
@@ -112,6 +113,9 @@ static struct super_block *alloc_super(s
|
|
s->s_maxbytes = MAX_NON_LFS;
|
|
s->s_op = &default_op;
|
|
s->s_time_gran = 1000000000;
|
|
+#ifdef CONFIG_PRECACHE
|
|
+ s->precache_poolid = -1;
|
|
+#endif
|
|
}
|
|
out:
|
|
return s;
|
|
@@ -183,6 +187,7 @@ void deactivate_locked_super(struct supe
|
|
* inodes are flushed before we release the fs module.
|
|
*/
|
|
rcu_barrier();
|
|
+ precache_flush_filesystem(s);
|
|
put_filesystem(fs);
|
|
put_super(s);
|
|
} else {
|
|
--- a/include/linux/fs.h
|
|
+++ b/include/linux/fs.h
|
|
@@ -1426,6 +1426,9 @@ struct super_block {
|
|
/* Granularity of c/m/atime in ns.
|
|
Cannot be worse than a second */
|
|
u32 s_time_gran;
|
|
+#ifdef CONFIG_PRECACHE
|
|
+ u32 precache_poolid;
|
|
+#endif
|
|
|
|
/*
|
|
* The next field is for VFS *only*. No filesystems have any business
|
|
--- /dev/null
|
|
+++ b/include/linux/precache.h
|
|
@@ -0,0 +1,55 @@
|
|
+#ifndef _LINUX_PRECACHE_H
|
|
+
|
|
+#include <linux/fs.h>
|
|
+#include <linux/mm.h>
|
|
+
|
|
+#ifdef CONFIG_PRECACHE
|
|
+extern void precache_init(struct super_block *sb);
|
|
+extern void shared_precache_init(struct super_block *sb, char *uuid);
|
|
+extern int precache_get(struct address_space *mapping, unsigned long index,
|
|
+ struct page *empty_page);
|
|
+extern int precache_put(struct address_space *mapping, unsigned long index,
|
|
+ struct page *page);
|
|
+extern int precache_flush(struct address_space *mapping, unsigned long index);
|
|
+extern int precache_flush_inode(struct address_space *mapping);
|
|
+extern int precache_flush_filesystem(struct super_block *s);
|
|
+#else
|
|
+static inline void precache_init(struct super_block *sb)
|
|
+{
|
|
+}
|
|
+
|
|
+static inline void shared_precache_init(struct super_block *sb, char *uuid)
|
|
+{
|
|
+}
|
|
+
|
|
+static inline int precache_get(struct address_space *mapping,
|
|
+ unsigned long index, struct page *empty_page)
|
|
+{
|
|
+ return 0;
|
|
+}
|
|
+
|
|
+static inline int precache_put(struct address_space *mapping,
|
|
+ unsigned long index, struct page *page)
|
|
+{
|
|
+ return 0;
|
|
+}
|
|
+
|
|
+static inline int precache_flush(struct address_space *mapping,
|
|
+ unsigned long index)
|
|
+{
|
|
+ return 0;
|
|
+}
|
|
+
|
|
+static inline int precache_flush_inode(struct address_space *mapping)
|
|
+{
|
|
+ return 0;
|
|
+}
|
|
+
|
|
+static inline int precache_flush_filesystem(struct super_block *s)
|
|
+{
|
|
+ return 0;
|
|
+}
|
|
+#endif
|
|
+
|
|
+#define _LINUX_PRECACHE_H
|
|
+#endif /* _LINUX_PRECACHE_H */
|
|
--- a/include/linux/swap.h
|
|
+++ b/include/linux/swap.h
|
|
@@ -186,8 +186,61 @@ struct swap_info_struct {
|
|
struct block_device *bdev; /* swap device or bdev of swap file */
|
|
struct file *swap_file; /* seldom referenced */
|
|
unsigned int old_block_size; /* seldom referenced */
|
|
+#ifdef CONFIG_PRESWAP
|
|
+ unsigned long *preswap_map;
|
|
+ unsigned int preswap_pages;
|
|
+#endif
|
|
};
|
|
|
|
+#ifdef CONFIG_PRESWAP
|
|
+
|
|
+#include <linux/sysctl.h>
|
|
+extern int preswap_sysctl_handler(struct ctl_table *, int, void __user *,
|
|
+ size_t *, loff_t *);
|
|
+extern const unsigned long preswap_zero, preswap_infinity;
|
|
+
|
|
+extern struct swap_info_struct *get_swap_info_struct(unsigned int type);
|
|
+
|
|
+extern void preswap_shrink(unsigned long);
|
|
+extern int preswap_test(struct swap_info_struct *, unsigned long);
|
|
+extern void preswap_init(unsigned);
|
|
+extern int preswap_put(struct page *);
|
|
+extern int preswap_get(struct page *);
|
|
+extern void preswap_flush(unsigned, unsigned long);
|
|
+extern void preswap_flush_area(unsigned);
|
|
+#else
|
|
+static inline void preswap_shrink(unsigned long target_pages)
|
|
+{
|
|
+}
|
|
+
|
|
+static inline int preswap_test(struct swap_info_struct *sis, unsigned long offset)
|
|
+{
|
|
+ return 0;
|
|
+}
|
|
+
|
|
+static inline void preswap_init(unsigned type)
|
|
+{
|
|
+}
|
|
+
|
|
+static inline int preswap_put(struct page *page)
|
|
+{
|
|
+ return 0;
|
|
+}
|
|
+
|
|
+static inline int preswap_get(struct page *get)
|
|
+{
|
|
+ return 0;
|
|
+}
|
|
+
|
|
+static inline void preswap_flush(unsigned type, unsigned long offset)
|
|
+{
|
|
+}
|
|
+
|
|
+static inline void preswap_flush_area(unsigned type)
|
|
+{
|
|
+}
|
|
+#endif /* CONFIG_PRESWAP */
|
|
+
|
|
struct swap_list_t {
|
|
int head; /* head of priority-ordered swapfile list */
|
|
int next; /* swapfile to be used next */
|
|
--- a/kernel/sysctl.c
|
|
+++ b/kernel/sysctl.c
|
|
@@ -1330,6 +1330,17 @@ static struct ctl_table vm_table[] = {
|
|
.mode = 0644,
|
|
.proc_handler = scan_unevictable_handler,
|
|
},
|
|
+#ifdef CONFIG_PRESWAP
|
|
+ {
|
|
+ .procname = "preswap",
|
|
+ .data = NULL,
|
|
+ .maxlen = sizeof(unsigned long),
|
|
+ .mode = 0644,
|
|
+ .proc_handler = preswap_sysctl_handler,
|
|
+ .extra1 = (void *)&preswap_zero,
|
|
+ .extra2 = (void *)&preswap_infinity,
|
|
+ },
|
|
+#endif
|
|
#ifdef CONFIG_MEMORY_FAILURE
|
|
{
|
|
.procname = "memory_failure_early_kill",
|
|
--- a/mm/Kconfig
|
|
+++ b/mm/Kconfig
|
|
@@ -347,3 +347,31 @@ config NEED_PER_CPU_KM
|
|
depends on !SMP
|
|
bool
|
|
default y
|
|
+
|
|
+#
|
|
+# support for transcendent memory
|
|
+#
|
|
+config TMEM
|
|
+ bool
|
|
+ help
|
|
+ In a virtualized environment, allows unused and underutilized
|
|
+ system physical memory to be made accessible through a narrow
|
|
+ well-defined page-copy-based API. If unsure, say Y.
|
|
+
|
|
+config PRECACHE
|
|
+ bool "Cache clean pages in transcendent memory"
|
|
+ depends on XEN
|
|
+ select TMEM
|
|
+ help
|
|
+ Allows the transcendent memory pool to be used to store clean
|
|
+ page-cache pages which, under some circumstances, will greatly
|
|
+ reduce paging and thus improve performance. If unsure, say Y.
|
|
+
|
|
+config PRESWAP
|
|
+ bool "Swap pages to transcendent memory"
|
|
+ depends on XEN
|
|
+ select TMEM
|
|
+ help
|
|
+ Allows the transcendent memory pool to be used as a pseudo-swap
|
|
+ device which, under some circumstances, will greatly reduce
|
|
+ swapping and thus improve performance. If unsure, say Y.
|
|
--- a/mm/Makefile
|
|
+++ b/mm/Makefile
|
|
@@ -19,6 +19,9 @@ obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.
|
|
|
|
obj-$(CONFIG_BOUNCE) += bounce.o
|
|
obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
|
|
+obj-$(CONFIG_TMEM) += tmem.o
|
|
+obj-$(CONFIG_PRESWAP) += preswap.o
|
|
+obj-$(CONFIG_PRECACHE) += precache.o
|
|
obj-$(CONFIG_HAS_DMA) += dmapool.o
|
|
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
|
|
obj-$(CONFIG_NUMA) += mempolicy.o
|
|
--- a/mm/filemap.c
|
|
+++ b/mm/filemap.c
|
|
@@ -33,6 +33,7 @@
|
|
#include <linux/cpuset.h>
|
|
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
|
|
#include <linux/memcontrol.h>
|
|
+#include <linux/precache.h>
|
|
#include <linux/mm_inline.h> /* for page_is_file_cache() */
|
|
#include "internal.h"
|
|
|
|
@@ -116,6 +117,16 @@ void __remove_from_page_cache(struct pag
|
|
{
|
|
struct address_space *mapping = page->mapping;
|
|
|
|
+ /*
|
|
+ * if we're uptodate, flush out into the precache, otherwise
|
|
+ * invalidate any existing precache entries. We can't leave
|
|
+ * stale data around in the precache once our page is gone
|
|
+ */
|
|
+ if (PageUptodate(page))
|
|
+ precache_put(page->mapping, page->index, page);
|
|
+ else
|
|
+ precache_flush(page->mapping, page->index);
|
|
+
|
|
radix_tree_delete(&mapping->page_tree, page->index);
|
|
page->mapping = NULL;
|
|
mapping->nrpages--;
|
|
--- a/mm/page_io.c
|
|
+++ b/mm/page_io.c
|
|
@@ -111,6 +111,13 @@ int swap_writepage(struct page *page, st
|
|
return ret;
|
|
}
|
|
|
|
+ if (preswap_put(page) == 1) {
|
|
+ set_page_writeback(page);
|
|
+ unlock_page(page);
|
|
+ end_page_writeback(page);
|
|
+ goto out;
|
|
+ }
|
|
+
|
|
bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write);
|
|
if (bio == NULL) {
|
|
set_page_dirty(page);
|
|
@@ -179,6 +186,12 @@ int swap_readpage(struct page *page)
|
|
return ret;
|
|
}
|
|
|
|
+ if (preswap_get(page) == 1) {
|
|
+ SetPageUptodate(page);
|
|
+ unlock_page(page);
|
|
+ goto out;
|
|
+ }
|
|
+
|
|
bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
|
|
if (bio == NULL) {
|
|
unlock_page(page);
|
|
--- /dev/null
|
|
+++ b/mm/precache.c
|
|
@@ -0,0 +1,138 @@
|
|
+/*
|
|
+ * linux/mm/precache.c
|
|
+ *
|
|
+ * Implements "precache" for filesystems/pagecache on top of transcendent
|
|
+ * memory ("tmem") API. A filesystem creates an "ephemeral tmem pool"
|
|
+ * and retains the returned pool_id in its superblock. Clean pages evicted
|
|
+ * from pagecache may be "put" into the pool and associated with a "handle"
|
|
+ * consisting of the pool_id, an object (inode) id, and an index (page offset).
|
|
+ * Note that the page is copied to tmem; no kernel mappings are changed.
|
|
+ * If the page is later needed, the filesystem (or VFS) issues a "get", passing
|
|
+ * the same handle and an empty pageframe. If successful, the page is copied
|
|
+ * into the pageframe and a disk read is avoided. But since the tmem pool
|
|
+ * is of indeterminate size, a "put" page has indeterminate longevity
|
|
+ * ("ephemeral"), and the "get" may fail, in which case the filesystem must
|
|
+ * read the page from disk as before. Note that the filesystem/pagecache are
|
|
+ * responsible for maintaining coherency between the pagecache, precache,
|
|
+ * and the disk, for which "flush page" and "flush object" actions are
|
|
+ * provided. And when a filesystem is unmounted, it must "destroy" the pool.
|
|
+ *
|
|
+ * Two types of pools may be created for a precache: "private" or "shared".
|
|
+ * For a private pool, a successful "get" always flushes, implementing
|
|
+ * exclusive semantics; for a "shared" pool (which is intended for use by
|
|
+ * co-resident nodes of a cluster filesystem), the "flush" is not guaranteed.
|
|
+ * In either case, a failed "duplicate" put (overwrite) always guarantee
|
|
+ * the old data is flushed.
|
|
+ *
|
|
+ * Note also that multiple accesses to a tmem pool may be concurrent and any
|
|
+ * ordering must be guaranteed by the caller.
|
|
+ *
|
|
+ * Copyright (C) 2008,2009 Dan Magenheimer, Oracle Corp.
|
|
+ */
|
|
+
|
|
+#include <linux/precache.h>
|
|
+#include <linux/module.h>
|
|
+#include "tmem.h"
|
|
+
|
|
+static int precache_auto_allocate; /* set to 1 to auto_allocate */
|
|
+
|
|
+int precache_put(struct address_space *mapping, unsigned long index,
|
|
+ struct page *page)
|
|
+{
|
|
+ u32 tmem_pool = mapping->host->i_sb->precache_poolid;
|
|
+ u64 obj = (unsigned long) mapping->host->i_ino;
|
|
+ u32 ind = (u32) index;
|
|
+ unsigned long mfn = pfn_to_mfn(page_to_pfn(page));
|
|
+ int ret;
|
|
+
|
|
+ if ((s32)tmem_pool < 0) {
|
|
+ if (!precache_auto_allocate)
|
|
+ return 0;
|
|
+ /* a put on a non-existent precache may auto-allocate one */
|
|
+ ret = tmem_new_pool(0, 0, 0);
|
|
+ if (ret < 0)
|
|
+ return 0;
|
|
+ pr_info("Mapping superblock for s_id=%s to precache_id=%d\n",
|
|
+ mapping->host->i_sb->s_id, tmem_pool);
|
|
+ mapping->host->i_sb->precache_poolid = tmem_pool;
|
|
+ }
|
|
+ if (ind != index)
|
|
+ return 0;
|
|
+ mb(); /* ensure page is quiescent; tmem may address it with an alias */
|
|
+ return tmem_put_page(tmem_pool, obj, ind, mfn);
|
|
+}
|
|
+
|
|
+int precache_get(struct address_space *mapping, unsigned long index,
|
|
+ struct page *empty_page)
|
|
+{
|
|
+ u32 tmem_pool = mapping->host->i_sb->precache_poolid;
|
|
+ u64 obj = (unsigned long) mapping->host->i_ino;
|
|
+ u32 ind = (u32) index;
|
|
+ unsigned long mfn = pfn_to_mfn(page_to_pfn(empty_page));
|
|
+
|
|
+ if ((s32)tmem_pool < 0)
|
|
+ return 0;
|
|
+ if (ind != index)
|
|
+ return 0;
|
|
+
|
|
+ return tmem_get_page(tmem_pool, obj, ind, mfn);
|
|
+}
|
|
+EXPORT_SYMBOL(precache_get);
|
|
+
|
|
+int precache_flush(struct address_space *mapping, unsigned long index)
|
|
+{
|
|
+ u32 tmem_pool = mapping->host->i_sb->precache_poolid;
|
|
+ u64 obj = (unsigned long) mapping->host->i_ino;
|
|
+ u32 ind = (u32) index;
|
|
+
|
|
+ if ((s32)tmem_pool < 0)
|
|
+ return 0;
|
|
+ if (ind != index)
|
|
+ return 0;
|
|
+
|
|
+ return tmem_flush_page(tmem_pool, obj, ind);
|
|
+}
|
|
+EXPORT_SYMBOL(precache_flush);
|
|
+
|
|
+int precache_flush_inode(struct address_space *mapping)
|
|
+{
|
|
+ u32 tmem_pool = mapping->host->i_sb->precache_poolid;
|
|
+ u64 obj = (unsigned long) mapping->host->i_ino;
|
|
+
|
|
+ if ((s32)tmem_pool < 0)
|
|
+ return 0;
|
|
+
|
|
+ return tmem_flush_object(tmem_pool, obj);
|
|
+}
|
|
+EXPORT_SYMBOL(precache_flush_inode);
|
|
+
|
|
+int precache_flush_filesystem(struct super_block *sb)
|
|
+{
|
|
+ u32 tmem_pool = sb->precache_poolid;
|
|
+ int ret;
|
|
+
|
|
+ if ((s32)tmem_pool < 0)
|
|
+ return 0;
|
|
+ ret = tmem_destroy_pool(tmem_pool);
|
|
+ if (!ret)
|
|
+ return 0;
|
|
+ pr_info("Unmapping superblock for s_id=%s from precache_id=%d\n",
|
|
+ sb->s_id, ret);
|
|
+ sb->precache_poolid = 0;
|
|
+ return 1;
|
|
+}
|
|
+EXPORT_SYMBOL(precache_flush_filesystem);
|
|
+
|
|
+void precache_init(struct super_block *sb)
|
|
+{
|
|
+ sb->precache_poolid = tmem_new_pool(0, 0, 0);
|
|
+}
|
|
+EXPORT_SYMBOL(precache_init);
|
|
+
|
|
+void shared_precache_init(struct super_block *sb, char *uuid)
|
|
+{
|
|
+ u64 uuid_lo = *(u64 *)uuid;
|
|
+ u64 uuid_hi = *(u64 *)(&uuid[8]);
|
|
+ sb->precache_poolid = tmem_new_pool(uuid_lo, uuid_hi, TMEM_POOL_SHARED);
|
|
+}
|
|
+EXPORT_SYMBOL(shared_precache_init);
|
|
--- /dev/null
|
|
+++ b/mm/preswap.c
|
|
@@ -0,0 +1,182 @@
|
|
+/*
|
|
+ * linux/mm/preswap.c
|
|
+ *
|
|
+ * Implements a fast "preswap" on top of the transcendent memory ("tmem") API.
|
|
+ * When a swapdisk is enabled (with swapon), a "private persistent tmem pool"
|
|
+ * is created along with a bit-per-page preswap_map. When swapping occurs
|
|
+ * and a page is about to be written to disk, a "put" into the pool may first
|
|
+ * be attempted by passing the pageframe to be swapped, along with a "handle"
|
|
+ * consisting of a pool_id, an object id, and an index. Since the pool is of
|
|
+ * indeterminate size, the "put" may be rejected, in which case the page
|
|
+ * is swapped to disk as normal. If the "put" is successful, the page is
|
|
+ * copied to tmem and the preswap_map records the success. Later, when
|
|
+ * the page needs to be swapped in, the preswap_map is checked and, if set,
|
|
+ * the page may be obtained with a "get" operation. Note that the swap
|
|
+ * subsystem is responsible for: maintaining coherency between the swapcache,
|
|
+ * preswap, and the swapdisk; for evicting stale pages from preswap; and for
|
|
+ * emptying preswap when swapoff is performed. The "flush page" and "flush
|
|
+ * object" actions are provided for this.
|
|
+ *
|
|
+ * Note that if a "duplicate put" is performed to overwrite a page and
|
|
+ * the "put" operation fails, the page (and old data) is flushed and lost.
|
|
+ * Also note that multiple accesses to a tmem pool may be concurrent and
|
|
+ * any ordering must be guaranteed by the caller.
|
|
+ *
|
|
+ * Copyright (C) 2008,2009 Dan Magenheimer, Oracle Corp.
|
|
+ */
|
|
+
|
|
+#include <linux/mm.h>
|
|
+#include <linux/mman.h>
|
|
+#include <linux/sysctl.h>
|
|
+#include <linux/swap.h>
|
|
+#include <linux/swapops.h>
|
|
+#include <linux/proc_fs.h>
|
|
+#include <linux/security.h>
|
|
+#include <linux/capability.h>
|
|
+#include <linux/uaccess.h>
|
|
+#include "tmem.h"
|
|
+
|
|
+static u32 preswap_poolid = -1; /* if negative, preswap will never call tmem */
|
|
+
|
|
+const unsigned long preswap_zero = 0, preswap_infinity = ~0UL; /* for sysctl */
|
|
+
|
|
+/*
|
|
+ * Swizzling increases objects per swaptype, increasing tmem concurrency
|
|
+ * for heavy swaploads. Later, larger nr_cpus -> larger SWIZ_BITS
|
|
+ */
|
|
+#define SWIZ_BITS 4
|
|
+#define SWIZ_MASK ((1 << SWIZ_BITS) - 1)
|
|
+#define oswiz(_type, _ind) ((_type << SWIZ_BITS) | (_ind & SWIZ_MASK))
|
|
+#define iswiz(_ind) (_ind >> SWIZ_BITS)
|
|
+
|
|
+/*
|
|
+ * preswap_map test/set/clear operations (must be atomic)
|
|
+ */
|
|
+
|
|
+int preswap_test(struct swap_info_struct *sis, unsigned long offset)
|
|
+{
|
|
+ if (!sis->preswap_map)
|
|
+ return 0;
|
|
+ return test_bit(offset % BITS_PER_LONG,
|
|
+ &sis->preswap_map[offset/BITS_PER_LONG]);
|
|
+}
|
|
+
|
|
+static inline void preswap_set(struct swap_info_struct *sis,
|
|
+ unsigned long offset)
|
|
+{
|
|
+ if (!sis->preswap_map)
|
|
+ return;
|
|
+ set_bit(offset % BITS_PER_LONG,
|
|
+ &sis->preswap_map[offset/BITS_PER_LONG]);
|
|
+}
|
|
+
|
|
+static inline void preswap_clear(struct swap_info_struct *sis,
|
|
+ unsigned long offset)
|
|
+{
|
|
+ if (!sis->preswap_map)
|
|
+ return;
|
|
+ clear_bit(offset % BITS_PER_LONG,
|
|
+ &sis->preswap_map[offset/BITS_PER_LONG]);
|
|
+}
|
|
+
|
|
+/*
|
|
+ * preswap tmem operations
|
|
+ */
|
|
+
|
|
+/* returns 1 if the page was successfully put into preswap, 0 if the page
|
|
+ * was declined, and -ERRNO for a specific error */
|
|
+int preswap_put(struct page *page)
|
|
+{
|
|
+ swp_entry_t entry = { .val = page_private(page), };
|
|
+ unsigned type = swp_type(entry);
|
|
+ pgoff_t offset = swp_offset(entry);
|
|
+ u64 ind64 = (u64)offset;
|
|
+ u32 ind = (u32)offset;
|
|
+ unsigned long mfn = pfn_to_mfn(page_to_pfn(page));
|
|
+ struct swap_info_struct *sis = get_swap_info_struct(type);
|
|
+ int dup = 0, ret;
|
|
+
|
|
+ if ((s32)preswap_poolid < 0)
|
|
+ return 0;
|
|
+ if (ind64 != ind)
|
|
+ return 0;
|
|
+ if (preswap_test(sis, offset))
|
|
+ dup = 1;
|
|
+ mb(); /* ensure page is quiescent; tmem may address it with an alias */
|
|
+ ret = tmem_put_page(preswap_poolid, oswiz(type, ind), iswiz(ind), mfn);
|
|
+ if (ret == 1) {
|
|
+ preswap_set(sis, offset);
|
|
+ if (!dup)
|
|
+ sis->preswap_pages++;
|
|
+ } else if (dup) {
|
|
+ /* failed dup put always results in an automatic flush of
|
|
+ * the (older) page from preswap */
|
|
+ preswap_clear(sis, offset);
|
|
+ sis->preswap_pages--;
|
|
+ }
|
|
+ return ret;
|
|
+}
|
|
+
|
|
+/* returns 1 if the page was successfully gotten from preswap, 0 if the page
|
|
+ * was not present (should never happen!), and -ERRNO for a specific error */
|
|
+int preswap_get(struct page *page)
|
|
+{
|
|
+ swp_entry_t entry = { .val = page_private(page), };
|
|
+ unsigned type = swp_type(entry);
|
|
+ pgoff_t offset = swp_offset(entry);
|
|
+ u64 ind64 = (u64)offset;
|
|
+ u32 ind = (u32)offset;
|
|
+ unsigned long mfn = pfn_to_mfn(page_to_pfn(page));
|
|
+ struct swap_info_struct *sis = get_swap_info_struct(type);
|
|
+ int ret;
|
|
+
|
|
+ if ((s32)preswap_poolid < 0)
|
|
+ return 0;
|
|
+ if (ind64 != ind)
|
|
+ return 0;
|
|
+ if (!preswap_test(sis, offset))
|
|
+ return 0;
|
|
+ ret = tmem_get_page(preswap_poolid, oswiz(type, ind), iswiz(ind), mfn);
|
|
+ return ret;
|
|
+}
|
|
+
|
|
+/* flush a single page from preswap */
|
|
+void preswap_flush(unsigned type, unsigned long offset)
|
|
+{
|
|
+ u64 ind64 = (u64)offset;
|
|
+ u32 ind = (u32)offset;
|
|
+ struct swap_info_struct *sis = get_swap_info_struct(type);
|
|
+ int ret = 1;
|
|
+
|
|
+ if ((s32)preswap_poolid < 0)
|
|
+ return;
|
|
+ if (ind64 != ind)
|
|
+ return;
|
|
+ if (preswap_test(sis, offset)) {
|
|
+ ret = tmem_flush_page(preswap_poolid,
|
|
+ oswiz(type, ind), iswiz(ind));
|
|
+ sis->preswap_pages--;
|
|
+ preswap_clear(sis, offset);
|
|
+ }
|
|
+}
|
|
+
|
|
+/* flush all pages from the passed swaptype */
|
|
+void preswap_flush_area(unsigned type)
|
|
+{
|
|
+ struct swap_info_struct *sis = get_swap_info_struct(type);
|
|
+ int ind;
|
|
+
|
|
+ if ((s32)preswap_poolid < 0)
|
|
+ return;
|
|
+ for (ind = SWIZ_MASK; ind >= 0; ind--)
|
|
+ (void)tmem_flush_object(preswap_poolid, oswiz(type, ind));
|
|
+ sis->preswap_pages = 0;
|
|
+}
|
|
+
|
|
+void preswap_init(unsigned type)
|
|
+{
|
|
+ /* only need one tmem pool for all swap types */
|
|
+ if ((s32)preswap_poolid >= 0)
|
|
+ return;
|
|
+ preswap_poolid = tmem_new_pool(0, 0, TMEM_POOL_PERSIST);
|
|
+}
|
|
--- a/mm/swapfile.c
|
|
+++ b/mm/swapfile.c
|
|
@@ -588,6 +588,7 @@ static unsigned char swap_entry_free(str
|
|
swap_list.next = p->type;
|
|
nr_swap_pages++;
|
|
p->inuse_pages--;
|
|
+ preswap_flush(p->type, offset);
|
|
if (p->flags & SWP_BLKDEV) {
|
|
struct gendisk *disk = p->bdev->bd_disk;
|
|
if (disk->fops->swap_slot_free_notify)
|
|
@@ -1055,7 +1056,7 @@ static int unuse_mm(struct mm_struct *mm
|
|
* Recycle to start on reaching the end, returning 0 when empty.
|
|
*/
|
|
static unsigned int find_next_to_unuse(struct swap_info_struct *si,
|
|
- unsigned int prev)
|
|
+ unsigned int prev, unsigned int preswap)
|
|
{
|
|
unsigned int max = si->max;
|
|
unsigned int i = prev;
|
|
@@ -1081,6 +1082,12 @@ static unsigned int find_next_to_unuse(s
|
|
prev = 0;
|
|
i = 1;
|
|
}
|
|
+ if (preswap) {
|
|
+ if (preswap_test(si, i))
|
|
+ break;
|
|
+ else
|
|
+ continue;
|
|
+ }
|
|
count = si->swap_map[i];
|
|
if (count && swap_count(count) != SWAP_MAP_BAD)
|
|
break;
|
|
@@ -1092,8 +1099,12 @@ static unsigned int find_next_to_unuse(s
|
|
* We completely avoid races by reading each swap page in advance,
|
|
* and then search for the process using it. All the necessary
|
|
* page table adjustments can then be made atomically.
|
|
+ *
|
|
+ * if the boolean preswap is true, only unuse pages_to_unuse pages;
|
|
+ * pages_to_unuse==0 means all pages
|
|
*/
|
|
-static int try_to_unuse(unsigned int type)
|
|
+static int try_to_unuse(unsigned int type, unsigned int preswap,
|
|
+ unsigned long pages_to_unuse)
|
|
{
|
|
struct swap_info_struct *si = swap_info[type];
|
|
struct mm_struct *start_mm;
|
|
@@ -1126,7 +1137,7 @@ static int try_to_unuse(unsigned int typ
|
|
* one pass through swap_map is enough, but not necessarily:
|
|
* there are races when an instance of an entry might be missed.
|
|
*/
|
|
- while ((i = find_next_to_unuse(si, i)) != 0) {
|
|
+ while ((i = find_next_to_unuse(si, i, preswap)) != 0) {
|
|
if (signal_pending(current)) {
|
|
retval = -EINTR;
|
|
break;
|
|
@@ -1293,6 +1304,8 @@ static int try_to_unuse(unsigned int typ
|
|
* interactive performance.
|
|
*/
|
|
cond_resched();
|
|
+ if (preswap && pages_to_unuse && !--pages_to_unuse)
|
|
+ break;
|
|
}
|
|
|
|
mmput(start_mm);
|
|
@@ -1637,7 +1650,7 @@ SYSCALL_DEFINE1(swapoff, const char __us
|
|
spin_unlock(&swap_lock);
|
|
|
|
current->flags |= PF_OOM_ORIGIN;
|
|
- err = try_to_unuse(type);
|
|
+ err = try_to_unuse(type, 0, 0);
|
|
current->flags &= ~PF_OOM_ORIGIN;
|
|
|
|
if (err) {
|
|
@@ -1689,9 +1702,14 @@ SYSCALL_DEFINE1(swapoff, const char __us
|
|
swap_map = p->swap_map;
|
|
p->swap_map = NULL;
|
|
p->flags = 0;
|
|
+ preswap_flush_area(type);
|
|
spin_unlock(&swap_lock);
|
|
mutex_unlock(&swapon_mutex);
|
|
vfree(swap_map);
|
|
+#ifdef CONFIG_PRESWAP
|
|
+ if (p->preswap_map)
|
|
+ vfree(p->preswap_map);
|
|
+#endif
|
|
/* Destroy swap account informatin */
|
|
swap_cgroup_swapoff(type);
|
|
|
|
@@ -1886,6 +1904,7 @@ SYSCALL_DEFINE2(swapon, const char __use
|
|
unsigned long maxpages;
|
|
unsigned long swapfilepages;
|
|
unsigned char *swap_map = NULL;
|
|
+ unsigned long *preswap_map = NULL;
|
|
struct page *page = NULL;
|
|
struct inode *inode = NULL;
|
|
int did_down = 0;
|
|
@@ -2088,6 +2107,12 @@ SYSCALL_DEFINE2(swapon, const char __use
|
|
}
|
|
}
|
|
|
|
+#ifdef CONFIG_PRESWAP
|
|
+ preswap_map = vmalloc(maxpages / sizeof(long));
|
|
+ if (preswap_map)
|
|
+ memset(preswap_map, 0, maxpages / sizeof(long));
|
|
+#endif
|
|
+
|
|
error = swap_cgroup_swapon(type, maxpages);
|
|
if (error)
|
|
goto bad_swap;
|
|
@@ -2126,6 +2151,9 @@ SYSCALL_DEFINE2(swapon, const char __use
|
|
else
|
|
p->prio = --least_priority;
|
|
p->swap_map = swap_map;
|
|
+#ifdef CONFIG_PRESWAP
|
|
+ p->preswap_map = preswap_map;
|
|
+#endif
|
|
p->flags |= SWP_WRITEOK;
|
|
nr_swap_pages += nr_good_pages;
|
|
total_swap_pages += nr_good_pages;
|
|
@@ -2149,6 +2177,7 @@ SYSCALL_DEFINE2(swapon, const char __use
|
|
swap_list.head = swap_list.next = type;
|
|
else
|
|
swap_info[prev]->next = type;
|
|
+ preswap_init(type);
|
|
spin_unlock(&swap_lock);
|
|
mutex_unlock(&swapon_mutex);
|
|
atomic_inc(&proc_poll_event);
|
|
@@ -2168,6 +2197,7 @@ bad_swap_2:
|
|
p->swap_file = NULL;
|
|
p->flags = 0;
|
|
spin_unlock(&swap_lock);
|
|
+ vfree(preswap_map);
|
|
vfree(swap_map);
|
|
if (swap_file) {
|
|
if (did_down) {
|
|
@@ -2373,6 +2403,10 @@ int valid_swaphandles(swp_entry_t entry,
|
|
base++;
|
|
|
|
spin_lock(&swap_lock);
|
|
+ if (preswap_test(si, target)) {
|
|
+ spin_unlock(&swap_lock);
|
|
+ return 0;
|
|
+ }
|
|
if (end > si->max) /* don't go beyond end of map */
|
|
end = si->max;
|
|
|
|
@@ -2383,6 +2417,9 @@ int valid_swaphandles(swp_entry_t entry,
|
|
break;
|
|
if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD)
|
|
break;
|
|
+ /* Don't read in preswap pages */
|
|
+ if (preswap_test(si, toff))
|
|
+ break;
|
|
}
|
|
/* Count contiguous allocated slots below our target */
|
|
for (toff = target; --toff >= base; nr_pages++) {
|
|
@@ -2391,6 +2428,9 @@ int valid_swaphandles(swp_entry_t entry,
|
|
break;
|
|
if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD)
|
|
break;
|
|
+ /* Don't read in preswap pages */
|
|
+ if (preswap_test(si, toff))
|
|
+ break;
|
|
}
|
|
spin_unlock(&swap_lock);
|
|
|
|
@@ -2617,3 +2657,98 @@ static void free_swap_count_continuation
|
|
}
|
|
}
|
|
}
|
|
+
|
|
+#ifdef CONFIG_PRESWAP
|
|
+/*
|
|
+ * preswap infrastructure functions
|
|
+ */
|
|
+
|
|
+struct swap_info_struct *get_swap_info_struct(unsigned int type)
|
|
+{
|
|
+ BUG_ON(type > MAX_SWAPFILES);
|
|
+ return swap_info[type];
|
|
+}
|
|
+
|
|
+/* code structure leveraged from sys_swapoff */
|
|
+void preswap_shrink(unsigned long target_pages)
|
|
+{
|
|
+ struct swap_info_struct *si = NULL;
|
|
+ unsigned long total_pages = 0, total_pages_to_unuse;
|
|
+ unsigned long pages = 0, unuse_pages = 0;
|
|
+ int type;
|
|
+ int wrapped = 0;
|
|
+
|
|
+ do {
|
|
+ /*
|
|
+ * we don't want to hold swap_lock while doing a very
|
|
+ * lengthy try_to_unuse, but swap_list may change
|
|
+ * so restart scan from swap_list.head each time
|
|
+ */
|
|
+ spin_lock(&swap_lock);
|
|
+ total_pages = 0;
|
|
+ for (type = swap_list.head; type >= 0; type = si->next) {
|
|
+ si = swap_info[type];
|
|
+ total_pages += si->preswap_pages;
|
|
+ }
|
|
+ if (total_pages <= target_pages) {
|
|
+ spin_unlock(&swap_lock);
|
|
+ return;
|
|
+ }
|
|
+ total_pages_to_unuse = total_pages - target_pages;
|
|
+ for (type = swap_list.head; type >= 0; type = si->next) {
|
|
+ si = swap_info[type];
|
|
+ if (total_pages_to_unuse < si->preswap_pages)
|
|
+ pages = unuse_pages = total_pages_to_unuse;
|
|
+ else {
|
|
+ pages = si->preswap_pages;
|
|
+ unuse_pages = 0; /* unuse all */
|
|
+ }
|
|
+ if (security_vm_enough_memory_kern(pages))
|
|
+ continue;
|
|
+ vm_unacct_memory(pages);
|
|
+ break;
|
|
+ }
|
|
+ spin_unlock(&swap_lock);
|
|
+ if (type < 0)
|
|
+ return;
|
|
+ current->flags |= PF_OOM_ORIGIN;
|
|
+ (void)try_to_unuse(type, 1, unuse_pages);
|
|
+ current->flags &= ~PF_OOM_ORIGIN;
|
|
+ wrapped++;
|
|
+ } while (wrapped <= 3);
|
|
+}
|
|
+
|
|
+
|
|
+#ifdef CONFIG_SYSCTL
|
|
+/* cat /sys/proc/vm/preswap provides total number of pages in preswap
|
|
+ * across all swaptypes. echo N > /sys/proc/vm/preswap attempts to shrink
|
|
+ * preswap page usage to N (usually 0) */
|
|
+int preswap_sysctl_handler(ctl_table *table, int write,
|
|
+ void __user *buffer, size_t *length, loff_t *ppos)
|
|
+{
|
|
+ unsigned long npages;
|
|
+ int type;
|
|
+ unsigned long totalpages = 0;
|
|
+ struct swap_info_struct *si = NULL;
|
|
+
|
|
+ /* modeled after hugetlb_sysctl_handler in mm/hugetlb.c */
|
|
+ if (!write) {
|
|
+ spin_lock(&swap_lock);
|
|
+ for (type = swap_list.head; type >= 0; type = si->next) {
|
|
+ si = swap_info[type];
|
|
+ totalpages += si->preswap_pages;
|
|
+ }
|
|
+ spin_unlock(&swap_lock);
|
|
+ npages = totalpages;
|
|
+ }
|
|
+ table->data = &npages;
|
|
+ table->maxlen = sizeof(unsigned long);
|
|
+ proc_doulongvec_minmax(table, write, buffer, length, ppos);
|
|
+
|
|
+ if (write)
|
|
+ preswap_shrink(npages);
|
|
+
|
|
+ return 0;
|
|
+}
|
|
+#endif
|
|
+#endif /* CONFIG_PRESWAP */
|
|
--- /dev/null
|
|
+++ b/mm/tmem.h
|
|
@@ -0,0 +1,84 @@
|
|
+/*
|
|
+ * linux/mm/tmem.h
|
|
+ *
|
|
+ * Interface to transcendent memory, used by mm/precache.c and mm/preswap.c
|
|
+ * Currently implemented on XEN, but may be implemented elsewhere in future.
|
|
+ *
|
|
+ * Copyright (C) 2008,2009 Dan Magenheimer, Oracle Corp.
|
|
+ */
|
|
+
|
|
+#ifdef CONFIG_XEN
|
|
+#include <xen/interface/xen.h>
|
|
+
|
|
+/* Bits for HYPERVISOR_tmem_op(TMEM_NEW_POOL) */
|
|
+#define TMEM_POOL_MIN_PAGESHIFT 12
|
|
+#define TMEM_POOL_PAGEORDER (PAGE_SHIFT - TMEM_POOL_MIN_PAGESHIFT)
|
|
+
|
|
+extern int xen_tmem_op(u32 tmem_cmd, u32 tmem_pool, u64 object, u32 index,
|
|
+ unsigned long gmfn, u32 tmem_offset, u32 pfn_offset, u32 len);
|
|
+extern int xen_tmem_new_pool(u32 tmem_cmd, u64 uuid_lo, u64 uuid_hi, u32 flags);
|
|
+
|
|
+static inline int tmem_put_page(u32 pool_id, u64 object, u32 index,
|
|
+ unsigned long gmfn)
|
|
+{
|
|
+ return xen_tmem_op(TMEM_PUT_PAGE, pool_id, object, index,
|
|
+ gmfn, 0, 0, 0);
|
|
+}
|
|
+
|
|
+static inline int tmem_get_page(u32 pool_id, u64 object, u32 index,
|
|
+ unsigned long gmfn)
|
|
+{
|
|
+ return xen_tmem_op(TMEM_GET_PAGE, pool_id, object, index,
|
|
+ gmfn, 0, 0, 0);
|
|
+}
|
|
+
|
|
+static inline int tmem_flush_page(u32 pool_id, u64 object, u32 index)
|
|
+{
|
|
+ return xen_tmem_op(TMEM_FLUSH_PAGE, pool_id, object, index,
|
|
+ 0, 0, 0, 0);
|
|
+}
|
|
+
|
|
+static inline int tmem_flush_object(u32 pool_id, u64 object)
|
|
+{
|
|
+ return xen_tmem_op(TMEM_FLUSH_OBJECT, pool_id, object, 0, 0, 0, 0, 0);
|
|
+}
|
|
+
|
|
+static inline int tmem_new_pool(u64 uuid_lo, u64 uuid_hi, u32 flags)
|
|
+{
|
|
+ BUILD_BUG_ON((TMEM_POOL_PAGEORDER < 0) ||
|
|
+ (TMEM_POOL_PAGEORDER >= TMEM_POOL_PAGESIZE_MASK));
|
|
+ flags |= TMEM_POOL_PAGEORDER << TMEM_POOL_PAGESIZE_SHIFT;
|
|
+ return xen_tmem_new_pool(TMEM_NEW_POOL, uuid_lo, uuid_hi, flags);
|
|
+}
|
|
+
|
|
+static inline int tmem_destroy_pool(u32 pool_id)
|
|
+{
|
|
+ return xen_tmem_op(TMEM_DESTROY_POOL, pool_id, 0, 0, 0, 0, 0, 0);
|
|
+}
|
|
+#else
|
|
+struct tmem_op {
|
|
+ u32 cmd;
|
|
+ s32 pool_id; /* private > 0; shared < 0; 0 is invalid */
|
|
+ union {
|
|
+ struct { /* for cmd == TMEM_NEW_POOL */
|
|
+ u64 uuid[2];
|
|
+ u32 flags;
|
|
+ } new;
|
|
+ struct { /* for cmd == TMEM_CONTROL */
|
|
+ u32 subop;
|
|
+ u32 cli_id;
|
|
+ u32 arg1;
|
|
+ u32 arg2;
|
|
+ void *buf;
|
|
+ } ctrl;
|
|
+ struct {
|
|
+ u64 object;
|
|
+ u32 index;
|
|
+ u32 tmem_offset;
|
|
+ u32 pfn_offset;
|
|
+ u32 len;
|
|
+ unsigned long pfn; /* page frame */
|
|
+ } gen;
|
|
+ } u;
|
|
+};
|
|
+#endif
|
|
--- a/mm/truncate.c
|
|
+++ b/mm/truncate.c
|
|
@@ -16,6 +16,7 @@
|
|
#include <linux/pagemap.h>
|
|
#include <linux/highmem.h>
|
|
#include <linux/pagevec.h>
|
|
+#include <linux/precache.h>
|
|
#include <linux/task_io_accounting_ops.h>
|
|
#include <linux/buffer_head.h> /* grr. try_to_release_page,
|
|
do_invalidatepage */
|
|
@@ -51,6 +52,7 @@ void do_invalidatepage(struct page *page
|
|
static inline void truncate_partial_page(struct page *page, unsigned partial)
|
|
{
|
|
zero_user_segment(page, partial, PAGE_CACHE_SIZE);
|
|
+ precache_flush(page->mapping, page->index);
|
|
if (page_has_private(page))
|
|
do_invalidatepage(page, partial);
|
|
}
|
|
@@ -108,6 +110,10 @@ truncate_complete_page(struct address_sp
|
|
clear_page_mlock(page);
|
|
remove_from_page_cache(page);
|
|
ClearPageMappedToDisk(page);
|
|
+ /* this must be after the remove_from_page_cache which
|
|
+ * calls precache_put
|
|
+ */
|
|
+ precache_flush(mapping, page->index);
|
|
page_cache_release(page); /* pagecache ref */
|
|
return 0;
|
|
}
|
|
@@ -215,6 +221,7 @@ void truncate_inode_pages_range(struct a
|
|
pgoff_t next;
|
|
int i;
|
|
|
|
+ precache_flush_inode(mapping);
|
|
if (mapping->nrpages == 0)
|
|
return;
|
|
|
|
@@ -292,6 +299,7 @@ void truncate_inode_pages_range(struct a
|
|
pagevec_release(&pvec);
|
|
mem_cgroup_uncharge_end();
|
|
}
|
|
+ precache_flush_inode(mapping);
|
|
}
|
|
EXPORT_SYMBOL(truncate_inode_pages_range);
|
|
|
|
@@ -434,6 +442,7 @@ int invalidate_inode_pages2_range(struct
|
|
int did_range_unmap = 0;
|
|
int wrapped = 0;
|
|
|
|
+ precache_flush_inode(mapping);
|
|
pagevec_init(&pvec, 0);
|
|
next = start;
|
|
while (next <= end && !wrapped &&
|
|
@@ -492,6 +501,7 @@ int invalidate_inode_pages2_range(struct
|
|
mem_cgroup_uncharge_end();
|
|
cond_resched();
|
|
}
|
|
+ precache_flush_inode(mapping);
|
|
return ret;
|
|
}
|
|
EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);
|