The Angry Dome- Adventures in ZFS: arc, l2arc, dedupe, and streaming workloads

S	M	T	W	H	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Adventures in ZFS: arc, l2arc, dedupe, and streaming workloads

by: Jeremy
file under: geekery
at: Jun 28 2011 11:25
0 Comments (post new)

The ARC

ZFS uses what's known as "adaptive replacement cache" (almost always just called "the arc") to hold both metadata and filesystem data in fast storage, which can dramatically speed up read operations for cached objects. When you start using zfs, this all happens behind the scenes in memory; Solaris or OpenIndiana dedicates a chunk of RAM for the ARC, and it reduces the size of the ARC when memory pressure demands it. Ben Rockwood wrote a good introduction to the ARC and a tool you can use to examine its state, so if you're interested in more details be sure to check that out.

Now, in addition to the ARC that sits in RAM, ZFS also has a facility to use level two adaptive replacement cache ("l2arc") on other "fast" storage. You can, for example, attach a fast SSD to a pool as l2arc, and ZFS will start using it as secondary cache. It won't be as fast as RAM, of course, but it's still potentially much faster than spinning rust, especially for random I/O.

In most cases you can just ignore the ARC, and happily reap the benefits of faster reads from cache. Adding additional RAM or assigning additional l2arc drives enables the ARC to cache more; a nice bonus for sure, but it's not the end of the world if you run out of cache.

There is, however, a critical exception to this: ARC becomes absolutely vital when dedupe is enabled. Before you even think about turning dedupe on, you'd better start thinking about the size of your ARC.

Dedupe and the ARC

When you turn on dedupe, you add a massive chunk of metadata known as the dedupe table ("DDT") into the equation. The dedupe table is where the magic happens; ZFS uses the table to identify duplicated blocks. Any writes with dedupe enabled will require lookups to this table first.

The reason you need to be thinking about ARC when the DDT is in play is this: the DDT is stored in the ARC. If your ARC can't fit the entire DDT, then every single time you try to write or read data, zfs will have to retrieve the DDT from spinning rust. The nature of the table makes this even more of a disaster, since it's a whole lot of small, random I/O - which is something normal hard drives are very bad at.

So heed this warning: if you turn on dedupe without enough RAM to cache the DDT in ARC, your write speeds can decrease by an order of magnitude (or more).

So, how much memory do you need in order to effectively use dedupe? The common answer on mailing lists or IRC is "as much as you can afford," and in practice that's probably the best advice you'll get. There are calculations you can make based on data retrieved from undocumented commands, but as a starting point you should count on at least 1 GB of ARC per 1 TB of data.

One frustrating aspect of this scenario is that it's very difficult to see what your DDT is really doing. ARC data in general is not exposed to the user by tools that come with OI or Solaris; indeed, one must use third party tools such as arc_summary, arcstat, or sysstat to see what's going on at all.

One thing you can do to potentially save yourself a world of pain is to ensure you have SSDs for l2arc. We really want the DDT in RAM, but having it on SSD will prevent the system from being completely useless if memory is exhausted, so it's a great idea to have SSD l2arc devices assigned to any pool that you want to dedupe. Unlike the in-memory ARC, we have some visibility into the l2arc directly provided by the zpool iostat utility:

zpool iostat -v [pool]

Caching strategy

When I added l2arc to my system and turned on dedupe, I paid very close attention to my cache usage. There are two tunables at the zfs dataset level which determine what ends up in the ARC: 'primarycache' and 'secondarycache'.

The values possible for these options are 'all', 'off', and 'metadata.' You can use this to selectively decide whether you want caching on your different layers of ARC; 'primarycache' is RAM, and 'secondarycache' is l2arc. The ddt is "metadata," so the most paranoid approach is to set both of these to "metadata" which will ensure that the DDT always has room to exist.

The problem with this conservative approach is that you lose all the benefits of caching filesystem operations. I attempted to cache only 'metadata' in the primarycache and then cache 'all' in secondarycache, but that doesn't do what you might expect; it turns out that as currently implemented you cannot cache something in secondarycache that is not cached in primarycache first. That means that if you want any filesystem caching anywhwere, you must use 'primarycache = all'. You can then reserve the l2arc for metadata cache if you desire. I settled on all/all, since I noticed that my l2arc was barely used at all when reserved for metadata.

Tuning the ARC for streaming workloads

Even with both caches set to "all," I noticed that my l2arc wasn't filling very quickly. The reason for this is that, by default, the l2arc will only cache random I/O; a sane strategy, since it speeds up the most costly operations. But in my case, with lots of streaming workloads and lots of l2arc, I was missing out on some potential performance gains.

You can set a tuneable that changes this behavior in /etc/system: set zfs:l2arc_noprefetch = 0

After a reboot, streaming workloads will be cached.

Priming the cache

With the streaming caching enabled, virtually every file you read will now be cached for future use. The effect is that, for reads of cached data, your performance will be close to that of running directly off of SSD.

I use this primarily for video games to improve load times. I export my zfs storage via NFS, and on my Linux workstation I install games onto that NFS filesystem and run them in wine.

A couple of things about the l2arc: it doesn't retain the cache through reboots, and less frequently accessed data will "fall off" the cache if it becomes full. This means that, if you want to consistently have some subset of your data cached in the l2arc, you need to read it all in regularly. I do this with a daily cron job:

36 6 * * * sh -c 'tar -cvf - /vault/games/RIFT > /dev/null 2> /dev/null'

Doing so dramatically improves read performance for the cached directories, and it ensures that they will always be in the cache (even when I haven't played RIFT for several days).

New Comment

Comments |Back

The Angry Dome