LWN.net Weekly Edition for May 2, 2013

LFCS: The value of FOSS fiscal sponsorship

LWN.net Weekly Edition for April 25, 2013

The 2013 Linux Storage, Filesystem, and Memory Management Summit

Weekly edition	Kernel	Security	Distributions	Contact Us	Search
Archives	Calendar	Subscribe	Write for LWN	LWN.net FAQ	Sponsors

The conclusion of the 3.9 merge window

By Jonathan Corbet
March 5, 2013

By the time that Linus released the 3.9-rc1 kernel prepatch and closed the merge window for this cycle, he had pulled a total of 10,265 non-merge changesets into the mainline repository. That is just over 2,000 changes since last week's summary. The most significant user-visible changes merged at the end of the merge window include:

The block I/O controller now has full hierarchical control group support.
The NFS code has gained network namespace support, allowing the operation of per-container NFS servers.
The Intel PowerClamp driver has been merged; PowerClamp allows the regulation of a CPU's power consumption through the injection of forced idle states.
The device mapper has gained support for a new "dm-cache" target that is able to use a fast drive (like a solid-state device) as a cache in front of slower storage devices. See Documentation/device-mapper/cache.txt for details.
RAID 5 and 6 support for the Btrfs filesystem has been merged at last.
Btrfs defragmentation code has gained snapshot awareness, meaning that sharing of data between snapshots will no longer be lost when defragmentation runs.
Architecture support for the Synopsys ARC and ImgTec Meta architectures has been added.
New hardware support includes:
- Systems and processors: Marvell Armada XP development boards, Ralink MIPS-based system-on-chip processors, Atheros AP136 reference boards, and Google Pixel laptops.
- Block: IBM RamSam PCIe Flash SSD devices and Broadcom BCM2835 SD/MMC controllers.
- Display: TI LP8788 backlight controllers.
- Miscellaneous: Kirkwood 88F6282 and 88F6283 thermal sensors, Marvell Dove thermal sensors, and Nokia "Retu" watchdog devices.

Changes visible to kernel developers include:

The menuconfig configuration tool now has proper "save" and "load" buttons.
The rework of the IDR API has been merged, simplifying code that uses IDR to generate unique integer identifiers. Users throughout the kernel tree have been updated to the new API.
The hlist_for_each_entry() iterator has lost the unused "pos" parameter.

At this point, the stabilization period for the 3.9 kernel has begun. If the usual pattern holds, the final 3.9 release can be expected sometime around the beginning of May.

(Log in to post comments)

The conclusion of the 3.9 merge window

Posted Mar 5, 2013 19:20 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

> The device manager has gained support for a new "dm-cache" target that is able to use a fast drive (like a solid-state device) as a cache in front of slower storage devices.

How close are we to being able to use the "cache drives"[1]?

[1]http://www.ocztechnology.com/ocz-synapse-cache-sata-iii-2...

"cache drives"

Posted Mar 5, 2013 22:56 UTC (Tue) by ntl (subscriber, #40518) [Link]

I don't see why you couldn't use the OCZ drives from your link in a dm-cache configuration today, assuming the vendor hasn't crippled them in some way.

"cache drives"

Posted Mar 6, 2013 3:03 UTC (Wed) by drag (subscriber, #31333) [Link]

Ewww... OCZ drives are bad news from what I am told.

"cache drives"

Posted Mar 6, 2013 5:33 UTC (Wed) by drdabbles (subscriber, #48755) [Link]

I've used many of them in many situations. They are just as good as any other. Turn on TRIM/DISCARD on a recent Intel 320 SSD and the drive will die due to a "known issue".

"cache drives"

Posted Mar 7, 2013 11:25 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

They spontaneously die to firmware glitches sometimes. This is unusual for a hard drive (never had any of my rotating rust do this in 20+ years), but you get used to it after a while. The media isn't actually damaged, just the firmware has stopped talking to you. If removed and re-inserted they will typically wake up and suddenly they're OK again until next time.

"cache drives"

Posted Mar 6, 2013 17:07 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

Half of the drive is purely for cache. It's 64GB, but it's only 32GB when mounted (the rest being cache). There's no device for the "other half". I'd like to use that part of the drive in dm-cache, not the part I can use.

"cache drives"

Posted Mar 6, 2013 18:02 UTC (Wed) by k3ninho (subscriber, #50375) [Link]

>it's only 32GB when mounted (the rest being cache)
The rest is intended to be used as the NAND cells wear out. The device has twice as much silicon as it needs so that it can be useful for longer, given the high read-erase patterns of scratch caches. The rest is not cache.

K3n.

"cache drives"

Posted Mar 6, 2013 18:30 UTC (Wed) by ntl (subscriber, #40518) [Link]

Yeah, I suspected something like this. So the drive exposes only half of its advertised capacity to the host. Assuming the touted "intelligent caching algorithms" are implemented in the drive and not the Windows-only software that OCZ provides, maybe it would compare favorably in a dm-cache configuration to conventional SSDs. Maybe not.

Anyway, the drives in the link seem to be EOL.

"cache drives"

Posted Mar 7, 2013 11:09 UTC (Thu) by Tobu (subscriber, #24111) [Link]

The caching is in their windows software, the SSD has no visibility on the HDD.

"cache drives"

Posted Mar 6, 2013 18:23 UTC (Wed) by Tobu (subscriber, #24111) [Link]

The small print on that page says it has 50% over-provisioning, so not visible outside of the SSD firmware. That must be there to compensate for the dataplex caching driver doing something not SSD-friendly, like heavy in-place rewriting (which would otherwise prevent the FTL from doing its work refreshing cells and spreading writes). bcache at least only writes whole erase blocks (1MB default bucket size), so it doesn't need that kind of overprovisioning.

flicker-free suspend/resume on Intel

Posted Mar 5, 2013 23:23 UTC (Tue) by sebas (subscriber, #51660) [Link]

I've read that it was planned that flicker-free suspend and resume (i.e. no vt switching for S3) for the Intel KMS driver was planned for this release. Short of going through huge git log, does anybody know off-hand if this change-set made it in?

flicker-free suspend/resume on Intel

Posted Mar 6, 2013 0:42 UTC (Wed) by marduk (subscriber, #3831) [Link]

It might be in 3.9-rc1 already. I didn't see any vt switching when resuming from suspend (but I never noticed it before either)

flicker-free suspend/resume on Intel

Posted Mar 6, 2013 11:44 UTC (Wed) by blackwood (subscriber, #44174) [Link]

Nope, missed the 3.9 cycle. Originally I've merged the core pm/fbdev patches for 3.9, but due to too much config madness and broken compilations I've taken them out again. The drm/i915 part itself wasn't ready in time, we still have outstanding issues. Should all land in 3.10.

flicker-free suspend/resume on Intel

Posted Mar 6, 2013 1:26 UTC (Wed) by neilbrown (subscriber, #359) [Link]

Well if you look in kernel/power/suspend.c, in suspend_prepare(), you can see an unconditional call to pm_prepare_console(), and in kernel/power/console.c pm_prepare_console() always calls vt_move_to_console(SUSPEND_CONSOLE, 1), where
#define SUSPEND_CONSOLE (MAX_NR_CONSOLES-1)

This will always switch console unless disable_vt_switch is set, and that only gets set by a call to pm_set_vt_switch(0).

This is only called by drivers/video/geode/gxfb_core.c and drivers/video/geode/gxfb_core.c, which defaults it to 0 unless it is set by a module parameter:

module_param(vt_switch, int, 0);
MODULE_PARM_DESC(vt_switch, "enable VT switch during suspend/resume");

So it looks like, with 3.9-rc1, you always get a vt switch at suspend/resume unless you have a GEODE video controller.

(note to self: I really should use that pm_set_vt_switch() call for my omap3 display on the GTA04 instead of commenting out the call to pm_prepare_console())

flicker-free suspend/resume on Intel

Posted Mar 6, 2013 1:30 UTC (Wed) by Tobu (subscriber, #24111) [Link]

I don't use it myself, but Phoronix mentioned that this merge has part of intel's "fastboot" work, which seems to be about reusing bios-allocated frame buffers.

flicker-free suspend/resume on Intel

Posted Mar 6, 2013 1:37 UTC (Wed) by Tobu (subscriber, #24111) [Link]

Also, here is airlied's pull request. It doesn't name fastboot, but does give a shootout to “the worst news site ever”, the one I've just linked to :P

flicker-free suspend/resume on Intel

Posted Mar 6, 2013 11:47 UTC (Wed) by blackwood (subscriber, #44174) [Link]

Getting fastboot merged is a long road, we've pulled in preparatory infrastructure since 3.7. Currently I'm optimistic that 3.10 will have something which can actually avoid the initial modeset in some configurations. But there are still a lot of loose pieces to nail down.

flicker-free suspend/resume on Intel

Posted Mar 7, 2013 13:12 UTC (Thu) by sebas (subscriber, #51660) [Link]

Thanks all for the answers, that's exactly what I wanted to know.

Looking forward to flicker-free suspend/resume. Because I do that approximately a 100 times more often than rebooting, so not sure why everybody cares so much about flicker-free boot, with suspend states working stable and fast.I also hope to get a nice speedup from this, as the vt switch takes a long time here and is quite blocking.

Now if I only could convince networkmanager to not drop the wifi connection purposefully, on suspend/resume, because that's apparently not necessary anymore with this nice ultrabook hardware. Using ifup/ifdown, it does not and is immediately online. Did anybody try that?

flicker-free suspend/resume on Intel

Posted Mar 7, 2013 14:03 UTC (Thu) by johill (subscriber, #25196) [Link]

The "immediately" online behaviour isn't really immediate, you will typically have been disconnected from the AP. In fact, this disconnection is now going to be enforced for 3.10 because the other behaviour doesn't gain a whole lot (you still need to reconnect etc.) and trying to keep the connection causes a huge amount of issues, particularly with USB hardware (that can be unplugged while suspended.)

The "ultrabook" behaviour you refer to can be achieved with WoWLAN (http://wireless.kernel.org/en/users/Documentation/WoWLAN) and network scanning offload (while in suspend), but this doesn't exist in Linux yet.

However, a much easier solution could be implemented in NetworkManager: instead of scanning on all channels when resuming, it could scan just the channel that it was previously connected on. If that finds the previously connected AP, it would be able to reconnect almost immediately (delay of less than half a second), vs. a full scan that can take up to 10-15 seconds. This should be a fairly simple NetworkManager change.

flicker-free suspend/resume on Intel

Posted Mar 7, 2013 17:18 UTC (Thu) by raven667 (subscriber, #5198) [Link]

It's also not as if this kind of problem is unique to Linux or NetworkManager, I experience very similar behavior on my MacOSX laptop when resuming from suspend, so this is a common problem with wireless. The stated solution probably won't speed up connection when changing location but keeping the same SSID as you will likely be connecting to a different AP on a different channel and it'll have to scan anyway.

flicker-free suspend/resume on Intel

Posted Mar 7, 2013 20:39 UTC (Thu) by johill (subscriber, #25196) [Link]

Well, yes. Although you could scan the channels in the right order, i.e. the ones that you'd expect the same network (SSID) on first. Say your corporate installation -- it's probably only going to use a handful of channels (1,6,11,36,149 or so), so if you know from "experience" that (almost) all of the APs for the network are there, you can scan those first.

But ultimately you're right, in the general case you have to scan all channels and that simply takes a while. It shouldn't be more than a few seconds since you're not connected; here it takes ~3.5 seconds to scan all the channels my card supports, but I know that it can be (much) slower depending on the device.

flicker-free suspend/resume on Intel

Posted Mar 7, 2013 23:25 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

well, you should first check the channel you were last using, and then do the normal scan you would do if someone just gave the SSID and told you to connect to it.

flicker-free suspend/resume on Intel

Posted Mar 8, 2013 0:10 UTC (Fri) by johill (subscriber, #25196) [Link]

Yeah, that's what I suggested first (and indeed Dan has now opened a bug for to implement it in NM), however it is possible to break it down further like I was trying to describe later:

1) scan the last channel
2) scan the known channels for the last SSID
3) scan all (remaining) channels

The intermediate step only makes sense if those known channels are a relatively small subset of all channels, but with typical installations they will be. In fact, for many networks like your home network 1) and 2) will be exactly the same because you only have a single AP, but for typical enterprise networks 2) will still be better than 3).

flicker-free suspend/resume on Intel

Posted Mar 8, 2013 0:56 UTC (Fri) by mgedmin (subscriber, #34497) [Link]

Somewhat tangentially there's a very interesting article about DHCP tricks that MacOS X does to reconnect faster.

I'd love to see Network Manager pick up something from that.

Instead, currently when I suspend a laptop via ssh -t foo sudo pm-suspend, I get to enjoy a lapsed DHCP lease on resume, for a few hours. According to strace that's because dhcpd expiration timer (i.e. select() with a timeout) simply stops while the laptop is suspended. Connection still works, but the wifi router doesn't resolve the laptop's hostname until dhcpd finally wakes up.

flicker-free suspend/resume on Intel

Posted Mar 8, 2013 9:07 UTC (Fri) by johill (subscriber, #25196) [Link]

This would be easier in connman as it has DHCP built into it rather than calling out to dhclient (running as a daemon)

flicker-free suspend/resume on Intel

Posted Mar 8, 2013 15:35 UTC (Fri) by njwhite (subscriber, #51848) [Link]

Ah, I've run into this with wpa_supplicant on Debian. Have you found a solution to it? Is there a straightforward way to get dhclient to wake up when the computer resumes?

flicker-free suspend/resume on Intel

Posted Mar 14, 2013 0:46 UTC (Thu) by kevinm (guest, #69913) [Link]

If CLOCK_MONOTONIC in Linux actually worked according to POSIX spec, then a timer based on this clock could be used to fix this problem.

(POSIX says that CLOCK_MONOTONIC measures time elapsed since "an unspecified point in the past", and further "This point does not change after system start-up time.". On Linux however, CLOCK_MONOTONIC stops while the system is suspended, which is clearly nonconforming).

The conclusion of the 3.9 merge window

Posted Mar 6, 2013 6:03 UTC (Wed) by bergwolf (subscriber, #55931) [Link]

> The device manager has gained support for a new "dm-cache" target that is able to use a fast drive (like a solid-state device) as a cache in front of slower storage devices.

What about the EnhanceIO driver that is aiming at 3.10 merge window? I assume that they are similar code accomplishing similar job, no?

http://lwn.net/Articles/538435/

The conclusion of the 3.9 merge window

Posted Mar 6, 2013 12:33 UTC (Wed) by Tobu (subscriber, #24111) [Link]

I think "dm-cache" is more similar to "dm-cache", first posted to LKML in 2006. Both are dm targets with pluggable tiering policies. EnhanceIO has some older dm-cache code by way of Flashcache, but it doesn't have the pluggable policies (I don't know if Flashcache slashed it or if there was some convergent evolution later).

The conclusion of the 3.9 merge window

Posted Mar 8, 2013 2:27 UTC (Fri) by msnitzer (subscriber, #57232) [Link]

dm-cache has no relation to the original dm-cache (or flashcache).

The conclusion of the 3.9 merge window

Posted Mar 6, 2013 13:48 UTC (Wed) by ebirdie (subscriber, #512) [Link]

And what about Bcache? I haven't seen news where this is aiming to, but according to its git-repo it seems to get development.

http://bcache.evilpiepirate.org/
https://lwn.net/Articles/501632/

>Bcache is a Linux kernel block layer cache. It allows one or more fast disk drives such as flash-based solid state drives (SSDs) to act as a cache for one or more slower hard disk drives.

The conclusion of the 3.9 merge window

Posted Mar 6, 2013 14:48 UTC (Wed) by Tobu (subscriber, #24111) [Link]

It seems to be the fastest of the lot, and an earlier version is in use at Google. Currently the maintainer is spoon-feeding some BIO and AIO rework through maintainer reviews, and I have no idea how much of the supporting infrastructure is left to merge at this point. It's not terribly user-friendly at the moment: no dm target that would allow in-place migration, and you have to stick to a maintainer tree for continued data access (even though a pass-through shim would be sufficient for non-writeback uses).

The conclusion of the 3.9 merge window

Posted Mar 7, 2013 8:05 UTC (Thu) by ebirdie (subscriber, #512) [Link]

Thank you for the insightfull view. I have had bookmark on Bcache and an itch to try it when circumstances allow. Now that I'm aware of dm-cache I think I'll choose ease of use over topnotch performance.

The conclusion of the 3.9 merge window

Posted Mar 8, 2013 2:37 UTC (Fri) by msnitzer (subscriber, #57232) [Link]

Have you benchmarked bcache vs dm-cache vs enhanceio? If so I'd love to see what you found. We have an "all-caches" branch here:
https://github.com/jthornber/linux-2.6/tree/all-caches

But we definitely need to definitely update the bcache and enhanceio code.

Also, please note that there are some dm-cache fixes that will likely be sent to Linus for 3.9-rc2 that Alasdair (DM maintainer) has staged here: http://people.redhat.com/agk/patches/linux/editing/series...

The conclusion of the 3.9 merge window

Posted Mar 8, 2013 17:30 UTC (Fri) by Lennie (subscriber, #49641) [Link]

The user unfriendliness exists to protect your data when using write-back.

It is kinda annoying, I know.

But if a normal block device would be used as a backing store there is nothing to preventing accidental use even though there might be dirty data on the cache device.

The EnhanceIO developers use some udev scripts to prevent this, I haven't looked at how they do it. I guess that could work, the EnhanceIO developers said they haven't seen any problems yet. But I can definitely see why the bcache developer made his choice.

If a cachesystem would have be integrated in the filesystem the filesystem could have something recorded which would prevent it from being mounted without the user forcing it in some way when the cache device is never coming back.

It is kinda interesting to see there are 6 ways/ideas floating around to do caching or caching related things now for the Linux kernel:
- dm-cache, now in the kernel I guess
- Facebook Flashcache
- Google bcache
- EnhanceIO was based on Flashcache I believe
- If I'm not mistake in recent kernels there is code in the VFS which keeps track of which data is hot
- btrfs developers are looking at doing something in btrfs, if I remember correctly they have expressed some interest in the VFS solution

The dm-cache, EnhanceIO and bcache have 'spoken' on the mailinglist and one even mentioned he didn't see any problem in having several implementations in the mainline kernel.

I'm not so sure Linus would even accept those patches. :-)

It is interesting and maybe sometimes seems a bit painful to see that many different efforts. They obviously all have their strengths and weaknesses of course.

The conclusion of the 3.9 merge window

Posted Mar 8, 2013 17:49 UTC (Fri) by Lennie (subscriber, #49641) [Link]

Had a quick look at dm-cache seems they actually have 3 things which can all be stored on different devices: the backing store, the cache and the meta data.

The conclusion of the 3.9 merge window

Posted Mar 6, 2013 6:08 UTC (Wed) by bergwolf (subscriber, #55931) [Link]

> The hlist_for_each_entry() iterator has lost the unused "pos" parameter.

Yet another nightmare for out-of-tree modules that aim at supporting multiple kernel versions. sigh...

The conclusion of the 3.9 merge window

Posted Mar 6, 2013 8:43 UTC (Wed) by renox (subscriber, #23785) [Link]

>> The hlist_for_each_entry() iterator has lost the unused "pos" parameter.
>Yet another nightmare for out-of-tree modules that aim at supporting multiple kernel versions. sigh...

And a perfect example why the kernel doesn't keep a stable internal ABI: otherwise you have to keep unused parameters aka bloat.

The conclusion of the 3.9 merge window

Posted Mar 6, 2013 18:02 UTC (Wed) by marduk (subscriber, #3831) [Link]

>> The hlist_for_each_entry() iterator has lost the unused "pos" parameter.

Unused parameters are undesirable.

>Yet another nightmare for out-of-tree modules that aim at supporting multiple kernel versions. sigh...

Out-of-tree modules are undesirable.

The conclusion of the 3.9 merge window

Posted Mar 6, 2013 22:04 UTC (Wed) by ovitters (subscriber, #27950) [Link]

If the module is added to the kernel the porting will be done for you. Plus the various kernel versions automatically will either have or won't have that parameter.

Or in other words: pretty well known that there is no stability guarantee and the developers prefer to include your module/driver in the kernel. Seems that this often also gets you a good review (thus better quality).

The conclusion of the 3.9 merge window

Posted Mar 9, 2013 23:04 UTC (Sat) by meyert (subscriber, #32097) [Link]

Shouldn't the cgroup hierarchy support be removed in the long term? Or am I mixing this up?

The conclusion of the 3.9 merge window

Posted Mar 10, 2013 5:15 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

I thought it was arbitrary hierarchy (CPU tree != mem tree) that was going to be dropped.