Skip to comments.Linux Not Fully Prepared for 4096-Byte Sector Hard Drives
Posted on 02/16/2010 6:57:42 AM PST by ShadowAce
Recently, I bought a pair of those new Western Digital Caviar Green drives. These new drives represent a transitional point from 512-byte sectors to 4096-byte sectors. A number of articles have been published recently about this, explaining the benefits and some of the challenges that we'll be facing during this transition. Reportedly, Linux should unaffected by some of the pitfalls of this transition, but my own experimentation has shown that Linux is just as vulnerable to the potential performance impact as Windows XP. Despite this issue being known about for a long time, basic Linux tools for partitioning and formatting drives have not caught up.
The problem most likely to hit you with one of these drives is very slow write performance. This is caused by improper logical-to-physical sector alignment. OS's like Linux use 4K blocks (or multiples of 4K) to store data, which matches well with the physical sector. However, nothing restricts you from creating a partition that starts on an odd-numbered 512-byte logical sector. This misalignment causes a performance hit since the drive has to read and rewrite the 4K sectors with whatever 512-byte slices changed.
WD claims to have done some studies and found that Windows XP was hardest hit. By default, the first primary partition starts on LBA block 63, which obviously is not a multiple of 8. They provide a utility to shift partitions by 512 bytes to line them up. WD also tested other OS's and declared both MacOS X and Linux to be "unaffected". I don't know about MacOS, but with regard to Linux, they are not entirely correct. Following are the results of my experimentation.
The first thing I did was test the performance effect itself. It has been suggested that WD might internally offset block addresses by 1 so that LBA 63 maps to LBA 64. This way, Windows XP partitions would not really be misaligned. I performed a test that demonstrates that WD has not done this. I've included the source code to my test at the end of the article. This program does random 4K block writes to the drive at a selectable 512-byte alignment. So if I pass 0 to the program, it runs the test on 4K boundaries. If I pass 1, the test is on 4K boundaries plus 512. The effects of this test are amplified by the use of O_SYNC, which insists that all writes hit the disk immediately, but it demonstrates the problem. Note that I realize that all my testing is "quick and dirty," but I'm just trying to demonstrate a point, not analyze it in painful detail.
1000 random aligned 4K writes consistently take between 7 and 8 seconds.
1000 random unaligned 4K writes consistently take between 22 and 24 seconds.
Now, this just demonstrates the problem we already know about. What about how it affects filesystems. Next, to formatting the drives.
I have two drives, /dev/sdc and /dev/sdd, both identical Green drives. I partitioned them as follows:
For /dev/sdd, I used fdisk to add a Linux (0x83) primary partition, taking up the whole disk, using fdisk defaults. By default, the partition starts at LBA 63.
For /dev/sdc, I used fdisk the same as with sdd, but after creating the partition, I realigned it. I did this by entering expert mode ("x"), then setting the start sector ("b") to 64.
Once that was finished, I formatted both drives using the command "time mke2fs /dev/sdc1" (and sdd1).
/dev/sdc, which was aligned, took 5m 45.716s to format.
/dev/sdd, which was not aligned, took 19m 53.609s to format.
That's a difference of greater than a factor of three!
Now to file tests. I ran two test. The first test was to copy one large file. I have a Windows XP disk image for qemu-kvm that takes up 18308968KiB. I copied the file (from my much faster 7200 RPM drives in RAID1 configuration) to one drive, then the other, then I reran the first test to avoid buffering effects.
$ time cp winxp.img /mnt/sdc # ALIGNED real 5m9.360s user 0m0.090s sys 0m20.420s
$ time cp winxp.img /mnt/sdd # UNALIGNED real 13m26.943s user 0m0.110s sys 0m19.350s
Pretty striking difference. I didn't really expect this. Since this is one large file, and it can be written linearly to the disk, I expected that we would see a very slight performance hit. I think this is something that itself should be investigated. There's no reason for long contiguous writes to get hit this hard, and it's something that the kernel developers need to look into and fix. To complete the testing, I next tried random writes. I have some stuff I've been working on for school, lots of small files of all sorts of different sizes. So I decided to copy that stuff recursively.
$ time cp -r Computer Architecture/ /mnt/sdc # ALIGNED real 42m9.602s user 0m0.680s sys 1m59.070s
$ time cp -r Computer Architecture/ /mnt/sdd # UNALIGNED real 138m54.610s user 0m0.660s sys 2m15.630s
This performance hit of a factor of about 3.3 is surprisingly consistent across operations. And this is severe. I've read people guessing that there would be a 30% performance loss. But a 230% performance loss is exceptionally bad.
In conclusion, these drives are on the market now. We've known about this issue for a LONG time, and now it's here, and we haven't fully prepared. Some distros, like Ubuntu, use "parted", which has a very nice "--align optimal" option that will do the right thing. But parted is incomplete, and we must rely on tools like fdisk for everything else. But anyone manually formatting drives based on popular how-tos that pop up at the top of Google searches is going to cause themselves a major performance hit, because mention of this alignment issue and how to fix it is conspicuously absent. I've done a lot of googling on this topic, and as far as I can tell, this issue has really not been taken seriously. There's plenty of discussion on aligning partitions for SSDs and VMWare volumes, but nothing about the issue relating to these new hard drives. And no fix for fdisk. Most of the drives still being sold today have 512-byte sectors, so lots of people will say "not my problem", but it will become your problem soon since all the hard disk manufacturers have been very eager to make the switch. This time next year you may have trouble buying a drive without 4K sectors, and you're going to want all your Linux distros to handle them properly.
Evaluation setup and methodology:
Don’t buy the lame ass ‘green drive’. Problem solved. :)
Thanks for the heads up. This is an interesting problem, and one that is good to be aware of.
Except that of course the sector sizing has nothing to do with "green"-ness, which is just new-age marketing hype...
I’ve seen the same problems with Virtual Machines (VMWare) and and some storage arrays and filers like NetApp. Most linux systems (and Windows 2000/2003) start the primary partition at at sector (logical block) 63. The reason for this is tied to historical disk geometry and results in the controller performing additional work. Windows Server 2008 and Vista default at 1,048,576 which i think would alleviate a great deal of the problem.
As you say, not a major problem for most people, but something the Linux community needs to address. LinuxCon is being held in Boston this year. You should send them a note.
I bought the 1TB Caviar Green Drive because it cost less than the Black drive. I haven’t had a single problem. But then again I’m running 64-bit Windows 7 with 4GB of RAM and a Quad-Core processor.
Well, the RAM and processor don’t really matter in this case—it’s how the OS interfaces with the drive.
I just looked at the ‘green’ drive, saw 5400RPM, and moved on to the next one. I didn’t even know there was anything different about it at all.
Maybe it uses 5% less power. ("Than what?")
I hate marketing hype.
Let's see if this image makes it for ya...
I just figured that since hard drives are slow enough anyway that I’d leave the idea of making my computer even slower to ‘save the planet’ to the envirotards.
This sounds like an old trick for optimizing databases. Often in databases it’s best to use 64KB clusters, but if you do that you have to make sure to align your partition on a 64KB boundary, and RAID stripes on a 64KB boundary, or performance will suffer for exactly the same reason as here.
Finally, an enterprise headache hits the home user. :)
That’s nothing! My home theater system goes to 40!