Slow sequential single file reads on ZFS
I built a 5x 16TB RAIDz2, filled it with data, then I discovered the following.
Sequentially reading a single file from the file system gave me around 40MB/s. Reading multiple in parallel brought the total throughput in the hundreds of megabytes - where I'd expect it. This is really weird. The 5 disks show 100% utilization during single file reads. Writes are supremely fast, whether single threaded or parallel. Reading directly from each disk gives >200MB/s.
Splitting the the RAIDz2 into two RAIDz1s, or into one RAIDz1 and a mirror improved reads to 100 and something MB/s. Better but still not where it should be.
I have an existing RAIDz1 made of 4x 8TB disks on the same machine. That one reads with 250-350MB/s. I made an equivalent 4x 16TB RAIDz1 from the new drives and that read with about 100MB/s. Much slower.
All of this was done with ashift=12
and default recordsize
. The disks' datasheets say their block size is 4096.
I decided to try RAIDz2 with ashift=13
even though the disks really say they've got 4K physical block size. Lo and behold, the single file reads went to over 150MB/s. 🤔
Following from there, I got full throughput when I increased the recordsize
to 1M. This produces full throughput even with ashift=12
. My existing 4x 8TB RAIDz1 pools with ashift=12
and recordsize=128K
read single files fast.
Here's a diff of the queue dump of the old and new drives. The left side is a WD 8TB from the existing RAIDz1, the right side is one of the new HC550 16TB
< max_hw_sectors_kb: 1024
---
> max_hw_sectors_kb: 512
20c20
< max_sectors_kb: 1024
---
> max_sectors_kb: 512
25c25
< nr_requests: 2
---
> nr_requests: 60
36c36
< write_cache: write through
---
> write_cache: write back
38c38
< write_zeroes_max_bytes: 0
---
> write_zeroes_max_bytes: 33550336
Could the max_*_sectors_kb
being half on the new drives have something to do with it?
Can anyone make any sense of any of this?
OK, I think it may have to do with the odd number of data drives. If I create a raidz2 with 4 of the 5 disks, even with
ashift=12
,recordsize=128K
, the performance in sequential single thread read is stellar. What's not clear is why this doesn't affect, or not as much, the 4x 8TB-drive raidz1.Would you use zfs and raid-z when there is only 1 file on your disk?
Would you build 4 ticket counters when your concert hall has only 1 seat? Would you build a 4 lane highway when there is only 1 car in your country?
:-)
Yes, yes I would use ZFS if I had only one file on my disk.
Ok :-)
Then you probably shouldn't optimize it for the use of many files (which is the default, of course).