Slow sequential single file reads on ZFS

lightrush@lemmy.ca to Selfhosted@lemmy.world – 22 points –

I built a 5x 16TB RAIDz2, filled it with data, then I discovered the following.

Sequentially reading a single file from the file system gave me around 40MB/s. Reading multiple in parallel brought the total throughput in the hundreds of megabytes - where I'd expect it. This is really weird. The 5 disks show 100% utilization during single file reads. Writes are supremely fast, whether single threaded or parallel. Reading directly from each disk gives >200MB/s.

Splitting the the RAIDz2 into two RAIDz1s, or into one RAIDz1 and a mirror improved reads to 100 and something MB/s. Better but still not where it should be.

I have an existing RAIDz1 made of 4x 8TB disks on the same machine. That one reads with 250-350MB/s. I made an equivalent 4x 16TB RAIDz1 from the new drives and that read with about 100MB/s. Much slower.

All of this was done with ashift=12 and default recordsize. The disks' datasheets say their block size is 4096.

I decided to try RAIDz2 with ashift=13 even though the disks really say they've got 4K physical block size. Lo and behold, the single file reads went to over 150MB/s. 🤔

Following from there, I got full throughput when I increased the recordsize to 1M. This produces full throughput even with ashift=12. My existing 4x 8TB RAIDz1 pools with ashift=12 and recordsize=128K read single files fast.

Here's a diff of the queue dump of the old and new drives. The left side is a WD 8TB from the existing RAIDz1, the right side is one of the new HC550 16TB

< max_hw_sectors_kb: 1024
---
> max_hw_sectors_kb: 512
20c20
< max_sectors_kb: 1024
---
> max_sectors_kb: 512
25c25
< nr_requests: 2
---
> nr_requests: 60
36c36
< write_cache: write through
---
> write_cache: write back
38c38
< write_zeroes_max_bytes: 0
---
> write_zeroes_max_bytes: 33550336

Could the max_*_sectors_kb being half on the new drives have something to do with it?


Can anyone make any sense of any of this?

4

OK, I think it may have to do with the odd number of data drives. If I create a raidz2 with 4 of the 5 disks, even with ashift=12, recordsize=128K, the performance in sequential single thread read is stellar. What's not clear is why this doesn't affect, or not as much, the 4x 8TB-drive raidz1.

Would you use zfs and raid-z when there is only 1 file on your disk?

Would you build 4 ticket counters when your concert hall has only 1 seat? Would you build a 4 lane highway when there is only 1 car in your country?

:-)

Yes, yes I would use ZFS if I had only one file on my disk.

Ok :-)

Then you probably shouldn't optimize it for the use of many files (which is the default, of course).