Calgary RHCE

A linux and open source technology blog.

  • Home
  • About
  • GPG Key
  • GitLab

Connect

  • GitHub
  • LinkedIn
  • Twitter

Troubleshooting performance of VDO and NFS

October 20, 2018 By Andrew Ludwar Leave a Comment

In setting up a local virtualization environment a little while back, I thought I’d try the recently GA’d VDO capabilities in the RHEL 7.5 kernel. These include data compression and de-duplication natively in the linux kernel (through a kernel module). This was Red Hat’s efforts behind the Permabit acquisition. Considering a virtualization data store is a prime candidate for a de-duplication use-case, I was anxious to reclaim some of my storage budget 🙂 . I was also curious to see what the extra overhead was like (if any), and understand the general performance characteristics of VDO.

I found VDO quite easy to setup, I followed this guide basically verbatim. Given my NFS virtualization back-end stores mostly the same OS images, I was happy to see excellent deduplication stats on my VDO device:

1
2
3
[root@nfs vms]# vdostats --si
Device                    Size      Used Available Use% Space saving%
/dev/mapper/vdo0          2.5T    239.3G      2.3T   9%           60%

After a few months and several VMs spawned later, I noticed some slowness whenever I was doing high IO work like database updates and copying several gigs of data to disks all at once. (When my Satellite server downloads 50+ GB for a new content repo, and index’s it for example). My other VMs would notice a bit of a slowdown during this time. Given I hadn’t done much for tuning in this environment, it was probably time to look into it. I’ve also been debating upgrading my home lab to 10G networking and this seemed to line up with what I was seeing for storage performance. I thought I finally was being bottlenecked by the network, given I’ve got an SSD array in my NFS server, with a 4-port 1GB NIC in LACP. But before I went crazy buying 10G networking gear, I looked at tuning what I had.

The official documentation is fairly good at explaining the performance characteristics of VDO, and what you might want to tune. I also went into NFS server/client tuning as I hadn’t done much for this either. Given there’s a few things at play here (disk hardware performance, network performance, VDO optimization, NFS optimization) I quickly went down a few rabbit holes and realised I needed to do some basic benchmarking and baselining so I could understand which areas in this stack were actually performing well, and which ones were candidates for more tuning. In addition to the VDO tuning docs, here’s what I used for reference:

  • How to increase the number of threads created by the NFS daemon in RHEL 4, 5 and 6?
  • How can I improve the performance of my RHEL NFS server?
  • Initial baseline data collection for NFS client streaming I/O performance
  • High I/O wait to NFS share on NAS
  • RHEL network interface dropping packets

Firstly, it’s important to troubleshoot things in isolation, and use a benchmarking method that’s complimentary to isolation as well. I used the iperf3 utility for network benchmarking and fio utility for disk benchmarking. With this I’d do a series of sequential read, sequential write, random read, and random read and write tests, and these would be done both on local disk filesystems and over the network filesystem. For reference, here’s the fio commands:

1
2
3
4
5
6
7
8
9
10
11
# Sequential read
# fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=read --size=500m --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
 
# Sequential write
# fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=write --size=500m --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
 
# Random read
# fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=randread --size=500m --io_size=10g --blocksize=4k --ioengine=libaio --fsync=1 --iodepth=1 --direct=1 --numjobs=1 --runtime=60 --group_reporting
 
# Random read and write
# fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=randrw --size=500m --io_size=10g --blocksize=4k --ioengine=libaio --fsync=1 --iodepth=1 --direct=1 --numjobs=1 --runtime=60 --group_reporting<code>

After reading the above guides and doing some basic investigating and benchmarking, this is what I ended up tuning first:

Overall the networking looked alright. I saw some dropped packets, but I’ve been doing a fair amount of cable pulling, stop/starting hosts, and VPN up/down. The NICs on NFS server and clients weren’t using their full ring buffer, so I changed this. I don’t think this was much of a candidate for the dropped packets, but this tuning couldn’t hurt.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
[root@nfs]# ethtool -g eno1
Ring parameters for eno1:
Pre-set maximums:
RX: 2047
RX Mini: 0
RX Jumbo: 0
TX: 511
Current hardware settings:
RX: 200
RX Mini: 0
RX Jumbo: 0
TX: 511
 
# ethtool -G eno1 rx 2047
# ethtool -G eno2 rx 2047
# ethtool -G eno3 rx 2047
# ethtool -G eno4 rx 2047
 
# ethtool -g eno1
Ring parameters for eno1:
Pre-set maximums:
RX: 2047
RX Mini: 0
RX Jumbo: 0
TX: 511
Current hardware settings:
RX: 2047
RX Mini: 0
RX Jumbo: 0
TX: 511
 
 
[root@curie]# ethtool -g enp1s0
Ring parameters for enp1s0:
Pre-set maximums:
RX: 511
RX Mini: 0
RX Jumbo: 0
TX: 511
Current hardware settings:
RX: 200
RX Mini: 0
RX Jumbo: 0
TX: 511
 
# ethtool -G enp2s0 rx 511
# ethtool -G enp1s0 rx 511
 
# ethtool -g enp1s0
Ring parameters for enp1s0:
Pre-set maximums:
RX: 511
RX Mini: 0
RX Jumbo: 0
TX: 511
Current hardware settings:
RX: 511
RX Mini: 0
RX Jumbo: 0
TX: 511

I also increased the default number of NFS threads on the NFS server. Considering I’ve got 15+ VMs, each VM looks to use 2-3 nfs threads depending on number of disks, tuning the default of 8 to 20 should help for concurrent disk activity:

1
2
# egrep COUNT /etc/sysconfig/nfs
RPCNFSDCOUNT=20

Similarly, the VDO device I created only had 1 thread allocated in several places, so I upped these as well and doubled the cache size:

1
vdo modify --all --vdoLogicalThreads=4 --vdoPhysicalThreads=4 --vdoBioThreads=6 --vdoCpuThreads=6 --vdoAckThreads=2 --blockMapCacheSize=256M

After these changes, I still wasn’t seeing any significant performance change. I was getting fairly abysmal speeds even on a local SSD filesystem on the NFS server, not even going over the network. I started to isolate this, and started to suspect a hardware/SSD tuning related issue. After updating my DL360p 420i storage controller firmware, making sure the RAID controller cache was disabled, SSD smart path was on, it started to dawn on me. Previously, these six SSD drives had been used in a RAID5 configuration that saw a ton of heavy disk IO. I had dedicated this host to an OpenStack environment and had done several builds hammering these disks. RAID5 is not an optimal SSD configuration, parity calculation is expensive, and this would add a ton of disk IO and disk wear that wouldn’t be present in a RAID0 or RAID10 configuration. Essentially, I’ve got worn SSDs. I needed to do a secure erase of these SSDs to return their cells to as close to original factory condition as one could get. These disks are approx 3 years old and haven’t had this done yet.

After doing an enhanced secure erase, I saw my local disk storage speeds come back up about 5 fold. This was more in line with the newer SSDs in other servers. Doing disk tests over the network saw the same speed increase. So it looks like my problem was entirely hardware related :). As I’m now turning on all the VMs, I’m seeing a much quicker response when doing the high IO activities. There’s more than 16 NFS threads consistently in use now and I’m monitoring to see the change in VDO related performance. I need to research a utility to get accurate VDO stats, I think this likely will be with a PCP module. But at first glance, with not much concrete data to go on yet, I *think* the VDO tuning has helped as well.

While I learned a bit about VDO and NFS performance tuning, it looks like I might need to spend that 10G networking budget on new SSDs instead. There’s diminishing life left in these.

Filed Under: cloud, enterprise, networking, open source, performance tuning, storage Tagged With: hardware, I/O, linux kernel, open source, performance tuning, private cloud, software defined storage, SSD, storage, virtualization

Migrating RHEV storage domains

October 19, 2018 By Andrew Ludwar Leave a Comment

Recently, I’ve been doing some troubleshooting in my virtualization environment, specifically with the NFS storage backing it. To isolate an issue I needed to migrate all the VM disks off the main data store to another one. I hadn’t performed this kind of activity before, but found it to be quite easy. I quickly built an additional NFS server, added it into RHV, and migrated the VM disks to it with just a few clicks and some waiting for the copying to be complete. This article was basic, but helpful.

I selected all my VM disks on the original NFS data store and clicked “Move”. Since I only had two data stores, the other was auto-selected and I just had to hit “OK”.

RHEV-move

They slowly started moving over:

RHEV Copying

After a little while, I saw my previous NFS data store to be depleted, and the new store now containing all the VM disks:

RHEV Domains

RHEV VMs

Next up, I’ll elaborate on why I’m doing this troubleshooting :).

Filed Under: cloud, open source, performance tuning, storage Tagged With: hardware, open source, private cloud, software defined storage, SSD, storage

Reliable, resilient storage with GlusterFS

August 18, 2014 By Andrew Ludwar 2 Comments

A need came up lately for some inexpensive resilient storage that was easily expandable, and that spanned multiple datacentres.  Having recently been playing with GlusterFS and Swift in our OpenStack trial, I was quick to point out that these were strong use-cases for a GlusterFS architecture. We needed something right away, and something that also wasn’t terribly expensive, which Gluster caters to quite well.  Typically we would purchase a SAN technology for this, but recently having some discussions on cost and business agility, we decided against that and opted to try a popular open source alternative that’s been gaining both momentum and adoption at the enterprise level.

I began researching the latest best-practise architecture for GlusterFS.  It’s been about a year since I’ve given it a deep investigation, so I started re-reading their documentation.  I was pleasantly surprised to see that quite a bit more work had been done on the documentation since I last checked-in with the project.  They’ve recently moved a lot of documentation into github, and redesigned their website as well.  I found some good information on their website here, but found the most useful information in github here.

Our storage needs for this project were very simple.  We need resilient, highly available storage that can expand quickly, but don’t need a ton of IOPs.  Databases and other intense I/O applications are out of scope of this project, so what we’re mostly looking for out of this architecture is basic file storage for things like:

  • ISOs
  • Config file backups
  • Restored file location
  • Static web files
  • FTP/Dropbox/OwnCloud file storage

After perusing through Gluster.org‘s documentation, I came across the architecture that would best suit our needs.  Distributed replicated volumes.  We wanted two servers in each datacentre to have a local copy of this data, and also have two copies stored in another datacentre for resiliency.  This way we can suffer a server loss, a server loss in each datacentre, or an entire datacentre loss and still have our data available.  The documentation even gives you the command syntax to create this architecture; perfect!  I wanted to change the architecture slightly, but that wasn’t a big deal.  There’s enough detail in the documentation that I was able to understand how to do this.

  1. I racked 2 CentOS 6.5 servers in each datacentre (for a total of 4. I had four Dell R420s that were decomissioned from another project – commodity hardware)
  2. Installed gluster-server on them all via the glusterfs-repo
  3. Made sure their DNS was resolvable for both forward and reverse
  4. Chopped up their RAID10 1TB hard disk with LVM leaving approx 80GB for the OS, and the rest for /gfs/brick1

…and I was ready to run their provided command – slightly modified of course.  Since I wanted the data to essentially be replicated across all 4 nodes, I changed the replica count from 2 to 4.  This means if I ever need to expand, I’ll need to do it 4 nodes at a time.  I then ran this command on one of the cluster members to create the gluster volume named ‘gfs’:

1
gluster volume create gfs replica 4 transport tcp server1:/gfs/brick1 server2:/gfs/brick1 server3:/gfs/brick1 server4:/gfs/brick1

For a visual reference, this is what this architecture looks like:

GlusterFS - Distributed Replicated Volume
GlusterFS – Distributed Replicated Volume

That’s it! I could now install the glusterfs and glusterfs-fuse client software on any node that needed to access this volume, and mount it with

1
mount -t glusterfs server1:/gfs /mnt

After succesfully mounting the volume, I did a few file writing tests with bash for loops, and dd, just to test the speed and see that any created files were replicating.  I then did a few failover tests by shutting down an interface on a node, rebooting a node completely, and shutting down two nodes in a datacentre.  Since I am using the native glusterfs client to connect to the volume, the failover and self-healing features are automatically handled unbeknownst to the user.  After the 42 second timeout, services failed over nicely, and files were replicated across active hosts.  When the downed nodes came back up and joined the volume again, the files they missed were automatically copied over.  Perfect!  And just like that – I was done.

I was surprised at how simple this was to architect and setup.  When I need additional disk space, I’ll rack another 4 nodes into the architecture, and expand it by using the gluster volume add-brick command documented here.

In another article, I plan to do some tweaking to the 42 second TCP timeout – Gluster provides documentation on all the options and tuneables you can set.  It’s also worth looking at using an IP-based load balancing technology.  (Either an F5, or CTDB).  This should increase the service availability even further.  In yet another article, I would also like to explore the new geo-replication technology that spans WANs and the internet.  For now, I have a 1TB highly available and resilient file storage volume to play with.

 

Filed Under: devops, enterprise, open source, storage Tagged With: devops, gluster, software defined storage