countbuggula · November 17, 2018 00:28
diff --git a/tiresome.txt b/tiresome.txt
 David,

 CentOS 6 latest version of util-ling is 2.17. I believe GPT support was add in 2.30. This is a real thing you have to deal with not a theoretical problem. One of the OSs you quoted is in fact old enough to be effected.

 We could provide warnings or we could use universal tools instead that would work all the time. I'm not sure why this suggestion keeps getting shot down. Why is the more problematic and less universal tool + warnings a better solution?

 The default installers in all both RHEL/CentOS 7 as well as all the version of Ubuntu Linux you mentioned now support dm-raid as well as dm-crypt transparently from the installer. The likelihood of one-offs is growing. Not to mention we get what we get from customers. (I just remembered there is a LUKS volume in RIA that I know of as well - and not one I built.)

 I also didn't notice any Linux version restrictions on your wiki page. It just says "Linux Drive Space Management." We might actually have more VMs running Linux OSs older than the ones you listed. We are still doing drive expansions on older systems aren't we?

 VMware is one thing, VMware + Commvault + whatever is another. I recall from memory that some of these issues arose during Commvault's snapshot/coalesce functions. As I said I'm not alone. Phil Fiacco is aware of these events. Josh Simoneau and Ian Mathews are both Senior Level engineers on Phil's team that have held the opinion that we should not be doing this. When I spin up an additional meeting we can/will go into additional technical detail. Certain tweaks can be made to make this a less dangerous methodology, but I say avoid it.

 VMware appliances as demonstrated here: https://kb.vmware.com/s/article/2126276 are made up of a slew of drives to avoid this exact problem. Literally 11 disks. I'm not sure where you are getting your information but it certainly is not the universally accepted methodology by VMware.

 Lastly I have not asked that we fix all VMs that we have. I am asking that we don't break them further. I am also not saying that we need to make a full disk PV in all cases. I stated for new builds we should adopt these standards. If you have a partitioned disk where: sda1 sda2 sda3 and you want to expand sda3, make good backups and do it. That makes sense. If you need to expand sda2 or sda1, you might create an sda4 (though I don't recommend it) and extend within the same Virtual drive; but if you have a choice - and since it has already been determined that data grows at this location --- optimize things now and for future operations.

 In regards to downtime... yeah, not necessarily required but since fsck's should be conducted before and after resize2fs for instance, you should more often then not schedule downtime anyways. --- These fsck operations are missing from the document as well.

 These drive expansions don't even check inode tables. I know of 5 Linux installations in Pittsford where the number of files is so extremely large that drive expansions add space without additional inodes making these expansions useless.

 Lowering risk and protecting data should be the focus. I would not rely on Commvault to bail us out here. I was on the phone with JP recently discussing Beacon's restore failures - doesn't sound pretty.

 Its the weekend and I am tired of campaigning here. 

 	w. matthew schlueter
 infrastructure architect - strategy

 ------------------------------------
 Matt,

 I actually don't disagree with you as much as I'm afraid you think I do.  You've brought up some good points, and we can definitely improve a couple of our steps with those in mind.  That said, please bear in mind I didn't create this process to cover every possible circumstance ever, but to cover what we're likely to see with our supported systems.  Those include RedHat/CentOS 6 and 7, and Ubuntu 16.04 and newer.  In that context, we're not going to see super old versions of fdisk, and we're unlikely to even see any examples of machines without LVM.  I think since you brought up examples we have currently in our datacenters, we should include a warning that if you don't see LVM clearly marked, to escalate instead of attempting to expand what is assumed to be a native linux partition.  I'll also update some of the steps at the beginning to better check for LVM in use.

 I get that having multiple vmdks in a VG isn't ideal, but it's literally how VMware creates and manages their own appliances.  They support it, we can too.  The whole reason for going this direction is that we have seen very recently how expanding existing drives can cause problems.  It's not theoretical, it's what started this whole process and conversation in the first place.  So I created a relatively safe process that avoids needing to modify existing drives, and has the added bonus of requiring zero downtime and no reboots.  Ideally, it would be great if we could take the time and effort required to rebuild every Linux system in the company with partitionless PVs, but between the existing configurations and Synoptek's rate of acquisitions, we're not likely to stop seeing this sort of setup in the near future.  Given this environment, I believe that adding drives to expand capacity is the safest and least disruptive option we have, if we have to standardize on something.

 I appreciate your feedback and expertise, and hope we can find an agreeable solution to settle on.

 David Bugg     
 Systems Engineer

 303.713.3109  o
 515.418.6553  c
 [email protected]
	David,

	CentOS 6 latest version of util-ling is 2.17. I believe GPT support was add in 2.30. This is a real thing you have to deal with not a theoretical problem. One of the OSs you quoted is in fact old enough to be effected.

	We could provide warnings or we could use universal tools instead that would work all the time. I'm not sure why this suggestion keeps getting shot down. Why is the more problematic and less universal tool + warnings a better solution?

	The default installers in all both RHEL/CentOS 7 as well as all the version of Ubuntu Linux you mentioned now support dm-raid as well as dm-crypt transparently from the installer. The likelihood of one-offs is growing. Not to mention we get what we get from customers. (I just remembered there is a LUKS volume in RIA that I know of as well - and not one I built.)

	I also didn't notice any Linux version restrictions on your wiki page. It just says "Linux Drive Space Management." We might actually have more VMs running Linux OSs older than the ones you listed. We are still doing drive expansions on older systems aren't we?

	VMware is one thing, VMware + Commvault + whatever is another. I recall from memory that some of these issues arose during Commvault's snapshot/coalesce functions. As I said I'm not alone. Phil Fiacco is aware of these events. Josh Simoneau and Ian Mathews are both Senior Level engineers on Phil's team that have held the opinion that we should not be doing this. When I spin up an additional meeting we can/will go into additional technical detail. Certain tweaks can be made to make this a less dangerous methodology, but I say avoid it.

	VMware appliances as demonstrated here: https://kb.vmware.com/s/article/2126276 are made up of a slew of drives to avoid this exact problem. Literally 11 disks. I'm not sure where you are getting your information but it certainly is not the universally accepted methodology by VMware.

	Lastly I have not asked that we fix all VMs that we have. I am asking that we don't break them further. I am also not saying that we need to make a full disk PV in all cases. I stated for new builds we should adopt these standards. If you have a partitioned disk where: sda1 sda2 sda3 and you want to expand sda3, make good backups and do it. That makes sense. If you need to expand sda2 or sda1, you might create an sda4 (though I don't recommend it) and extend within the same Virtual drive; but if you have a choice - and since it has already been determined that data grows at this location --- optimize things now and for future operations.

	In regards to downtime... yeah, not necessarily required but since fsck's should be conducted before and after resize2fs for instance, you should more often then not schedule downtime anyways. --- These fsck operations are missing from the document as well.

	These drive expansions don't even check inode tables. I know of 5 Linux installations in Pittsford where the number of files is so extremely large that drive expansions add space without additional inodes making these expansions useless.

	Lowering risk and protecting data should be the focus. I would not rely on Commvault to bail us out here. I was on the phone with JP recently discussing Beacon's restore failures - doesn't sound pretty.

	Its the weekend and I am tired of campaigning here.

	w. matthew schlueter
	infrastructure architect - strategy

	------------------------------------
	Matt,

	I actually don't disagree with you as much as I'm afraid you think I do. You've brought up some good points, and we can definitely improve a couple of our steps with those in mind. That said, please bear in mind I didn't create this process to cover every possible circumstance ever, but to cover what we're likely to see with our supported systems. Those include RedHat/CentOS 6 and 7, and Ubuntu 16.04 and newer. In that context, we're not going to see super old versions of fdisk, and we're unlikely to even see any examples of machines without LVM. I think since you brought up examples we have currently in our datacenters, we should include a warning that if you don't see LVM clearly marked, to escalate instead of attempting to expand what is assumed to be a native linux partition. I'll also update some of the steps at the beginning to better check for LVM in use.

	I get that having multiple vmdks in a VG isn't ideal, but it's literally how VMware creates and manages their own appliances. They support it, we can too. The whole reason for going this direction is that we have seen very recently how expanding existing drives can cause problems. It's not theoretical, it's what started this whole process and conversation in the first place. So I created a relatively safe process that avoids needing to modify existing drives, and has the added bonus of requiring zero downtime and no reboots. Ideally, it would be great if we could take the time and effort required to rebuild every Linux system in the company with partitionless PVs, but between the existing configurations and Synoptek's rate of acquisitions, we're not likely to stop seeing this sort of setup in the near future. Given this environment, I believe that adding drives to expand capacity is the safest and least disruptive option we have, if we have to standardize on something.

	I appreciate your feedback and expertise, and hope we can find an agreeable solution to settle on.

	David Bugg
	Systems Engineer

	303.713.3109 o
	515.418.6553 c
	[email protected]