author: "Issif" date: 2021-06-17T12:00:00+02:00 title: "FinOps" description: "What I learned from my FinOps experience" categories: ["cloud", "finops"]
During my previous professional experience, I worked for 3 years as a FinOps expert on AWS. The function having been created with me, I had complete freedom to set up the model I wanted, with its methodology and tools. Before forgetting what I know, I said to myself that it might be interesting for some people if I put what remains in my mind in writing, literature in French becoming all the more rare (if the article please, I may do an English translation).
It had also appeared to me, while talking with my clients and other experts, often from consulting firms, that my vision could differ from theirs and from what is generally found in articles dealing with the subject. I think it comes from the fact that I am above all an OPS, managing production and the inherent constraints. My recommendations and advice have always had as a backdrop a real exploitation of the platforms, their performance and availability (working for an outsourcer, if the platform fell, it was my colleagues and I who solved the problems, possibly on call 😉 ) .
Not having advanced knowledge and having performed audits only for AWS, I will use in this article a lot of terms and advice related to this Cloud Provider, but I think the general elements are easily transposable to * GCP* and Azure.
- Foreword
- Summary
- Definition
- Finding the right mix between HA, Perfs and Costs
- Costs and billing models
- FinOps, a process
- Tooling
- Invoices
- AWS Cost Explorer/Trusted Advisor
- Scripts
- SaaS
- Websites
- Schemes
- Metrology
- Many possible actions
- Tags
- Conclusion
The first task is of course to define the term FinOps itself. It obviously appears that he is the contraction of two terms (like all these trendy TrucOps 😛). In our case: Financial + Operations. Not very telling, moreover, we expect more accounting than Cloud architecture.
So that's my definition, with bold bits on purpose:
The FinOps analyzes the Cloud architectures and their associated invoicing, in order to identify axis of cost limitations and/or savings, ensuring a level low risk and taking into account the business issues.
Let's take each of the highlighted parts:
- analyzes architectures: The work of FinOps is above all an engineering job, if we are content to point lines in an Excel sheet to find the services that are too expensive to cut from , it won't work.
- associated invoicing: We are pushing an open door, but it goes without saying that to target areas for improvement, you need to know which services are invoiced and to what extent (reducing the cost of a service at $100 remains less interesting before 2% of a service at $1000, for example).
- axis of cost limitation and/or economy: The distinction is very important in my opinion. We often seek only to reduce costs, but we must also think about controlling their evolution. I have often explained that an increasing Cloud Provider bill is not necessarily abnormal or annoying, it can just be the logical consequence of the increase in activity. You have more traffic, and therefore more consumption but above all more turnover, which is expected. Where it gets stuck is when the bill increases while the turnover drops, or when the two both grow but in very different orders of magnitude (costs which explode in relation to income).
- a low level of risk: When we implement FinOps actions, we often think of changing the types of Instances (VM), or re-architecting portions of the platform. It is therefore very important to keep in mind that the platform must provide a service of the same quality (or even more), and you must therefore be very careful in the choices and their implementation.
- business issues: A Cloud platform is after all only a tool, a means of delivering a service, it is part of a much larger framework to be taken into account. I am thinking, for example, of the maturity of the company with the technologies, or of the legal constraints (PCI-DSS, HADS, etc.).
With this definition, it is now necessary to explain what we are aiming for by doing FinOps, which will guide our choices.
It is a question of finding the right balance between 3 criteria:
- High availability
- Performance
- Costs
The 3 obviously influence each other.
For example:
- If HA is a dominant criterion, costs will inevitably increase and performance may decrease (multi-regions, multi-cloud).
- If we favor performance, larger machines will have to be paid for but the HA could also be sacrificed (I had a client who was running everything on a single AWS AZ because the latencies between AWS data centers had an impact on its compute grid).
- If we absolutely seek to lower costs, we will sacrifice performance and/or high availability.
I voluntarily omitted to add this 4th criterion in the previous list, because it is particular and has all its importance. It potentially impacts the other 3 criteria and is part of a more important framework that is very often impossible to neglect (PCI-DSS, HADS, etc.). It is therefore necessary to ensure that the recommendations are in line with good security practices, or even within the framework of the law.
The good news comes out of all this, it is that we are in the context of a Cloud platform, which comes with its constraints but also its advantages, with flexibility at the top of the list.
I mentioned before that the whole art of FinOps is to find the right balance between perfs, HA and costs, and what is magic with the * Cloud* is that this dosage is not destined to be immutable and universal. The flexibility of Cloud Providers makes it possible to adjust the balance as desired and when desired.
Here is a list of elements that can lead to adapting our choice of the dominant criterion or criteria, with a few examples to explain them:
- Environment: More importance is often given to the availability of the production environment than to the staging environment.
- Period: In e-commerce, we will put the package during the sales period. A billing application only needs to be pumped up on steroids at the beginning or end of the month, etc.
- Data criticality: We can sometimes afford to lose cache, but not business data.
- Criticality of the application: An SSO will be central, while a worker who falls only creates delay and his task will be supposed to be resumed afterwards.
- Business challenges: Compliance with availability or performance SLAs.
- Budget: Fundraising often makes it possible to make the desired changes or conversely, a hole in the cash flow will require you to sacrifice this or that criterion.
It is extremely important to bear in mind what the invoices are made up of but also the overall context in which they fit.
I therefore identify 2 axes, the Direct costs and the Indirect costs.
We will not be able to play on both in the same way, or even at all for certain elements. And it is very important to take them all into account when choosing the actions to be taken.
I classify among the Direct costs everything that is directly in the bill of the Cloud Provider, separating into 2 categories:
- Resources: These are the costs in the easiest invoices to understand and estimate, generally invoiced by the second or hour, they characterize the "fixed" elements to compose the infrastructure (Instances, Volumes , etc). The rule is simple, the bigger you take, the more you pay, and vice versa. We can therefore easily estimate the possible savings when we act on them.
- Consumption: These are all the costs in the invoices that are more difficult to estimate because they are independent of our own choices and vary at all or all between periods. It classifies bandwidth, snapshots, total number of requests, etc. Savings estimates are often less precise, but by knowing your business well and taking a few margins, you can get there.
When we think of reducing our Cloud bill, we often forget the additional costs that do not appear strictly speaking on the bill but which are nevertheless there when we want to set up actions:
- Migration costs: If you plan to replace pieces or go on a new architecture, it will be necessary to spend a large number of JH (man-days) on the subject, and this will have a cost for your organization, both financial and human. The return on investment must therefore be interesting.
- Development costs: In the same idea, less OPS and more DEV, if you change technology, there will surely be code to write and therefore JH to spend, on functionalities which do not necessarily bring added value to the end customer.
- Training: A point to never neglect, any change will be better experienced if it is accompanied by training, whether internal or not (it is often more expensive with external consultants, of course) .
- Outsourcing: If you do not manage your platform yourself, or if you have signed a contract to have 24/7, the financial scope may be impacted.
- Licenses: This is a particularly valid point for migrations from On-Premise to Cloud, as the billing models for publisher licenses can change a lot, and not necessarily downwards.
These Indirect costs are very often more difficult to evaluate, even almost impossible when you are an external auditor, but it is important to take into account the fact that they exist and can drastically change the choice of actions to be taken.
Now that we have the destination, let's see the path. In my opinion, FinOps is a process that will take place throughout the life of the platform and begins at its conception. It often appears that thinking about an architecture by integrating the FinOps aspects from the start, leads to creating designs that are closer to best practices.
This process has several characteristics, it is:
- Keep on going
- Iterative (one action after another)
- A growing complexity (even exponential)
This process may vary from one company to another, but there are generally 3 key steps, which are repeated:
Analysis > Recommendations > Implementation > Analysis > ...
The Cloud architectures being elastic, the associated billings are just as elastic. Added to this is that everything virtualized, paid for by consumption, makes the architectures mutable (change of type of machine, replacement of bricks by other services, etc.). It then becomes obvious that carrying out FinOps actions once every quarter, for example, will be less relevant, and worse, lag behind expenses.
It is important to integrate FinOps among all the other processes allowing to correctly manage a platform, or even to integrate checkpoints in the same way that we set up monitoring for the systems themselves. The Cloud Providers have made no mistake about it and provide means of checking compliance with a budget (without ever blocking possible expenses, we are just talking about alerts). This fits into the cost containment
part of my definition.
This does not preclude more in-depth audits that can be carried out, not on a day-to-day basis, but all the same regularly, and which are often used to identify areas for greater savings.
Once a FinOps analysis has been carried out and recommendations have been established, the temptation to reduce the bill immediately can be strong by carrying out all the actions at once. It's rarely a good thing. By carrying out several actions in parallel, it becomes more difficult to determine which of them have had the desired effect, and worse, some may contradict each other without it being possible to determine how. By iterating, action by action, it becomes easy to determine which ones have worked well and therefore to repeat them over time or in other environments.
The first action (see the rest of the article) is generally to track down unnecessary expenses, waste. It is a relatively simple action, which can be done manually at first and then automated. Once this is in place, other actions can be carried out, necessarily more complicated than the previous ones, and so on. It quickly appears that the ratio of possible savings to investment (time + money) decreases over time, the more we advance in the solutions found to limit costs or save, the more they become complex to implement. I even think that the trend followed is exponential. The order of magnitude of the investment is not at all the same between automating the deletion of detached volumes and the overhaul of
With what can we conduct a FinOps analysis?
This is of course a non-exhaustive list, but it is a request that has often been made to me so here it is.
The invoices are, of course, the first documents to study, they are often simplified and allow to have a global overview as well as trends. This is also what your accounting department receives (it will help to communicate with them, to explain possible peaks or other when necessary).
AWS provides 2 tools:
- Cost Explorer: This is going to be your new best friend. The tool improves every year and allows you to analyze and graph all your spending history, filter by tags, department, region, etc. It really is an essential and you really have to learn how to use it. The difficulty is to understand what we read, because unlike the invoices, the details are based on the AWS API and not a simplification of the terms as in the invoices received at the end of the month.
- Trusted Advisor: When you subscribe to the Business level support, in addition to giving good information related to the security of your platform, the tool gives you information on the financial waste present in the account. The need to have a high level of support for this to be active can be cumbersome but fortunately the given items are relatively easy to obtain by hand or via script.
There are a plethora of scripts or even complete applications on the web to help analyze costs and find areas for improvement. You can also write your own scripts to highlight specific points related to your infrastructure. The advantage of automation is of course to save you time but above all to be able to have regular results and therefore to follow the results of your modifications.
Here are some examples (the first two are mine):
And a search on Github with some other tools: Results
The craze for FinOps necessarily comes with its share of online services to help in the task, here are two that I tested (I will not do a comparison, having not touched them for a long time now):
These tools do more than FinOps, but as their costs fall into what I call Indirect costs, they must therefore be taken into account in your calculations to estimate the savings.
The web is teeming with documentation and articles like this one, you won't have to hesitate to search for hours to learn the techniques in vogue to limit costs or even decode the sometimes abstruse AWS documentation.
Here is an example of tools that one can find and use frequently:
- https://Instances.vantage.sh
- https://calculator.aws/#/ (official calculator provided by AWS)
- https://observablehq.com/@rustyconover/aws-ec2-spot-Instance-simulator
The FinOps is a technical work, even if we peel the invoices well, it is interesting to have the vision of the platform in terms of architecture. Understand how things are thought, from the Instance to the managed services. The ultimate being to also have the actual flows and not just a block diagram, the flows being often very difficult to categorize and even more difficult to understand in terms of billing.
Here is a diagram of some flows with the directions that AWS charges you for:
{{< fancybox path="/img/2021/06" file="aws-flux.png" caption="AWS Stream" gallery="Screenshot" >}}
All Cloud Providers provide their own more or less advanced metrology solutions. In the case of AWS, the solution is called Cloudwatch. Although practical, it is necessary, to assess the proper use of resources, to have its own metrology system, with a finer granularity, but above all more metrics. Let's take the case of CPU usage, which is certainly rarely a limiting factor on Web services, memory usage or IO being more to follow. You have no information about the execution contexts (user, system, steal, iowait, ...). In the case of RAM, your Cloud Provider will not provide you with details on its use, difficult to judge in the state of the relevance of its size for your needs.
The following list of categorized actions FinOps is absolutely not intended to be complete, each service having its specificities and the billing models also evolving over time. Once again, even if very AWS oriented, they can be fully or partially transposed to other Cloud Providers.
In order to be able to better assess the relevance of the recommended actions but also to be able to prioritize them, I have got into the habit of categorizing them according to 3 criteria:
- Complexity: This is often a good way to help estimate the time that will be needed for implementation.
- Risk: Our goal, as explained in my definition, is not to degrade production, but any action has its share of risk, especially when switching from one piece of architecture to another, or when reducing the type of an Instance.
- Expected savings: Often in absolute value, and to be put into perspective with the other two criteria. It may make sense to do a simple, risk-free action that saves $100 rather than a complicated redesign that saves $200 but otherwise takes 10JH (remember, Indirect costs).
When the FinOps audit is for a third-party client, it will be all the more important to allow him to assess the actions he wants to carry out or have carried out, he alone having the purse strings.
The first action we think of is obviously to reduce waste. The flexibility of the Cloud often leads to creating and then leaving resources lying around in a corner.
Here is a list of some common mess:
- Old Snapshots: Several scenarios, either these are Snapshots for obsolete AMI, or EBS volumes that no longer exist. These may be old EBS Snapshots still running, in which case as they are backed up incrementally, deleting them will result in a lower gain on the bill, with the newer ones still needing to be saved. 'a reference.
- Old AMIs: Unless absolutely necessary, it is not useful to keep old AMIs, with obsolete systems and packages. Deleting them does not gain anything strictly speaking, but frees up Snapshots which can then be deleted.
- Detached EIPs: The gain is minimal, but if they really are useless, you might as well free them.
- Detached EBS: This often comes from deleting Instances without deleting the volumes that were using them. If in addition there was AutoScaling active, the volumes can quickly increase.
- EBS attached but no IO: The classic hit of the volume mounted on
/logs
but no application writing to it. - ELB/ALB/NLB without Instance or without request: It happens that
xLB
do not point to anything, or that the Instances behind them have been Unhealthy for a very long time, without any impact elsewhere. - EC2/RDS in old types: With each new family released by AWS, prices are lower, sometimes for better performance or options.
- EC2 stopped for a long time: The compute is not paid in this case, only the storage, it can be legit or not. Up to you.
- S3 Buckets with Reduced Redundancy Storage Class type objects: AWS has deprecated this type of object, making it slightly more expensive by one thousandth than the Standard type, on Po or even To, that quickly ( story lived with a music streaming platform)
- S3 Buckets that are Public but not behind a CloudFront distribution: Just going out to the internet through a Cloudfront distribution reduces bandwidth costs, cache enabled or not. This is valid in all cases, before
xLB
orEC2
. - RDS without connection for a long time: If the base is only used from time to time, it is always possible to turn it off (for 7 days maximum, then it will have to be turned off again).
- RDS using unnecessary PIOPS: For every GB of GP2/3, AWS ensures a floor of 3 IOPS, which implies that with 350GB you have more (1050 IOPS) than the 1000 IOPS requested from get paid a minimum for PIOPS storage, for much less.
To help you identify these elements, the API and metrics you have will be very useful.
With the advent of the Cloud, its flexibility and its speed in obtaining resources, we almost forget the reflexes of our predecessors, forced to draw the last drop of sap from machines purchased for several years. I strongly believe that before starting type changes, especially when Scale Up (increasing the type), you have to work at the level of the OS and applications to get the maximum. Solving everything with infra has never been relevant (the famous more RAM, always more RAM, instead of dealing with the memory leak).
In the case of managed resources, especially RDS, many settings are available to configure the engines (size of buffers, allocation of temporary tables, maximum number of connections, etc.). Adjusting them to the real business need can sometimes change performance completely.
For the EC2, it's even more true, because you can play on many more settings like:
- the sysctl parameters: All the OS Unix-Like allow to set parameters of the kernel, including many limits. By adjusting these values, one can often push back the usable abilities.
- the configurations of the applications: The majority of the applications, especially in the world of the Web OpenSource, allow to be configured down to the smallest details to ensure their loads (worker_processes or keepalives for Nginx For example).
Another option, much less widespread, because more difficult to implement, is the fact of using binaries compiled for the CPU architecture of the EC2 used. This may seem crazy, especially since it brings great complexity and is totally outside the scope of such practical package managers. But after all, using compiled binaries outside of any tree is a very common thing, and it's called a Container 😉. AWS provides for each type of Instance the associated CPU model, it is up to you to find the right flags to use for compilation, and to remember to modify them at each type change (for example, Blender rendering via the CPU can save up to 30% of time [of experience],
Using managed services has a cost, it is undeniable, but they are to be put into perspective and put into perspective with the Indirect costs. Between a managed PostgreSQL and a PostgreSQL installed by hand on a rented server, the Cloud Provider solution may be more expensive, but if you add to your in-house solution the financial and human costs for the management of backups, possible restorations, high availability, etc., without necessarily having the ability to change the size of the machine or expand its storage, the balance no longer leans in the same direction. Once again, stopping only at material costs (which relate to Direct costs) and not taking the entire financial perimeter, can prove to be counterproductive.
I generally advise to start by doing everything manually, whether it's analyzing costs in the Cost Explorer, finding waste or estimating possible savings. This is in order to really understand how things work and to be able to spot areas for improvement at a glance once the habit is made. Once run in, it is time to set up procedures, and if possible automated tools to limit drifts over time. The automation of procedures for cleaning and extinguishing environments outside working hours are classics, there are many examples on the net. As mentioned above, it is at this level that the Budget alerts can also help, even if they can often prove to be too rigid (a period of sales can explode the counters compared to the month before , without this being worrying, quite the contrary). One aspect that should not be neglected at this stage (at all in fact), is to train the teams (at home or at the client's) on what FinOps is, what it brings and on the procedures involved. courses in the company that are part of its process.
This is the most complex, but often the most effective and rewarding method, redesign. The Cloud Providers are constantly releasing new features and new services, which is impossible today may not be possible in 1 or 2 years. Integrating FinOps when designing a platform is important to be efficient, but it is even more important during redesigns, which can often be dictated by reasons of rationalization and/or savings. This is all the more relevant since you then have real experience of the behavior of your platform, its needs and constraints.
We can consider in the case of a redesign of:
- migrate some functions in Lambda
- use DynamoDB in base NoSQL
- use ALB in front of Lambda and not API Gateway
You have to keep in mind once again, that all actions have Indirect costs (JH to migrate the code from one technical solution to another, training, etc.), and can "lock" you to a * Cloud Provider*, we are touching here on company policy.
The capacity to adapt the fleet of machines to the load is a basic functionality, it is almost the primary reason for the existence of the Cloud. Over time, either natively or via external functions, we can pilot AutoScaling groups to do many things like:
- Remove all or part of the machines outside working hours
- Change the size of the Instances according to the periods
- Scale In/Out (add/remove) according to business metrics and not just the CPU (which is rarely a relevant metric, I remind you)
If you are ready to commit for the long term (1 or 3 years), reserving resources is the most direct way to obtain at least 30% savings (with payment of an account or not) on services such as:
- EC2
- RDS
- Elasticache
- DynamoDB
The cases of DynamoDB and Elasticache are a little particular, the first requiring to reserve capacities in Read/Write and not "hardware" resources and the second imposing a deposit in all cases.
As a reminder, reservations work with time units, they are not machines with a label in their name in the AWS datacenter.
Let's take an example to make things clear.
Imagine that you reserve for 1 year, for EC2, 1 "Instance" m5.xlarge
. So you get 4 units (see this table) per month in the m5
family. By taking a month as an average of 732 hours, you will therefore be entitled to the reduced price for:
- 732h of
m5.xlarge
- 1464h of
m5.large
(i.e. 2 Instances) - 366h of
m5.2xlarge
(i.e. 1 Instance on half the month)
The units are limited to a family, it is exactly the same case for the Saving Plans of the EC2 Instance type (the savings are to the comma the same as for the Reservations). If you don't want to be constrained, and only think in terms of calculation units and not of "family", the Saving Plans of the Compute type, are the new method to make reservations at *AWS *, regardless of family or type. Discounts on Compute apply regardless of the type, this allows you to change families whenever you want. This comes with less saving though.
AWS makes its unsold stock available (sort of) at a knockdown price (up to -70%), what they call Spot Instances. The principle is simple, you set a maximum price you are willing to pay for, and as long as the current price is lower and AWS has stock, you are likely to have your Instance. If one of the criteria is no longer met, AWS warns you 2min before removing the Instance. To mitigate the risks, you can create fleets, that is to say groups of Instances of different types.
It's an extremely powerful way to save money, provided that the use you make of it tolerates losing nodes, so it's very often used to do tests within a CI, for example .
Another aspect to take into account and which makes it possible to limit the risks even more, is to mix Instances On-Demand (paid at the public price announced) and Instances Spot. This is now possible with the Launch Templates which define how the Instances of an Auto-Scaling group should be. You define the number of Instances On-Demand that will constitute your base, then the percentage of On-Demand above this base compared to the Instances Spot. If your On-Demand base matches your bookings, the savings are even greater with controlled risk.
Good practice is to isolate environments (prod, staging, etc.) in different AWS accounts. Isolating accounts from each other allows, among other things:
- to seal for safety
- to clearly separate the costs, which facilitates the analyzes (an outgoing flow is an outgoing flow, AWS will not tell you which VPC it comes from, difficult to re-invoice between cost centers in this case)
Besides that, it is very interesting to group all these accounts within a single organization, with an identified payer account (called a consolidating account) and having no resources (to simplify the analyses, in this way it will not appear 2 times, as an account and consolidating account).
There are several advantages to doing this, with the payer account aggregating the entire bill:
- reservations are uploaded to the consolidating account (unless voluntary configuration), which implies that unused units on one side can be used on the other, you do not pay for anything
- you benefit from tariffs on volumes (in the sense of quantity), which are called Blended Costs.
- AWS provides a summary invoice of all costs, which will please the accounting department who will not have to manage N pdf.
The Blended Costs are a way of aggregating all the volumes of certain services and thus benefiting from lower price brackets.
Let's take S3 as an example to explain this.
Imagine two accounts linked A and B to a consolidating account Z. The S3 uses are respectively:
- 50TB in eu-west-3 for account A
- 30TB in eu-west-3 for account B
If the accounts were billed separately, they should:
- account A:
40TB x $0.024/GB = $960
- account B:
30TB x $0.024/GB = $720
- total:
$1680
With Blended Costs, the Z account consolidating everything will actually pay:
50TB x $0.024/GB + 20TB x $0.023/GB = $1620
The underlying principle being that for S3, in eu-west-3, the first 50 TB are worth
0.024$/GB
and0.023$/GB
beyond that.So we have a savings of $60, without any action other than having linked accounts A and B to Z.
(see HERE for S3 prices)
We touch the sinews of war, the Tags. It is essential to understand the costs, or even to anticipate them, to categorize things as much as possible, and this involves tagging the resources that can (we cannot tag network flows, for example).
Here are some useful tags:
environment
project
businessUnit
costCenter
scope
team
There is a very important point to know, the tags are case-sensitive at the key and value level, so you have to be careful not to have prod
, Prod
, production
, etc, which represent the same value but create noise, or even hide certain resources from your reports.
To help you catch up, in case you have resources created by hand, AWS provides a tool that is ultimately quite unknown, the [Tag Editor](https://docs.aws.amazon.com /ARG/latest/userguide/tag-editor.html).
This article is ultimately only a sketch of everything you can do without really going into the details of how you do it. It only represents a synthesis of my reflections, the conclusions of which are surely debatable (and perhaps obsolete now, the world of the Cloud moving so quickly). You must have noticed that there is something missing in this article, what about FinOps when using Kubernetes? At the time when I was still doing FinOps audits, although using K8S in production, I had never had to seriously dig into the question. However, I think that the general principles I mentioned can be applied in part, in the case of an environment with shared resources such as Kubernetes.
Thank you for reading this long article to the end, and I hope it has been useful to someone, it has at least allowed me to get out of my head something that has been lying around for a long time.
Enjoy