Skip to content

Instantly share code, notes, and snippets.

@DinisCruz
Last active December 20, 2024 16:18
Show Gist options
  • Save DinisCruz/2df19f2f0cba76d9a6be797663677a93 to your computer and use it in GitHub Desktop.
Save DinisCruz/2df19f2f0cba76d9a6be797663677a93 to your computer and use it in GitHub Desktop.
Raw data created when writing the "Empowering Devs, DevOps and Finance Teams with GenAI-driven Cloud Cost Intelligence" article https://www.linkedin.com/pulse/empowering-devs-devops-finance-teams-genai-driven-cloud-dinis-cruz-sxybe/

Cloud Cost Analysis: Connecting Technical Usage to Business Value

Core Concept

This project aims to create a system that connects cloud usage costs to specific projects and business owners, driven by the finance team. The key objective is not cost reduction (though that may be a side effect) but rather to maximize usage efficiency and empower development teams to improve their pipelines.

Project Overview

Key Objectives

  • Connect cloud spend with business usage and strategy
  • Map project owners and business sponsors to cloud costs
  • Create feedback loops between technical and business teams
  • Empower development teams to optimize resource usage
  • Enable efficient environment management

Unique Value Proposition

The finance team's perspective drives this project because they:

  • Have access to the cloud billing data
  • Are responsible for paying the bills
  • Have visibility into organizational structure
  • Understand project funding and business cases

Technical Implementation

Data Collection Pipeline

  1. Initial Data Collection

    • Create service to fetch data from cloud provider APIs
    • Store raw data in blob storage/file system (not databases)
    • Collect data 5-6 times daily (due to billing data delay)
    • Only download delta since last collection
    • Store first transformation in cloud storage
  2. Data Processing Workflow

    • Implement CI/CD pipeline for data processing
    • Use FastAPI for invoking each step
    • Log detailed information about pipeline execution
    • Create clean, efficient, and cost-effective solutions

Data Transformations

  1. High-Level Analysis

    • Process top-level user information
    • Analyze usage patterns
    • Create feedback loop for tagging resources
  2. Detailed Architecture Analysis

    • Gather additional cloud provider information
    • Connect compute resources to networks
    • Map system dependencies
    • Build relationship graphs (nodes and edges)
  3. Business Context Integration

    • Collect project status information
    • Connect to issue tracking systems (JIRA, GitHub, GitLab)
    • Link costs to actual development activity
    • Integrate business strategy information
    • Include risk register data
  4. Organization Structure

    • Integrate org chart information
    • Map responsibilities and accountability
    • Define ownership of resources
    • Track impact on different departments

Project Lifecycle Management

  1. Cost Categorization

    • Development costs
    • Prototyping costs
    • Testing costs
    • Production costs
  2. Environment Management

    • Create environments on demand
    • Scale resources based on actual needs
    • Implement efficient spin-up/spin-down processes
    • Optimize for different stages (dev, test, prod)

Technical Architecture Considerations

Infrastructure Design

  1. LocalStack Integration

    • Use LocalStack to replicate cloud environments
    • Enable local development and testing
    • Support multi-cloud deployment scenarios
    • Abstract cloud provider specifics
  2. Deployment Flexibility

    • Support multi-region deployment
    • Enable multi-availability zone setups
    • Allow for multi-cloud scenarios
    • Implement clean environment deployments

CI Pipeline Optimization

  1. Environment Management

    • Automatic resource provisioning
    • Efficient teardown procedures
    • Right-sized resource allocation
    • Cost-effective testing environments
  2. Development Workflow

    • Identify minimal viable setup
    • Support component-level testing
    • Enable integration testing
    • Provide production-like environments

Business Integration

Stakeholder Analysis

  1. Business Strategy Alignment

    • Connect to strategic pillars
    • Integrate with risk management
    • Support value generation projects
    • Enable risk reduction initiatives
  2. Data Sources

    • Business systems integration
    • Strategy documentation
    • Risk registers
    • Project management tools

Personalization Layer

  1. Stakeholder Personas

    • Consider cultural context
    • Account for language preferences
    • Assess technical expertise
    • Understand business knowledge
    • Map RACI responsibilities
  2. GenAI Integration

    • Transform technical insights for different personas
    • Create personalized reports
    • Generate context-aware documentation
    • Support multiple languages and cultures

Business Model Potential

Service Offering

  1. Value Proposition

    • Cost-effective alternative to in-house development
    • Specialized expertise and tooling
    • Continuous improvement and updates
    • Best practices implementation
  2. Deployment Model

    • Run within client environment
    • Maintain data privacy
    • Ensure security compliance
    • Support multiple cloud providers
  3. Client Benefits

    • Better resource utilization
    • Improved development practices
    • Enhanced business alignment
    • Reduced technical debt

Implementation Considerations

Security and Privacy

  1. Data Protection

    • Keep sensitive data within client environment
    • Implement secure access controls
    • Support data isolation requirements
    • Enable private cloud deployments
  2. Operational Security

    • Clean environment deployments
    • Account isolation
    • Minimal permanent infrastructure
    • Secure configuration management

Technical Requirements

  1. Infrastructure Needs

    • Minimal DNS entries
    • Essential storage requirements
    • Clean cloud account deployment
    • Automated setup and teardown
  2. Integration Points

    • Cloud provider APIs
    • Business systems
    • Development tools
    • Monitoring systems

Future Considerations

  1. Scalability

    • Support for multiple cloud providers
    • Cross-region deployment
    • Multi-account management
    • Enterprise-scale operations
  2. Evolution

    • Enhanced automation capabilities
    • Advanced analytics
    • Deeper business integration
    • Expanded platform features

Cloud Cost Analysis: Raw Transcript and Ideas

Initial Concept

I want to write an article that shows the workflow that we are handling with ChatGPT and other algorithms describing workflows. The one I want to describe is the workflow of a finance team using the detailed cloud bill to understand what's going on and, more importantly, to connect the usage to not just the technical team doing it, but the business owners of that.

One of the amazing assets that exists in the cloud bill - and I come from the cybersecurity side of things, which is always worried about what's going on and who's doing what, and any malicious activities happening in the mix of our cloud environment - and that's really important.

Core Project Ideas

Business Connection and Project Mapping

Starting from the project, the objective is that this project is going to deliver new services to customers to be Y, which will have positive benefits of either X revenue or more users or more brand awareness or whatever the metrics the business is trying to achieve.

If you start from there, what happens is you need to then map out who are the project owners, who actually are responsible for it, and also who are the business sponsors, because ultimately they're the ones who should own the budget.

Data Collection and Storage

If you look at the CI pipeline, the bits that need to be put in place, and in a way, the bits that need to be funded to develop this - because fundamentally, this is a project - is you need a way to go and get the data from Amazon or a cloud provider. Now, every cloud provider does provide APIs to dump the user's data. Ironically, they charge for it, but that's a different story.

You need a pipeline that goes and retrieves the information and puts it on a database. Now, here is usually where things start to go wrong, because I'm a big fan of very clean, actually efficient, cheap, but effective solutions. Usually, the first temptation is to take the data and put it in the database. I think that would be a massive mistake, because as soon as you do that, the database is expensive, needs to be run, you have state, you're actually adding a lot of complexity to the system.

So my preface here is you go in, you basically make a REST call, grab the information from the cloud provider, and then put it in a blob store, put it in the storage in a file system, where you now have, in a way, a really nice record of the data that exists in the cloud provider costs. And then you can do every further analysis in a way that offline mode, because you already got the information.

Data Collection Timing

Unfortunately, most of the cloud providers, as far as I understand, don't provide real-time data about billing, which, you know, it's actually something to get away with. You shouldn't, but the data is delayed. And then, they also charge for making those calls. There's a cost associated on the call, which can become substantial if you make this all the time.

So that actually means that you don't need to make this call in real time because the data won't be there in real time. So you can literally probably just connect and collect the data, let's say, five or six times a day. And you can also be efficient by only downloading the data since the last time you collected it. So again, you reduce the amount of information that comes from the cloud provider.

Data Transformation and Analysis

This workflow of taking data, creating a schema, transforming the data into that schema is fundamentally the key element of this. In practice, we have things we call ELT, extract, transform, load, extract, load, transform. It's very important at each of these steps. You have a CI pipeline, you have FAST, in this case we'll develop a FAST API that allows to invoke each step each time, and each one is actually very nice log and detailed on exactly what's happening in the pipeline.

And this is where it gets interesting, because the next transformation, we probably will have two sets of transformations:

  • We're going to have one that most likely can give us high-level data about what's going on, you know, the top-level users, the top-level information
  • And, in fact, you know, this is where we start to see how well things are tagged. We're going to have to create a feedback loop to tag things

Additional Data Requirements

There'll be another set of information that we don't have, that we're going to have to go and get. This is information that is not in the billing data. And this will be, for example, information to do with:

  • The architecture
  • What's connected to what
  • You might have, oh, I got a VPC user, then I have EC2 user, for example
  • Compute users and network users

We will need to go back to the cloud provider and make a number of more requests to start to connect those dots and saying, all right, you know:

  • Where is this compute located?
  • Which system is actually using it?
  • Which system is using this resource?

Business Context Integration

Getting information from the business cannot be underestimated, and it's usually where things go wrong, and it's usually where we miss the big trick. Because if we use the same pipeline, and this, again, will have a bit of a different timing, because probably it can be done once a day, where we have an API that then goes to existing business data points or systems and starts to collect business information about what's happening, including information about the business strategy.

So what are the big pillars, what is the big kind of things the business is trying to do, and information about even the risk that the business has. Because, really, including cybersecurity, but let's say the risk register is also a really good source of information of why things are happening.

One of the key concepts that we should have is that projects that are being funded by the company should aim not just at generating revenue, not just at creating a value, but you can also have projects that are designed to reduce risk, eliminate a risk, or basically reduce the exposure to a particular kind of event that might occur.

Data Storage Approach

And if you think about it, the workflow is the same. You have an API that you're going to make some calls to another system to grab that information, and then you grab that information and you save it in the cloud storage, which basically means that we now have a JSON representation of that strategy.

And, of course, it would be great in the future if that was more natively supported by the system. But I would be expecting that in the beginning, this would be a bit more crude transformation from some Excel spreadsheets or the system JIRA, Monday, you know, others that actually have that information.

Organizational Structure Integration

Now we start to have some nice data in our graph. We have:

  • The data from the cloud user
  • The data from the projects
  • The data from the business
  • And then it's all connected

One more piece of information that is very important is also the org chart, or at least the part of the org chart that's relevant to the data. And the org chart will allow us to connect the dots. It allows us to start seeing:

  • Who is responsible
  • Who's accountable
  • Who owns it
  • Who is impacted by this Because ultimately the cost of what's happening is going to be given to them.

Project Lifecycle Understanding

Another element to take into account is the life cycle of a project. So, what I mean by that is that we need to start distinguishing what is the cost in:

  • Development
  • Prototyping
  • Testing
  • Production

So, in a way, the cost in production is what will be the long-term cost of the application, where the cost in development is what is the setup that needs to be created while there's active development of the project.

Team Empowerment

Now we'll start to be able to get to the point where we are now connecting the costs to the actual teams, and the objective is to empower those teams to make better use of their budget, to make better use of, in a way, the money they're spending, and to give them the accountability to really think heavily about the resources they're using.

They need to know that if you start a database, then that is going to have an extended cost over development, over testing, and in production. And you start to understand the power of these feedback loops by empowering the dev teams and the product teams to start to make good decisions based on their costs.

And again, stressing, this is not... The objective is not cost reduction, which cost reduction is exciting, right? The objective is to empower the teams to make good decisions and change their architectures to reflect much more lean and effective solutions so that you then have much better architecture and you can accelerate the development.

Platform Creation and Reuse

The other very good side effect of this workflow is that it starts to reward the creation of platforms and the creation of a certain type of infrastructure that hasn't been optimized for specific use cases and prevents teams from not inventing things over and over again. The point is that if somebody has figured out a good pattern, for example, the one we are describing here where you have a workflow of collecting data, you then want a situation where other teams that need the same pattern, they can reuse it.

Environment Management

What I have to focus on in this project is the idea and the workflows of the creation of environments on demand for the developers or testing on this. The principle is simple:

  • You should only have resources active and charging when you need them
  • You should only have the resources at the scale that you need at that moment in time
  • Because in production, you need a ridiculous large database, you shouldn't have it in test

This CI pipeline, this focus on creating, make sure you have environments that can spin up when you need and spin down when you don't need, actually is really hard, partially to do because the cloud environments are not very well designed for this. In a way, they really feel very designed for you spin something up, you spend things making it work, and then you kind of just stay there, which is also a problem if any of these things are done manually, which is what we want to avoid, because when something is done manually, it means it's not repeatable.

And there's a ridiculous amount of little, little chronic cases you need to address in order to be able to spin something up and then spin something down. The advantage of this is cost, but more importantly, the advantage is that you improve the engineering pipeline, you improve the ability to deploy the solution anywhere.

So you actually have more business models to sell your solution if needed, but also you have a much better ability to:

  • Deploy
  • Test
  • Check variations And also super critical to have a lot of resilience, because you can then start thinking more strategically on where you deploy it:
  • Do you deploy multi-regions
  • Multi-environments
  • Multi-availability zones
  • If in fact even multiple clouds Then it becomes realistic because you can just spin up and spin down the solution.

Business Case for Implementation

And the interesting part of this is that because we're driving this from the finance team, suddenly we become almost a simple business decision to say, look:

  • The current team already spends X amount per week, per month, on environments that they don't spin down
  • Let's fund the resources team project to do that, to improve their CI pipeline

Which is ultimately why I think one of the key objectives here, which is to empower the teams to have the resources and funding required to work on what's sometimes called non-functional requirements, which is the environment that the developers are using to test their solution.

Development Evolution

And the other nice side effect is that when you go into this workflow, and I have really a lot of experience, and I experienced this many times, is when you go into this workflow of continuous improving your CI pipeline, you start to go on this interesting quest, which is, what is the smallest possible setup and environment that I need to run:

  • My service
  • My application
  • My infrastructure

Because what's super critical here is to start to allow each team, when they're testing their own components, to, in a way, to see how that component interacts with the wider range of microservices. So that means that you might have a team that is working on a particular service, and they should be able to spin up, in a way, all the dependencies required by that service. And sometimes it might even go one level deep.

But the power of that is that suddenly that team, when they run the integration tests, when they run their confirmations, if everything is fine with their changes, they actually are hitting what are most likely very good representations of the final solution. So you start to have a massive halo effect, because:

  • You now reduce cost a lot
  • You now allow a much more dynamic spinning up and spinning out of environments
  • Especially test environment or QA environments That eventually could all run in a laptop or in a smaller environment or in a Kubernetes cluster, which is always a good option, because you can now have a workflow that you spin up and that has the key components that you have, which have all been developed in this workflow.

LocalStack Integration

One tool that I've used recently, which actually helped us a lot, is called LocalStack, because LocalStack replicates some of the cloud environments. I believe they supported AWS quite well, but I believe they also have Azure and a couple of GCP stuff. And what that allows is to develop where your code doesn't know if it's talking to a full-blown production environment, to a local environment. So that helps a lot. It also helps with immediate cloud deployment everywhere.

GenAI Integration

If we now take a step back and we see what we have, we now have a way, when this works effectively, to connect the cloud spend with the business usage, and especially with the business strategy. And this is where the second part of the GenAI comes along.

So up until now, the GenAI has been helping us to:

  • Create the briefs
  • Make sure everybody understands what's happening
  • Probably create some of the schemas
  • It might help with some of the data transformations

But I would argue that's because the data, in these cases, is probably quite resilient. We might not need it. But where it becomes supercritical is in the next part, which is the transformation of the insight that we've got into actual relevant materials for the persona, i.e. the target audience, the stakeholder.

And this is where, in a way, we have to start thinking, not necessarily on a particular individual, but what is the persona that that individual has in this particular project, where we should take into account:

  • The person's culture
  • The person's language
  • The person's technical skills
  • The person's business skills
  • Basically their knowledge But also, more importantly, what is their role in that particular project.

And this is key, because you might have somebody that, in one project, is almost the RASI, responsible, accountable, informed environment. So, what is that person actually doing in this project?

So, the power of GenAI is that we can:

  • Start to collect this data
  • Collect the graph of the relationships
  • Collect the mappings
  • Collect the insights And then we say, OK, now, translate this for this person, for this persona. And that's where the GenAIs are very good, because we can fit all of this in the prompt.

Business Model Potential

The final part of this is that everything we talk about here is actually a service that could be provided by a startup. Because if you build infrastructure and you build the workflows to connect these dots, the reality is that it will be more cost-effective for a particular company to use that startup, that service, than to build themselves.

So there's a lot of interesting business models that you can have here, where ultimately, from the finance team, or I guess from other parts of the business, but let's say if you're doing it for the finance team, you now can have a service that helps companies to:

  • Have a much better grip and much better control of the cloud costs
  • More importantly, they're able to connect the cloud spend, which is ultimately the development and engineering spend, with the business objectives and what they're trying to do

And this is where it becomes very critical, the infrastructure and the development that I've been describing, also for this startup. Because what you then have is this will have a lot of attrition if you need to send the data outside the company.

But if you can run this whole thing in the client environment, you just give it:

  • Some of the pipelines
  • Some of the transformations
  • Especially some of the business prompts and workflows and UIs But they run it in their environment, immediately you have:
  • A lot of security
  • A lot of privacy
  • A lot of basically less problems in somebody adopting the solution Because there's very little risk of data being extracted and sensitive business information actually being exposed.

Clean Environment Deployment

Right. There's one final thing I forgot to say before, which is one of the ultimate objectives, I would argue, for the whole spin up and spin down is the ability to:

  • Rotate regions effectively
  • And even more importantly, to be able to deploy everything in a clean AWS or the clean cloud environment, so a cloud account

The cloud providers these days make it very easy to create separate accounts. So one of the really powerful things is to say it deploys the whole thing from scratch in an environment. Usually, there has to be some leftovers:

  • There might be a pinned main DNS entry
  • Maybe a couple of self-storage stuff

But the

“I need your help to transcribe and to capture the key points of the idea I'm about to describe today.” 00:12

ChatGPT said: ChatGPT Sure, go ahead and describe your idea. I'll transcribe and capture the key points for you.

You said: “Okay, so I want to write an article that shows the workflow that we are handling with ChatGPT and other algorithms describing workflows. The one I want to describe is the workflow of a finance team using the detailed cloud bill to basically understand what's going on and, more importantly, to connect the usage to not just the technical team doing it, but the business owners of that.” 00:39 ChatGPT said: ChatGPT Got it. So, you'd like to write an article about the workflow of a finance team that uses a detailed cloud bill to understand cloud usage and, importantly, to connect this usage to the business owners, not just the technical team.

You said: “Correct, because one of the amazing assets that exists in the cloud bill, and I come from the cybersecurity side of things, which is always worried about what's going on and who's doing what, and any malicious activities happening in the mix of our cloud environment, and that's really important. Thank you.” 00:19

You said: “That is correct. And in a way, so there's multiple parts of this. There is, you know, fundamentally a number of graphs and maps eventually, but graphs mainly that would need to be created. And what the element of this idea is, we're starting from the finance team who has access to the data. And in a way, it's the one paying it. So the finance team also has access to the organization and the projects that they're funding. So if you start at the end, the key objective here is to make sure that when the team is delivering a particular project and they make a business case for that project, they have an accurate representation of the cloud costs that A, they project using during development, and then during execution, and the actual budget and the actual cloud costs being analyzed. Basically, it has to be.” 01:03

ChatGPT said: ChatGPT So, the key idea is to leverage the finance team's access to cloud cost data to create graphs and maps that connect these costs to specific projects and business objectives, ensuring accurate representation of projected and actual cloud costs.

You said: “The objective of the project is that this project is going to deliver new services to customers to be Y, which will have positive benefits of either X revenue or more users or more brand awareness of whatever the metrics the business is trying to do.” 00:17

ChatGPT said: ChatGPT Right. So,

You said: “If you start from there, what happens is you need to then map out who are the project owners, who actually are responsible for it, and also who are the business sponsors, because ultimately they're the ones who should own the budget.” 00:17

ChatGPT said: ChatGPT Exactly. Mapping out the project owners and business sponsors is crucial since they're ultimately responsible for the budget.

You said: “that information, what is actually happening in the cloud costs. The advantage of the cloud costs is that they represent what happens. So basically what we want to do is we want to connect to the lifecycle of the project to start putting the allocation of costs right from the beginning. Now in practice, it's probably better to start with reality, but the key objective ultimately, or the key, I will say, work that we want is for these capabilities to be done right from the beginning of a project. So you start to track what's going on.” 00:49

ChatGPT said: ChatGPT So, the goal is to connect cloud costs to the project's lifecycle from the start, ensuring costs are allocated and tracked accurately from the beginning.

You said: “Yes, but in reality, a company already has the projects going on, so in practice, you would start with existing projects and use that to develop the technology.” 00:15

ChatGPT said: ChatGPT Absolutely. Starting with existing projects allows you to develop and refine the technology for tracking and allocating costs before rolling it out to new projects.

You said: “The key element of this is to have a CI pipeline, a continuous integration or probably even a continuous deployment workflow that handles the key parts of this. Now, when you look at GenAI, one of the premises that for me is very important is that because this is a very deterministic element, we need to be very careful where we actually put the GenAI workflows and where we have black magic or something that just happens or stuff that gets built that we don't understand. I think the real power of DLLMs is the ability to help us to develop simple solutions, solutions that are efficient, that are effective, and that's what we need to do here.” 00:42

ChatGPT said: ChatGPT So, integrating a CI/CD pipeline to handle the key parts of cost tracking is crucial. The real power of GenAI lies in developing simple, efficient, and effective solutions, avoiding unnecessary complexity.

You said: “Correct. And what I mean helping to develop is being a partner, is being an aid in the creation of these actual schemas, exactly, for example, what I'm doing right now, because I'm using GenAI to create the briefs that we are actually going to operate on.” 00:18

ChatGPT said: ChatGPT Exactly. GenAI can be a valuable partner in developing these efficient workflows and schemas, just like we're doing now with this brief.

You said: “So if you look at the CI pipeline, the bits that need to be put in place, and in a way, the bits that need to be funded to develop this, because fundamentally, this is a project, is you need a way to go and get the data from Amazon or a cloud provider. Now, every cloud provider does provide APIs to dump the user's data. Ironically, they charge for it, but that's a different story. So you need a pipeline that goes and retrieves the information and puts it on a database. Now, here is usually where things start to go wrong, because I'm a big fan of very clean, actually efficient, cheap, but effective solutions. So usually, the first temptation is to take the data and put it in the database. I think that would be a massive mistake, because as soon as you do that, the database is expensive, needs to be run, you have state, you're actually adding a lot of complexity to the system. So my preface here is you go in, you basically make a REST call, grab the information from the cloud provider, and then put it in a blob store, put it in the storage in a file system, where you now have, in a way, a really nice record of the data that exists in the cloud provider costs. And then you can do every further analysis in a way that offline mode, because you already got the information.” 01:38

ChatGPT said: ChatGPT So, the idea is to fetch usage data from cloud providers via APIs and store it in a blob storage or file system, avoiding complexities like databases. This approach keeps it efficient and cost-effective, allowing for offline analysis.

You said: “So that's the step one. Step one is go and get the data, have a service that collects. And here there's an important point to make, which is, unfortunately, most of the cloud providers, as far as I understand, don't provide real-time data about billing, which, you know, it's actually something to get away with. You shouldn't, but the data is delayed. And then, B, they also charge for making those calls. There's a cost associated on the call, which can become substantial if you make this all the time. So that actually means that you don't need to make this call in real time because the data won't be there in real time. So you can literally probably just connect and collect the data, let's say, five or six times a day. And you can also be efficient by only downloading the data since the last time you collected it. So again, you reduce the amount of information that comes from the cloud provider.” 00:59

ChatGPT said: ChatGPT Exactly. Since cloud billing data isn't real-time and comes with a cost for API calls, it's efficient to collect this data a few times a day, only fetching new data since the last collection to minimize costs and data volume.

You said: “Okay, but what's important is that that first transformation is also stored in our file storage, cloud-based of course, which means that we can now do our next set of analysis on top of that data. This already dramatically reduces the amount of data we need to process and also already gives us a much better dataset to operate than the original one.” 00:30

1/2

ChatGPT said: ChatGPT Absolutely. Storing that first transformation in cloud-based file storage allows for the next stage of analysis on a refined dataset, significantly reducing the data volume and improving the efficiency of further processing.

You said: “This workflow of taking data, creating a schema, transforming the data into that schema is fundamentally the key element of this. In practice, we have things we call ELT, extract, transform, load, extract, load, transform. It's very important at each of these steps. You have a CI pipeline, you have FAST, in this case we'll develop a FAST API that allows to invoke each step each time, and each one is actually very nice log and detailed on exactly what's happening in the pipeline.” 00:38

ChatGPT said: ChatGPT Exactly. Implementing a CI pipeline with each step logged and detailed is crucial. Using a FastAPI to invoke each step ensures clarity and efficiency in the Extract, Load, Transform process, forming a robust workflow.

You said: “And this is where it gets interesting, because the next transformation, we probably will have two sets of transformations. We're going to have one that most likely can give us high-level data about what's going on, you know, the top-level users, the top-level information, and, in fact, you know, this is where we start to see how well things are tagged. We're going to have to create a feedback loop to tag things, for example. So you've got one side of analysis that we should have all the information required, but in most cases, there'll be another set of information that we don't have, that we're going to have to go and get. So this is information that is not in the billing data. And this will be, for example, information to do with the architecture, to do with what's connected to what, to do with, you know, you might have, oh, I got a VPC user, then I have EC2 user, for example, or compute users and network users, that we will need to go back to the cloud provider and make a number of more requests to start to connect those dots and saying, all right, you know, where is this compute located? You know, which system is actually using it, or which system is using this resource?” 01:16

ChatGPT said: ChatGPT Exactly. The next transformation will involve two sets: high-level analysis and a deeper dive to gather missing architectural details. We'll have to make additional requests to the cloud provider to connect the dots, like identifying which systems are using specific resources.

You said: “Can you summarize, so far, the project?” 00:03

ChatGPT said: ChatGPT Sure! Here's a summary of the project so far:

Objective: Connect cloud usage costs to specific projects and business owners for better financial tracking and decision-making.

Data Collection: Use a CI/CD pipeline to periodically fetch cloud usage data via APIs and store it in a cloud-based blob storage.

Initial Transformation: Store the raw data in a structured format for offline analysis, keeping the process simple and efficient.

Further Analysis: Perform high-level and detailed analyses, identifying top-level usage patterns and gathering additional architectural details from the cloud provider to connect resources with specific systems.

GenAI Integration: Use GenAI to aid in developing and refining these workflows, ensuring solutions are simple, efficient, and effective.

This project aims to streamline cloud cost management and provide actionable insights to business owners and technical teams.

You said: “Very cool. One element to add is that this project is also driven by the finance team. And one of the key elements of this is not to reduce cost, but that will be a side effect. The cost is to maximize usage, and more important, is to empower the dev teams, or the DevOps teams, or the development engineers, to actually be able to spend time improving their pipelines. Because that will be one of the side effects of all of this, is that we're going to find out that a large reason why a lot of the cost exists and a lot of the things don't get shut down is because the CI pipeline of those projects does not have enough maturity to set up and destroy things. So things just stay in place.” 00:47

ChatGPT said: ChatGPT Memory updated Absolutely! One of the key elements here is that the finance team is driving this project, not just to reduce costs, but to empower the DevOps teams by improving their CI/CD pipelines. This will ultimately allow them to optimize their workflows and better manage resources.

You said: “Cool. So, if we continue where we were on the transformation, we got to the point where we now started to have some context, i.e. our graph. Again, we need to build some nodes and edges, some relationships between this. It's time to have a bit of information about the resources consumed and what's happening. And this now becomes the critical part where we now need to create a feedback loop with the teams themselves so that they start to connect and give us information about what is actually happening in the projects, what state we are in the projects, what objectives are being achieved, what milestones. And some of this information we might actually get almost free if the team has a good issue ticketing system, using JIRA, using GitLab, using Google issues, GitHub issues. But it's very critical that we start to connect the costs with actual activity that is happening by the dev team.” 01:04

ChatGPT said: ChatGPT Creating that feedback loop with the teams is crucial. By integrating information from issue tracking systems like Jira or GitHub Issues, we can better connect costs to actual project activities, milestones, and objectives. This ensures a comprehensive understanding of resource consumption.

You said: “This part of getting information from the business cannot be underestimated, and it's usually where things go wrong, and it's usually where we miss the big trick. Because if we use the same pipeline, and this, again, will have a bit of a different timing, because probably it can be done once a day, where we have an API that then goes to existing business data points or systems and starts to collect business information about what's happening, including, and this will be very powerful, information about the business strategy. So what are the big pillars, what is the big kind of things the business is trying to do, and information about even the risk that the business has. Because, really, including cybersecurity, but let's say the risk register is also a really good source of information of why things are happening. Because one of the key concepts that we should have is that projects that are being funded by the company should aim not just at generating revenue, not just at creating a value, but you can also have projects that are designed to reduce risk, eliminate a risk, or basically reduce the exposure to a particular kind of event that might occur.” 01:43

ChatGPT said: ChatGPT Absolutely. Integrating business context such as strategic goals, major pillars, and risk factors is crucial. By aligning projects with these elements, we ensure that resources are not only driving revenue and value but also addressing risk. This comprehensive understanding supports informed decision-making and aligns technical efforts with business objectives.

You said: “And if you think about it, the workflow is the same. You have an API that you're going to make some calls to another system to grab that information, and then you grab that information and you save it in the cloud storage, which basically means that we now have, you know, adjacent representation of that strategy. And, of course, it would be great in the future if that was more negatively supported by the system. But I would be expecting that in the beginning, this would be a bit more crude transformation from some Excel spreadsheets or the, you know, system JIRA, Monday, you know, others that actually have that information.” 00:45

ChatGPT said: ChatGPT Exactly. The workflow remains consistent: using APIs to pull data from various systems, storing it in cloud storage, and creating a structured representation. While initial data might come from spreadsheets or systems like Jira, this structured approach ensures we can integrate and analyze this information effectively, even if it's initially crude.

You said: “Now we start to have some nice data in our graph. We have the data from the cloud user, we have the data from the projects, we have data from the business, and then it's all connected. One more piece of information that is very important is also the org chart, or at least the part of the org chart that's relevant to the data. And the org chart will allow us to connect the dots. It allows us to start seeing who is responsible, who's accountable, who owns it, who is impacted by this, because ultimately the cost of what's happening is going to be given to them.” 00:40

ChatGPT said: ChatGPT Exactly. Integrating the org chart is essential. It helps map out responsibility, accountability, and ownership, ensuring that costs and impacts are attributed to the right people. This comprehensive view ties all the data together, providing clear insights into who's responsible for various aspects.

You said: “Another element to take into account is the life cycle of a project. So, what I mean by that is that we need to start distinguishing what is the cost in development, in prototyping, in testing, and in production. So, in a way, the cost in production is what will be the long-term cost of the application, where the cost in development is what is the setup that needs to be created while there's active development of the project.” 00:33

ChatGPT said: ChatGPT Absolutely. Differentiating costs across the lifecycle stages—development, prototyping, testing, and production—is crucial. It provides clarity on what's needed for development versus the ongoing operational costs in production, helping in better budgeting and resource management.

You said: “Okay, so now we'll start to be able to get to the point where we are now connecting the costs to the actual teams, and the objective is to empower those teams to make better use of their budget, to make better use of, in a way, the money they're spending, and to give them the accountability to really think heavily about the resources they're using. So they need to know that if you start a database, then that is going to have an extended cost over development, over testing, and in production. And you start to understand the power of these feedback loops by empowering the dev teams and the product teams to start to make good decisions based on their costs. And again, stressing, this is not... The objective is not cost reduction, which cost reduction is exciting, right? The objective is to empower the teams to make good decisions and change their architectures to reflect much more lean and effective solutions so that you then have much better architecture and you can accelerate the development.” 01:13

ChatGPT said: ChatGPT Absolutely. The main objective is to empower teams to make informed decisions, leading to more efficient and lean architectures. By understanding the costs associated with their choices, teams can optimize their workflows and resource usage. This isn't just about reducing costs, but about fostering accountability and encouraging smarter resource management, ultimately accelerating development.

You said: “The other very good side effect of this workflow is that it starts to reward the creation of platforms and the creation of a certain type of infrastructure that hasn't been optimized for specific use cases and prevents teams from not inventing things over and over again. The point is that if somebody has figured out a good pattern, for example, the one we are describing here where you have a workflow of collecting data, you then want a situation where other teams that need the same pattern, they can reuse it.” 00:41

ChatGPT said: ChatGPT Absolutely. This approach encourages the creation of reusable platforms and optimized infrastructure, preventing teams from reinventing the wheel. By sharing effective patterns, we foster efficiency and collaboration across teams.

You said: “What I have to focus on in this project is the idea and the workflows of the creation of environments on demand for the developers or testing on this. So the principle is simple. You should only have resources, you know, active and charging when you need them. And also, you should only have the resources at the scale that you need at that moment in time, because in production, you need a ridiculous large database, so you should have it in test. So this CI pipeline, this focus on creating, make sure you have environments that can spin up when you need and spin down when you don't need, actually is really hard, partially to do because the cloud environments are not very well designed for this. In a way, they really feel very designed for you spin something up, you spend things, you know, making it work, and then you kind of just stay there, which is also a problem if any of these things are done manually, which is what we want to avoid, because when something is done manually, it means it's not repeatable. And there's a ridiculous amount of little, little chronic cases you need to address in order to be able to spin something up and then spin something down. The advantage of this is cost, but more importantly, the advantage is that you improve the engineering pipeline, you improve the ability to deploy the solution anywhere. So you actually have more business models to sell your solution if needed, but also you have a much better ability to deploy, to test, to check variations, and also super critical to have a lot of resilience, because you can then start thinking more strategically on where you deploy it, you know, do you deploy multi-regions, multi-environments, multi-availability zones, if in fact even multiple clouds, then it becomes realistic because you can just spin up and spin down the solution.” 02:03

ChatGPT said: ChatGPT Absolutely. The ability to create and tear down environments on demand is crucial for optimizing resource usage and improving the overall engineering pipeline. By automating this process, we not only reduce costs but also enhance deployment flexibility, resilience, and scalability. This strategic capability opens up new business opportunities and ensures the solution can be deployed efficiently across different regions, availability zones, or even clouds.

You said: “And the interesting part of this is that because we're driving this from the finance team, suddenly we become almost a simple business decision to say, look, the current team already spends X amount per week, per month, on environments that they don't spin down. Let's fund the resources team project to do that, to improve their CI pipeline, which is ultimately why I think one of the key objectives here, which is to empower the teams to have the resources and funding required to work on what's sometimes called non-functional requirements, which is the environment that the developers are using to test their solution.” 00:46

ChatGPT said: ChatGPT Exactly. By highlighting the financial impact, it becomes a straightforward business decision to invest in improving CI pipelines. This not only optimizes costs but empowers teams with the resources they need to focus on non-functional requirements, like creating efficient development and testing environments. This leads to more streamlined processes and better overall outcomes for the organization.

You said: “And the other nice side effect is that when you go into this workflow, and I have really a lot of experience, and I experienced this many times, is when you go into this workflow of continuous improving your CI pipeline, you start to go on this interesting quest, which is, what is the smallest possible setup and environment that I need to run my service, to run my application, to run my infrastructure? Because what's super critical here is to start to allow each team, when they're testing their own components, to, in a way, to see how that component interacts with the wider range of microservices. So that means that you might have a team that is working on a particular service, and they should be able to spin up, in a way, all the dependencies required by that service. And sometimes it might even go one level deep. But the power of that is that suddenly that team, when they run the integration tests, when they run their confirmations, if everything is fine with their changes, they actually are hitting what are most likely very good representations of the final solution. So you start to have a massive halo effect, because not only you now reduce cost a lot, you now allow a much more dynamic spinning up and spinning out of environments, especially test environment or QA environments, that eventually could all run in a laptop or in a smaller environment or in a Kubernetes cluster, which is always a good option, because you can now have a workflow that you spin up and that has the key components that you have, which have all been developed in this workflow.” 01:50

ChatGPT said: ChatGPT Absolutely. This continuous improvement mindset leads to a leaner, more dynamic setup, where each team can efficiently test components in an environment that mirrors production closely. This not only reduces costs but also increases the accuracy of integration testing, resulting in more robust and reliable applications. It enables a more agile and resilient development process.

You said: “One tool that I've used recently, which actually helped us a lot, is called LocalStack, because LocalStack replicates some of the cloud environments. I believe they supported the Burst quite well, but I believe they also have Azure and a couple of GCP stuff. And what that allows is to develop where your code doesn't know if it's talking to a full-blown production environment, to a local environment. So that helps a lot. It also helps with immediate cloud deployment everywhere.” 00:31

ChatGPT said: ChatGPT LocalStack is indeed a powerful tool for replicating cloud environments locally. It allows developers to test their code in an environment that mimics AWS, Azure, or GCP, without needing to spin up actual cloud resources. This can greatly enhance development speed and efficiency, while also simplifying deployment and testing processes.

You said: “If we now take a step back and we see what we have, we now have a way, when this works effectively, to connect the cloud spend with the business usage, and especially with the business strategy. And this is where the second part of the D&AI comes along. So up until now, the D&AI has been helping us to create the briefs, to make sure everybody understands what's happening, to probably create some of the schemas, to... It might help with some of the data transformations, but I would argue that's because the data, in these cases, is probably quite resilient. We might not need it. But where it becomes supercritical is in the next part, which is the transformation of the insight that we've got into actual relevant materials for the persona, i.e. the target audience, the stakeholder. And this is where, in a way, we have to start thinking, not necessarily on a particular individual, but what is the persona that that individual has in this particular project, where we should take into account the person's culture, the person's language, the person's technical skills, the person's business skills, basically their knowledge, but also, more importantly, what is their role in that particular project. And this is key, because you might have somebody that, in one project, is almost the RASI, responsible, accountable, informed environment. So, what is that person actually doing in this project? So, the power of InnoVeg NAI is that we can start to collect this data, we can collect the graph of the relationships, we correct the mappings, the insights, and then we say, OK, now, translate this for this person, so, for this persona. And that's where the Gen AIs are very good, because we can fit all of this in the prompt.” 02:03

ChatGPT said: ChatGPT Memory updated GenAI can indeed play a crucial role in transforming insights into personalized, relevant materials tailored to different personas. By understanding each stakeholder's role, technical skills, and business context, we can generate tailored insights that resonate and drive action. This personalized approach ensures that the insights are meaningful and actionable for each persona involved in the project.

You said: “The final part of this is that everything we talk about here is actually a service that could be provided by a startup. Because if you build infrastructure and you build the workflows to connect these dots, the reality is that it will be more cost-effective for a particular company to use that startup, that service, than to build themselves. So there's a lot of interesting business models that you can have here, where ultimately, from the finance team, or I guess from other parts of the business, but let's say if you're doing it for the finance team, you now can have a service that helps companies to have a much better grip and much better control of the cloud costs. More importantly, they're able to connect the cloud spend, which is ultimately the development and engineering spend, with the business objectives and what they're trying to do.” 00:56

ChatGPT said: ChatGPT Absolutely. Offering this as a service can provide immense value to companies, enabling them to better manage cloud costs and align technical spending with business objectives. This kind of service, driven by financial insights, can help companies stay focused on their core goals while efficiently managing resources. It's a win-win scenario.

You said: “And this is where it becomes very critical, the infrastructure and the development that I've been describing, also for this startup. Because what you then have is this will have a lot of attrition if you need to send the data outside the company. But if you can run this whole thing in the client environment, you just give it some of the pipelines, some of the transformations, especially some of the business prompts and workflows and UIs, but they run it in their environment, immediately you have a lot of security, you have a lot of privacy, you have a lot of basically less problems in somebody adopting the solution. Because there's very little risk of data being extracted and sensitive business information actually being exposed.” 01:01

ChatGPT said: ChatGPT Exactly. By enabling the solution to run entirely within the client's environment, you significantly enhance security, privacy, and compliance. This reduces friction for adoption, as companies are more likely to trust a service that doesn't require sensitive data to leave their premises. It also ensures that the solution aligns with their internal policies and regulatory requirements.

You said: “Right. There's one final thing I forgot to say before, which is one of the ultimate objectives, I would argue, for the whole spin up and spin down is the ability to rotate regions effectively, and even more importantly, to be able to deploy everything in a clean AWS or the clean cloud environment, so a cloud account. So the cloud providers these days make it very easy to create separate accounts. So one of the really powerful things is to say it deploys the whole thing from scratch in an environment. Usually, there has to be some leftovers. There might be a pinned main DNS entry, maybe a couple of self-storage stuff. But the key is that you should be able to deploy the entire solution on a clean, pristine cloud environment, which, if you think about it, cloud account, sorry, which actually means that that is also the client deployment model.” 00:59

ChatGPT said: ChatGPT Absolutely. The ability to deploy everything from scratch in a clean cloud environment, like a new AWS account, is incredibly powerful. It not only ensures a clean slate for testing and deployment but also aligns well with client deployment models. This approach enhances security, simplifies troubleshooting, and ensures consistency across deployments, which is a huge advantage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment