Enriching metadata. National Geographic - more photos each year. Petabytes of images, 100+ years of content. How to enrich metadata?
Upscaling images without adding noise (adding colorisation etc.).
Deep learning based image analysis
Pipeline: Store, Analyze, Deliver.
Objects (water, boat) => scenes (ocean) => concept (sailing)
Consistent response rate.
Use case for Graphic: self service, multi tenant. Stored metadata available via API. Automatic image resizing, unique ID creation.
Global asset ingest and registration. The name is not enough for uniquness. Key AWS components: Cloud front, sep functions, api gw, dynamo db, lambda, dynamo, recognition.
User experience - type in “turtle” to return details about an image.
Next step: adding video
Following problems: b&w images, noise, historical context.
Instead of upscaling and smoothing images, apply Deep learning AI to add missing pixels.
Niche context deep learning - animal (chihuahua confidence 99%) or a muffin (98%)?
Source optimisation: deblur, refocus, image stabilisation.
Plenty of media centric models and data sets available.
GPU and FPGA instances, training ~30 days -> 8 hours. Bandwidth between S3, EBS.
Using manage service to solve 80% of the problem.
Strategy for enabling DevOps in enterprius.
Tech challenges - infra automation, mohlitic apps, tooling selection noise, security and resiliency, failure detection, automate controls.
Org challenges - complexity, skills nd cloud experience, multiple process handoffs, long lead times, ownership confusion
Financial services enterprises - regulatory compliance, encryption, laeast priviledged access, audit and reporting, separation of duties.
2 pizza team responsibility diagram
Responsible for: PRODUCT
Not responsible: Deploy tools, monitoring, amp tools, ingrained provisions, database management…
Devops transformation - technlogicka (iaas, self service, single purpose, microserv), organisational (cultural)
Org tans: app as a aservice, app deploy as a service, encryption as a aservice, database aaservice
Automate all the things, simplify and decompose monoliths, two pizza service teams
..
Start incorporating infra into developers app code. Including infra tests.
Financial services - extra reqm self service,
Strategies A, B, C
A
Cloud governance model, governance at scale, self service governance (from traditional to Devops)
Policy enforcement via authorised templates
Service catalogue, Cloudformation
Gain: ease of use, agility, governance, scale
How to get self-service: standardise, enforce policy, integrate, automate.
AWS service catalogue: control standardisation governance, agility, self service, time to market
Developer make a choice between service catalogue OR purpose built patterns (Cloud formation) - and later it becomes part of service catalogue.
Governance at scale: scaling via automation (policy automation engine)
CFN_NAG third party tool - test templates / resources. Checks SG for 0.0.0.0, IAM with asterix, EBS for encryption, Custom rules.
Self service governance - simple orchestration. Simple consistent experience (interface… like CLI).
Orcehstrate all: portable functions, simplify repetitive tasks, consistent interface, best practices and guardrails
Orchestration != abstraction
Orchestration : enable direct access, to native capability, common interface…… Abstraction: create common provider schema, aimed at multicolour portability, limits use of capabilities to last common, longer customeriztion development cycles, goal is to prevent vendor lock-in.
Python app CLICK see on GitHub.com
CLICK can manage handlers of multiple commands - chef, code commit,codepipeline, service catalogue…
App DORI.
ENFORCMENT AT SCALE:
Farewell to a Trade-off: Enabling High-scale, Bank-wide Cloud Adoption by Delivering Control and Self-service through Automation and Continuous Compliance
CTO + CCA from Barclays.
Governance and Control VS Agility and autonomy.
Fundamental decision: Foundation Cloud Platform. VS Specific Application Use case (quicker to deliver, easy to make compromises).
Foundation Cloud Platform Why not to take easy path: what comes after the first use case? Flase summit. Mortgaging the future. Expectations may not match reality. Aspecific use case may not raise all the issues. Cloud can be more expensive. Taks time to change direction.
Know your true objective: support DevOps and CICD act as an enabler for innovation motivation optimisation.
In practice: it should be self service, it should be API driven.
Straight trough processing.
Needs holistic thinking.
Pau down technical and process debt.
Examples of debt: bloated OS images (full of agents), slow boot time, real money costs.
Weakly defined infra data model: store everything in th CMDB…. >150 resource types in AWS, do we want them all in the CMDB (do you track S3 buckets in change management database????). How does our change process work in dynamic server environment.
Centralise to apply control. Small number of standards. One throat to choke. CLASHES with DevOps: we have to develop a new best practice in the cloud
Cloud anti patterns: Cloud brokers (high cost, lost opportunity on innovation, abstraction is a much, arbitrage is false economy, impact the power ot the community)
Instead of cloud brokers: federated auth, users are redirected to native AWS web console.
Barclays has AWS Portal API to give temporary credentials.
IaaS: it is unlikely a successful cloud journey would be achieved without some cloud service provider specific integrations.
IaaS will enable to capture architectures in machine readable form, enabling further analysis and reasoning.
Antipattern 2: Small Numbner of AWS accounts.
Billing: tags are not enough. Soft and Hard limits.
Instead: Miucro segmentation (account for each app and environment).
TO consider: Scoping role based access control. Fully loaded bill for the application. Tax. Security isolation. Application Onboarding. Network connectivity.
Account topology: We nominate accounts for one region. Functionally decomposed management accounts: microservices approach to infrastructure services. Separation of duty between app accounts and management accounts.
But how to manage it at the scale: account configuration: orchestrate integration into Barclays environment, Factory for AWS accounts, deploy baseline config of controls, accounts congig lifecycle management is new challenge (root credentials).
AMI lifecycle management.
Antipattern 3: Centralise administration. Destroys CICD, Reduces productivity, excuse to shrug off responsiblity, illusion of control (bottlenecks make mistakes)
Instead: Contonuous compliance. Automatically converge anomalies > monitors controls integrity > Track all activity. More focus on detective and automated reactive controls.
Examples: Lambda doing checks on EBS volume encryptions, SGs, Bucket policies…
Conclusions: operate as development team. Automate controls and react to user activity. AWS IAM is critical, test it. Federate accountability, integral part of DevOps. Enable many patterns to achieve optimisation.
Chalk talk.
Felix Candelario
Hanybal Jajoo
Ken Jackson
Speedy responses, easy to integrate, language model, not text matching. Undrstands the concept.
User auth via multiple channels? Integrate with Cognito, store the ID in DB. Pass the session to LEX then. Lex is not doing ANY authentication.
Chalk topics: session management (memory). Session state within Lex with time (how long you’d like to keep it active).
Additional questions, when chatbot is requesting extra data, but customer has no clue, what’s going on.
It is possible to export Lex to JSON and import to Alexa.
Use cases already in place: Q/A, CRM, AWS Connect combo, Password reset, HR/IT, FAQ (instead of Wiki).
LEX cannot start a chat :(. What about reminders?
Usability part - how to break down financial vocabulary?
Hierarchy: BOT -> Intent (prompts) -> Slots -> Prompts / Values… Slots can be dynamic. Use Lambda for session management.
In case you don’t want to compute on Lambda, you still have to use it as thin Lambda proxy to send it down to your own API / HTTPS server.
Chatbot as a source code.
Quick migration, schema conversion.
Minimize application downtime.
Patterns: Lift and shift (just move to the cloud with minimal change = fast)… Homofeneous migration (same engine, different service)… Heterogenous migration (ie. Oracle > Aurora).
DMS - supports widely used dbs. DMS can do data replication (to minimise impact on users). Where’s the magic? SCT - Schema Migration Tool. To automate conversion.
Cannot migrate from on premise to on premise. It can do from on premise to RDS / Data Warehouse.
Also migrate to nonSQL (!) Also can migrate from S3. Or from MongoDB to DynamoDB.
Supports encryption via KMS.
HA perspective, multiAZ supported by DMS.
DMS runs on EC2 instances (replication instances, secured etc.). You need a compute resource and storage instance. T2 and C4. If you migrate big DB with many concurrency, use C4. Also to reduce time to conversion. Storage: 50 or 100 GB. Also depends on workloads.
Copy from source to target. OR Capture changes during migration. OR replicate only data changes on source DB.
Snowball for petabate scale now supported by DMS.
Rules and filters. Selection rules applied on the source (include / exclude schemas and tables, filter rows on column values). Transformation rules only on target system.
One replication instance can do up to 8 tables at the same time.
Change data capture and apply (CDC) = logging mechanism depending on each product / DB provider.
Other features: multiple sources into one target. OR One source to multiple targets (this is good, if you’d like to modularise / refactor your app).
Near zero downtime migration for mission critical apps.
On source - select DB and schemas. Keep app running. Create replication instances. After replication is done, it will continue to do incremental sync. Every transaction is captured. Transaction based, according to the log.
Free: DMS is free to migrate to Aurora.
Oracle to PostgresSQL cookbook available.
Validation feature. When you migrate data, you’d like to make sure all data are correctly migrated.
SCHEMA CONVERSION TOOL:
Components: Source schema, action items, target schema, schema element details, edit window
Extension pack: Oracle sends an email calls with UTL_SMTP = this will be replaced by autogenerated Lambda function (very cool).
Can run on any system (Linux, Mac, Win).
Data extractors
(Aurora now supports autoscaling)
Best practices
Don’t include - omit LOB columns
Limited LOB mode - specify max LOB size
Full LOB mode.- specify lon chunk size.
LOB performance: when’re possible, use, limited LOB mode.
If a table has a few large LOBs and many smaller LOBS, consider breaking it to two 0 a table with the lariat LOBs and another table with smaller LOBs
Migrate then using two seprate tasks, full LONB and limited LOB mode.
New service - migration HUB.
DB2 supports? New sources and targets are released regularly, so we can expect some updates soon. (But they cannot speak about roadmap in public).
Is it possible to pause replication instance? At the moment you can pause the task, but not the instance. Not yet… it is requested feature. But of course, you can create replication instance via CLI automatically.
How Cloudfront measures and manages traffic.
Speed of light in fibre - 100 ms RTT from Las Vegas to Sao Paulpo
TCP plus TLS takes 2 RTT, so 300 ms before any. Data is sent
Bandwidth - regional ISPs, bandwidth X delay products
No brainer for static content, for dynamic content (short TTL caching, handshake with the edge, persistent connections)
If there is no caching, like shopping cart… or Slack uses it. Keeping TCP connection hot between user and CF.
CF then routes via Amazon backbone.
Availability - many POP, many paths, DDOS protection (AWS Shield), Stale content
WAF, Filed level encryption, compliance (PCI, DSS, HIPAA_
Personal identification data are encrypted from the first moment you touch CF.
Request lifecycle - user / example.com -> ISP DNS resolver -> Cloudfront DNS -> response back to user with IP -> GET from CLoudFront POP.
Which POP for this resolver? Feedback loop (the shower problem).
POP health is sent to Kinesis and ultimately stored on S3.
Kinesis can do sharing and resharding. Consumer hasn’t have to change anything.
Feedback loops between Cloudfront DNS and backend.
Based on RTT, load and link utilisation, kinesis / s3 will precompute resolver for POP table and sends it back to Cloudfront DNS.
Design patterns: use multiple nested feedback loops, precompute data in was regions. Cloudwatch metrics for low and moderate cardinality. Intermediaty aggregated result in S3. Kinesis Client Library to consume streams.
Congestion - how do routers work? IN/OUT
Torsten Kablitz (VP of IT, cloud engineering, change healthcare)
Benjamin Andrew (Global Leader, Sec & Network infrastructure, AWS Marketplace, AWS).
Building gold base AMI.
Change Healthcare - medical network, 2 Trillion $ operating on scale. Regulatory requirements.
Challenges: Software entitlement and deployment models. Complx agreement management. Constant rental and replacement. Out of date procurement mechanism. No single approved catalog of SW in place.
Customers want to: Rapidly innovate by buying and deoly=oynt. Sw solutions in demand. Simplify and streamline purchasing, licenses, invoicing. Upgrade on demand. Reduce cost while picking new standard.
Enterprise contract, top 50 AWS customers agreed on 15 sw vendors. All via AWS marketplace. Billed via AWS account. Private price via marketplace get in touch with ISV and negotiate.
In 2016, they got 29 accounts, 62 VPCs. But CISO asked - how is this VPC configured?
One VPC for Shared Services, One for Security (bucket for ALL logs from ALL accounts, reachable only via Security guys)…. 35 other accounts and 35 vpcs.
All of that is created via code. They also create AMIs.
Heuristics: cloud first & security by design. Automate everything, do not use WebConsole at all. Never login into a server. Apply principle least privilege principle.
But we can’t say: hey, you guys (AWS) do it and we trust we do it right.
Allgres - regulatory product mapping (RPM). You can select regulatory model. It shows, what is in responsibility of AWS etc., what are shared controls, marketplace controls…
Product team will use base AMI, but they can change it - so security team has to scan it. Validation tool check the instance = AWS Inspector.
They use parameter LATEST, so CI/CD will pick it up. AMI is shared among regions and accounts.
Scanning: If somebodyy made a change, which violates regulation, we need to be informed. Scanning for compliance = Cloud health. Scan CIE security benchmarks and AWS best practices.
Questions: How to enforce, that all deployment go via the pipeline, where also security rules are applied.
How do you make sure, that new version of AMI will not break down some application (assumption - we always use latest version of base AMI). = test before deployment with latest image, if something breaks, report to security team.
How do you prove your regulator, that AWS actually runs an antivirus on underlying Lambda servers?
They got their own package repository.
Changes on AMIs are stored into DynamoDB, to have a history etc.
They manage history of images, few versions… you can use “version X”, but it’s not recommended.
Trust in AWS, trust of our customers.
Trust increases, speed increases, costs decreases. How cloud changes regulatory framework? Expectations don’t really change, what changes is the way, how these expectations are delivered.
GOALS: Better understanding of regulatory expectations for the cloud. How DTCC uses this information an makes it actionable. “If you are going to talk the talk, you’ve got to walk the walk”.
Stakeholders needs to understand, that it is shared responsibility. Security OF the cloud (AWS), security IN the cloud (us).
Areas: oversight, exit strategy. Security & privacy, disaster recovery, incident response.
Risk assessment: key way to demonstrate to regulators, that we know what we do / control effectiveness. AWS Artifact. Notifications, when something new comes up.
Exit Strategy: Data migration services IN & OUT (DMS, Snowball).
Sec & Privacy: How is AWS making sure, that they are doing a secure environment? Golden image, security by design. Automation is the key. More time you spend with guardrail on the beginning, better for later development.
HA & DR - designed for fault. DR test not on Friday night… they do it during production time on Thursday afternoon, then audit runs log analysis, to see the time of failover etc. Include audit in your test plan.
PLAN FOR FAILURE.
Incident réponse - reduce the noice, enriched data, reduce response time. Cyber security - if there is an issue, you should have all info, to identify root cause etc. Personal Health Dashboard.
DDOS shield, Macy, GuardDuty. …
DTTC is here cca 40 years. Operating in relatively challenging environment, full of financial forms. Two part plan - white paper. Second - what kind of risk is DTCC bringing by NOT inviting the cloud.
DTCC asks a question about compliance, risk etc., How hard it’s going to be by not partnering with cloud / AWS. For the most part - reviewers thinks it will be not so difficult. It’s not going to be the same. Sec Incident management.
Change in audit function: providing reports, but they also increase the amount of requirements, how to demonstrate compliance? DTCC has access to IT experts, so they can understand and prove it.
Some of most agile organisations due hire developers to audit positions. New wave of auditors, by people which actually design infrastructure.
Exit Strategy - snowflake status - we have to be able to demonstrate. Vendor relationship - be able to divorce (even if not immediately).
Regulatory obligations. Their supervisors are also their customers. More transparency internally, to demonstrate all processes / reports etc.
HA/DR - very important obligation of DTCC - towards regulators. Sometimes you have some policy… to demonstrate you testing adequate capacity. DTCC just “cannot go down”, it’s too big / important.
What to do now: Share vocabulary with audit teams
Share learning with stake holders.
AWS Auditor Learning Path! (Trusted advisor, artefacts, compliance, service documentation, whitepapprers, automate security events, from idea to code to execution)
Rely on best practices.
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know about Enterprise Cloud Transformation
S&P 500 company lige span = 15 years. In the 1920: 67 years.
World is changing faster.
Digital innovation. Digitization: creating value out of data. AirBmB, Netflix, Uber… Data are the centre of value od business process. Via software. This is hard for enterprise customers, to write their own software, becoming builders.
How to make builders out existing employees. In the cloud, no physical limitations. Cloud is like “digital IT” un.imited resource, 100% automatic.
Idea - product - data (repeat)
Build - measure - learn
How to accelerate this process? Time to revenue? How to make it faster, how to do it in practice?
Digital transformation works only with rights technology, organisation AND people. CULTURE is the key.
Speaker: Christian Deger from AutoScout24. Online marketplace for cars. Buying and selling used cars. Legacy: 2000 servers, 2 dta centres, MTBF optimised (mean time between failure = HA setup), Oracle, ASP/NET, VMWARE.
Normal DEV to OPS process. New CEO - do you attract talent? Are we ready for the future?
From Win to Linux (.NET > JVM). Monolith to microsertvices. Data centre to AWS, DEVOPS. Involve product people.
Why micro services? Speed. Scale the organisation and stay fast. Autonomous teams, fast local decisions. Loosely coupled. Strong boundaries. Independently deployable. Technology diversity.
(Death Star diagrams). Lot of services talking to each other, micro services depending on each other. That felt complicated.
Self contained system, different microservcices concept.
2014 - new green field team. Strategic goals. Reduce time to market, support data driven decision. Mobile first. Best talent. Cost efficiency, One Scout IT.
Architecture principles.
Design and delivery principles.
Micro X Macro architecture. Macro - security, compliance, shared within company. Micro - team dependent decisions, specific for the product.
Don’t roll your own service, if AWS has it.
Conwauy’s Law. Interesting!!
Autonomouse teams organised around business capabilities. You built it, you run it.Better resilience and ownership of their services, as they got full responsibility. Their own innovation, measures, etc.
Follow the trail = you got technology diversity, but you don’t want to go in wrong direction. What works: typically first team solving the problem will affect other teams (tools, templates, processes).
Guilds - self organising, common interests, across teams. Macro architecture guild, infrastructure guild, frontend guild, QA. Beware silos between teams.
Cech mistru na jednotlivy obor, zajimave!
Continous delivery [ Delivey pipeline ]. GitHub repo per service. Commit stage - CI = build package as artefact. Delivery phase. (Nothing new).
This was improved by including IaaS.
No staging environment. Integrate in production. Removing friction - feature toggles (to decouple functionality, even when code is not production ready, it’s part of the release), consumer driven contracts, canary releases (put the canary in to see, of it hold the traffic), shadow traffic (for the traffic from old service to new service, to see the impact), semantics monitoring (user journey testing).
Pets VS cattle.
Hamburgers VS cattle (SERVERLESS FTW!)
Monitor what you run. Monitoring is new testing.
Decpoupled UI composition. Containing page ond by now team responsible for core purpose and overall experience.
Technology. Think about your organisation as system. Day 1 thinking, pioneering thinking. How does Day 2 looks like? Followed by irrelevance. Painful decline, followed by death. That’s why it is always DAY 1. ( Do not try to settle).
Avoiding day 2. True customer obsession. Experiment patiently. Accept failures. Olabnt seeds, Protect samplings. Double down when you see customer delight.
Resisting proxies (avoiding placeholders). The process is not the thing, It;s always worth asking, do we own the process or do process own us? WHEN DID YOU SEE LAST YOUR CUSTOMER?
I follow the process, that’s why it’s right!
Embrace external trends. If you fight them, you are probably fighting future. Embrace and you have a tailwind.
Hihgh velocity decusion making. Door has two ways. Some doors can be dangerous. Many customers are too conservative. Being wrong might not be that costly. Being slow is expensive for sure.
Create your first team, autonomous. Business plus Dev plus Ops. Call them engineers / builders.
Identify your digital assets.
Brainstorm digital business models. Build, experiment, iterate. Celebrate success, share, coach. Many companies comes with immune system.
Phoenix Project, Systems thinking, Principles (Ray Daily), The Lean Enterprise.
Speech is a natural interface, this is how we interact.
Surgery - doctor using voice to control, why not taking hands away from a patient. Parents shouting at doctor app, while with kids. Cook having dirty hands finding info about how many mililiters into a cup.
Voice unlocking digital for everyone. Interacting with digital systems with voice.
In development countries. IRRI plants rice for any poor. How to apply fertilisers? Farmers ask via phone, ML replies, based on voice input of farmer.
Alexa / ECHO device is not terribly smart, all the IQ is in the cloud. You can integrate with any device, not only echo. (Billy bass, the fish).
It’s not just voice. What about haptic? Clock? All other components becoming part of home automation (Alexa, open garage doors, set temperature to twenty degrees and play Red Hot Chilli Peppers).
Devices in conference rooms - on your desk - Alexa for business.
Teem - conference room management. Alexa helps you at your desk. Concure, SPlunk, Acumatica (Cloud ERP). Alexa, turn on ESPN, play music, open shades, turn of the lights.
This will become new building interfaces. Preparing environment with natural interface.
Admin plane, Controle plance, Data plane.
Iflix architecture as example. Not necessary complex, but extensive. Very extensive.
“Well architected framework”. Thousand of framework analysis. Five pillars - operational excellence, security, cost effiecentcy -
5 Pillars, Lens (HPC and server less), Boorcamps, certification.
Well architected principles: Stop guessing capacity needs, test systems on production scale, Improve trough Game Days, Drive your architecture using data, Allow for evolutionary architecrure, Automare to make architecture experiment.
Well architected security - protecting customers before first functionality is being developed. Implement strong identity foundation. Ebamle traceability. Automate your seciurtuy processes. Protect your data at all costs. Be prepared for things to break.
WE are not taking encryption serious enough. Dance like no one watching. Encrypt like everyone is.
There is no excuse for not using encryption. Security is all of our jobs now. Developers becoming centre of security. Pace of innovation now meets pace of automation.
Protection in CI/CD. Each and every peas in the pipeline should have audit trail, log, etc.
Prevent / block if unsure.
Change on developement side: more security awareness.
Every great platform has great IDE, Cloud9.
Cloud9 demo - import Lambda blueprint, debugger, lambda event simulator, then give access to dev environment to code reviewer. Both guys are online at the same time. Integrated chat, peer programming. Then deploy to PROD and RUN REMOTE.
Publish directly intro code start tools.
View on availability. Distributed systems best practices.
Test recovery procedures, automatically recover from failure. Fault isolation zones, reduntandt components, circuit breakers, bimodal behaviour
Deployment automation, Canary deployments, blue green, feature toggles, failure isolated zones.
Business rules drive availability architecture decision. 99.9 availability = 1 AZ. 99.95 = 2 AZs. 99.9999 = 3 AZs. (Aurora, multi-master). 99.99999 = multi regional. Acrive active over 2 regions. DynamoDB global tables (tables over multi regions!!!).
Route53 has 100%, crucial service.
Chaos engineering (to je kniha).
Netflix - fallback implementation - one component doesn’t work, no problem. Don’t show the data or use different fallback. Detect potential for fallbacks via chaos engineering.
Forces of Chaos. Graceful restarts + degradations. Chaos monkey. Graceful resilient degradation.
Targeted chaos. Regional failovers. They use Kafka.
Cascading failure. Triggering series of failures.
Cultural things - do not ask,what happens IF this fails, ask what will happen WHEN it fails.
Build a failure injection library.
Safety & monitoring. On top of chaos engineering. CHAP@Netflix.
Key metric at Netflix - can you press (can you see) a play button? Monitoring - 2% of traffic goes via two lines - controlled and uncontrolled. If there is too much deviation between those two, experiment is stopped and engineers will have a look at it.
Future of chaos? Chaos does not help to solve problem, it helps to reveal them.
Rise of micro services. Enabled by containers = default mechanism. Abby Fuller. Containers on AWS. ECS, Fargate, EKS, Fargate on EKS. How to build MVP. How to start? Tools?
CapitalOne uses ECS behind ELB. Segmented company, 200k events per second. Running ETL just on ECS.
Monzo bank in UK, Kubernetes. 350 microservices. HA on every level of their infra. They got direct connect to receive data from external parties / banks. Run Kubernetes for me.
EKS = managed kubernetes.
Even easier = Fargate. Edit container, port mapping, reousrces, save it. Then Task definition. Add ELB, Crete, deploy.
...
All the code you will ever write is the business logic.
Serverless. Lambda, Step functions. AWS Serverless App repository.
Sage maker, GDBX
Reltime IOT analytics, ESPN stories based on AI patterns.
AWS evangelist: I got easiest job ever - I just made nerds excited about the cloud :).
Don’t do Active Active, it’s HARD :)))).
Session objectives…
5 pillars - stciritu, reliability performance efficiency, cost optimizioanto, operational excellence
Chain is as strong as weakest link
Adding redundant components. Assuming instant failover.
Nines of availability = 99% = 3 days 15 hours.
Add a 9 = double the cost.
Every architect answers “it depends” :).
Don’t worry about hard decisions. Worry about easy decisions. Consider active active.
Region wide services = S3, DUnamoDB, EFS, SQS, Kinesis, RDS, Elasticahe, Amemazon ES.
HA for EC2, instance recovery. New instance sam e in all manners as old instance.
Guarding against failure of your application
Cost Effective DR: Why not use DR all the time?
DR environments that don’t; get used.
1/ Fall out of sync eventually 2/ Waste money
Active active kinda forces proper DR.
Data replication = synchronous. OR Asnc. Nearly continuous (lag seconds to minutes), Barch (hourly, daily…)
AURORA can do multi-master.
Parametrized localization = EU customers in Ireland, US customers in Ohine, but if Ohio fails, US customers uses Ireland.
Segragation = explicit (different URLs).
Monitoring = application and infra heals. Replication lag and code sync monitoring.
Multi tenancy = unit of movement / failover. Customers having other customers.
Failover scripts: traffic rerouting for a tenant.
Key consideration -= tolerance for network partitioning. Failve o one region should not lead to failure in other region,
Regional independence for request serving, no API calls from one region to another.
Minimal data replication requirements - does all data need to be replicated? If yes - must be synchrnosu? All data, really?
Classification of data. Async is better, if possible.
Concept of data replication lanes. Synchronous = difficult.
Ideal replication system. Each data store type will need a different technology.
Minimum - should report replication lag.
Should report record offset.Should be Abel to retry replication. Try until suycccessfull.
Painful process towards micro services with eventual consistency (better value).
Inter region VPC peering. Key benefits: works similar to existing intra region VPC peering. Encrypted by default.
S3 cross region replication. Compliance, lower latency, security. (Only replicates new PUTs. Entire bucker or prefix based. 1:1 between any regions / storage classes.
RDS multiAZ deployment - standard. Master + Slave.
RDS cross region replication (not Oracle). You can monitor replication lag.
Simple plan: Master ion one region, all writes happens into the master. Then it is replicated to many regions.
Simple, but hight latency and network dependency.
Better plan: Tenant in three regions. With local masters. Sync with Slave in one region, async replica in other regions.
DynamoDB global, active active! Multi regions, multimnaster. Enable streams, empty table, then add region = global table.
Update Region, update time. This is how Dynamo knows about latest version of data.
Route 53 global traffic management.Traffic flow = hybrid / low level infrastructure (CNAME or IP).
Quick overview of rules = simple routing, failover routing, geolocation routing, geoproximuty routing policy, latency routing, multivalve answer routing policy. Weighted routing policy.
Cross region code deployments.
Blue green. Canary.
Is code deployment any different? No, DevOps pipelines works the same way. Important tradeoff = simultaneous deployments OR one region at the time.
Netflix Archaius.
Cross region monitoring = no difference just because it’s multi region.
Deploy SW on demand. Curated software from trusted vendors.
Free trial, hourly / monthly/annual, BYOL, private offers.
Deploying SW from AWS marketplace - one click using CloudFormation.
Service catalogue.
Core networking offering - VPC, Direct Connect, ELB, Route 53
Transit VPC is a HUB VPC, which routes traffic to other VPC
Infor - building transit VPC Atchitet\cute
Multiregion Transit VPC…. Multiaccount support, WAN agnostic.
Ability to use zone based firewall on transit routers.
Use of redundant links and BGP for path control across all spokes.
Automation being new spoke VPCs spa and into the routing table in minutes.
Staging VPC, staging zones.
BGP is predictable standard.
https://s3-us-west-2.amazonaws.com/nrtblackbeltteam/workshop/DDBworkshop.html
Top three questions. Can I get free Netflix? Free 30 days :).
Why did show X leave the service.
What is Netflix doing here.
Challenge, Team, Efficiency Hieaerrchy of needs, The Future.
Challenge: Everything before you hit Play is on AWS (then their own CDN will kick in). Over 90 folks working on cloud stuff. 2500 instances per bubble.
Freedom and Responsibility. Hiring folks and giving them trust. No procurement, no budget. From capacity perspective, 1500+ configs. Location, instance types….
Four pillars, Innovation (top), Reliability (expected 100%), Security (newest pillar, rising importance), Efficiency.
SPark Stream. It runs on Kafka. That runs on TIdus, internal container system.
Variaty of tooling necessary. Teams were picked from organisations, based on talents. Cross functional teams. Cloud Capacity Analytics.
Charter - support the data related needs of the cloud capacity planning function.
Investigate trends, patterns, anomalies in core metrics.
Suggest new data driven approaches to existing workflows and goals. (Here’s a new solution for your problem).
Success criteria. Cost / Capacity VS business growth. Money spent on cloud / amount streams (clicked Play button).
Feedback from engineering teams: regular use of our tools and insights. Raised awareness of their impact on efficiency. Pro active engagement on efficiency projects.
The Efficiency - hierarchy of needs. From fundamentals to automation.
TOP Automation (Optimisation and machine learning), actionable insights (Targeted alerts, summary emails, and personalised dashboards), deep dives (exploratory analyses and case studies), transparency (intuitive and interactive dashboards). BOTTOM
What do you need to know, before ou can even ask about effieiceny? If you cannot measure it, you cannot improve it. Cost and usage. AWS DBR or CUR files.
Sustem data: S3 Inventory, AWS Cloud trail. Metadata, AWS Tags, Org. structure. Undocumented facts (tribal knowledge).
That which measured improves, that which is measured and reported improvecee exponentially.
Transparency over Dashboards - dashboards are everything!
Taileor views to specific use cases.
Add business context.
When possible, colocate with existing tools / workflows.
TOOLS:
Transparency layer:
Picsou. Scrooge McDUck in French. Netflix comprehensive cloud capacity tool. Data: Vulling plus Tribal Knowledge. Tech: Scale apps plus Spark plus React.js.
Cloud Cost Dashboard.: Enrich cost and usage data with internal metadata (org, platforms_. Add business context. Taliero views to users.
Libra: Visulize reserved and used instances across zones and instance types. Rebalance as necessary. BUlti in retry logic.
Deep dive layer.
Tell a story showing the potential impact of your efficiency project tot generate buy in from your organisation. Connect the components of Complex Architectures to show the bugger picture.
Relative change in demand = # of requests X duration.
DarwinQL, new UI engine for TVs about to roll out, how it will impact our cloud efficiency?
Cradle to Grave. Track the end to end cost of ingesting, storing and processing the data.
Device - ZUUL - micro services…
What do you need to know and when d you want it?
Strive to minimise the cognitive load for your target audience.
Targeted messages (sends alert only when something is really worth it). Efficiency score cards: email. 3 core efficiency metrics System and business context. Monitor changes in magnitude and trend over weeks (non opeatrional_. Link each card to a detailed dashboard.
Team doesn’t do any judgements, they don’t know, why exactlyy something happens - it’s about making sure, that others are aware).
EC3 alerts - Picsou. Computer reservation shortages. List in descending order of cost. Attribute to top growing apps. Also sent as a digest email linked back to Picsou.
Safely automate repetitive or complex tasks. Start simple, rules engine, optimiziotion. Graduate your actionable insights, Show your work.
Tool: Tableau. S3 Storage class optimisation. Very similar to AWS S3 analytics products. We use the same data.
RI management. Picsou, explore costs and usage. Notigu of RI shortage. Picsou RI Recommendation. Ingest output from shortage anakyss=is. Use linear programming to compute optimal RI modification / purchase. Email recommendation. Once we gain enough confidence.
Self Service C2G, Give data producers, consumers and caretaker the ability to manage their own efficient. Identuvy all involved parties along a data topics. Apportion data infra cost to all relevant teams. Quickly notice low usage data tic. Estimate data replication or large sinks ti user ratios.
Long term enable data platform owners to use this tool.
Device Cloud efficiency
Expose the impact of device ui features on efficiency. Provide the relative cost change sort AB test cells. Atrtribure micro services growth and cost to each device family.
Key take aways.
Culture, scale, architecture and priorities requires efficiency to be championed by central team but enforce by ALL engineers.
This is achieved by implementive successive layers.
They don’t have any server less infrastructure.
How do you tell, if actionable item works or not? Data nerds, checking for details.
3D configuration, rendering on AWS, integrate 3D into web app, customer story-muycs.
Cars, fashion, furniture.
Benefits - engage with customers, visualise final product, support buying decision
Server side rendering - quality, speed, low cost.
3D rendering - rasterisers, vectors, polygons, OpenGL, directX, gpuaccelerated.
Ray trains = send traces trough scene. Photorealistic images, GPGPU optimised, CUDA or OpenCL.
AWS GPU instances - P3 = 1-8 GPUs, NVIDIA TEsla V200, Volta architecture, 1x GPU core.
Elastic GPU, key features = flexible instance size and attachment, Right size instance selection, utlizie auto scaling to handle request (Windows only).
OpenGL 4.2 support
No CUDA, OpenCL and DX support. Ensure OpenGL.
REndering max 25fps.
Integrating to webapp = rendering API, web app/ microservices, renderer, 3d model and model config, caching.
Other eq:near realtime rendering, HQ renderid images.
Rendering: rasteriser: Unity, Amazon Lumnberyard. Raytracer: Nvidia Iraz, 3DS max, Cycles / Blender.
Integrate engine: write image to file, grab frame buffer, utilise native integration.
Raytracing with Blender. Run blender from CLI.Force GPU mode.
Autoscaled services and rendering API.
NEtwork Load Balancer, stream images with http/2 push. TP level load balancer.
Server pre rendered assets via CDN.
Caching - utilise cloud front caching, rendering APIm, customer caching for http/2.
Rasterizer with Elastic GPU - Unity with ElasticGPU.
Operations - monitor GPU utilisation, FPU instances use nvidia0smi to query data.
Cloud Watch.
Auto scaling - scale GPU fleet based on custom metrics.
Scal cup aggressively, scale down slowly.
Pre rendered content served by Lambda@Edge.
Make customisation the ne normal.
Photorealistic 3D configurations. (Company MYCS).
Main challenge: client side vs server side render. Ray tracing, AWS infra + lessons learned.
Photorealism vs Interactive.
Goals: any image must arrive in 2 secs.
Progressive stream of images.
High degree of availability
Cross device compatibility
Downsides: Not optimised for real time applications, low intreacitivy, steeper learning curve and setup compared to WebG:, expensive.
AWS GPU raytracing = rendering speed and affordable scalability. $25 per hour. $35k per month for 2 instances :(,
(ELB doesn’t support websocker)
But Application Load Balancer supports web socket.
Still, difficult to manage. Rendering always on one GPU. Job has to be in the queue and then you make a decision, when and if to scale up.
Varnish shall be replaced by Lambda@Edge. CloudFront as default CDN.
At certain point you realise, that there is no fine tune. So they decided to build their own rendering engine, to become 5x 7x faster.
They had to remove CLoudFront, because there is a limit in the query.
:lessons learned:
Aim for the best possible hardware.
Don’t be afraid to step deep into rendering topics.
Always reevaluate your infrastructure.
Keep most of your energy on the renderer.
Don’t just stick to one technology.
80/20 rule - learn how rendering engines are working.