AWS Re:invent recap Feb 2023

Redshift

Aurora to Redshift serverless “zero ETL” transaction replication

MySQL flavour only
Redshift serverless only
Create an “integration” on the Aurora side
Create database from integration on redshift side
Cannot replicate deletes

Auto copy from S3 (looks similar to Snowflake’s snow pipe and Databricks autoloader)

Tracks files and ingests

Redshift Multi-AZ

Cluster sits across AZ
Queries distributed to single but different AZs
For traditional Redshift, serverless already supports this

Dynamic Data Masking

Integrated with IAM

Streaming ingestion support

Kinesis Data Streams or Managed Kafka -> MSK topic -> Redshift

Uses your cluster compute
Takes advantage of materialised view
Can multi-cast to views
Auto refresh, restart ability

Data Sharing

Feature of Lakeformation
Pause / unpause clusters, takes a few mins (comparable to Databricks, not as good as Snowflake)
Works with RA3 or Serverless, can share between Prod & Dev accounts

AWS Clean Rooms

Focus on encryption, security, no data egress/exfiltration
Integration with 3rd party IdP
5 parties can share / upload their data
Only 1 collaborator can run queries
create a "collaboration"
Associate a table with optionally filtered columns to a collaboration

AWS Glue for Ray

Since Glue 4.0
3 ways to run Glue - spark, python and Ray
Is this EMR in the background?

AWS Glue Data Quality

Auto profiling & rule generation
Rules based on pydeequ
"Glue Data Quality" brings Data Brew low/code no-code functionality to Glue
Has version control integration

Athena for Apache Spark

Can now do Athena SQL OR Athena Spark
Create Workgroup
Create Notebook - looks like Jupyter
Is NOT EMR or Glue, it's a serverless Spark service that autoscales
Low management overhead
No config for GPU or other cluster tweaks
No Spark UI or ganglia or other performance tuning observability
Use case is for ad-hoc queries but does allow jobs
Has API for jobs

Amazon DataZone

Datamesh - producers -> DataZone -> consumers
A bundling of AWS Glue, athena, spark, s3
Business Catalog - enterprise-wide data search
- Filter by "Domain", "Business Glossary"
Access notifications

Sagemaker

Domain Profile is SSO integrated, kind of like a separate Id where SM is like a completely different SaaS to AWS Console.
Data Wrangler things
- Can use Presto on EMR
- can import sagemaker_datawrangler
- aws/sagemaker-python-sdk
- offline vs online feature store
  - Online can use data preparation to generate features from live inference API request data + offline historic data
- unique ID is timestamp + ID e.g. customer ID
Run notebook as a job
- studio -> jupyter scheduling extension ->(if schedule) eventbridge -> sagemaker pipeline -> (if run now straight here) Sagemaker training job.
Sagemaker "Spaces" for collab - separate EFS, share/edit same file
Async inference
- request -> process from s3 to s3 -> send sns notification
  - can also send a manifest for bulk processing
Tagging is the best practice, tag all the things
- team, cost centre, project, enable detailed cost allocation
- automatic tagging at domain, user and space level
- Tagging can have lag and potentially fail policies like 'describe' if you don't wait

Experiments

New UI easier experiment tracking
Integrated into the SDK

from sagemaker.experiments.run import Run, load_run
sagemaker.session import Session
exp_name = unique_name_from_base("exp-4")
with Run(exp_name, run, Session()) as run:
    run.log_parameters({})

Experiment has many "runs" (formerly trials)
automl/pipelines/prep feature store -> Experiments(Log, Track, Analyze) -> Model Registry -> Inference
Use jobs within experiments, don't loop and manually log params in the notebook, make use of the job automatic param logging

Deploying ML models for inference

4 types of Inference
- Realtime inference - always running, can be costly, can use multi-model to save at the expense of brittle infra
  - Can do unload/load to scale
- Serverless inference
  - could suffer from cold-start
- Batch inference - can support multiple container pipeline
- Async inference - max 15min, max 1G payload
  - Can be used similarly to batch or live with minimal resource usage and chain to event-driven next step

A/B Testing with Production Variants

Model Name .. Instance type .. Instance count

variant1 = production_variant(**)

New feature: shadow testing
- Multicast to 2 variants, old returns response to requester, new returns response to S3 Blue/green supports automated rollback

Multi-modal Endpoints (MME)

Models are loaded dynamically on request & cached
NVIDIA Triton Server up to 15 GPU instance types
- For advanced use cases
supports common popular NLP & CV libs like hugging face, tensorflow

Model Monitor

Now supports Batch Transform job ( Async not supported :( )
- Not AWS Batch
- Has tricky setup requirement
- Templatize the Batch Transform Job
- Takes baseline and uses Clarify to continually compare
Realtime inference example
- https://github.com/aws-samples/amazon-sagemaker-mlops-with-featurestore-and-datawrangler

Pipelines and autopilot integration, simpler https://aws.amazon.com/blogs/machine-learning/launch-amazon-sagemaker-autopilot-experiments-directly-from-within-amazon-sagemaker-pipelines-to-easily-automate-mlops-workflows/

Jumpstart

library of shared models, notebooks & solutions
share within same region/account ( all domains )

SageMaker Governance

SageMaker Role Manager!!! * Domain on top of profile is running as "SM execution mode" which can be hard to govern traditionally. * Training / Processing needs a pass-role for compute/service. Ephemeral cluster uses the pass-role

Define least privilege SageMaker Model Cards ( metadata/info for models)
- Is not integrated with Model Monitor well just yet
- Prefer model registry for now
- Could be used for central audit history of models including external models and documents, e.g. record purpose of model. Step up from Excel fun

SageMaker Model Dashboard

Model dashboard integrates model monitor
Mel also has a Cloudwatch SageMaker-Monitoring-Dashboard
- Can be shared via public URL
Cloudwatch can now do cross-account observability

Partha Arthi Melli

!!! Need to use Jupyter v 3 sagemaker domain setting !!!

Q's

Is it valid to load a client payload + historic features?

I want to use Feature Store for multiple data sources including S3/Glue table/Snowflake python snowflake connector for direct data access - don't need feature store here But what about further processing of Snowflake data? We might have features from datalake, direct from snowflake and also derived from snowflake

What is STAM Specialist Technical Account Management

davehowell/AWS Data and Analytics New Services.md

Select an option

No results found