- MySQL flavour only
- Redshift serverless only
- Create an “integration” on the Aurora side
- Create database from integration on redshift side
- Cannot replicate deletes
- Tracks files and ingests
- Cluster sits across AZ
- Queries distributed to single but different AZs
- For traditional Redshift, serverless already supports this
- Integrated with IAM
Kinesis Data Streams or Managed Kafka -> MSK topic -> Redshift
- Uses your cluster compute
- Takes advantage of materialised view
- Can multi-cast to views
- Auto refresh, restart ability
- Feature of Lakeformation
- Pause / unpause clusters, takes a few mins (comparable to Databricks, not as good as Snowflake)
- Works with RA3 or Serverless, can share between Prod & Dev accounts
- Focus on encryption, security, no data egress/exfiltration
- Integration with 3rd party IdP
- 5 parties can share / upload their data
- Only 1 collaborator can run queries
- create a "collaboration"
- Associate a table with optionally filtered columns to a collaboration
- Since Glue 4.0
- 3 ways to run Glue - spark, python and Ray
- Is this EMR in the background?
- Auto profiling & rule generation
- Rules based on pydeequ
- "Glue Data Quality" brings Data Brew low/code no-code functionality to Glue
- Has version control integration
- Can now do Athena SQL OR Athena Spark
- Create Workgroup
- Create Notebook - looks like Jupyter
- Is NOT EMR or Glue, it's a serverless Spark service that autoscales
- Low management overhead
- No config for GPU or other cluster tweaks
- No Spark UI or ganglia or other performance tuning observability
- Use case is for ad-hoc queries but does allow jobs
- Has API for jobs
- Datamesh - producers -> DataZone -> consumers
- A bundling of AWS Glue, athena, spark, s3
- Business Catalog - enterprise-wide data search
- Filter by "Domain", "Business Glossary"
- Access notifications
-
Domain Profile is SSO integrated, kind of like a separate Id where SM is like a completely different SaaS to AWS Console.
-
Data Wrangler things
- Can use Presto on EMR
- can
import sagemaker_datawrangler aws/sagemaker-python-sdk- offline vs online feature store
- Online can use data preparation to generate features from live inference API request data + offline historic data
- unique ID is timestamp + ID e.g. customer ID
-
Run notebook as a job
- studio -> jupyter scheduling extension ->(if schedule) eventbridge -> sagemaker pipeline -> (if run now straight here) Sagemaker training job.
-
Sagemaker "Spaces" for collab - separate EFS, share/edit same file
-
Async inference
- request -> process from s3 to s3 -> send sns notification
- can also send a manifest for bulk processing
- request -> process from s3 to s3 -> send sns notification
-
Tagging is the best practice, tag all the things
- team, cost centre, project, enable detailed cost allocation
- automatic tagging at domain, user and space level
- Tagging can have lag and potentially fail policies like 'describe' if you don't wait
Experiments
- New UI easier experiment tracking
- Integrated into the SDK
from sagemaker.experiments.run import Run, load_run
sagemaker.session import Session
exp_name = unique_name_from_base("exp-4")
with Run(exp_name, run, Session()) as run:
run.log_parameters({})-
Experiment has many "runs" (formerly trials)
-
automl/pipelines/prep feature store -> Experiments(Log, Track, Analyze) -> Model Registry -> Inference
-
Use jobs within experiments, don't loop and manually log params in the notebook, make use of the job automatic param logging
Deploying ML models for inference
- 4 types of Inference
- Realtime inference - always running, can be costly, can use multi-model to save at the expense of brittle infra
- Can do unload/load to scale
- Serverless inference
- could suffer from cold-start
- Batch inference - can support multiple container pipeline
- Async inference - max 15min, max 1G payload
- Can be used similarly to batch or live with minimal resource usage and chain to event-driven next step
- Realtime inference - always running, can be costly, can use multi-model to save at the expense of brittle infra
A/B Testing with Production Variants
- Model Name .. Instance type .. Instance count
variant1 = production_variant(**)- New feature: shadow testing
- Multicast to 2 variants, old returns response to requester, new returns response to S3 Blue/green supports automated rollback
Multi-modal Endpoints (MME)
- Models are loaded dynamically on request & cached
- NVIDIA Triton Server up to 15 GPU instance types
- For advanced use cases
- supports common popular NLP & CV libs like hugging face, tensorflow
Model Monitor
- Now supports Batch Transform job ( Async not supported :( )
- Not AWS Batch
- Has tricky setup requirement
- Templatize the Batch Transform Job
- Takes baseline and uses Clarify to continually compare
- Realtime inference example
Pipelines and autopilot integration, simpler https://aws.amazon.com/blogs/machine-learning/launch-amazon-sagemaker-autopilot-experiments-directly-from-within-amazon-sagemaker-pipelines-to-easily-automate-mlops-workflows/
Jumpstart
- library of shared models, notebooks & solutions
- share within same region/account ( all domains )
SageMaker Role Manager!!! * Domain on top of profile is running as "SM execution mode" which can be hard to govern traditionally. * Training / Processing needs a pass-role for compute/service. Ephemeral cluster uses the pass-role
- Define least privilege
SageMaker Model Cards ( metadata/info for models)
- Is not integrated with Model Monitor well just yet
- Prefer model registry for now
- Could be used for central audit history of models including external models and documents, e.g. record purpose of model. Step up from Excel fun
SageMaker Model Dashboard
- Model dashboard integrates model monitor
- Mel also has a Cloudwatch SageMaker-Monitoring-Dashboard
- Can be shared via public URL
- Cloudwatch can now do cross-account observability
Partha Arthi Melli
!!! Need to use Jupyter v 3 sagemaker domain setting !!!
Q's
Is it valid to load a client payload + historic features?
I want to use Feature Store for multiple data sources including S3/Glue table/Snowflake python snowflake connector for direct data access - don't need feature store here But what about further processing of Snowflake data? We might have features from datalake, direct from snowflake and also derived from snowflake
What is STAM Specialist Technical Account Management