Skip to content

Instantly share code, notes, and snippets.

@davehowell
Last active February 17, 2023 05:52
Show Gist options
  • Select an option

  • Save davehowell/2cd1a5c86bedf1029a872caef99dd921 to your computer and use it in GitHub Desktop.

Select an option

Save davehowell/2cd1a5c86bedf1029a872caef99dd921 to your computer and use it in GitHub Desktop.
AWS Reinvent 2022 Recap

AWS Re:invent recap Feb 2023

Redshift

Aurora to Redshift serverless “zero ETL” transaction replication

  • MySQL flavour only
  • Redshift serverless only
  • Create an “integration” on the Aurora side
  • Create database from integration on redshift side
  • Cannot replicate deletes

Auto copy from S3 (looks similar to Snowflake’s snow pipe and Databricks autoloader)

  • Tracks files and ingests

Redshift Multi-AZ

  • Cluster sits across AZ
  • Queries distributed to single but different AZs
  • For traditional Redshift, serverless already supports this

Dynamic Data Masking

  • Integrated with IAM

Streaming ingestion support

Kinesis Data Streams or Managed Kafka -> MSK topic -> Redshift

  • Uses your cluster compute
  • Takes advantage of materialised view
  • Can multi-cast to views
  • Auto refresh, restart ability

Data Sharing

  • Feature of Lakeformation
  • Pause / unpause clusters, takes a few mins (comparable to Databricks, not as good as Snowflake)
  • Works with RA3 or Serverless, can share between Prod & Dev accounts

AWS Clean Rooms

  • Focus on encryption, security, no data egress/exfiltration
  • Integration with 3rd party IdP
  • 5 parties can share / upload their data
  • Only 1 collaborator can run queries
  • create a "collaboration"
  • Associate a table with optionally filtered columns to a collaboration

AWS Glue for Ray

  • Since Glue 4.0
  • 3 ways to run Glue - spark, python and Ray
  • Is this EMR in the background?

AWS Glue Data Quality

  • Auto profiling & rule generation
  • Rules based on pydeequ
  • "Glue Data Quality" brings Data Brew low/code no-code functionality to Glue
  • Has version control integration

Athena for Apache Spark

  • Can now do Athena SQL OR Athena Spark
  • Create Workgroup
  • Create Notebook - looks like Jupyter
  • Is NOT EMR or Glue, it's a serverless Spark service that autoscales
  • Low management overhead
  • No config for GPU or other cluster tweaks
  • No Spark UI or ganglia or other performance tuning observability
  • Use case is for ad-hoc queries but does allow jobs
  • Has API for jobs

Amazon DataZone

  • Datamesh - producers -> DataZone -> consumers
  • A bundling of AWS Glue, athena, spark, s3
  • Business Catalog - enterprise-wide data search
    • Filter by "Domain", "Business Glossary"
  • Access notifications

Sagemaker

  • Domain Profile is SSO integrated, kind of like a separate Id where SM is like a completely different SaaS to AWS Console.

  • Data Wrangler things

    • Can use Presto on EMR
    • can import sagemaker_datawrangler
    • aws/sagemaker-python-sdk
    • offline vs online feature store
      • Online can use data preparation to generate features from live inference API request data + offline historic data
    • unique ID is timestamp + ID e.g. customer ID
  • Run notebook as a job

    • studio -> jupyter scheduling extension ->(if schedule) eventbridge -> sagemaker pipeline -> (if run now straight here) Sagemaker training job.
  • Sagemaker "Spaces" for collab - separate EFS, share/edit same file

  • Async inference

    • request -> process from s3 to s3 -> send sns notification
      • can also send a manifest for bulk processing
  • Tagging is the best practice, tag all the things

    • team, cost centre, project, enable detailed cost allocation
    • automatic tagging at domain, user and space level
    • Tagging can have lag and potentially fail policies like 'describe' if you don't wait

Experiments

  • New UI easier experiment tracking
  • Integrated into the SDK
from sagemaker.experiments.run import Run, load_run
sagemaker.session import Session
exp_name = unique_name_from_base("exp-4")
with Run(exp_name, run, Session()) as run:
    run.log_parameters({})
  • Experiment has many "runs" (formerly trials)

  • automl/pipelines/prep feature store -> Experiments(Log, Track, Analyze) -> Model Registry -> Inference

  • Use jobs within experiments, don't loop and manually log params in the notebook, make use of the job automatic param logging

Deploying ML models for inference

  • 4 types of Inference
    • Realtime inference - always running, can be costly, can use multi-model to save at the expense of brittle infra
      • Can do unload/load to scale
    • Serverless inference
      • could suffer from cold-start
    • Batch inference - can support multiple container pipeline
    • Async inference - max 15min, max 1G payload
      • Can be used similarly to batch or live with minimal resource usage and chain to event-driven next step

A/B Testing with Production Variants

  • Model Name .. Instance type .. Instance count
variant1 = production_variant(**)
  • New feature: shadow testing
    • Multicast to 2 variants, old returns response to requester, new returns response to S3 Blue/green supports automated rollback

Multi-modal Endpoints (MME)

  • Models are loaded dynamically on request & cached
  • NVIDIA Triton Server up to 15 GPU instance types
    • For advanced use cases
  • supports common popular NLP & CV libs like hugging face, tensorflow

Model Monitor

Pipelines and autopilot integration, simpler https://aws.amazon.com/blogs/machine-learning/launch-amazon-sagemaker-autopilot-experiments-directly-from-within-amazon-sagemaker-pipelines-to-easily-automate-mlops-workflows/

Jumpstart

  • library of shared models, notebooks & solutions
  • share within same region/account ( all domains )

SageMaker Governance

SageMaker Role Manager!!! * Domain on top of profile is running as "SM execution mode" which can be hard to govern traditionally. * Training / Processing needs a pass-role for compute/service. Ephemeral cluster uses the pass-role

  • Define least privilege SageMaker Model Cards ( metadata/info for models)
    • Is not integrated with Model Monitor well just yet
    • Prefer model registry for now
    • Could be used for central audit history of models including external models and documents, e.g. record purpose of model. Step up from Excel fun

SageMaker Model Dashboard

  • Model dashboard integrates model monitor
  • Mel also has a Cloudwatch SageMaker-Monitoring-Dashboard
    • Can be shared via public URL
  • Cloudwatch can now do cross-account observability

Partha Arthi Melli

!!! Need to use Jupyter v 3 sagemaker domain setting !!!

Q's

Is it valid to load a client payload + historic features?

I want to use Feature Store for multiple data sources including S3/Glue table/Snowflake python snowflake connector for direct data access - don't need feature store here But what about further processing of Snowflake data? We might have features from datalake, direct from snowflake and also derived from snowflake

What is STAM Specialist Technical Account Management

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment