You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Practice Exam Attempt (3) Carefully taken with notes
Got an 16/20 on the latest practice exam. Missed questions 5, 7, 12, 13.
Summary of why missed each question:
5
always favor simpler method (cbt (cloud big query tool) vs
hbase cli) … I feel uneasy with this answer
7
prefer simpler method. pub/sub more than capable. kafka is
overkill. this question showed my lack of knowledge of pub/sub
capabilities
12
i was unaware of google cloud best practices IAM. principle
of least privalige. use predefined IAM roles when possible. I
do not have strong enough knowledge of IAM.
13
a missunderstanding of question and not knowing what mid
value is. prefer simple solution.
Learning:
mid values can be looked up in google knowledge graph
base64 encode vision api data
Questions:
storage files for a data pipeline
what is a temporary table vs permanent?
A temporary table is a randomly named table saved in a special
dataset. Temporary tables are used to cache query results. A
temporary table has a lifetime of approximately 24
hours. Temporary tables are not available for sharing, and are
not visible using any of the standard list or other table
manipulation methods. You are not charged for storing temporary
tables.
A permanent table can be a new or existing table in any dataset
to which you have access. If you write query results to a new
table, you are charged for storing the data. When you write query
results to a permanent table, the tables you’re querying must be
in the same location as the dataset that contains the destination
table.
auto detect schema?
big query samples 100 rows from one input and detects schema from that
not easy to change table schemas
Bigtable
single row transactions
cloud bigtable allows you to spot how rows or expensive nodes
see if key schema is balanced
Pub/sub vs kafka
kafka can store for arbitrary amounts of time
pub/sub can only store for 7 days
pub/sub has no ordering guarentees can use timestamp to help
Cloud Machine Learning Engine
“error” vs. “failed”
The Operation object will include one of two keys on completion:
The “response” key is present if the operation was
successful. Its value should be google.protobuf.Empty, as none
of the Cloud ML Engine long-running operations have response
objects.
The “error” key is present if there was an error. Its value is a Status object.
The job object
SUCCEEDED, FAILED, CANCELLED
supervised ml is great here
bayessian optimization is used for hyperparameter tuning
job is a series of operations thus you don’t care about the opperations themselves.
Dataprep features
people spend a long time with data preparation
an easy to use pandas
clense, enrich, validate
will require a data engineer to run these operations (can be replacement of pandas work)
traditional discovery (query to explore)
many supported types
can be integrate and run constantly
run dataflow behind the scenes
keeps track of the operations that we have done
once done you can start the large dataproc job
yes you can provide a template for conversion
Cloud IAM
see notes in products
predefined roles are preferable. limit to minimum permissions.
cloud vision ml api
base64 results from api. befinitely simpleest
mid is a single field there is more to the api
Bigtable storage type cannot be changed but compute can be
how to snapshot images
snapshots are incremental diffs between the original images taking
up much less space.
went with simplest option was possibly lucky
Big query cache enable
The prefetch cache (A.K.A. the “Smart cache”) predicts the data
that a component could request by analyzing the dimensions,
metrics, filters, and date range properties and controls on the
report. Data Studio then stores (prefetches) as much of the data as
possible that could be used to answer the predicted queries. When a
query can’t be answered by the query cache, Data Studio tries to
answer it using this prefetched data. If the query can’t be
answered by the prefetch cache, the data will come from the
underlying data set.
must be an owner
lightning bolt will be enabled
dimension vs filter
filter reduces data
dimension is equal to a group by operation
Encryption at rest
three options
default encryption (only on GCP) supplied and controlled by google
all
cloud key management (key stored on google controlled by user can be supplied)
use with bigquery, storage, and compute engine
customer key supplied by tool (google never sees)
limited to storage and compute engine
Data Engineer Exam
Provisional Result is a PASS
Reflection:
general knowledge of each service will only answer 5 or so questions
Coursera material is useful for introduction (documentation is much better and dense resource)
much more focused than expected
only 10 questions like exam pratice
not testing new features (for instance clustering or GIS in BigQuery)
case studies match online ones (10 questions) read but not very improtant to questions
Concepts (number of questions relevant approximate):
Big Query (20)
DataFlow (15)
DataProc (5-10)
BigTable (5-10)
IAM Permissions (10) - always remember principle of least privilege
Data Transfer (1-2)
Encryption (3-4)
ML Questions Basic (5)
Pub/Sub (5)
Concepts ALWAYS emphasized (there are usually 2 technically correct answers that boil down to):
simplicity (less words … seriously, and not tedious (as in do some manual steps that cant be automated))
The documentation on each service is intimidating but it is the most
important thing to read. Important things to focus on. Coursera
material is only a high level view of each service. BUT quicklabs
(in coursera) are the way to learn how to use google cloud.
features (usually the front page)
concepts (explain limitations of service, performance, and architecture)
Automatically scales, highly available and fault tolerant. No servers
to provision.
file storage events
events (pub/sub)
http
Firebase, and Google Assistant
stackdriver logging
Languages: python 3.7 and node.
Access and IAM. VPC access to cloud functions. Can not connect vpc to
cloud functions. IAM controls on the invoke of the
function. --member allow you to control which user can invoke the
function. IAM check to make sure that appropriate identity.
Durable and high performance block storage. SSD and HHD available. Can
be attached to any compute engine instance or as google kubernetes
storage volumes. Transparently resized and easy backups. Both can be
up to 64 TB in size. More expensive per GB than storage. No charge for
IO.
zonal persistent HHD and SSD (efficient reliable block storage)
regional peristent disk HHD and SSD. Replicated in two zones
local SSD: high performance transient local block-storage
Persistent disk performance is predictable and scales linearly with
provisioned capacity until the limits for an instance’s provisioned
vCPUs are reached.
Free to get started. Uses google cloud storage behind the scenes. Easy
way to provide access to files to users based on
authentication. Trigger functions to process these files. Clients
provides sdks for reliable uploads on spotty connections. Targets
mobile.
Cloud Filestore is a managed file storage service (NAS).
Connects to compute instances and Kubernetes engine instances. Low
latency file operations. Performance equivalent to a typical HHD. Can
get SSD performance for a premium.
Size must be between 1 TB - 64 TB. Price per gigabyte per hour. About
5x more expensive than object storage. About 2-3X more expensive than
Blob storage.
Cloud filestore exists in the zone that you are using.
Storage Transfer Service allows you to quickly import online data
into Cloud Storage. You can also set up a repeating schedule for
transferring data, as well as transfer data within Cloud Storage,
from one bucket to another.
schedule one time transfer operations or recurring transfer operations
deleete existing object in the destination bucket if they dont
have a corresping object in the source
delete source objects after transfering them
schedule periodic synchronization from source to data (with filters)
Transfer Appliance
install storage locally move data and send to google
two rackable applicances capable of 100 TB + 100 TB
standalone 500 TB - 1 PB storage to transfer.
greater than 20 TB it is worth it.
capable of high upload speeds (> 1GB per second)
Big Query Data Transfer Service
The BigQuery Data Transfer Service automates data movement from
Software as a Service (SaaS) applications such as Google Ads and
Google Ad Manager on a scheduled, managed basis. Your analytics
team can lay the foundation for a data warehouse without writing
a single line of code.
data sources
campaign manager, cloud storage, google ad
manager, google ads, google play, youtube
channel reports, youtube content owner reports.
Fully manged MYSQL and Postgresql service. Susstained usage
discount. Data replication between zones in a region.
Fully managed MySQL Community Edition databases in the cloud.
Second Generation instances support MySQL 5.6 or 5.7, and provide
up to 416 GB of RAM and 10 TB data storage, with the option to
automatically increase the storage size as needed.
First Generation instances support MySQL 5.5 or 5.6, and provide up
to 16 GB of RAM and 500 GB data storage.
Create and manage instances in the Google Cloud Platform Console.
Instances available in US, EU, or Asia.
Customer data encrypted on Google’s internal networks and in
database tables, temporary files, and backups.
Support for secure external connections with the Cloud SQL Proxy or
with the SSL/TLS protocol.
Support for private IPbeta (private services access).
Data replication between multiple zones with automatic failover.
Import and export databases using mysqldump, or import and export
CSV files.
Support for MySQL wire protocol and standard MySQL connectors.
Automated and on-demand backups, and point-in-time recovery.
Instance cloning.
Integration with Stackdriver logging and monitoring.
High throughput and consistent. Sub 10 ms latency. Scale to billions
of rows and thousands of columns. can store TB and PB of data. Each
row consists of a key. Large amounts of single keyed data with low
latency. High read and write throughput. Apache HBase API.
Replication among zones. Key value map.
key to sort among rows
column families for combinations of columns
Performance
row keys should be evenly spread among nodes
How to choose a row key:
reverse domain names (domain names should be written in reverse) com.google for example.
string identifiers (do not hash)
timestamp in row key
row keys can store multiple things - keep in mind that keys are sorted (lexicographically)
SQL like query language. ACID transactions. Fully managed.
Eventually consistent. Not for relational data but storing objects.
Atomic transactions. Cloud Datastore can execute a set of
operations where either all succeed, or none occur.
High availability of reads and writes. Cloud Datastore runs in
Google data centers, which use redundancy to minimize impact from
points of failure.
Massive scalability with high performance. Cloud Datastore uses a
distributed architecture to automatically manage scaling. Cloud
Datastore uses a mix of indexes and query constraints so your
queries scale with the size of your result set, not the size of
your data set.
Flexible storage and querying of data. Cloud Datastore maps
naturally to object-oriented and scripting languages, and is
exposed to applications through multiple clients. It also provides
a SQL-like query language.
Balance of strong and eventual consistency. Cloud Datastore ensures
that entity lookups by key and ancestor queries always receive
strongly consistent data. All other queries are eventually
consistent. The consistency models allow your application to
deliver a great user experience while handling large amounts of
data and users.
Encryption at rest. Cloud Datastore automatically encrypts all data
before it is written to disk and automatically decrypts the data
when read by an authorized user. For more information, see
Server-Side Encryption.
Fully managed with no planned downtime. Google handles the
administration of the Cloud Datastore service so you can focus on
your application. Your application can still use Cloud Datastore
when the service receives a planned upgrade.
NoSQL database built for global apps. Compatible with Datastore
API. Automatic multi region replication. ACID transactions. Query
engine. Integrated with firebase services.
Real time syncing of JSON data. Can collaborate across devices with
ease.
Could this be used for jupyter notebooks? Probably not due to
restrictions… cant see exactly what text has changed.
Networking
Virtual Private Cloud (VPC)
A private space within google cloud platform. A single VPC can span
multiple regions without communicating accress public Internet. Can
allow for single connection points between VPC and on-premise
resources. VPC can be applied at the organization level outside of
projects. No shutdown or downtime when adding IP scape and subnets.
Get private access to google services such as storage big data,
etc. without having to give a public ip.
Low-latency, low-cost content delivery using Google global
network. Recently ranked the fastest cdn. 90 cache sites. Always close
to users. Cloud CDN comes with SSL/TLS.
anycast (single IP address)
http/2 support these push abilities
https
invalidation
take down cached content in minutes
logging with stackdriver
serve content from compute engine and cloud storage buckets. Can
mix and match.
web interface for working with google cloud recources
Cloud shell
command line management for web browser
Cloud mobile app
available on android and OSX
Seperate billing accounts for managing paying for projects
Cloud Deployment Manager for managing google cloud infrastructure
Cloud apis for all GCP services
API Platform and Ecosystems
Apigee API Platform
Many many features around apis. Features: design, secure, deploy,
monitor, and scale apis. Enforce api policies, quota management,
trasformation, autorization , and access control.
create api proxies from Open API specifications and deploy in the
cloud.
protect apis. oauth 2.0, saml, tls, and protection from traffic spikes
dynamic routing, caching, and rate limiting policies
publish apis to a developer portar for developers to be able to
explore
measure performance and usage integrating with stackdriver.
Free trial, then $500 quickstart and larger ones later. Monetization.
Healthcare, Banking, Sense (protect from attacks).
API Monetization
Tools for creating billing reports for users. Flexible report models
etc. This is through apigee
$300/month to start. Engagement, operational metrics, business
metrics.
Google Cloud Endpoints allows a shared backend. Cloud endpoints
annotations. Will generate client libraries for the different
languages.
Nginx based proxy. Open API specification and provides insight with
stackdriver, monitoring, trace, and logging.
Control who has acess to your API and validate every call with JSON
web tockens and google api keys. Integration with Firebase
Authentication and Auth0.
Less than 1ms per call.
Generate API keys in GCP console and validate on every API call.
Developer Portal
Have dashboards and places for developers to easily test the API.
Developer Tools
Cloud SDK
cli for GCP products
gcloud manages authentication, local configuration, developer
workflow, and interactions with cloud platforms apis
bq
big query through the command line
kubectl
management of kubernetes
gsutil
command line access to manage cloud storage buckets and
objects
Fully managed service for transforming and reacting to data.
automated resource managerment
Cloud Dataflow automates
provisioning and management of processing resources to minimize
latency and maximize utilization; no more spinning up instances
by hand or reserving them.
dynamic work rebalancing
automated and optimized work
partitioning dynamic lagging work.
Integrates with gcloud composer. Cloud SQL is used to store the
Airflow metadata. App Engine for serving the web service. Cloud
storage is used for storing python plugins and dags etc. All running
inside of GKE. Stackdriver is used for collecting all logs.
Google Cloud Natural Language reveals the structure and meaning of
text both through powerful pretrained machine learning models in an
easy to use REST API and through custom models that are easy to build
with AutoML Natural LanguageBeta. Learn more about Cloud AutoML.
You can use Cloud Natural Language to extract information about
people, places, events, and much more mentioned in text documents,
news articles, or blog posts. You can use it to understand sentiment
about your product on social media or parse intent from customer
conversations happening in a call center or a messaging app. You can
analyze text uploaded in your request or integrate with your document
storage on Google Cloud Storage.
Google Cloud Speech-to-Text enables developers to convert audio to
text by applying powerful neural network models in an easy-to-use
API. The API recognizes 120 languages and variants to support your
global user base. You can enable voice command-and-control, transcribe
audio from call centers, and more. It can process real-time streaming
or prerecorded audio, using Google’s machine learning technology.
Google Cloud Text-to-Speech enables developers to synthesize
natural-sounding speech with 30 voices, available in multiple
languages and variants. It applies DeepMind’s groundbreaking research
in WaveNet and Google’s powerful neural networks to deliver high
fidelity audio. With this easy-to-use API, you can create lifelike
interactions with your users, across many applications and devices.
Cloud Translation offers both an API that uses pretrained models and
the ability to build custom models specific to your needs, using
AutoML Translation.
The Translation API provides a simple programmatic interface for
translating an arbitrary string into any supported language using
state-of-the-art Neural Machine Translation. It is highly responsive,
so websites and applications can integrate with Translation API for
fast, dynamic translation of source text from the source language to a
target language (such as French to English). Language detection is
also available in cases where the source language is unknown. The
underlying technology is updated constantly to include improvements
from Google research teams, which results in better translations and
new languages and language pairs.
Cloud Vision offers both pretrained models via an API and the ability
to build custom models using AutoML Vision to provide flexibility
depending on your use case.
Cloud Vision API enables developers to understand the content of an
image by encapsulating powerful machine learning models in an
easy-to-use REST API. It quickly classifies images into thousands of
categories (such as, “sailboat”), detects individual objects and faces
within images, and reads printed words contained within images. You
can build metadata on your image catalog, moderate offensive content,
or enable new marketing scenarios through image sentiment analysis.
allAuthenticatedUsers (any signed in google account)
allUsers anyone on the web
Resource:
you can grant access to <service>.<resource> resources
Permissions:
you can grant access based on <service>.<resource>.<verb>.
Roles are collections of permissions. Three kinds of roles in Cloud IAM.
primitive roles: Owner, Editor, Viewer
predefined roles: finer access than primitive
roles. roles/pubsub.publisher provides access to only publish
messages to a cloud pub/sub topic.
custom roles: custom roles specific to the organization.
Cloud IAM policies bind member -> roles.
Resource Hierarchy
You can set a Cloud IAM policy at any level in the resource hierarchy:
the organization level, the folder level, the project level, or the
resource level. Resources inherit the policies of the parent
resource. If you set a policy at the organization level, it is
automatically inherited by all its children projects, and if you set a
policy at the project level, it’s inherited by all its child
resources. The effective policy for a resource is the union of the
policy set at that resource and the policy inherited from higher up in
the hierarchy.
Best Practices:
Mirror your IAM policy hierarchy structure to your organization
structure.
Use the security principle of least privilege to grant IAM roles,
that is, only give the least amount of access necessary to your
resources.
grant roles to groups when possible
grant roles to smallest scope needed
billing roles for administration looking over
prefer predefined roles (not primitive)
owner allows modifying IAM policy so grant carefully
cloud audit logs to monitor changes to IAM policy
Service Accounts:
belong to an application or virtual machine instead of user
Each service account is associated with a key pair, which is
managed by Google Cloud Platform (GCP). It is used for
service-to-service authentication within GCP. These keys are
rotated automatically by Google, and are used for signing for a
maximum of two weeks.
You should only grant the service account the minimum set of
permissions required to achieve their goal.
Compute Engine instances need to run as service accounts