Skip to content

Instantly share code, notes, and snippets.

View anjijava16's full-sized avatar
💭
Awesome

Anjaiah Methuku anjijava16

💭
Awesome
View GitHub Profile
import org.apache.spark.sql.types._
// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
conda info
conda update -n base -c defaults conda
conda create --name data_ingestion python=3.6
(OR)
conda create --name data_ingestion
conda activate data_ingestion
conda list
https://www.esg-global.com/validation/esg-technical-review-analyzing-the-performance-of-mapr-db
https://medium.com/hackernoon/interacting-with-mapr-db-58c4f482efa1
https://www.linkedin.com/pulse/hbase-mapr-db-designed-distribution-scale-speed-chaaranpall-lambba/
https://stackoverflow.com/questions/30254134/difference-between-mapr-db-and-hbase
Understand the unique processing characteristics of stream processing:
This includes the difference between event time and processing time, sliding and tumbling windows, latearriving data and watermarks,
and missing data.
i. Event time is the time that something occurred at the place where the data is generated.
ii. Processing time is the time that data arrives at the endpoint where data is ingested.
iii. Sliding windows are used when you want to show how an aggregate, such as the average of the last three values, change over time,
and you want to update that stream of averages each time a new value arrives in the stream.
iv. Tumbling windows are used when you want to aggregate data over a fixed period of time for example, for the last one minute.
i. GCS Trasnfer Tools (For small trasnfers upto a few TB'S)
GSUTIL
rsync --Fast multi thread mode
ii. Trasnfer service
Tools: UI,Client Libraries,HTTP REST API
Transfer Service for cloud data :
Transfer Service enables you to quickly and securely transfer data into Google Cloud Storage from a variety of online sources, such as Amazon S3 and Azure Blob Storage, or to move data between Cloud Storage buckets.
# az vm create command to create a Linux VM:
az vm create \
--resource-group learn-85594f60-ef0f-4f1e-ad12-08bf2ea66630 \
--name myvmanji \
--image UbuntuLTS \
--admin-username azureuser \
--generate-ssh-keys
#Run the following az vm extension set command to configure Nginx on your VM:
https://app.pluralsight.com/library/courses/preparing-google-cloud-professional-data-engineer-exam-1/recommended-courses ---> ML
https://app.pluralsight.com/profile/author/vitthal-srinivasan
https://app.pluralsight.com/profile/author/james-wilson
https://app.pluralsight.com/profile/author/janani-ravi
Table :
====================
CREATE EXTERNAL TABLE tweets ( createddate string,
geolocation string,
tweetmessage string,
user_name struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION 'gs://iwinner-data/json_data';
Query :
Database pioneer and Turing Award winner Jim Gray gave a famous adage: When you have lots of data, bring [machine learning] computations to the data, rather than data to the computations.
According to him, there is nothing closer to the data than the database; so the computations have to be done inside the database.
Now all major cloud and database vendors are:
🔸 offering SQL data pipelines in the data warehouse
🔸 expanding in-database ML computations offerings
ML and analytics in the data warehouse are cheaper and more efficient.
C:\Users\anjai>gcloud config get-value project
iwinner-data
Updates are available for some Cloud SDK components. To install them,
please run:
$ gcloud components update