Import the English language class
from spacy.lang.en import English
# Create the nlp object
nlp = English()
Here is a list of some terms associated with Hadoop. You'll learn more about these terms and how they relate to Spark in the rest of the lesson.
- Hadoop - an ecosystem of tools for big data storage and data analysis. Hadoop is an older system than Spark but is still used by many companies. The major difference between Spark and Hadoop is how they use memory. Hadoop writes intermediate results to disk whereas Spark tries to keep data in memory whenever possible. This makes Spark faster for many use cases.
- Hadoop MapReduce - a system for processing and analyzing large data sets in parallel.
- Hadoop YARN - a resource manager that schedules jobs across a cluster. The manager keeps track of what computer resources are available and then assigns those resources to specific tasks.
- Hadoop Distributed File System (HDFS) - a big data storage system that splits data into chunks and stores the chunks across a cluster of computers.
As Hadoop matured, other tools were developed t
sudo (SuperUser DO): run programs / commands with administrative privileges
apt-get
sudo apt-get update The first command you need to run in any Linux system after a fresh install. Updates the database and let your system know if there are newer packages available or not.
sudo apt-get upgrade For upgrading all the packages with available updates.
-
ACCOUNT_TABLE; Data ACCOUNT_TABLE;
infile DATALINES delimiter=','; INPUT FirstName $ LastName $ Age Gender $;
DATALINES; x,y,23,Male z,w,45,Female a,b,64,Male
Operators in the where clause
- Equal (=)
- Greater Than (>)
- Less Than (<)
- Greater Than or Equal (>=)
- Less Than or Equal (<=)
- Not Equal (<>)
- BETWEEN () : Between a certain range
- LIKE () : Search for a pattern
# This basically installs some dependencies, adds two SQL scripts and runs a provided SH script. | |
FROM apache/airflow:2.0.0-python3.7 | |
USER root | |
# INSTALL TOOLS | |
RUN apt-get update \ | |
&& apt-get -y install libaio-dev \ | |
&& apt-get install postgresql-client | |
RUN mkdir extra | |
USER airflow |
O(1)
Constant - no loops
O(log N)
Logarithmic - usually searching algorithms have log n if they are sorted (Binary Search)
O(n)
Linear - for loops, while loops through n items
O(n log(n))
Log Linear - usually sorting operations
O(n^2)
Quadratic - every element in a collection needs to be compared to ever other element. Two nested loops
O(2^n)
Exponential - recursive algorithms that solves a problem of size N
// constants won't change. They're used here to set pin numbers:
const int buttonPin = 2; // the number of the pushbutton pin
const int ledPin = 13; // the number of the LED pin
// variables will change:
int buttonState = 0; // variable for reading the pushbutton status
void setup() {
// initialize the LED pin as an output: