# Install IPython
> sudo easy_install ipython==1.2.1
# Launch pyspark with IPython
> PYSPARK_DRIVER_PYTHON=ipython pyspark
# Check ipython version
In [1]: sc.version
# Example Output: u'1.3.0'
# RDD in PySpark
In [2]: integer_RDD = sc.parallelize(range(10), 3)
# Check partitions
# 1.
# Gather all data on the driver:
In [3]: integer_RDD.collect()
# Example Output: Out: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# 2.
# Maintain splitiing in partitions
In [4]: integer_RDD.glom().collect()
# Example Output: Out: [[0, 1, 2], [3, 4, 5], [6, 7, 8, 9]]
# 1.
# from local filesystem:
In [5]: text_RDD = sc.textFile("file:///home/cloudera/testfile2")
# 2.
# from HDFS
In [6]: text_RDD = sc.textFile("/user/cloudera/input/testfile1")
> text_RDD.take(1) #outputs the first line
# coalesce
In [7]: sc.parallelize(range(10), 4).glom().collect()
# Example Output: [[0, 1], [2, 3], [4, 5], [6, 7, 8, 9]]
In [8]: sc.parallelize(range(10), 4).coalesce(2).glom().collect()
# Example Output: [[0, 1, 2, 3], [4, 5, 6, 7, 8, 9]]
Hi Dimitar,
This question is related to the code : advanced-join-in-spark.py
I have question for you on this Advanced Join in spark (Coursera). I am using python3 and the automated grader is showing wrong answer. I suspect that I am missing something with the output format. I am running my code on a Windows machine on stand alone spark cluster(version 2.2).
Here is the sample output I am getting.
[('CNO', 100), ('XYZ', 100), ('BOB', 100), ('ABC', 100), ('MAN', 100), ('DEF', 100), ('CAB', 100), ('NOX', 100), ('BAT', 100)]
Any idea what is wrong here?