In this document, I'm using a data file containing 40 million records. The file is a text file with one record per line.
The following Scala code is run in a spark-shell:
val filename = "<path to the file>"
val file = sc.textFile(filename)
file.count()
This simply reads the file and counts the number of lines. The number of partitions and the time taken to read the file can be read in the UI.
In the table below, when the test says "Read from ... + repartition", the file is repartitioned before counting the lines. Repartitioning is required in real-life applications when the initial number of partitions is too low. This ensures that all the cores available on the cluster are used. The code used in this case is the following:
val filename = "<path to the file>"
val file = sc.textFile(filename).reparition(460)
file.count()
Tests are run on a Spark cluster with 3 c4.4xlarge workers.
In the following table, we perform a combination of tests:
- File is either uncompressed or compressed. We test 2 compression formats: GZ (very common, fast, but not splittable) and BZ2 (splittable but very CPU expensive).
- File is read from the an EBS drive or from S3. S3 is what will be used in real life but the disk serves as a baseline to assess the performance of S3.
- With or without repartitioning. In a real use case, repartitioning is mandatory to achieve good parallelism when the initial partitioning is not adequate. This does not apply to uncompressed files as they already generate enough partitions. When repartitioning, I asked for 460 partitions as this is the number of partitions created when reading the uncompressed file.
- Spark versions 2.0.1 vs Spark 1.6.0. I tested the latest version available with this client's Chef scripts (1.6.0) and the latest version available from Apache (2.0.1).
Spark 2.0.1:
# | Format | File size | Test | Time | # partitions |
---|---|---|---|---|---|
1 | Uncompressed | 14.4 GB | Read from EBS | 3 s | 460 |
2 | Uncompressed | 14.4 GB | Read from S3 | 13 s | 460 |
3 | GZ | 419.3 MB | Read from EBS | 47 s | 1 |
4 | GZ | 419.3 MB | Read from EBS + repartition | 1.5 min | 460 |
5 | GZ | 419.3 MB | Read from S3 | 44 s | 1 |
6 | GZ | 419.3 MB | Read from S3 + repartition | 1.4 min | 460 |
7 | BZ2 | 236.3 MB | Read from EBS | 55 s | 8 |
8 | BZ2 | 236.3 MB | Read from EBS + repartition | 1.2 min | 460 |
9 | BZ2 | 236.3 MB | Read from S3 | 1.1 min | 8 |
10 | BZ2 | 236.3 MB | Read from S3 + repartition | 1.5 min | 460 |
Spark 1.6.0:
# | Format | File size | Test | Time | # partitions |
---|---|---|---|---|---|
1 | Uncompressed | 14.4 GB | Read from EBS | 3 s | 460 |
2 | Uncompressed | 14.4 GB | Read from S3 | 19 s | 460 |
3 | GZ | 419.3 MB | Read from EBS | 44 s | 1 |
4 | GZ | 419.3 MB | Read from EBS + repartition | 1.7 min | 460 |
5 | GZ | 419.3 MB | Read from S3 | 46 s | 1 |
6 | GZ | 419.3 MB | Read from S3 + repartition | 1.8 min | 460 |
7 | BZ2 | 236.3 MB | Read from EBS | 53 s | 8 |
8 | BZ2 | 236.3 MB | Read from EBS + repartition | 1.2 min | 460 |
9 | BZ2 | 236.3 MB | Read from S3 | 57 s | 8 |
10 | BZ2 | 236.3 MB | Read from S3 + repartition | 1.1 min | 460 |
Spark version - Measures are very similar between Spark 1.6 and Spark 2.0. This makes sense as this test uses RDDs (Catalyst or Tungsten cannot perform any optimization).
EBS vs S3 - S3 is slower than the EBS drive (#1 vs #2). Performance of S3 is still very good, though, with a combined throughput of 1.1 GB/s. Also, keep in mind that EBS drives have drawbacks: data is not shared between servers (it has to be replicated manually) and IOPS can be throttled.
Compression - GZ files are not ideal because they are not splittable and therefore require repartitioning. BZ2 files suffer from a similar problem: although they are splittable, they are so much compressed that you get very few partitions (8, in this case, on a cluster with 48 cores). The other problem is that the performance of BZ2 files is poor compared to uncompressed files. In the end, we see that uncompressed files clearly outperform compressed files. This is because uncompressed files are I/O bound and compressed files are CPU bound, but I/Os are really good here.
Given this, I recommend storing files on S3 as uncompressed files. This allows to achieve great performance while providing a safe storage.