Skip to content

Instantly share code, notes, and snippets.

View dportabella's full-sized avatar

David Portabella dportabella

  • Lausanne, Switzerland
View GitHub Profile
@dportabella
dportabella / ExampleExecuteScalaFuturesInSerial.scala
Last active June 1, 2023 20:40
Explanation on how to execute scala futures in serial one after the other
/*
Execute scala futures in serial one after the other
This gist is to explain the solution given in
http://www.michaelpollmeier.com/execute-scala-futures-in-serial-one-after-the-other-non-blocking
The three examples produce the same result:
---
done: 10
done: 20
done: 30
@dportabella
dportabella / ExampleExecuteScalaFuturesInSerial.scala
Created September 13, 2016 20:16
Example on how to execute scala futures in serial one after the other, without collecting the result of the futures
/*
Example on how to execute scala futures in serial one after the other, without collecting the result of the futures
Look this instead if we need to collect the result of the futures (it also explains how foldLeft works here):
https://gist.github.com/dportabella/4e7569643ad693433ec6b86968f589b8
*/
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration.Duration
@dportabella
dportabella / deserialize_hadoop_sequence_file.scala
Last active November 8, 2016 21:42
How to deserialize a hadoop result sequence file outside hadoop (or a spark saveAsObjectFile outside spark)
// libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.7.3"
import java.io.{ByteArrayInputStream, ObjectInputStream}
import org.apache.hadoop.conf._
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
val f = "/path/to/part-00000"
val reader = new SequenceFile.Reader(new Configuration(), SequenceFile.Reader.file(new Path(f)))
@dportabella
dportabella / RunTestOnMultipleGithubRepos
Created November 8, 2016 21:13
An example Scala script that runs a test on all github projects with a given name and their forks and branches (you need to install ammonite: brew install ammonite-repl)
#!/usr/bin/env amm
/* To run this script:
* $ chmod +x ./RunTestOnMultipleGithubRepos
* $ ./RunTestOnMultipleGithubRepos
*/
import ammonite.ops._
import scalaj.http._
import $ivy.`org.eclipse.jgit:org.eclipse.jgit:4.5.0.201609210915-r`, org.eclipse.jgit.api.Git
@dportabella
dportabella / DeserializeHadoopSequenceFileWithoutClassDeclaration.scala
Last active November 8, 2016 22:32
How to deserialize a hadoop result sequence file outside hadoop (or a spark saveAsObjectFile outside spark) without having the class declaration
// resolvers += "dportabella-3rd-party-mvn-repo-releases" at "https://github.com/dportabella/3rd-party-mvn-repo/raw/master/releases/"
// libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.7.3"
// libraryDependencies += "com.github.dportabella.3rd-party-mvn-repo" % "jdeserialize" % "1.0.0",
import java.io._
import org.apache.hadoop.conf._
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
import org.unsynchronized.jdeserialize
@dportabella
dportabella / FilterArchive.scala
Created February 8, 2017 09:39
Example to filter a WARC archive using Spark and storing the result back to a WARC archive
package application
import java.io._
import java.util
import org.apache.spark.rdd.RDD
import org.archive.format.warc.WARCConstants.WARCRecordType
import org.archive.io.warc.WARCRecordInfo
import org.warcbase.spark.archive.io.ArchiveRecord
import org.warcbase.spark.matchbox.RecordLoader
@dportabella
dportabella / dist.scala
Last active March 24, 2017 19:14
compute distance in km between two postal codes
// using build.sbt: libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
// using Ammonite: import $ivy.`org.apache.sis.core:sis-referencing:0.7`, org.apache.sis.distance.DistanceUtils
case class Coordinates(lat: Double, lon: Double)
def readCoordinates(file: String): Map[String, Coordinates] = {
def parseLine(line: String): (String, Coordinates) = {
val c = line.split("\t")
(c(0) + "-" + c(1), Coordinates(c(9).toDouble, c(10).toDouble))
}