Skip to content

Instantly share code, notes, and snippets.

View dportabella's full-sized avatar

David Portabella dportabella

  • Lausanne, Switzerland
View GitHub Profile
@dportabella
dportabella / dist.scala
Last active March 24, 2017 19:14
compute distance in km between two postal codes
// using build.sbt: libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
// using Ammonite: import $ivy.`org.apache.sis.core:sis-referencing:0.7`, org.apache.sis.distance.DistanceUtils
case class Coordinates(lat: Double, lon: Double)
def readCoordinates(file: String): Map[String, Coordinates] = {
def parseLine(line: String): (String, Coordinates) = {
val c = line.split("\t")
(c(0) + "-" + c(1), Coordinates(c(9).toDouble, c(10).toDouble))
}
@dportabella
dportabella / FilterArchive.scala
Created February 8, 2017 09:39
Example to filter a WARC archive using Spark and storing the result back to a WARC archive
package application
import java.io._
import java.util
import org.apache.spark.rdd.RDD
import org.archive.format.warc.WARCConstants.WARCRecordType
import org.archive.io.warc.WARCRecordInfo
import org.warcbase.spark.archive.io.ArchiveRecord
import org.warcbase.spark.matchbox.RecordLoader
@dportabella
dportabella / DeserializeHadoopSequenceFileWithoutClassDeclaration.scala
Last active November 8, 2016 22:32
How to deserialize a hadoop result sequence file outside hadoop (or a spark saveAsObjectFile outside spark) without having the class declaration
// resolvers += "dportabella-3rd-party-mvn-repo-releases" at "https://github.com/dportabella/3rd-party-mvn-repo/raw/master/releases/"
// libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.7.3"
// libraryDependencies += "com.github.dportabella.3rd-party-mvn-repo" % "jdeserialize" % "1.0.0",
import java.io._
import org.apache.hadoop.conf._
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
import org.unsynchronized.jdeserialize
@dportabella
dportabella / RunTestOnMultipleGithubRepos
Created November 8, 2016 21:13
An example Scala script that runs a test on all github projects with a given name and their forks and branches (you need to install ammonite: brew install ammonite-repl)
#!/usr/bin/env amm
/* To run this script:
* $ chmod +x ./RunTestOnMultipleGithubRepos
* $ ./RunTestOnMultipleGithubRepos
*/
import ammonite.ops._
import scalaj.http._
import $ivy.`org.eclipse.jgit:org.eclipse.jgit:4.5.0.201609210915-r`, org.eclipse.jgit.api.Git
@dportabella
dportabella / deserialize_hadoop_sequence_file.scala
Last active November 8, 2016 21:42
How to deserialize a hadoop result sequence file outside hadoop (or a spark saveAsObjectFile outside spark)
// libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.7.3"
import java.io.{ByteArrayInputStream, ObjectInputStream}
import org.apache.hadoop.conf._
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
val f = "/path/to/part-00000"
val reader = new SequenceFile.Reader(new Configuration(), SequenceFile.Reader.file(new Path(f)))
@dportabella
dportabella / ExampleExecuteScalaFuturesInSerial.scala
Created September 13, 2016 20:16
Example on how to execute scala futures in serial one after the other, without collecting the result of the futures
/*
Example on how to execute scala futures in serial one after the other, without collecting the result of the futures
Look this instead if we need to collect the result of the futures (it also explains how foldLeft works here):
https://gist.github.com/dportabella/4e7569643ad693433ec6b86968f589b8
*/
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration.Duration
@dportabella
dportabella / ExampleExecuteScalaFuturesInSerial.scala
Last active June 1, 2023 20:40
Explanation on how to execute scala futures in serial one after the other
/*
Execute scala futures in serial one after the other
This gist is to explain the solution given in
http://www.michaelpollmeier.com/execute-scala-futures-in-serial-one-after-the-other-non-blocking
The three examples produce the same result:
---
done: 10
done: 20
done: 30
@dportabella
dportabella / build.sbt
Last active May 31, 2016 02:07
sbt project for the spark distribution examples
val sparkVersion = "1.6.1"
val hbaseVersion = "0.98.7-hadoop2"
name := "spark-examples"
version := sparkVersion
javacOptions ++= Seq("-source", "1.8", "-target", "1.8", "-Xlint")
initialize := {
@dportabella
dportabella / PomDependenciesToSbt
Last active May 7, 2022 16:25
Script to convert Maven dependencies (and exclusions) from a pom.xml to sbt dependencies. Or run it online on http://goo.gl/wnHCjE
#!/usr/bin/env amm
// This script converts Maven dependencies from a pom.xml to sbt dependencies.
// It is based on the answers of George Pligor and Mike Slinn on http://stackoverflow.com/questions/15430346/
// - install https://github.com/lihaoyi/Ammonite
// - make this script executable: chmod +x PomDependenciesToSbt
// - run it with from your shell (e.g bash):
// $ ./PomDependenciesToSbt /path/to/pom.xml
import scala.xml._
@dportabella
dportabella / ExampleScalaAck.scala
Created August 22, 2014 22:46
this example Scala scripts executes a regex to all files recursively. it uses apache tika UniversalEncodingDetector to filter only text files. it uses a regex to find all lines containing the word "super", except if this word is part of the larger word "superstition" or "supernatural".
import java.io.File
import org.apache.tika.detect._
import org.apache.tika.metadata._
import org.apache.tika.mime._
import org.apache.tika.io._
import org.apache.tika.parser.txt._
import resource._
def recursiveListFiles(f: File): List[File] = {
val these = f.listFiles.toList