Skip to content

Instantly share code, notes, and snippets.

@coltfred
Created December 15, 2015 03:34
Show Gist options
  • Select an option

  • Save coltfred/f5fd0e1c8fa87d4d4cd9 to your computer and use it in GitHub Desktop.

Select an option

Save coltfred/f5fd0e1c8fa87d4d4cd9 to your computer and use it in GitHub Desktop.

Ideally I'd be able to write this with only one pass of data, but it's not possible in one pass (as far as I know)

def separate(r: RDD[A \/ B]): (RDD[A], RDD[B]) = ???

I'd settle for something like this where the As are dumped to a file and the Bs are still in the RDD. It's kind of like observeW from scalaz-stream.

def observeLefts(r: RDD[A \/ B], filename: String): RDD[B] = ???

Best I can find is carrying it as a tuple to the end and using a multiple outputs or to manually handle the writers in a mutable map both of which were discussed in this SO answer.

Can I do better?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment