Skip to content

Instantly share code, notes, and snippets.

@victorkurauchi
Created October 11, 2017 16:14
Show Gist options
  • Select an option

  • Save victorkurauchi/e9c5dece2766384b3443a340a758a9ef to your computer and use it in GitHub Desktop.

Select an option

Save victorkurauchi/e9c5dece2766384b3443a340a758a9ef to your computer and use it in GitHub Desktop.
using lambda with python to transform dataset
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
dataset = sc.textFile("file:///home/cloudera/massa_de_exemplo_wol.txt")
categories = []
regex = r"(\/[^, z]*\.[a-z]\w)+"
result = dataset.map(lambda line: re.search(regex, line).group(0).split('/')[1])
# print result
for category in result:
if not category in categories:
categories.append(category)
# print categories
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment