Skip to content

Instantly share code, notes, and snippets.

@dipanjanS
Created April 6, 2019 22:34
Show Gist options
  • Save dipanjanS/8d17dd0338a28ef650738a477fcd6075 to your computer and use it in GitHub Desktop.
Save dipanjanS/8d17dd0338a28ef650738a477fcd6075 to your computer and use it in GitHub Desktop.
from pyspark.sql.functions import regexp_extract
logs_df = base_df.select(regexp_extract('value', host_pattern, 1).alias('host'),
regexp_extract('value', ts_pattern, 1).alias('timestamp'),
regexp_extract('value', method_uri_protocol_pattern, 1).alias('method'),
regexp_extract('value', method_uri_protocol_pattern, 2).alias('endpoint'),
regexp_extract('value', method_uri_protocol_pattern, 3).alias('protocol'),
regexp_extract('value', status_pattern, 1).cast('integer').alias('status'),
regexp_extract('value', content_size_pattern, 1).cast('integer').alias('content_size'))
logs_df.show(10, truncate=True)
print((logs_df.count(), len(logs_df.columns)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment