gwenshap’s gists

gwenshap / gist:3997968

Created November 2, 2012 00:57

Split a file using AWK

nohup awk 'BEGIN{RS="\t\t\n"; FS="\t";}NR%300000==1{x="cc_unl_"++i;}{print > x}' ../catalog_caption.unl &

gwenshap / gist:4022222

Created November 6, 2012 02:41

Example of AWK and Shell quote wierdness

	#create a list of comma separated, single quoted, file names:

	ls -l /u01/ORACLE/work/cat_caption_parts \| awk '{printf ("'"'"'%s'"'"',",$9)}'

gwenshap / gist:4121450

Created November 20, 2012 21:50

Column Statistics

select /*+ LEADING(tab) */  col.column_name, num_distinct as "Distinct", decode(num_distinct, 0, null, round((tab.num_rows - col.num_nulls) / col.num_distinct, case when (tab.num_rows - col.num_nulls) / col.num_distinct < 50 then 1 else 0 end)) as "Card",ceil((tab.num_rows - col.num_nulls) * col.density) as "Dens", indc.column_position || nvl2(indc.index_name, chr(64 + dense_rank() over(order by index_name)), '`') as "i", nullif(cusg.equality_preds,0) "eq", nullif(cusg.equijoin_preds,0)||'/'|| nullif(cusg.nonequijoin_preds,0) "join", nullif(cusg.range_preds,0)||'/'|| nullif(cusg.like_preds,0)||'/'|| nullif(cusg.null_preds,0) "r/l/n",num_nulls as "nulls", num_buckets as "Hist", round(sysdate - col.last_analyzed, 1) as "old", avg_col_len as "len", indc.index_name, case when indc.type# = 2 then 'B' when bitand(indc.property, 1) =1 then 'U' when bitand(indc.flags, 32) = 32 then 'C' || indc.prefix_length END as "F", round(indc.leaf_blocks*8/1024,1) as "Mb", indc.distinct_keys as "Keys", indc.AVG_DATA_BLOCKS_PER_KE

gwenshap / gist:4121628

Created November 20, 2012 22:17

track PGA

	select * from (
	select grouping_id(module) g,dt, schemaname,machine,decode(grouping_id(module),1,'{------}',module) module, round(sum(pga_used_mem)/1024/1024) as mb_used, round(sum(pga_alloc_mem)/1024/1024) as mb_alloc,
	count(*) CNT,count(case when logon_time > dt - interval '30' minute then 1 end) cnt_30min,
	round(ratio_to_report(sum(pga_alloc_mem)) over(partition by dt,grouping_id(machine,module))*100) "%mem",
	round(ratio_to_report(count()) over(partition by dt,grouping_id(machine,module))100) "%cnt", round(sum(diff_alloc)/1024/1024) MB_grow,
	count()-lag(count()) over (partition by schemaname,machine,module order by dt) ses_added
	from (
	select pga_alloc_mem-lag(pga_alloc_mem) over (partition by spid,sid,serial# order by dt) diff_alloc,
	t.* from track_pga t
	)

gwenshap / gist:4122337

Created November 21, 2012 00:50

unshared cursors

	select * from
	(select sql_id, nonshared_reason, count(*) from v$sql_shared_cursor
	unpivot
	(nonshared_value for nonshared_reason in (
	UNBOUND_CURSOR as 'UNBOUND_CURSOR',
	SQL_TYPE_MISMATCH as 'SQL_TYPE_MISMATCH',
	OPTIMIZER_MISMATCH as 'OPTIMIZER_MISMATCH',
	OUTLINE_MISMATCH as 'OUTLINE_MISMATCH',
	STATS_ROW_MISMATCH as 'STATS_ROW_MISMATCH',
	LITERAL_MISMATCH as 'LITERAL_MISMATCH',

gwenshap / gist:7044525

Created October 18, 2013 17:01

Move CM Database to MySQL

	These are the steps that you need to follow in order to migrate scm database from PostgreSQL to MySQL - let me know if you need assistance or have any further questions.

	1. Stop all cluster services including the Cloudera Manager monitoring services and Agents on all hosts, using the Cloudera Manager UI.
	2. Invoke the API call to /api/v4/cm/deployment. Example that directs the resulting output to a file for later use:
	$ curl -v -u admin:password http://your_cloudera_manager_host:7180/api/v4/cm/deployment > path/to/file
	Note: It is critical this file be retained and protected as it will be needed to re-import the functional configuration into the new database.
	3. Take a fail-safe database backup from the embedded postgres database or make a tarball (or other copy) of the /var/lib/cloudera-scm-server-db/ directory to a safe location. Either will accomplish the needed outcome, just pick the most convenient for your environment. And backup contents of the directory /etc/cloudera-scm-server/*
	4. On the serv

gwenshap / gist:7226124

Created October 30, 2013 02:06

POM for hadoop project with multiple components

	<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>

	<groupId>com.hadooprecipes.recipes</groupId>
	<artifactId>hadoop-recipes</artifactId>
	<version>1.0</version>
	<packaging>jar</packaging>

	<name>Samples</name>

gwenshap / gist:7226166

Created October 30, 2013 02:11

pom for geo-spatial project

	<?xml version="1.0" encoding="UTF-8"?>
	<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>

	<parent>
	<groupId>com.foobar.tps.spatial-bar</groupId>
	<artifactId>spatial-bar</artifactId>
	<version>1.0.13-SNAPSHOT</version>
	</parent>

gwenshap / cleancsv.py

Created March 13, 2014 17:43

Python for cleaning CSVs from in-record newlines

	#!/usr/bin/python
	import csv
	import sys
	import argparse
	import io

	csv.field_size_limit(sys.maxsize)

	parser = argparse.ArgumentParser(description='Clean csv of in-line newlines')
	parser.add_argument('infile',help='Path to input CSV file');

gwenshap / gist:9699072

Created March 22, 2014 00:03

Using Hive to Efficiently Merge Data


	-- create temporary table for merging

	CREATE external TABLE staging.ItemFulfillment_merged
	(ItemFulfillment_b13aFilingOptionFedEx STRING,
	ItemFulfillment_b13aStatementDataFedEx STRING,
	ItemFulfillment_backupEmailAddressFedEx STRING,
	ItemFulfillment_backupEmailAddressUps STRING,
	ItemFulfillment_internalId STRING,
	ItemFulfillment_itemList_item_item_internalId STRING,

Gwen Shapira gwenshap