Skip to content

Instantly share code, notes, and snippets.

@gwenshap
gwenshap / gist:3997968
Created November 2, 2012 00:57
Split a file using AWK
nohup awk 'BEGIN{RS="\t\t\n"; FS="\t";}NR%300000==1{x="cc_unl_"++i;}{print > x}' ../catalog_caption.unl &
@gwenshap
gwenshap / gist:4022222
Created November 6, 2012 02:41
Example of AWK and Shell quote wierdness
#create a list of comma separated, single quoted, file names:
ls -l /u01/ORACLE/work/cat_caption_parts | awk '{printf ("'"'"'%s'"'"',",$9)}'
@gwenshap
gwenshap / gist:4121450
Created November 20, 2012 21:50
Column Statistics
select /*+ LEADING(tab) */ col.column_name, num_distinct as "Distinct", decode(num_distinct, 0, null, round((tab.num_rows - col.num_nulls) / col.num_distinct, case when (tab.num_rows - col.num_nulls) / col.num_distinct < 50 then 1 else 0 end)) as "Card",ceil((tab.num_rows - col.num_nulls) * col.density) as "Dens", indc.column_position || nvl2(indc.index_name, chr(64 + dense_rank() over(order by index_name)), '`') as "i", nullif(cusg.equality_preds,0) "eq", nullif(cusg.equijoin_preds,0)||'/'|| nullif(cusg.nonequijoin_preds,0) "join", nullif(cusg.range_preds,0)||'/'|| nullif(cusg.like_preds,0)||'/'|| nullif(cusg.null_preds,0) "r/l/n",num_nulls as "nulls", num_buckets as "Hist", round(sysdate - col.last_analyzed, 1) as "old", avg_col_len as "len", indc.index_name, case when indc.type# = 2 then 'B' when bitand(indc.property, 1) =1 then 'U' when bitand(indc.flags, 32) = 32 then 'C' || indc.prefix_length END as "F", round(indc.leaf_blocks*8/1024,1) as "Mb", indc.distinct_keys as "Keys", indc.AVG_DATA_BLOCKS_PER_KE
@gwenshap
gwenshap / gist:4121628
Created November 20, 2012 22:17
track PGA
select * from (
select grouping_id(module) g,dt, schemaname,machine,decode(grouping_id(module),1,'{------}',module) module, round(sum(pga_used_mem)/1024/1024) as mb_used, round(sum(pga_alloc_mem)/1024/1024) as mb_alloc,
count(*) CNT,count(case when logon_time > dt - interval '30' minute then 1 end) cnt_30min,
round(ratio_to_report(sum(pga_alloc_mem)) over(partition by dt,grouping_id(machine,module))*100) "%mem",
round(ratio_to_report(count(*)) over(partition by dt,grouping_id(machine,module))*100) "%cnt", round(sum(diff_alloc)/1024/1024) MB_grow,
count(*)-lag(count(*)) over (partition by schemaname,machine,module order by dt) ses_added
from (
select pga_alloc_mem-lag(pga_alloc_mem) over (partition by spid,sid,serial# order by dt) diff_alloc,
t.* from track_pga t
)
@gwenshap
gwenshap / gist:4122337
Created November 21, 2012 00:50
unshared cursors
select * from
(select sql_id, nonshared_reason, count(*) from v$sql_shared_cursor
unpivot
(nonshared_value for nonshared_reason in (
UNBOUND_CURSOR as 'UNBOUND_CURSOR',
SQL_TYPE_MISMATCH as 'SQL_TYPE_MISMATCH',
OPTIMIZER_MISMATCH as 'OPTIMIZER_MISMATCH',
OUTLINE_MISMATCH as 'OUTLINE_MISMATCH',
STATS_ROW_MISMATCH as 'STATS_ROW_MISMATCH',
LITERAL_MISMATCH as 'LITERAL_MISMATCH',
@gwenshap
gwenshap / gist:7044525
Created October 18, 2013 17:01
Move CM Database to MySQL
These are the steps that you need to follow in order to migrate scm database from PostgreSQL to MySQL - let me know if you need assistance or have any further questions.
1. Stop all cluster services including the Cloudera Manager monitoring services and Agents on all hosts, using the Cloudera Manager UI.
2. Invoke the API call to /api/v4/cm/deployment. Example that directs the resulting output to a file for later use:
$ curl -v -u admin:password http://your_cloudera_manager_host:7180/api/v4/cm/deployment > path/to/file
Note: It is critical this file be retained and protected as it will be needed to re-import the functional configuration into the new database.
3. Take a fail-safe database backup from the embedded postgres database or make a tarball (or other copy) of the /var/lib/cloudera-scm-server-db/ directory to a safe location. Either will accomplish the needed outcome, just pick the most convenient for your environment. And backup contents of the directory /etc/cloudera-scm-server/*
4. On the serv
@gwenshap
gwenshap / gist:7226124
Created October 30, 2013 02:06
POM for hadoop project with multiple components
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.hadooprecipes.recipes</groupId>
<artifactId>hadoop-recipes</artifactId>
<version>1.0</version>
<packaging>jar</packaging>
<name>Samples</name>
@gwenshap
gwenshap / gist:7226166
Created October 30, 2013 02:11
pom for geo-spatial project
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>com.foobar.tps.spatial-bar</groupId>
<artifactId>spatial-bar</artifactId>
<version>1.0.13-SNAPSHOT</version>
</parent>
@gwenshap
gwenshap / cleancsv.py
Created March 13, 2014 17:43
Python for cleaning CSVs from in-record newlines
#!/usr/bin/python
import csv
import sys
import argparse
import io
csv.field_size_limit(sys.maxsize)
parser = argparse.ArgumentParser(description='Clean csv of in-line newlines')
parser.add_argument('infile',help='Path to input CSV file');
@gwenshap
gwenshap / gist:9699072
Created March 22, 2014 00:03
Using Hive to Efficiently Merge Data
-- create temporary table for merging
CREATE external TABLE staging.ItemFulfillment_merged
(ItemFulfillment_b13aFilingOptionFedEx STRING,
ItemFulfillment_b13aStatementDataFedEx STRING,
ItemFulfillment_backupEmailAddressFedEx STRING,
ItemFulfillment_backupEmailAddressUps STRING,
ItemFulfillment_internalId STRING,
ItemFulfillment_itemList_item_item_internalId STRING,