Skip to content

Instantly share code, notes, and snippets.

@qxj
qxj / hadoop_avro_job.sh
Created June 9, 2015 07:46
If input files are serialized with avro, unserialize them by org.apache.avro.mapred.AvroAsTextInputFormat in hadoop streaming.
#!/usr/bin/env bash
# @(#) norm.sh Time-stamp: <Julian Qian 2015-06-09 15:35:35>
# Copyright 2015 Julian Qian
# Author: Julian Qian <[email protected]>
# Version: $Id: norm.sh,v 0.1 2015-06-08 18:03:30 jqian Exp $
#
day=$(date +%Y%m%d -d yesterday)
input=/user/hive/warehouse/query_log/ds=$day/hr=00
@qxj
qxj / crontab.sh
Created May 16, 2015 02:04
Collect *.cron in the directory, and then APPEND to the original crontab
#!/usr/bin/env bash
# @(#) crontab.sh Time-stamp: <Julian Qian 2015-05-15 18:14:28>
# Copyright 2015 Julian Qian
# Author: Julian Qian <[email protected]>
# Version: $Id: crontab.sh,v 0.1 2015-05-14 10:53:03 jqian Exp $
#
# Collect *.cron in the directory, and then APPEND to the original crontab
# TODO fix potential confliction when more than one crontab.sh instances are running concurrently.
@qxj
qxj / kafka.md
Last active August 29, 2015 14:20 — forked from ashrithr/kafka.md

Introduction to Kafka

Kafka acts as a kind of write-ahead log (WAL) that records messages to a persistent store (disk) and allows subscribers to read and apply these changes to their own stores in a system appropriate time-frame.

Terminology:

  • Producers send messages to brokers
  • Consumers read messages from brokers
  • Messages are sent to a topic
@qxj
qxj / lda_gibbs.py
Last active August 29, 2015 14:19 — forked from mblondel/lda_gibbs.py
"""
(C) Mathieu Blondel - 2010
License: BSD 3 clause
Implementation of the collapsed Gibbs sampler for
Latent Dirichlet Allocation, as described in
Finding scientifc topics (Griffiths and Steyvers)
"""
@qxj
qxj / lr.py
Created April 26, 2015 12:21
Python logistic regression (with L2 regularization)
#!/usr/bin/env python
# -*- coding: utf-8; tab-width: 4; -*-
# @(#) lr.py
# http://blog.smellthedata.com/2009/06/python-logistic-regression-with-l2.html
#
from scipy.optimize.optimize import fmin_cg, fmin_bfgs, fmin
import numpy as np
def sigmoid(x):
@qxj
qxj / example.sh
Last active October 20, 2015 07:05
tail log files and publish text stream to remote kafka server.
#!/bin/bash
log_agent.py publish --file '/home/work/log/weblog/web/pp-stats_*.log' --file '/home/work/log/weblog/donatello/web_*.log' --status ~/log_agent.status --throttling 1000 --monitor 10.161.19.223:12121
@qxj
qxj / avro_cli_test.php
Created March 23, 2015 08:03
avro php library benchmark
<?php
require_once('../lib/avro.php');
function create_record() {
$rec = array('member_id' => 1392, 'member_id2' => 999);
for ($i = 0; $i < 20; $i++) {
$rec['field' . $i] = 'test_value' . $i;
}
return $rec;
[unix_http_server]
file=/tmp/supervisor.sock ; path to your socket file
[supervisord]
logfile=/var/log/supervisord/supervisord.log ; supervisord log file
logfile_maxbytes=50MB ; maximum size of logfile before rotation
logfile_backups=10 ; number of backed up logfiles
loglevel=error ; info, debug, warn, trace
pidfile=/var/run/supervisord.pid ; pidfile location
nodaemon=false ; run supervisord as a daemon
worker_processes 2;
error_log /var/log/nginx/error.log;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
use epoll;
}
import io
import avro.schema
import avro.io
import lipsum
import random
from kafka.client import KafkaClient
from kafka.producer import SimpleProducer, KeyedProducer
g = lipsum.Generator()