YoshihitoAso · May 16, 2013 06:40
diff --git a/gistfile1.txt b/gistfile1.txt
 Apacheのaccess_logをTreasure Dataに送ってみた際のメモ

 前提：TDのアカウントを作成する

 ▼ 監視サーバにfluentd（td-agent）をインストールする
 
 $ curl -OL http://toolbelt.treasure-data.com/sh/install-redhat.sh
 $ chmod 755 install-redhat.sh
 $ ./install-redhat.sh
 $ rm -f install-redhat.sh
 
 $ service td-agent start
 $ chkconfig td-agent on
 
 
 ▼ td-agentが利用するディレクトリの権限を変更　chmod, chgrp
 
 $ sudo chgrp td-agent /var/log/httpd/
 $ sudo chgrp td-agent /var/log/messages
 $ sudo chgrp td-agent /var/log/secure
 $ sudo chgrp td-agent /var/log/cron
 
 $ sudo chmod g+rx /var/log/httpd/
 $ sudo chmod g+rx /var/log/messages
 $ sudo chmod g+rx /var/log/secure
 $ sudo chmod g+rx /var/log/cron


 ▼　クライアント（自分はWindows PC）にTreasure Data Toolbeltのインストール
 Windows用のインストーラは下記から取得
 http://toolbelt.treasure-data.com/win


 アカウント設定をする

 $ td account -f
 Enter your Treasure Data credentials.
 Email: 
 Password (typing will be hidden): 


 APIキーを確認する

 $ td apikey:show
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX


 ▼　td-agent.conf の設定変更
 # tail apache access_log
 <source>
  type tail
  format apache
  path /var/log/httpd/access_log
  tag td.testdb.www_access
 </source>

 <match td.*.*>
  type tdlog
  apikey XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  auto_create_table
  buffer_type file
  buffer_path /var/log/td-agent/buffer/td
  use_ssl true
 </match>

 apikeyは上記で取得したもの
 DBは予め作成しておく（td db:create testdb とか）


 ▼サービスを起動
 設定が完了したらサービスを再起動する
 $ service td-agent restart


 実際にログを発生させて、5分くらいするとデータが格納されているはず。

 $ td tables
 +----------+------------+------+-------+--------+
 | Database | Table      | Type | Count | Schema |
 +----------+------------+------+-------+--------+
 | testdb   | www_access | log  | 175   |        |
 +----------+------------+------+-------+--------+
 1 row in set


 ▼　サンプルクエリ


 試しに以下のようなクエリを投げてみる。

 ○ユーザエージェントごとの集計をするクエリ

 $ td query -w -d testdb "SELECT v['agent'] AS agent, COUNT(1) AS cnt FROM www_access  GROUP BY v['agent'] ORDER BY cnt DESC LIMIT 3"

 ---- こんな感じでHiveのログが出てくる　-----
 Job 2909409 is queued.
 Use 'td job:show 2909409' to show the status.
 queued...
  started at 2013-05-16T06:24:30Z
  Hive history file=/mnt/hive/tmp/932/hive_job_log__1147327208.txt
  Total MapReduce jobs = 2
  Launching Job 1 out of 2
  Number of reduce tasks not specified. Defaulting to jobconf value of: 4
  In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
  In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
  In order to set a constant number of reducers:
    set mapred.reduce.tasks=<number>
  Starting Job = job_201305140230_3827, Tracking URL = http://ip-10-143-152-77.ec2.internal:50030/jobdetails.jsp?jobid=
 ob_201305140230_3827
  Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_201305140230_3827
  Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
  2013-05-16 06:24:48,177 Stage-1 map = 0%,  reduce = 0%
  2013-05-16 06:24:53,259 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.59 sec
  2013-05-16 06:24:54,289 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.59 sec
  2013-05-16 06:24:55,344 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.59 sec
  2013-05-16 06:24:56,363 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.59 sec
  2013-05-16 06:24:57,383 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.59 sec
  2013-05-16 06:24:58,404 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.59 sec
  2013-05-16 06:24:59,422 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU 5.09 sec
  2013-05-16 06:25:00,441 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU 5.09 sec
  2013-05-16 06:25:01,461 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU 5.09 sec
  2013-05-16 06:25:02,480 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 7.95 sec
  2013-05-16 06:25:03,500 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 7.95 sec
  2013-05-16 06:25:04,519 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 7.95 sec
  2013-05-16 06:25:05,528 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 7.95 sec
  2013-05-16 06:25:06,538 Stage-1 map = 100%,  reduce = 75%, Cumulative CPU 10.45 sec
  2013-05-16 06:25:07,547 Stage-1 map = 100%,  reduce = 75%, Cumulative CPU 10.45 sec
  2013-05-16 06:25:08,559 Stage-1 map = 100%,  reduce = 75%, Cumulative CPU 10.45 sec
  2013-05-16 06:25:09,576 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 13.02 sec
  2013-05-16 06:25:10,585 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 13.02 sec
  2013-05-16 06:25:11,595 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 13.02 sec
  2013-05-16 06:25:12,604 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 13.02 sec
  MapReduce Total cumulative CPU time: 13 seconds 20 msec
  Ended Job = job_201305140230_3827
  Launching Job 2 out of 2
  Number of reduce tasks determined at compile time: 1
  In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
  In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
  In order to set a constant number of reducers:
    set mapred.reduce.tasks=<number>
  Starting Job = job_201305140230_3828, Tracking URL = http://ip-10-143-152-77.ec2.internal:50030/jobdetails.jsp?jobid=
 ob_201305140230_3828
  Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_201305140230_3828
  Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
  2013-05-16 06:25:20,184 Stage-2 map = 0%,  reduce = 0%
  2013-05-16 06:25:25,239 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.83 sec
  2013-05-16 06:25:26,257 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.83 sec
  2013-05-16 06:25:27,267 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.83 sec
  2013-05-16 06:25:28,287 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.83 sec
  2013-05-16 06:25:29,301 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 3.4 sec
  2013-05-16 06:25:30,325 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 3.4 sec
  2013-05-16 06:25:31,334 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 3.4 sec
  MapReduce Total cumulative CPU time: 3 seconds 400 msec
  Ended Job = job_201305140230_3828
  finished at 2013-05-16T06:25:32Z
  MapReduce Jobs Launched:
  Job 0: Map: 1  Reduce: 4   Cumulative CPU: 13.02 sec   HDFS Read: 537 HDFS Write: 1070 SUCCESS
  Job 1: Map: 1  Reduce: 1   Cumulative CPU: 3.4 sec   HDFS Read: 2179 HDFS Write: 357 SUCCESS
  Total MapReduce CPU Time Spent: 16 seconds 420 msec
  OK
  MapReduce time taken: 52.696 seconds
  Time taken: 52.916 seconds
 Status     : success
 Result     :
 +--------------------------------------------------------------------------------------------------------------+-----+
 | agent                                                                                                        | cnt |
 +--------------------------------------------------------------------------------------------------------------+-----+
 | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31 | 167 |
 | check_http/v1.4.15 (nagios-plugins 1.4.15)                                                                   | 5   |
 | facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)                                    | 2   |
 +--------------------------------------------------------------------------------------------------------------+-----+



 …Facebookさんのクローラがいる(・_・;)


 ○Top Path
 $ td query -w -d testdb \
  "SELECT v['path'] AS path, COUNT(1) AS cnt \
   FROM www_access \
   GROUP BY v['path'] ORDER BY cnt DESC LIMIT 3"

 ○ある日のアクセスランキング
 $ td query -w -d testdb \
  "SELECT v['referer'] AS referer, COUNT(1) AS cnt \
   FROM www_access6 \
   WHERE \
     TD_TIME_RANGE(time, '2013-05-16', '2012-05-16', 'PDT') \
   GROUP BY v['referer'] ORDER BY cnt DESC LIMIT 3"


 その他のサンプルは公式の方を参照
 http://docs.treasure-data.com/articles/analyzing-apache-logs
	Apacheのaccess_logをTreasure Dataに送ってみた際のメモ

	前提：TDのアカウントを作成する

	▼ 監視サーバにfluentd（td-agent）をインストールする

	$ curl -OL http://toolbelt.treasure-data.com/sh/install-redhat.sh
	$ chmod 755 install-redhat.sh
	$ ./install-redhat.sh
	$ rm -f install-redhat.sh

	$ service td-agent start
	$ chkconfig td-agent on


	▼ td-agentが利用するディレクトリの権限を変更　chmod, chgrp

	$ sudo chgrp td-agent /var/log/httpd/
	$ sudo chgrp td-agent /var/log/messages
	$ sudo chgrp td-agent /var/log/secure
	$ sudo chgrp td-agent /var/log/cron

	$ sudo chmod g+rx /var/log/httpd/
	$ sudo chmod g+rx /var/log/messages
	$ sudo chmod g+rx /var/log/secure
	$ sudo chmod g+rx /var/log/cron


	▼　クライアント（自分はWindows PC）にTreasure Data Toolbeltのインストール
	Windows用のインストーラは下記から取得
	http://toolbelt.treasure-data.com/win


	アカウント設定をする

	$ td account -f
	Enter your Treasure Data credentials.
	Email:
	Password (typing will be hidden):


	APIキーを確認する

	$ td apikey:show
	XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX


	▼　td-agent.conf の設定変更
	# tail apache access_log
	<source>
	type tail
	format apache
	path /var/log/httpd/access_log
	tag td.testdb.www_access
	</source>

	<match td..>
	type tdlog
	apikey XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
	auto_create_table
	buffer_type file
	buffer_path /var/log/td-agent/buffer/td
	use_ssl true
	</match>

	apikeyは上記で取得したもの
	DBは予め作成しておく（td db:create testdb とか）


	▼サービスを起動
	設定が完了したらサービスを再起動する
	$ service td-agent restart


	実際にログを発生させて、5分くらいするとデータが格納されているはず。

	$ td tables
	+----------+------------+------+-------+--------+
	\| Database \| Table \| Type \| Count \| Schema \|
	+----------+------------+------+-------+--------+
	\| testdb \| www_access \| log \| 175 \| \|
	+----------+------------+------+-------+--------+
	1 row in set


	▼　サンプルクエリ


	試しに以下のようなクエリを投げてみる。

	○ユーザエージェントごとの集計をするクエリ

	$ td query -w -d testdb "SELECT v['agent'] AS agent, COUNT(1) AS cnt FROM www_access GROUP BY v['agent'] ORDER BY cnt DESC LIMIT 3"

	---- こんな感じでHiveのログが出てくる　-----
	Job 2909409 is queued.
	Use 'td job:show 2909409' to show the status.
	queued...
	started at 2013-05-16T06:24:30Z
	Hive history file=/mnt/hive/tmp/932/hive_job_log__1147327208.txt
	Total MapReduce jobs = 2
	Launching Job 1 out of 2
	Number of reduce tasks not specified. Defaulting to jobconf value of: 4
	In order to change the average load for a reducer (in bytes):
	set hive.exec.reducers.bytes.per.reducer=<number>
	In order to limit the maximum number of reducers:
	set hive.exec.reducers.max=<number>
	In order to set a constant number of reducers:
	set mapred.reduce.tasks=<number>
	Starting Job = job_201305140230_3827, Tracking URL = http://ip-10-143-152-77.ec2.internal:50030/jobdetails.jsp?jobid=
	ob_201305140230_3827
	Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305140230_3827
	Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
	2013-05-16 06:24:48,177 Stage-1 map = 0%, reduce = 0%
	2013-05-16 06:24:53,259 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.59 sec
	2013-05-16 06:24:54,289 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.59 sec
	2013-05-16 06:24:55,344 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.59 sec
	2013-05-16 06:24:56,363 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.59 sec
	2013-05-16 06:24:57,383 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.59 sec
	2013-05-16 06:24:58,404 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.59 sec
	2013-05-16 06:24:59,422 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 5.09 sec
	2013-05-16 06:25:00,441 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 5.09 sec
	2013-05-16 06:25:01,461 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 5.09 sec
	2013-05-16 06:25:02,480 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 7.95 sec
	2013-05-16 06:25:03,500 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 7.95 sec
	2013-05-16 06:25:04,519 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 7.95 sec
	2013-05-16 06:25:05,528 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 7.95 sec
	2013-05-16 06:25:06,538 Stage-1 map = 100%, reduce = 75%, Cumulative CPU 10.45 sec
	2013-05-16 06:25:07,547 Stage-1 map = 100%, reduce = 75%, Cumulative CPU 10.45 sec
	2013-05-16 06:25:08,559 Stage-1 map = 100%, reduce = 75%, Cumulative CPU 10.45 sec
	2013-05-16 06:25:09,576 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.02 sec
	2013-05-16 06:25:10,585 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.02 sec
	2013-05-16 06:25:11,595 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.02 sec
	2013-05-16 06:25:12,604 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.02 sec
	MapReduce Total cumulative CPU time: 13 seconds 20 msec
	Ended Job = job_201305140230_3827
	Launching Job 2 out of 2
	Number of reduce tasks determined at compile time: 1
	In order to change the average load for a reducer (in bytes):
	set hive.exec.reducers.bytes.per.reducer=<number>
	In order to limit the maximum number of reducers:
	set hive.exec.reducers.max=<number>
	In order to set a constant number of reducers:
	set mapred.reduce.tasks=<number>
	Starting Job = job_201305140230_3828, Tracking URL = http://ip-10-143-152-77.ec2.internal:50030/jobdetails.jsp?jobid=
	ob_201305140230_3828
	Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305140230_3828
	Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
	2013-05-16 06:25:20,184 Stage-2 map = 0%, reduce = 0%
	2013-05-16 06:25:25,239 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.83 sec
	2013-05-16 06:25:26,257 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.83 sec
	2013-05-16 06:25:27,267 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.83 sec
	2013-05-16 06:25:28,287 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.83 sec
	2013-05-16 06:25:29,301 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3.4 sec
	2013-05-16 06:25:30,325 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3.4 sec
	2013-05-16 06:25:31,334 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3.4 sec
	MapReduce Total cumulative CPU time: 3 seconds 400 msec
	Ended Job = job_201305140230_3828
	finished at 2013-05-16T06:25:32Z
	MapReduce Jobs Launched:
	Job 0: Map: 1 Reduce: 4 Cumulative CPU: 13.02 sec HDFS Read: 537 HDFS Write: 1070 SUCCESS
	Job 1: Map: 1 Reduce: 1 Cumulative CPU: 3.4 sec HDFS Read: 2179 HDFS Write: 357 SUCCESS
	Total MapReduce CPU Time Spent: 16 seconds 420 msec
	OK
	MapReduce time taken: 52.696 seconds
	Time taken: 52.916 seconds
	Status : success
	Result :
	+--------------------------------------------------------------------------------------------------------------+-----+
	\| agent \| cnt \|
	+--------------------------------------------------------------------------------------------------------------+-----+
	\| Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31 \| 167 \|
	\| check_http/v1.4.15 (nagios-plugins 1.4.15) \| 5 \|
	\| facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) \| 2 \|
	+--------------------------------------------------------------------------------------------------------------+-----+



	…Facebookさんのクローラがいる(・_・;)


	○Top Path
	$ td query -w -d testdb \
	"SELECT v['path'] AS path, COUNT(1) AS cnt \
	FROM www_access \
	GROUP BY v['path'] ORDER BY cnt DESC LIMIT 3"

	○ある日のアクセスランキング
	$ td query -w -d testdb \
	"SELECT v['referer'] AS referer, COUNT(1) AS cnt \
	FROM www_access6 \
	WHERE \
	TD_TIME_RANGE(time, '2013-05-16', '2012-05-16', 'PDT') \
	GROUP BY v['referer'] ORDER BY cnt DESC LIMIT 3"


	その他のサンプルは公式の方を参照
	http://docs.treasure-data.com/articles/analyzing-apache-logs
No results found