This document is a translation of LTSV FAQ. (Japanese)
LTSV(Labeled Tab-Separated Values) is a specification of text format just like CSV, TSV, and JSON. It's useful for httpd access logging.
The specification is available at http://ltsv.org .
LTSV is just a log format.
Yes!
For example, the following log will be converted into ...
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
like this:
host:127.0.0.1<TAB>ident:-<TAB>user:frank<TAB>time:[10/Oct/2000:13:55:36 -0700]<TAB>req:GET /apache_pb.gif HTTP/1.0<TAB>status:200<TAB>size:2326<TAB>referer:http://www.example.com/start.html<TAB>ua:Mozilla/4.08 [en] (Win98; I ;Nav)
'combined' log format, which is common as Apache access_log, has a couple of bad points:
- it's inconvenient to parse
- it's hard to add value
Everyone has been using this format only because eveyone has been using the format. But recently someone noticed that the above problems can be solved by LTSV, which requires very small changes. That's why many people have been excited at this format.
For more details about this excitement, see the following page. (Japanese)
- easy to parse
- in ruby:
Hash[gets.split("\t").map{|f| f.split(":", 2)}]
- specific parser is not required
- specific formatter is not required to output data
- you can set on Apache/nginx embedded config file
- thanks to labeled value, easy to process the parsed data
- row-oriented format makes it easy to integrate with other program
For more detail, see the following URLs (both Japanese):
Imagine you have a LTSV log like this:
host:127.0.0.1<TAB>ident:-<TAB>user:frank<TAB>
And, you a hundred of scripts which parses logs and does something:
#!/usr/bin/env ruby while gets record = Hash[$_.split("\t").map{|f| f.split(":", 2)}] # do something for the record end
One day, you noticed that the log doesn't contain timestamp and you want to add it.
time:[10/Oct/2000:13:55:36 -0700]<TAB>host:127.0.0.1<TAB>ident:-<TAB>user:frank<TAB>
Does this change affect the hundred of scripts? Do they fail to parse the new data? No.
If the log used combined format and it was parsed with regular expression, all the scripts would not work.
It doesn't affect the scripts even if you insert the time field into any place. Additionally, if a script can accept arbitrary number of values, the script can use timestamp after you just added time field into the record.
Comparing to 'combined' log:
- it's a bit less readable than combined log
- but, do you really think combined log is readable?
- record size will be increased by the length of field name
I don't think these points are critical, or they can be solved.
- While JSON and MessagePack is good for labeling data, it's not easy to parse data with the format.
- We have to do non-trivial way to generate Apache/nginx log in JSON format.
The advantage of LTSV is that user can migrate from less extensible log format without an effort.
The specification of LTSV is as follows:
- do not use colon ":" as key. which is delimiter in LTSV.
- each field is delimited by TAB.
That's all.
LTSV specification doesn't contain escape.
Here are the reasons:
- parsing become harder if escape is defined strictly.
- "Hey, I wonder if some string like User-Agent contains TAB character ..." -> "Never."
- Kazuho Oku mentioned about this in his blog (Japanese). According to the blog, Apache HTTP Server escapes all control character in the log due to a vulnerability.
You can find may implemantation in various languages in ltsv.org, but it's not necessary to use those implementation to parse LTSV.
Just use this tiny script:
#!/usr/bin/env ruby while gets record = Hash[$_.split("\t").map{|f| f.split(":", 2)}] p record end
Pretty easy! I don't need escape specification, but there is some discussion about extended specifications like strict-LTSV.
Of course!
Since it's just a format specification like CSV, TSV, and JSON, you can use LTSV in anywhere.
Yes. If you apply LTSV to access log, I recommend to use labels in "Recommendation for labeling" in ltsv.org.
You can use a filter like ltsview. It's pretty easy to implement a filter.
If you have this kind of filter, You can tail the log with formatting:
$ tail -f access_log | ltsview
If you want to read a log in combined format, you can write a filter which convert LTSV to combined format.
Since LTSV specification is based on UNIX philosophy, LTSV is row-oriented, self-describing, and open for extension. That's why you can implement LTSV filter very easily.
This is My personal opinion:
- In access_log, request URI, User-Agent, and Referer should be a large portion of total data size. so the size of label doesn't matter.
- If your system are so big that adding label generates huge amount of data, you should have another solution which can process/ingest massive size of log.
- processing with Mapreduce(Hadoop or Amazon EMR), storing data on DWH, etc...
- If you import the log via fluentd for example, the size of the label will be dissipated.
Hatena, a large web service company in Japan (+1M users) has used LTSV for 3 years. This indicates that it doesn't matter if you have a system which is smaller than Hatena.
No.
LTSV is just a format specification and it's not a sub project of any other software. Why fluentd community are so excited is that fluentd is popular for processing Apache or nginx access log.
If you use LTSV, fluentd configuration will be simple and DRY. This will solve a tough problem for administrator (especially in long-term operation).
Some blogger wrote a Perl script:
404 Blog Not Found:perl - Apache Combined Log を LTSV に (Japanese)
You can use this script without any external library.
Like other open specification, no one make a decision for the specification. Anyone who is interested in LTSV does any action. You may think me as a leader or a member of a committee because I wrote this FAQ, but I don't have any privileges for LTSV community and specification.
Though @stanaka has ltsv.org domain and he is a main person of the community, ltsv.org repository is public and anyone can join the community.
The internet is interesting!
Search 'ltsv' on Twitter. I recommend to use free word search rather than hashtag. The word LTSV is searchable :)
I would appreciate if you contribute to promote LTSV globally :)
For exmaple, you can translate documents and send it to @stanaka or just send pull request. You can also submit an entry to Hacker News.
As I mentioned above, there is no decision maker, so you can do anything! Please write something to your blog, implement some parser and tweet it with hashtag #ltsv, or write an English document.
Enjoy!