Skip to content

Instantly share code, notes, and snippets.

@csabahenk
Last active December 22, 2015 20:39
Show Gist options
  • Save csabahenk/6527883 to your computer and use it in GitHub Desktop.
Save csabahenk/6527883 to your computer and use it in GitHub Desktop.
Some fun with Python and JSON

Some fun with Python and JSON

This document will help you to introspect complex Python data.

Requirements

We need some JSON goodness to start with.

Required

  • JSONPickle: a Python module that gives you complete dump Python objects and structures as JSON

Optional

This writeup will rely on the following tools as I fancy them, but you can replace them with alternatives (would there be any you fancy more).

  • My choice of JSON parser is YAJL-Ruby.
    • The primary reason to love it for is the ability to parse JSON streams, ie. a concatenated series of JSON objects (eg. {}[] -- most JSON parsers barf out on this because of "trailing charcters", and can't take subsequent action on {} and []).
    • The other reason to love it for is the very lightweight and intutive API. Much more likable than that of the similar Python binding!

Of course, in turn, you need Yajl and Ruby installed for this.

  • My choice of JSON formatter is underscore-cli.
    • The primary reason to love it for is that it formats the data with awareness of width. Most JSON formatters would format {"a":1,"b":2} either as

{"a": 1, "b": 2} ```

   or

   ```json

{ ⁣ "a": 1, ⁣ "b": 2 } ```

  depending on whether indentation was asked for or not.
  _underscore-cli_ decides about formatting depending on data and screen 
  size, and  applies one-line formatting for small data, while indented
  formatting for larger data, thus achieving optimal readilbility.


- Other reasons to love it for:
    - it features a functional API for manipulating and extracting information from JSON
    - it features colorized output

It's written for Node.js, so you'll need to have that installed.

Set up JSON dumping

Add the following code to your sitecustomize.py (or usercustomize.py):

import os
import time
from gzip import GzipFile
from threading import Lock, current_thread
from string import Template
import distutils.dir_util as du
import jsonpickle

class Jlog(object):

    def __init__(self, ftmp):
        self.pathtemp = Template(ftmp)
        self.lock = Lock()
        self.jreg = {}

    params = {'pid': os.getpid}

    def getpath(self):
        pd = {}
        for k,v in self.params.iteritems():
            if callable(v):
                v = v()
            pd[k] = v
        return self.pathtemp.substitute(pd)

    @property
    def jhandle(self):
        path = self.getpath()
        if not self.jreg.get(path):
            du.mkpath(os.path.dirname(path))
            self.jreg[path] = GzipFile(path, 'wb')
        return self.jreg[path]

    def jlog(self, *a, **kw):
        for i in range(len(a)):
            kw["data%02d" % i] = a[i]
        d = {'data': kw, 'time': time.time(), 'pid': os.getpid(), 'thread': current_thread().getName()}
        with self.lock:
            self.jhandle.write(jsonpickle.encode(d))
            self.jhandle.flush()
        ld = {}
        for k,v in d['data'].iteritems():
            ld[k] = [len(jsonpickle.encode(v))]
        d['data'] = ld
        print 'JLOG ' + jsonpickle.encode(d)

jlog = Jlog("/tmp/jlog/${pid}.json.gz").jlog

This sets up a canonical Jlog instance (called jlog) for any Python script you execute. Then in your Python code you can just add

from sitecustomize import jlog
...
jlog(<args>, <keywords>)

For example:

from threading import Thread
from sitecustomize import jlog

class Rectangle(object):
    def __init__(self,h,w):
        self.height = h
        self.width = w

def sayhirect(h,w):
    jlog("hello", "shape", shape=Rectangle(h,w))

t = Thread(target=sayhirect, args=(3,4))
t.start()
sayhirect(5,6)

This will print a message to stdout to get a hint what's going on, something like:

JLOG {"pid": 19869, "data": {"shape": [60], "data00": [7], "data01": [7]}, "thread": "MainThread", "time": 1378916569.103762}
JLOG {"pid": 19869, "data": {"shape": [60], "data00": [7], "data01": [7]}, "thread": "Thread-1", "time": 1378916569.100782}

The actual log is written to /tmp/jlog/<pid>.json.gz, so in this particular example, to /tmp/jlog/19869.json.gz, as a gzipped JSON stream. The JSON stream looks like:

{"pid": 19869, "data": {"shape": {"py/object": "__main__.Rectangle", "width": 4, "height": 3}, "data00": "hello", "data01": "shape"}, "thread": "Thread-1", "time": 1378916569.100782}{"pid": 19869, "data": {"shape": {"py/object": "__main__.Rectangle", "width": 6, "height": 5}, "data00": "hello", "data01": "shape"}, "thread": "MainThread", "time": 1378916569.103762}

We can get it by zcat(1)-ing the file. But that's not the best way to view it.

Read the JSON dump like a pro

So basically we want to have the JSON dump fed to underscore-cli to get a royal view; alas, it can't handle JSON streams. A little snippet of Ruby in Yajl for the rescue!

#!/usr/bin/env ruby

require 'yajl'

sel = $*.map { |i|
  case i
  when /\A(-?\d+)\.\.(\.?)(-?\d+)\Z/
    Range.new *([$1, $3].map { |j| Integer j } << ($2 == "."))
  else
    Integer i
  end
} 

w = []
Yajl::Parser.new.parse(STDIN) { |o| w << o }
w = w.values_at *sel unless sel.empty?

Yajl::Encoder.encode w, STDOUT

Save it as jwrap.rb, place to your $PATH and set it executable (or not, but the following examples will assume that). jwrap.rb wraps the elements of the JSON stream into a single JSON array and output that to stdout. Besides:

  • Passing integer arguments to it, it will select only the objects of given indices; negative indices are accepted. Thus jwrap.rb 0 1 selects the first two objects, while jwrap.rb -1 selects the last object.
  • You can also have command line arguments of Ruby range syntax, whereas i..j represents an inclusive range (eg. 3..6 consists of 3, 4, 5, 6), while i...j represents an exclusive range (eg. 3...6 consists of 3, 4, 5); negative range bounds are accepted. Thus jwrap.rb -10..-1 will select the last ten objects.

Beware that this code is optimized for simplicity not efficiency -- if you happen to jlog gigabytes of data, you'll have to come up with a smarter version. It will do for us for everyday introspection.

Back to our above example, we can do now the following:

$ zcat /tmp/jlog/19869.json.gz | jwrap.rb | underscore print

which gives:

[
  {
    "pid": 19869,
    "data": {
      "shape": { "py/object": "__main__.Rectangle", "width": 4, "height": 3 },
      "data00": "hello",
      "data01": "shape"
    },
    "thread": "Thread-1",
    "time": 1378916569.100782
  },
  {
    "pid": 19869,
    "data": {
      "shape": { "py/object": "__main__.Rectangle", "width": 6, "height": 5 },
      "data00": "hello",
      "data01": "shape"
    },
    "thread": "MainThread",
    "time": 1378916569.103762
  }
]

or if we are interested only in the last dump entry:

$ zcat /tmp/jlog/19869.json.gz | jwrap.rb -1 | underscore print

which gives:

[
  {
    "pid": 19869,
    "data": {
      "shape": { "py/object": "__main__.Rectangle", "width": 6, "height": 5 },
      "data00": "hello",
      "data01": "shape"
    },
    "thread": "MainThread",
    "time": 1378916569.103762
  }
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment