csv.md

Last week we refactored some Python code that parses CSV files and loads records into a database. We ended up with two general groups of code.

One group (lines 24 - 43) is a series of small functions that extract various data types (booleans, timestamps, etc) from CSV fields. The other group is a large function (lines 46 - 92) that extracts individual fields and builds a model object. I expressed frustration that this latter function was so long.

def build_nessus_object(row, heading_map):
    """process a row of data and return a Nessus object"""
    # parameters which are processed as-is from the csv
    normal_params = {
        'plugin' : 'Plugin',
        'pluginName' : 'Plugin Name',
        'family' : 'Family',
        'severity' : 'Severity',
        'IPAddress' : 'IP Address',
        'protocol' : 'Protocol',
        'port' : 'Port',
        'repository' : 'Repository',
        'MACaddress' : 'MAC Address',
        'DNSname' : 'DNS Name',
        'NetBIOSname' : 'NetBIOS Name',
        'solution' : 'Solution',
        'seeAlso' : 'See Also',
        'cve' : 'CVE',
        'exploitEase' : 'Exploit Ease',
        'exploitFrameworks' : 'Exploit Frameworks'
    }
    params = {}
    for param_name, field_name in normal_params.iteritems():
        params[param_name] = row[heading_map.get(field_name)]
    # parameters which require input text to be transformed in some way
    special_params = {
        'exploit' : get_boolean(row[heading_map.get('Exploit?')]),
        'pluginText' : get_text(row[heading_map.get('Plugin Text')], 65535),
    
        # why 10k? db is varchar(1000) ?
        'synopsis' : get_text(row[heading_map.get('Synopsis')], 10000),
    
        # XXX: should this be 65K? underlying field in db is mysql text.
        'description' : get_text(row[heading_map.get('Description')], 10000),
    
        'firstDiscovered' : get_date(row[heading_map.get('First Discovered')]),
        'lastObserved' : get_date(row[heading_map.get('Last Observed')]),
        'vulnPublicationDate' : get_date(row[heading_map.get('Vuln Publication Date')]),
        'importedDate' : datetime.now()
    }
    params.update(special_params)
    return Nessus(**params)

I also wasn't happy with the different treatment of fields: the "normal" fields that are parsed based on a magic hash, and the "special fields" that are parsed manually. I thought it would make maintenance harder. Finally, I objected to the presence of so many magic strings in the code (the Python csv library essentially loads each line into a hash table keyed by the field name).

Over the weekend I decided to write a Scala version of the code using the PureCSV library. PureCSV uses a technique called "automatic typeclass derivation" (aka "scrap your boilerplate") to reduce the parser function to a single line. The code is here.

The Scala port has three main parts:

The Nessus model object. This is similar to the Python version, but with stronger type safety -- we ensure the ipAddress is a valid IP address, the TCP/UDP port number is valid, the seeAlso field only contains valid URLs, etc.
String converters. These are similar to the Python get_text, get_date, and get_boolean functions. They're a little longer than their Python equivalents because PureCSV requires that field converts suppport reading and writing. The Scala versions also handle more error cases than the Python versions.
The CSV parser itself. This is a single line of code:

 val reader = CSVReader[Nessus]

The compiler, using the Shapeless library, is able to generate all of the calls needed to build the CSVReader (the equivalent of the 50-line build_nessus_object() Python function above). Note that for this technique to work, the order of the fields in the model class has to match the order in the CSV file. I thought that was a reasonable restriction. On the plus side, it also keeps magic strings out of the code.

Essentially, because all of the StringConverters (in 2) are marked implicit they're available to be wired in whenever the compiler needs them. And because we have a StringConverter for every type in the Nessus model, we're covered. The compiler (plus Shapeless) can generate all the boilerplate for us.

It's really a joy to program this way. You get the power of a strong static type system, much earlier detection of errors, and minimal boilerplate. The nature of PureCSV forces the StringConverters to be modular and pure functions which also makes them trivial to test.

derekmorr/csv.md