So I paged through the docs at lunch (the gauntlet has been thrown, so I WILL take a gander at the source, this weekend perhaps, but it's been a long time, and I knew remarkably less, then, about everything, so I may not recall enough to get much out of it. We'll see.).
The kind of work-flow implied by the docs isn't what I'm after at all... however I don't think that really matters. As you said, with the right modularity, who cares?
There's a lot here, so even if you just give it a once-over and make sure you don't see anything that should be a deal-breaker, we don't need to try and tackle any of this yet, I don't think.
I figured I'd start with what's probably (to my layman's eyes) the biggest obstacle.
- Realtime vs. file-based:
- BLUF: Realtime isn't on my radar at all, but I don't really know how much that matters.
- My real goal is to digest historical data and produce stats/images for decision makers (including myself), more than anything else.
- I'm mocking stuff up now in IPython, and I'm kludging together data to try to get a handle on the best way of doing it, and I have yet to even ask "The Performance Question". Maybe once I figure out how to get the data into the format I actually need, it'll be simple to optimize, and work perfectly fine (even better without the overhead of an IPython notebook, I imagine) . But it's also possible that the whole approach I'm taking is going to be incapable of breaking 2FPS or something.
- Nevermind performance, but being able to intelligently digest new data while throwing old data away, may be non-trivial. May be super simple, I have no idea.
Another possibly difficult issue:
- Wireshark Dissectors
- Very much want to leverage them
- Part of my 'kitchen sink' philosophy above is the idea that I have no idea what I'll want to look or filter on next week. Maybe I WANT to look at spanning-tree BPDUs (which wouldn't show up if we're only showing IP traffic, for instance) and LACP packets? Who knows? For that reason, so far, I've outsourced all of the field extraction to TShark.
- Any field or property that Wireshark can understand or interpret is reddily available, and that's some pretty crucial functionality, IMO. I'm not sure if there's some other way we can leverage their dissectors (some library we can hook into somehow?), but I know I definitely want those available.
- I'm not sure how this impacts the discussions above about the different data types, etc. In theory Tshark (or whatever) could be "just another IO option".
Easier stuff:
- Different data types:
- Shouldn't be a problem
- I'm only ever (at this job) going to care about digesting PCAPs
- But there's no reason I can think of to expect the MGEN data to be THAT dissimilar from the PCAP data. Probably different timestamps, etc, but with documentation and/or example code that'll be workable.
- Ultimately, as long as the actual data wrangling is sufficiently abstract, you should be able to give it whatever you want.
- Pipeline vs. Interactive:
- Shouldn't be a problem, but probably means more work. Mostly just makes modularity important, I think.
- Inline with the stuff above, I'm more interested in building a library of functions/classes that can quickly and sensibly get me to a human-readable interpretation of some arbitrary data. bps over time, packet loss for a given flow, packet loss of a given flow as a percentage of the whole, etc.
- I'm terrible at predicting what I'll need/want ahead of time. Or I'm just chronicly dissatisfied with what's available. Either way, I much prefer the "kitchen sink" or "toolbox" approach, so that I can build what I need on-the-fly, rather than have to plan it out in advance.
- No reason NOT to make frequent tasks as accessible as they possibly can be, however.
- Likewise, there's no reason that something that does everything I want to do in an IPython notebook can't also tie into STDIN/STDOUT and be a prefectly well-behaved and clearly-defined console application. That's all IO, and I have every expectation that will be easy to keep modular.