Skip to content

Instantly share code, notes, and snippets.

@rentalcustard
Created July 31, 2012 14:27
Show Gist options
  • Save rentalcustard/3217386 to your computer and use it in GitHub Desktop.
Save rentalcustard/3217386 to your computer and use it in GitHub Desktop.
Stateful CSV parsing sketch - pseudocode-ish.
class CSVParser
def initialize(line_handler=nil)
@line_handler = line_handler
@current_line = []
@lines = []
end
def receive(chunk)
#handle the chunk depending on state
#if we detect end of line:
@line_handler.handle(@current_line) if @line_handler
@lines << @current_line
@current_line = []
end
def data
@lines
end
end
#parsing CSV line-by-line
class LineHandler
def handle(line)
puts line.inspect #or whatever
end
end
class LinewiseCSVParserClient
def parse_csv
parser = CSVParser.new(LineHandler.new)
input = SomeInput.new
input.each_chunk do |data|
parser.receive(data)
end
end
end
#parsing CSV as a whole
class WholeFileCSVParserClient
def parse_csv
parser = CSVParser.new
parser.receive(File.read("some_path"))
return parser.data
end
end
#if above is too much code:
class CSVParser
def self.parse(io)
parser = self.new
parser.receive(io.read)
parser.data
end
end
@JEG2
Copy link

JEG2 commented Jul 31, 2012

OK, I think I understand where you're going here. I agree that this approach can work.

So now for the important questions. What do we gain from this approach? Why do it? What limitations in the current model does this overcome?

@rentalcustard
Copy link
Author

The big win for me about taking this approach is that you remove the coupling on an IO-like object from CSV. But this is really ideal-world stuff, since a big part of being in stdlib is needing to retain some kind of backward-compatibility, which entails being able to pass paths to most methods, and hence CSV being responsible for setting up IO objects and then speaking to them.

The big problem for me which sparked my thinking about CSV in Ruby was the 1.8 lib, which deals exclusively with paths except for the parse and parse_line methods. This seemed backwards, given that my approach to the problem would begin with something like the above. I think we can all agree the 1.8 API needed work.

My problem with the 1.9 API is purely that IO is so pervasive - it's nice to have a method or two that deal with IO/Reader objects, for utility, but it obscures the core responsibility of the module and means that any source of CSV data we want to use needs to implement gets, or else find the method (parse) that allows passing data - but in the latter case, only parsing the whole file at once is supported.

@JEG2
Copy link

JEG2 commented Jul 31, 2012

I hear ya, but what's wrong with needing an object that implements gets(). That's just programming to an interface, right? Can you help me understand why it's evil?

@rentalcustard
Copy link
Author

Right - bear in mind that almost all of my complaints with CSV are for the 1.8 implementation, which takes a primitive and constructs its dependency, so the coupling is much, much tighter.

There's still a coupling with needing an object that implements gets(), and I still think it's slightly evil. Take the example of a streaming HTTP request - for that to implement gets, it needs to buffer until it gets a newline, which might not happen in each chunk of data it receives. So to talk to a CSV parser which requires the gets interface, we need our streaming HTTP request to be wrapped in something that does the buffering, whereas if we can just push arbitrary chunks of data to the CSV parser and let it handle them according to its own constraints, we're not pushing any additional code requirements onto our callers.

Programming to an interface is good, but it's a way of reducing coupling. Not having coupling at all is even better!

@arp
Copy link

arp commented Jul 31, 2012

Having a need to parse really huge CSV files from S3 that don't fit in memory and stream parsed data while reading the next chunk, I think that proposed "newline-agnostic" approach is a really useful feature. I will need to implement such logic for my case anyways.

However, I'm not sure if this should be a part of CSV library itself. "Uniform text" to "strings iterator" FIFO is something that happens to be useful every time someone needs to stream parsed newline-separated data.

@JEG2
Copy link

JEG2 commented Jul 31, 2012

CSV is a line oriented data format, so the fact is that I need lines to work with. If I'm going to be handed anything other than lines, someone needs to do some buffering. I think adding that buffering to CSV is a worse sin and, if we introduce a third object to do the buffering, I'll just be programming to that interface instead. In fact, there is already an interface for buffering to a newline in Ruby: its called gets(). That's my opinion anyway.

@rentalcustard
Copy link
Author

I disagree: The line-oriented nature of CSV is something which is particular to CSV (and other line-oriented data formats). Hence, parsers for these formats need to deal with lines, whereas parsers for other formats do not. Pushing the line-oriented nature of the format onto callers is a sin because they shouldn't need to care about the constraints of your format. Consider the following pseudocode:

def parse_some_data(stream)
  data_type = get_data_type_from_stream(stream)
  parser = parser_for_data_type(data_type)
  stream.each_chunk do |data|
    parser.receive(data)
  end
end

In this case, when parser requires lines, I'll need to wrap it in an adapter which takes data hunks, buffers them, and passes on lines. But this is something which depends on the parser, so to determine this, I need knowledge of the parser's implementation, and that knowledge leaks into my code. Why not make CSV responsible for the fact it's line-oriented, and extract a collaborator which buffers arbitrary data chunks into lines, which other parsers for line-oriented formats can then use?

@JEG2
Copy link

JEG2 commented Jul 31, 2012

Streams in Ruby are typically Sockets or IO objects. Both have gets(). If I copy that, how am I not reinventing the wheel? Or do you have another stream example that I'm missing?

@rentalcustard
Copy link
Author

Is it not conceivable that there would be another source of data which doesn't implement gets? Or that I'd like to bypass gets and receive data as soon as it's ready rather than on each newline in some situation?

@arp
Copy link

arp commented Jul 31, 2012

In comparison, C's libcsv acts in a newline-agnostic manner: you just feed chunks of data to it, and it runs callbacks for every field/record parsed. This makes the library very streaming-friendly and also eliminates a need to deal with different newline formats outside of libcsv. With buffering outside the library, one would need to deal with Mac/Win/Nix newlines before spoon-feeding the lines.

@JEG2
Copy link

JEG2 commented Jul 31, 2012

Sure. But then one of us needs to recreate the code in gets(). And you are saying that should be CSV.

New angle: How do I know, when I am handed a chunk, if it doesn't have a newline because more is coming or because it's the end (without a final newline)?

@rentalcustard
Copy link
Author

Sure. But then one of us needs to recreate the code in gets(). And you are saying that should be CSV.

Or in a "collaborator which buffers arbitrary data chunks into lines, which other parsers for line-oriented formats can then use?"

New angle: How do I know, when I am handed a chunk, if it doesn't have a newline because more is coming or because it's the end (without a final newline)?

You don't, and nor do you need to. You just call callbacks when you get a new record. The 'end' is when you stop getting data, but under the evented model, you don't do anything at the end, because you're not returning. If you need to handle the case where the end of the file is reached and there isn't a closing newline, you expose a "finish" method or similar which caller can call when the input stream is fully read.

@arp
Copy link

arp commented Jul 31, 2012

Yes, there's a case of functionality duplication anyways: either buffering or newline-handling. Sounds like CSV quantum uncertainty principle :)

In libcsv, detecting the final chunk is put onto the developer: you just need to call csv_fini() as the last operation which makes libcsv assume that the latest feeded chunk was actually the last one. In Ruby, this can be achieved by using blocks and finalizing stream consumption just after yield.

@JEG2
Copy link

JEG2 commented Jul 31, 2012

Yeah, something like finish() would be required, to detect the case I mentioned.

So yeah, we all agree that the evented model works. I'm not convinced it's superior to the correct approach where it's easy for me to do things like support Ruby's normal suite of iterators. I agree that it works though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment