-
-
Save rentalcustard/3217386 to your computer and use it in GitHub Desktop.
class CSVParser | |
def initialize(line_handler=nil) | |
@line_handler = line_handler | |
@current_line = [] | |
@lines = [] | |
end | |
def receive(chunk) | |
#handle the chunk depending on state | |
#if we detect end of line: | |
@line_handler.handle(@current_line) if @line_handler | |
@lines << @current_line | |
@current_line = [] | |
end | |
def data | |
@lines | |
end | |
end | |
#parsing CSV line-by-line | |
class LineHandler | |
def handle(line) | |
puts line.inspect #or whatever | |
end | |
end | |
class LinewiseCSVParserClient | |
def parse_csv | |
parser = CSVParser.new(LineHandler.new) | |
input = SomeInput.new | |
input.each_chunk do |data| | |
parser.receive(data) | |
end | |
end | |
end | |
#parsing CSV as a whole | |
class WholeFileCSVParserClient | |
def parse_csv | |
parser = CSVParser.new | |
parser.receive(File.read("some_path")) | |
return parser.data | |
end | |
end | |
#if above is too much code: | |
class CSVParser | |
def self.parse(io) | |
parser = self.new | |
parser.receive(io.read) | |
parser.data | |
end | |
end |
Streams in Ruby are typically Sockets or IO objects. Both have gets(). If I copy that, how am I not reinventing the wheel? Or do you have another stream example that I'm missing?
Is it not conceivable that there would be another source of data which doesn't implement gets? Or that I'd like to bypass gets and receive data as soon as it's ready rather than on each newline in some situation?
In comparison, C's libcsv acts in a newline-agnostic manner: you just feed chunks of data to it, and it runs callbacks for every field/record parsed. This makes the library very streaming-friendly and also eliminates a need to deal with different newline formats outside of libcsv. With buffering outside the library, one would need to deal with Mac/Win/Nix newlines before spoon-feeding the lines.
Sure. But then one of us needs to recreate the code in gets(). And you are saying that should be CSV.
New angle: How do I know, when I am handed a chunk, if it doesn't have a newline because more is coming or because it's the end (without a final newline)?
Sure. But then one of us needs to recreate the code in gets(). And you are saying that should be CSV.
Or in a "collaborator which buffers arbitrary data chunks into lines, which other parsers for line-oriented formats can then use?"
New angle: How do I know, when I am handed a chunk, if it doesn't have a newline because more is coming or because it's the end (without a final newline)?
You don't, and nor do you need to. You just call callbacks when you get a new record. The 'end' is when you stop getting data, but under the evented model, you don't do anything at the end, because you're not returning. If you need to handle the case where the end of the file is reached and there isn't a closing newline, you expose a "finish" method or similar which caller can call when the input stream is fully read.
Yes, there's a case of functionality duplication anyways: either buffering or newline-handling. Sounds like CSV quantum uncertainty principle :)
In libcsv, detecting the final chunk is put onto the developer: you just need to call csv_fini() as the last operation which makes libcsv assume that the latest feeded chunk was actually the last one. In Ruby, this can be achieved by using blocks and finalizing stream consumption just after yield.
Yeah, something like finish() would be required, to detect the case I mentioned.
So yeah, we all agree that the evented model works. I'm not convinced it's superior to the correct approach where it's easy for me to do things like support Ruby's normal suite of iterators. I agree that it works though.
I disagree: The line-oriented nature of CSV is something which is particular to CSV (and other line-oriented data formats). Hence, parsers for these formats need to deal with lines, whereas parsers for other formats do not. Pushing the line-oriented nature of the format onto callers is a sin because they shouldn't need to care about the constraints of your format. Consider the following pseudocode:
In this case, when parser requires lines, I'll need to wrap it in an adapter which takes data hunks, buffers them, and passes on lines. But this is something which depends on the parser, so to determine this, I need knowledge of the parser's implementation, and that knowledge leaks into my code. Why not make CSV responsible for the fact it's line-oriented, and extract a collaborator which buffers arbitrary data chunks into lines, which other parsers for line-oriented formats can then use?