Created
August 27, 2012 15:32
-
-
Save ingydotnet/3489550 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
19:01 <mst> http://paste.scsys.co.uk/206434 | |
19:02 <mst> thoughts? | |
16:32 <ingy> my first thought is that SineSwiper is wrong in his assessment that parsing is better with lexing | |
16:32 <ingy> with a separate lexing step | |
16:33 <ingy> this is a majot problem with coffeescript | |
16:33 <ingy> which I am currently trying to fix | |
16:35 <ingy> I think SS thinks that RecDescent means that you parse too deeply and waste time, but that is a simply a matter of writing good grammars vs poor ones. | |
16:35 <ingy> My grammars are careful never to need much lookahead | |
16:36 <ingy> because I was thinking about that when I wrote them | |
16:36 <ingy> anyway, I think it's weird that he didn't talk to me about it | |
16:37 <ingy> the thing he should do is write some simple failing tests | |
16:37 <ingy> it is entirely possible that Pegex has flaws. almost certain in fact. | |
16:38 <ingy> but unless he can come up with test cases that I can't workaround with good grammar writing, I am certainly not going to entertain lexing | |
... in #cdent | |
16:39 <@ingy> mst just made me aware of people griping that Pegex is not a Lex/Parse setup | |
16:39 <@ingy> and hi mst :) | |
16:40 <@ingy> Pegex does both lexing and parsing at the same time | |
16:41 <@ingy> so coffeescript does a full lex. then a lexical analysis, then a finally a grammar parse | |
16:41 <@ingy> with lots of code in every stage | |
16:42 <@ingy> the pegex way is to do this all at once, and unless I'm sorely mistaken, Pegex parsing of CoffeeScript will be faster, and also work around all the heinous corners that coffee has painted itself into | |
16:43 <@ingy> func() if bool # is required to be on one line in coffeee | |
16:44 <@ingy> this sucks when that expression grows past 80 columns | |
16:45 <@ingy> but the lexer would assign a newline to be a terminator token, and the parser isn't expecting that | |
16:45 <@ingy> which is not to say that the parser can't be smarter | |
16:46 <@ingy> but in my experience requiring the lexing to be completely separate from the parsing throws away too much context, and leaves you having to invent ways to deal with the resulting problems | |
16:50 * sevvie smiles. | |
16:51 < sevvie> It just sounds like you lex well, to me. | |
16:53 < sevvie> It does bring up the question; what benefits does a balanced lexer and parser provide? (For all intents and purposes, I know nothing.) | |
16:54 <@ingy> mst: sevvie just made me realize that pegex could be be setup to be "just a lexer" | |
16:54 <@ingy> I should have examples of this | |
16:55 <@ingy> doing it both ways |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Yeah, sorry about not chatting about it earlier; thought you weren't online.
As a matter of context, I'm trying to take Pg's original SQL lexer/parser (flex/bison code) and convert it into Perl with its own interchange format. Because of the sheer depth of the actual parsing code (parser = 545 rules), this Pg parser project forces a balancing act between "C->Perl conversion" and "writing entirely new code". I'm fine with wholesale conversions of the various types, and I had to convert huge batches of C code, anyway. Though, if I start writing new code, then I'm throwing away the work that the Pg crew already figured out for me. So, I'm trying to keep that sort of thing fairly minimized.
So far, I've successfully converted the thing into a working Eyapp module. I used that because I didn't know any better, and because it was the closest Perl analogue to Bison/yacc, anyway. There's still some bugs here and there, but it's good enough to start re-converting it into a more modern parser like Pegex. (And hopefully, it doesn't compile things into 8MB PM files, like Eyapp.)
As of now, I'm currently in the process of the "Lexer" conversion to Pegex, under the assumption that it will all be one "piece". (I think you have "#include" support somewhere. Otherwise, it'll just be one large *.pgx file.) My goal here will be to 'warp' the lexer to provide rules that turn the tokens within the parser into rules. This would at least reduce the time I'd need to spend messing with the parser, though I still have to figure out how the AST/Receiver modules will play out.
I am finding out that I can replace huge sections of "old skool" lexer code with more modern Perl REs. (For example, the old scan.l file relies on start/end state changes for lexing that can be replaced with a single (start/content/end) RE.) However, I don't think I'm going to get away with a lot of those kinds of optimizations on the parser, besides maybe taking some of those large OR blocks and splitting them up into subgroups.
Anyway, if you could fill in the gaps on the Syntax POD (see pull request), that would be very helpful. I still can't figure out what those rule modifiers do.