Income Sources :: Books :: REAL Full-Stack Software Engineer :: Manuscript 1 - Crafting Text Parsers in TypeScript

⪼ Made with 💜 by realpolyglot.dev

Income Sources :: Books

Introduction

Hello, Winners. You've made an excellent decision in picking up this manuscript as part of your journey toward becoming a confident, self-sufficient full-stack software engineer. You can rest assurred that a LOT of hard work has gone into learning the concepts presented to you. You should expect NO-fluff and NO-hand-waving here.

The best books I've read on software development, left me with knowledge ...:

that inspired me to immediately put that knowledge into practice.
I could use across languages and platforms.
I could lean on time and time again throughout the entirety of my career.

If you read no further than the above paragraph, I would be pleased to know that I've helped you become more discerning when choosing the best software books to invest in so you aren't wasting time nor money.

Goals

There are 3 important goals I had in mind when coming up with the concept for this series of manuscripts:

Help you learn non-trivial software development concepts that go beyond the typical hello world or todo app.
Present the information to you in an easy to read format that won't bore you to sleep.
Provide real-world value without taking up a lot of space while keeping you engaged.

A Word of Encouragement

The world of software development can be rough. Imposter syndrome is real, but you can take control and eliminate it by crafting non-trivial software solution on your own.

As Peter Theil says, competition is for losers. Find your unique purpose in life, look for problems or gaps in the world, then build creative solutions. Of course, you'll need to be equipped with the knowledge to solve real-world problems in order to achive this. Once you are well-equipped, you'll start to notice unique problems only you are interested in solving because you are now confident that you can solve them.

Why Full-Stack?

At the time of this writing, there are very few resources on full-stack development. One of the best depictions of what full-stack is comprised of, is outlined by Jeff Casimir. The full-stack is known to be the following components:

This one manuscript alone will not provide all of full-stack development to you; however, this manuscript is terse in that it covers a lot without taking up too much space. My goal is to keep you engaged and provide you with value so you are more inclined to purchase and read the next manuscript in the series.

If you are going to eventually be full-stack, there are several non-trivial topics you'll need to understand. This is one of them. Becoming truly full-stack is no easy feat, but if you trust the process and make incremental progress, you will get there.

Why TypeScript?

JavaScript is going to be around for a long time; however, as programs become more complex, you are going to want a type system. I've written A LOT of vanilla JavaScript and I understand most, if not all of the weird gotchas; however, I promise you, TypeScript is much nicer.

Working with a well-designed type system makes writing programs faster, easier to debug, and easier to extend in the future.

Why Text Parsers?

As a software engineer, eventually, you'll come across a problem where it is necessary to build a parser for a data format that doesn't exist.

Without the ability to parse text, we wouldn't be able to build web scraping tools, compilers, full-text search engines, JSON & CSV parsers, or natural language processing programs. You'll find text parsing a lot in data science as well.

Let's consider a source code compiler for a moment. A compiler is a set of programs that first reads source code (text), breaks it up into tokens, then it tries to make contextual sense of those tokens given the order in which they are found. There are more steps to it than that, but that is out of scope for now. Ultimately, a compiler tries to understand the source code then it converts it into a running program.

Writing compilers is a non-trivial task. Writing fast compilers is even more difficult. Fortunately, this manuscript is not about compilers; but I had to mention them because, usually, text books on compilers is the only place you'll find information about how to build a text parser. That's unfortunate because, text parsing is a very important topic in computer science.

Chapter 1 :: You've been lied to about "Regular Expressions"

large complex problems become easier to understand and ultimately, solve, once you've broken them down into smaller chunks.

Lexical analysis, lexing, or tokenization is a programming task that separates the given series of text into smaller components based on some rules. The first step in a compiler is the lexical analysis step. Lexical analysis is enabled by something called a grammar. A grammar is just a list (or an array) of rules. Each rule is given a name and a regular expression matching a pattern in the source text.

For example, you might want to write a program that identifies height in a block of text such as "3 FT 10 IN". You could write a fairly simple (naive) regular expression to do that.

/\b\d+ FT \d+ IN/gm

The above regular expression, if said in plain english would read as: "Look for a word boundary, then immediately following that word boundary, look for one or more numeric characters, then expect a single literal space character, then "FT", then another literal space, one or more numerice characters, a space, then "IN".

The power of regular expression is that you can "express" all of the above, with just a few symbols. That is why I like to think of regular expressions as a domain specific language for matching textual patterns.

That being said, the more complicated the regular expression, the more difficult it is to read, debug, or update. The more ridgid the pattern, the simpler the regular expression can be. The more flexible the pattern, the more complicated the regular expression must be.

For example, the text "3 FT 10 IN" is fairly simple; however, what if we were required to match all of:

The regular expression could no longer be naive. It would have to be a more complex regular expression:

/\b(?<nfeet>\d)+[\s]*(?<feet>FT|')[\s]*(?<ninches>\d)+[\s]*(?<inches>IN|")/gm

This is where things start to get weird. There are several places where characters are optional, and if we were to intoduce even one more variable, this regular expression goes from somewhat hairy, to completely unmanageable.

If you haven't already come across the following quote, keep progamming long enough and you will:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

That quote has been echoed by hundreds, if not thousands of programmers over the years advising other programmers to stay away from regular expressions. While I disagree; I can certainly empathize with the sentiment; especially as someone who has had to debug someone else's arcane regular expression.

The answer to all of this is simple. Keep your regular expression simple. Don't try to match everything in a single pass. The grammar for text parsers generally are broken up into tiny patterns that match the smallest unit of characters that can provide enough context to be meaningful.

wilmoore/readme.md Secret