Writing parsers is fun


Home | Blogs

A parser can just be a program that takes an input or a file and converts it into something usable, but if you do more with a parser, you have the ability to create amazing things.

I have wanted to make some sort of parsing program for a while. I was wondering what I should code, and I thought about it: why not a TailwindCSS clone? Tailwindcss is a CSS library where everything is a class. You would be able to use a class like w-4, and the CSS it will output is width:1rem;. This does not sound super hard, as you just have to hard code some default values and you are good. I wanted to make it hard and make my own scripting language for it.

My scripting language looks like this:

// Width

.w-{}% => width: {}%
.w-{x}/{y} => width: {x / y * 100}%
.w-{}px => width: {}px
.w-{} => width: {*0.25}rem

// Max Width

.mw-{}% => max-width: {}%
.mw-{x}/{y} => max-width: {x / y * 100}%
.mw-{}px => {{
    max-width: {}px
    width: {}px
}}
.mw-{} => max-width: { * .25}rem

(Where it shows .mw-{}px => the stuff after is to just say it has the capability to do that.)

You can easily write one yourself, and you should. You will learn how much actually goes into parsers for file types and programming languages.

There are two parts to parsing: a lexer and then doing things off the lexer.

If you want to make a lexer the easy way, you can just use some regex patterns for what each type would be. To actually store the output of your lexer, you should use a list of tokens, and these tokens should be enums because then you can store what the actual content was and what type it is in a programmatic way. You can take your input and just add a ^ to the start of your regex to see if a regex pattern matches your file. If it does, then remove however many bytes your match was from the start of the contents and add what it matched to your list of tokens.

Lexing is easy, but actually parsing what your Lexer gives you is the hard part. Taking what your lexer gives you and parsing it is hard because you have to think of every possibility and report the error. Say you have the string int foo, and you are coding in C; just off of this, you can't tell if this is a function or a variable. You have to keep track of what you have lexed so far to see if you are currently already in a function, because then it's a variable. What if you have the string "hello 'world'", now you have to make sure that a ' does not end the ", but depending on the language, a ' or " can be used as a string, so you have to keep track of what is the parent. Escaped characters are the same as if you have "hello \"world\"", you see a " char after the \, so you have to keep track of what was before. But you can't always keep track of what was before because \\\" does not translate to \\". After the first two \ there was effectively no \ before.