Lertulo ([info]lertulo) wrote,
@ 2008-10-06 19:38:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
Tokenizing
A short article for an easy step. The first part of compiling or interpreting a script is breaking the script's text up into tokens. If you're using lex (which you probably should do), that's pretty easy--but since "easy" is no fun, I'll be doing it the old-fashioned hand-coded way.

A clever tokenizer won't run too far ahead of the actual compiler: that's called a streaming tokenizer, and it's marginally more complex than a tokenize-everything-and-remember-it-all-at-once tokenizer. I'm using the latter, since I've used up all my "easy is no fun"-ness in the first paragraph.

The basic tokenization loop looks more or less like this:

   as long as there's content left in the input script {

      skip whitespace (including EOLs)
      if we hit "//", skip to the end of the line and restart this loop
      if we hit "/*", keep reading until we hit "*/" then restart this loop
      if we ran into the end of the script, quit now

      if the next char is a digit, look for number sequences
         don't forget to look for hex and octal radixes ("0x5E13", "0777")
         don't forget to look for decimals and exponents ("15.37", "27e+5")
         remember to look for special cases ("0", "0.3" which look octal-ish)

      see if the next 3 characters match a 3-character token (like ">>=");
         if so, record that token into our output and restart this loop

      likewise, see if the next 2 characters match a 2-character token ("+=", "<<")

      likewise, see if the next character matches a 1-character token (":", ";" etc)
 
      if the next character is an apostrophe, crack character sequences like 'x'
         remember to handle encodings like '\t' for tab, '\n' for newline etc
         and of course '\x7f', '\127', '\035' should be supported too

      if the next character is a quotation mark, try to pull a whole string
         this is pretty easy--just skip the ", then keep reading until hit another one
         again, look for \ prefixes, and don't be fooled by \"

      okay, the next word must be plain text--either a keyword or something like a variable.
      scan forward until we run out of legal characters for either, and accumulate the text.
      then match against known keywords ("for", "return" etc)
   }


And that's it. No magic involved--just some simple text cracking. The result is that we can stop worrying about the text file that the user supplied; instead, we have a much more programmatically accessible array of tokens. The compiler will start pawing through those tokens to get its work done--in the next post.


Advertisement


(No comments)

Post a comment in response:

From:
Help
Identity URL: 
Username:
Password:
Don't have an account? Create one now.
Subject:
No HTML allowed in subject
   Help
Message:
 
Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…