Woo, yet another weak parser implementation.
I'm having a lot of fun porting my old JQuery code to ClojureScript, inspite of my setbacks. It's nothing very exciting though, so I'm going to write about something else today. A few weeks ago I was asked by Corp. CAD IS to have a look at some stuff. The stuff was a zip file containing the accumulated AutoCAD® customization files of an entire firm we acquired recently. The vague task I was given was to "convert" the files to be compatible with the latest version of AutoCAD.
The former firm had a CAD standards document that enumerated some of the tools they had collected, but there wasn't any obvious structure to their collection, or any central entry point. Since they had built a CUIX file I decided that was the best place to find out what they were actually using. In case you didn't know, a CUIX file is really just a zip file full of CUI files (xml), images, and other resource files. The naming convention seems odd, but it aligns with all the OpenOffice convention (doc -> docx, xls -> xlsx, etc. which are also zipped XML files).
I started by just browsing some of the XML, and I noticed an interesting
pattern: most of the command macros had the form roughly:
(if (not c:foo) (load "some.lsp"));foo which has the effect of demand loading
a lisp file when the user tries to use a command. While I had recently been
using the Python ElementTree XML library to process AutoCAD tool palettes on
another task, for this I just wrote a script to dump any contiguous string
containing ".lsp". I didn't need a parser because I wasn't making modifications
to, or trying to understand the structure of, the CUI files.
I then wrote another script to copy all lisp files in the found list into one directory (used), and report on any missing files, and copy everything else to another (unused). About half of the lisp files they had were not apparently used. There still existed the possibility that the unused lisps were libraries required by the used lisps though.
I decided I needed to have more information about the contents of the lisps, but there were dozens in each folder ranging in length from a few lines to hundreds. I decided that I would just write a script to analyze the contents of all the lisps. First I looked around online for a Python module to either do the parsing for me or at least make building a parser simple. My main problems where that I only wanted specific information, and I didn't need the full parse results. Everything I found was either too complex, or didn't suit my needs.
So I decided to write my own parser. It's only lisp, how hard could it be?
Honestly it wasn't that hard. It probably isn't that good either, but it did what I wanted. I wrote it in a few hours across two days time, while doing other work. In this post I'll just show the tokenizer portion, since the parser (like this post) ran a bit long.
I started out "tokenizing" just by reading a line at a time, doing some text replacements, then splitting on whitespace:
substitutions = ( # Expand quote literal to quote fn ("'(", "(quote "), # Expand parens with white-space for splitting (')', ' ) '), ('(', ' ( '), ) def modify(line): """Perform replacements listed in global dictionary.""" for k, v in substitutions: if k in line: line = line.replace(k, v) return line def tokenize(infile): """Cheaty str.split tokenization.""" tokens = list() # Clean and append all lines into one big string for line in infile: # Strip empty lines and comments line = line.strip() if not line or line == ';': continue # Tokenize the line tokens.extend([token for token in # AutoLisp is case-insensitive modify(line).lower().split() if token]) return tokens
It was surprisingly fast, and worked well enough on simple examples, but made
garbage out of real code. The problem was that comments don't just appear on
lines by themselves. They can, and in typical Autolisp style usually do, appear
at the ends of lines. The comment character in lisp is the semi-colon
is also sometimes used in text. But people also sometimes put text quotes in
comments (especially when commenting out code). What I needed was a more
sophisticated way to pre-process each line to remove all comments.
I built the following function to do the job. Sophisticated would be an over-statement, but it does use one cute trick. There are three cases I want to cover:
- The line is empty or starts with a
;- skip line
- The line is non-empty and has a
;but no quote - trim tail including
- The line contains quotes, and one or more
;- do special stuff
Case 2 can actually also handle quotes on the condition that the last quote is
before the first
;. For special stuff I first break the line into parts on
;, then starting with the first part collect parts until all quotes are
matched. The trick is given an example line:
(foo "evil;string" "an;other") ; badness it is broken into
badness]. We start with one quote in
the first part, then add two in the second giving 3 (which is not even). As soon
as the number of quotes does become even whatever is left is the comment.
def scrub_comment(line): """Carefully clean comments from line.""" if not line or line == ';': # Just skip blank lines and comments return semi = line.find(';') if semi > 0: if '"' not in line or line.rfind('"') < semi: # Keep everything up to but not including semi line = line[:semi] else: # Find left most semi not in a string literal... parts = line.split(';') first, rest = parts, parts[1:] accepted = [first] quotes = 0 while rest: quotes += first.count('"') # If we have collected an even number of quotes on left if quotes%2 == 0: break first, rest = rest, rest[1:] accepted.append(first) # Put back all the semis that belonged there line = ';'.join(accepted) return line def tokenize(infile): """Cheaty str.split tokenization.""" tokens = list() # Clean and append all lines into one big string for line in infile: # Modified line filter line = scrub_comment(line.strip()) if not line: continue # Tokenize the line tokens.extend([token for token in # AutoLisp is case-insensitive modify(line).lower().split() if token]) return tokens
Don't worry, the "parser" implementation has even more wat.