I imagine most readers have had to program a lexer at some point. If not, for clarity, we should start out by saying that a lexer is a program that, given an arbitrary sequence of characters, breaks it down into lexical items to be used for various purposes. The lexer itself is not too difficult to construct. You just need a set of rules that divide the characters into two categories: trivial characters and special characters. The trivial characters are to be stored in a buffer, and when a special character appears, they are crammed into a single “lexical symbol”. Special characters in the vast majority of cases will also become lexical symbols, but only consisting of a qualitative essence, not containing a data sequence as in the case of ‘trivial’ symbols. There is also a third category, namely the characters that the lexer will choose to ignore, to pass over in silence. We will not discuss this category.
In many cases, lexers are used as a precursor to compilers or interpreters, operating on source code. The moment of compiling or interpreting the source code can be considered the moment when it comes to life, when it becomes a structure, in a process akin to transforming nucleic acids from information embedded in molecules into the shape of some creature’s body, or even its (sometimes downright annoying) behaviour.
For the code (in its textual form) to ჭbecome a functioning program, it must first be disassembled by the lexer and transformed into a list of lexical symbols. Without disassembly, it would be impossible for anything ever to happen to the code. In a way, the same thing happens to you as you read these words. An intracranial lexer figures out which characters on the sheet of paper (or the screen, or who knows what other medium) are trivial and which are special, and then groups the trivial ones together and somehow manages to operate with the special ones. Sometimes, there will also be characters that the lexer simply has to gloss over, not knowing how (or why) to interpret them.
Unlike programming languages, the human-written language has a relatively small number of special characters, as its content is, to a large extent, trivial. For ease of explanation, in what follows, we will refer to the Romanian language, which, again, most readers of this online publication probably speak fluently. Even the few special lexical symbols almost always have the same role, that of delimitation, the only difference between them being the intensity of this delimitation.
So, in ascending order of delimiting intensity, we have, first of all, the hyphen, which seems so innocuous in its expressive depth that a surprisingly high percentage of communicators misuse it. The next most frequently used special symbol is the space. Next, the comma and the hyphen – after all, there is not much difference between the two. Then the semicolon. Then, equal in delimiting intensity, we have the period, the exclamation mark, the question mark, and the colon.
But the symbol that finally requires more processing in the lexer’s wake, and that elevates the human-written language to a rank somewhat closer to a programming language, is the parenthesis. This is because a parenthesis is something other than a mere delimiter: It represents the iterative creation of an execution subspace, whose domain of existence continues until the ending symbol is encountered.
In this way, the language space itself takes on a new dimension; it becomes a potential recursive space, a discourse stack. Although it rarely happens in practice, any presence of the parenthesis symbol opens up a new level, as there is no limit to the number of subspaces that can be superimposed. The rules of the syntactic system concerning this construction are as follows:
- Any linguistic material starts at subspace level 0.
- The minimum subspace level is 0.
- All linguistic material ends at subspace level 0.
A corollary of rule #2 is that a parenthesis cannot end before it begins, and a corollary of rule #3 is that a parenthesis cannot start without ending.
In addition to parentheses, there is another lexical token that serves to create a subspace in the information structure, but unlike parentheses, it is not an iterative one. It is the quotation mark. Unlike parentheses, the logic of which must be implemented post-lexically, the role of quotation marks is to short-circuit the lexer and command it to interpret all new characters encountered as trivial symbols until the end of the quotation marks. And a parenthesis appearing between quotation marks will not be a special symbol, thus not changing the subspace level of the text.
From this, a problem arises: How can we use the symbol which ends a quotation as a trivial character? Unfortunately, the current system does not provide a solution and is therefore incomplete. But we can draw inspiration from languages that have addressed this issue, again by referring to a range of programming languages. Here, there is the symbol “\”, whose role is to uniquely transform the next symbol encountered into a trivial symbol and vice versa. This also happens at the level of the lexer, so that, for example, the sequence “\(” (excluding the quotation marks, of course (these have been added because your lexer is not programmed in such a way as to take the “\” symbol into account)) will not be considered as a start of a parenthesis subspace, but only the “(” symbol itself. It’s interesting how a language like Common LISP allows us to use parentheses (“(” or “)”) inside symbols such as function or variable names, as long as we force the lexer to give them the trivial treatment (via “\”), even though the parenthesis as a special symbol can be considered the very foundation of that language:
(defun \( (n)
(let ((\) ‘()))
(loop for i from 1 to (1- n)
when (zerop (mod n i))
do (push i \)))
(nreverse \))))
(defun \(\) (n)
(> (reduce #’+ (\( n)) n))
(defun \)\) (n)
(let ((count 0)
(\(\(\( 1))
(loop while (< count n)
do (when (\(\) \(\(\()
(incf count))
(when (< count n)
(incf \(\(\()))
\(\(\())
(defun \)\( (n)
(\)\) n))
(defun \)\)\) (n)
(let ((\) (\)\( n)))
(if (and (integerp \)) (>= \) 0))
(code-char \))
(error “)”))))
(defun \(\(\) ()
(loop
(format t “~D” (\)\)\) 7))
(finish-output)))
(\(\(\))
The above program can be used to test one’s intracranial interpreter to detect whether or not it possesses the stack overflow concept. In any case, to preserve the proper functioning of the interpreter, it is recommended to run the code until the three rules of the syntactic system mentioned above are respected.