Parsing SQL for Excel
Relation for Excel has been around for some time, and its ability to combine Excel spreadsheets is useful in many applications. However, the relational algebra syntax is not everyone's cup of tea, and using the more common SQL language directly in Excel would open up the application to more users.
Therefore, parsing SQL for Excel was an interesting summer project programmed in VBA. The plan was to convert an SQL SELECT query into a list of relational algebra statements that would then be executed automatically.
Just want the file? See Relation for Excel
There are many texts about the differences between declarative and imperative programming languages. Relational for Excel is an imperative language. It gives a list of statements that act on relations (tables) in exactly the order defined in the formula. SQL is a declarative language: the query describes the result table to be reached, but not the steps that lead there. I can already say that you never really understand the difference between declarative and imperative until you try to implement it.
The simplest form of an SQL query is
SELECT * FROM t1
which translates to
These are still easy to translate
SELECT foo, bar FROM t1
relFilter(t1, "P foo::bar")
SELECT foo, bar FROM t1 WHERE foo > 2
relFilter(t1, "S %foo > 2", "P foo::bar"
But we can already see that the statements follow a specific order. It gets more complicated when we add expressions. We have to make sure that both the old and the new columns must be present at the time of selection.
SELECT foo * 3 AS triplefoo FROM t1 WHERE foo > 2 AND triplefoo < 100
relFilter(t1, "E triplefoo %foo *3", "S AND(%foo > 2, %triplefoo < 100)", “P triplefoo")
If there are joins, the names are qualified but can be referenced without namespace.
SELECT t1.foo, t2.bar FROM t1 JOIN t2 ON t1.bar = t2.bar WHERE foo > 2
relFilter(t1, "E t1.foo foo", "t1.bar bar", t2, "E t2.bar bar", “J %t1.bar=t2.bar", "P t1.foo::t2.bar")
Expressions can contain an aggregation:
SELECT SUM(foo) AS f FROM t1
relFilter(t1, "P foo SUM", “R foo_sum f")
SELECT SUM(foo*2) AS f FROM t1
relFilter(t1, "E foo_temp %foo * 2”, "P foo_temp SUM", “R foo_temp_sum f")
SELECT 2*SUM(foo) AS f FROM t1
relFilter(t1, "P foo SUM", "E f %foo_sum*2", "P f")
SELECT 2*SUM(foo*2) AS f FROM t1
relFilter(t1, "E foo_temp %foo * 2”, "P foo_temp SUM", “E f %foo_temp_sum*2", "P f")
As long as there is only one aggregation, all expressions are valid.
We define a function relSql(code, t1, ... t9), where code is the SQL query and t1 to t9 are tables in the form of ranges or relations.
For the SQL query, we allow the following syntax:
|query||SELECT columnlist FROM name (join)* (selection)? (having)? (order)?|
|columnlist||* | (name | expression AS name (, name | expression AS name)*)|
|expression||(name | string | number | operator | function) +|
|function||name [(] expression (, expression)* [)]|
|join||(NATURAL|LEFT|RIGHT|OUTER) JOIN name (ON expression)|
|order||ORDER BY namelist|
|amelist||name (ASC|DESC)? (, name (ASC|DESC))*|
We make some design decisions:
- The SQL works only pure relational: There are no duplicates and empty rows are ignored.
- We use auto-group: the grouping is defined directly with the defined columns. GROUP BY is therefore not needed. We still need HAVING because this statement acts on a selection after aggregation.
- The columns are not typed. The values are interpreted either as text or as numbers, depending on the context of the operator.
- For comparisons, the following inequality is defined: empty < numbers < text
- The single quote for text constants works very well in the Excel context
- We interpret expressions directly and support the algebraic operators, the text concatenation operator & and the parenthesizes
- We support as simple set of functions for math (ABS COS EXP INT LN LOG MOD POW ROUND SGN SIN SQRT TAN), text (LEFT LEN LOWER MID REPLACE RIGHT TRIM UPPER) and aggregation (AVG COUNT MAX MEDIAN MIN STDEV SUM), that could be expanded.
- An expression can mix functions and aggregators, but there can only be one aggregator at a time.
- The HAVING statement can only have access to aggregated columns that are visible (could be changed).
- The ORDER statement can only have access to visible columns at the end (could be changed).
- Keywords must be written in uppercase
Note: Why don't we use ranges directly in the code? In Excel ranges can have a name, so the function would be more readable. The short answer is 'recalculation'. While it is possible to do it, the automatic recalculation would not be possible in Excel. The Excel calculation model is based on the propagation of changes of cells via references to other cells. If the range is within a text of a code, Excel would not know that it needs to recalculate the function when a source cell has changed its value.
We define a function relTokenizeSql that reads the code and identifies the tokens.
We define a parser function relParseSql that parses the tokens and creates the relational algebra.
We define a compiler function for expressions prelCompileExpression that compiles the expressions into RPN.
We define an interpreter function for RPN prelRPN.
And finally we define the relSql function that is used by the end user.
The functions above are provided as public to detect possible errors.
The tokenizer is a straightforward state machine. It reads the code letter by letter and creates a list of tokens. Each token gets a first character to define its type.
- Operators !
- Unary left operators <
- Aggregators +
- Functions *
- Numbers #
- Text $
- Names @
- Comma ,
- Open parenthesis (
- Close parenthesis )
- The SQL keywords have no type and are taken literally.
The first character is used quick identification in the next step.
The list is then combined in a string with vbTab as separator. We can assume that the tab character is not used within an Excel cell.
Throughout the project we will work with strings as intermediate results, because passing arrays between functions in Excel is quite complicated.
The parser relParseSql(code) itself uses a state machine that builds the elements of the query by carefully checking that the syntax is correct.
The first result is then a naive implementation. We use one that ensures that all columns are present when they are needed.
- Create all relations, rename them with the t1 prefix
- Extend the relations for all expressions that will be used
- If there is an aggregator in the expression, split in 3: extend expression inside, project aggregation, extend expression outside
- Join, if possible naturally
- Proceed the selection
- Aggregate in projection
- Apply the having statement
- Aggregate a second time and remove the prefixes if possible
The naive implementation is slow because it performs calculations for extensions that may not be needed. It also makes calculations on rows that may be excluded at an early step. There is a potential for optimization in the degree of O(n^2).
We did therefore some optimizations:
- Split SELECT for all expressions combined with AND
- Move SELECT to the front if possible
- Move EXTEND to the back if possible
- Reconnect SELECT statements that now follow each other
Note that all the optimization is done without knowledge of the content or structure of the tables. We only know the code at this time.
The relation instruction list is then combined to a string with vbCRLF as separator.
The parser for expressions is quite simple. The tokens are already present. We first use a state machine to check if the expression is syntactically correct. We also count the arguments in the functions and create an argc'' variable for functions that have a variable arity (for the moment this is only the empty function for the IN operator).
We use the shunting yard algorithm to handle precedence and create the RPN code.
The list is then combined to a string with vbTab as the separator. The expression is prefixed with a question mark so that prelFilter can identify it as an RPN expression (unlike the Excel expressions normally used in relFilter).
We can now put all pieces together. The relSql parses the code and passes then the compiled relation code to prelFilter.
There is a cache for long results. Intermediate results in cells with more than 32K text now show a hash identifier that can be used as an argument to other formulas.
The functions now support line breaks in cells. vbCRLF is internally replaced by vbCR.