Parsing SQL for Excel

Relation for Excel has been around for some time, and its ability to combine Excel spreadsheets is useful in many applications. However, the relational algebra syntax is not everyone's cup of tea, and using the more common SQL language directly in Excel would open up the application to more users.

Therefore, parsing SQL for Excel was an interesting summer project programmed in VBA. The plan was to convert an SQL SELECT query into a list of relational algebra statements that would then be executed automatically.

Just want the file? See Relation for Excel

There are many texts about the differences between declarative and imperative programming languages. Relational for Excel is an imperative language. It gives a list of statements that act on relations (tables) in exactly the order defined in the formula. SQL is a declarative language: the query describes the result table to be reached, but not the steps that lead there. I can already say that you never really understand the difference between declarative and imperative until you try to implement it.

The simplest form of an SQL query is

SELECT * FROM t1

which translates to

relFilter(t1)

These are still easy to translate

SELECT foo, bar FROM t1
relFilter(t1, "P foo::bar")

SELECT foo, bar FROM t1 WHERE foo > 2
relFilter(t1, "S %foo > 2", "P foo::bar"

But we can already see that the statements follow a specific order. It gets more complicated when we add expressions. We have to make sure that both the old and the new columns must be present at the time of selection.

SELECT foo * 3 AS triplefoo FROM t1 WHERE foo > 2 AND triplefoo < 100
relFilter(t1, "E triplefoo %foo *3", "S AND(%foo > 2, %triplefoo < 100)", “P triplefoo")

If there are joins, the names are qualified but can be referenced without namespace.

SELECT t1.foo, t2.bar FROM t1 JOIN t2 ON t1.bar = t2.bar WHERE foo > 2
relFilter(t1, "E t1.foo foo", "t1.bar bar", t2, "E t2.bar bar", “J %t1.bar=t2.bar", "P t1.foo::t2.bar")

Expressions can contain an aggregation:

SELECT SUM(foo) AS f FROM t1
relFilter(t1, "P foo SUM", “R foo_sum f")
SELECT SUM(foo*2) AS f FROM t1
relFilter(t1, "E foo_temp %foo * 2”, "P foo_temp SUM", “R foo_temp_sum f")
SELECT 2*SUM(foo) AS f FROM t1
relFilter(t1, "P foo SUM", "E f %foo_sum*2", "P f")
SELECT 2*SUM(foo*2) AS f FROM t1
relFilter(t1, "E foo_temp %foo * 2”, "P foo_temp SUM", “E f %foo_temp_sum*2", "P f")

As long as there is only one aggregation, all expressions are valid.

Specification

We define a function relSql(code, t1, ... t9), where code is the SQL query and t1 to t9 are tables in the form of ranges or relations.

For the SQL query, we allow the following syntax:

query	SELECT columnlist FROM name (join)* (selection)? (having)? (order)?
columnlist	* \| (name \| expression AS name (, name \| expression AS name)*)
expression	(name \| string \| number \| operator \| function) +
name	[A-Za-z][A-Za-z_0-9]*
string	([^']\|)'*
number	[0-9][0-9.]*
operator	[-+*/()&<>=]OR\|AND\|NOT\|IN
function	name [(] expression (, expression)* [)]
join	(NATURAL\|LEFT\|RIGHT\|OUTER) JOIN name (ON expression)
selection	WHERE expression
having	HAVING expression
order	ORDER BY namelist
amelist	name (ASC\|DESC)? (, name (ASC\|DESC))*

We make some design decisions:

The SQL works only pure relational: There are no duplicates and empty rows are ignored.
We use auto-group: the grouping is defined directly with the defined columns. GROUP BY is therefore not needed. We still need HAVING because this statement acts on a selection after aggregation.
The columns are not typed. The values are interpreted either as text or as numbers, depending on the context of the operator.
For comparisons, the following inequality is defined: empty < numbers < text
The single quote for text constants works very well in the Excel context
We interpret expressions directly and support the algebraic operators, the text concatenation operator & and the parenthesizes
We support as simple set of functions for math (ABS COS EXP INT LN LOG MOD POW ROUND SGN SIN SQRT TAN), text (LEFT LEN LOWER MID REPLACE RIGHT TRIM UPPER) and aggregation (AVG COUNT MAX MEDIAN MIN STDEV SUM), that could be expanded.
An expression can mix functions and aggregators, but there can only be one aggregator at a time.
The HAVING statement can only have access to aggregated columns that are visible (could be changed).
The ORDER statement can only have access to visible columns at the end (could be changed).
Keywords must be written in uppercase

Note: Why don't we use ranges directly in the code? In Excel ranges can have a name, so the function would be more readable. The short answer is 'recalculation'. While it is possible to do it, the automatic recalculation would not be possible in Excel. The Excel calculation model is based on the propagation of changes of cells via references to other cells. If the range is within a text of a code, Excel would not know that it needs to recalculate the function when a source cell has changed its value.

Implementation

We define a function relTokenizeSql that reads the code and identifies the tokens.
We define a parser function relParseSql that parses the tokens and creates the relational algebra.
We define a compiler function for expressions prelCompileExpression that compiles the expressions into RPN.
We define an interpreter function for RPN prelRPN.
And finally we define the relSql function that is used by the end user.

The functions above are provided as public to detect possible errors.

Tokenizer

The tokenizer is a straightforward state machine. It reads the code letter by letter and creates a list of tokens. Each token gets a first character to define its type.

Operators !
Unary left operators <
Aggregators +
Functions *
Numbers #
Text $
Names @
Comma ,
Open parenthesis (
Close parenthesis )
The SQL keywords have no type and are taken literally.

The first character is used quick identification in the next step.

The list is then combined in a string with vbTab as separator. We can assume that the tab character is not used within an Excel cell.

Throughout the project we will work with strings as intermediate results, because passing arrays between functions in Excel is quite complicated.

Parser

The parser relParseSql(code) itself uses a state machine that builds the elements of the query by carefully checking that the syntax is correct.

The first result is then a naive implementation. We use one that ensures that all columns are present when they are needed.

That means:

Create all relations, rename them with the t1 prefix
Extend the relations for all expressions that will be used
If there is an aggregator in the expression, split in 3: extend expression inside, project aggregation, extend expression outside
Join, if possible naturally
Proceed the selection
Aggregate in projection
Apply the having statement
Aggregate a second time and remove the prefixes if possible

The naive implementation is slow because it performs calculations for extensions that may not be needed. It also makes calculations on rows that may be excluded at an early step. There is a potential for optimization in the degree of O(n^2).

We did therefore some optimizations:

Split SELECT for all expressions combined with AND
Move SELECT to the front if possible
Move EXTEND to the back if possible
Reconnect SELECT statements that now follow each other

Note that all the optimization is done without knowledge of the content or structure of the tables. We only know the code at this time.

The relation instruction list is then combined to a string with vbCRLF as separator.

Expression parser

The parser for expressions is quite simple. The tokens are already present. We first use a state machine to check if the expression is syntactically correct. We also count the arguments in the functions and create an argc'' variable for functions that have a variable arity (for the moment this is only the empty function for the IN operator).

We use the shunting yard algorithm to handle precedence and create the RPN code.

The list is then combined to a string with vbTab as the separator. The expression is prefixed with a question mark so that prelFilter can identify it as an RPN expression (unlike the Excel expressions normally used in relFilter).

The function

We can now put all pieces together. The relSql parses the code and passes then the compiled relation code to prelFilter.

More enhancements in 3.0

There is a cache for long results. Intermediate results in cells with more than 32K text now show a hash identifier that can be used as an argument to other formulas.

The functions now support line breaks in cells. vbCRLF is internally replaced by vbCR.