Skip to content

AST-based SQLite driver #164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 37 commits into from
Closed

AST-based SQLite driver #164

wants to merge 37 commits into from

Conversation

JanJakes
Copy link
Collaborator

This PR is part of the Advanced MySQL support project and builds on the previous work in Exhaustive MySQL parser.

This is a WIP, currently stacked on top of #157.

@JanJakes
Copy link
Collaborator Author

@adamziel Looking deeper into this, I think that translating specific AST subnodes is pretty straightforward, but what's an open question is the driver/translator architecture, and I'm kind of going back and forth on multiple options.

Splitting the driver & translator
I'm not sure whether having all in one large class is a good idea, and I'm exploring ways to split the code a bit. That said, there is the large switch/case statement in the prototype that doesn't need to know much about the AST structure — it just handles some nodes specifically, and passes many other ones down to the translator again, recursively. This has some advantages, such as we don't need to worry much about where a specific grammar rule can appear.

Splitting top-level constructs
That said, some kind of distinction makes a lot of sense to me, at least at the highest level. There is a big difference whether we're running a SELECT statement, or an UPDATE statement, or CREATE TABLE, etc. These top-level statements have pretty different semantics, they can return different things, and some special constructs make sense only in some cases (e.g., SQL_CALC_FOUND_ROWS works only with SELECT statements). Therefore, I'm pretty confident that on the top level, it makes sense to split these use cases a bit into separate classes.

To add a bit more context, even though the top-level statement rule can be re-used in a compound statement for stored procedures, stored functions, triggers, and events, it's probably good to distinguish that case from the top-level use case. E.g., an ALTER TABLE ... statement on the top-level may result in directly executing multiple SQLite queries, while when the same is passed as a procedure body, we may need to say it's unsupported.

Splitting further?
However, going a little bit deeper, any kind of distinction can be problematic. There are surely some grammar rules that can appear only in the context of others — tableElementList can never appear in the SELECT clause, for instance — but would it make any sense to implement such rules in CREATE/UPDATE translator class? Or, should there be one huge switch to handle all the grammar rules, with maybe the exception of the top-level ones, as suggested above?

Translating vs. interpreting
Another way of looking at organizing the code is the difference between translating MySQL syntax into SQLite syntax vs. emulating a MySQL statement with multiple different SQLite statements or PHP calls. We can't completely separate these two, but I think it would be nice to be clear about where we are just translating, and where some logic is actually executed. Again, it seems to me that the top-level vs. deeper level could help here — each top-level statement could be handled by a "handler" class that would interpret the query in full, and these handlers could use a generic "translator" within them, whose task would only be to translate some MySQL constructs into SQLite syntax.

class WP_SQLite_Create_Table_Translator {
	public function translate_create_table( WP_Parser_Node $ast ) {
		// CREATE TABLE ... LIKE ...
		// CREATE TABLE ... (LIKE ...)
		$has_like = $ast->get_descendant_token( WP_MySQL_Lexer::LIKE_SYMBOL );
		if ($has_like) {
			// Here we will need to run a query to get the table structure,
			// and then create the table:
			return $this->create_table_like( $ast );
		}

		// CREATE TABLE ... ( elements ) options
		// CREATE TABLE ... AS ...
		$elements_ast = $ast->get_child( 'tableElementList' );
		$has_as = $ast->get_descendant_token( WP_MySQL_Lexer::AS_SYMBOL );
		if ( $elements_ast && $has_as ) {
			throw new \Exception('"CREATE TABLE" with both column/constraint definition and "AS" clause is not supported');
		}
		
		// SQLite supporst both ( elements ) and "AS", so we can just rewrite
		// the query and run it. Rewriting can be done either partly here,
		// or fully in the translator.
		$query = $this->translator->translate( $ast );
		return $this->connection->query( $query );

		// Or, we rewrite the top-level `CREATE TABLE` here and nested nodes in the translator:
		$query = new WP_SQLite_Expression(
			array_merge(
				array( WP_SQLite_Token_Factory::raw( 'CREATE TABLE' ) ),
				array(
					$this->translate(
						$ast->get_child( 'tableName' )->get_child()
					),
				),
				array( WP_SQLite_Token_Factory::raw( '(' ) ),
				$this->translator->translate( $ast->get_child( 'tableElementList' ) ),
				array( WP_SQLite_Token_Factory::raw( ')' ) )
			)
		);
		return $this->connection->query( $query );

I'm not yet sure what's the optimal solution, but it feels like at least a bit of a concern separation would be useful. It would also be nice if we could solidify some AST node translation by unit testing the translated fragment as well (not only by the end-to-end tests that are run against SQLite), as such test would give immediate feedback and great diffs.

@adamziel
Copy link
Collaborator

adamziel commented Nov 15, 2024

Is there any downside to starting with a single giant class and splitting once that no longer serves the purpose?

Separation of concerns is a useful principle but it cuts both ways. Take Enterprise FizzBuzz. Great separation of concerns, extremely unreadable codebase. Too many concerns were defined and separated.

Therefore, I propose starting with just a single concern: Running MySQL queries in SQLite. Why? Because we don't really need modular code here. I don't think we'll ever need to use, say, SELECT statement translation as a component in any other context – therefore is it really useful to separate it?

I only propose that approach as a starting point. Perhaps it will make things unnecessarily complex and we'll have to look for other ways to slice it. I'm just not too excited about preemptively decoupling things that may not need decoupling at all.

The HTML Tag processor, for example, is just a single class even though it tackles a complex problem. It's a huge advantage – most methods are private so we can adjust them without breaking BC, you never have to analyze a dependency graph or wonder where should a specific piece of logic go, you always have full access to the entire state, and you can extend it easily.

I've tried splitting that logic a few times in the current iteration of the plugin and the outcome was always messy. Making a distinction between translating and interpreting is tempting, but it complicates the design. Perhaps things are different now that we have AST – if yes, then we should hit the wall really soon with a single class approach.

@JanJakes
Copy link
Collaborator Author

@adamziel Thanks for your insights! I did not consider that separation of concerns also creates new public APIs as a side effect. That's a very important point. Your proposal makes a lot of sense to me. Somehow, it seemed natural to me that the HTML Tag Processor, or the MySQL lexer, for instance, is handled by a single class, but I wasn't sure about the driver. I guess for parser-like logic, a single huge class is quite common, but in the case of the driver, I started to see too many different things (DB connection, query translation, query execution, etc.), and got myself caught overthinking it.

Is there any downside to starting with a single giant class and splitting once that no longer serves the purpose?

The only downside is that the giant class can will have multiple responsibilities, and it can get a bit difficult to read. For instance, some state variables will make sense for selects, but not for inserts, etc. But we can mitigate that with good naming, docs, and making the state as simple as possible.

Separation of concerns is a useful principle but it cuts both ways. Take Enterprise FizzBuzz. Great separation of concerns, extremely unreadable codebase. Too many concerns were defined and separated.

Wow, what a codebase! 😂 And a nice demonstration of what over-architecting can cause.

Therefore, I propose starting with just a single concern: Running MySQL queries in SQLite.

💯

I only propose that approach as a starting point. Perhaps it will make things unnecessarily complex and we'll have to look for other ways to slice it. I'm just not too excited about preemptively decoupling things that may not need decoupling at all.

Makes sense!

It's a huge advantage – most methods are private so we can adjust them without breaking BC

This! 💯


Looking at the architecture, I'd like to simplify also the current draft with WP_SQLite_Expression and WP_SQLite_Query_Builder. I think maybe we could make it work without them.

The WP_SQLite_Query_Builder takes the expression and stringifies it, joining all strings with a space and wrapping sub-expressions with parentheses. It covers many use cases out of the box, but it's not really a definition of the SQLite SQL dialect or anything like that, and it's not very flexible — for instance, a list of column definitions is delimited by a comma, not a space, and maybe the current implementation would put spaces even into places where they aren't expected. Apart from that, it performs some operations on some tokens (quoting identifiers, for instance), which could be done by the driver itself.

For WP_SQLite_Expression, it may be possible to replace it within the driver with return types of the recursive translate calls — either we return a token, or we could return an array (~ the expression). In any case, if we'll need to keep this one for some reason, I'd consider renaming it to something like WP_SQLite_Node to avoid confusion with SQL Language Expressions or any hints that it maps to some SQLite construct.

I will quickly try out these simplifications to see if they could make sense.

Base automatically changed from recursive-descent-parser to develop November 20, 2024 07:39
$this->assertSame( null, $root->get_descendant_token( 123 ) );

// Test multiple descendant methods.
// @TODO: Consider breadth-first search vs depth-first search.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamziel I initially kept the logic from the first prototype (which was traversing descendants breadth-first), and added this note to look into it later, but now it seems to me it may be more useful and natural to do a dept-first search (which would return tokens in the order from the original query).

I'm still not 100% sure which one we need more — usually, I just want to know "is there this node or that token somewhere in this subtree", and in that case, both searches work. But sometimes it may be useful to check for a next token or more, for instance, and then get_descendant_tokens() would be more natural with a depth-first traversal. If really needed, we could also provide both. What do you think?

@JanJakes
Copy link
Collaborator Author

@adamziel Another thought regarding WP_Parser_Node: We now provide child and descendant methods. Should we also provide parent methods? It can be useful, but I think it has a major flaw. I'll explain:

Sometimes, it could be useful in the driver, if we want to translate some constructs in a way "if I'm in the context of createTable, do this, otherwise do that". This is a simple way to express some specific nuances, but it's not the only way — we could also handle it from the top within the createTable node instead, constructing the statement a bit more top-down (but it can be more verbose at times).

That said, I think there's a problem — adding parent references to WP_Parser_Node would probably cause memory leaks. PHP GC uses reference counting, which means that it only frees the memory occupied by an object when there are no more references to it, and that circular parent-child references would never allow for garbage collection. Weak refs were only introduced in PHP 7.4, so I guess currently, it doesn't make sense to do this. It also complicates the tree a bit and makes subtrees not "independent". There is the alternative of saving some context (node stack) in the driver during the translation, for instance.

@adamziel
Copy link
Collaborator

adamziel commented Nov 23, 2024

@JanJakes what would be a specific code example where a parent reference would help? I get "do this, do that," but what's "this" and "that"?

@adamziel
Copy link
Collaborator

Also the HTML and the XML API solve the same problem with the get_ breadcrumbs a method that gives you a stack of parent elements names. Maybe the same would use here for AST notes?

@JanJakes
Copy link
Collaborator Author

@JanJakes what would be a specific code example where a parent reference would help? I get "do this, do that," but what's "this" and "that"?

@adamziel For now, I've run into the INTEGER PRIMARY KEY "without" vs "with" autoincrement — in the former case, I'd like to translate it as INT PRIMARY KEY while in the latter one as INTEGER PRIMARY KEY due to the "bug" that's preserved in SQLite for back compatibility. In the AST, it would look something like this:

createStatement
   |- createTable
     |- tableElementList
       |- tableElement          <-- Here I can reliably determine PRIMARY KEY
       |  |- columnDefinition
       |    |- fieldDefinition  <-- Here I can reliably determine AUTOINCREMENT
       |     |- dataType        <-- Here I want to know about PRIMARY KEY and AUTOINCREMENT
       |     |- columnAttribute <-- Here PRIMARY KEY and AUTOINCREMENT can be specified
       |     ...
       |- tableElement
         |- tableConstraintDef  <-- But PRIMARY KEY can also be specified here (AUTOINCREMENT not)

Without the tableConstraintDef option for primary keys, it would be simple-enough to handle this case in fieldDefinition in a top-down manner. Something like:

case 'fieldDefinition':
	$has_primary_key   = ...;
	$has_autoincrement = ...;
	if ( $has_primary_key ) {
		$children       = $ast->get_children();
		$data_type_node = array_shift( $children );
		$data_type      = $this->translate( $data_type_node );
		if ( 'INTEGER' === $data_type ) {
			$data_type = $has_autoincrement ? 'INTEGER' : 'INT';
		}
		return $data_type . $this->translate_sequence( $children );
	}
	return $this->translate_sequence( $ast->get_children() );

Here, the $has_primary_key can peek below for descendant token, but it has no way to verify the tableConstraintDef, that would be in a different tableElement. I would either need a reference to the parent tableElement to use it to search for a primary key, or I would need to save it somehow ($this->set_context('primary_key', ...)) and then read it below.


However, in the case of CREATE TABLE, I'm also wondering whether it may make more sense to fully parse the create table statement (into columns, constraints, etc.), and only then construct a query from that, like we do in the current translator. It's granular changes on specific AST nodes vs reading the full MySQL statement and generating a new SQLite statement from that. I don't know yet which of these options will be ultimately more convenient and reliable.

@adamziel
Copy link
Collaborator

Two ideas I got to avoid a parent reference:

  • Could the driver or parser class keep an array like $child => $parent and expose a method such as get_parent_of?
  • The query execution process could keep track of the stack of nodes it entered

@adamziel
Copy link
Collaborator

adamziel commented Nov 23, 2024

It's granular changes on specific AST nodes vs reading the full MySQL statement and generating a new SQLite statement from that. I don't know yet which of these options will be ultimately more convenient and reliable.

Are these actually different? You need to traverse the tree node by node in either case. Or would the latter mean doing one pass to build a flat index of, say, modifiers and options, and then doing the translation in the second pass? Because I don't think that index could be flat - we'd still need a recursive data structure to account for subqueries.

@JanJakes
Copy link
Collaborator Author

It's granular changes on specific AST nodes vs reading the full MySQL statement and generating a new SQLite statement from that. I don't know yet which of these options will be ultimately more convenient and reliable.

Are these actually different? You need to traverse the tree node by node in either case. Or would the latter mean doing one pass to build a flat index of, say, modifiers and options, and then doing the translation in the second pass? Because I don't think that index could be flat - we'd still need a recursive data structure to account for subqueries.

@adamziel I think they are a bit different:

  1. The first approach with only the necessary granular changes often relies on the default case that just recursively translates all children and concatenates them with a space. At the moment, this often works thanks to the similarity of the two SQL dialects, so even tokens are reused (SELECT, parentheses, etc.). In a way, we're just taking the MySQL query and doing some adjustments, where needed. Over time, we'll probably cover the translation more and more on each node, and will rely a bit less on the shape of the original MySQL query.
  2. The second approach would be oriented more towards reading data and then constructing a whole new query, rather than doing only the necessary changes. The index of the elements could be flat, at least in this case. More generally, even with this approach, the recursion is still possible — if I know that I have a list of some definitions, I can still get them with $definition = $this->translate( ... ), in case they can contain expressions, subqueries, etc.

Looking at CREATE TABLE in particular, and the grammar branch that contains tableElementList (the last below),

createTable:
    TEMPORARY_SYMBOL? TABLE_SYMBOL ifNotExists? tableName (
        LIKE_SYMBOL tableRef
        | OPEN_PAR_SYMBOL LIKE_SYMBOL tableRef CLOSE_PAR_SYMBOL
        | (OPEN_PAR_SYMBOL tableElementList CLOSE_PAR_SYMBOL)? createTableOptions? partitionClause? duplicateAsQueryExpression?
    )
;

we would be able to get a list of all column and constraint definitions, organize them somehow, and then generate the new CREATE TABLE query. The question is whether we should do that, or if we should just go with option 1.


Here's maybe a good example for thinking about this:

A PRIMARY KEY constraint can be defined either within a particular column definition, or in a different table element, separately from the column definition. Now:

  1. With option 1, we'd keep the definition where it was in MySQL, not normalizing it in any way, and relying on the fact that both MySQL and SQLite support defining PRIMARY KEY in both places. When we would run into a specific constraint that can be defined in both places in MySQL, but only in a single one in SQLite, we'd just handle that specific case.
  2. With option 2, we would first create a list of all columns and constraints, and then generate a query from that. Inline constraints would be normalized to separate constraint definitions (or the other way round), etc.

As a consequence, option 1 fixes only the particular cases that SQLite supports differently, leaving the rest mostly to their similarities. If there is a specific keyword or modifier we need to ignore, it needs to be done explicitly.

With option 2, we would be picking and allow-listing the things we support, possibly ignoring any other ones. In many cases, I think option 2 may be a bit more verbose, unless there are many exceptions and edge cases that would have to be handled in option 1.

Ultimately, we can do a little bit of both — sometimes we can be more descriptive for a specific node, we can decompose it and fully reconstruct in an option 2 fashion, while still leveraging option 1 as well by translating some subnodes recursively. The decision may be needed on a case-by-case, but it is a question of what would be a useful guideline to deciding this.


And here's a particular example:

In MySQL, a PRIMARY KEY with AUTOINCREMENT can be defined in both of these ways:
a) CREATE TABLE t (id INT PRIMARY KEY AUTO_INCREMENT);
b) CREATE TABLE t (id INT AUTO_INCREMENT, PRIMARY KEY (id) );

While in SQLite, b) is a syntax error. It means that we have to rewrite it to something like a) because AUTOINCREMENT in SQLite can be only in the column definition and only after a PRIMARY KEY constraint. And here we're touching on the questions — handling this particular case vs reading and normalizing the full statement.


Another interesting example:

  • In MySQL, we can write both PRIMARY KEY AUTO_INCREMENT and AUTO_INCREMENT PRIMARY KEY.
  • In SQLite, it always needs to be PRIMARY KEY AUTOINCREMENT, the other one is a syntax error.

Going back to the two options, with option 1, we need to catch and handle this case, while with option 2, it would probably just work due to the full statement reading and normalization, at the likely cost of more verbosity.

Ultimately, it may also work that we only normalize the PRIMARY KEY & AUTOINCREMENT definitions, leaving the rest as-is. At the moment, I don't know yet if there will be more of such cases in this particular statement.

@adamziel
Copy link
Collaborator

Monumental work here @JanJakes, and a tremendous update. Mad props.

That said, maybe the path to get this merged, e.g., with a feature flag, is to focus on point 1, and do the rest in other PRs.

I would love that! It feels like a good spot to park, do a round or a few rounds of review, and merge. Let's just rewire the CI tests to the old driver to avoid failing checks on trunk.

$this->pdo->query( 'PRAGMA encoding="UTF-8";' );

$valid_journal_modes = array( 'DELETE', 'TRUNCATE', 'PERSIST', 'MEMORY', 'WAL', 'OFF' );
if ( defined( 'SQLITE_JOURNAL_MODE' ) && in_array( SQLITE_JOURNAL_MODE, $valid_journal_modes, true ) ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How easy/hard would if be to avoid referencing any global constants and variables in this driver? Ideally we could use it outside of WordPress as a drop-in PDO replacement and it's not easy to see all the configuration options when we rely on those constants. Can we source this information from a constructor argument? If we need a bit of dedicated code to connect it with WordPress, we could have a separate function that glues all the WordPress-specific bits with new WP_SQLite_Driver()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just marking that this is copy-pasted from the legacy translator. It will be definitely great to refactor this to an explicit set of config options 👍

@adamziel
Copy link
Collaborator

Superseded by #179

@adamziel adamziel closed this May 30, 2025
@JanJakes JanJakes mentioned this pull request Jun 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants