[Tech Blog] Retrospection of making automatic query binding extraction feature

Hello, I'm Hamill, a backend developer at Ntuple.

Today, I'd like to reflect on the process of implementing the 'automatic binding extraction feature' included in the SyncTree 4.0.1 version released last May, focusing on the 'query function enhancement'. Additionally, I'll delve into PHP internal code, specifically PDO, which many PHP developers have used directly or indirectly when manipulating database data. Before we start...

1. This post includes content from a presentation I gave at the 'ModernPHP User Group' last December.

2. To explain the SyncTree 'query binding extraction feature', I also cover the following topics. Since I'm discussing them to a degree where it's easy to understand, if you need additional information, please refer to other resources!

- PDO (PHP Data Objects)

- PDO Scanner operation method

- Basic structure of compiler and lexer

‍

Shortcomings of the existing query feature

SyncTree provides powerful features called 'Storage' and 'Query', which allow you to manage and use desired databases and queries as blocks. The 'Query' menu, which allows users to store/manage frequently used queries, was one of the convenient features. However, it was regrettable that manual binding had to be specified. If there were many binding details, users had to create blocks one by one while viewing the entered query, which could be quite cumbersome if there were many bindings. Therefore, we decided to provide a feature that automatically extracts binding information when users specify bindings in the query so that blocks are generated automatically when used.

‍

What should we do?

‍

How can we extract binding information from user input queries? The idea itself seems simple. When users input queries according to predefined binding rules, we can simply parse the query and extract only the necessary parts.

What considerations should we take into account? First, we need to ignore the binding rule strings included in the query. In this case, how can we distinguish between the binding rules included in the string, such as :or , and the binding information specified by the user? The first suggestion that came up was this:

Shouldn't we just remove all the strings and extract the binding rules from the remaining query?

After listening to the suggestions, a few backend developers brainstormed for a while, and the opinion "How does PDO solve this?" came up. Before implementing the binding extraction feature, we became curious about how PDO extracts binding information from queries and decided to look into the PHP internal code.

PDO (PHP Data Objects) is a PHP extension that provides a consistent interface for accessing databases, regardless of the type of database being used. It standardizes the way of controlling various databases. According to the official documentation:"PDO provides a consistent way to perform queries and retrieve data from different databases." And as described in the opentutorial:"PDO (PHP Data Objects) standardizes the way of controlling various databases. There are different types of databases, each requiring different drivers to interact with. Prior to PDO, APIs for controlling databases varied depending on the type of driver used. With PDO, you can control databases using the same method."

In other words, PDO stands for PHP Data Objects, which is a PHP extension that provides a consistent interface to multiple databases.

If anyone has experience specifying bindings using PDO, please take a moment to recall. It's okay if you don't. Typically, we don't often specify bindings directly as we commonly use well-implemented ORMs. We're just exploring ideas here. The official documentation provides guidance on two methods, although there are technically three (we won't discuss the third method, which involves using stored procedures, here).

Named Placeholders Method

The first method involves directly mapping using named placeholders as shown below:

On the 7th line, ':calories' and ':colour' are the names being mapped, and it can be confirmed that values are being passed using 'bindParam'.

This method is defined in the PHP internal code as an enum named PDO_PLACEHOLDER_NAMED.

‍

Question mark placeholders Method

The second method involves specifying '?' and its corresponding order:

Similarly, on the 7th line, we can see '?' appearing twice, and through the first parameter of 'bindParam', it is specified which value goes into which position.

This method is also defined in the PHP internal code as an enum named PDO_PLACEHOLDER_POSITIONAL. We have briefly examined the two binding methods mentioned in the official documentation.

‍

How does PDO parse queries?

Returning to the main point, during the brainstorming session for implementation, the suggestion to "take a look at the PHP code" came up, and we examined the code responsible for parsing queries. Please also take a look at the GitHub link provided here.

What caught your eye?

Firstly, if you noticed that the source code file extension is .re, you might be familiar with compilers. This file serves as a specification file for lexical analysis. The file itself consists mainly of the Scanner, placeholder structures, and functions scan and pdo_parse_params.

Here's the scan function: This is all of it, without any omitted parts.

At first glance, it seems like most of the code is commented out except for saving the current token position. What does this code do?

The upper part appears to define regular expressions, while the lower part seems to implement the rules defined above, with macro functions being called on the right side of each rule.

What does this code signify? (Of course, some may have already inferred its meaning just by looking at this part.)

Knowledge 1: Basic Structure of a Scanner

As indicated by the file name (parser) or function name (scan), this code represents what we commonly know as a Lexer, Scanner, or Parser, responsible for lexical/syntactic analysis. Lexers and parsers typically consist of the following input structure:

Definition Section
Rule Section
User-defined Subroutine Section

Knowledge 2: Interpreter / Compiler's logic

To better understand, let's briefly examine the basic logical structure of an interpreter or compiler. The process typically proceeds from top to bottom as follows:

Source Program (Raw Program)
Lexical Analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generation
Code Optimization
Target: Code Generation
Target Program

Role of Lexer
‍

In the logical structure of a compiler, the lexer receives the written code as input and breaks it down into tokens, returning them as output.

The lexer is generated through lexer generation libraries like Lex, where user-defined regular expressions and execution code are compiled once more to create the lexer executable. When this generated file is executed, the execution code passes through the lexer, producing a series of tokens.

In this process, the token selection rules correspond to the regular expressions briefly seen in the scan function within the pdo_sql_parser.re file.

In essence, the pdo_sql_parser.re file can be considered as the lexical analysis specification, and the file with the .re extension is compiled into a .c file. This compiled file then undergoes another compilation process to ultimately function as a feature of the PHP interpreter. By examining the Makefile.frag.w32 file located in the same directory as the aforementioned file, it can be inferred that the .c file is compiled again:

‍

Now let's examine the Scanner of the PDO source file in the three structures mentioned above.

The basic structure of the definition part is as follows:

Let's examine the actual code based on this structure:

Now, let's look at each regular expression that appears:

BINDCHR = [:][a-zA-Z0-9_]+;

It extracts combinations of letters, numbers,_ following a :. When tested with a simple query, it extracts as shown below: (https://www.phpliveregex.com/)

‍

QUESTION = [?];

It retrieves the question mark ?.

‍

ESCQUESTION = [?][?];

Here, we have two question marks ??.

‍

COMMENTS = ("/"([^]+|[]+[^/])[]"/"|"--"[^\r\n]*);

This represents the SQL comment rule.

‍

SPECIALS = [:?"'-/];

It includes :, ?, ", ', -, /.

‍

MULTICHAR = [:]{2,};

It extracts strings with two or more : colons.

‍

ANYNOEOF = [\001-\377];

It represents ASCII codes from 1 to 255, excluding null.

‍

The basic structure of the rules section is as follows:

Regular expressions represent the token patterns to be recognized, and the executable code that the lexer should execute when a token is recognized is placed next to them, forming a pair.

Looking at the code again, we can see that the executable code associated with the names defined in the previous definition part is placed next to them.

Although we have passed over it, the executable code is actually defined as macros in the file header.

In other words, it saves the position and returns the pre-defined token code (1~5).

The sections of code excluding the definition and rule parts constitute the user-defined subroutine section.

Since the code comprises around 350 lines, it's difficult to cover everything here. Therefore, let's focus only on the parts related to bindings, which are the main topic of this discussion.

‍

When looking at the name and internal data of the 'placeholder' structure, it can be anticipated that it contains binding-related information.

Now, let's examine the pdo_parse_params function. Placeholder variables of type placeholder are being assigned.

‍

Next, the 'while' loop progresses by returning token codes until it reaches the end of the given statement.

Here are the explanations for the lines you provided:

Line 100: When the token code is PDO_PARSER_BIND, PDO_PARSER_BIND_POS, or PDO_PARSER_ESCAPED_QUESTION, the execution enters the conditional statement. (i.e., excluding regular string tokens.)

Line 101: However, when using the question mark method (PDO_PLACEHOLDER_POSITIONAL) and there are two consecutive question marks (??), they are treated as a string and the execution continues to the next character.

Line 127: If it's not an exception condition, it's considered a binding, so the binding count is incremented.

Please note that the line numbers (100, 101, 127) correspond to the line numbers in the source code at the commit point with the hash "a6a80ee".

Line 140: If the query type supports both named and positional parameter styles, an error message "mixed named and positional parameters" is raised. If anyone has encountered such an error, this could have been the cause.

In summary:

1. First, simple strings or invalid binding styles are excluded from processing.

2. Then, when encountering PDO_PARSER_BIND or PDO_PARSER_BIND_POS, they are considered as bindings, and the count is incremented accordingly.

‍

After confirming the operation of PDO, the processing we need to implement can be summarized as follows:

Ignore binding rule strings contained within strings. In other words, remove the strings.
Remove parts where two or more consecutive ':' are used
Remove parts where two question marks ('??') are used.
Retrieve the remaining parts of the query where binding rules are applied.

Here's the code implementing the described processing:

Executon Result

..... 5 / 5 (100%)Time: 00:00.009, Memory: 6.00 MB

Sql Parser (SqlParser\Tests\SqlParser)

✔ Can extract binding names from user queries. with data set "Binding enclosed in double quotes is not extracted"

✔ Can extract binding names from user queries. with data set "Binding enclosed in single quotes is not extracted"

✔ Can extract binding names from user queries. with data set "Consecutive colons are not extracted"

✔ Can extract binding names from user queries. with data set "Consecutive question marks are not extracted"

✔ Can extract binding names from user queries. with data set "Normal binding"

OK (5 tests, 5 assertions)

‍

Furthermore, you might wonder if other parsers follow a similar approach. You can confirm that by examining ext/json/json_scanner.re, where scanning is implemented using re2c, similar to what we've seen in the PDO parser. (Link: php-src/ext/json/json_scanner.re at master · php/php-src (github.com))

‍

Perhaps it could have been an issue that was easily implemented and quickly forgotten, but the time we spent discussing and verifying together was valuable. Also, while examining the PHP internal code, it was a meaningful time to confirm that the direction of the initial opinions was somewhat correct.

Please continue to follow SyncTree's journey for improving backend development productivity. Thank you for reading until the end!

[Tech Blog] Retrospection of making automatic query binding extraction feature

Shouldn't we just remove all the strings and extract the binding rules from the remaining query?

Knowledge 1: Basic Structure of a Scanner

Knowledge 2: Interpreter / Compiler's logic

Role of Lexer
‍

Now let's examine the Scanner of the PDO source file in the three structures mentioned above.

Keywords

Related Posts

Seoul

Hong Kong

[Tech Blog] Retrospection of making automatic query binding extraction feature

Shouldn't we just remove all the strings and extract the binding rules from the remaining query?

Knowledge 1: Basic Structure of a Scanner

Knowledge 2: Interpreter / Compiler's logic

Role of Lexer‍

Now let's examine the Scanner of the PDO source file in the three structures mentioned above.

Keywords

Related Posts

Seoul

Hong Kong

Role of Lexer
‍