The Relational Database Management System (RDBMS) is built with numerous complex algorithms and data structure just to store and retrieve information properly. The complexity is almost akin to an operating system that functions in sync with many features almost in real time. Modern RDBMS has built-in facility for memory management, file buffering, network communication support etc. They form the basic architecture of the RDBMS package. The article provides a glimpse of what goes behind the scene when a user submits a query until the result is obtained from the database.
An RDBMS package is typically a database server that serves multiple clients via communication pathways under the aegis of network protocol such as such as socket, pipes etc. In a standalone database application client communicates with the database via programmatic interfaces. In such a case the database server becomes part of the client application or vice versa. Sometimes the database is contained within the embedded system as a slave to the host system. Generally, in a large database application, the RDBMS server is separated from the concern on the application by hosting the server in a remote location. The business logic interacts with the database server via network as per requirement. Regardless, the logic for query processing remains the same be it an embedded database application, network application or a standalone application.
Applications connect to the database using a set of protocols called database connectors. The Open Database Connectivity (ODBC) is a well-known database connector that an application can use to connect to almost any database. There are also vendor specific database connectors for an RDBMS such as MySQL. MySQL supports connectors for Java (JDBC), PHP, Python, .NET etc. These implementations mostly support communication over network protocols. These connectors are designed (API) to transfer SQL commands to the database server and retrieve information upon request by the client. The connectors typically consist of database driver and client access APIs.
Queries are nothing more than questions put to the database according to the syntax and semantics of standard query language called SQL (Structured Query Language). The database server understands the language and replies back as per the query submitted. According to the semantics of SQL, queries can be of two types. The first type of query is a Data Definition Language (DDL) query, which is typically used to create and do things with the dataabse such as creating and altering tables, defining indexes, managing constraints, etc. A second type of query called the Data Manipulation Query (DML) is used to work on the data of the database. This includes actions such as SELECT querying, updating, and deleting data in the database tables.
A typical SELECT query syntax may be written as follows. The square bracket () represents optional parameters and the lowercase notation depicts user-defined variables.
SELECT [ DISTINCT ] columns
[ WHERE expression ]
[ GROUP BY columns ]
[ HAVING expression ]
[ ORDER BY columns ] ;
- The DISTINCT keyword removes the duplicate records in the final result.
- The FORM clause forms a projection on the references that appear in the other clauses.
- The WHERE applies the expression on the referenced table.
- The GROUP BY clause groups the result according to the specified attribute.
- The HAVING clause applies filter on the groups.
- The ORDER BY clause sorts the result.
Once the client submits a database query statement via network protocol to the database server, it is first interpreted and then executed. The interpretation is meant to decipher the meaning of the query. This is done by parsing the SQL statement and breaking it into elements before executing. The interpretation of the query is a two-step process: one, in the logical plan it describes what the query is supposed to do and secondly, in the physical plan, it describes how to implement the query.
The physical plan of the query is handled by the database system’s query execution engine. A tree structure is created where each node represents query operator with number of children. These children represent a number of tables involved in the operation. The query is passed through several phases before execution such as parsing, validation, optimization, plan generation/compilation and finally execution.
- Parsing breaks the SQL statement into parts, validates it and translate the logical query (SQL query) into a query tree according to the syntactical scheme of the relational algebra. This is the logical plan.
- The logical query is then translated into a physical plan. There can be many such plans, but the query optimizer finds the best one, say, according to the estimated execution performance. This is done by taking on the relational algebra tree into optimizer’s search space and expanding it by forming alternative execution plans and then finally choosing the best among them. The result is akin to the code-generation part the compiling of SQL. The critical resources to optimize the code is obtained from the database system’s catalog that contains the information about number if tuples, and many other things such as stored relations referenced by the query etc. The optimizer finally copies the optimal plan from the memory structure and send it to the query execution engine. The query execution engine executes the plan using database relation as input and generates new table with rows and columns that matches the query criteria.
Note that the plan is always optimal or near optimal within the search space of the optimizer. The interpretation of a SQL query by the RDBMS is not that simple after all. Optimization is a costly affair because it analyses on alternative execution plans. A single query can have an infinite number of possibilities. Therefore, it consumes additional processing time impacting on both the query optimizer, query execution engine and overall database response time.
This is just a glimpse of an overall SQL query execution process. In short, parsing breaks the SQL statement into elements which is then passed through validation phase to validate for errors and check the syntax according to the SQL standard and identify query operation. The parser transfers the query into an intermediate form, recognized by the optimizer which generates efficient query execution plan. The execution engine then takes the optimized query and executes the query. The result thus obtained from the execution is finally returned to the client.
# # #