homedark

Writing a Custom SQLite Function (in C) - Part 1

Feb 02, 2023

A powerful feature of SQLite is the ability to call application-defined functions from within SQL. This functionality leverages the fact that SQLite is embedded within the application and thus has far greater capacity for integration than an external database.

Your language's SQLite library probably offers some facility for registering a function with SQLite. But in most cases, these are heavily abstracted, hiding all internal SQLite details and, in some cases, limiting what you're able to accomplish.

In this short series, we're going to add a user defined function to SQLite in C. We'll go through the process slowly and in a way which, I hope, a C novice can follow.

Setup

It's pretty easy to get setup. The first thing we'll do is download and unzip the complete SQLite source code. Head over to the SQLite download page and search for "Alternative Source Code Formats". Download the source code. I'll be using https://www.sqlite.org/2022/sqlite-src-3400100.zip, but there's a good chance the latest version will work fine. Next we'll run ./configure and make. The entire sequence looks like:

wget https://www.sqlite.org/2022/sqlite-src-3400100.zip
unzip sqlite-src-3400100.zip
cd sqlite-src-3400100/
./configure
make

In addition to other things, this will generate what's known as the "amalgamation" source code, which is the entire SQLite source code in a single file (sqlite3.c) with a single header (sqlite3.h). If you have any problems building this from source, you can grab the pre-generated amalgamation from the same download page, just search for "sqlite-amalgamation" (it's usually the first (pre-release) or second (stable release) link).

We'll copy sqlite3.c and sqlite3.h into a new empty working directory. For me, that's ~/code/play/sqlite. To this, we'll add a file, main.c. In the end, you should have a directory with 3 files: sqlite3.h, sqlite3.c and main.c. You'll only touch main.c, open it now.

Opening a Connection

The first thing we'll do is open a new SQLite database and then close it:

#include <stdio.h>
#include <stdlib.h>
#include "sqlite3.h"

int main(int argc, char ** argv) {
  sqlite3 *db;

  int flags = SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE;
  int rc = sqlite3_open_v2("db.sqlite", &db, flags, NULL);

  if (rc != SQLITE_OK) {
    printf("Failed to open: %s\n", sqlite3_errmsg(db));
    sqlite3_close_v2(db);
    exit(1);
  }

  sqlite3_close_v2(db);
  exit(0);
}

If you go through the above code, it's hopefully understandable. We're using 3 SQLite functions. sqlite3_open_v2 opens a connection to our database and, because we're specifying the SQLITE_OPEN_CREATE flag, will create the database if necessary. sqlite3_close_v2 closes the connection. sqlite3_errmsg returns the last error message that happened on the provided connection.

If you're new (or like me rusty) with C, there's one thing about the above that must be understood: return values. A typical pattern in C, and certainly in SQLite, is for functions to return a return code. If we need to "return" multiple values, such as sqlite3_open_v2 which needs to "return" a return code (hence rc) and a connection, we pass a reference to our connection and let the function initialize it.

Those familiar with Go (or other modern languages) will probably consider this pattern inferior to returning multiple values (and I'd agree!). But even in Go, passing "output" parameters is quite common: think json.Unmarshal which takes a reference to the data to populate, or the Scan methods of sql.Row and sql.Rows which take references to empty values to populate.

We can compile and run the above:

cc sqlite3.c main.c -o sqlite_udf
chmod a+x sqlite_udf # make it executable
./sqlite_udf  # execute it

This should create an empty db.sqlite file in the same directory. If you want, you can check the error handling by changing the path, "db.sqlite", to something that doesn't exist, like /invalid/fail, re-compiling and running it again.

Refactoring

Before we start writing a custom function, let's clean things up a little to streamline the rest of our effort. Let's create a helper to print errors, and let's add another function, run, so that we don't have to worry about closing our connection in every error case:

#include <stdio.h>
#include <stdlib.h>
#include "sqlite3.h"

// Must define functions before we can use them
// An alternative would be to move the full run and print_error up here
// but that can get pretty tangled as we add more and more functions that call
// one and other.
static int run(sqlite3*);
static void print_error(const char*, sqlite3*);

int main(int argc, char ** argv) {
  sqlite3 *db;
  int flags = SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE;
  int rc = sqlite3_open_v2("db.sqlite", &db, flags, NULL);
  if (rc != SQLITE_OK) {
    print_error("sqlite_open: %s", db);
    exit(1);
  }
  int status = run(db);
  sqlite3_close_v2(db);
  exit(status);
}

static void print_error(const char *format, sqlite3 *db) {
  printf(format, sqlite3_errmsg(db));
  printf("\n");
}

static void run(sqlite3 *db) {
  // our main code will go here
}

Note that in C, we usually explicitly declare our functions before we can use them. Hence we provide the function headers for run and print_error. An alternative approach is to simply define the entire function before we use it. So we could put print_error and run and any other function that we'll create before they are needed.

Increment User Defined Function

We're going to do this slowly and incrementally. We'll eventually write a slightly fancier function, but for now, we're keeping it simple. Inside of our new run function, we're going to use the sqlite3_create_function function to register a custom function with SQLite:

static int run(sqlite3 *db) {
  int rc = sqlite3_create_function(db, "increment", 1, SQLITE_UTF8, NULL, &increment, NULL, NULL);
  if (rc != SQLITE_OK) {
    print_error("create_function: %s", db);
    return 2;
  }
  return 0;
}

We're registering our (not-yet-created) increment function, that takes 1 argument and naming it "increment". The eventual result will be that we'll be able to do: select increment(value_or_column). The first argument to sqlite3_create_function is the database connection (sqlite3*). Functions are registered on the connection, not the database. If we open multiple connections and want our user defined function available in all of them, we need to call sqlite3_create_function on each connection.

Now the skeleton of our increment function:

// add this header with the run and print_error headers
static void increment(sqlite3_context*, int, sqlite3_value**);

// add this function anywhere else in the file
static void increment(sqlite3_context *context, int argc, sqlite3_value **argv) {

}

The sqlite3_context* is used to store results that will be passed back to SQLite and can be used for a few other things (like getting the underlying db connection (of type sqlite3*). argc is the number of arguments that were passed into our function, and argv are the actual arguments. When we called sqlite3_create_function we specified that our function needed 1 argument. SQLite will enforce this so, in our case, argc will always be equal to 1. But we could have specified -1 to indicate that the number of argument is variable, in which case we'd use argc to figure out the number of arguments that were passed to our function.

Parameters to our function are given as sqlite3_value*. To increment, we'll need an integer. We can use the sqlite3_value_int function to turn an sqlite3_value* into an int. As you can probably guess, there are functions for all of the SQLite supported types, such as sqlite3_value_double, sqlite3_value_text and so on:

static void increment(sqlite3_context *context, int argc, sqlite3_value **argv) {
  // we can safely access argv[0] because told SQLite that our increment
  // function takes 1 argument when we called sqlite3_create_function.
  int value = sqlite3_value_int(argv[0]) + 1;
  // TODO: return
}

We can then use the sqlite3_result_int to store the result into the context. Just like sqlite3_value_*, there are also sqlite3_result_* for each supported type. Thus, the entire increment function is:

static void increment(sqlite3_context *context, int argc, sqlite3_value **argv) {
  // we can safely access argv[0] because told SQLite that our increment
  // function takes 1 argument when we called sqlite3_create_function.
  int value = sqlite3_value_int(argv[0]) + 1;
  sqlite3_result_int(context, value);
}

Testing It (By Executing A Statement)

Our increment function will work, but it'd be nice to see it in action. To do so, we need to execute a statement, such as select increment(?1) and read the result. This is, as you can probably guess, the most important thing we can do in SQLite. So while it isn't directly related to writing a custom function, it's worth learning.

To execute a statement, any statement, we must do multiple things. First we have to prepare the statement. Next we optionally bind values to the statement. Then we step through the results in a loop, until the statement execution is done. Finally, we must cleanup after ourselves. This prepare -> bind -> step loop -> cleanup is a common pattern in SQLite (for example, it's the same thing that must be done when PostgreSQL executes a statement).

Starting from the beginning, we must prepare a statement:

static int run(sqlite3 *db) {
  // existing sqlite3_create_function code
  // ...

  sqlite3_stmt *stmt;
  rc = sqlite3_prepare_v2(db, "select increment(9000)", -1, &stmt, 0);
  if (rc != SQLITE_OK) {
    print_error("statement prepare: %s", db);
    return 2;
  }


  return 0;
}

The third parameter is the length of the SQL statement to execute. -1 tells SQLite to find the end of the string (zero-terminated). This is a common pattern, where we can either specify the length of the string or let SQLite infer it. Being able to pass in an explicit length allows us to use larger buffers and tell SQLite to only care about the first N bytes (think something like an Go slice; but unsafe since we could always specify a length greater than our buffer).

sqlite3_prepare_v2 only prepares a single statement. The last parameter is an output parameter that, after execution, would point to the remainder of our input. In this case, we only have a single statement and thus don't expect any remainder.

After the function executes, we expect rc == SQLITE_OK and stmt to be initialized. At this point, we need to bind our values. We could make our life easier and simply test select increment(9000), but to demonstrate binding, we'll do select increment(?1). Thus we need to bind one value:

if (sqlite3_bind_int64(stmt, 1, 9000) != SQLITE_OK) {
  print_error("bind: %s", db);
  return 2;
}

There are sqlite3_bind_* versions for all of the SQLite supported types. The 2nd parameter is the bind index. Bind indexes start at 1.

We're now ready to step through the result. Normally, we'd step in a loop, waiting for the step function to return SQLITE_DONE. But we'll cheat a little for our specific use-case and rely on the fact that we only expect a single row:

int status;
rc = sqlite3_step(stmt);
if (rc == SQLITE_ROW) {
  printf("increment(%d) == %d", 9000, sqlite3_column_int(stmt, 0));
  status = 0;
} else if (rc == SQLITE_DONE) {
  // very weird, should not be possible
  printf("Unexpected empty result");
  status = 3;
} else {
  print_error("step: %s", db);
  status = 2;
}
sqlite3_finalize(stmt);
return status;

Again, the sqlite3_step function is what we use to iterate through each row of our result. Here we cheat a little as we know to expect a single row (thus, don't iterate). The sqlite3_column_int function reads an int from the current row as the specified index (here our indexes start at 0). There are various sqlite3_column_* functions to read different types of values from.

Finally, we have to call sqlite3_finalize on the prepared statement to clean up any resources associated with it. In the next part we'll look more closely at resource allocation and lifetime as these are critical and complicated topics we need to understand. Our simple example only deals with an integer, but what if we were dealing with a heap-allocated value (like a string)? Who would be responsible for freeing that memory?

Complete Code

Our completed code looks like:

#include <stdio.h>
#include <stdlib.h>
#include "sqlite3.h"


static int run(sqlite3*);
static void print_error(const char*, sqlite3*);
static void increment(sqlite3_context*, int, sqlite3_value**);

int main(int argc, char ** argv) {
  sqlite3 *db;
  int flags = SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE;
  int rc = sqlite3_open_v2("db.sqlite", &db, flags, NULL);
  if (rc != SQLITE_OK) {
    print_error("sqlite_open: %s", db);
    exit(1);
  }
  int status = run(db);
  sqlite3_close_v2(db);
  exit(status);
}

static void print_error(const char *format, sqlite3 *db) {
  printf(format, sqlite3_errmsg(db));
  printf("\n");
}

static int run(sqlite3 *db) {
  int rc = sqlite3_create_function(db, "increment", 1, SQLITE_UTF8, NULL, &increment, NULL, NULL);
  if (rc != SQLITE_OK) {
    print_error("create_function: %s", db);
    return 2;
  }

  sqlite3_stmt *stmt;
  rc = sqlite3_prepare_v2(db, "select increment(?1)", -1, &stmt, 0);
  if (rc != SQLITE_OK) {
    print_error("statement prepare: %s", db);
    return 2;
  }

  if (sqlite3_bind_int64(stmt, 1, 9000) != SQLITE_OK) {
    print_error("bind: %s", db);
    return 2;
  }

  int status;
  rc = sqlite3_step(stmt);
  if (rc == SQLITE_ROW) {
    printf("increment(%d) == %d", 9000, sqlite3_column_int(stmt, 0));
    status = 0;
  } else if (rc == SQLITE_DONE) {
    // very weird, should not be possible
    printf("Unexpected empty result");
    status = 3;
  } else {
    print_error("step: %s", db);
    status = 2;
  }
  sqlite3_finalize(stmt);
  return status;
}

static void increment(sqlite3_context *context, int argc, sqlite3_value **argv) {
  int value = sqlite3_value_int(argv[0]) + 1;
  sqlite3_result_int(context, value);
}

Follow Up Exercise

There are two additional enhancements you might be interested in exploring on your own. The first is expanding our increment function to take a 2nd (possibly optional) parameter to indicate how much to increment our value by (defaulting to 1). The second is to add type strictness. As-is, the return value for increment('teg') will be 1, as sqlite3_value_int will return 0 when called on a non-integer value (or a value that cannot be coerced to an integer (e.g. passing the a string argument of '9002' would return 9003)). We could use the sqlite3_value_type function on argv[0] to determine the type and handle non-integer values. In such cases, instead of using the sqlite3_result_int to store an integer value in our context, we could use sqlite3_result_error to store an error.