Introduction to PostgreSQL Arrays

Dec 08, 2014

I'm getting the opportunity to play with arrays in PostgreSQL and thought that it might be useful to share what I've learnt so far (which is another way of saying this is not a comprehensive overview and possibly an incorrect one).

We start by creating a posts table:

-- uppercase SQL is lolz
create table posts (
  id serial primary key,
  title text not null,
  post text not null,
  tags text[]
);

From the above, you can see the familiar syntax used to declare an array type. Here we declare tags as an array of text. You can create an array for any built-in or user-defined type.

You insert an array using braces:

insert into posts (title, post, tags)
values ('Teg is great', 'blah blah', '{"fact", "serious"}');

The single and double quote syntax for text isn't great. For numbers and booleans, it's better: {1,5,894,1} and {true, false, true, true, true}.

Square brackets are used to access a specific index. <blink> indexes start at 1 </blink>. When reading, null is returned for values outside of the array's length:

-- return 1, 'fact'
select id, tags[1] from posts;
-- return 1, null
select id, tags[1000] from posts;

We can set to any index, which will increase the array's length automatically. However, there appears to be some optimizations for handling sparse arrays. For example, I have a table of 40 000 rows. Each row has a column with an bool[] and 8 values. The table size is 4.5MB. If I push a 9th value into the array, the table grows to 9MB. A 10th value pushes the size up to 14MB. But if I push diretly into [100] or even [100000] the growth is constant.

It's possible to specify the size of an array when we create the column, but this isn't actually enforced. So we could have done tags text[3] but we'd still be able to set a value at any index.

There are a number of handy functions and operators, such as append and prepend, the ability to get the array_length and || concat arrays together, or arrays to individual values.

We can use any to find all posts that have a specific tag:

select * from posts
where 'sand' = any(tags);

all can be used to find records where an array only contains a specific value (possibly multiple times).

Neither any nor all leverage indexes. Thankfully, it's pretty easy to set things up so that we can use an index for a very common case: getting rows where a specific value is present within our array (like the any example above). First, we create a GIN index:

create index posts_tags on posts using GIN(tags);

Next, we use the contains operator:

select * from posts where tags @> array['worm'];

With a dummy posts table of only 10000 rows, explain analyze reported that the un-indexed query using any took 2.9ms vs the 0.491ms for the query using the index.

The only way I know of doing a not contains is to use a not in/exists:

select count(*)
from posts
where id not in (
  select id
    from posts
    where tags @> array['spice']
);

If you're going to be querying against specific array positions, you can always create a normal index on those positions:

create index post_first_tag on posts((tags[1]));

Which allows you to efficiently query the first element:

select * from posts where tags[1] = 'spice';

The most powerful function that I've used so far is unnest which turns an array into a result set. As a trivial example, this could let you flatten out a result:

select id, unnest(tags) from posts

It could also be used for more powerful purposes, such as being used in a join.

Beyond this, it's worth saying that arrays can be multi-dimensional (text[][]) and, it's difficult (but not impossible) to remove specific value (either by index or by value).