PostgreSQL: Fun with Create From Template

elephante_paper.jpg

Did you know that when you "CREATE DATABASE" you are not actually "creating" anything? Instead you are actually copying from an existing database. This existing database is called, template1This is the default behavior of PostgreSQL, but you can actually tell PostgreSQL to use another database as the template.


  -- CREATE DATABASE WITH TEMPLATE statement
  
  CREATE DATABASE new_db WITH TEMPLATE orig_db OWNER you;

This is a very quick way of making a copy of your database. There is one gotcha with this command, the orig_db needs to be idle during the copying process. This problem is the main reason why we can't use this as a general "COPY DATABASE" command, but you should keep it in mind for writing or reviewing new code that requires migrations.

Here are the basic steps:

1. Create a "master" database that stays in sync with the master branch of your code.

2. When you checkout a new branch create a new database where the TEMPLATE is the master branch database from step 1.

3.  Make your project connect to that new database.

4. Write/Review your new code using the new database.

Here are some scripts that you can use in your Rails projects to help with this process:

1. Add git hook (.git/hooks/post-checkout)

 #!/usr/bin/env ruby
 require 'pg' 
 require 'git'
 
 branch = Git.open('.').current_branch
 conn   = PG.connect(dbname: 'postgres')
 
 begin
   conn.exec("CREATE DATABASE proj_#{branch} WITH TEMPLATE proj_master OWNER me;" )
 rescue => e
   puts e
 end
  
 begin
   conn.exec("CREATE DATABASE proj_test_#{branch} WITH TEMPLATE proj_test_master OWNER me;")
 rescue => e
   puts e
 end

This git hook simply creates a new development and test database from your proj_master and proj_test_master database templates. The first time you checkout a new branch, it might take and extra couple of seconds, but the next time you checkout it will throw a database already exists exception. (Warning, if you use a branch name like feature/something the creation of the database will break because of the / character.)

2. Edit your config/database.yml

<% branch = Git.open('.').current_branch %> 

development:
  adapter: postgresql
  database: proj_<%= branch %>
  host: localhost
  pool: 5
  timeout: 5000
  
test:
  adapter: postgresql
  database: proj_test_<%= branch %>
  host: localhost
  pool: 5
  timeout: 5000

This database.yml file edit forces Rails to connect to the newly created branch database. 

There are some downsides to this approach, for example, having to restart your Rails project to connect to the new database. The benefit though is that you can work/review multiple branches and be confident that your local database is in the state you want it to be.


Using PostgreSQL Explain

Have you ever wanted to know how a database thinks or what steps it will take to find your data? Well there is a command for this, EXPLAIN. 

When a database receives a query, it decides how best to retrieve the data. The database creates a query plan based on different factors including the structure of the query and the properties of your data. As developers, we can use the EXPLAIN command to see what query plan the database is going to use for a query.

The EXPLAIN command is very simple:


  -- EXPLAIN statement
  
  EXPLAIN select * FROM users;

The output of this command will be the query plan, more specifically the output of this command will be a tree of plan nodes.


                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on users  (cost=0.00..10.40 rows=40 width=1672)

In the case of this basic query, the output is a single plan node.

Because there is no WHERE clause, the database has chosen todo a sequential scan (Seq Scan) on the entire table. As you might expect there are different scan nodes to describe different table access methods, for example, index scans and bitmap index scans.

Lets look at the numbers inside the parenthesis. The cost has two numbers, the first number (0.00) is the estimated startup cost. It basically tries to estimate how long it will take before the output phase of the query begins. For example, the time it takes to do the sorting in a sort node. The second number (10.40), is the estimated total cost.

Rows and width are somewhat self explanatory. Rows, is the estimated number of rows the output will produce. Finally, the width, which is the estimated average width of rows in bytes.

It is important to realize that all these numbers are estimates, including the cost numbers. The cost numbers are actually arbitrary units determined by the planner.

Lets take a look at a slightly more complicated output.


-- EXPLAIN SELECT * FROM users WHERE unique < 50;

                        QUERY PLAN                        
----------------------------------------------------------
 Bitmap Heap Scan on users (cost=5.07..229.20, rows=101, width=0)
   Recheck Cond: (unique1 < 50)
   -> Bitmap Index Scan on users_unique1 (cost=0.00..5.04 rows=101 width=0)
     Index Cond: (unique1 < 50)

The database planner has decided to break up this query in two steps. First the database will visit the indexes to find the location of the rows matching the WHERE condition (Bitmap Index Scan), then it will fetch the actual rows (The Bitmap Heap Scan). The important thing to remember when you get a tree with multiple nodes, is that the top node will include the cost of all its child nodes. 

All these costs estimates are great, but wouldn't you like to see the actual time it takes for a query to run? Well we can do this with the ANALYZE option.

WARNING: Using (ANALYZE) runs the actual query.


  -- EXPLAIN (ANALYZE) SELECT * FROM users;
                       
                       QUERY PLAN                                               
-----------------------------------------------------------
 Seq Scan on agents  (cost=0.00..12.23 rows=123 width=2072) (actual time=0.099..0.399 rows=123 loops=1)
 Planning time: 3.135 ms
 Execution time: 0.487 ms

Notice the new information we get, actual time, Planning time and Execution time. It is important to know that actual time is measured in milliseconds while the cost is still estimates. The same goes for rows, one is an estimate and the other is actual rows. Planning time is the time it took for the database to generate the query plan.

This is just the very basic information on the EXPLAIN command and the best way to learn it is by playing, so open up your database console and query away. Notice all the different scan nodes you get, change your tables and see if you get a different plan. Remove an index, what happens?

Get curious!