PostgreSQL: Fun with Create From Template

elephante_paper.jpg

Did you know that when you "CREATE DATABASE" you are not actually "creating" anything? Instead you are actually copying from an existing database. This existing database is called, template1This is the default behavior of PostgreSQL, but you can actually tell PostgreSQL to use another database as the template.


  -- CREATE DATABASE WITH TEMPLATE statement
  
  CREATE DATABASE new_db WITH TEMPLATE orig_db OWNER you;

This is a very quick way of making a copy of your database. There is one gotcha with this command, the orig_db needs to be idle during the copying process. This problem is the main reason why we can't use this as a general "COPY DATABASE" command, but you should keep it in mind for writing or reviewing new code that requires migrations.

Here are the basic steps:

1. Create a "master" database that stays in sync with the master branch of your code.

2. When you checkout a new branch create a new database where the TEMPLATE is the master branch database from step 1.

3.  Make your project connect to that new database.

4. Write/Review your new code using the new database.

Here are some scripts that you can use in your Rails projects to help with this process:

1. Add git hook (.git/hooks/post-checkout)

 #!/usr/bin/env ruby
 require 'pg' 
 require 'git'
 
 branch = Git.open('.').current_branch
 conn   = PG.connect(dbname: 'postgres')
 
 begin
   conn.exec("CREATE DATABASE proj_#{branch} WITH TEMPLATE proj_master OWNER me;" )
 rescue => e
   puts e
 end
  
 begin
   conn.exec("CREATE DATABASE proj_test_#{branch} WITH TEMPLATE proj_test_master OWNER me;")
 rescue => e
   puts e
 end

This git hook simply creates a new development and test database from your proj_master and proj_test_master database templates. The first time you checkout a new branch, it might take and extra couple of seconds, but the next time you checkout it will throw a database already exists exception. (Warning, if you use a branch name like feature/something the creation of the database will break because of the / character.)

2. Edit your config/database.yml

<% branch = Git.open('.').current_branch %> 

development:
  adapter: postgresql
  database: proj_<%= branch %>
  host: localhost
  pool: 5
  timeout: 5000
  
test:
  adapter: postgresql
  database: proj_test_<%= branch %>
  host: localhost
  pool: 5
  timeout: 5000

This database.yml file edit forces Rails to connect to the newly created branch database. 

There are some downsides to this approach, for example, having to restart your Rails project to connect to the new database. The benefit though is that you can work/review multiple branches and be confident that your local database is in the state you want it to be.


Using PostgreSQL Explain

Have you ever wanted to know how a database thinks or what steps it will take to find your data? Well there is a command for this, EXPLAIN. 

When a database receives a query, it decides how best to retrieve the data. The database creates a query plan based on different factors including the structure of the query and the properties of your data. As developers, we can use the EXPLAIN command to see what query plan the database is going to use for a query.

The EXPLAIN command is very simple:


  -- EXPLAIN statement
  
  EXPLAIN select * FROM users;

The output of this command will be the query plan, more specifically the output of this command will be a tree of plan nodes.


                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on users  (cost=0.00..10.40 rows=40 width=1672)

In the case of this basic query, the output is a single plan node.

Because there is no WHERE clause, the database has chosen todo a sequential scan (Seq Scan) on the entire table. As you might expect there are different scan nodes to describe different table access methods, for example, index scans and bitmap index scans.

Lets look at the numbers inside the parenthesis. The cost has two numbers, the first number (0.00) is the estimated startup cost. It basically tries to estimate how long it will take before the output phase of the query begins. For example, the time it takes to do the sorting in a sort node. The second number (10.40), is the estimated total cost.

Rows and width are somewhat self explanatory. Rows, is the estimated number of rows the output will produce. Finally, the width, which is the estimated average width of rows in bytes.

It is important to realize that all these numbers are estimates, including the cost numbers. The cost numbers are actually arbitrary units determined by the planner.

Lets take a look at a slightly more complicated output.


-- EXPLAIN SELECT * FROM users WHERE unique < 50;

                        QUERY PLAN                        
----------------------------------------------------------
 Bitmap Heap Scan on users (cost=5.07..229.20, rows=101, width=0)
   Recheck Cond: (unique1 < 50)
   -> Bitmap Index Scan on users_unique1 (cost=0.00..5.04 rows=101 width=0)
     Index Cond: (unique1 < 50)

The database planner has decided to break up this query in two steps. First the database will visit the indexes to find the location of the rows matching the WHERE condition (Bitmap Index Scan), then it will fetch the actual rows (The Bitmap Heap Scan). The important thing to remember when you get a tree with multiple nodes, is that the top node will include the cost of all its child nodes. 

All these costs estimates are great, but wouldn't you like to see the actual time it takes for a query to run? Well we can do this with the ANALYZE option.

WARNING: Using (ANALYZE) runs the actual query.


  -- EXPLAIN (ANALYZE) SELECT * FROM users;
                       
                       QUERY PLAN                                               
-----------------------------------------------------------
 Seq Scan on agents  (cost=0.00..12.23 rows=123 width=2072) (actual time=0.099..0.399 rows=123 loops=1)
 Planning time: 3.135 ms
 Execution time: 0.487 ms

Notice the new information we get, actual time, Planning time and Execution time. It is important to know that actual time is measured in milliseconds while the cost is still estimates. The same goes for rows, one is an estimate and the other is actual rows. Planning time is the time it took for the database to generate the query plan.

This is just the very basic information on the EXPLAIN command and the best way to learn it is by playing, so open up your database console and query away. Notice all the different scan nodes you get, change your tables and see if you get a different plan. Remove an index, what happens?

Get curious!

Add Business Value with Interactors

At FiveStreet, we have been focusing heavily on the 'Mise en place' of our objects ( thanks to Dave Bock for the phrase ). This has been key to our early success. We have found that focusing on the placement of objects allows us to worry more about building new features while adding less fear of breaking stuff. One of our strategies for achieving this is by using interactor objects.

Interactors can be defined as the use cases of your application, for example adding a comment. They are also a great place for objects that shouldn't know about each to interact. For example, in the following code example you will see that when adding a comment we also interact with a external object, the MixpanelWrapper object. What follows is a dead simple example of how we use them in our application.
#app/interactors/agent/add_comment.rb
  class Agent::AddComment
    attr_reader :comment

    def initialize(agent, comment)
      @agent = agent
      @comment = comment
    end

    def allowed?
      touch_log = TouchLog.find(comment.touch_log_id)
      return false if !@comment.valid? || 
        touch_log.agent_id != @agent.id

      true
    end

    def run
      return false if !allowed?

      MixpanelWrapper.
        track("Agent: Add Comment To Prospect", { 
        :distinct_id => @agent.email })

      @comment.save
    end

  end
The previous block of code is a sample of how we build interactors. All of our interactors have two main methods, #allowed? and #run. In the allowed? method we handle all of our validations. The run method runs the use case and more!


# The Comment Controller

class Agent::CommentsController < ApplicationController
  def create
    comment = Comment.new(params[:comment])
    action = Agent::AddComment.new(current_agent, comment)
    if action.run
      flash_success_on_redirect "Comment was added."
      redirect_to prospect_path(action.comment.touch_log)
    else
      flash_error_on_redirect "Please enter a comment."
      redirect_to prospect_path(action.comment.touch_log)
    end
  end
end

This controller is pretty basic, the newish thing here is calling Agent::AddComment.new(current_agent, comment) instead of the normal comment = Comment.new(params[:comment]); comment.save.
At a high level what are the benefits:
1. Easy to Read/Understand: When I started looking at the code base for the first I knew exactly what was happening and where to find each use case.

2. Testing is easier: Thanks to the separation of concerns we gain.

3. Low 'learning' curve: There is nothing amazingly complex going on, you just pass the objects into the interactor and make the magic happen.

What are the negatives:
1. More files: Yeah there are more files but in the end it is easier to search.

2. More prep-time: It is not the default Rails way of doing things, but adding an extra file shouldn't be hard.

How does this add business value:
As mentioned we have been extremely happy with this approach and these are some of the main reasons:
1. First, in a sense this reminds me of Rails. Interactors allow us to have such a great separation of concern that we are able to worry more about building new features and less about breaking shit. How so? Well each interactor lives on its own, it is separated from everything else.

2. Being in its own world lets us easily remove code that we don't need. This is especially useful in the moments where that amazing new feature actually wasn't that amazing.

3. It also allows us to reuse use-cases easier. If I wanted to post a comment anywhere else in the system I can just call this object again.

4. Testing is 10 times easier, this is especially nice in those moments where you have a deadline and you don't have time to write the 'perfect' method. It's ok you got your test and you can always refactor.

5. In the future when new people join the company their biggest requirement will be to know what Agent::AddComment is supposed to do?

Refactoring the Law of Demeter

It is Friday, this means refactoring day!

Here is the code I tackled today:

class Crm::CreateLead

  attr_reader :username, :lead

  def initialize(username, lead)
    @username = username
    @lead = lead
  end

  def generate_data
    property = lead.property
    street = property && property.street && 
      property.street.encode(:xml => :text)
    city = property && property.city && 
      property.city.encode(:xml => :text)
    zipcode= property && property.zip && 
      property.zip.encode(:xml => :text)
    state= property && property.state && 
      property.state.encode(:xml => :text)
  end

  #There is more to the class ...

end

As you can see in the generate_data method, we continuously checked if there was a property and if the property had a specific attribute, if it did then we encoded it. Whenever you see this kind of foo && foo.something it is a warning sign that you might be doing something wrong, something doesn't belong or something is missing. In this case, I realized we were breaking the Law of Demeter. For an excellent post describing the law, you should read Demeter: It's not just a good idea, It's the law by Avid Grimm.

The culprit for all this mess was on line 11 property = lead.property. The problem was not that we were trying to access the property, it was that we were trying to perform actions on the attributes of the property (lines 12 through 15). In addition, and the bigger problem, is that lead.property returns a property object, while property.street returns a string object. In the words of Peter Van Rooijen (read Avdi's blog), we were not only trying to play with our toys but we were also trying to take them apart.

My first pass at cleaning this up was to just pass in the property. After all the law say that I can play with toys that were given to me.

class Crm::CreateLead

  attr_reader :username, :lead, :property

  def initialize(username, lead, property)
    @username = username
    @lead = lead
    @property = property
  end

  def generate_data
    street = property.street && property.street.encode(:xml => :text)
    city = property.city && property.city.encode(:xml => :text)
    zipcode= property.zip&& property.zip.encode(:xml => :text)
    state= property.state&& property.state.encode(:xml => :text)
  end

  #There is more to the class ...

end

This was a little better, but I still had to check if there is a street and then try to encode that string. To make this nicer I added some methods to the Property class.
class Crm::CreateLead
 
class Crm::CreateLead 
  attr_reader :username, :lead, :property

  def initialize(username, lead, property)
    @username = username
    @lead     = lead
    @property = property
  end

  def generate_data
    street  = property.encode_street
    city    = property.encode_city
    zipcode = property.encode_zip
  end

  #There is more to the class ...

end

class Property
  
  def encode_street
    @street.to_s.encode(:xml => :text)
  end

  def encode_city
    @city.to_s.encode(:xml => :text)
  end

  # more encode methods

end

Ahhh much nicer now.

Web APIs and the lost update problem

In the world of web API development there is always a concern for the lost update problem. This is when two users access the same record and try to perform an update concurrently.

For example:

1. Julio performs a

   GET request on /users/1
   
2. Maria also performs a 
   
   GET request on /users/1
   
3. Julio then performs an edit to update the record

4. Maria also performs an edit, but since she requested /users/1 before Julio updated the record, she would override any changes Julio did. 

Julio's edits would be lost :(

There are a couple of ways around this, but I want to show you how to use precondition HTTP headers and entity tags to solve the problem.

An entity tag is a string used to identify the version of a resource. The ETag HTTP header carries an entity tag as a value and typically looks something like:

ETag: "686897696a7c876b7e"

Precondition HTTP headers are the headers that start with If, for example, If-Match and If-None-Match. You can set these in your HTTP request, and if the conditions are not met the request will fail. Browsers may use these headers to decide if they should use a resource that is stored locally or if the server should return a new version of a resource.

For example:

1. Browser sends GET request to http://example.com/users/1

curl -I "http://example.com/users/1"

2. Server responds with

HTTP/1.1 200 OK
Etag: "0c4aeda935"
(OTHER HEADERS)

2. Browser caches the response locally and assigns the ETag value to it

3. Browser needs that resource (/users/1) again so it will send a GET request with a IF-None-Matches header that has the ETag value.

curl -I "http://example.com/users/1" --header 'If-None-Match: "0c4aeda935"'

4. If there is a record with that URL /users/1 and it has the same ETag value, the server will respond with

HTTP/1.1 304 Not Modified

With this 304 Not Modified response the browser knows that the resource has not changed, so it can just reuse the local copy.

Now knowing that we can set precondition to our HTTP request, lets revisit our lost update problem.

1. Julio performs a GET

Client Request
curl -I "http://example.com/users/1"

Server Response
HTTP/1.1 200 OK
Etag: "0c4aeda935300c"
(OTHER HEADERS)

2. Maria also performs a GET

Client Request
curl -I "http://example.com/users/1"

Server Response
HTTP/1.1 200 OK
Etag: "0c4aeda935300c"
(OTHER HEADERS)

3. Julio then performs an edit to the record and sends the request

Client Request
curl -X PUT "http://example.com/users/1"

4. Maria also performs edit. Again she doesn’t have the latest copy of the resource with Julio’s edits.

Client Request
curl -X PUT "http://example.com/users/1" --header 'If-Match: "0c4aeda935300c"'

Server Response
HTTP/1.1 412 Precondition Failed.

In this new version of our example, the request by Maria to update /users/1 is also sending the If-Match header. This is telling the server to perform this update ONLY if the resource at the specified URL has the same ETag. In other words, only update the resourse, if the resource hasn't changd. Because Julio edited the resource the ETag value will NOT match. In response Maria gets the HTTP 412 Precondition Failed message.

At this point the client can be creative and provide a solution for merging the edits. In the end the goal was to not lose Julio’s edits and it didn’t.