I Don't Quite Understand ~ A Blog About Programming

What's In That Gem? / Chronic - 18 Dec 2016

In this post I’m going to do a little dive into a gem and talk a bit about how the gem is constructed and some design features that went into it. I’d like to turn this into a series on different gems every time but we’ll see how this one goes.

Chronic

chronic is a natural language date and time parsing library. The first version was originally uploaded to RubyGems.org on September 6, 2006. Needless to say it’s had a lot of work on it over the years, but it’s basic structure hasn’t changed a whole lot since then. It may seem like magic at first, but once we dig into it a bit we can pull back the curtain and see what’s really going on.

Interface

Chronic’s main public interface has two methods, one called parse that parses a string of text into a time object, and another construct that builds a time object given a day, hour, month, time, etc. The construct method is fairly straightforward, so we’ll mostly be focusing on what parse is doing.

Parsing

#lib/chronic.rb

def self.parse(text, options = {})
  Parser.new(options).parse(text)
end

The parse method takes a string of text and a hash of various options that you can use to configure how chronic parses the text. The options are used to generate a new Parser object and then the text is passed in to be parsed. Let’s take a peek inside the Parser object.

#lib/chronic/parser.rb

def parse(text)
  tokens = tokenize(text, options)
  span = tokens_to_span(tokens, options.merge(:text => text))
  
  puts "+#{'-' * 51}\n| #{tokens}\n+#{'-' * 51}" if Chronic.debug

  guess(span, options[:guess]) if span
end

For the most part we only care about the first two lines in this method. tokenize is responsible for cleaning up our text and making sure it’s in a standardized format for the rest of the library to have a reliable base to work off of. This is essentially the lexing step, tagging each token with a type based on various rules. Then finally tokens_to_span is the evaluation step turning our tokens into a Span that hopefully contains our correct time object.

tokenize

This method does the standardization, and lexing on the text.

#lib/chronic/parser.rb

def tokenize(text, options)
  text = pre_normalize(text)
  tokens = text.split(' ').map { |word| Token.new(word) }
  [Repeater, Grabber, Pointer, Scalar, Ordinal, Separator, Sign, TimeZone].each do |tok|
    tok.scan(tokens, options)
  end
  tokens.select { |token| token.tagged? }
end

The pre_normalize method is for the most part a big list of regex substitutions. Turning commas, quotes, etc into spaces, ‘numbers’ into numbers, and ordinal words like ‘third’ into ‘3rd’. Here’s a sample of it.

#lib/chronic/parser.rb

def pre_normalize(text)
  #...
  text.gsub!(/['"]/, '')
  text.gsub!(/,/, ' ')
  text.gsub!(/^second /, '2nd ')
  text.gsub!(/\bsecond (of|day|month|hour|minute|second)\b/, '2nd \1')
  text = Numerizer.numerize(text)
  text.gsub!(/([\/\-\,\@])/) { ' ' + $1 + ' ' }
  text.gsub!(/(?:^|\s)0(\d+:\d+\s*pm?\b)/, ' \1')
  #...
  text
end

After pre_normalize you can see it’s a simple mapping, split on spaces, that generates Tokens from the text. After prenormalization and splitting the tokens, each Tag type is checked against the tokens to see if they meet the rules for that tag. If they match, the Token is given that tag and we move to the next tag type. The base class for all of the tags is the Tag class. It has a method named scan_for, but to understand that we need to look at how it’s getting called.

#lib/chronic/tags/grabber.rb

def self.scan(tokens, options)
  tokens.each do |token|
    token.tag scan_for_all(token)
  end
end

def self.scan_for_all(token)
  scan_for token, self,
  {
    /last/ => :last,
    /this/ => :this,
    /next/ => :next
  }
end

The scan method called from tokenize is where the tag class starts it’s journey. It simply iterates through the tokens and adds a tag using scan_for_all which subsequently calls the base Tag class’s scan_for method. We can see what’s passed in, the token, self, and a hash of the rules for Grabber. Below we can see what happens in the base Tag class.

#lib/chronic/tag.rb

def scan_for(token, klass, items={}, options = {})
  case items
  when Regexp
    return klass.new(token.word, options) if items =~ token.word
  when Hash
    items.each do |item, symbol|
      return klass.new(symbol, options) if item =~ token.word
    end
  end
  nil
end

scan_for can take either a hash or a regular expression object. It basically just checks if the token matches against either the regex or the hash, whichever one is passed in. Finally a little metaprogramming thrown in generates a new object based on the tag (klass) type. In this case if we had a string like ‘last tuesday of september’, scan_for would return a Grabber object and add it to the Token’s list of tags.

The last thing we need to do to tokenize our text is just to filter out the tokens that have no tags associated with them after the tagging process. I should also mention there are different types of tags that are more complex than Grabber mostly Repeaters which allow for stuff like ‘third week in april’, it checks for most tokens that contain a named value like month names, season names, or day names.

Evaluation using tokens_to_span

Phew, that was a lot. Now that we have our tokens tagged, we can begin to parse out their meanings into time objects. To do this we need to first get a Dictionary of definitions that maps a set of tokens to a specific action needed to parse our text. This is the basics of how a parser works on programming languages as well. We’re getting there!

The basic outline of tokens_to_span looks like this:

#lib/chronic/parser.rb

def tokens_to_span(tokens, options)
  definitions = definitions(options)
  
  #...more definition types iterated

  definitions[:anchor].each do |handler|
    if handler.match(tokens, definitions)
      good_tokens = tokens.select { |o| !o.get_tag Separator }
      return handler.invoke(:anchor, good_tokens, self, options)
    end
  end
  #...
end

A basic definition class looks like this:

#lib/chronic/definition.rb

class AnchorDefinitions < SpanDefinitions
  def definitions
    [
      Handler.new([:separator_on?, :grabber?, :repeater, :separator_at?, :repeater?, :repeater?], :handle_r),
      Handler.new([:grabber?, :repeater, :repeater, :separator?, :repeater?, :repeater?], :handle_r),
      Handler.new([:repeater, :grabber, :repeater], :handle_r_g_r)
    ]
  end
end

Each Handler class takes an array or nested array of types that correspond to our tag types. The question marks mean that type is optional in our set of tokens.

We’ve gathered our definitions and we iterate over the tokens determining if one of the handlers (actions) in the definition matches our specific set of tokens. The match method on the Handler class looks at the tokens and determines if they match against the definition, returning true or false. Finally when a match has been found the correct action method gets called using the second argument to our Handler init method and the Handler’s invoke method. For instance :handle_r_g_r is the name of a method to run when we have a :repeater, :grabber, and :repeater as our set of tokens. Let’s see what that looks like:

#lib/chronic/handlers.rb

# Handle repeater/grabber/repeater
def handle_r_g_r(tokens, options)
  new_tokens = [tokens[1], tokens[0], tokens[2]]
  handle_r(new_tokens, options)
end

It takes in the tokens and the options and figures out the need steps to build a time object out of our string. In this case it’s just reordering the tokens. Many of the methods in the Handlers module parse out a little bit in steps and then pass it off to another method for more parsing. In this case handle_r which itself passes things off to further more methods. Eventually a Span is returned. Finally based of the :guess option chronic will return the start, middle, or end of the span, giving us a nice time object.

Obviously this isn’t going into every nitty gritty detail, but hopefully it helps get you pointed in the right direction when looking at the Chronic codebase. It’s not magic. The real secret of Chronic’s magical abilities lie in the extensive work done by contributers to figure out many of the various ways dates and times can be written and how exactly to parse them. Surprisingly, many natural language processing projects are just big collections of rules that turn a complex natural language object into something well defined and usable by a computer.