What's In That Gem? / Chronic - 18 Dec 2016
In this post I’m going to do a little dive into a gem and talk a bit about how the gem is constructed and some design features that went into it. I’d like to turn this into a series on different gems every time but we’ll see how this one goes.
Chronic
chronic
is a natural language date and time parsing library. The first version was originally uploaded to RubyGems.org on September 6, 2006. Needless to say it’s had a lot of work on it over the years, but it’s basic structure hasn’t changed a whole lot since then. It may seem like magic at first, but once we dig into it a bit we can pull back the curtain and see what’s really going on.
Interface
Chronic’s main public interface has two methods, one called parse
that parses a string of text into a time object, and another construct
that builds a time object given a day, hour, month, time, etc. The construct
method is fairly straightforward, so we’ll mostly be focusing on what parse
is doing.
Parsing
#lib/chronic.rb
def self.parse(text, options = {})
Parser.new(options).parse(text)
end
The parse
method takes a string of text and a hash of various options that you can use to configure how chronic
parses the text. The options are used to generate a new Parser
object and then the text is passed in to be parsed. Let’s take a peek inside the Parser
object.
#lib/chronic/parser.rb
def parse(text)
tokens = tokenize(text, options)
span = tokens_to_span(tokens, options.merge(:text => text))
puts "+#{'-' * 51}\n| #{tokens}\n+#{'-' * 51}" if Chronic.debug
guess(span, options[:guess]) if span
end
For the most part we only care about the first two lines in this method. tokenize
is responsible for cleaning up our text and making sure it’s in a standardized format for the rest of the library to have a reliable base to work off of. This is essentially the lexing step, tagging each token with a type based on various rules. Then finally tokens_to_span
is the evaluation step turning our tokens into a Span
that hopefully contains our correct time object.
tokenize
This method does the standardization, and lexing on the text.
#lib/chronic/parser.rb
def tokenize(text, options)
text = pre_normalize(text)
tokens = text.split(' ').map { |word| Token.new(word) }
[Repeater, Grabber, Pointer, Scalar, Ordinal, Separator, Sign, TimeZone].each do |tok|
tok.scan(tokens, options)
end
tokens.select { |token| token.tagged? }
end
The pre_normalize
method is for the most part a big list of regex substitutions. Turning commas, quotes, etc into spaces, ‘numbers’ into numbers, and ordinal words like ‘third’ into ‘3rd’. Here’s a sample of it.
#lib/chronic/parser.rb
def pre_normalize(text)
#...
text.gsub!(/['"]/, '')
text.gsub!(/,/, ' ')
text.gsub!(/^second /, '2nd ')
text.gsub!(/\bsecond (of|day|month|hour|minute|second)\b/, '2nd \1')
text = Numerizer.numerize(text)
text.gsub!(/([\/\-\,\@])/) { ' ' + $1 + ' ' }
text.gsub!(/(?:^|\s)0(\d+:\d+\s*pm?\b)/, ' \1')
#...
text
end
After pre_normalize
you can see it’s a simple mapping, split on spaces, that generates Token
s from the text. After prenormalization and splitting the tokens, each Tag
type is checked against the tokens to see if they meet the rules for that tag. If they match, the Token
is given that tag and we move to the next tag type. The base class for all of the tags is the Tag
class. It has a method named scan_for
, but to understand that we need to look at how it’s getting called.
#lib/chronic/tags/grabber.rb
def self.scan(tokens, options)
tokens.each do |token|
token.tag scan_for_all(token)
end
end
def self.scan_for_all(token)
scan_for token, self,
{
/last/ => :last,
/this/ => :this,
/next/ => :next
}
end
The scan
method called from tokenize
is where the tag class starts it’s journey. It simply iterates through the tokens and adds a tag using scan_for_all
which subsequently calls the base Tag
class’s scan_for
method. We can see what’s passed in, the token, self
, and a hash of the rules for Grabber
. Below we can see what happens in the base Tag
class.
#lib/chronic/tag.rb
def scan_for(token, klass, items={}, options = {})
case items
when Regexp
return klass.new(token.word, options) if items =~ token.word
when Hash
items.each do |item, symbol|
return klass.new(symbol, options) if item =~ token.word
end
end
nil
end
scan_for
can take either a hash or a regular expression object. It basically just checks if the token matches against either the regex or the hash, whichever one is passed in. Finally a little metaprogramming thrown in generates a new object based on the tag (klass
) type. In this case if we had a string like ‘last tuesday of september’, scan_for
would return a Grabber
object and add it to the Token
’s list of tags.
The last thing we need to do to tokenize our text is just to filter out the tokens that have no tags associated with them after the tagging process. I should also mention there are different types of tags that are more complex than Grabber
mostly Repeaters
which allow for stuff like ‘third week in april’, it checks for most tokens that contain a named value like month names, season names, or day names.
Evaluation using tokens_to_span
Phew, that was a lot. Now that we have our tokens tagged, we can begin to parse out their meanings into time objects. To do this we need to first get a Dictionary
of definitions that maps a set of tokens to a specific action needed to parse our text. This is the basics of how a parser works on programming languages as well. We’re getting there!
The basic outline of tokens_to_span
looks like this:
#lib/chronic/parser.rb
def tokens_to_span(tokens, options)
definitions = definitions(options)
#...more definition types iterated
definitions[:anchor].each do |handler|
if handler.match(tokens, definitions)
good_tokens = tokens.select { |o| !o.get_tag Separator }
return handler.invoke(:anchor, good_tokens, self, options)
end
end
#...
end
A basic definition class looks like this:
#lib/chronic/definition.rb
class AnchorDefinitions < SpanDefinitions
def definitions
[
Handler.new([:separator_on?, :grabber?, :repeater, :separator_at?, :repeater?, :repeater?], :handle_r),
Handler.new([:grabber?, :repeater, :repeater, :separator?, :repeater?, :repeater?], :handle_r),
Handler.new([:repeater, :grabber, :repeater], :handle_r_g_r)
]
end
end
Each Handler
class takes an array or nested array of types that correspond to our tag types. The question marks mean that type is optional in our set of tokens.
We’ve gathered our definitions and we iterate over the tokens determining if one of the handlers (actions) in the definition matches our specific set of tokens. The match
method on the Handler
class looks at the tokens and determines if they match against the definition, returning true or false. Finally when a match has been found the correct action method gets called using the second argument to our Handler
init method and the Handler
’s invoke
method. For instance :handle_r_g_r
is the name of a method to run when we have a :repeater
, :grabber
, and :repeater
as our set of tokens. Let’s see what that looks like:
#lib/chronic/handlers.rb
# Handle repeater/grabber/repeater
def handle_r_g_r(tokens, options)
new_tokens = [tokens[1], tokens[0], tokens[2]]
handle_r(new_tokens, options)
end
It takes in the tokens and the options and figures out the need steps to build a time object out of our string. In this case it’s just reordering the tokens. Many of the methods in the Handlers
module parse out a little bit in steps and then pass it off to another method for more parsing. In this case handle_r
which itself passes things off to further more methods. Eventually a Span
is returned. Finally based of the :guess
option chronic will return the start, middle, or end of the span, giving us a nice time object.
Obviously this isn’t going into every nitty gritty detail, but hopefully it helps get you pointed in the right direction when looking at the Chronic codebase. It’s not magic. The real secret of Chronic’s magical abilities lie in the extensive work done by contributers to figure out many of the various ways dates and times can be written and how exactly to parse them. Surprisingly, many natural language processing projects are just big collections of rules that turn a complex natural language object into something well defined and usable by a computer.