News

Engineering Builds a Twitter Bot

April 9, 2009 By Daniel

Twitter is the new hotness, it seems every site you visit has some kind of "follow us on twitter" link or built in twitter functionality. People are hooking up cat doors and office chairs to the twitter API so it's only natural that Sparkart Engineering wants to play with this "exploding" new technology.

The idea was simple, as anyone in support will tell you our fans have lots of followers (stalkers), but with twitter our clients can now stalk their fans right back! Twitter has built in search from their acquisition of Summize, that allows any tom, dick, or harry bot to search tweets for specific terms. If you haven't you should check out the advanced search features which allow you to search for terms with positive or negative attitudes, near places, between dates or that are asking a question. Arbitrarily we built our bot to search for tweets that have the terms sparkart or UFC in them, and in a couple of hours our first bot was born.

Engineering Builds a Twitter Bot

Since new technology deserves more new technology our twitter bot can't be written in ruby, that would be too easy for team engineering. To make the project more interesting and fun we decided to write the bot in Erlang. Erlang according to wikipedia is:

A general-purpose concurrent programming language and runtime
system. The sequential subset of Erlang is a functional language, with
strict evaluation, single assignment, and dynamic typing. For concurrency
it follows the Actor model. It was designed by Ericsson to support
distributed, fault-tolerant, soft-real-time, non-stop applications.

Wow, that's a mouthful. Erlang is starting to become the new "hot" language among startups, especially where concurrency is a major requirement. Everyone from Amazon.com, facebook, github, heroku, mochiweb, and last.fm are using it internally. Lets dive into some code and see if we can take the title of "nerdiest post on the blog". Warning really, really technical content ahead...

First some Erlangism's to make the code below easier to understand. Variables in erlang start with capital letters, all lowercase symbols are called atoms (think ruby symbols). Square brackets denote lists and curly brackets denote tuples (ordered sequences).

  -module(twitter_search).

  -export([run/0, search/1]).

  -include_lib("xmerl/include/xmerl.hrl").

  -define(SEARCH_URL,
    "http://search.twitter.com/search.atom?rpp=50&q=").

   -define(SEARCH_STORAGE, "results-file").

  run() ->
    inets:start(),
    open_table(?SEARCH_STORAGE),
    SearchTerms = ["sparkart", "ufc"],
    [ store_result( Term, search_for( Term ) ) | Term <- SearchTerms ],
    timer:sleep(2000),
    close_table().

  search_for(Query) ->
    Xml = search(Query),
    retrieve_names_from(Xml).

  search(Query) ->
   URL = search_url_for(Query),
   { ok, {_Status, _Headers, Body }} = http:request(URL),
   Body.

  search_url_for(Query) ->
    ?SEARCH_URL ++ Query.

  retrieve_names_from(Xml) ->
    { Body, _Rest } = xmerl_scan:string(Xml),
    Names = xmerl_xpath:string("//name/text()", Body),
    UserNames = [ strip_out_full_name(Author) | {_xmlText,[{name, _},{author, _},{entry, _},{feed, _}],
                                             _,[],Author,text} <- Names],
    lists:usort(UserNames).

  strip_out_full_name(Author) ->
    [Name | _Rest ] = string:tokens(Author, " "),
    Name.

  store_result(Key, Results) -> dets:insert(?MODULE, {Key, Results}).

  open_table(File) ->
    io:format("dets opened:  ~p~n", [File]),
    io:format("dets name:  ~p~n", [?MODULE]),
    case dets:open_file(?MODULE, [{file, File},{type, bag}]) of
      {ok, ?MODULE} ->
        true;
      {error, _Reason} ->
        io:format("cannot open dets table~n"),
        exit(eDetsOpen)
    end.

  close_table() -> dets:close(?MODULE).

Lets start breaking this bad boy down so it makes sense to more people than the person who wrote it.

-module(twitter_search). 

-export([run/0, search/1]). 

-include_lib("xmerl/include/xmerl.hrl"). 

-define(SEARCH_URL, 
  "http://search.twitter.com/search.atom?rpp=50&q="). 

    -define(SEARCH_STORAGE, "results-file").

The first line defines the name space for the the functions in our bot, the module name needs to match the file name, just like model files and classes in Rails. All functions in Erlang are name spaced, so run becomes twitter_search:run(), preventing any similarly named functions from colliding with each other. The export directive tells the run time what functions and their arity are available outside of the module. The arity of a function is the number of parameters it takes. So the twitter_search module exports the run and search functions. The include_lib line is like a ruby require, we need xmerl to do an Xpath search later. The two define lines create constants, for the twitter search API URL and for the path to a file for persistent storage of the results respectively.

run() ->
  inets:start(),
  open_table(?SEARCH_STORAGE),
  SearchTerms = ["sparkart", "ufc"],
  [ store_result( Term, search_for( Term ) ) | Term <- SearchTerms ],
  timer:sleep(2000),
  close_table().
The run function is the entry point into our system. It starts the inets process so we can make HTTP requests, it opens our persistent backing store, performs the search and finally closes the backing store. The line [ store_result( Term, search_for( Term ) )Term <- SearchTerms ] is an Erlang list comprehension. In English, for every item in the search terms list it maps that element to the variable Term and then runs the search_for and store_result functions.
search(Query) ->
 URL = search_url_for(Query),
 { ok, {_Status, _Headers, Body }} = http:request(URL),
 Body.    

Search builds the URL for our search including the query we are looking for. Lets explore the line { ok, {Status, Headers, Body }} = http:request(URL), since it makes use of pattern matching which is a fundamental aspect of Erlang programming. First throw out the notion you may have that http:request(URL) gets assigned to that mess on the left hand side of the equals sign. What's really happening here is http:request(URL) is called and the result is checked to see if it matches the tuple on the left hand side. The tuple has an atom (think ruby symbol) as the first element and then has another tuple, with the HTTP status, HTTP headers and body of the response. The underscore before the variable names tells the runtime that we don't really about that value. IF the result matches the tuple, then the variable Body will be bound to the HTTP response body.

retrieve_names_from(Xml) ->
  { Body, _Rest } = xmerl_scan:string(Xml),
  Names = xmerl_xpath:string("//name/text()", Body),
  UserNames = [ strip_out_full_name(Author) | {_xmlText,[{name, _},{author, _},{entry, _},{feed, _}],
                                       _,[],Author,text} <- Names],
  lists:usort(UserNames).

The retrieve_names_from function parses the XMl returned from our search, and uses an XPath expression to find all of the name nodes. We use another list comprehension to match the inner text of the name nodes and create a list of twitter usernames. Finally the list of user names is sorted and duplicates are removed. This list comprehension is a little more advanced than the one in run(). Instead of executing the strip out full name function for every element in the Names list, we use Erlang pattern matching to only match on the inner text of the author node. The format of the tuple is based on the internal parser representation of the node defined in xmerl header file.

Simple right! One bright and shiny new bot in 56 lines of code. Hopefully this served as a good introduction to our first bot and Erlang the language it was written in. Feel free to drop questions, comments, kudos, and criticisms in the comments.

Note: the single pipe in the list comprehensions should be a double pipe to be syntactically correct, but double pipes are reserved in markdown so a single pipe has been used.

Related Stories: Technology