General architecture

  ..........................................
  .              file upload               .
  .   ---------->-->-->-->-->------------  .       ------------------
  .  |  webapp  |           | libstorage | ->->-> | storage hardware |
  .   ----------<--<--<--<--<------------  .       ------------------
  .      |           uri                   . --<--<--<--<--<--<--<--<--<
  ...... v .................................                            ^
         |                                                              |
         v .. The frontend uses the HTTP POST method to push metadata   ^
         |    received from the user input. Data collected will be      |
         v    formatted in a defined JSON format.                       ^
         |                                                       ----------
         v                                                      | HTTP API |
  ---------------                                                ----------
 |  xmpp server  |            indexer                                |
 | (pubsub node) |            ...................................... ^ ......
  ---------------             .                                      |      .
    |    |    |               .   -------------                -----------  .
    v    v     -->-->-->-->-->-- | atom parser |              | Query API | .
    |    |                    .   -------------                -----------  .
    v   ---------------       .     v                                ^      .
    |  | other indexer |      .   ----------------          --------------  .
    v   ---------------       .  | term extractor | ->-->- |   indexer    | .
    |                         .   ----------------          --------------  .
  ---------------             ...............................................
 | other indexer |
  ---------------
                              

Frontend

The frontend is a webapp with client and server side components mainly responsible for getting user input and pushes collected data via Pubsub.

Another goal of this frontend is to use a storage API responsible to actually store an uploaded resource and then return an URI of this uploaded file. This URI will be sent together with other metadata fields through Pubsub.

Indexer

It is responsible to receive metadata of documents from the frontend (read from the Pubsub channel) and then index them in a database. Its flow can be easily explained by the following `image'::

          ----------
         | metadata |
          ----------
              |
              v
              |                                              The indexer
   .....................................................................
   .                                                                   .
   .  -----------------  Responsible for receiving and parsing         .
   . | XMPP Client     | commands from the caller and feed             .
   . |                 | him back.                                     .
   . |  -------------  |                                               .
   . | | Atom parser | ----> Read tags and fields from the Atom        .
   . |  -------------  |     document. Collected data will be          .
   .  -----------------      passed to the Indexer API.                .
   .        |                                                          .
   .        V                                                          .
   .        |                                                          .
   .  --------------  Implements commands that will be called by the   .
   . | Indexer core | XMPP client. Currently only two methods:         .
   . |  ----------  | cervo_indexer_{index,remove}_document().         .
   . | |   iddb   | |                                                  .
   . |  ----------  |                                                  .
   . | |  Xapian  | |                                                  .
   . |  ----------  |                                                  .
   .  --------------                                                   .
   .....................................................................

Details

The indexer itself is an XMPP client. After receiving a notification from a Pubsub service, it reads the content of the atom document. Look for terms and fields in this document and then index the whole document received with data extrected.

Indexer API

Explaining the internals of the Indexer API, the iddb API is used to provide a way for relating an Atom id with a Xapian::docid. It uses a BerkleyDB key/value storage with a hashtable backend to create the relation between the id found in the atom document with xapian's internal identification format, the string one (atom id) is the key and the unsigned int one (xapian id) is the value.

Dependencies for the current indexer code

  • pkg-config
  • make
  • gcc
  • g++
  • libtaningia-dev
  • libdb4.8-dev