ready to build your app

As a software development company, we very often work on complex applications that need to handle lots of data. Recently, on one of our projects, we’ve faced a challenge – We had a lot of data on many levels and we had to be able to operate directly on these documents.

Do you want to know how we resolved our problem?

Join us on our journey to find out.

But first, let me introduce our background.

Our technologies

At Brainhub we specialize in building apps with JavaScript and we do so using the following technologies:

  • On the backend:

    • NodeJS as the platform
    • Express or Koa as an HTTP library/framework
    • Various DBMS which fit a particular need, mostly MongoDB and Redis
    • RabbitMQ for queuing
    • Consul for microservices registration
    • GitLab CI or Circle CI as a continuous integration tool
    • Google App Engine, Bluemix or own servers for deployment
  • On the web frontend:

  • On the mobile frontend:

  • On the desktop frontend:

What kind of problem we have

We have a lot of data on many levels, which means, in a document model, many levels of nested documents. Moreover, we have to be able to operate directly on these nested documents (children, grandchildren, great-grandchildren etc.).

We have to create an API not only for our frontend but also for external integrations. The user should be able to send a JSON schema, which is later used for validation of provided data when creating or updating, and it’s also used to join documents from various collections.

An example of a simple JSON schema:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "title": "profiles",
  "description": "Profile Schema",
  "type": "object",
  "properties": {
    "id": {
      "type": "string",
      "description": "ID"
    },
    "address": {
      "type": "string",
      "description": "Address"
    },
    "email": {
      "type": "string",
      "description": "E-mail address"
    },
    "firstname": {
      "type": "string",
      "description": "First name"
    },
    "lastname": {
      "type": "string",
      "description": "Last name"
    },
    "transactions": {
      "type": "array",
      "description": "List of transactions connected to the profile",
      "items": {
        "title": "transactions",
        "type": "object",
        "properties": {
          "id": {
            "type": "string",
            "description": "ID"
          },
          "orderTotal": {
            "type": "string",
            "description": "Total value of the transaction"
          },
          "invoices": {
            "type": "array",
            "description": "List of invoices related to the transaction",
            "items": {
              "title": "invoices",
              "type": "object",
              "properties": {
                "id": {
                  "type": "string",
                  "description": "ID"
                },
                "discountPercent": {
                  "type": "string",
                  "description": "Discount percent"
                },
                "itemNo": {
                  "type": "string",
                  "description": "Item ID"
                }
              }
            }
          }
        }
      }
    }
  }
}

 
So there are the following solutions:

  • Document database with foreign keys inside the documents:
    • Operating on data with many queries
    • Operating with a single query if a DBMS permits it
  • Relational database:
    • Manual serialization/deserialization
    • ORM
  • Graph database
  • Multi-model document-graph database

What type of DBMS we expect

We would like a DBMS which satisfies:

  • Open source
  • High performance
  • Good support for JavaScript/NodeJS
  • Good community
  • Supporting ACID (atomicity, consistency, isolability, durability)

Other useful features are:

  • Multi-model
  • Powerful query language

Potential DBMS to choose

We love open source solutions, so we eliminated DBMS such as Oracle, SQL Server, DB2 and for licensing issues MySQL.

We made a comparison of many No-SQL DBMS (not only for this project but also to have some overview for other projects, the data is as of February 18th, 2018):

Original file

We took some DBMS from the top of the rank above but eliminating:

  • LevelDB – is a great DBMS but designed for text storage, no document storage, which we need
  • PouchDB – though it is higher in this rank than CouchDB, we decided to consider CouchDB instead because PouchDB is generally designed for backend-frontend synchronization, which we don’t need in our app, but CouchDB is a DBMS used typically on the backend side
  • Memcached – allows caching data like Redis but after more detailed research, it looks like Redis is the undisputed winner over Memcached. Moreover, we’re looking for a document-based DBMS, so we don’t want to consider more key-value DBMS than Redis
  • Firebase Realtime Database – not an Open Source and generally seems not to be enough good to take the risk of having some bugs which we cannot fix
  • Neo4j – designed especially for Java but not NodeJS and is not document-based

Moreover, among the SQL DBMS, we decided to include only PostgreSQL in our research because it makes it possible to store JSON-like data, which we need.

Based on the table above and other research, DBMS that seem to suit our needs the most are:

  • MongoDB – the most popular document DBMS:
    • Advantages:
      • NodeJS developers have the greatest knowledge of this DBMS
      • Great clustering options
    • Disadvantages:
      • Missing joins – very important for our data; MongoDB Aggregation Framework is very limited and hard to debug
      • Missing transactions – we need ACID (though planned in MongoDB 4.0)
      • Missing expressive, dedicated query language – queries only in JSON
      • Issues can be reported only in Jira but not GitHub, which is less user-friendly for the Open Source community
  • CouchDB:
    • Advantages:
      • RESTful API
    • Disadvantages:
      • Missing transactions – we need ACID
  • RethinkDB:
    • Advantages:
    • Disadvantages:
      • Relatively slow performance
  • ArangoDB:
    • Advantages:
      • Great community – very helpful team on Slack
      • Multi-model
      • AQL
      • Transactions
      • Sharding and replication
    • Disadvantages:
      • Still not very popular, so it’s practically impossible to find developers experienced in ArangoDB
      • Relatively slow writes
  • Redis:
    • Advantages:
      • Super fast
      • Popular – among the non-relational databases – only MongoDB is more popular
      • The most stars on GitHub among all DBMS
    • Disadvantages:
      • Not designed for durable persistence (as default everything is kept inside RAM)
      • No query language
      • Very limited queries (only very basic operators like get or incr`)
  • PostgreSQL:
    • Advantages:
      • Very popular
      • Supports SQL
      • Supports many kinds of data like multi-dimensional arrays and user-defined types
      • Is proved to be very mature in production
    • Disadvantages:
      • Running JavaScript on PostgreSQL is not out-off-the-box (requires an extension) and JavaScript is never the main language used on PostgreSQL server because it has to be always embedded inside the SQL syntax
      • Though user-defined functions and data types are very useful, their syntax seems to be from an old epoch like Fortran or Pascal
      • PostgreSQL json and jsonb types have low performance in comparison with document databases
      • Sharding in PostgreSQL is not so easy – it requires either using a fork of the official database engine or an implementation in the application layer
      • Like other relational DBMS requires a schema which slows down creation of a simple MVP microservice

Why we chose ArangoDB

ArangoDB seems to be something like MongoDB (we have the most experience in MongoDB) with some extra features. Of course it lacks some MongoDB features like the Aggregation Framework but, in reality, this one is not lacking but replaced with something more user-friendly – AQL + joins.

ArangoDB like MongoDB provides clustering, though the ArangoDB clustering has not proven to work stably on production. One of the key factors was a very active community. It has a very low ratio of open issues to the total number of issues. Moreover, everyone can easily access ArangoDB Slack where the support team is very helpful, and also in Stackoverflow they give adequate responses.

Another reason was that ArangoDB is a multi-model DBMS, which is useful as we were planning to extend our documents with using graphs.

How we have used ArangoDB

We have used the following ArangoDB features:

  • AQL
  • Transactions
  • Admin UI (only for some debugging purposes)

We are potentially planning to use in the future:

  • Graphs

In the ArangoDB shell, we found a very useful feature which doesn’t exist in MongoDB. To learn AQL, no data in the collections was needed because it’s possible to type something like this:

1
db._query('for i in [1,2,3] return i * i')

 
Because one of the requirements was to build the data from many collections using the provided JSON schema, we were looking for an ArangoDB query builder.

We found something which was rather unpopular and lacked many features, so we created our own ArangoDB query builder.

We created an abstract interface, so when replacing ArangoDB into another DBMS, only the inner implementation would be changed.

An example code of our query builder:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
const QueryBuilder = () => {
  const priv = {
    // private fields and methods
  };

  const pub = {
    getQueryTree() {
      return priv.queryTree;
    },

    fromSchema(schema) {
      priv.mainCollectionName = schema.title;
      priv.queryTree.loop = `FOR ${schema.title}Item in ${schema.title}`;
      priv.queryTree.sorting = `SORT ${schema.title}Item.id`;
      // some more code

      return pub;
    },

    withLimit(offset, count) {
      // some code
    },

    byId(id) {
      // some code
    },

    byIdentifiers(identifiers) {
      // some code
    },

    byParentId(collectionName, id) {
      // some code
    },

    toAQL() {
      return [
        priv.toString(),
        priv.bindings,
      ];
    },
  };

  return pub;
};

export default QueryBuilder;

 
We have created some JavaScript code which runs on the ArangoDB server, and we use this code for most transactions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// this function is run on the ArangoDB server, and, thus, it cannot use all es6 features
const dbProcedure = (params) => {
  const db = require('internal').db;

  const updateProperObject = () => {
    const collection = db._collection(params.myCollectionName);

    return collection.updateByExample({ id: params.newObject.id }, params.newObject);
  };

  const removeObjects = (collectionName) => {
    params.childrenIdsToBeRemoved[collectionName].forEach((id) => {
      db._collection(collectionName).removeByExample({ id });
    });
  };

  /*
   * The remaining public and private methods
   */


  const actions = {
    create,
    update,
    remove,
    override,
  };

  return actions[params.action]();
};

export default dbProcedure;

 
It was pretty cool that we were able to use some popular libraries like Lodash even on the database server.

ArangoDB provides an HTTP framework named Foxx which simplifies creating microservices that connect to ArangoDB.

However, we decided not to use Foxx because we didn’t want our microservices to be dependent on a database (like it is while using MongoDB + Mongoose or direct connecting between frontend and CouchDB REST API). Instead, we created abstract models which internally use ArangoDB.

This approach proved to be a good choice because later we had to replace ArangoDB into Redis in some critical places of the system in order to remove bottlenecks that were hindering the overall performance.

Our benefits from using ArangoDB

Thanks to AQL, writing and debugging queries was much easier than when using the MongoDB Aggregation Framework. Also, the transactions were very helpful. Even running JavaScript code on the database server was easier than in MongoDB because it was possible to use some libraries like Lodash.

Our drawbacks from using ArangoDB

Unfortunately, ArangoDB was not fast enough to handle a very large number of write/reads in a short period of time (however, even with other DBMS, e.g. MongoDB or MySQL, writing data to the hard disk would be too slow). That’s why in some microservices we decided to replace ArangoDB into Redis, which works in memory.

However, even in this situation, using ArangoDB was a better choice than MongoDB + Mongoose because Mongoose models usually make the entire architecture dependent on DBMS.

On the other hand, we created an abstract model using ArangoDB, so replacing DBMS into another was relatively easy.

Take note that MongoDB can be used with such an abstract model as well – so I just say that ArangoDB + our abstract models can be much better than MongoDB + Mongoose.

Importance of key characteristics of ArangoDB

Piotr Sroczkowski

Piotr Sroczkowski is a JavaScript Full Stack Developer at Brainhub (a software house building awesome Node.js web and mobile apps).