Blog

When looking for good references for improving my software architecture skills, I came to the book “Designing Data-Intensive Applications,” written by Martin Kleppmann. As soon as I read the last page, I did a simple exercise: tried to recall the databases mentioned throughout the previous 624 pages. Checking personal notes or the book itself was strictly forbidden.

https://dataintensive.net/

https://dataintensive.net/

Since I could easily remember more than 20 products, my immediate conclusion was that I needed to narrow down the studies. Before trying to understand what could be useful in my future projects, I was forced to come up with methods for choosing a focus. Maybe the most cited technologies? That’s when I remembered one of the most straightforward but useful applications of Apache Spark: counting words!

I converted the Kindle book (purchased through Amazon.com) to a .txt file and loaded the contents into an Apache Spark server using Python. After experimenting with a couple of other strategies (most frequent capitalised words, TF-IDF), I selected the Index section and selected capitalised expressions starting new lines.

The outcome was a list of 342 words, which were verified manually for taking expressions such as “R-trees” and “ETL” out of the results. Since this job would be forcing me to recall the meaning of each name and search for official websites when still in question, I decided not to try to write an automation script.

Once the list was narrowed to 72 items, a straightforward word counter did the job. For every product with more than two words, I queried the book for the single most meaningful word. e.g., “Apache Kafka” refers to the number of times “Kafka” is mentioned. “(IBM) System R” had to be considered a single expression for not mixing with other kinds of system’s. “(Google) Bigtable”, in the book, sometimes refer to the “Bigtable” data model, first proposed by Google’s database and later implemented in other products. In the end, I decided to count both cases in favor of Google’s product.

In a few cases, it’s hard to draw a simple line of what is a data store and what is not. Apache Lucene, a dependence of both Elasticsearch and Apache Solr, was also added to the list.

(46) Apache ZooKeeper means that ZooKeeper is mentioned 46 times in the book (without counting the Index section).

None of the logos are owned or were created by me, so I don’t take responsibility over their eccentric design.

https://zookeeper.apache.org/

https://zookeeper.apache.org/

https://www.postgresql.org/

https://www.postgresql.org/

https://www.mysql.com/

https://www.mysql.com/

https://kafka.apache.org/

https://kafka.apache.org/

https://cassandra.apache.org/

https://cassandra.apache.org/

https://www.oracle.com/database/index.html

https://www.oracle.com/database/index.html

https://www.mongodb.com/

https://www.mongodb.com/

http://basho.com/products/

http://basho.com/products/

https://hbase.apache.org/

https://hbase.apache.org/

https://www.microsoft.com/en-us/sql-server

https://www.microsoft.com/en-us/sql-server

https://www.voltdb.com/

https://www.voltdb.com/

https://aws.amazon.com/dynamodb/

https://aws.amazon.com/dynamodb/

https://lucene.apache.org/

https://lucene.apache.org/

http://www.project-voldemort.com/voldemort/

http://www.project-voldemort.com/voldemort/

https://couchdb.apache.org/

https://couchdb.apache.org/

https://coreos.com/etcd/

https://coreos.com/etcd/

https://www.datomic.com/

https://www.datomic.com/

https://www.ibm.com/analytics/us/en/db2/

https://www.ibm.com/analytics/us/en/db2/

https://cloud.google.com/spanner/

https://cloud.google.com/spanner/

https://www.elastic.co/products/elasticsearch

https://www.elastic.co/products/elasticsearch

https://www.couchbase.com/

https://www.couchbase.com/

https://redis.io/

https://redis.io/

https://engineering.linkedin.com/teams/data/projects/espresso

https://engineering.linkedin.com/teams/data/projects/espresso

https://cloud.google.com/bigtable/

https://cloud.google.com/bigtable/

https://www.rethinkdb.com/

https://www.rethinkdb.com/

http://leveldb.org/

http://leveldb.org/

https://www.ibm.com/it-infrastructure/z/ims

https://www.ibm.com/it-infrastructure/z/ims

http://www.mcjones.org/System_R/

http://www.mcjones.org/System_R/

https://lucene.apache.org/solr/

https://lucene.apache.org/solr/

https://rocksdb.org/

https://rocksdb.org/

https://www.rabbitmq.com/

https://www.rabbitmq.com/

https://www.vertica.com/

https://www.vertica.com/

https://azure.microsoft.com/en-us/services/storage/

https://azure.microsoft.com/en-us/services/storage/

https://eventstore.org/

https://eventstore.org/

https://www.sap.com/products/hana.html

https://www.sap.com/products/hana.html

https://hornetq.jboss.org/

https://hornetq.jboss.org/

https://aws.amazon.com/s3/

https://aws.amazon.com/s3/

https://neo4j.com/

https://neo4j.com/

https://bookkeeper.apache.org/distributedlog/

https://bookkeeper.apache.org/distributedlog/

https://activemq.apache.org/

https://activemq.apache.org/

https://www.memcached.org/

https://www.memcached.org/

https://www.teradata.co.uk/

https://www.teradata.co.uk/

https://www.foundationdb.org/

https://www.foundationdb.org/

https://github.com/HugoTian/Bayou

https://github.com/HugoTian/Bayou

https://developer.ibm.com/messaging/ibm-mq/

https://developer.ibm.com/messaging/ibm-mq/

https://en.wikipedia.org/wiki/NonStop_SQL

https://en.wikipedia.org/wiki/NonStop_SQL

https://github.com/etsy/statsd

https://github.com/etsy/statsd

http://hyperdex.org/

http://hyperdex.org/

https://github.com/pinterest/terrapin

https://github.com/pinterest/terrapin

https://github.com/lyogavin/Pistachio

https://github.com/lyogavin/Pistachio

http://zeromq.org/

http://zeromq.org/

http://www.paraccel.com/

http://www.paraccel.com/

https://github.com/jamesdabbs/brubeck.py

https://github.com/jamesdabbs/brubeck.py

http://www.lmdb.tech/doc/

http://www.lmdb.tech/doc/

https://www.memsql.com/

https://www.memsql.com/

http://druid.io/

http://druid.io/

https://www.consul.io

https://www.consul.io

https://firebase.google.com/docs/database/

https://firebase.google.com/docs/database/

https://github.com/nathanmarz/elephantdb

https://github.com/nathanmarz/elephantdb

https://bookkeeper.apache.org/

https://bookkeeper.apache.org/

https://www.ibm.com/cloud/websphere-application-platform

https://www.ibm.com/cloud/websphere-application-platform

https://azure.microsoft.com/en-us/services/service-bus/

https://azure.microsoft.com/en-us/services/service-bus/

https://qpid.apache.org/

https://qpid.apache.org/

https://franz.com/agraph/allegrograph/

https://franz.com/agraph/allegrograph/

http://hawq.apache.org/

http://hawq.apache.org/

https://titan.thinkaurelius.com/

https://titan.thinkaurelius.com/

https://aws.amazon.com/redshift/

https://aws.amazon.com/redshift/

https://www.objectivity.com/products/infinitegraph/

https://www.objectivity.com/products/infinitegraph/

https://nats.io/

https://nats.io/

https://ramcloud.atlassian.net/wiki/spaces/RAM/overview?mode=global

https://ramcloud.atlassian.net/wiki/spaces/RAM/overview?mode=global

https://msdn.microsoft.com/en-us/library/ms711472%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396

https://msdn.microsoft.com/en-us/library/ms711472%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396

https://www.softwareag.com/corporate/products/webmethods_integration/application_integration/webmethods_adapters.html

https://www.softwareag.com/corporate/products/webmethods_integration/application_integration/webmethods_adapters.html

If it isn’t obvious yet, yes, you should get yourself a copy of Martin’s book. Well worth the investment of time and money.