When looking for good references for improving my software architecture skills, I came to the book “Designing Data-Intensive Applications,” written by Martin Kleppmann. As soon as I read the last page, I did a simple exercise: tried to recall the databases mentioned throughout the previous 624 pages. Checking personal notes or the book itself was strictly forbidden.
Since I could easily remember more than 20 products, my immediate conclusion was that I needed to narrow down the studies. Before trying to understand what could be useful in my future projects, I was forced to come up with methods for choosing a focus. Maybe the most cited technologies? That’s when I remembered one of the most straightforward but useful applications of Apache Spark: counting words!
I converted the Kindle book (purchased through Amazon.com) to a .txt file and loaded the contents into an Apache Spark server using Python. After experimenting with a couple of other strategies (most frequent capitalised words, TF-IDF), I selected the Index section and selected capitalised expressions starting new lines.
The outcome was a list of 342 words, which were verified manually for taking expressions such as “R-trees” and “ETL” out of the results. Since this job would be forcing me to recall the meaning of each name and search for official websites when still in question, I decided not to try to write an automation script.
Once the list was narrowed to 72 items, a straightforward word counter did the job. For every product with more than two words, I queried the book for the single most meaningful word. e.g., “Apache Kafka” refers to the number of times “Kafka” is mentioned. “(IBM) System R” had to be considered a single expression for not mixing with other kinds of system’s. “(Google) Bigtable”, in the book, sometimes refer to the “Bigtable” data model, first proposed by Google’s database and later implemented in other products. In the end, I decided to count both cases in favor of Google’s product.
In a few cases, it’s hard to draw a simple line of what is a data store and what is not. Apache Lucene, a dependence of both Elasticsearch and Apache Solr, was also added to the list.
(46) Apache ZooKeeper means that ZooKeeper is mentioned 46 times in the book (without counting the Index section).
None of the logos are owned or were created by me, so I don’t take responsibility over their eccentric design.
If it isn’t obvious yet, yes, you should get yourself a copy of Martin’s book. Well worth the investment of time and money.