I was recently faced with the problem of finding an apartment in Berlin. Following my previous experience in this same effort, I decided to automate the task and write a software to send me an alert of the best deals. In this article, I explain how I built the foundations of this platform.
The platform I’ve written is a Go application deployed to Google Cloud using Terraform. Also, it has Continuous Deployment from a private GitHub repository.
After a quick research, I came to the following list of platforms to monitor:
A few hours later, I have a Go binary that does everything I need to run the application locally. It uses a web scraping framework called Colly to browse all the platforms listings, extract basic attributes, and export to CSV files in the local filesystem.
Since I didn’t want to maintain the application running locally, my first choice would be to get a cheap instance at Google Cloud. Once I had this rented virtual machine, I could write a startup script to compile the app from GitHub, and set up a crontab to scrape the platforms on a daily basis.
Probably the best decision for this specific project, but could I use this personal problem as an opportunity to explore the integration of Google Cloud services?
Since, in the past, I was involved in multiple projects involving some sort of scraping application, I believed it was worth the effort. I could easily reuse this setup in the future.
My architecture started with a few premises:
My hypothesis was that I didn’t need a virtual machine running 24/7; thus, it should not cost the same as a full month price. In fact, my application was able to download all the properties I was interested in under 3 minutes, so I expected something significantly lower.
My exploration through the latest Google Cloud services resulted in finding Cloud Run, a service that “run(s) stateless containers on a fully managed environment or in your own GKE cluster.” Still classified as a beta product by Google Cloud, it is built on top of Knative and Kubernetes. The key proposal is its pricing model: it charges in chunks of milliseconds rather than hours of runtime.
With a few tweaks, my Go application was wrapped in a Docker container to be runnable by Cloud Run. Once it gets a HTTP POST request, it collects attributes from all the advertised properties and publishes as CSV files to a Google Storage bucket. For my use case, I created two possible ways to hit this endpoint: an Internet-accessible address so I can trigger it whenever I want, and through Cloud Scheduler, which is configured to hit it once a day.
The application is fairly simple: it consists of an HTTP server with a single endpoint. On every hit, it scrapes all the platforms and saves results in CSVs inside a Storage bucket.
Other application files can be found in this Gist. All the feedback is appreciated, as this is one of my first Go projects.
Now with permissions already given, use Terraform to set up the rest of the infrastructure.
$ cd deployment
$ terraform init
$ terraform apply
The initial deployment may take about five minutes since Terraform waits for Cloud Run to build and start before configuring Cloud Scheduler.
Since Cloud Run is still in beta - with API endpoints in alpha stage -I was not able to declare all the infrastructure in Terraform files. As a temporary workaround, I’ve written a couple of auxiliary bash scripts that trigger the Cloud API through its CLI command. Fortunately, all this happens in background when a developer triggers terraform apply.
Every day, without any human interaction, Cloud Scheduler creates a new folder with a number of CSV files with the most recently available apartments in my city.
Not all the services in use are available in the official calculator. Either way, I’ve made a rough estimation for my personal use, considering an unrealistic number of one deployment each day.
For comparison, an f1-micro instance - with 0.6GB of RAM - running over a full month on Google Cloud, is included in the free tier; a g1-small instance, with 1.7GB, would cost US$ 13.80 per month. Also, it is reasonable to consider the cost could decrease or increase depending on how accurate were my initial assumptions and further optimizations.