Blog

But seriously, how is Data Science Retreat?

Jul 02, 2016 | 7 minutes read

Are you reading this in the future?

After Data Science Retreat, I decided to work full-time on an idea I had for my portfolio project. It is called Operation “Serenata de Amor” (Love Serenade) and gathered thousands of supporters. Check it out:

After recommending the course to multiple people in recent weeks and explaining the same things over and over again, I decided to put them together into an article. I’m currently attending the 7th batch of Data Science Retreat, a full-time bootcamp held in Berlin. This batch started on April 18, 2016, and will end by July 15. The report that follows is a description of this specific batch. Doesn’t necessarily describes the first batch and probably will be different in couple years.

There are still two weeks to end, but since my final project already has good traction and things have started to calm down, I have more time to think about the future and how I plan to use Data Science in the future. Writing about this it is an excellent way to reflect on the experience. This course wasn’t my first contact with the subject but for certain shaped the Data Scientist I am today.

In the first month, we had full-time classes (roughly 8 hours per day) on almost each of the 20 business days. Subjects were many. Data cleaning, Jupyter Notebooks, Python, R, SQL, Apache Spark, streaming and big data architectures.

For 3 days, DSR asked us to find a pair to work on a Kaggle problem so we could compete with each other in an internal competition.

During another 2 days, we were expected to start working on your portfolio project. Meaning: having ideas for data science projects by looking at public datasets or getting inspiration from existing applications.

DSR, for graduation, requires you to present what a portfolio project. Something you made, end-to-end, leveraging different data science techniques learned during the bootcamp. Want to use Python or R, the languages taught at the course? Excellent. Scala, Julia or languages acquired in after-hours at home? They will love it. The goal is to show you are capable of having good project ideas, finding and cleaning datasets, coming up and testing hypotheses, creating a predictive model using Machine Learning and finally presenting everything to technical and non-technical audiences.

The whole course is structured around this portfolio project expected to be presented on the final day of the course. Everything seems to be placed for giving you the knowledge and opportunity to experiment before the big event.

With the project, you should be showing five abilities:

  1. Coming up with a good question
  2. Coming up with a good business case
  3. Finding data publicly available to back your experiments
  4. Finding existing technology to solve a problem
  5. Building a solution possible to be verified immediately

If you decided to work on a project already hosted on Kaggle, you would fail requirements number 1 and 3. The question would have already been made by another stakeholder, together with good datasets.

If you were to build a recommender system, an accurate validation might require the collection of data after multiple days/weeks of usage, by distinct users, neglecting abilities 3 and 5.

The requirement number 4 exists to hold you back from creating a whole new deep learning framework and spending more the three months on the portfolio project.

Although not explicitly listed during classes as an official requirement, I would add another based on my experience listening proposals: leverage machine learning as a core feature. When colleagues proposed data science projects with minimal or no machine learning, the feedback from mentors was negative, classifying them as “not challenging enough”.

You are expected to show you can handle a data science project end to end, if necessary. Depending on your previous experience or the size of the company you go after DSR, you may even get yourself in a leadership role if you are capable of showing this ability.

DSR has a list of about 20 mentors, but the main one is Jose Quesada, the director of the company behind the course. Everyone has 30-minute one-on-one’s with him every Friday. It is the best opportunity for giving feedback about classes and teachers while having a person to ask about Data Science stuff that came through your mind during the week.

After doing more in-depth research on their website, the course says to follow a teaching technique called Meerkat Method. You are pushed against your current capabilities to try to get your limits expanded. It can be frustrating if you are not aware that data science is an area growing in fast pace, with a significant amount of knowledge already published.

When looking for mentors other than Jose, the best way to reach out to them is via email. Halfway through the course, a list of names, emails, and specialties was posted in our Slack chat. If you want to engage with them, you have two options: approaching a few who had lunch with us during the course — where many excellent conversations happened — or using the info from the list to reach out yourself. I would recommend you to keep the introduction message short and direct. The mentors have other full-time jobs, so it must be capable of grabbing their attention.

Although a few official mentors have given me useful insights for my portfolio project and career, the ones who helped more during classes were my colleagues. No one from the 8 people in my batch was a beginner: we had specialists in AI, data warehousing, dataviz, economics, physics, aerospace engineering, business intelligence and computer science. I have personally been learning a lot by osmosis.

In the second month of the bootcamp, time was approximately equally split between classes and personal time for our portfolio project. The classes were on advanced R and Python, lessons on dataviz, theory, and practice of presentations, geographical data, deep learning, and practicing HR interviews. On advanced R and Python, we practiced optimization of algorithms (Cython included). On deep learning, the classes were focused on computer vision.

I’d highlight practices on presentations as one of my favorite parts of this second month.

In the first scheduled day for presentations, we were told on a Friday evening to prepare a presentation of 20 minutes for Tuesday morning. The subject was of our own choice.

In the second day, we were given 3 hours to go from “Hey everyone! Think about something to present…” to “Recording? Ok…”. Challenging but exciting to get it done.

In this third month, the subjects aren’t many. We had classes on model pipelines, recommender systems, and technical communication. Since everyone is now super focused on their projects, these ending classes have to be interesting to grab people’s attention.

I’d call attention for the model pipelines class. It changed the way I see data science. Coming from a software developer career, I am used to reading code when I want to learn a new language or paradigm. At least when this article is written, the majority of open source repositories tagged with data science and Machine Learning is far from following the most basic software engineering best practices. With this class, I started to see the light at the end of the tunnel, where a data engineering pipeline is actually well written and maintained. I had the feeling of learning Haskell in a world of Fortran.

As I have mentioned, a lot is learned from colleagues going through the same challenges as you, in a distinct pace. So telling you what happened on my batch may not help you on understanding what you’re going to find in a next one.

From the incredible seven people I’ve met, I learned how to version control Jupyter Notebooks, different techniques with deep learning, Scala, graph analyses with Neo4j and how to sing “Happy Birthday” in Vietnamese.

The batch I attended had 8 people from 7 different nationalities. People I will keep as friends for the future. :)