main About Back to Blog Talks & Workshops Contact
Talking about the software perspective. If coding is not your strength, there are dozens of other ways.
As a global citizen, Serenata de Amor may sound like an excellent way of contributing in the fight against corruption. As a hacker, the great opportunity you were looking for diving into Data Science, Machine Learning and all those nice things people say are doing but no one really knows how.
It all begins with serenata-de-amor, a GitHub repository for data collection and data analyses.
Think about suspicious activities in the Quota for Exercising Parliamentary Activity (aka CEAP). The Bureau Act forbids nepotism when a congressperson asks a reimbursement for an expense made in a company owned by them or any relative until 3rd degree. There’s an Issue for that, and it’s tagged as analysis. What data needs to be collected? The proposed analysis will link to Issues with the data collection tag.
Want to contribute with keyboard types but without coding? Fine. Find an Issue and comment on it. Give suggestions on how to approach a problem, agree or disagree with others (always presenting good arguments) and ask questions.
Write a Python script — supposed to be run via command line — and place it in the
src folder. You may want to read the
data/2016-12-06-reimbursements.xz file, containing reimbursements made for congresspeople using CEAP’s money.
Since data is always changing and we expect every piece of code to be reproducible, datasets are versioned, having a
YYYY-MM-DD- prefix. Also, we love CSVs and their simplicity. Accessible for everyone, even from Microsoft Excel. In the end, compressing the dataset is a good practice and
xz is our favorite format.
Once you finish the data collection, open a Pull Request and attach the
YYYY-MM-DD-*.xz file to it. It will be uploaded to our Amazon S3 bucket by a team member and added to the
serenata_toolbox/datasets.py script, located in the Toolbox project (wait for it…).
As a data scientist, after extracting knowledge from data and creating predictive models, your job is making yourself understood by your peers.
We use Jupyter Notebooks for showing our research to the world. It’s a tool where Markdown, code, its output, graphs and images mix themselves in a single document. Can easily be exported, with the same content the data scientist sees, to a HTML or PDF file.
develop folder, we expect to have notebooks following the
[YYYY-MM-DD]-[GITHUB_USERNAME]-[SHORT_ANALYSIS_TITLE] naming convention. Since is too hard to collaborate with plain
*.ipynb files via GitHub, we also ask contributors to generate
*.py versions of their notebooks. Jonathan Whitmore suggests not just this repository’s whole folder structure, but a hook to automatically generate
*.py files every time you save the notebook. Easy peasy.
In the notebook itself, we just ask you 3 things necessary in any essay:
deliver folder is where analyses explicitly making conclusions are placed. We still don’t have any.
Last but not least, the
data folder contains datasets. Not just single datasets, but multiple versions of them. Though git ignored (because can very quickly grow to gigabytes), compressed versions of each file can be fetched using the
serenata_toolbox.datasets module (wait for it, it’s about to come).
from serenata_toolbox.datasets import fetch_latest_backup
A simple Python library containing code shared between more than one of the project repositories. Not published in Python Package Index, must be installed via its Git URL.
With the files currently placed here, you are able to fetch datasets — either from our Amazon S3 bucket or directly from its source, the Brazilian Chamber of Deputies’ site. Although we made our best to make everything accessible, downloading and processing the datasets from the Lower House may take hours in old computers. Fortunately, we receive 3GB and output a single 11MB compressed file.
The great mind behind the whole project. She is a command line tool and has a single important script, ran with
$ python rosie/main.py. It’s going to process new reimbursements made with National Congress’ money and output a tabular file. Each row is a reimbursement, columns contain Rosie’s judgment for each suspicious activity.
The contents of this repository are generated after researches made and published in datasciencebr/serenata-de-amor notebooks. When we believe an irregularity classifier is ready for a production system, it’s moved to Rosie.
Here we are a bit stricter about what is added. Rosie will output irregularities of each expense and we must believe in her judgment. Every contribution is backed with a least one Jupyter Notebook, automated tests and implements the scikit-learn interface for estimators.
Doing Machine Learning, researching suspicious activities and finding cases to request formal clarifications are useless if we can’t show to our friends, right? Jarbas is Serenata de Amor Operation’s front end to the world.
Developed with a mix of Elm in the front seat and Python’s Django in the back seat, provides a search functionality and a way of looking expenses, one by one. For each reimbursement, you see all the data we collected about it, their sources and how Rosie classified it for each of the irregularities she is able to judge. You don’t need to trust us, we are transparent end-to-end in the process. Check data sources, verify our analyses and improve our tests.
People can collaborate in any of the repositories, trying to reproduce analyses or suggesting new approaches. Want something different? Propose ideas or fork the project. Create new sibling projects borrowing our work. You don’t need to ask permission.
After all, we are all looking for a way to best elect our representatives. Question them and bring new minds into this world. One does not need to receive a salary from the government for doing Politics.
Originally published at medium.com/serenata on Dec 12, 2016.