Konstantinos Vougas was at the top of the leader-board for the InnoCentive's first Prodigy Challenge, The Predictive Data Analysis Challenge. The Prodigy is a Solution Test Tool that provides rapid feedback to Solvers and displays the performance of the top ten performing Solvers found since the Challenge has been posted. We have asked Konstantin to share his experience with working on this new Challenge type in particular and on InnoCentive in general.
Hello Konstantin. Thank you for taking the time to talk to us about working on the first Prodigy Challenge. Just so we have a bit of background, can you tell us how learned about InnoCentive and what made you become an InnoCentive Solver?
I am always on the lookout for new opportunities in the field of science and I am attracted to the internet as a means of communication. Therefore, as soon as I discovered InnoCentive from a Google search, I knew that this was for me. What InnoCentive does is globalize R&D and I think that's great.
Have you worked on any Challenges prior to the Predictive Data Analysis Challenge? If so, can you tell us about them?
Although I have been checking on the challenges posted on InnoCentive for the past couple of years, the Prodigy Challenge was the very first challenge I decided to undertake and formally submit a solution. I guessed I jumped straight into the middle of the ocean...
What attracted you to the Prodigy Challenge?
Ever since I was a child I had a soft spot for scripting, mathematics and challenging problems. As soon as I read the details of the challenge it instantly fit my criteria. This Challenge had a huge dataset, with more than a million features out of which only a few carried the information (or part of it) needed for a good prediction, buried within experimental noise, missed measurements and small number of observations sounded just about right, I just couldn't say no. What was it like working on the Prodigy Challenge and the rapid feedback it provided? Did the Prodigy make you feel more confident about your solutions? Did it make you feel that you were engaged with other Solvers by knowing (in a strange way) who your competition was? Did it help you avoid incorrect answers?
The Prodigy, for me the best feature ever introduced into a challenge, although it is definitely not for the faint-hearted. I noticed the challenge with a one-month delay. People had already started scoring above the threshold. Initially I was excited and full of momentum. After my initial attempts on the Prodigy, which crashed and burned really badly, I got frustrated and the 0.14 threshold started to seem like a distant dream.
You know, you are working hard on an idea that flashed in your brain the previous night and sounds just about perfect. So perfect actually that you are in a hurry to wrap everything up and have a go in the Prodigy to finally manage to catch the dream, enter the top 10. You are excited, you enter the data, press enter and... boom, your score is 0 and all your excitement turns to dust. I thought of quitting more than once, but I was determined to give everything I had to it, so I just grabbed my pieces and started working all over again. After a lot of hard work I entered the top 10 and then I got even more excited which carried me to the first place. When I got there, I celebrated my "victory" and relaxed for about a day. Then I realized that other people were also working hard and they were catching up with me quickly day by day. The anxiety to maintain the first place was the dominant force, driving me to work even harder, to push the solution even further, make it even better, and make sure that I remained at the top right through the end. The prodigy definitely kept me on the right track offering me instant feedback on every decision I took on my methodology. I do not think that this challenge would have received any serious solutions had it not been for the prodigy. Apart from an invaluable testing tool, prodigy is a feature, driving and boosting productivity and efficiency to the highest possible extent.
Without sharing anything proprietary, can you tell us your inspiration for the solution?
The difficulty of the problem immediately came down to feature selection. Having such a large number of features with only a relatively small number of observations, a combination of techniques such as the 'least absolute shrinkage and selection operator' (LASSO) and 'Principal Component Analysis' (PCA) provided an efficient means of zeroing in to the most information rich features. The machine learner of choice was a kernel based variant of the well-established Support Vector Machine (SVM) called the 'Relevance Vector Machine' (RVM) which uses Bayesian inference to obtain parsimonious solutions for regression and classification. A kernel-based learner was used to compensate for the fact that the RVM (just as the SVM) are linear predictors whereas the relationship of the features to the 'Y' trait was not likely to be linear. Out of the reduced feature-set, feature groups of the highest possible predictive value possible where determined through well-established algorithms such as 'best-first selection' & 'backward elimination'. All the models were evaluated by multiple iterations of k-fold cross-validation. Finally the best models were combined, by utilizing the 'ensemble learning' paradigm, to provide one powerful predictor.
You were clearly at the top of the leader-board proving your aptitude for computational analysis. Is this your primary occupation or hobby? Would this acknowledgment have any effect on your career choices?
Actually data mining & knowledge discovery started as a hobby having roots in my passion for scripting and my love for mathematics. My main occupation is currently irrelevant to the field of computational analysis as I work as a research technician in the genomics and proteomics division of the Foundation of Biomedical Research of the Academy of Athens (http://www.bioacademy.gr). During the last 4 years though, I have made an effort in providing data analysis and bioinformatic support to various groups and projects and the results of this effort have now started to come out. I definitely intend to make bioinformatics and computational analysis my full-time main occupation, and hopefully this acknowledgement will back-up my decision.
This Challenge was related to genomics; do you have any experience in this field?
As I mentioned earlier I am currently providing computational analysis support to various groups mainly working on genomics. My primary tasks are to perform quality control and knowledge discovery on projects concerned with a) differential gene expression (DNA-microarrays), b) genomic copy number variation (aCGH) and c) Whole Genome Association Studies (SNP Genotyping). Finally, I am currently working on distributed computing and GPU utilization from within R, which is the statistical computing environment I use 99% of the time.
What was your experience on working with InnoCentive? Would you work on another Challenge?
It was a very fulfilling experience because through it, I had the opportunity to expand my horizons. I learned new things, not only about computing and science, but about myself and my limits. I will definitely do it again.