Showing posts with label scientific research. Show all posts
Showing posts with label scientific research. Show all posts

Thursday, July 22, 2010

Call for coherent, systematic field summaries

The overview

As a Ph.D. student, figuring out what has already been done by past researchers is by far the hardest problem I have come across. The current expectation and requirement is that every student will browse tens of conferences and journals going back tens of years. At the end of this long and tedious process, there are several common pitfalls. First, if even a single paper in a single conference was missed, this could lead to months of work which will eventually be found to be wasted time because it was spent working on an idea that was already developed. Second, even if every relevant paper was indeed located, students almost always discover what happened in each paper individually, and rarely draw links between papers. This is critically important, as described below.

The details

I recently attended the International Computer Vision Summer School (2010). There was one session in which students were asked to read 3 papers and trace the ideas in these papers back as far as possible in the literature in the form of a tree. After reviewing the submissions, the organizer was quick to point out that there was very little overlap in the trees submitted for these 3 papers. He went on to explain how the 3 papers had essentially presented exactly the same concept, just from slightly different angles and using different terminology and notation. Thus, the trees for the 3 papers should in fact be identical! This was an excellent concrete example of the problem - it takes an "already expert" to extract these deep and extremely important connections from a literature review. Leaving it up to every new student is not only a fruitless effort (as they will not get the correct information out anyway), it is also enormously replicated effort! There should be a system in place where efforts are pooled to do this completely and correctly a single time.


Previous attempts


There are occasionally "survey papers" written. These are very close to a good solution. However, they suffer from a lack of view points, as well as a lack of frequency. These two issues are directly addressed in the following section.


The proposal


There are two phases of my proposal for action. The first is the "catch up" phase, followed by the "maintenance phase".


"Phase 1: Catch Up"


The "catch up" phase is the longest and most difficult, but also the most helpful.
I propose that the general population of a field nominate and elect a committee of experts who are the most qualified, accomplished, and knowledgeable people in the given field. These people would be charged with two tasks. The first is splitting the field into an appropriate number of subfields. I can only speak of my field of Computer Vision. One could break this field down into "Structure from Motion", "Object Recognition", "SLAM", etc. There should be 3-5 experts in each of these sub-fields on the committee. Each sub-committee is then charged with producing a survey paper of the work in this area starting as far back as possible and going to the present year. This will certainly be a large document, with many, many references, however it is important not to get lost in the task of listing references. These connection between papers and following the evolution of each idea is the central idea of this whole project. The payment for this exercise is an overwhelming sense of advancing the state of the art of scientific research procedures, as well as a resume line item which indicates that you are a recognized expert.


"Phase 2: Maintenance"


This is the easy phase! This process must be performed yearly (or at some other regular interval). Again, a committee must be selected. However, all that must be done is a short review of what has happened in this sub-field in the last year. References should NOT, for the most part, come from more than 1 year ago. This keeps these reviews linear and sequential, making them extremely easy to follow.




Potential problems


After some initial conversations with some of the field experts, it is apparent that there is potential for some political issues to be raised with a project like this. You may get people complaining "Why is my paper not included in the survey!?". You may also expose parallels that the original authors did not realize, making them feel "foolish". It is my opinion that the progress of the field and rapid absorption of young researchers is much more important than protecting an individual from this type of silly whining.

Potential benefits

If students could read a couple of these documents and be fully caught up on the state of the art of their sub-field, many new doors would be opened. First, people would not be so restricted to a single sub-field. It would be possible to keep current very easily in multiple sub-fields by simple reading through these documents when they are released yearly. I have seen many times where a solution from an outside field has been adapted to a problem in the field with amazing results. Second, students could move forward confident in the fact that their work is actually on a track that the field is interested. They could also be certain that their work has not been previously attempted. The time savings when multiplied by the number of students is incredible. By applying the correct resources (the experts) in the correct places (a directed effort of these systematic summaries), a much more efficient community can certainly be achieved.

Friday, May 14, 2010

Creating a Common Research Language

As a research engineer, the sharing and dissemination of ideas throughout the field is absolutely critical. These ideas are tightly coupled to their implementations. As a explanatory example, consider a very simple invention, the pencil. Once can explain the concept of a pencil in a single sentence : ``A pencil is a writing implement usually constructed of a narrow, solid pigment core inside a protective casing.'' Now, consider that you have an incremental improvement to the pencil - you want to make it one color for the first half of the pigment, and a second color for the second half! It is a very simple new concept. If you were the creator of the original pencil, this small change would be extremely fast and easy to produce - simply use two different pigments in your pencil core production process. If you are NOT the creator of the original pencil, this is now an extremely complex task - you must address questions like ``How do I compact graphite?'', ``How do I get the core to not slip inside of the casing?'', and ``What kind of wood do I use for the casing?''. These questions have already been addressed, and likely studied in detail by the original creator of the pencil. That is of no use to you, though. Unless you are personal friends with this person, it is not likely that he will let you use his pencil factory to try out your new idea, so you must start from scratch.

This contrived example of an improvement to the pencil is shockingly similar to the daily situations encountered by engineering researchers. The major difference is that in software there is a simple solution! The situation is all too frequent: someone develops an algorithm and spends years perfecting its implementation. Now I want to use that algorithm as a step in my research. The path rarely strays from: 1) look online and find that there is no publicly available implementation of the algorithm published by the author. 2) email the author asking them to share their implementation with you. 3a) The author has agreed! Now you realize the code has been written without a single regard for future users - there are no comments and no standardized workflow, basically rendering the code useless to anyone but the original author. 3b) More commonly, the author will not respond, or will give you a ``sorry, I can't share that code'' type of response. Either way, you must now move forward with nothing. 4) Decide whether to take your research in a different direction, or stay and fight by implementing the alrogithm yourself. 5) spend countless hours and days fighting with the nuances that were left out of the publication of the algorithm, all of which the author has surely already addressed in their implementation. (Note that step 5 is where the majority of graduate students time is spent - REdoing past work!).

Having experienced the above path on countless occasions, I have seen three major problems that arise when implementations are not shared:

1) Massive time expense for newcomers to a field.

Consider that you are a new PhD student. There are two options; 1) No students in your lab have worked in the area you plan to work. In this case, you must start absolutely from scratch. 2) Your research interests have been shared by previous students in your lab. The good news: you get to start with implementations of several important algorithms! The bad news: each algorithm has been implemented by a different student, in a different language. If you are lucky enough that the language is common, it certainly was not using the same libraries. If you are extremely lucky and they did use the same libraries, the code is likely written in a way that does not encourage reuse- either no comments or no reasonable API. You might as well be in case 1 :(

2) Clique-ish research groups.

In the very rare case a lab has multiple studedents working together on parts of a unified problem, the effects of the Pencil analogy are intensely amplified. After a few years of multiple people working together on a code base, incremental changes are extremely easy to pump out quickly. This leads to an even more intimidating barrier to a newcomer to the field. This, in turn, leads to a less diversified outlook on problems, as the same people are almost exclusive continually working on certain problems.

3) ``Bad'' re-implementations.

When it has been determined that the way to proceed is to re-implement an existing algorithm, many things start to go wrong. Besides spending massive amounts to time that should be spent elsewhere, one must also consider the quality of the re-implementation. The original researcher spent months or years of his time completely dedicated to this particular algorithm. You intend to simply use it as a small piece in a much larger puzzle. The implementation you create in a week will absolutely not compare to the original implementation in speed, correctness, and reusability. This leads to inaccurate comparisons in research results, as well as overall lower quality and speed of future research.

Scientific computing/programming languages are no different than oral languages. Several oral languages developed thousands of years ago, and regionally societies accepted a particular language so that they could all communicate with one another. This was likely all done unintentionally. Here, the people are the researchers, and the regions are the fields of research. If laypeople could come to these conclusions without even intentional consideration, why can researchers not come to the same conclusions in the face of a very serious problem?

My field is computer vision and scientific data analysis. The languages of choice are Matlab and c++. The libraries (sub-languages) of choice are CGAL, VTK, ITK, and VXL. This exposition is not intended to claim a particular solution (though many of you know which way I lean :) ). If one of these is chosen as the ``accepted'' language for this type of research, the field would be able to accelerate at an amazing rate.

If a researcher is able to say "I have implemented all of my research as VTK filters", the next researcher can simply use his work as a building block for the next year's research. If, instead, the researcher says ``I wrote all of my classes from scratch by myself 20 years ago and my students and I have used them ever since'', there is a serious problem.