The genealogical research process might be a good model for the rest of the research world.

One of my hobbies is exploring the technical aspects of genealogy research. By this I mean everything except the locating of sources. Locating sources is important, of course; indeed, many people seem to treat it as the only activity in genealogy. But I am continually impressed by how well professional genealogists understand what to do with the sources they find.

Genealogical Research Process

In genealogy, the sources are pretty straight forward structurally: they are documents that reference particular individuals. The conclusions are also structurally simple: they are people, relationships between people, and events involving people. This simplicity allows fairly robust description of the path from source to conclusion.

Sources contain information; information can be used as evidence to support or reject putative facts. A census administered on 12 November 1930 might contain the information that there was a John Doe who was 54 years old on that day. This information contains evidence to support the putative fact that John Doe was born between 13 November 1875 and 12 November 1876. Evidence can also arrise from things other than positive information; a child not listed on a census of a family is evidence that that child was either not yet born or already dead at the time of the census. Evidence is never conclusive: people make mistakes and lie, and strange-sounding explanations are sometimes correct.

Once enough evidence has been gathered, we can use it to construct a proof argument. This takes the form of a set of mutually-consistent conclusions with a list of supporting evidences and a list of all contrary evidences. Each contrary piece of evidence is accompanied by our best explanation as to why it is not correct.

One of the nice things about genealogical research from a technical standpoint is that every portion of this can be easily stored in a canonical data structure. I had hopes that GEDCOMX would be this canonical structure when it was first announced. It now looks like it is only going to be half of it. Sources, information, and putative facts can be modeled as nodes of a graph. Evidences (pro and con) are the edges between these nodes, labeled with explanations were appropriate. A proof argument is then a collection of putative facts and a selection of explanations offering a single consistent view of the past.

Do Thou Likewise

As a researcher in computer science, I often wish we had the same kind of structure as genealogy. The sources and conclusions are so much more diverse that we can’t acheive the same simplicity. But could we do a better job at separating the various pieces of research?

What if computer scientists had to separately identify sources (e.g., algorithms or experiments), the information they contain (e.g., big-O or statistical analyses), the putative facts they evidence (e.g., “‍Floyd-Warshall is faster than Johnson for sparse graphs‍”), and identify both positive and negative evidences? Such a structure may not be as easy to read as the current free-form paper model Though it could be easier to read than are some papers… but it would have the advantage of clarity—no more hiding behind fancy prose or burying the lead in a poor outline. Such structure would also enable smaller contributions—a researcher could, for example, add a new source to an existing structure to strengthen (or weaken) its conclusions without needing to create an entire publication from the work.

I don’t know that the particular elements of the genealogical research process are necessarily the right elements in other fields. But I am attracted by the fact that they are identified as distinct. I hope some day my field will develop the same maturity.