Unfinished thoughts about using multiple tools to work on the same data.

One of the current difficulties with genealogical software is its plurality. There are many tools available for assisting in locating and recording ancestors, each with distinct features and interfaces. This in itself is a good thing: it means you can find a tool that suits your workflow or use multiple tools that support distinct elements of your process. But it is also a real problem.

The problem comes when you try to get data to move between tools. The basic model here is simple: one tool exports your current genealogy, the other tool imports it. And the problem isn’t in lack of a good format for these exports: GEDCOM, though it seems almost universally disliked, is a de facto standard format and most, if not all, tools can handle it. The problem derives instead from certain types of data.

Classes of problem data

The first kind of problem data I’ll call private. Sometimes there’ll be a datum that a tool shouldn’t include in an export. Private data might exist because the tool aggressively enforces its intellectual property over the datum or because the user wants to hide certain facts to guard against identity theft, invasion of privacy, or blackmail.

The second kind of problem data I’ll call custom. One tool might support recording “‍friend of‍” relationships where another only handles familial ties. A custom datum can be exported, but importing it is more problematic. The introduction of custom data is a natural part of innovation in tool design.

The third kind of problem data I’ll call non-normative. The same non-normative datum might be exported by distinct tools in distinct formats. Non-normative data might arise when two different tools independently introduce the same or similar custom functionality.

Sharing with problem data

Consider two tools, A and B. Let x_p refer to a private datum created by tool X, x_c refer to a custom datum created by tool X, and x_n refer to a non-normative datum created by tool X. Ideally, we’d like to be able to export record (a, a_p, a_c, a_n) to tool B and back to tool A and recognize what was changed and what was not without losing anything. There are several considerations in this ideal.

Transferring Private Data: A exports to tool B, dropping a_p. Data is edited in B and then exported back to tool A. How do we know if the re-imported data should have a_p added back in or not (i.e., how do we tell an edit from a new record)?
Merging Non-Normative Data: If a_n has changed, what (if anything) should B do about b_n? Can tools even tell the difference between custom and non-normative data?
Merging with Custom Data: If a record has changed only in a_c, should B consider the record to have changed at all? Should A → B → A preserve a_c, drop a_c, or should both behaviors be permitted? Can we merge records that are sub- or super-sets of one another?

With these issues there are several possible data transfer models.

Share What is Kept

Each tool only exports data it understands, ignoring other data on import; but it also exports a list of what data it understands.

Let A* be the set that A understands, B* the set that B understands, a be an existing record in A and b be a record exported from tool B. A considers b to be a name for a if b ∩ A* = a ∩ B*.

Keep but Ignore

The XML model: if you don’t understand a tag, keep it around but otherwise ignore it.

Share Edit Logs

Instead of sharing data itself, share a replayable list of edits. Ignore edits that are not supported by this tool, but keep them around for the next export.

Accumulate ID Sets

Require a persistent write-once data model. Keep, with datum, a set of all (tool, datum number) pairs and export these sets with the serialization of the data. On import, re-use data already in the tool and create new data for those not already present.

Globally-Unique IDs

Require each datum be given a globally-unique ID. Data files store the ID and the original serialization created by the originating tool. If a datum is edited, it is given a new ID and serialized afresh with a list of IDs it is a newer version of.

None of these inherently solves the sharing problem. This is more a list of things I’m thinking about than it is a proposed solution.