Splitting a Record
© 14 Aug 2014 Luther Tychonievich
Licensed under Creative Commons: CC BY-NC-ND 3.0
other posts

genealogy

Applying the Principle of Sensible Disbelief to derive polygenea.

 

An Example

It is common in existing family history data to store conclusion-oriented data, such as the following example person record which references two other person records that I do not show:

Person:
    id: 1
    name:
        as-written: Charlotte Ward
        surname: Ward
        givenname: Charlotte
    birth:
        father: 2
        mother: 3
    sources:
        - Kelleys Island, Erie, Ohio, pp. 370–371 
          no. 22, 1876
        - US Census 1900, Kelleys Island, Erie, 
          Ohio, United States, sheet 5B, family 114

I’m using YAML-like syntax because it is easier to read than many other data formats. I suspect an implementation would use XML or JSON instead.

Now let’s remove the parts that violate the principle of sensible disbelief. People might contend that this person’s name is recorded incorrectly, so we’ll have to pull that out.

Person:
    id: 1
    birth:
        father: 2
        mother: 3
    sources:
        - Kelleys Island, Erie, Ohio, pp. 370–371 
          no. 22, 1876
        - US Census 1900, Kelleys Island, Erie, 
          Ohio, United States, sheet 5B, family 114

We could get the parents wrong, so they need to be separate too.

Person:
    id: 1
    birth:
    sources:
        - Kelleys Island, Erie, Ohio, pp. 370–371 
          no. 22, 1876
        - US Census 1900, Kelleys Island, Erie, 
          Ohio, United States, sheet 5B, family 114

Maybe those sources are about different people, so we have to get rid of all but one source. We can leave one, though, since a person with no source could never have entered our research to begin with.

Person:
    id: 1
    source: Kelleys Island, Erie, Ohio, pp. 370–371 
            no. 22, 1876

The style of citation is bad too, or at the very least a point of disagreement. Pull it out and replace it with a reference to another node.

Person:
    id: 1
    birth:
    source: 4

How about the word “‍Person‍”? Maybe you thought it was a person but it was really a pet, or a city, or a house, or an imaginary friend…. We do know it was something, but what kind of thing is a sensible object of disbelief. We’d best store just a “‍thing‍” and let the personness be a subject of discussion.

Thing:
    id: 1
    birth:
    source: 4

And now that we realize it might not be a person, it also might not be born, right?

Thing:
    id: 1
    source: 4

And here we are left with the core atomic element of a person, place, pet, event, or almost anything else: the Thing node, which contains exactly two fields: a unique ID, and a pointer to exactly one source node.

So what about all those parts we pulled away? They all go off into nodes of their own. The sources, for example, become Citation nodes:

Citation
    id: 4
    type: Christening record
    place: Kelleys Island, Erie, Ohio, United States
    page: 370–371
    date: 1876
    number: 22
    lang: en

Citation
    id: 5
    county: Erie
    date: 1900-06-07
    district: 33
    household: 114
    lang: en
    line: 70
    sheet: 5B
    state: Ohio
    supervisors district: 12
    type: census
    document: Twelfth Census of the United States
    township: Kelleys Island
    village: Kelleys Island
    schedule: 1 – Population

The set of attributes of a citation is unbounded; anything you know about a cited document you can put in the citation.

The birth that we pulled out earlier is another Thing node:

That the birth cites source 4 and not source 5 is not something we can determine from the original record alone; we have to actually know something about the content of the two sources.
Thing:
    id: 6
    source: 4

Most of the other parts we removed are Property nodes; for example, we have the type of each thing:

Property:
    id: 7
    of: 1
    key: type
    value: person
    source: 4

Property:
    id: 8
    of: 6
    key: type
    value: birth
    source: 4

as well as fields like name:

That source 4 used “‍Charlotte Ward‍” and not “‍Ward, Charlotte‍” or “‍First name: Charlotte; Last name: Ward‍” or some such is not something we can determine from the record alone; even with the “‍as-written‍” part of the name in the record we don’t know if it was written that way in source 4 or source 5.
Property:
    id: 9
    of: 1
    key: name
    value: Charlotte Ward
    source: 4

Let’s assume that the separated name parts in the original example resulted from some indirect evidence; We can’t tell if it was direct or indirect from the given data; we’d have to ask the researcher (and trust memory) or re-perform the research ourselves. that is to say, they were not identified as separate in the source, we inferred them based on the name writing conventions of the period. That phrase “‍based on‍” suggests a rule or trend, a pattern that usually holds which we can use to derive new information. To store a rule in the data, we list its antecedents and consequents: if the antecdents are matched by other nodes, the consequent nodes can be derived.

Rule:
    id: 10
    antecedents:
        - Citation:
            lang: en
            date: between(1800, 1900)
        - Node:
        - Property:
            of: antecedent #2
            key: name
            value: regex(^([^,]+) (\S+)$)
            source: antecedent #1
    consequents:
        - Property:
            of: antecedent #2
            key: surname
            value: group 2 of value of antecedent #3
        - Property:
            of: antecedent #2
            key: givenname
            value: group 1 of value of antecedent #3

This rule states that if you have a citation-style source in the English language, created between 1800 and 1900, and it is the source of a name property where the name value is a multi-word string with no commas (that’s what that regex means, in case you are not fluent in regular expressions), then you may derive a surname and givenname property.

I am not suggesting that the rule syntax above is ideal; however, the idea of using functions to make more general rules holds. I anticipate that rules would usually be generated either via user-friendly rule-generation wizards (I may write more about those later) or by a relatively small set of users willing to write them by hand.

Using the rule we can create an inference; inferences match the rule up with some concrete antecedents and assert that we believe the rule holds in a particular case.

What if we are wrong and the rule does not apply in this case? We’d add a Property of the inference node that asserts it is false with a source explaining why we don’t believe it.
Inference:
    id: 11
    antecedents:
        - 4
        - 1
        - 9

The inference is now the source of the consequents of the rule:

Property:
    id: 12
    of: 1
    key: surname
    value: Ward
    source: 11

Property:
    id: 13
    of: 1
    key: givenname
    value: Charlotte
    source: 11

Now, recall that we had two things, a person and a birth:

Thing:
    id: 1
    source: 4

Thing:
    id: 6
    source: 4

Property:
    id: 7
    of: 1
    key: type
    value: person
    source: 4

Property:
    id: 8
    of: 6
    key: type
    value: birth
    source: 4

How do we connect them together? Using a Connection:

Connection:
    id: 14
    from: 1
    description: is-child-in
    to: 6
    source: 4

We’d likewise have connections for the father and mother, like so:

Connection:
    id: 15
    from: 2
    description: is-father-in
    to: 6
    source: 4

Connection:
    id: 16
    from: 3
    description: is-mother-in
    to: 6
    source: 4

Connections have the same number of fields as properties, but the value is a reference instead of a string.

We’ve seen a bunch of nodes that use citation 4 as a source; we’d also create a bunch that use citation 5 as a source, such as:

Thing:
    id: 17
    source: 5

Property:
    id: 18
    of: 17
    key: type
    value: person
    source: 5

Property:
    id: 19
    of: 17
    key: surname
    value: Ward
    source: 5

Property:
    id: 20
    of: 17
    key: givenname
    value: Charlotte
    source: 5

Then to get back to the original two-source record we’d record the idea that thing 1 and thing 17 are the same thing.

Some tools and data models (for example LifeLines, DeadEnds, and the now-defunct new.familysearch) do have match actions explicit in the data, but many do not.
Match:
    id: 21
    same:
        - 1
        - 17

Node 21, a match, is semantically the union of Thing 1 and Thing 17. It has two properties that assert it is a person (7 and 18) with two different sources, suggesting that we have two sources for the personness of this thing. It likewise has two sources asserting the surname of “‍Ward‍”, one from a cited source and one from indirect evidence represented by an inference node. And so on.

Polygenea

Although it may not be immediately evident, there are only two kinds of nodes missing from the set of nodes introduced above: note or comment nodes for containing arbitrary text that one researcher might wish to share with another, and belief sets for representing which nodes a particular researcher considers part of that researcher’s genealogy.

So what does this version of genealogical data give us?

There are probably more, but those come readily to mind.

Many of these benefits come because polygenea stores more information than do conclusion-oriented data models. Most data models simply do not have information regarding indirect inferences; only a few record match actions explicitly; and many don’t match sources to individual claims. Since that information is not in the original data, a fully-automated other-model-to-polygenea converter is not possible. Clever use of change logs might recover some of the information, but much of it was never entered into the computer before.




Looking for comments…



Loading user comment form…