Handling of duplicates by automated Family Tree profile creation mechanisms
We understand from the BYU Record Linking Lab that their various census and NUMIDENT projects use FS’ standard duplicate flagging algorithm in their decision making.
FamilySearch is clearly walking a tightrope in duplicate checking for its web interface and for the interactive users of partner solutions: minimising the number of duplicates that go unflagged (and thus, probably, unmerged), but avoiding encouraging inappropriate merges.
The key point here is that merging, while not easy, is a lot simpler and less disruptive than undoing a merge. I therefore assume that FS are obliged to tune the standard duplicate algorithm to minimise false positives, i.e. the flagging of a duplicate that is not actually a match. I would suspect that they mind false negatives less, i.e. failure to flag a duplicate that is in fact a match.
In my view BYU RLL’s profile creation automation, and other automated bulk insert mechanisms such as gedcom import, have quite different needs. It seems to me that these mechanisms really have to avoid false negatives, because it is fundamental both to the integrity of FT and to the experience of its other users that they avoid inserting duplicates. Meanwhile such automated mechanisms have the option of leaving any profile showing any danger at all of a match to be handled separately/manually, so false positives should not be a big concern.
So, I propose that a differently configured duplicate flagging algorithm is provided by FS for automated mechanisms’ use.
A pre-create 'check for duplication' API (or the option to ‘reject if duplicate’ on the Create Person API) is also needed in my view. At present automated mechanisms appear to have to first create the profile and then later check it for matches, which feels backwards to me.