Continuing problems with Possible Duplicates suggestions

LegacyUser · January 31, 2020

Paul said: FamilySearch is continuing to create an incredible amount of extra work for me with its current algorithm for Possible Duplicates. I commonly have to dismiss up to 20 suggestions when I am dealing with relatives named John WRIGHTSON. I am offered John WRIGHT individuals as possible dups regardless of where in the world they might have lived.

The example below is another one that shows how the algorithm is far too loose. Only the middle names of the individual and her father match and - well, yes, her mother's first name does match, too. But different surname AND part of the world. PLEASE save me hours and hours of unplanned work by cutting out these bizarre suggestions.

BTW - please don't suggest I ignore them - otherwise, I can be sure (from previous experience) another user WILL accept such an offering at face value and complete the merge!

LegacyUser · January 3, 2020

Juli said: All I can think of is that the "UT" in the proposed duplicate's birthplace was misinterpreted as "UK".... (Hmm. It was last modified by FS in 2012, meaning it comes straight from an extraction program, and whatever place standard has been applied -- if there is one -- hasn't ever been looked at by an actual human being.)

Unplanned work, indeed.

Does anyone at FS see the reasons for dismissal we put in? If we get sufficiently scathing, will they listen? (My record list length so far is five: wrong name, wrong century, wrong country, wrong husband, and wrong child.)

LegacyUser · January 3, 2020

Adrian Bruce said: The thing is that we have pretty definitive statements about the birth - one in the UK, the other in the USA. Since no-one can be born in two places, this should be a red-flag on the merge suggestion.

Of course, it might be that Coalville, UT has been confused with Coalville, Leicestershire, England - it shouldn't be, though, because "Coalville, Summit" goes straight to the Utah version in the Standard Place-names query.

LegacyUser · January 3, 2020

Christine said: Hey your possible (not!) duplicate in Coalville, Summit, Utah is my relative! Thanks for not meditating her

LegacyUser · January 3, 2020

Damon Macdonald said: It appears that GQSW-JL4 and MSNG-PID are no longer possible duplicates because they don't score high enough. Usually this happens as more information is added. In this case it appears the name, and birth were added first, and then the mother and then the father - but it's hard to tell from the history. I wonder if the duplicate appeared when only some of the information was entered?
These cases can be hard to track down because they are fleeting - as the data changes they disappear. If you remember the order you went through adding Edith it might be useful - for example, did you add her as a child to her father? Was the mother already a spouse of the father? Do you remember if you just added the name first, and then added the birth information later?
Thanks for your help in making familysearch a great collaborative effort!

LegacyUser · January 3, 2020

Paul said: Damon

Looking at the change log I appear to have added birth, then mother and father - though I would usually add children to an existing father, especially when working from a census source.

I marked the possible duplicate suggestion as "Not a match" after my original post, so this should stop it coming up now.

I appreciate this is a very difficult area for the programmers to get right, but some tweaking should be able to at least cut out the "highly unlikely" suggestions (as in this case) or downright impossible, as in many others.

(Thank you for your continuing interest in the issue. I now see you have contributed to previous threads on the matter, which I have either instigated or to which I have added comments.)

LegacyUser · January 3, 2020

Adrian Bruce said: What concerns me Paul is that the suggested duplicate algorithm might consist of a hundred "calculations", each of which is reasonable and innocuous in themselves. It's only when the software adds up a myriad of "Remotely possibles" that it gets a "Possible". How can anyone adjust such calculations if each is reasonable and innocuous? I don't know and I don't envy anyone trying. I suspect it's really, genuinely hard to understand the net result of those algorithms.

That's why I've been suggesting that at the end, another set of plausibility checks are done that are simple and don't rely on each other. Things like, "Does the proposed dupe die before the original is born?" or the reverse, "Is the proposed dupe around after the original has died?" One fundamental principle here is that the system believes what's on file. If not, then why are we wasting our time inputting data?

I will say that I might be on totally the wrong track about how the software works - if so, my suggestion is a waste of arranging electrons in files. Don't mind that. But I do worry about why I have never seen (so far as I remember) any comment on the proposal that I (and others) have made for plausibility / gatekeeper / audit checks, whatever you want to call them...

LegacyUser · January 3, 2020

Paul said: I believe there have been genuine attempts to get a better balance when producing these possible dupes (and general record hints). Maybe it is just a perception, but I'm sure things did improve a while back (yes, I know that's a vague statement) but am know finding the situation very trying again.

LegacyUser · January 3, 2020

Adrian Bruce said: I'm also certain that work has been done because I believe the software guys when they say that they were doing stuff.

The fact that people still find issues suggests either that we are recalibrating our expectations rapidly or that things are genuinely so seriously and utterly complex on multiple levels, that no human can understand the whole picture. Hence my suggestion for independent plausibility checks at the end of the hinting algorithm. (And rejections in those plausibility checks could feed back to the programmers).

LegacyUser · January 12, 2020

Paul said: I know this sounds very selfish, but could someone at FamilySearch perform a tweak on the algorithm so individuals with the surname WRIGHT do not get presented as possible duplicates for those named WRIGHTSON?

The offering below is yet another example I have encountered whilst adding details to WRIGHTSON IDs. The birth details are already inputted yet I am still being offered an Ernest N WRIGHT, with a birth in Indiana!

I would say, from experience, the amount of instances of WRIGHTSON being indexed as WRIGHT are about 2%, yet I have to spend a good deal of time checking and dismissing these (often bizarre) suggestions. Thanks for the record hints - but certainly in cases like this - NO THANKS!

LegacyUser · January 12, 2020

Christine said: I also have a question about possible duplicates NOT showing up! I did a general search by Britton, place Gloucestershire, England, no dates. I was hoping to find more Britton relatives who my ggg grandfather may have done temple work for in the Logan and Salt Lake Temples in the late 1800s. And I did find ALOT. Most of those early records had not been merged to later records. I could tell they were the same person because birth places were the same (when listed) and baptism and confirmation dates were the same. I then merged them when I was certain, but sometimes I couldn't tell and will have to go to the Family History Library when I am in Salt Lake again to look up those original records to see who they are and what relationship he listed to them. (another thread but I would love to have those relationships extracted!!!)

My concern: Anyone who found those early records with only baptism and confirmation listed could have done endowments and sealings for all those people.

My question: Why, when we enter Wrightson in England and get Wright in Indiana as a possible duplicate do we not get a possible duplicate for Rachel Britton KWNF-Q1K and Rachil Britton KWJD-LC4? I restored the merged Rachel Brittons so you all could see what I mean, although on these both records show all temple work completed, same dates. Many of the others had only baptism and confirmation done until I merged those into a record that had endowment completed.

LegacyUser · January 12, 2020

David Newton said: That's fairly easy to answer: it's not the same date and the name is too disimilar and the other person appears to be an orphaned entry. So in other words the combination of those three factors takes the match below the algorithm level trigger.

One's 1866 and the other is 1868, hence it not being the same date. Don't know if I'm correct, but I strongly suspect that the matching algorithm ignores ordinances. After all the ordinances are stored in a different database and the matching algorithm has to be used by everyone, not just Mormons who can see the ordinances.

LegacyUser · January 12, 2020

David Newton said: I've just added KWJD-LC4 as a sibling of KWNF-Q1K and ping it popped up in the suggested duplicates as a 4 star match. Can't remember what the quality threshold for matches is, but it appears my supposition about it previously being below that threshold and not triggering the algorithm was correct.

LegacyUser · January 12, 2020

Christine said: I just merged my Rachel and Rachil Brittons again, since when I logged back in it showed me that Rachil Britton needed temple ordinances and I could reserve them. So if anyone looks, they are re-merged. I get why the ordinances might not be seen in FS for comparing duplicates, but it seems the exact same birth place, and name same except one letter, birth day same except two year difference should have triggered possible duplicate.

Ah David, I didn't notice your response of adding the Rachil as sister. That is why she showed up as needing temple ordinances in my record. I would think, however that an exact birth date and place might show as a possible duplicate, since everytime I add an additional child in a family who had the same name (previous infant with name died) but different birth dates and different birth years shows as 'relationship already exists' and I have to say it is not the same person before I can add them. hmmmm.

LegacyUser · January 12, 2020

David Newton said: I suspect it was being a disconnected orphan record which meant it didn't trigger. Family structure does seem to be fairly heavily weighted by the algorithm, even if that family structure is sometimes just the person and their parents. Family structure definitely helps with record matches.

LegacyUser · January 23, 2020

Paul said: No apologies for continuing to highlight this problem, because of how time-consuming it is to dismiss the large amount of incorrect suggestions I get. Can't ignore them as there's a good chance another user will be stupid enough to merge in cases like these.

I wonder how many "Johns" were born in Yorkshire on this date? Quite a few, I should have thought. So, once again, FamilySearch, PLEASE don't continue to offer ludicrous "possible duplicates" like this:

LegacyUser · January 23, 2020

Paul said: Wondered why I just had the one suggestion - looked again and found a John Bell, a John Fox and a John Brotherton also offered as poss dupes - just because christened on same day and in same county, albeit 60+ miles away from Barningham.

As fast as I dismiss one, another appears! Now I have a "John" (no surname, except for his father - named Sharrock) - not even the same county this time (but 90 miles away in Lancashire).

LegacyUser · January 23, 2020

Adrian Bruce said: :-(

Born 1817, both named John, both mothers' named Elizabeth.

Everything else is wrong or missing (no mother's surname or father on one, for instance). Lack of data must *not* be treated as a match unless a lot more stuff matches.

I think that David Newton suggested that names on family relationships appear to be being prioritised over anything else. That was certainly the case when I had a suggestion for duplicates for my GG-GPs - wrong continent, both had been dead for several years but, hey, it's John and Ann Bruce!! Yes, I did raise it here. Similarity of names (not actually a match!) appears to be the only reason here.

John Turner in fact didn't have a standard christening place, so he could presumably have been born in the USA. I just standardized him.

LegacyUser · January 23, 2020

Adrian Bruce said: Duh - yes it was David's suggestion - just up the page!

LegacyUser · January 23, 2020

Juli said: I think part of the problem is that there's machine learning involved with bad input: the stupid people who take the computer at its word teach the algorithm Bad Things. (You would think people would realize that "same name, must be same person" just doesn't fly, but there are unbelievable numbers of people who believe it unquestioningly.)

LegacyUser · January 23, 2020

Justin Masters said: I was going to suggest that. From my anecdotal experience, it seems that when a bad source is attached, it takes the information in that source into account in its possible duplicate calculation.

LegacyUser · January 23, 2020

David Newton said: It's the names plus the christening date that appears to be sufficient to trigger the matching algorithm here. Both MNJQ-SLN and MRHG-G4N were christened on 30th November 1817.

I know in the past I've had instances where the names were the same, the place was the same and the month was the same but the algorithm wasn't triggered. Why? One was 3rd of the month and one was 8th of the month (obviously one of those at least being a transcription error). Heck I've even had that happen when they've the parents have been merged into each other and the two duplicates are now apparent siblings and as primed as possible to be merged with each other!

LegacyUser · January 23, 2020

Adrian Bruce said: Ah - missed that the dates were exactly the same.

Still - if the surnames are different, really different, and it's a male, shouldn't that have stopped the proposal? One wonders if there are any negative factors in the potential dupe algorithm?? (i.e. are there any aspects that reduce the score for matching?) I find it difficult to see how else this got through...

LegacyUser · January 23, 2020

David Newton said: What should be an overwhelming halt to these "matches"? Distance between the two events. In 1817 it was pretty much physically impossible to travel between Barningham and Leeds in one day. Google Maps says 19 to 20 hours to walk it.

That should indeed be enough to stop the matching algorithm in its tracks. Distance between events matters more than anything else.

LegacyUser · January 23, 2020

Ryan Torchia said: Given how badly Italian surnames were botched in US records, and the level of uncertainty over birthplace, I'd rather have a looser Possible Duplicate and Suggested Hint filters. It's still the responsibility of the editor to verify that they're really the same before blindly accepting the suggestion.

Even alerts for ones that aren't duplicates can be helpful. I've run across a lot of records that include details from multiple people with the same or similar names. When they're suggested as duplicates to a correct record, it gives me the opportunity to fix the bad one so that it doesn't get mistaken as a duplicate.

I just wish the Possible Duplicate warning was visible in the tree view instead of only when viewing a profile directly.

LegacyUser · January 23, 2020

Ryan Torchia said: That assumes the locations are accurate and trustable. I've found often they're not. There are a lot of people pages whose only source for birthplace is a census record from decades later, which is often wrong, probably because they didn't fill it out personally, or moved when they were too young to remember. We're talking different states or countries, not a matter of miles. Bad or conflicting sources, transcription errors, data that was added from a source that was for another person and never got erased after it was detached, cruft from bad merges -- it all happens. I'd rather be alerted of it than have the bad data in the profile prevent us from discovering the good data.

LegacyUser · January 24, 2020

Adrian Bruce said: "still the responsibility of the editor to verify that they're really the same before blindly accepting the suggestion."
Perfectly correct. But it is clear that many simply don't do any sensible verification at all and simply accept the proposed merge. Indeed, one of the regular contributors here once related the case of a training course that he'd been on where the instructor actually said that the quality of merge suggestions was so good that there was no need to check. There followed a serious argument I believe.

One possible way forward would be to explicitly brand those merge suggestions with a lower score as "These are unlikely but you may want to review them just in case"

LegacyUser · January 24, 2020

Paul said: Ryan

You obviously do not have the same negative experience as I do with these possible duplicates. At present, it takes too much of my time in Family Tree dismissing some of these crazy suggestions. In the last case I illustrated a christening date had already been inputted, yet I still received suggestions for individuals with totally different surnames who were christened on the same day and even many miles away from the location where my relative's christening took place.

But that is just part of the problem. Believe it or not, there are a number of users who will accept these suggestions and - without any checks on how unlikely or downright impossible a match they might be - just go ahead and merge the two individuals.

If you had experienced the problems I had - sometimes spending days trying to sort out a particularly bad situation, whereby a user merged every "James Young " possible duplicate put on a person page by FamilySearch - I'm sure you would quickly lose enthusiasm for the current algorithm.

The current situation might help in losing many genuine duplicates, but it is also making individuals disappear - persons who were born, married and died in one area, yet become "non-existent" by being merged with a totally unrelated person who lived elsewhere.

LegacyUser · January 24, 2020

David Newton said: For an illustration of the fallout of this sort of stupid activity see profile LDY1-5JF. As I write this so far I've detatched 5 sources from that American who one user apparently thought lived in America, Norfolk, Staffordshire, Middlesex, Monmouthshire, Berkshire, Cheshire and Hampshire at pretty much the same time. Still at least half a dozen more junk sources to detatch.

I came to deal with this through a Hampshire source attachment. Apparently said user thought that the wife named Sarah or the second wife named Harriet mysteriously transformed into Dinah for the Hampshire record.

No legitimate reason for these sources to be attached to the profile. Typical mess caused by a user who can at best be characterised as oblivious and naive and less charitably can simply (and probably legitimately unfortunately) be simply thought of as stupid. No regard for geography. Completely oblivious to the unlikely travel pattern this would entail. Completely implausible mother name mismatches in some cases (see above for example). Not realising that there's a decent chance that the dates mean at least one pregnancy of far less than 9 months.

I do not have much patience at all with a user who makes a pattern of bad edits so egregious as to show a complete disregard for anything approaching the truth. The really bad part is this example is nowhere near the worst to have been referred to on this forum!

LegacyUser · January 24, 2020

Justin Masters said: One other thing to consider...

Someone (or some people) may be purposely making such edits. It wouldn't be the first time that people who are against the Church of Jesus Christ of Latter-day Saints is doing such activity.

The emphasis on sources since then has been helpful, but it wouldn't stop such activity.

LegacyUser · January 24, 2020

David Newton said: Oh and apparently they lived in Somerset as well.

Continuing problems with Possible Duplicates suggestions

Active · Last Updated February 21

Comments

Categories