Data conflict: "This person had xx children. Most people have..."

JanetStory · décembre 19, 2024

So what if "most people have" up to 12 children? Why should that give a DATA CONFLICT message? Have had this 2 days in a row now. Yesterday it was husband's great-grandparents (French-Canadian Catholics). Today it was my great-grandfather who had several children by each of 2 wives (only 8 survived). I can see flagging an unusual occurrence, but it's not something that's actually conflicting data.

Rhonda Budvarson · décembre 19, 2024

@JanetStory thank you for your feedback. This is the place where possible errors and inconsistencies are put for evaluation. Could you suggest a better wording? I can definitely pass this along.

Gordon Collett · décembre 19, 2024

It think the idea is that the conflict is between the data in the profile and population statistics. I have no idea what kind of cut off FamilySearch used for the algorithm but a common one is to consider any value of more than two standard deviations above the mean for whatever is being measured to be, not abnormal, but outside of normal enough to warrant a second look.

It can be hard to determine what the cut off should be. Too low and everything gets flagged (too many false positives). Too high and nothing gets flagged (too many false negatives.) It's a tricky balance.

JanetStory · 4 janv.

I'm not quarreling with the idea of flagging an unusual number of children, but I don't like the "most people" wording. It's too — don't know what you'd call it, maybe normative? Also, maybe refine the algorithm so that it doesn't flag when children are spread over multiple marriages or where there are twin, triplet etc. births — might be more kids, but only one pregnancy.

Gordon Collett · 4 janv.

As just another user, who tends to like to get into rambling philosophical blathering, I've been thinking about alternatives to "most" and haven't been able to come up with anything better. Some options that come to mind were:

The majority of people….
It was uncommon for people….
It was an exception for people….
Few people have over….

I couldn't come up with anything more clear or concise. If you have any ideas, I'm sure the programmers would be open to reading them.

However, I will also say that I don't see why anyone would put a moral value on a word when being used as a strictly mathematical term even when the programmers are stretching its mathematical meaning. "Most" just means "the majority" or "over 50%." Here the programmers are adjusting the meaning to be "over a certain number of standard deviations above the norm of the applicable social statistics at the time."

Maybe to see how the term is actually being used and to get comfortable with its use, we could look at a really detailed profile evaluator that looks at everything about a person and generates similar statements:

This person lived to be 105. Most people only lived to be 85.
This person ran a marathon. Most people never ran more than one mile.
This person graduated from High School, Most people only completed sixth grade.
This person had cancer five times. Most people only get cancer once.
This person had 15 residences. Most people only had 4.

I am all in favor of continuing to refine the algorithm, but there will always be outliers. Otherwise the whole exercise is useless. And such a routine will alway make use of the statistical fact that all of nature, including human behavior can be standardized and when it does will most often generate a standard bell curve. (Number of children is a bit of an exception because number of children is an integer that cannot be less than zero) The decision is then where to draw a useful line:

At 1, 2, 3 or 4 standard deviations?

Too low is useless. "This person had three children. Most people have fewer than 2" (In the USA according to 2023 statistics) gives "most" the meaning "over 50%.:

Too high is useless. "This person had 50 children, most people have fewer than 49" gives "most" the meaning of almost never.

Another thing I have noticed that I'm not sure there is an easy fix to, is that people seem to be taking the term "Data Conflict" to mean that something is wrong. All it means is that two pieces of data don't appear to agree at first glance. It doesn't mean that either of them are wrong. Both can easily be right but the apparent conflict needs to be explained. But what other term to use?

Data Disagreement
Unusual Occurrence
Atypical Situation
Possible Error
Explain This
Double Check Your Work In Light of The Following Exception to The General Practices of This Community At This Time.

Again, if you can think of a better term that is short, clear, and translatable into 39 languages, list them here, and explain why they are better, I'm sure the programmers will appreciate your interest in improving the routine.

Julia Szent-Györgyi · 4 janv.

I think the problem is not with any particular message's wording, but with the entire presentational framework: "score" means either judgement or competition. Both of these things automatically put most people on the defensive, because they're protecting themselves from the possibility of being judged a failure or of losing the competition.

The problem is at least partly inherent: if you gamify a system, people will perceive it as a game.

I think the best solution would be to throw away all of the current framework and start over: how can the system best present its suggestions for further investigation? I think adding them to Research Suggestions would be best: we've all been well-trained to take those with a grain of salt, so it wouldn't raise people's hackles the way the schoolmarm does.

The other part of the scoring system, the impedance of edits, I would throw away entirely. (Burn with fire, as the modern meme goes.)

Gordon Collett · 4 janv.

Maybe call it just: Data Consistency Check. No quality inference. No score. Just a polite: "You might want to double check this."

I am seeing that the DQS does point out things that can be hard to find. I had a flag yesterday that one child in the family was born 88 miles away from all the rest. It took some searching, but I did find the incorrectly standardized place (Nettland, Kvinnherad, Hordaland, somehow got linked to the standard Nettland, Sogn og Fjordane, Norway.) and fixed it. It is also leading me to add alternate names that really should be on the profile that I just hadn't bothered with to clear "the last name is different from" flags. The biggest short term headache is all the tagging I didn't bother doing.

As far as editing impedance, people have been begging for years for a way to make other people leave correct data alone. It think the DQS requiring a quick second click on "edit" is at least a step in the right direction but probably doesn't go far enough. But as a current potential change, what would people think of the routine comparing the last user to make an edit on an item and the currently signed in user and if they are the same, suppressing that "check the sources before editing" alert? I think I'll give that it's own post.

Data Quality Score Feedback

Data conflict: "This person had xx children. Most people have..."

Commentaires

Catégories