Report: a proportion of records in a film was assigned bad dates. Prevention and fixes...
[Edited on 10/03.]
Film 005263680 contains 2̶2̶4̶1̶ 1494 christening records from a Colombia (Valle del Cauca) parish, 1935 to 1938.
[Added: 1494=2241 - 747 ; I realized that sequential numbers for baptism events on that volume start at 748. This also means that most recorded events do have assigned dates -- there are no large portions of missing records. ]
If we run a couple of reports on the range of birth dates, we get
1) Perhaps these numbers should not add to more than 1494, but this is a relatively minor concern.
2) A volume of Baptisms from 1935 to 1938, why would it contain births from before 1920? (Catholic baptism of grown-ups was very rare at that time and place). Why would it contain births from after 1939?
For some errors, perhaps the computer recognizer, at some point, treated the sequence number written by the parish as the event date.
[added: for instance, these supposedly born in the 1600s: https://www.familysearch.org/search/record/results?q.filmNumber=5263680&c.birthLikeDate1=on&f.birthLikeDate0=1600
I do not know how to recall records that have birthyear="", to track the cause.
Fortunately, it should be easy to
- prevent this from happening, by focusing the OCR to scan only a very specific area of the image,
- running a report at the end of each OCR, to see if dates and places make sense.
- correct the mistake by recalling and rewriting those with a regular expression.
I see similar problems in other films, like https://www.familysearch.org/search/film/004442413?cat=2015238
In this example, the OCR discovers the word "Castrillón", but instead of seeing it as a lastname, the software interprets it as the birth place, associating it with a town in Spain. The software is overreaching, reading where it should not, and making unfounded associations.