Could FamilySearch do anything to prevent incorrectly formatted dates appearing in indexed records?
Not having any experience of programming, I will take the risk of making myself look stupid here, but I raise this question following the post at https://community.familysearch.org/en/discussion/153153/why-are-thousands-of-danish-1921-census-dates-of-birth-indexed-wrongly#latest.
As most of us know, the problem referred to is not that unusual in itself (see example below), albeit it does not usually affect every record in a whole collection. So, my simple question relates to the complexity of running a check for such dates appearing in a collection before FamilySearch puts them online. Is it so difficult to write a piece of code that would expose such errors, thus allowing for corrective work before a collection goes online and researchers are faced with the resultant problems?
The date range I have inputted shows at least 908 errors of this kind in this collection.
Perhaps anyone with programming experience could advise of the feasibility of any action that could prevent dates appearing in this form in future.
Answers
-
What do the actual records look like? I have seen a similar appearance in records I work in and for those it has always been the same problem which is not incorrect dates and not incorrect indexing. It is strictly incorrect auto-interpretation of what the indexed record says.
In the ones I see, the problem arises because the date was indexed with just the month and day like this:
Any rational person seeing this would understand that this clearly means she died 6 March 1890. It takes a computer program to turn this into:
This is another example of on-the-fly auto-standardization at its worst.
Or when you go the full record, do you still have the incorrect years and this is a different issue?
4 -
Just about to close down for the evening when I saw your response. Will return tomorrow, but meanwhile provide the link to the example I was illustrating.
A quick look reveals some appear as in your example (day & month only indexed) and others with two digits only for the year in question. There are probably different reasons why the dates end up looking like this, but I do hope one factor is not the one-size-fits-all project instruction to only index what you see!
@LY4 makes what appears to me to be a sensible suggestion, at https://community.familysearch.org/en/discussion/153153/why-are-thousands-of-danish-1921-census-dates-of-birth-indexed-wrongly#latest
1 -
" ... It takes a computer program to turn this into: ..." Reminds me of
To err is human. To really foul things up requires a computer". 😉
The problems are multiple. Firstly the suggestions for the 1921 Danish issue are fine except that it's a point solution for a very specific collection where we can put bounds on the dates. In other words, it's a census and virtually no-one will be older than 100y. The Cheshire Parish Registers give us a much wider set of bounds - from 1538 onwards according to FMP and the worst-case scenario is a burial of a centenarian in 1538.
The Church Books of @Gordon Collett have a much tighter range of 1815 onwards.
So bound checking based on the collection is liable to be ... complex if it's done too far down the line, especially if the meta-data is no longer available because I guess that contains the date range of the collection.
Personally, I believe some sort of windowing is appropriate - indexer says Death Date is 06 March and Event Date is a Burial on 2 Apr 1890. Conclusion one is that the ddmm bit is 6 March. (Index what you see. Then believe the indexer). Conclusion two is that if it's a Burial Event and you're looking at a Death Date with only a ddmm completed, then the Death Date must be within 12m before the Burial. It's getting tricky... And it doesn't deal with delayed burials - though they will record the year, surely.
Maybe we need to get the indexer to index not just what they see but also what they interpret that to be.
But then what about data already on file?
Getting too late now...
1 -
Sorry Paul - re running a check before the data goes online. Yes - but who deals with any detected problems? I think it's too late by then.
1 -
Thank you for your responses, Gordon and Adrian.
From the evidence, it does appear the problem does (at least for the most part) relate to dates that have been indexed without a year and/or where the year has been indexed with only the last two digits.
As strongly expressed in comments on other threads, I do believe that the “index exactly what you see” project instruction has to allow for some flexibility – especially when it is hindering researchers from finding records that indeed are to be found in FamilySearch, but not by using search criteria (primarily a year range) that one would expect to produce relevant results.
I have just spent a couple of days in checking out records of the 1939 National Register collection in Find My Past. I find at least half the records / pages I have downloaded have been written using just two digits for the year, yet I have had no problem in finding records for, say, a relative born in 1900, even if the event is shown as “00”. I appreciate (as with the 1921 collection discussed on the other thread, and similar collections) this should be an easier matter to “resolve” than (as Adrian illustrates) a collection covering several centuries. However, Find My Past and other websites appear to have found a way around this issue: whereby it is far easier to locate records in their databases than when using FamilySearch.
The two main ways of tackling the problem appear to lie, respectively, in the indexing and post-indexing processes. With regards to indexing, the rigid rule of only indexing exactly what is written must surely be relaxed for cases where there is no ambiguity whatsoever about the actual date. Failing that, it would probably be better not to index a date without a (full) year at all. And, even if (as Adrian implies) the programming (post indexing) that would “reset” a date (say from 12 July 00 to 12 July 1900) is not necessarily too easy to implement, surely if the programmers on other websites have found a way around this, it must be workable within FamilySearch for at least some of its indexed collections?
My argument concerning this issue connects with that I have previously made in relation to indexed material in general: the end product must act as a helpful finding aid. Sadly, even many indexers do not see this to be of overriding importance. Some time ago, a response found in the Indexing category of Community was made quite firmly to the effect of indexers not having to take into consideration the later usefulness of their work, as it was essential to comply with project instructions, regardless of any later outcome.
To me, this issue provides a perfect example of an essential need for a change in attitude / practice when is comes to FamilySearch indexing projects. It just can’t be right if records continue to be omitted from a Results page when the researcher has entered a perfectly acceptable date range, from which “the system” is failing to pick out relevant, available records.
Urgently addressing the issue with 1921 Danish census collection will indicate FamilySearch’s interest in at least taking some steps to deal with the current, wide-ranging problem, which is hindering the research of even experienced FS users, let alone those who have far less knowledge of search strategies.
0 -
I have to make a partial apology here because, as illustrated, an Exact search for a 1925 death of this individual has produced a result for a record indexed as being buried in 0025. Interestingly, the current algorithm allows for the record to be displayed if an 1825-1825 date range is inputted - but not for a 1725-1725 input. I believe 2025 produced no results as that is in the future, as a similar test for someone buried in 2015 did show a result whether 1815-1815, 1915-1915 or 2015-2015 was entered.
However, one should still have to know this was a matching record for John Robbin, because nowhere in the record is there an indication of the century in which the event occurred:
0 -
Regarding indexing instructions, my example came from a indexing project done for the Norwegian archives by FamilySearch, My Heritage, and Ancestry and those three in return were able to post the resulting index. I would assume the instructions came from the archive and the requirements were to fit the requirements of the archive's database. There the record looks like this:
Here neither the death date nor the burial date contain the year. This actually makes sense because on the record, the year is not recorded against the person, only at the top of the page:
It looks like FamilySearch tacked on the year to the burial day and month only to avoid errors when a person died in one year but not buried until the next.
But then they should have fixed the search result display routine to be able to cope with the lack of the year.
1 -
Ah, but I'm wrong in making an assumption about the year in which the event occurred! The example below clearly illustrates John Hoole was buried in 05 April 1840, but (without originally seeing the second record) I had wrongly assumed (guessed) he was buried in 1805, instead 1840! Likewise, the February 0025 in my earlier example I had guessed represented February 1825, but was actually for a 25 February 1687 event! Without there being two records for each example in FamilySearch I would not been able to get the years (even centuries) right.
Looking back to my comments in my previous post sets my head spinning. I have obviously completely failed to work out how the search (range of dates) algorithm works!
0 -
It's the same old story. You can't trust anything without an image of the actual record. This is even more true with all the recent auto-standardization problems in the indexes and the on-the-fly display problems in the search results lists.
2 -
I think the reason searches by date still work on other sites is that they haven't tried to simplify by overcomplicating, the way FS has. On other sites, searching for an event in "1902" involves matching the text string "1902" against the text strings in the database, using an algorithm that tells the computer that the string "02" should be considered a match to "1902", "1802", "2002", and "February" (among others). It's only in FS's thoroughly-corrupted database that a bot's bad decisions render hundreds of thousands of records unfindable by normal means.
2 -
@Gordon Collett mentioned
"... Here neither the death date nor the burial date contain the year ... "
Yes, things can work like that providing, if necessary, the year is concatenated with the ddmm date. The key is surely to concatenate when it's needed and make everything consistent. Here, in FS, it's... Well...
@Paul W mentioned the 1939 Register on FMP. I've actually always assumed that the year of birth was always 2-digits. It works, either way. In fairness, the 1939 software can be easily coded. My grandpa is on the image with a birth-date in column 7 of "16th Dec" and a birth year in the next column of 95. He appears in the "transcription" (extract actually) with a birth-date of "16 Dec 1895"
Because the 1939 Register is like a (UK) census and is conducted on one day, and lists living people, the 2-digit year can be put through a "windowing" algorithm (I'm using my Y2K terminology here) - if the 2-digit year is 39 or less, stick "19" on the front of it; if the 2-digit year is 40 or higher, stick "18" on the front of it. (I should tweak that algorithm to cope with people reaching 100 in 1939 after the Register was taken, but let it pass).
Where it gets, in fairness, more complicated is with a collection based on a long date - like a parish register where dates might range from the 1500s to the 1900s.
0 -
@Gordon Collett - I've no idea what "Bosted" means in Norwegian but in the local dialect of Staffordshire, it means "broken" (a variant of "bust" I guess). Hope it's not too inappropraite for a burial record!
0 -
@Paul W - re your original point about whether FS could do anything to detect those...
It should surely be possible to detect such years - I find it difficult to believe that there are any genealogically relevant records on FS that need genuine years in the range 00-99.
But I am still unclear about what point to make such a check. I firmly believe that the original indexer - who is the only one to see the whole context - is best placed to concatenate (say) 1890 to "6 March" to get a death date of "6 March 1890". I am also perfectly happy that they enter both the original ("6 March") and the interpretation ("6 March 1890"). That way we preserve the "index what you see idea", and we give the interpretation to the only person who can interpret it. As it is, we appear to end up with things like "March 0006" (i.e. March in the year 6 AD).
So personally, I'd prefer the check at the indexing stage. However, that doesn't fix the existing data...
Re the "Index what you see" principle. I have, as you may suspect, issues with it. How does one index newspaper death notices where the death is recorded as "30th Ult" and the proposed burial as "2nd Inst" - i.e. 30th of last month and 2nd of this month? Assume for the purposes of the exercise that the newspaper's publication date is available...
0 -
On the points you make above:
Firstly, yes, I was surprised with the inconsistency in how the year is shown in the 1939 National Register records. I had previously only seen them recorded as 98, 00, 13, etc., but came across quite a few pages with the full year written.
Secondly, I still believe if this had been a FamilySearch indexing project the instruction would have been to index only what was recorded. That would have meant an additional post-indexing exercise to avoid getting 0001 - 0099 years of birth once the records went online. As you suggest, not a great problem with a collection of this type, but far more difficult when the year span includes a number of centuries: "78"? - now would that be 1678, 1778, 1878....? All very well if you do have access to images, but that is often not the case, of course.
Thirdly, still in connection with "project instructions", the PI to only index a day and month when the year is not actually written (no matter how "obvious" that might be) has led to the mess illustrated above. In one of the examples shown, I did find the FS film reference in the Citation, but the film was restricted, so I could not check the original record. You also provide an excellent illustration of FS indexing problems, with the "30th Ult" and "2nd Inst" example! I'm afraid I wring my hands at the refusal of indexers / project managers to accept / index a date as it is clearly, factually meant to be - just because of its format in the document. Thus, I have seen advice under the Indexing category of Community that "the last day of March" should not be indexed as "31 March" and from a document dated, say, 9 July 1878, it should not be assumed that a death that took place "yesterday" had taken place on 8 July 1878!
I know I keep repeating myself, but if indexing projects are to prove fit for purpose (in helping researchers find dates relating to their ancestors' "vitals", etc.) the work should not be carried out in such a way that renders the "finished product" as almost completely meaningless. ("Yes, he was baptised on 6 March, but what year - what century, even?")
Sadly, the folks that manage FS Indexing make themselves completely unavailable for contact or any discussion on the ramifications of their current practices. (Not just not my thoughts, but that of the exasperated indexers who report the impossibility of getting "official rulings" on the correct way to index items in a particular batch, especially when the PIs are ambiguous and/or clash with the generic version.)
But, whilst primarily the problems reported can be traced to indexing procedures, I still believe there needs to be far more in the way of post-indexing checks (including added coding) - not just for the sake of researchers, but to ensure indexers have not wasted their valuable time on projects that provide no real value when the data is added to FamilySearch's online database.
All very well to make points here, but how do these thoughts ever get communicated to those who can make the changes that will ensure getting consistently good search results at FamilySearch is in line with what one can achieve on other websites?
2 -
@Paul W said:
"... I have seen advice under the Indexing category of Community that "the last day of March" should not be indexed as "31 March" and from a document dated, say, 9 July 1878, it should not be assumed that a death that took place "yesterday" had taken place on 8 July 1878! ..."
Oh... 😥
0