Analysis & Suggestions
I did an analysis of how well the AI based full text indexing system did with my surname, Kneeland.
The good news is that there are 27000+ newly discoverable correctly interpreted references to Kneeland in the current set of documents.
I then started looking for known variations that I have seen previously in human interpreted cursive records (eg Census). Pretty much as expected there were several hundred each of variations like Knuland and Knelland.
During the above process, I discovered that this data collection contained numerous "Index" pages for court and land records that were either fully or at least first character sorted. So some of those pages included multiple instances of Kneeland. The display of the source page with the search term highlighted along with the interpreted text alongside allowed me to see the kinds of mistakes the AI system was making. The following are the errors I saw:
K -> A, H, R (X seems possible but not seen so far)
n -> r, u
ee -> ei, el, ie, u (ii seems possible but not seen so far)
l -> (no variations seen so far, but e, i seem possible)
a -> o
n -> r, u
d -> (no variations seen so far)
These variations occurred singly in any combination. I did searches and counts on many of the variations and found hundreds of occurances of some, done to just handfuls of occurances of others. Overall about 5000+ incorrectly indexed occurrances or about 15%.
Another type of error involved incorrectly running words together. Specifically, instances of a cursive Kneill being concatenated with a pre-printed "and". I didn't see in this dataset but I have seen elsewhere "kneel and" (as in kneel and pray) incorrectly returned as Kneeland.
Suggestions:
Better contextual awareness of letter combinations that are unusual, such as Hr or Hn.
Better contextual awareness of the overall document/list structures. On a fully sorted list, Ku would not appear before Kn. Or on the partially sorted lists, you would not have H or R mixed in with the K names.
As an adjunct, I would suggest that an enhanced Search engine also be developed that accounts for the types of errors that result from handwriting and optical character (eg h->li) recognition systems make. In simplest form it would look for and return instances with an error in any one position and then expand out from there. Basically what we as humans have to do manually using wildcards with the existing search engine.
The existing fuzzy search engines operate more like the original Soundex systems, so they pick up variations like Neeland and Neyland but are not designed to pick up the kind of variations being generated by mis-read cursive script and OCR errors.
The other thing that would be hugely useful with any search engine is Sort options.
Finally, as I think some others have suggested, we need the ability to make corrections, both for the benefit of follow-on researchers, and hopefully as feedback to the AI system so that it improves in the future. I would suggest this include a global search and replace for instances like Knuland -> Kneeland.
Bottom line is this is huge positive step forward. I'm eagerly anticipating it getting even better. Thanks for all you have done, are doing, and will do in the future. And thanks for the opportunity to contribute.
Kurt Kneeeland
コメント
-
Thank you for your feedback. Machine reading of handwriting is an imperfect science and transcriptions contain errors. This is to be expected. These are data issues that are certain to improve over time. For the present, we are focused on making the Full Text Search process, which is limited by the quality of the transcriptions, useful and user friendly for researchers. As you point out, it is a huge positive step and its potential is immense.
0