Handwriting OCR can replace indexer, cutting indexing process in half, and might even be accurate en
LegacyUser
✭✭✭✭
Michael W. McCormick, AG® said: Start using handwriting OCR as a first indexing. It would still queue up for human review. Maybe even make the index immediately searchable without waiting for the review process but also so it’d be instantly updated when the human review is done or that it can be corrected by those who find it by searching or browsing.
Tagged:
0
Comments
-
David Newton said: No it can't. The error rate is far too high and handwriting is far too variable. You have no idea what you are talking about so please stop talking about it.0
-
Lundgren said: Michael,
Thank you for your contributions to this forum. Familysearch is actively working on machine learning and record indexing.
You can see some of the results of this effort in the obituaries that have been added to the record search. They indicate that they have been indexed by machines.
We are getting better at this and as the system matures you should see more records like this become available.0 -
Juli said: Joining the chorus: no, current optical character recognition software CANNOT read handwriting. It can't even properly manage printed text, for heaven's sake.0
-
David Newton said: Thing is, that's not handwriting OCR. That's typewritten OCR. Typewritten OCR is an entirely different beast, and whilst it can have a dreadful error rate with dodgy scans, provided the scan it is working from is of good quality it can get recognition quality levels well over 90%. That's still an error rate of about 1 in 10 to 1 in 20, but it's good enough to be at least somewhat useful.0
-
Lundgren said: This is a sample of a search that has records in it that were indexed using machine learning:
https://www.familysearch.org/search/r...
If you click into the details, you can see that some of them indicate that they are created by machine.
As David mentioned, these are not from hand written documents.
Work is being done to make handwritten documents possible as well. Familysearch will make sure the quality is good enough to allow users to find their ancestors before the records are made available on the site.0 -
TManning said: The records I have seen indexed by machine were not accurate enough for my taste. The ones I have seen were obituaries. They frequently had miscellaneous relatives identified as siblings or other such problems. Any user just blithely attaching sources without actually referring to the original document would be adding people to the tree with entirely wrong relationships since add buttons were right there to click.0
-
Lundgren said: I've experienced incorrect relationships and names as well.0
-
David Newton said: Yep. There are some fonts which give OCR engines real indigestion. Fraktur is notorious in that regard, which means older printed texts in German are a pain in the neck to deal with.0
-
Adrian Bruce said: "Any user just blithely attaching sources without actually referring to the original document"
A hugely important point - there is an argument that says a material error rate is acceptable provided the user actually reads the source and discards the garbage. Unfortunately, lots of people don't, they just accept the computer index.0 -
Adrian Bruce said: For an example of the problems, just look at the thread on https://getsatisfaction.com/familysea... where a human has indexed "Earl of Elgin" as "Early Elyin". And the thing is, looking at the image - that's not a stupid error. The point is that the background processing in the human brain that proposes and discards different readings is based on context, and that context simply isn't there for machine learning of handwriting outside very specific and very limited scenarios.0
-
Michael W. McCormick, AG® said: Please ban those who respond rudely. When someone gives an idea and you think it is impossible or even detrimental remember the golden rule. I have seen several tests and studies, but it isn’t worth my time to argue.0
-
Michael W. McCormick, AG® said: Please ban those who respond rudely. When someone gives an idea and you think it is impossible or even detrimental remember the golden rule. I have seen several tests and studies, but it isn’t worth my time to argue.0
-
Michael W. McCormick, AG® said: I have thick enough skin that It only hurts and I can move on, but the type of negative replies common on this forum are unacceptable. Users should remember the golden rule and stop assuming they know more than a person who posts a suggestion about the topic. They may know less but that is irrelevant. Just be nice.0
-
Michael W. McCormick, AG® said: I know FamilySearch has been studying this as an option for several years and has been talking to BYU who also has been studying this for several years. I have seen studies resulting anywhere from 70-95% accuracy depending on handwriting difficulty, machine learning used, etc. It has not been discussed super publicly, but BYU and other groups have talked about it every 2-3 years at RootsTech. Universities and other groups worldwide have been researching it for many years. In the last few years I have seen organizations begin to offer handwriting recognition for document transcription in actual solutions the public can access or that can be purchased. It is still not mainstream though. FamilySearch doesn’t always adopt something until years after it is viable. They introduced the phone app years after it was commonplace. Same for many other features. So I know they’ll do it when they feel it is ready, but that doesn’t mean others aren’t already making it work.
BYU says they think it’ll be ready soon. https://www.facebook.com/114517842399...0 -
Michael W. McCormick, AG® said: University of Innsbruck makes this available to users. https://transkribus.eu/Transkribus/0
-
David Newton said: 70% accuracy. So slightly better than the flip of a coin. Useless for anything.
70 characters in that first paragraph. At a 70% accuracy rate that would be 49 characters correct and 21 incorrect.
95% accuracy is still somewhat marginal. 99% accuracy is beginning to get somewhere and 99.9% accurracy is very much useful. Typewritten OCR is still around 98% to 99% accuracy at the page level. Guess what? It's been at or around that level of accuracy for a very, very long time. The low-hanging fruit has bern harvested: significant improvement is hard and takes a ĺot of work to find.
It's the Pareto principle writ large.
Handwriting OCR is not ready for mainstream use at all except in extremely specialised areas. What is one of those areas? Recognition of the amount of a cheque when paying it in. I have seen such OCR at work and it does the job correctly. General handwriting OCR? No way.0 -
David Newton said: Golden rule? You mean tell people the truth?
I'd definitely call that a golden rule. I'd definitely call that extremely important. In fact I'd call that paramount.
Handwriting OCR is not possible at a useful level of accuracy. It may never be possible at a useful level of accuracy. Worth research? Yes very much so. Worth productising? Definitely not at the moment.0 -
joe martel said: Thanks Michael. You are correct and character recognition is an important aspect of recognizing and getting more records in front of humans. Lungren has pertinent info here too. And as mentioned here by you there is a lot of work, especially in handwritten pattern recognition.
To come up to speed with what's happening in this space the posters here could start with a google search for terms like - handwriting pattern recognition conferences. ICPR ICFHR IAPR and look at the published work and participants.
And you could assume a lot of unpublished research is happening. (Think google translate for say, Kanji)
(Many here on this community appreciate questions like you have posed. I wish everyone could be polite. but sometimes the best approach is to ignore those who choose to make it difficult for the well-meaning to want to participate)0 -
A van Helsdingen said: I agree that many responses here have been quite rude.
While handwritten OCR is still not good enough, it is showing great promise. The indexing site VeleHanden (which specialises in Dutch projects) currently has a project where indexers review and correct the OCR readout of documents going back as far as 1578. I believe they are using the aforementioned Transkribus.
Professional genealogist Yvette Hoittink predicted that handwritten OCR would be the biggest "trend" for (Dutch) genealogy in the 2020s. https://www.dutchgenealogy.nl/ten-tre...0 -
Paul said: Michael
I understand how you might have been hurt by the blunt responses here, but please re-read your initial post and immediately following comments. You are definitely implying your idea is already ready to be introduced in relation to handwriting. Even those who have been enthusiastic and/or polite about the issue seem to all be in agreement that it is most certainly not.
Yes, a good idea for the future, but surely not for immediate implementation, as you are suggesting.0 -
Tom Huber said: Michael,
There is a big difference between OCR of machine produced records (typeset, typewritten, etc.) and OCR of handwritten records. The latter just isn't here yet, while great advances have been made in the area of machine-produced records.
This is not a negative response but one that is factual, and one understood by someone whose entire career was in this industry. You may not like the response, but it is accurate and honest.
Please do not be as negative as you imagine the rejection from others to be. Accept what is available and if you know of a program (not technology) that can accurately decipher and produce a text based upon a wide array of hand-made records, including those that mix languages, have different characters, or are as bad as the proverbial doctor's handwriting, I would love to know what it is.0 -
Juli said: Michael, I don't believe my response was rude in the slightest. I'm sorry that you feel people are being negative to you, but the statement in your title ("Handwriting OCR can replace indexer") is so completely and absolutely untrue that we all feel "triggered" by it.0
-
Michael W. McCormick, AG® said: In hindsight I see it was a poorly worded title that would trigger people. I only meant that right now 2 people see everything indexed by FamilySearch. The indexer and the reviewer. Someday some collections can be run through handwriting OCR first and then a human would do review. Staff I have spoken with say this may show up soon, but won’t say what soon means. BYU is beta testing it to a selective public beta this year.0
-
Tom Huber said:
On the other hand, automatic transcription might be one of the highest selling points of Transkribus, but the success of this operation is dependent on each document’s needs. Each text has its unique characteristics and requires special personalized treatment. Users must be able to provide to the system accurate human-made transcription of the document that desires to transcribe, and after that, they must build a model that is intelligent enough in order to decode the handwriting types that the document includes. In short, we are concluding that this platform does not have one but several selling points. The technology that Transkribus platform is handling can provide it is entirely advance HTR technology for academic and research needs, while the increased accessibility of the transcribed documents through the servers at the University of Innsbruck ensures that this project will keep bringing together the scientific and the technological world.
-- https://medium.com/@filotasliakos/tra... (emphasis added)
Unfortunately, the review, part of which I have included above), does not actually record or do a comparison of examples of hand-written vs results on various examples. The requirement for "accurate human-made transcription" is vague at best and suggests that an original image of a handwritten document must be transcribed before it can be processed.
In looking through that review, it would take a person many hours to decide if they could actually use the software and that is something I do not have available.
So.. the question is: Have you, Michael, actually downloaded and used the program and if you have, what are the results.
To me, this sounds a lot like machine learning is involved, something that does exist, but has its own set of issues.
There is a program that deals with voice recognition, something that has been around for a very long time (decades), but in each case, the learning requires training the "machine" to recognize the commands.
There are algorithms that deal with artificial intelligence -- something that I played with in the 1980s, but something that I quickly came to realize as not being overly self-creative. Note that at that time, the resulting program could look like it was "intelligent" but not in the truest sense of the word: no imagination, no intuition, only the ability to follow its programming and modify its code accordingly.
Until we actually see more than an academic review of the program without any real-world test results, I consider this akin to the student who brought me a book on how to play chess and told me he wanted to program a computer to play it.
Indexers have a lot of challenges and one of them is documents with faded ink that takes a lot of processing to even make decipherable. Then there is the human-based intuition that is needed to actually transcribe (index) that document in a meaningful way.
It is my feeling, that unless you have actually used the program on a wide variety of documents, you are dreaming of a future that has yet to arrive. And just as computers have become very powerful, they have yet to achieve true artificial intelligence, which is what will be needed to make handwritten auto-transcription possible.0 -
Tom Huber said: My reply here still stands. Transkribus is not something that is "ready for prime time" and certainly not a program to download and easily spend a few minutes learning how to use so that the average user can put the program through its paces.
I was hoping for some real-world reviews, not the extensive academic review can be boiled down to thisTranskribus from a technical perspective might not be perfect yet but the growth that the platform has shown into the last decade guarantees the future prosperity of the software. Last but not least, it is fair to acknowledge that Transkribus project is indeed a powerful stepping stone...
from last paragraph in the above-cited academic review.0 -
Juli said: Just to illustrate how useless 70% accuracy is, here's David's first line at about that level:
7ooo auuracy. 5o sIightIy beHer than the fllq of a ioin. UieIeis far aiuthlng.
(And before you complain "but the computer would never guess 'auuracy' for 'accuracy' because it can look it up in a dictionary and know which one is an actual word", keep in mind that we're talking about transcribing things that are primarily names, with a few numbers thrown in: nothing that can be looked up in any dictionary.)0 -
terry blair said: I know nothing about the Transkribus program or how hard it is to set up and train, or its accuracy rate, but I would like to propose a test: compare the transcription results of the program to the human generated transcription of the Cuyahoga Co., OH death records from 1900 through 1907. The advantage of this data set is that it is a large data set in reasonably good order; it has been photographed and is available on FSFT; the entries in the set have apparently been made by a limited number of clerks, thus the handwriting for large groups of the entries is the same; and it has been completely indexed. The results of such a test will determine how ready the program is to be used for other data sets where large blocks of entries have been made in the same hand.0
-
Michael W. McCormick, AG® said: I can read the sentence example. Our brains often correct words with 30% or so mixed up letters. Of course that doesn’t mean I’d find a badly misspelled name in a search. Wildcards are necessary for searching anyway, but this might require new wildcard strategies. Good thing that we are not actually stuck at 70% accuracy.
One technique used is to provide the handwriting recognition with a training database that only improves as humans correct errors. There actually are name dictionaries. We used one in the defunct FamilySearch Indexing desktop program even. It would mark things red that were not in the dictionary so the user would know to double check they got it right. I have a book on my shelf at home listing hundreds of German given names. It is really handy when I see one I’m not already familiar with. It wouldn’t be a 100% solution, but would help.0 -
Justin Masters said: I think most people recognize (or should recognize) that OCR for printed text is VASTLY different than handwritten recognition. (and I'm honestly somewhat surprised that printed text OCR isn't better, as I'd approach it this way - don't try to trace outlines and determine font matches - go for the middle of the strokes/lines, and compare that "path" or "pattern" to known characters)
I'm thrilled to see that BYU and FamilySearch are working hard on this problem. I can guarantee you that this is MUCH bigger than a genealogical issue.
There is TONS of money involved with legal documents, titles/deeds for property, financial documents, etc. that would KILL to have handwritten OCR capabilities, and it's likely that those industries will be making significant headway in making handwritten OCR a reality (if ever). I think that without AI technology, this isn't going to be possible.
Just think of the skills, techniques, past and contextual learning necessary for *US* to do handwriting recognition, and that has to be emulated (and improved). And I think we'll have to accept that there are going to be handwriting samples that will NEVER get OCR'ed, simply because they're illegible to everybody.
Hopefully, what we'll see is vast improvements in contextual recording, and I'm not even sure how to attack such a problem. And what I mean by that is this:
Given an illegible name on a document, collect what other info we can glean from it (time, place, other people in proximity or listed on the document (including non-related and distantly related people who might be witnesses or doing their job in recording), perform something akin to a FAN+ analysis (Friends, Associates and Neighbors). Now, take all that out one level to all THOSE people and a FAN analysis, and then rate the relationships, context (does it make sense for that person to be there at that place/time with those people) and provide a score.
Bonus points for diagramming those relationships (along with a UI that allows us to easily attach them all quickly), and perhaps use those attachments to further provide a weighting towards future analysis. (But wait... how do we "un-score" an improper collection of "facts" that end up not being what was "predicted" or "assigned"?)0
This discussion has been closed.