Opensourcing FamilySearch's OCR/HTR software+trained models

Hello! I came across FamilySearch after seeing their transcribed data sets in NARA. For the past year I have been working on something similar for a personal project that involves automatic transcription of large volumes of mining records in Alaska, which the Dept. of Natural Resources scanned in in the early to mid 2000's and made available via a book/page interface on their website. As we all know, only being able to search by book and page and having to manually search grantor/grantee indexes is a cumbersome approach, especially when you are searching for something that has not been indexed.
I wrote a program that downloaded all the TIF files from all the books of all the recording districts in Alaska and started looking at the various software out there that would allow me to convert the TIF images to searchable text. None of the free or open source software does HTR that well out of the box, and most of the commerical pre-trained software that is out there charges by the page. At my volume and for a personal project this is not feasible.
So I've been working on training tesseract on the handwriting in the various books/districts, as well as looking at what Google/Amazon/etc offer via their APIs. I've gotten to a point where its better than software I've tested a yearly or perpetual license and unlimited scans. But from what I've seen, whatever FamilySearch is using blows it out of the water.
I wonder if this technology will be opensourced (with training data) or is made available to those helping with verifying transcriptions, or if there is any way to participate in its development as I seem to have been spending the last year re-inventing the wheel.
Thanks!
Answers
-
Flagging for moderator response (not sure who the appropriate mod would be for this).
0 -
Is there a better forum/group to ask this question in? Thanks.
0 -
@Sky287 I am not connected with FamilySearch
My suggestion would be to contact the FamilySearch Library as it seems to me that the Library, or perhaps more accurately the Library management structure is ultimately responsible for FamilySearch matters. However whether you would get any response is another matter.
If you are on Facebook there is a FamilySearch Library page where you, I think, would be able to send a message (but I am not on Facebook so can't say definitely)https://www.facebook.com/familysearchlibrary/
https://www.familysearch.org/en/library/
There are email addresses in
https://www.familysearch.org/en/help/helpcenter/article/how-do-i-request-a-correction-to-the-familysearch-catalog
0 -
Thank you for the suggestion, I will give it a shot!
0