ML-generated metadata for microfilm scans

Bryan3867 · October 7, 2024

Hi, this an idea developed through my use of online microfilm scans. e.g. the "Metrical books" of Ukrainian Catholic church records. Beyond the basic ability to associate a reel with a parish/place, I am finding that the process to find specific images that relate to the place, a time period, etc is too time consuming to enable a successful search. The basic issue I am having is likely similar to other data sources, e.g. thousands of pages are somewhat organized in film scans. But within a film scan, the image order and dates/places reflect what was likely a decades/centuries-long process of documenting, collecting, copying, collating, warehousing, loss/damage recovery, etc. As well, the existing reel indexes of places are at best a hint to a probable subset of related images, but other images from the same place probably also occur in other reels not indexed as such.

With a background in data science and ML (machine learning), I see some simple improvements in (or first creation of) metadata for reels and images, that are possible with application of simple ML techniques (e.g. image recognition). A project to build a ML-enabled metadata generation tool could succeed through incremental development of basic ML-model training resources for image recognition, e.g.

handwritten names of places or other important metadata, often appearing by the same hand across many images e.g. in the page header
dates related to the page

More complex process developments could follow, e.g. generation of metadata for individual records in an image, e.g. place, date, and names.

What I would like to know is (1) are there any ongoing projects to support the user community in this way; (2) are there any opportunities to develop tools such as I describe through an improved (e.g. automated) access to film scan images (this is essential to being able to develop/implement such a process at scale).

Thanks for any advice.

Ashlee C. · October 7, 2024

@Bryan3867 Thank you for your interest. If I understand your query correctly, the answer is yes, FamilySearch is already using AI to help improve in searching records. You can get involved in the project by helping with Quick Name Review and Full Name Review. You can read about it here. Scroll down until you see the section titled, "What does it mean to review a name?"

Bryan3867 · October 7, 2024

I understand that as I view individual images I can suggest new/changed information about what is on that image, if that is what you meant. But in the suggestion I am referring to automating that review process, so manual review is not the only way to improve the data.

Ashlee C. · October 8, 2024

I think you are referring to the index editor, which allows users to make changes to the indexes of individual records.

I am referring to Get Involved, which is located at the top of the FamilySearch homepage. Among the opportunities to help are these:

These opportunities start with a computer that reads the records searching for names and places. The users then help to review what the computer has done, which teaches the computer to do better as well.

Another AI project can be found in FamilySearch labs. It is the Full Text Search. Many FamilySearch collections are included in this lab project already, and many more are added all the time. The computer searches for keywords through multiple collections. You can learn more about Full Text Search in this Community group - Full Text Search Feedback.

Does this answer your question?

Bryan3867 · October 9, 2024

Thanks for the suggestion. I checked out the available experiments and the closest in intent is "Expand your search with Full Text". I will test that to see what types of documents have been included, but my goal is more specific as a use case, i.e. a platform capability to auto-generate index data for specific document sets (e.g. the "Metrical books" microfilms), based on typed and written text in the documents and the association of document images as sets based on how those sets are structured by the authors (e.g. to include images that do not contain the indexed text but that are associated with others that do, by the sequence of images in an image group). A side-benefit of this will be development of methods for assessing/documenting how document sets are structured, as input to the indexing process.

I will suggest this through the "suggest an idea" link. My initial goal here is just to get connected with anyone in the development team that can help enable me to pursue this as a project.

Ashlee C. · October 9, 2024

Thank you for the idea. Posting in Suggest an Idea is a great way to get your idea to the engineers.

ML-generated metadata for microfilm scans

Answers

Categories