Place to attach text transcription of record from "difficult to read" images.

Chuck Dosh · February 23, 2022

Various images of records written in German Kurrent style are **difficult** to read.

Please consider a way to attach small text files to records for others' benefit. It is useful not only as a readable rendition of the original record but also as a template for other records of that time, locale and author's handwriting idiosyncrasies.

I have transcribed dozens of them to text and would like to share the results. Some are family, some not. As time goes on I expect this dozen to reach 100s.

Once a complete records is transcribed (it might take an hour), useful details emerge like "step" relations, cause of death, widow status, etc. - facts not usually seen in the original transcription. Too often I do then find original transcription errors like son/father reversed.

----

This idea relates to an image to text transcription, not a language translation - see example below.

Of course parts of records are indecipherable, perhaps then use the underscore character when uncertain.

This is not a German Kurrent only issue.

------

Example below

Record: https://www.familysearch.org/ark:/61903/1:1:QPDN-G4JQ

Corresponding image: Image #515 lower left: https://www.familysearch.org/ark:/61903/3:1:3Q9M-CS4Q-9Q2H

Sample marriage transcription.

ab 21te Novts: wurde Joh: Nickolaus Dosch, Nickolaus

Doscher, heis gewusstem Gemeindsmanns, mit seine Ehefrau Anna

Dorothe, geb: Kronmüllerin erzeugten, ehelich ledigen Sohn 36 Jahre mit

Anna Elisabethe Christina Baumannina ____: Ezeckiel Baumanns

gewusstem Gemeinds manns and zu Sonderreith hinterlassen

Nichmittag, Bauminnin 30 J. alt, aus 3 maligen Aufbieten östentlich gelaeunt.

Julia Szent-Györgyi · February 25, 2022

I'm not sure where in FS's data structures a group of transcriptions would fit. They do not currently contain any: the database that Search - Records and the hinting system use contains indexes only, not transcriptions. (An index contains only the key names, dates, and maybe a few other facts about an event. It is meant as a means of finding the document. The fact that FS's structures practically force one to use these finding aids as stand-ins for the documents is a side effect of the fact that computers can't read handwriting.)

I normally put transcriptions in the Notes/Description field of the source citation, but of course this presupposes that I've made the association between the page/image and a Family Tree profile, which normally means that the person is related to someone I'm researching.

genthusiast · February 26, 2022

I have always liked the idea of having an image linked digital transliteration/copy option (a reproduction word by word, line by line that could be linked/bonded to the image). The idea is definitely worth considering again (assuredly already has been but has not been implemented for probably various reasons).

Along the lines of the AI indexed records - this could be identified by a green section box (instead of AI blue section example above) - and along the lines of researcher index edits - could be indicated by green word/segment highlights/boxes ( instead of whatever color AI indicates for word segments - in this example image above yellow).

@Chuck Dosh For ease of explanation I may borrow your example above and generate some images for illustration (check back later). Although your idea lists only a complete record Transcription - I am expanding it to include Translation(s). Obviously I agree - the record should be transcribed correctly before translated.

Option 1: As far as where those translations would end up - it depends upon the type of database/platform being used for the image indexes - but generally I think a text field/blob could be added to the database and another 'translation' column added (hopefully since the record language should already be identified - all that would be needed is the 'researcher' selecting what language they are translating into - thus allowing possibility of multiple translation into different languages - and the possibility of multiple 'researchers' contributing to that one-per-language translation - for example only one English translation which could be contributed to).

Option 2: In practicality though - I doubt Familysearch will do this since they have enough just to index the records for searchability - the power in this idea though is that the 'researcher' would be contributing the translation. So - as Julia mentions - it is probably easier for the 'researcher' to attach their transcript to the record Description/Notes. But examining that - perhaps a Description/Notes template/selection could be specifically developed for translations there or nearby. Perhaps one with the AI generated segments/box fields for each word/line - thus binding it to the image/green highlights - then each translation could be referenced - perhaps by green pad/record icon instead of blue/grey - when viewing the image index. But then it would have to be determined where to fit that on the 'record' page - perhaps another carrot/chevron for 'Translation(s)/Transliteration(s)' on the left - above Citation?

genthusiast · February 26, 2022

Option 2 graphic above - showing proposed method of indicating Transcription/Translation(s) tabs (in previous post). Perhaps different colored text could indicate whether Transcription/Translation were incomplete (red text for not complete, green for complete?).

Below using Source Box for this example since https://www.familysearch.org/ark:/61903/1:1:QPDN-G4JQ is not currently attached in Tree - but would be similar ... there would need to be a template for:

1. Transcription (original language) which once completed could then unlock ...

2. options to add template/field for other language translations

Julia Szent-Györgyi · February 26, 2022

There is a technique or presentation used by many of the online newspaper repositories, where the scan (the PDF or image file) is overlaid on top of the OCR-generated text. The text is not generally editable, but it can be copied-and-pasted, and searched, just like any other document, so it appears to the user as if the image itself can be selected and searched.

OCR technology is continually advancing, and they're working on getting it to "read" handwriting, but most of what's out there currently is Not There Yet. The newspaper sites are full of gibberish generated by character-recognition run amok. However, the speed and capacity of the automated process so far outstrips what a human could do that none of the repositories employ human transcribers (that I know of).

A means of generating a text overlay containing a transcription of an image (be it That Dratted German Handwriting or something easier) would be interesting, and potentially useful, but I'm not sure it'd be the best use of FS's most highly-skilled volunteers. Indexing is slow enough already.

genthusiast · February 26, 2022

@Julia Szent-Györgyi

Yes the binding of the transcription/translation to the image would be much like OCR / AI processes FamilySearch is currently using. I didn't include an example graphic showing that or what I envision exactly (first AI graphic is the closest) - but you can see that whenever Edit is available on any record - though that process involves human/manual segmenting/highlighting of the relevant image. So the idea I am illustrating here - on the hidden Transcription/Translation(s) tab would contain the word/line segment fields that are AI generated and any entry by human transcribers/translators. Of course for a finished translation there might not be word for word binding - maybe more like sentence for sentence - so maybe an intermediate word for word binding might be yet another step. Or else minimally a way to highlight Name, Date, Place, Relationship words. So, the features mostly already exist in FamilySearch platform - as shown above - just a few tweaks/additions here and there ... viola (yep the instrument)!

Julia Szent-Györgyi · February 26, 2022

Transcription would be work-intensive enough. Translation introduces many extra complexities, and it makes no sense whatsoever to "bind" it in any way to the image. Word-for-word translations are often impossible, and basically useless anyway.

(Here's a snippet of word-for-word translation: [Us] [father-our] [who] [you-singular-informal-are] [the] [heavens-in].... You probably recognize what it's trying to be, but the key word there is "trying".)

I suggest not going off onto the tangent of translation at all. The idea of image-associated transcription -- whether computer-generated or human-input -- has merit, but there is no need to complicate it with translations. Armed with a correct transcription, people can apply their choice of machine translators, dictionaries, and word-lists to interpret the document, capturing the nuances that a translation can easily obscure.

genthusiast · February 26, 2022

@Julia Szent-Györgyi

"Translation introduces many extra complexities, and it makes no sense whatsoever to "bind" it in any way to the image."

I disagree (imagine that). A translation - especially a finished one (meaning the translation is produced by a person competent in both languages and accounts for nuances) - should obviously have as its source the original document. Thus the translation can be bound to that document. Yes - if not word for word - perhaps sentence by sentence (as mentioned above).

Yes I recognized

The tangent of translation is so much related that I thought it should be addressed. (apologies @Chuck Dosh )

MaureenE123 · February 26, 2022

The online platform for Australian digitised newspapers has a facility to correct the text derived by OCR, which any one can do,

Here is an article https://trove.nla.gov.au/newspaper/article/182019371 which shows the OCR text on the LHS of the webpage and here is an article about how to correct text if required https://trove.nla.gov.au/help/become-voluntrove/text-correction

FamilySearch perhaps could introduce a facility like this in respect of record images, where either OCR indexing could be corrected, or transcribed indexes could be corrected or expanded, or indeed images which have no index attached could have a transcription index added.

With the increased use of indexes produced by the computer, this would seem a very valuable addition.

genthusiast · February 26, 2022

Images above are not passing image moderation very successfully - so here is a snippet of the first:

My initial use of transliteration appears to be incorrect - I probably meant transcription there.

Julia Szent-Györgyi · February 26, 2022

I apologize, I wasn't clear: I don't think it makes any sense to bind a translation to the words on a page. I'm not sure even sentence-for-sentence is worth the effort; paragraph-by-paragraph is plenty, and sometimes, even that gets awkward.

(Keep in mind that I'm fully bilingual. [Typical grocery list on our fridge: tej, kenyér, cornstarch, zeller, hagyma, sliced cheese, cereal.] I am very familiar with the ins and outs and pitfalls of translation.)

But as I said: combining an idea about transcriptions with all of the complications of translation does nobody any favors. They're two different topics and should be considered separately.

genthusiast · February 26, 2022

@Julia Szent-Györgyi

"(Keep in mind that I'm fully bilingual. [Typical grocery list on our fridge: tej, kenyér, cornstarch, zeller, hagyma, sliced cheese, cereal.] I am very familiar with the ins and outs and pitfalls of translation.)"

You seem to have missed my comments above that perhaps a word by word translation might be an intermediate step? If the language can be translated at all obviously there must be some interpretation of words. tej=milk?, kenyér=bread?, zeller=celery?, hagyma=onion? I don't understand your point in using words that can be translated individually - and could obviously be bound as individual words? If sentences cannot be translated effectively then I doubt paragraphs can. Yes I agree there could be complexities/nuances - but words make up sentences and sentences convey ideas that can be grouped into paragraphs (which should be aspects of the same idea). And to translate one must have a method to do that equivalent in the other language ...? A document should make this process easier for someone proficient in both. (deja vu - no diacritics) It is the trivial case that a translation should bind to the document/paragraph - but yes perhaps that is the level at which it should be.

"But as I said: combining an idea about transcriptions with all of the complications of translation does nobody any favors. They're two different topics and should be considered separately."

(Again apologies @Chuck Dosh) But no I don't think discussion of ideas does nobody any favors - it's all part of the same process/idea being discussed here. Once words are digitally reproduced - creating a transcription - the next logical step (if not fluent in that result) is to translate it. Yes, you have a very good point that once the transcription exists one can use another application of choice to translate it. But the discussion/expansion of the idea to include translation within the platform is very relevant. After all we are concerned here about reproducing content of genealogical documents. I see many requests here in Community for translation ...

If the translation discussion that I have contributed is not wanted - ignore that part and accept the relevant parts. To me they are so closely related - that the contributed ideas are relevant to both (thus my so frequent use of Transcription/Translation(s)). For example - where in the platform to put said Transcription (the ideas are so similar they can be discussed together). So if you don't like my word Translation - just replace it in your reading wherever it appears - with Transcription.

Note: As a result of Julia's complaint - I removed my expansion on image copy processing - but still retain the expansion to include Translation. But here is the conclusion I came to:

Conclusion: This would be a lot of development involved to replace the user just entering their Transcript/Translation in Source: Notes, Source Box: Notes or in Memories>Documents (size limitations there are known). So no I don't expect FamilySearch will pursue these. But I'm just an idea guy - so the exercise of thinking about these types of platform development paths is fun ... there would be other further features I have thought about in relation to this idea ...

Julia Szent-Györgyi · February 26, 2022

Sentences can generally be translated just fine, but they don't always map to the original: what can be said in one sentence in one language might be most effectively rendered as two (or more) sentences in another language. (Consider for example those paragraph-length sentences that German Officialese is so good at.)

It's not discussion that does nobody any favors, it's combining two ideas into one discussion that's the problem.

That Australian newspaper site that Maureen posted about is nicely done, although I found it somewhat disconcerting when the image automatically moved if I went to a different section of the transcription. It is of course OCR-based, full of errors like 'l' instead of 'i' and quote marks for stray specks in the newsprint, but it does make me wonder: has anybody done anything remotely like it for images of handwritten text? With the current state of OCR, the transcription would likely have to start out blank, or with just the pre-printed parts filled in (for printed forms filled out by hand), but the large-scale parsing could likely be automated. (What I mean by that is the way the newspaper site had the one article highlighted, for example.)

(And the grocery list wasn't meant as an example of translation difficulty, but as a demonstration of the level of bilingualism observable in my day-to-day life. I write things on the list in whatever language happens to come to mind at that moment.)

genthusiast · February 27, 2022

Sorry for multitasking with ideas - perhaps it's just ADHD...

From my experience newspaper print can generally be OCRed fairly well - but have seen tons of gibberish too. Perhaps AI involvement has improved results. The FamilySearch AI example I included first (Brazilian handwritten record) - shows it can largely segment records (though the graphic shows overlap from what I can tell). It also apparently indicates AI can determine names fairly well. I have seen examples of FamilySearch AI interpretations being gibberish too. So I don't really know how well AI has gotten at interpreting handwriting.

That's why in the above contributions - I think AI could handle the segmenting of words in the document and record segmentation - but then include a transcription option for human entry on those 'fields' (much like indexing). Of course I'm arguing that the same structure could be used to do a translation (once the transcription is produced) - and therefore should be considered at the same time as well. But if other platforms are preferred for translation then ignore that. I just know there are tons of translation requests here in Community - and if built into the platform - interested/proficient translators could go ahead and work on those.

The graphic of the Source Box never came through above - so you can't see the side-by-side transcript/translation fields I was proposing (just use your imagination). But the main point is to allow a transcription field/blob in the database - then that can be presented in the platform wherever it fits (Sources: text field, Source Box: text field, Documents are all good options. But if word segments were converted to fields then some other database structure or object might need to be used. I would prefer Sources option - otherwise I don't recall Source Box size limitations - if that came into play).

MaureenE123 · February 27, 2022

It is of course OCR-based, full of errors like 'l' instead of 'i' and quote marks for stray specks in the newsprint, but it does make me wonder: has anybody done anything remotely like it for images of handwritten text?

FS video https://www.familysearch.org/rootstech/rtc2021/session/insights-in-archives-and-computer-assisted-indexing or https://www.youtube.com/watch?v=k78HooANi60

Place to attach text transcription of record from "difficult to read" images.

New · Last Updated February 23, 2022

Comments

Categories