Filter Last Names
I love how search results filters have been moved forward in the new Search interface. The search bar on the filter menu is a very nice enhancement. Please add a filter: Filter by Last Names similar in concept to Filter by Collection.
This filter would produce an extremely useful list of name permutations: stems, soundalikes, lookalikes. So often brick walls are due to spelling changes.
Here is a use case. Inwinkelried is a very rare surname with just 32 exact matches in FS historical records, but 648,436 non-exact matches. How to find Inwinkelried and variant needles in a haystack this large? Doing this by hand is not practical.
Inwinkelried 32
Imwinkelreid 20
Winkelried 371
Winkelreid 57
Winkler 451,796
Winklereid 7
Winkleried 9
. . .
Couldn't this be implemented easily using the filter functionality?
[Edited to remove a tangent.]
Comments
-
The biggest problem with this idea is that different languages have completely different patterns, and FS has no way of knowing what language it should be "thinking" in for any particular search. As far as I know, the search algorithm for surnames uses a variation on Soundex, which is optimized for the American melting pot, but often fails miserably on said pot's less-common ingredients.
Also, are you aware of wildcards? (Asterisk for any number of characters, including none, and question mark for a single character.)
1 -
How FS does matching, their version of a Soundex, is a rabbit hole we do not need to go down right now. Often, I find it much more immediately important to get a list of names in the results.
Yes, I have used wildcards here since I first arrived. This idea is in addition to phonetic matching and wildcards or (dreaming) regular expressions.
This idea is that, given record search results, the user could see a list of names and count of records matching each name. Like the Collections filter, but for names.
0 -
Hi, sorry, me again..
How about this ?
See : https://community.familysearch.org/en/discussion/104666/new-search-tips-and-tricks#latest
@dontiknowyou It finds lots of Imwinkelried, and also some Juwinkelried ! Is that a real name or did somebody misread 'In' as 'Ju' ?
2 -
Might need to work on the matching criterion ....
1 -
On a connected note. I was surprised to discover that when I set the preferences in the new search to Data Sheet format, then customize it to remove all additional data, leaving only the person and their basic vitals in the results, If I then choose to export the data, the entire data is exported, not the reduced set.
I had presumed that the export would be what I had customized.
1 -
Questions about spelling of surnames are a big part of any surname study. Are the spelling differences variants (evolution) or deviants (phonetic spelling, transcription errors, typos)? Building trees is one way to find out. Building trees in FT tackles most deviants without any extra effort by the researcher, leveraging consilience in the surrounding tree.
That's why I want a list of surnames in a set of search results.
0 -
I like this idea better than my (negate) wildcard idea. If filtering could be done - and surely a computer can parse and match like names - so that would seem to imply filters would be available - that would be more powerful than the negate idea. (upvote)
FS has no way of knowing what language it should be "thinking" in for any particular search.
FamilySearch website does have a language setting (bottom of FamilySearch pages) next to COOKIES PREFERENCES. It does know what language the user would like to 'read'.
I was surprised to discover that when I set the preferences in the new search to Data Sheet format, then customize it to remove all additional data, leaving only the person and their basic vitals in the results, If I then choose to export the data, the entire data is exported, not the reduced set.
Export of search results to spreadsheet does export everything. You can hide the columns you don't wish to see.
Building trees is one way to find out. Building trees in FT tackles most deviants without any extra effort by the researcher, leveraging consilience in the surrounding tree.
Are you saying you like to add trees as Unconnected persons?
1 -
I'm convinced that it does need to be a filter within FS. Because, although it's true that you could achieve the same result by exporting the data and then processing it at the user's end, the export would have to be huge to ensure that it included all possible variations.
The only alternative I can envisage might be to allow a simple form of macro scripting to enable the user to "Extract Data" limited to a predefined set.
This might also provide what another user has asked for, sorting results.
Thinking as I'm typing, though, it could be implemented quite easily as a fixed set of routines at the FS end, provided within the preferences section.
By commandeering the Preferences - Format - Data Sheet - Customize Data Sheet function as a basis, and using it as a similar option to Export Search Results. This would allow the user to select only the data required (with an additional option to use Full Name or to separate First Names and Surname).
I think that the capability to choose "Sort results A-Z" needs to be in the Format section though. The user shouldn't have to export results just to see them sorted alphabetically.
2 -
I'm convinced that it does need to be a filter within FS.
Then please use the Upvote button on the opening post.
About phonetic matching. Whatever FS is using, it isn't the original Soundex (wikipedia). Phonetic indexing algorithms (wikipedia) are a hot research topic; commercial applications are growing rapidly. https://forebears.io has its own algorithm (a trade secret?) and returns lists of names (see screenshot). I use this site to plan searches on FS. But forebears.io lists are only an approximate solution, because they are not built on FS historical records.
Steve Morse is a developer of phonetic indexing algorithms that takes into account language origins. His https://stevemorse.org demonstration of the algorithm on Ellis Island and other immigrant passenger databases is a goldmine of names and mis-spellings for genealogy and surname research.
Returning to FS and the Search interface. Regardless of the algorithm that FS uses in Search, Find, and Record Hints, many users of Search still need a list of all names returned in search results.
0 -
Ok it sparked my interest, so I thought I'd see what I could do with what's currently available.
What I did - Set search to display 100 per page. In More Options, Preferences, Format Select Data Sheet, Customize data sheet, and then turn off all additional data.
Search for surname *??winke?rie?
Now instead of exporting results, I used the mouse to select everything from Name, to the end of the last person record, then copied and pasted into a text file. I ran this file through the unix stream editor 'sed' with a rough and ready match to pull out all the surnames and then passed them through sort and another unix tool 'uniq' to get the unique results.
There were 14 variants matching the wildcard criterion above in the first 100 search results. I then copied content from the remaining 3 pages of results into the same text file, and did the same again. Here's the result..
Boero-Imwinkelried
Boeroimwinkelried
Fenwinkebried
Fenwinkelried
Finwinkelried
Imwinkebried
Imwinkebrier
Imwinkeiried
Imwinkelried
Inewinkelried
Inwinkelried
Irwinkelried
Ivowinkelried
Junwinkelried
Juwinkebried
Juwinkelried
Smwinkelried
Surwinkelried
Tenwinkelried
Timwinkelried
Tinwinkelried
Truwinkebriel
Ynswinkelried
Zuwinkelried
These three managed to avoid detection until I started looking
Bmwinkelried
Inwinkelrie
Imwinkeirted
2 -
Basically what I do, but even with regular expressions, grep, sed, awk, uniq, vi search and replace, even perl scripts, it is still a tedious chore. Which is exactly why extracting a list of names needs to be a tool built in, so everyone can use it.
Here is one of my lists of variants and deviants (Guild of One Name Studies jargon):
Amhof
Amhuf
Amhuff
Earhuff
Einhoff
Einhuff
Emhof
Emhoff
Emhoof
Emhough
Emhuf
Emhuff
Emkoff
Emoff
Emtruff
Enchoff
Enhoff
Enhuff
Erhuff
Ernhoff
Ernhuff
Eruhuff
Euhuff
Eunphuff
Finhoff
Hemenhoff
Hemhoff
Hemhuff
Hemmhoff
Heunnhoff
Humoff
Imhaf
Imhoaf
Imhof
Imhoff
Imhoof
Imhooff
Im Hooff
Imhuf
Immhof
Immhoff
Immhooff
Inhofe
Inhoff
Inhoof
Iruhoff
Iuhoff
Iuhuff
Omhoff
Omhoof
Omhuff
Umhoff
Umhoof
Umhuff
Ymoff
. . . And still very incomplete. After generating this list I added several more spellings, and I know I have under-sampled the variations split in two, similar to Im Hoof. Iruhoff and Iuhoff strike me as training data for an evil twin of phonetic indexing: visual indexing. "ru" is a common misreading of "m" and of course "u" is a common misreading of "n".
0 -
By the way, I am hoping FamilySearch uses or soon will use Family Tree as training data for its name indexing algorithms.
So far, FT hints do not seem to know about look-alike transcription errors. I am having to search historical records by hand to find look-alike spelling variations such as Lerois for Lewis. I generate a list of possible variations mostly by crawling around on the FS research wiki, reading pages about reading handwriting.
There are two ways FS could leverage contributor work to build such pattern matching algorithms:
- Compare names on FT to names on attached historical records.
- Collect our individual edits to FS historical record indexing errors. (We are of course also training the next generation of OCR indexing.)
Providing lists of names in search results would support this very important infrastructure work.
0 -
! Keeping the topic live , and re-iterating my earlier " it does need to be a filter " observation.
1: It is not possible to 'automate' (mac speak) or create a macro (win speak) or write a script (generic) to extract the data at the user's end. The reasons are primarily those of browser security. MacOS specifically prevents scripts from manipulating those elements of the web page that would allow the data that the OP requires from being collected using the 'Automation' function. I presume that Windows and Android will implement the same restrictions.
2: The ability to search records is not available via the API. Although in theory it 'might' be possible to fabricate a mechanism to automatically collect and process data obtained from a machine-conducted search using the normal web interface, to attempt to conduct such an activity would be more likely to cause significant adverse reaction, if not damage.
Conclusion: The desired result can only be accomplished by implementing a filter at the FS end, i.e. in the 'improved' search, or it's improved replacement....
1 -
Going back a ways in the discussion, someone said:
FamilySearch website does have a language setting (bottom of FamilySearch pages) next to COOKIES PREFERENCES. It does know what language the user would like to 'read'.
The interface language has absolutely no bearing on the language of the records one is searching through. I leave the interface set to English, but the names are in Latin, Hungarian, German, Slovak, or sometimes a mixture (like Schuszter), never in English.
It'd be nice if one could do true phonetic pattern-matching, but sometimes, the intended phonetics are impossible to determine. For example, the family now pronounces my great-grandmother's surname as [hɛjtlɛr], because that's what the usual spelling of "Heitler" comes out to in Hungarian, but the original German would've been more like [haɪtlɛr], and I don't know which one she or her parents used. My mother and her sisters do not have the kind of ear for language that would allow them to remember -- or even notice -- such a detail. (Remember that scene in The Little Mermaid where the crab whispers "are-ee-el" and the prince automatically hears "air-ee-el"?)
And then there's misreading-based pattern matching: my aforementioned great-grandmother's baptism was originally indexed as Keiszer. This is a mix of "understandable/usual" (K versus H), "not impossible" (s versus t), and "huh?" (z versus lower-case L). I shudder to imagine the size of the database required to make any sense of the possibilities, and there are some patterns that are so entirely context-dependent that I'm not sure a unified approach can ever even work. For example, in English, that B-like thing is probably a B, but in German, it's much more likely to be ß. Similarly, in Latin letters, 'e' is highly unlikely to be mistaken for 'n', whereas in That Dratted German Script, the two letters are functionally identical.
What it boils down to is that I'm not sure a unified database approach to finding patterns in names has any utility. I'm thinking that what we need is a highly flexible character-level search (such as regular expressions) combined with Wiki pages compiling useful search terms and strategies for each different context. Of course, determining useful context categories also gets very fuzzy very quickly.... Language Is Hard, no matter which way you turn it.
2 -
To put some of this in layman's terms:
1. It'd be nice to have a surname filter - provided by Familysearch - processing done on Familysearch end.
2. This still could not provide full search results because language patterns may not be effectively captured by the originating Search.
Conclusion: For those languages where the Search can capture and sort possible variants etc. - this would be a very helpful feature. I recommend an upvote.
Comment: The Search> More Options> Preferences
Language Options:
Translated Text (should help understanding in desired translated language)
Original Text (should help if the original language is spoken)
0 -
I still want this feature. It is very important for assembling the tree because spelling differences, both variants and deviants, contribute so much to brick walls.
Here is just a tiny portion of the variants list generated for just one surname study in which I participate.
0