Search needs Regular Expressions!
LegacyUser
✭✭✭✭
Steve Tomer said: I'd like to be able to use regular expressions to do searching. I'm guessing it's probably allowed by the back-end database, just not by the user interface. I know you can use the wild cards * and ?, but this is insufficient in many circumstances.
Here's a concrete example:
I'm looking for the surname Greenbank including it's variants.
Acceptable are:
Greenbank
Greenbanke
Greenbanck
Greenbanks
Greenback (This is a common mis-indexing of Greenbank)
Not acceptable:
Greenbaum
etc.
So I'm limited to Greenba* (too few) and Greenban*.
Another example:
The surname Voltrauer is in my family This name has been written and indexed in so many different ways, that searching them all is unbelievably challenging.
The first character is either V, W, or F dependent on whether the record was written in German, Hungarian, or Latin.
The second character is O, but often misindexed as A.
There can be one or two L characters next.
The next character T can also be the two characters DE.
The next two characters are almost always RA.
The next character is U, but frequently misindexed as N or M.
The last two characters are usually indexed as R.
Using the current wild cards, all I can do is ??L*RA?ER, which generates way too many hits on tons of things that aren't . But with regular expressions, getting exactly what you're looking for is so much easier:
[WVF][ao][l]+((de)-(t))ra[un]er
I know that regular expressions are not in general use by the novice user, but for a power user, they're amazing. Incidentally, Microsoft Word and Excel have them for searches, so it's not that uncommon.
Thanks,
Steven Tomer
Here's a concrete example:
I'm looking for the surname Greenbank including it's variants.
Acceptable are:
Greenbank
Greenbanke
Greenbanck
Greenbanks
Greenback (This is a common mis-indexing of Greenbank)
Not acceptable:
Greenbaum
etc.
So I'm limited to Greenba* (too few) and Greenban*.
Another example:
The surname Voltrauer is in my family This name has been written and indexed in so many different ways, that searching them all is unbelievably challenging.
The first character is either V, W, or F dependent on whether the record was written in German, Hungarian, or Latin.
The second character is O, but often misindexed as A.
There can be one or two L characters next.
The next character T can also be the two characters DE.
The next two characters are almost always RA.
The next character is U, but frequently misindexed as N or M.
The last two characters are usually indexed as R.
Using the current wild cards, all I can do is ??L*RA?ER, which generates way too many hits on tons of things that aren't . But with regular expressions, getting exactly what you're looking for is so much easier:
[WVF][ao][l]+((de)-(t))ra[un]er
I know that regular expressions are not in general use by the novice user, but for a power user, they're amazing. Incidentally, Microsoft Word and Excel have them for searches, so it's not that uncommon.
Thanks,
Steven Tomer
Tagged:
0
Comments
-
Jeff Wiseman said: Hi Steve!
I'm no expert in how the system works, but sometimes i'd like the GREP capability too (Global search Regular Expression Print).
However, then I realized that the general Search function here should NOT use anything as explicit as regular expressions. The problem is that you are searching a database that is shot-full of incorrectly indexed data. Add that to the multiple language and foreign location types that are there and it's probably better to let FS evolve a general search algorithm. A RE search could easily drop many appropriate records simply because of letter mis-indexing.
HOWEVER,
I think that your idea might really be successfully applied to the FILTERS area. In other words, after the main search is completed and you get all of the variations on your initial search criteria, you could THEN turn on a RE filter to fine tune and reduce your results (sort of like the the source category filter does now).
Often I come up with a list of records in a search where there is a huge percentage of a specific item that I would like to EXCLUDE from the search results with some kind of filter before I start walking through the list and examining each item. Parsing using REs could be used to accomplish this.
Also there is a question as to exactly which fields you want to parse. If you put in such a filter, you would need to be able to identify which fields it applied to (given name, surname, both names in correct order of language, etc.)
So yea, we probably shouldn't have such a rigid criteria added to the general search engine that already is looking for dozens of variants across multiple languages and is even struggling with that at the present. But a new Regular Expression type of FILTER on the search results sounds really interesting.0 -
Lundgren said: Steve,
It is something to consider. We would have to evaluate the amount of load it would add to the search system to apply regex's to billions of records.
Thank you for the feedback.0 -
Justin Masters said: I would think that allowing regular expressions to be used would be EXTREMELY difficult to work in because of the need to do input validation, lest someone enter in characters that allow it to dump a database and get access to everything.0
-
Tom Huber said: Very few users will know what to do with common expression searches. Even after spending over fifty years in the computer industry, I have forgotten the details and would have to research using this kind of search. In addition, many users (including my wife) are completely lost with respect to using much of FamilySearch, including searches.
In addition, FamilySearch has limited resources and while this kind of search could be easy to set up, if my memory serves me, the value added may suggest that resources should be better spent elsewhere.
Note, I am not shooting down the idea, and if it was available, it would be great. In fact, there may be separate programs (found in the solutions gallery) that can search the massive databases, which can use common expressions.
In addition, FamilySearch makes its APIs available and it is something you may want to explore. See https://www.familysearch.org/developers/0 -
Tom Huber said: The link to the solutions gallery is at the bottom of this and most pages on FamilySearch.org.0
-
Jeff Wiseman said: Tom,
As I pointed out in my initial reply, it might be useful as a filter. Anyone not familiar with it would just leave it alone. Also, as a FILTER it would be operating on a much SMALLER subset of records.
I wouldn't want RE searching as the main search capability since it is in conflict with how the main search operates. The main search is operating on vague data, much of which has errors so the idea is to extrapolate to try and find related sources that are not well labeled or indexed. REs assume that everything is spelled correctly to start with. That is why I think it would work better as a filter.
But since I can't even get the search to find records that I KNOW are in the database, I'm certainly no expert :-)
(I usually end up going through the catalog and finding them. Since they are indexed, I can then attach them as such. I have found records where the name was indexed correctly--it just wasn't in the name fields!)0 -
Tom Huber said: Mm, yes. One of the problems is that the original record (images) are not being searched, but the volunteer-created indexes, which contain varying levels of errors.
I don't index because of those I tried, the process was convoluted to the point where I could not get the correct field in the image to line up with the appropriate index field.
I don't know if that is still going on, but I let others do the indexing and am thankful for the work they do.0 -
Lundgren said: Jeff,
You said:
> (I usually end up going through the catalog and finding them. Since they are indexed, I can then attach them as such. I have found records where the name was indexed correctly--it just wasn't in the name fields!)
If you can give us some samples of these records, we would like to see them to understand the problem.0 -
Jeff Wiseman said: Lundren,
The issue with the misplaced index data was an exception and a ways back now, so I don't have any examples I can go to right now. I will try and remember tl make a note if/when I come across more.
I have also occasionally gone into Ancestry and was able to find a source that, ironically, was from Family Search. By using the exact name from the index and putting that into the search engine of FS (with exact match checkmarks on the full name) I was able to then find the source in FS and attached that one to the FS PID.
Again, I can't remember the specific source it was. I just remembered because they were unusual.0
This discussion has been closed.