Problems and tasks for search ============================= Path and lucene hit selection ----------------------------- Fixes ''''' - default text bpv function creates searchable text bpv's--when we're just getting the text from the node's URI, as well as when we're getting it from the object's first property. Problem cases: search on a call number, or on a title (in both cases we get nodes we don't really want for this reason). ** SOLVED ** requires testing - non-repetition and correct results for searching words in places (unravel this) - in some cases lucene hits that should produce results just don't; try "16" (should give results from handwritten notes; doesn't) - sometimes we would actually prefer the path description to go with the value of a property instead of a default text bpv, for example, call number or title (see Reference A) - with snowball stemming, hits on literals but not on text BPVs (try "positiva") ** SOLVED ** make sure TextBPV functions are marked multilingual as necessary - possible bug: something wrong with selecting and making similar routes: changing results with searches on years ** SOLVED ** - possible bug: what if Lucene's highlighting function returns a partial string that cuts our tag in half, leaving only the openning tag? Additions ''''''''' - deal appropriately with links between images in associated documentation section; this means creating sam rules for main-level socs, too (see Reference B) - deal with search within collection-level descriptions (see Reference B) - do DO-based anding (instead of blind oring) when we have multiple search terms, and limit number of terms - merge routes that have the same members? - for search to funciton correctly we need to complete modelling: publications, other objects, other photos, monochrome ** SOLVED ** Friendly route descriptions--------------------------- - acceptable friendly path descriptions: plural forms of nouns; better names for things (such as primary supports); remove certain path references (such as "contenido de") - why is it that sometimes parts of a path just don't appear in friendly results? See for example: search for "positivo" ** SOLVED ** : it happens when there's no inverse property label, or when the path is marked as "equivalent meaning" New route description rules =========================== Type of element: (A) Noun phrase head (number and type of objects found in a main-level SOC) (B) Adjectival phrase Types of adjectival phrase (B): (1) preposition + noun phrase (2) past participle + (preposition +) noun phrase (3) relative posessive pronoun ("cuyo") + noun phrase + verb + object (4) relative pronoun ("que") + verb + (preposition +) noun phrase Characteristics of noun phrases in adjectival phrases (B) of types (1), (2) and (4): (') indefinite article (always singular) (.) singular definite article (..) plural definite article (#) numeral adjective (G) general noun (P) proper noun (.GP) singular definite article, general noun and proper noun describing preceeding general noun (_que_contiene_T) contained adjectival phrase following the noun: "que contiene" + search term (_de_T_de_Q) contained adjectival phrase for dimensions, as in "de 3 cm de ancho" (_>[]) contained adjectival phrase (element type B) Characteristics of noun phrases in adjectival phrases (B) of type (3): (_es_P) general noun, no article, verb is "es", object is proper noun (only singular subject) (_contienen_T) general noun, no article, verb is "contiene", object is search term (_contiene_T_en_su_G) general noun, no article, verb is "contiene", object is search term, indirect object refers to a property where the term was found (_es_medido_en_P) just for expressing units of measurement Highlighting rules: If P is the node with the hit, it gets highlighting T always gets highlighting X Innermost adjectival phrase must be B3_es_P, B3_contienen_T, or B1 or B2 with _que_contiene_T Combining routes may require different descriptions to avoid redundancy (see inscriptions and titles) Note that "contiene" phrases only make sense for things that can be seen as text or having text (inscriptions, names, titles, descriptions, seals) Don't forget that we need new krrrs to jump over complex link constructs (inscriptions, seals, content components) Rules so far ============ - Special signals are required to determine how to set up tail (es, contiene, de T de Q, es medido en) - These signals are attached to KRRRs and/or properties - Different rules for when we have a single hit or multiple hits - The single hit rule must work together with the rule for creating a particular designation of a thing - Cuyo applies when there's only one property allowed; cuyo hermano = only one brother; though sometimes cuyo isn't even an option, even when card is 1 Roadmap ------- Remove same-RS recursion check from SAM generation* Increase allowed SAM path length* Re-verify path selection* Purge repeated main-leve-soc pass-through Plan metonymy and inference for search* Plan NLG descriptions* Program traversal, menotymy and poor man's inference* Review all rulesets and socs for appropriate krrrs, krrr options and slots on textBPVFunctions Program simple nlg path descriptions Program complex nlg path descriptions Include collections and complex values in SAM paths Re-verify path selection Complete nlg path descriptions Program algorithm for ordering groups in search results Planning issues --------------- - title *** OK afmt.image_as_content.has_title | rule swv.inscribed_title.has_title_text con el título | *first rule capturesSearchHits *special BPV function needed for referring term in search cuyo título contiene | - width *** OK afmt.photographic_print.has_primary_support | afmt.primary_support.has_dimensions | swv.dimensions_description.has_width con soporte primario | | de 16.8 cm de ancho *second and third rules should be condensed in one phrase *special BPV function needed for referring term in search - dates *** Almost OK afmt.image_as_content.has_date swv.time_range_over_unit.spans tomada en 1874 | *first rule capturesSearchHits *special BPV function needed for referring term in search afmt.photigraphic_print.exposicion swv.exhibition.has_exhibition_date swv.time_range.starts swv.full_date.has_year expuesta públicamente en 2005 | *month requires full traversal of complex values - search for "inscription" - search for "cm" Simplified NLG -------------- - Possible phrases for every property or multiple krrr - The phrases contain codes allowing their adjustment for the gender and number of in nodes - The phrases may have slots for a refering term of a specific node or the search query term - The phrases belong to one of the following categories: - ONE_TAIL_ON_HIT: Single tail, no further chaining possible - MANY_TAILS_ON_HIT: Set of tails, no further chaining possible - ONE_TAIL_CHAINING: Single tail, further chaining is possible - MANY_TAILS_CHAINING: Many tails, further chaining is possible Planning rules -------------- 1 In actual paths (not rules or routes) find tree structures from results to hits 2 Traverse the trees from trunk outward 3 At each bifurcation, if there are any hits on a TextBPV AND hits from anywhere further along a path that bifurcates from this point along a slot indicated for that TextBPVFunction, surpress the hit on the TextBPV 4 Keep a TextBPV ONLY if there are no hits along paths that begin with a slot that it indicates 5 A slot marked capture means that for a hit coming from a path along that slot will not actually be described as such; rather the highlighting from that hit will be taken into account, and the capturing TextBPVFunction will be re-executed, and the hit will appear to come from the node associated with that function. For 1 and 2, one can simply index all paths by the krrs they traverse; then for each krr, if there is more than one path that traverses it, see if there is more than one path or hit at the next level; if there is, consider it a bifurcation in the terms mentioned above; if not, leave it alone. The above can actually be forgone if we just make function-generated TextBPV's unsearchable. This is fine considering what's up with multiple search terms (no phrase searching). To avoid apparently-not-useful paths that travers associated documentation relations, perhaps make SAMRules unable to go beyond a point that with a resource that is in a main-level SOC. (Note that we may have to implement this on concrete paths, not on rules, since the SAMRules don't really know for certain exactly what they're going to find.) Multiple search terms --------------------- - On its own Lucene is doing simple ORing; phrase search appears not to work - We will strip non-word characters out of search (perhaps except for *), divide up the query into terms, perform a separate search for each term, and then combine the results; thus terms will be ANDed, but not within the same phrase, rather, within the same document. - This means we have no need for searching within code-generated TextBPVs; we will, however, in the future, when we implement phrase search Language changes ---------------- In MultipleKRRRs, instead of equivalentMeaning, we'll have traverseInSearch, menotymizeInSearch and compactInSearch (the last option being a sort of poor man's inference) In MultipleKRRRs and Properties, we'll have adjPhrase elements, with types of phrases and text. The types will be hardcoded in Java and will have varying chaining and use properties. The text will include special codes for agreement with gender and number. In TextBPVFunctions, we have slots that indicate the KRRRs that component strings are coming from. Needed in order to know when to re-execute functions with highlighting. Slots may have the CaptureSearchResults option, that will make the NLG take them instead of results along the slots' path. (For appropriate referring term construction.) Interface and boilerplate code ------------------------------ - make temporary group and appropriate server objects - filter lucene input to avoid errors - make interface, including 0 results message - make highlighting mechanism (includes interface and back-end components; note complexity involving passing highlighting _through_ a textBPV function) Wishlist -------- - make rdf:type relations traversable for creating search results (search for "inscription") - include "union or intersection" button for search groups Highlighting ------------ - all textBPV functions require full mention of slots they include - each textBPV function determines slot dependencies, through a recursive function - rule manager indexes textBPVfunctions by their dependencies - When executing textBPV functions, lists of concrete textBPV dependencies are drawn up - Highlighting creates a set of BPVs with hits and their values as modified to highlight the hit - When templates are executed with highlighting, they check which fields use a function dependent on a function associated w/ a hit - The results for those fields are checked for dependencies on concrete bpv's - Said templates re-execute textBPVFunctions as necessary, to get versions that include the highlighting - Variable desc templates also add fields as necessary, to show fields with highlighting ** problem: slots can only designat default bpv functions from _other_ rulesets. But places has a slot for a function from the same ruleset possible solution: transfer slot declarations from .def language to code in ruby blocks (?) ***Remaining issues*** ---------------------- - Merge publication and other objects - Finish modelling doc associada, incl. misma composición (Briquet, ibero), reprografía (charnay), misma serie (ibero), mismo negativo (óvalo) - Flush out modelling institutions, esp. dependencies of the UNAM - Model taken to and from - Add more materials - Fix diverse errors in doc. associada (...only 11 photos from calendar?) - Review all content, compare with excels - Issues with vectors (dates, measurements, locations in objects) - Menotymy and phrases for multiple krrrs in branches following complex valuesF - Order of groups* - Multiple hits on the same result set in groups: combining adj phrases, order of adj phrases within group description* - Multiple search terms* - Limiting number of search terms* - Finish interface details* - SAM rules that traverse complex values - Search on collection field - Search in collection record - Search results summary - Highlighting* - Improve specification of heads of phrases following chaining phrases: take into account actual number and gender for intermediate nodes - Special bpv function type for referring terms in adj phrases (for places, may also help with article problems) - Article problems for: institutions, places - Article (and class name) problems for tail terms in chaining phrases in associated doc - Fundamental synonyms, esp. institutional and place acronyms such as AGN, DF, SINAFO, UNAM - Some unexpected stemming errors/incorrect languages on TextBPVs - search on term* gives funny highlighting and reference terms - no highlighting in group list for search on call number - accent problem in descriptions Make it work roadmap -------------------- - onlySearchShowable fields ** OK - grouping in interface on search results ** OK - selecting two groups in search results ** OK - highlighting in full desc on search results ** OK - combine routes for single term search ** OK - route and group ranking, ordering ** OK - ungrouped search results (interface and server) ** OK - search results summary ** OK - everything collection ** OK - multiple search terms - prepare explanation - search box details -- little bug - error: group to group then back group by text doesn't change -- little bug - text for designation of group of type of result in full description state -- little bug - make region jumpiness even less -- little bug - descriptions get cut off when highlighted -- medium bug - remove grouping and ordering options from list of all main-level SOCs - final details of adj phrase configuration -- medium bug - switch "todos los registros" for collection name in side panel in Desc view when comming from complete catalogue -- little bug - problems with title ordering--see ordered list, titles beginning with "México" -- little bug - redundant paths when there's one single hit for places, in which the query appears in places contained in each other, for example, "Ribera" -- little bug - stemming: palacio and palacios don't stem to the same thing -- little bug - some places that aren't colonias get "col" prepended to their tame at times -- medium bug - arrow keys are moving the whole screen again -- arg! -- little bug Fixed bugs: - complete catalogue appears as a collection in record -- big bug - titles of different types aren't combined in groups of search results