Solr Plugin

Enterprise Search Engine for Foswiki based on Solr

About Solr

Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface.

Screenshots

Installation

The below installation procedure assumes that you are going to install Solr as well as Foswiki on the same server using Linux.

Foswiki plugin installation

You do not need to install anything in the browser to use this extension. The following instructions are for the administrator who installs the extension on the server.

Open configure, and open the "Extensions" section. "Extensions Operation and Maintenance" Tab -> "Install, Update or Remove extensions" Tab. Click the "Search for Extensions" button. Enter part of the extension name or description and press search. Select the desired extension(s) and click install. If an extension is already installed, it will not show up in the search results.

You can also install from the shell by running the extension installer as the web server user: (Be sure to run as the webserver user, not as root!)
cd /path/to/foswiki
perl tools/extension_installer <NameOfExtension> install

If you have any problems, or if the extension isn't available in configure, then you can still install manually from the command-line. See https://foswiki.org/Support/ManuallyInstallingExtensions for more help.

Download Solr

The current plugin requires Solr 5.0.0 or later. Download it from the Apache Archive.

Extract software, create user, install system service

tar xzf solr-5.x.x.tgz solr-5.x.x/bin/install_solr_service.sh --strip-components 2 
./install_solr_service.sh ./solr-5.x.x.tgz
service solr stop

Secure Solr access

... so that it only listening to the local loopback interface

  • edit /var/solr/solr.in.sh
  • add SOLR_OPTS="$SOLR_OPTS -Djetty.host=localhost"

Optionally relocate logs

... from /var/solr/logs to /var/log/solr

mv /var/solr/logs /var/log/solr

  • edit /var/solr/solr.in.sh
  • disable garbage collection logs ... GC_LOG_OPTS
  • set SOLR_LOGS_DIR=/var/log/solr
  • edit /var/solr/log4j.properties
  • set solr.log=/var/log/solr
  • set log level from INFO to WARN: log4j.rootLogger=WARN, file, CONSOLE

Install Foswiki configuration set

cd /var/solr/data
cp -r <foswiki-dir>/solr/cores .
mkdir configsets
cd configsets
ln -s <foswiki-dir>/solr/configsets/foswiki_configs 
chown -R solr.solr /var/solr

Updating from a previous configuration set

An updated SolrPlugin might come with a newer configuration set, i.e. a newer schema.xml pr solrconfig.xml files. Make sure that these files coming with an update are installed to the solr server as well. This will be taken care of when the foswiki_configs directory is linked into the solr server's configsets directory. Note however that any local changes you made to these files will be overwritten by the update. You might eigher create a config set of your own and adjust the core definition accordingly to make use of the newly created config set, or you need to merge changes into the standard foswiki_configs set of files.

Start solr service again

service solr start

Test

cd <foswiki-dir>/tools
./solrindex topic=Main.WebHome

... should produce Indexing Main.WebHome

cd <foswiki-dir>/bin
./rest /SolrPlugin/search

... should return a JSON response from Solr showing the recently indexed topic

Skin integration

SolrPlugin comes with a skin overlay - called solr - that will replace the upper left search boxes in PatternSkin with a solr-driven auto-suggest search box. To switch that on use

   * Set SKIN = solr, pattern

in your SitePreferences.

ALERT! Note that you won't need to enable the solr skin overlay in case you are using NatSkin as it comes with support for SolrPlugin out of the box.

Commandline scripts

There is a set of tools to interact with the Solr index from the commandline. These can either be used to index Foswiki manually - as we did in above tests - as well as for searching or deleting specific documents in the index.

The set of tools comes in two variants, one for normal single-host Foswiki installations and for virtual hosting using VirtualHostingContrib

The virtual-hosting aware scripts have a prefix virtualhost-... and take an optional host=<domain> parameter to specify the virtual domain to interact with. When not specified will the script be executed for each domain in turn as configured in VirtualHostingContrib. Only exception is solrjob (see below).

solrindex / virtualhosts-index

cd <foswiki-dir>/tools
./solrindex ...

Parameter Description Default
web="..." the web to be indexed; if undefined all webs will be indexed all
topic="<web>.<topic>" the topic to be indexed; use this parameter to index one specific topic  
mode="full/delta" mode of operation: full will unconditionally index all content as specified by web or topic; delta will only index content that has changed since the last time the script was run delta
optimize="on/off" optimize the Solr database by de-fragmenting its internal segments for better performance; this is normally not required unless a full indexing of larger chunks of content is performed; note that optimizing the Solr index might require considerable time and I/O resources on the filesystem of the server off

solrdelete / virtualhosts-delete

cd <foswiki-dir>/tools
./solrdelete ...

Parameter Description Default
<lucene-query> delete topics matching the query do nothing

For instance to empty your index completely use:

./solrdelete *:*

solrjob

cd <foswiki-dir>/tools
./solrjob ...

This tool is a wrapper around solrindex and will use either solrindex or virtualhost-solrindex depending on the host commandline parameter. It is mainly used in cronjobs or iwatch (see below). In contrast to solrindex a locking & throttling strategy is used to prevent multiple indexers being started simulataneously. This is usefull when firing up the indexer as part of iwatch monitoring filesystem changes in the Foswiki store. As these events often come in bundles firing rapidly in a short period of time will only one indexing process be spawned for a given time span defined by the throttle parameter to solrhjob.

Parameter Description Default
-f / --file <file-path> index the topic that the given file points to  
-h / --host <virtual-domain> specifies the virtual domain to operate on (only makes sense when running VirtualhostingContrib); Or specify all to perform the operation on all known virtual hosts  
-m / --mode full/delta mode of operation (see solrindex above) delta
-t / --throttle <seconds> number of seconds to wait until the indexing process is started; note that any other calls to solrjob are prevented from entering the indexing loop as well 5

rest /SolrPlugin/search

cd <foswiki-dir>/bin
./rest /SolrPlugin/search ...

Parameter Description Default

Setting up an indexing strategy

Before using SolrSearch and get back results you will need to index your content completely and do so repeatedly to keep up with changes in the Foswiki content base. This is basically achievable in various ways:

  1. full indexing: index all of the content from start to end
  2. delta indexing: index topics that changed since the last time (delta) indexing was performed
  3. realtime indexing: monitor changes in the Foswiki store and fire up indexing as close to the actual change event as possible
  4. online indexing: index content changes as part of the content being saved

We will discuss these strategies and line out their advantages. A combination of a few of the above ways will then make up the recommended indexing strategy for Foswiki content.

Full indexing

./solrindex mode=full optimize=on

This will crawl all webs, topics and attachment and submit them to the Solr server, which will build up the search index. This can take a considerable amount of time depending on the amount of content and number of users registered to your site, so you may prefer to do it at a quiet time.

Note that full indexing is required the first time you installed SolrPlugin. From there on will you be able to use delta indexing to update the index incrementally as content changes in Foswiki.

It is recommended to only perform a full indexing again once in a week or preferably in longer intervals.

Delta indexing

./solrindex 

This will inspect all of the content base and check for changes since the last time the content was added to Solr. Any update content will be added to the index as required. The delta indexing procedure will also look up all of the index and delete those documents from it where the original topic in the Foswiki content base has been removed.

Delta indexing is a relatively fast operation that is best performed every 15 minutes or so. Don't shorten the intervals of delta indexing too much as that would create additional load on the server where no content is found to be delta-indexed.

Realtime indexing

This mode of operation requires a separate service to be installed such as iwatch. Iwatch is a tool using the inotify kernel service of Linux systems to trigger a script based on events happening on the filesystem such as "file-open", "file-delete", "file-changed", "file-moved" etc. Iwatch lets us hook in the solrjob script (see above) while watching events in the Foswiki data store at <foswiki-dir>/data/

Note that this is only a "near-realtime" indexing behavior as the used script to perform the indexing is configured to throttle the procedure for a given amount of time defaulting to 5 seconds. So any change to the content will then show up within 5 seconds after the event.

The service is configured by placing below configuration script at /etc/iwatch/iwatch.xml:

<?xml version="1.0" ?>
<!DOCTYPE config SYSTEM "/etc/iwatch/iwatch.dtd" >

<config>
  <guard email="root@localhost" name="IWatch"/>
  <watchlist>
    <title>Foswiki</title>
    <contactpoint email="root@localhost" name="Administrator"/>
    <path type="recursive" filter=".*\.txt$" alert="off" syslog="on" exec="su <httpd-user> '<foswiki-dir>/tools/solrjob'"><foswiki-dir>/data</path>
    <path type="regexception">\.tmp|\.sw\w|\.svn|\.lease|\.lock|,$|\.changes|,v|^_[0-9]|^log|^Temporary|^UnitTestCheck</path>
  </watchlist>
</config>

For VirtualHostingContrib use:

<?xml version="1.0" ?>
<!DOCTYPE config SYSTEM "/etc/iwatch/iwatch.dtd" >

<config>
  <guard email="root@localhost" name="IWatch"/>
  <watchlist>
    <title>Foswiki</title>
    <contactpoint email="root@localhost" name="Administrator"/>

    <!-- watch directories shared among all virtual domains -->
    <path type="recursive" filter=".*\.txt$" alert="off" syslog="on" exec="su <httpd-user> -c '<foswiki-dir>/tools/solrjob --host all'"><foswiki-dir>/data/System</path>
    <!-- <path type="recursive" filter=".*\.txt$" alert="off" syslog="on" exec="su <httpd-user> -c '<foswiki-dir>/tools/solrjob --host all'"><foswiki-dir>/data/Applications</path> -->

    <!-- watch each virtual domain for changes -->
    <path type="recursive" filter=".*\.txt$" alert="off" syslog="on" exec="su <httpd-user> -c '<foswiki-dir>/tools/solrjob --host <domain1'>"><vhosts-dir>/<domain1>/data</path>
    <!-- <path type="recursive" filter=".*\.txt$" alert="off" syslog="on" exec="su <httpd-user> -c '<foswiki-dir>/tools/solrjob --host <domain2'>"><vhosts-dir>/<domain2>/data</path> -->

    <path type="regexception">\.tmp|\.sw\w|\.svn|\.lease|\.lock|,$|\.changes|,v|^_[0-9]|^log|^Temporary|^UnitTestCheck</path>
  </watchlist>
</config>

Make sure to replace

  • <foswiki-dir>
  • <httpd-user>
  • <domainX>
  • <vhosts-dir>

with the appropriate values on your platform.

Note in the latter example for an iwatch.xml configuration for virtual hosting that those webs shared among all domains (via soft links) must be watched separately as changes to those directories don't appear as a change to the domains' directories. These are typically the System and the Applications web in case you installed WikiWorkbenchContrib and you'd like to share all wiki apps among all virtual domains.

Online indexing

This mode of operation refers to a way to update the search index immediately as part of the save operation performed by Foswiki on behalf of the user.

The biggest advantage here is that changes to the content base will immediately show up in the search index reflecting the exact changes being made to the content base.

There are a couple of flags to switch on/off online indexing in your configuration.

Enable / disable indexing content as part of a save operation:

$Foswiki::cfg{SolrPlugin}{EnableOnSaveUpdates} = 0;

Enable/disable updates when a new attachment has been uploaded:

$Foswiki::cfg{SolrPlugin}{EnableOnUploadUpdates} = 0;

Enable/disable updates when a topic or attachment has been moved or deleted:

$Foswiki::cfg{SolrPlugin}{EnableOnRenameUpdates} = 1;

Setting up cronjobs

Below will set up performing

  • a full indexing every Saturday midnight and
  • a delta indexing every 15 minutes

0 0 * * 6 <foswiki-dir>/tools/solrjob --mode full
*/15 * * * * <foswiki-dir>/tools/solrjob --mode delta 

HELP Add --host all to index all virtual hosts, or --host <hostname> to index a single virtual host.

Recommendations

By now we are able to orchestrate a couple of ways how to keep up with changes in Foswiki while indexing it into an external database such as Solr.

There are a couple of pros and cons to keep in mind innate to every of the above methods. Also, your own business requirements might significantly shift any decision how and when to schedule crawling the content. Some of the criteria to keep in mind are:

  • size of content base
  • speed of indexing content determined by server resources
  • interactive performance as perceived by the user
  • real-time requirements for updates in search results
  • changes in access control structures such as:
    • new users being registered to Foswiki,
    • changing member ship in user groups,
    • changing clearance of user groups for specific content

What to keep in mind for full indexing

Especially changes in access control structures might affect clearance to content in a broader scale. As the indexing procedure caches the current authorization for a specific piece of content along with it, will a change to access control -- independent to any change of the content itself -- render access control incorrect as cached into the Solr index unless this content is indexed again. This is not a problem when the ACL of a single document is altered as this document is re-indexed again as part of the change event. No such re-indexing is triggered automatically when a user group changes or is granted more or less authorization for content. This will indeed only be reflected the next time a full indexing is performed.

Access control structures might be changing totally outside of Foswiki when using LdapContrib where users and groups are defined in a distant LDAP database. These user and group records immediately affect Foswiki granting access to documents (there is some caching involved here as well, but let's ignore this for now). Only after indexing affected documents again will a search on the index exclude / include new content users have access to when visiting the page directly.

Therefore a regular full indexing is required, presumably once a week or once a day during off times.

The runtime of a full indexing run depends on the size of your content base as well as the size of the user base. Both directly affect the throughput indexing content. It is strongly recommended to plan full indexing during off times when the system isn't used otherwise. Also, make sure that two full indexing runs don't overlap as that would constantly increase load on the involved servers.

In those cases where a full indexing run over all of the content base exceeds off times (e.g. starting Friday night, doesn't finish on Monday morning) will you need to add more server resources. There are multiple ways to do so. Step one would be to use separate servers for both Foswiki and Solr. Please read up on how to scale Solr beyond a single-node installation as has been outlined in above configuration.

Correctness of search index

A search index might show "incorrect" results for example when the content it indexes doesn't actually exist anymore. So users get a positive search hit but won't be able to access the content anymore: both content base and search index are out of sync. Keeping the search index "correct" is of importance for any indexing strategy.

A search index might also be "incorrect" when it doesn't reflect the access rights a users has got on the content itself. That is: the search engine shall only return search results for content that the user has clearance for. No such search result shall ever be returned for content that the user isn't allowed to access of even get to know that it exists.

In SolrPlugin any Foswiki ACLs are added to the Solr database while content is indexed. So ACLs are checked as an additional filter on any search operation that an authenticated user might perform.

Correctness of the search index as we discuss it now is more concerned with the time it takes for to keep any content change in Foswiki in sync as it is being indexed and added to the Solr database.

There are two general categories for indexing content that we want to compare now:

  • online indexing: index content as part of the interaction performed by the user
  • offline indexing: perform content indexing independent from the user interacting with the system online

Offline indexing is performed by the solrindex script as well as the solrjob wrapper. Both might be used in a cronjob or iwatch as described above.

Looking at online indexing there is a price in doing so that we should keep in mind before switching it on.

Indexing will be part of a save, delete or rename operation performed by the user and thus directly increase the perceived time for the user to interact with the system while applying content changes.

You may decide yourself when trading interactive performance against negative side-effects due to "incorrect" search indexes. It is recommended to rather sacrifice a short period of time for the search index not being quite up-to-date rather than slowing down the interactive performance of the system by hooking the indexing procedure into the online operations of Foswiki.

Using Solr for WebSearch, WebChanges and WikiUsers

It is recommended to replace Foswiki's default AutoViewTemplatePlugin with AutoTemplatePlugin. This will allow you to replace the default WebSearch, WebChanges and SiteChanges as well as WikiUsers with a Solr-driven interface for better usability and performance.

Configure AutoTemplatePlugin by adding the following {ViewTemplateRules}

$Foswiki::cfg{Plugins}{AutoTemplatePlugin}{ViewTemplateRules} = {
...
  'WebChanges' => 'WebChangesView',
  'SiteChanges' => 'SiteChangesView',
  'WebSearch' => 'SolrSearchView',
  'WikiUsers' => 'SolrWikiUsersView',
...
};

The SolrWikiUsersViewTemplate implements a person search driven by Solr. It allows you to facet on properties as defined in the UserForm such as:

  • filter by location
  • filter by profession
  • filter by organization

There is a specific configuration option for Foswiki to detect which topics are actually user profile pages.

$Foswiki::cfg{SolrPlugin}{PersonDataForm} = '(*UserForm)';

Any topic that has got a UserForm attached to it will participate in the person search interface at %USERWEB%.WikiUsers. Note that the value at {SolrPlugin}{PersonDataForm} specifies a Solr filter query that might be customized and extended as required. For example, to also include any topic that has got a PersonTopic DataForms attached to it use:

$Foswiki::cfg{SolrPlugin}{PersonDataForm} = '(*PersonTopic OR *UserForm)';

Finally, you'll need to make this configuration accessible in wiki applications such as the WikiUsers view template. Add '{SolrPlugin}{PersonDataForm}' to the {AccessibleCFG} list as in

$Foswiki::cfg{AccessibleCFG} = [
    '{ScriptSuffix}',
    '{LoginManager}',
    '{AuthScripts}',
...
    '{SolrPlugin}{PersonDataForm}',
];

Macros

SolrPlugin comes with a set of search macros tailored to the extensive capabilities of Solr's responses to search queries. All of them make use of the same set of options to render a response as listed in SOLRSEARCH.

SOLRSEARCH

This is the most important macro. It allows you to interact with the Solr server and display results within wiki applications. An example search looks like this:
%SOLRSEARCH{"test"
  format="   1 $web.$topic$n"
  sort="date desc"
}%

This will list the 10 most recently changed topics that match the string "test".

To list the 20 most recently changed topics topics that have the string "test" in their name use:
%SOLRSEARCH{"topic_search:test"
  format="   1 $web.$topic$n"
  sort="date desc"
  rows="20"
}%

SOLRSEARCH allows you to use the full power of the Lucene query language. This works with syntactically correct boolean queries like "title:foo OR body:foo". Consult the Lucene Query Syntax guide to learn more about how to form more complicated queries.

SOLRSEARCH also allows you to run a query in dismax mode. The dismax query parser only supports a subset of the Lucene syntax, but is highly tolerant of all sorts of strange user input. The query syntax is uses is familiar to many search engine users, and supports +/- and quotes for groupings words. The edismax mode adds several more powerful features, though still short of what is offered by the full Lucene syntax.

Parameter Description Default
id a search can be cached optionally for the time of the current request, for example using id="solr1". further calls to %SOLRFORMAT can make use of the cached solr response to render it independent from the location of the %SOLRSEARCH call on the wiki page  
search query string: depending on the search type this can either be a free-form text (type=dismax), a valid lucene query (type=standard) or a combination of both (edismax) *:*
type dismax/edismax/standard: query type standard
fields list of fields to be returned in the result; by default all fields in solr documents are returned; communication between Foswiki and the solr search can be optimized by specifying only those fields that you are interested in while rendering the response *, score
Flags:
jump on/off: jump to the topic specified explicitly in the seach string on
lucky on/off: jump to the first result found off
highlight switch on/off highlighting of found terms off
spellcheck switch on/off spellchecking to propose alternative spellings in case no search result was found off
Pagination:
start integer index within the result from where to start listing results 0
rows maximum number of documents to return 10
Filter parameters:
web filter by web: this can be any webname all
contributor filter by contributor to a topic  
filter lucene query to filter results  
extrafilter additional lucene filter query (see SolrSearchBaseTemplate for the difference in filter and extrafilter  
reverse on/off - reverts sorting if switched on; note: this overrides sorting order specified in sort off
sort sorting expression; examples: score desc, date desc, createdate, topic_sort  
Dismax Parameter:
boostquery a raw query string (in the solr query syntax) that will be included with the user's query to influence the score. example: type:topic^1000 will boost results of type topic see solrconfig.xml and SolrSearchBaseTemplate
queryfields list of fields and their boosts giving each field a significance when a term was found in them. the format supported is fieldOne^2.3 fieldTwo fieldThree^0.4, which indicates that fieldOne has a boost of 2.3, fieldTwo has the default boost, and fieldThree has a boost of 0.4 ... this indicates that matches in fieldOne are much more significant than matches in fieldTwo, which are more significant than matches in fieldThree see solrconfig.xml and SolrSearchBaseTemplate
phrasefields list of fields and their boosts similar to queryfields. this parameter may contain fields and boosts that pharses (specified in quotes) are matched against. boosting those fields higher than their counterpart in queryfields allows you to prefer phrase matches over separate word matches see solrconfig.xml and SolrSearchBaseTemplate
Faceting:
facets list of facets to be rendered during search; each facet can be a title=name pair specifying the facet name and the title label used to display it in the result; example:
%MAKETEXT{"Webs"}%=web, %MAKETEXT{"Topic type"}%=field_TopicType_lst
 
facetquery query to be used for a facet query  
facetoffset used to page through a list of facets being returned by a search  
facetlimit maximum number of values to be displayed per facet; this is a list of pairs name=integer specifying a per-facet limit; example: 50, tag=100, contributor=10, category=10 will constraint the global limit of facet values to be returned to 50, tags to 100, list the top 10 contributors in the hit set as well as the 10 most used categories 100
facetmincount minimum frequency of a facet to be included in the result 1
facetprefix prefix string of a facet to be included  
facetdatestart part of a date facet describing the start of a time interval NOW/DAY-7DAYS
facetdateend part of a date facet describing the end of a time interval NOW/DAY+1DAYS
facetdateother part of a date facet describing the time intervals excluding the one specified with facetdatestart and facetdateend before
hidesingle comma separated list of facets to be hidden if there's only one choice left  
disjunctivefacets list of facets that are queried using OR; so searching within one facet will expand the search instead of drilling down facet values are combined using AND
combinedfacets list of facets where values are queried in each of them using OR; for example listing field_ProjectMembers_lst and field_ProjectManager_s will result in a lucne filter of the form field_ProjectMembers_lst:WikiGuest OR field_ProjectManager_s:WikiGuest  
Formating results:
correction format string for corrections proposed by the spellchecker Did you mean <a href='$url'>$correction</a>
header format string prepended to the result  
format format string used to render each hit in the result set  
separator format string used to separate hit results rendered using format  
footer format string appended to the result  
header_interesting format string prepended to more-like-this queries (see %SOLRSIMILAR)  
format_interesting format string used to render more-like-this results  
separator_interesting format string used to separate hit results in more-like-this queries  
footer_interesting format string appended to more-like-this queries  
include_interesting regular expression terms must match in a more-lile-this result  
exclude_interesting regular expression terms must not match in a more-lile-this result  
header_<facet> format string prepended to a facet result  
format_<facet> format string used to render a facet value  
separator_<facet> format string used to separate facet values  
footer_<facet> format string appended to facet results  
include_<facet> regular expression facet values must match to be displayed  
exclude_<facet> regular expression facet values must not match to be displayed  

SOLRFORMAT

When a solr response has been cached using the id parameter to SOLRSEARCH, it can be reused by subsequent calls to %SOLRFORMAT.

%SOLRSEARCH{"test" 
  id="solr1"
  facets="web,contributor"
  facetlimit="web=10, contributor=10"
}%

<noautolink>
*Results:*
%SOLRFORMAT{"solr1"
  format="   1 [[$web.$topic][$topic]]$n"
}%

*Webs:*
%SOLRFORMAT{"solr1"
  format_web="   * $key ($count)$n"
}%

*Contributors:*
%SOLRFORMAT{"solr1"
  format_contributor="   * $key ($count)$n"
  exclude_contributor="UnknownUser|AdminGroup|AdminUser|RegistrationAgent|TestUser"
}%
</noautolink>

SOLRSIMILAR

SOLRSIMILAR allows to return a list of similar topics given the current one.

Parameter Description Default
"..." query string referencing the document(s) to return similar ones for id:System.SolrPlugin
like list of fields used to compute similarity category, tag
fields list of fields and their boost value to be included in result items web, topic, title, score
filter restricts results to those matching this filter type:topic
include switches on/off inclusion of the matched document found in the query parameter off
limit maximum number of results to return 100
boost    
mintermfrequency    
mindocumentfrequency    
mindwordlength    
maxdwordlength    

Error parsing solr response

SOLRSCRIPTURL

returns a link to a SolrSearch with the given parameters pre-set.

Parameter Description Default
"..." or search search string to render a link for  
id get a link to the search defined by SOLRSEARCH  
topic name of the search topic to jump to WebSearch
union a list of fields whose values can be selected in a union (using an "or" operator)  
multivalued a list of fields that may be searched by multiple values  
start    
sort    
<field_name> any field defined in in solr's schema.xml  


---+++ Rest inteface

---++++ search

---++++ terms

---++++ similar

---++++ autocomplete

---+++ Commandline tools

---++++ solrstart

---++++ solrindex

---++++ solrdelete

---+++ Perl interface

---++++ registerIndexTopicHandler()

---++++ registerIndexAttachmentHandler()

Solr indexing schema

SolrPlugin comes with a custom schema to index general Foswiki data as defined in the <solr-home-dir>conf/schema.xml file. It offers support for generic DataForm values, so adding any new DataForm definition will allow to use those formfields for faceting directly without changing configurations or having to reindex the content.

The process of indexing content is configured on the Foswiki side which will crawl all webs, topics and their attachments thus creating lucene documents which are then sent over to the solr server. A lucene document is made up of fields of a certain type which defines the way the document should be processed by the solr server. This is configured in the schema.xml file.

While the schema is able to cover all Foswiki related data it is still kept generic enough to be used for non-wiki content as well.

Field types

This is the list of the most common field types used in the default schema. See the schema.xml for more exotic field types like point and location, useful for spatial search.

Type Description
string not analyzed, but indexed/stored verbatim
boolean boolean values (true, false)
binary the data should be sent/retrieved in as Base64 encoded strings
int, float, long, double default numeric field types. for faster range queries, consider the tint/tfloat/tlong/tdouble types
date the format for this date field is of the form 1995-12-31T23:59:59Z, and is a more restricted form of the canonical representation of dateTime. The trailing "Z" designates UTC time and is mandatory. Optional fractional seconds are allowed: 1995-12-31T23:59:59.999Z All other components are mandatory. Note: for faster range queries, consider the tdate type
text_ws a text field that only splits on whitespace for exact matching of words
text a general text field that has reasonable, generic cross-language defaults: it tokenizes with StandardTokenizer, removes stop words from case-insensitive "stopwords.txt", and down cases. At query time only, it also applies synonyms.
text_generic same as text but also splits words on case change while generating word parts. a general unstemmed text field - good if one does not know the language of the field. this field type is usful when searching for parts of a WikiWord
text_prefix substring decomposition starting at the front of the string
text_suffix substring decomposition starting at the back of the string
text_spell generic text analysis for spell checking
text_sort this is a text field suitable for sorting alphabetically
text_rev a general unstemmed text field that indexes tokens normally and also reversed, to enable more efficient leading wildcard queries.
type a list of strings used to analyse different media types. these are analysed using the system's mime types table and generating meaningfull values; for example a gif image would be of type "gif", "image" and "attachment"

Fields

Name Type Multivalued Stored Description
access_granted string multivalued   this field controls access of users to this topic or attachment in the search index; every query is augmented with an ACL check against this field; only users listed in this field are allowed view rights; special value is "all" when there are no view restrictions
attachment string multivalued stored list of all attachment names of this topic
author string   stored the name of the person that changed the document most recently
author_title string   stored title name of the person that changed the document most recently
catchall text_generic multivalued stored copy-field that gathers content from (allmost) all fields; this is the default search field for the "standard" query parser; note that fields to be queried can be configured per request using the "dismax" handler
category string multivalued stored list of categories this document is in; note: this field will only be used if Foswiki:Extensions/ClassificationPlugin is installed; it will populate it with the list of all categories up to TopCategory; content of this field is copied to category_search as well (see generic fields below)
comment text_generic   stored comment field of an attachment
concept string multivalued stored support for uima processing chain
container_id string   stored id of containing document, e.g. the topic this is a comment or attachment for
container_title string   stored title name of containing document
container_topic string   stored topic of containing document
container_url string   stored url of containing document
container_web string   stored web of containing document
contributor string multivalued stored list of users that contributed to this topic at some point in time
createauthor string   stored author of the initial version of this document
createauthor_title string   stored title name of the initial author of this document
createdate tdate   stored date when the initial version of this document was created
date tdate   stored time the the document was changed last
form string   stored name of the form attached to the current topic
icon string   stored icon to indetify the rendition for this document
id string   stored unique identifier for each document; this is the external id usable in applications; there's an internal solr document id not related to this field
language string   stored language of the current document; this may be specified explicitly using the CONTENT_LANGUAGE preference, or set to "detect" to let the solr update chain detect the language automatically
macro string multivalued   list of wiki macros being used in this topic
name string   stored filename of an attachment
outgoing string multivalued stored list of all outgoing links; this information is used to detect backlinks
parent string   stored parent topic of the current topic
phonetic phonetic multivalued   holds the phonetic analysis of the most important search fields
charnorm text_charnorm   multivalued result of the character normalization analysis
preference string multivalued stored this field catches all topic preferences. each preference is captured in a dynamic field as well (see dynamic fields below)
sentence text_generic multivalued stored support for uima processing chain
size tint   stored size of an attachment in bytes
spell text_spell multivalued   used for spellchecking
state string     used by comments or any other application that tracks specific states of a document, such as "new", "unapproved", "approved", "draft", "unpublished", "published", ...
text_prefix text_text_prefix multivalued   holds substring analysis of the most important search fields, starting at the front
text_suffix text_text_suffix multivalued   holds substring analysis of the most important search fields, starting at the back
summary text_generic   stored this is a plainified summary of the topic text
tag string multivalued stored list of tags assigned to this document; note: this field will only be used if Foswiki:Extensions/ClassificationPlugin is installed; content of this field is copied to category_search as well (see generic fields below)
text text_generic     document text
thumbnail string   stored url to thumbnail representation of this document; mostly used for images
timestamp tdate   stored time when the document was added to the index
title string   stored title of a document; a topic title is read from a TopicTitle formfield, a TOPICTITLE preference variable or defaults to the topic name itself; for attachments this is the filename with the extension stripped off
topic string   stored name of the topic
type type   stored holds the type facet of the document; this is "image" for all kinds of images, "video" for all kinds of videos, "topic" for Foswiki topics and the verbatim file extension for everything else; note: plugins like Foswiki:Extensions/MetaCommentPlugin might use specific types as well (like "comment" in this case)
url string   stored url used to access the document being indexed
version float     current version of the topic
webcat string   stored combined web-category facet
web string   stored name of the web this document is located in
webtopic string   stored concatenation of the web and topic part

Dynamic fields

Dynamic fields are generated based on the content properties of the document to be indexed. Fields are specified using some kind of wildcard in schema.xml. When a document is indexed, the wildcard will be expanded to create a proper field name. Dynamic fields allow to apply specific ways of analyzing fields based on their name, as well as cover fields that aren't known in advance, like the name of all formfields of a DataForm that ever could be invented.

When SolrPlugin is about to index a DataForm attached to a topic, it tries to guess the data type of each formfield. Normally, Foswiki does not specify any type information within a DataForm definition. Exceptions are (1) date: these are mapped to a *_dt field and (2) checkbox, select, radio, textboxlist: these are potentially multi-value fields and are thus indexed in a *_lst field.

Every other formfield is stored into an *_s field as well as into a *_search field. The former captures the exact content while the latter analyses the text more thoroughly optimized for fuzzy searching.

DataForm formfields are mapped to lucene document fields by prepending the field_* prefix to prevent name clashes with other dynamic fields generated on the fly. So for example a formfield ProjectManager will be stored in field_ProjectManager_s and field_ProjectManager_search. Likewise a select+multi formfield ProjectMembers will be stored in field_ProjectMembers_lst as it is a multivalued field.

If a formfield name already comes with one of the below suffixes (_i, _l, _f, _dt, etc) then this suffix will be used instead of any heuristics trying to derive the best field type for the lucene field. That way DataForm fields although untyped by Foswiki can be indexed type-specific nevertheless.

Similarly topic preferences are indexed using a preference_* prefix.

Name Type Multivalued Stored Description
*_i tint   stored fields with a _i suffix are indexed as an integer number
*_l tlong   stored fields with a _l suffix are indexed as a long integer
*_f tfloat   stored fields with a _f suffix are indexed as a float
*_d tdouble   stored fields with a _f suffix are indexed as a double precision float
*_b boolean   stored true, false
*_s string   stored dynamic field for unanalyzed text
*_std string not stored dynamic field for standard analysis, i.e. stopwords not being removed
*_t text_generic   stored generic text
*_dt tdate   stored a dateTime value
*_lst string multivalued stored this field is used for any multi-valued formfield in DataForms like, select, radio, checkbox, textboxlist
preference_* string   stored preference values such as preference_NAMEOFPREFERENCE_t
*_search text_generic   stored generic text, optimized for searching
*_sort text_sort   stored text optimized for sorting alphabetically

Copy fields

Finally, after having defined all field type there are some fields that are created by copying some source field to a destination field using the copyField feature of solr. So while most of a lucene document to be indexed is created by the crawler and indexer explicitly, some more are created automatically to facilitate specific search applications. The destination fields are then analysed using the dynamic field definitions as given above.

Source Destination
attachment catchall
attachment charnorm
attachment phonetic
attachment spell
category catchall
category category_search
category charnorm
category phonetic
comment catchall
comment charnorm
comment phonetic
comment spell
concept catchall
concept charnorm
concept phonetic
concept spell
field_* catchall
field_* charnorm
field_* phonetic
field_* spell
form catchall
form charnorm
form phonetic
form spell
name catchall
name charnorm
name phonetic
name spell
name name_std
name name_search
tag catchall
tag charnorm
tag phonetic
tag tag_search
text catchall
text charnorm
text phonetic
text spell
text text_prefix
text text_std
text text_suffix
title catchall
title charnorm
title phonetic
title spell
title title_first_letter
title title_prefix
title title_search
title title_sort
title title_std
title title_suffix
topic catchall
topic charnorm
topic phonetic
topic spell
topic topic_search
topic topic_sort
topic topic_std
type catchall
type charnorm
type phonetic
web spell
webtopic webtopic_search
web web_search
web web_sort
web web_std

---++ Templates

---+++ Structure of !SolrSearchBaseTemplate

---+++ Replacing !WebSearch and !WebChanges

---+++ Creating custom search interfaces

Dependencies

NameVersionDescription
Foswiki::Contrib::JQMomentContrib>=1.0Required
Foswiki::Contrib::JQPhotoSwipeContrib>=1.0Required
Foswiki::Contrib::JQSerialPagerContrib>=2.0Required
Foswiki::Contrib::JQTwistyContrib>=1.0Required
Foswiki::Contrib::StringifierContrib>=1.20Required
Foswiki::Plugins::AutoTemplatePlugin>=1.0Optional
Foswiki::Plugins::ClassificationPlugin>=1.0Optional
Foswiki::Plugins::DBCachePlugin>=1Optional
Foswiki::Plugins::FilterPlugin>=2.0Required
Foswiki::Plugins::FlexWebListPlugin>=1.91Required
Foswiki::Plugins::ImagePlugin>=3.0Required
Foswiki::Plugins::JQueryPlugin>=6.00Required
Cache::Cache>0Required
HTML::Entities>=3.64Required
JSON::XS>=2.231Required
LWP::UserAgent>=5.820Required
Moo>=2.00Required
Types::Standard>=1.00Required
XML::Easy>0Required
Foswiki::Plugins::TopicTitlePlugin>1.00Required for Foswiki < 2.2

Change History

31 Jan 2019: reduce amount of presumably unrelated search results; improved language detection in solr; added fields name_std and name_search for better searchability of attachments; don't display wiki markup in search result summaries; added field macro to capture use of wiki macros
10 Oct 2018: mime types are now multivalued, e.g. and image is now tagged type: ["gif", "image", "attachment"]; better support for attachments listed in the autosuggest drop down box; the rudimentary type mapping is now based on the system mime types table and not using a typemap file in solr's config anymore; removed dependency on Image::Magick; fixed error exceeding the max string length in solr; the form name will now be used when no TopicType field is present to construct the TopicType facet; fixed support for ALLOWWEBVIEW = *
13 Aug 2018: new alphabetical navigation for wiki users; fixed searching for summary; replaced jquery.scrollto with native scroll api; make number of items suggested configurable in jquery.autosuggest drop-down box
07 Jun 2018: new index fields author_title, createauthor_title, title_first_letter; added support indexing arbitrary meta data; added support for ListyPlugin; added toggle "exact search" to search interface; depending on new TopicTitlePlugin now; fixed keyboard interaction of autosuggest box; fixed sorting facet values by title; much improved relavancy sorting
09 Jan 2018: added support for jquery.i18n; improved solr schema for better findability; fixed solr sidebar in subwebs
18 Sep 2017: replacing text_substring with text_prefix and text_suffix to improve substring matching; truncate document values larger than 32k to prevent solr from crashing; use flexbox for people search interface; fixed creating urls to ImagePlugin rest interface to generate thumbnail previews
23 Jan 2017: converted WebServices::Solr to Moo; fixed documentation for iwatch realtime indexing; documentation of SOLRSCRIPTURL macro; using jquery.i18n for javascript translations now; new facet filter to search in facet values; improved indexing of user profile pages and their thumbnail image; indexing image geometry now; improved jquery.autosugest widget; improved ToggleFacetWidget; improved boosting of query ingrediences; mapping all office documents to a combined attachment type (document, presentation, spreadsheet, chart, ...); better support for plenv in system services and cron jobs
18 Oct 2015: fixed backwards compatibility with pre-unicode Foswiki; bring back solr::queryfields in SolrSearchBaseTemplate; fixed language facet to properly match language tags to their name; improved layout of search results as well as autosuggestion widget; removed workflow facet from default search; fixed icon mapping for topics that don't come with an icon defined in their TopicType; don't try to encode html entities without a code point in utf8; don't remove all macros from topic text, just some; removed dependency on MimeIconsPlugin as we are using fontawesome now; improved formula for sorting results by reference; fixed sorting in ajax-solr; fixed exposing/hiding parameters in ajax-solr; improved findability of content; i.e. when containing stop words only in the title; removed unused /browse search handler from solr config
01 Oct 2015: improve default layout of search results; moved unsafe inline-javascript into a js file of its own
21 Sep 2015: cache stringified attachments using Cache::FileCache now and added api to purge/clear cache regularly; removed IndexExtensions config parameter to let the stringifier decide on supported file formats; added support for Foswiki:Extensions/LikePlugin boosting search results by social preferences
17 Jul 2015: added support for Foswiki-2.0 ; indexing workflow and state facets supporting Foswiki:Extensions/WorkflowPlugin; added author_url to solr schema; added google image and video mime types mapping them to "image" while indexing
27 Feb 2015: upgraded to solr-5.0.0
29 Sep 2014: moved to jsrender for templating, replacing the deprecated jquery.tmpl
29 Aug 2014: fix mailto links in WikiUsers view template; fully specify rest security; fixed creating of working area for timestamps db; improved indexing of list values; fixed encoding error in SOLRSEARCH/FORMAT; use SOLR_EXTRAFILTER preference setting in auto-suggest widget as well; fixed applying strings and defaults in solrDictionary class; fixed applying extra-filters in SolrSearch; harvest facet headings for translations;
28 May 2014: implemented new ACL style compatible with Foswiki >= 1.2
14 Jul 2013: added support for PiwikPlugin
14 Mar 2013: improved indexing performance; added configurable http timeouts takling to the solr backend; fixed language mappings for multilingual content; fixes due to latest changes in jquery.moment
17 Oct 2011: fixed WebServices::Solr to only encode to utf8 if needed; fixed handling character encoding on a pure utf8 foswiki; fixed schema for spell correction
29 Sep 2011: improved schema.xml: replaced StandardTokenizer with WhitespaceTokenizer, using new ClassicTokenizer and ClassicFilter to feed the spellchecker, switched spellchecker to JaroWinklerDistance and lowered the frequency threshold for a term to be added to the spellchecker; building the spellchecker when optimizing the index now; fixed detecting the content language
28 Sep 2011: added multilanguage support per document; fixed default values in %SOLRSIMILAR; speeding up indexing by better caching ACLs; implemented mapping facet values to any other label; during query time; added Language facet to default search interface
26 Sep 2011: improved default boosting in dismax to prefer topic hits a lot stronger than attachments; improved default cache settings for better default performace; added support to distribute updates and search in a master-slave setup; added boostquery, queryfields, phrasefields parameter to customize boosting and sorting; improved default schema while documenting it
21 Sep 2011: upgrading to solr-3.4.0; fixed utf8 handling; added jump and i-feel-lucky options; made hidesingle configurable per facet; added disjunctivefacets and combinedfacets; fixed handling of date fields; support new ui::autocomplete in JQueryPlugin; using type-specific icons in Foswiki:Extensions/MimeIconPlugin if installed; fixed quoting lucene queries; indexing outgoing links to support fast backlinks; adding fields createauthor, language and collection to schema; disabling phonetic boost in schema by default; be more robust in case of mallformed DataForm definitions; copying every string field into a search field also to allow exact as well as fuzzy search; enhancing normalizeWebTopicName to create uniform web names using dots, not slashes everywhere; fixed parsing inline topic permissions; externalized sidebar pager into a new plugin of its own: Foswiki:Extensions/JQSerialPagerContrib; upgrading to WebService::Solr-0.14 ... which now requires CPAN:XML::Easy instead of CPAN:XML::Generator; lots of improvements to SolrSearchBaseTemplate; now supporting Foswiki:Extensions/InfiniteScrollContrib in SolrSearch; documentation improvements
19 Apr 2011: shipping a multicore setup by default; added support for Foswiki:Extensions/VirtualHostingContrib; fixed utf8 recoding; some usability improvements to faceted search interface; fixing illegal control characters in output (Oliver Schaub)
16 Dec 2010: added state field to schema used for approval workflows; added solrjob to ease cronjobbing indexing; added docu how to use iwatch for almost-realtime indexing; fixed dependencies to include Foswiki:Extensions/FilterPlugin as well; fixed mapping facet values to their display title in search interface; fixed delta updates not properly removing outdated attachment entries when these where moved/renamed; and some minor html improvements
03 Dec 2010: fixed solr-based WebChanges and SiteChanges using PatternSkin
01 Dec 2010: adjustments due to changes in stringifier api; fixed removal of deleted webs from search index
22 Nov 2010: fixes integration with pattern skin
18 Nov 2010: initial public release

PackageForm edit

Author Michael Daum
Version 7.30
Release 31 Jan 2019
Description Enterprise Search Engine for Foswiki based on Solr
Repository https://github.com/foswiki/SolrPlugin
Copyright © 2009-2019, Michael Daum http://michaeldaumconsulting.com
License GPL (GNU General Public License)
Home Foswiki:Extensions/SolrPlugin
Support Foswiki:Support/SolrPlugin
Topic revision: r1 - 31 Jan 2019, ProjectContributor
This site is powered by FoswikiCopyright © by the contributing authors. All material on this site is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback