Indexing – schema.XML

Maximum token count

By default, complete documents are subjected to full-text indexing. For very large documents this requires corresponding resources concerning memory and CPU usage. For certain documents one may assume that all of the relevant search terms are located in the beginning of the document, e.g. in the table of contents or in the foreword. In such a case it makes sense to limit full-text indexing. In the example below a filter (“LimitTokenCountFilter”) in the field type definition is used to limit full-text indexing to consider only the first 1000 different tokens. This setting can be used for any field type.

<fieldType name="text_stemming_de" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="1000"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt"/>
            <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" types="lang/wdfftypes.txt"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <!-- Optionally you can use another stem filter with different aggressive aproach of stemming
                 <filter class="solr.GermanStemFilterFactory"/> 
                 <filter class="solr.GermanLightStemFilterFactory"/>

Adjusting the maximum token count does not affect performance per se, which means to say that nothing will actually change with regard to an indexing process for a small document. Only if very large documents with a correspondingly high number of tokens really enter the indexing process, the resource usage will be limited.

Complete indexing of very large documents consumes memory and processing time and slows down the overall indexing process!

If you expect very large documents in your environment, your specific performance requirements must be reflected by an adequate hardware infrastructure or by an appropriate search architecture, e.g. a distributed search environment with several shards [Wiki DSearch].

The combination of basic authentication with distributed search is not possible because of a known Solr defect (see https://issues.apache.org/jira/browse/SOLR-15237).

Fine tuning for search modes

The following full-text related search modes can be distinguished, which are all handled by different internal fields that are subject to indexing and that may be further customized by Solr mechanisms:

Fields which are used by standard search queries

This search is handled by the internal fields “fulltext” and “unitedmetadata”. The standard full-text search supports language depending word stemming (also see language setting in solrcore.properties in chapter Basic configuration – solrcore.properties) and eliminates common stop words. Different word stemmers are provided by Solr and the default stop word list can be customized. In this case there are additional internal fields used, e.g. “fulltext_en” and “unitedmetadata_en” for documents that have been detected to be English by the standard search query.
Phrase search

A phrase search is handled by the internal fields “fulltext_phrase” and “unitedmetadata_phrase”. A phrase search is triggered by using double quotation marks that enclose a phrase in the full-text search field.
Wildcard search

The wildcard fields are required to support wildcard search. The asterisk and the question mark wildcard are supported in the full-text search field. In the Solr configuration of the assembly release no special fields are defined. In a standard installation the phrase search fields are used. It is possible to activate special wildcard fields which are optimized for queries with beginning wildcards. In this case enable the store and indexed properties of the fields “fulltext_wildcard” and “unitedmetadata_wildcard” in the schema.xml configuration file.

For saving resources concerning memory or CPU consumption, in the section “fields” of the configuration file “schema.xml”, indexing for the different search modes can selectively be activated or deactivated by setting the corresponding field attributes “indexed“ and “stored“ to the values “false“:

...

...

<field name="fulltext_phrase"

type="text" multiValued="true" indexed="false" stored="false" />

<field name="unitedmetadata_phrase"

type="text" multiValued="true" indexed="false" stored="false" />

<field name="fulltext_wildcard"

type="text_wildcard" multiValued="true" indexed="false" stored="false" />

<field name="unitedmetadata_wildcard"

type="text_wildcard" multiValued="true" indexed="false" stored="false" />

...

</fields>

...

By deactivating any of the indexes the corresponding search feature will become unavailable!

Certain customization scenarios can be conceived which share one of the indexes for several purposes, e.g. using either the phrase search index or the standard full-text index for both search modes. However, such a customization must always be worked out depending on project specific requirements and should only be implemented in cooperation with your contact at T-Systems (also see section Internal configuration management and customization).