Indexing – schema.xml

Maximum token count

By default, full-text indexing is applied to complete documents. To improve the indexing performance, you can use the maximum token count.

If all relevant search terms are located in the beginning of a document, e.g. in the table of contents or in the foreword, it makes sense to limit full-text indexing. In the example below, a filter in the field type definition is used, to consider only the first 1000 different tokens (maxTokenCount). This setting can be used for any field type.

<fieldType name="text_stemming_de" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="1000"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt"/>
            <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" types="lang/wdfftypes.txt"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <!-- Optionally you can use another stem filter with different aggressive aproach of stemming
                 <filter class="solr.GermanStemFilterFactory"/> 
                 <filter class="solr.GermanLightStemFilterFactory"/> 

Setting the maximum token count is only useful if many large documents (containing many tokens) enter the indexing process. It does not make sense in case of many small documents.

Complete indexing of large documents consumes memory and processing time and slows down the overall indexing process!

If you expect very large documents in your environment, your specific performance requirements must be reflected by an adequate hardware infrastructure or by an appropriate search architecture, e.g. a distributed search environment with several shards [Wiki DSearch].

The combination of basic authentication with distributed search is impossible because of a known Solr defect (see https://issues.apache.org/jira/browse/SOLR-15237).

Fine tuning for search modes

The following full-text related search modes can be distinguished, which are all handled by different internal fields that are subject to indexing and that may be further customized by Solr mechanisms:

  • Fields which are used by standard search queries

    This search is handled by the internal fields “fulltext” and “unitedmetadata”. The standard full-text search supports language depending word stemming (also see language setting in solrcore.properties in chapter Basic configuration – solrcore.properties) and eliminates common stop words. Different word stemmers are provided by Solr and the default stop word list can be customized. In this case there are additional internal fields used, e.g. “fulltext_en” and “unitedmetadata_en” for documents that have been detected to be English by the standard search query.

  • Phrase search

    A phrase search is handled by the internal fields “fulltext_phrase” and “unitedmetadata_phrase”. A phrase search is triggered by using double quotation marks that enclose a phrase in the full-text search field.

  • Wildcard search

    The wildcard fields are required to support wildcard search. The asterisk and the question mark wildcard are supported in the full-text search field. In the Solr configuration of the assembly release no special fields are defined. In a standard installation the phrase search fields are used. It is possible to activate special wildcard fields which are optimized for queries with beginning wildcards. In this case enable the store and indexed properties of the fields “fulltext_wildcard” and “unitedmetadata_wildcard” in the schema.xml configuration file.

To reduce memory and CPU usage, you can selectively activate or deactivate indexing for the different search modes. In the “fields” section of the “schema.xml” configuration file, set the “indexed” and “stored” field attributes to “false”.

...

<fields>

    ...

     <field name="fulltext_phrase"

          type="text" multiValued="true" indexed="false" stored="false" />

     <field name="unitedmetadata_phrase"

          type="text" multiValued="true" indexed="false" stored="false" />

     <field name="fulltext_wildcard"

          type="text_wildcard" multiValued="true" indexed="false" stored="false" />

     <field name="unitedmetadata_wildcard"

          type="text_wildcard" multiValued="true" indexed="false" stored="false" />

    ...

</fields>

...

By deactivating an index, the corresponding search mode no longer supports full-text search.

Customization scenarios can be conceived, which share one of the indexes for several purposes, e.g. using either the phrase search index or the standard full-text index for both search modes. However, such a customization must always be worked out depending on project-specific requirements and should only be implemented in cooperation with your contact at T-Systems (also see section Internal configuration management and customization).