StaticSearch indexer

1,700 words, 9-minute read

You must run the StaticSearch indexing process whenever your site content changes. It would typically be done following a build as part of your site’s deployment process.

During indexing, StaticSearch extracts words from all the HTML files in a build directory (./build/) and creates a new directory (./build/search/) containing index data, JavaScript, and CSS files.

Installing StaticSearch #

StaticSearch can be run without installation:

terminal

npx staticsearch

You can also install it globally:

terminal

npm install staticsearch -g

then run using:

terminal

staticsearch

The remainder of this tutorial shows global staticsearch commands, but you can still prepend npx if necessary.

StaticSearch help #

View StaticSearch command line help using:

terminal

staticsearch --help

Additional help options are available from the CLI:

CLIdescription
-v, --versionshow application version
-?, --helpshow CLI help
-E, --helpenvshow .env/environment variable help
-A, --helpapishow Node.js API help

Using the StaticSearch Node.js API #

You can configure and run StaticSearch from any Node.js project. This is useful when you want to index a site as part of your build process, perhaps within a Publican publican.config.js configuration file.

To use StaticSearch, install it as a dependency:

terminal

npm install staticsearch

then import the module, set configuration options, and run the .index() method in your JavaScript code:

index.js example

// example search index
import { staticsearch } from 'staticsearch';

// configuration
staticsearch.buildDir = './dest/';
staticsearch.searchDir = './dest/index/';
staticsearch.buildRoot = './blog/';
staticsearch.wordWeight.title = 20;

// run indexer
await staticsearch.index();

When an option is not explicitly set, StaticSearch falls back to an environment variable then the default value.

Run your application as normal to index a site, e.g. node index.js

Configuring StaticSearch #

StaticSearch can index most sites without additional configuration. However, you can set options as CLI arguments, environment variables, or the Node.js API. This section describes the configuration parameters.

Load environment files #

All StaticSearch options can be set using environment variables, e.g.

terminal

export BUILD_DIR=./dest/
staticsearch

Variables can also be defined in a file and loaded with --env <file>. Create a file with your configuration values, e.g.

example .env

# StaticSearch environment variables
BUILD_DIR=./dest/
SEARCH_DIR=./dest/index/
BUILD_ROOT=/blog/

Then import the file on the command line:

terminal

staticsearch --env .env

Note that CLI arguments take precedence over environment variables.

File indexing options #

The following options control how static site files are parsed:

CLIENVAPIdescription
-b, --builddirBUILD_DIR.buildDirwebsite directory (./build/)
-s, --searchdirSEARCH_DIR.searchDirindex data directory (./build/search/)
-d, --domainSITE_DOMAIN.siteDomainsite domain (http://localhost)
-r, --rootBUILD_ROOT.buildRootsite root path (/)
-i, --indexfileSITE_INDEXFILE.siteIndexFiledefault index file (index.html)
-f, --ignorerobotfileSITE_PARSEROBOTSFILE.siteParseRobotsFileparse robot.txt Disallows (true)
-m, --ignorerobotmetaSITE_PARSEROBOTSMETA.siteParseRobotsMetaparse robot meta noindex (true)

The build directory (--builddir | BUILD_DIR | .buildDir) is an absolute or relative path to the directory where your static website files are built, e.g. ./build/.

The search directory (--searchdir | SEARCH_DIR | .searchDir) is an absolute or relative path to the directory where the search index JavaScript and JSON data files are generated. You can use any path, but it should normally be inside your build directory, e.g. ./build/search/. When no search directory is set, it will default to a search sub-directory of the build directory.

If your pages use links with fully qualified URLs such as https://mysite.com/path/, you should set the domain (--domain | SITE_DOMAIN | .siteDomain) so they can be identified, e.g. https://mysite.com.

The web root path is presumed to be /, so a file named ./build/index.html is your home page. You can set it to another path, such as /blog/ if necessary (--root | BUILD_ROOT | .buildRoot). The file at ./build/index.html is then presumed to have the URL http://site.com/blog/index.html.

The HTML index file used as the default for directory paths is presumed to be index.html. You can change this to another filename if necessary (--indexfile | SITE_INDEXFILE | .siteIndexFile), e.g. default.htm.

StaticSearch parses the robots.txt file in the root of the build directory, e.g.

example robots.txt

User-agent: *
Disallow: /secret/

User-agent: staticsearch
Disallow: /personal/

In this case, StaticSearch will not index any HTML file in the /secret/ or /personal/ directories. This can be disabled using --ignorerobotfile | SITE_PARSEROBOTSFILE=false | .siteParseRobotsFile=false.

StaticSearch parses HTML meta tags. A page containing noindex in the content attribute of either of the following meta tags is not indexed:

example index.html

<meta name="robots" content="noindex">
<!-- OR -->
<meta name="staticsearch" content="noindex">

This can be disabled by setting --ignorerobotmeta | SITE_PARSEROBOTSMETA=false | .siteParseRobotsMeta=false.

Example: index HTML files in the ./dest/ directory, ignore robots.txt restrictions, and write search index files to ./dest/search/:

staticsearch --builddir ./dest/ --searchdir ./dest/search/ --ignorerobotfile

Document indexing options #

StaticSearch attempts to locate your page’s main content; you would not normally want to index text in the header, footer, and navigation which appears on every page. It checks for content in:

  1. the HTML <main> element. Any <nav> or <menu> blocks within that are ignored.

  2. the HTML <body> when no <main> element can be found. Content is ignored in <header>, <footer>, <nav>, and <menu> elements (or any element with an ID or class attribute containing header, footer, etc).

If this is not suitable, you can set alternative elements using CSS selectors to locate your main content:

CLIENVAPIdescription
-D, --domPAGE_DOMSELECTORS.pageDOMSelectorsnodes to include
-X, --domxPAGE_DOMEXCLUDE.pageDOMExcludenodes to exclude

The --dom | PAGE_DOMSELECTORS | .pageDOMSelectors value defines a comma-delimited list of CSS selectors where the main content is located, e.g. 'article.primary, .secondary, aside'.

The --domx | PAGE_DOMEXCLUDE | .pageDOMExclude value defines a comma-delimited list of CSS child selectors to exclude from selected nodes e.g. 'nav, menu, .private'.

Example: index content in #main and .secondary elements but exclude all <nav>, and <div class="related"> elements within those:

terminal

npx staticsearch --dom '#main,.secondary' --domx 'nav,div.related'

Notes:

  1. In the example above, pages without #main or .secondary elements will not be indexed.

  2. Be careful not to index the same elements more than once. In the example above, content inside a .secondary block would be indexed twice if it were contained inside a #main block.

  3. Ensure excluded nodes are valid child selectors. Consider this example:

    npx staticsearch --dom '.main' --domx 'body nav'
    

    It would not exclude the <nav> in the following HTML because it couldn’t find a child body element inside .main:

    <article class="main">
      <p>main content</p>
      <nav>navigation</nav>
    </article>
    

Word indexing options #

The following options control how words are indexed:

CLIENVAPIdescription
-l, --languageLANGUAGE.languagelanguage (en)
-c, --wordcropWORDCROP.wordCropcrop word letters (7)
-s, --stopwordsSTOPWORDS.stopWordscomma-separated list of stop words
--weightlinkWEIGHT_LINK.wordWeight.linkword weight for inbound links (5)
--weighttitleWEIGHT_TITLE.wordWeight.titleword weight for main title (10)
--weightdescWEIGHT_DESCRIPTION.wordWeight.descriptionword weight for description (8)
--weighth2WEIGHT_H2.wordWeight.h2word weight for H2 headings (6)
--weighth3WEIGHT_H3.wordWeight.h3word weight for H3 headings (5)
--weighth4WEIGHT_H4.wordWeight.h4word weight for H4 headings (4)
--weighth5WEIGHT_H5.wordWeight.h5word weight for H5 headings (3)
--weighth6WEIGHT_H6.wordWeight.h6word weight for H6 headings (2)
--weightemphasisWEIGHT_EMPHASIS.wordWeight.emphasisword weight for bold and italic (2)
--weightaltWEIGHT_ALT.wordWeight.altword weight for alt tags (1)
--weightcontentWEIGHT_CONTENT.wordWeight.contentword weight for content (1)

The default --language | LANGUAGE | .language is English (en). This provides word stemming and stop word lists to reduce the size of the index and provide fuzzier searching. Setting any other language indexes every word without stemming or stop words (further languages may be supported in future releases).

By default, --wordcrop | WORDCROP | .wordCrop is set to 7: only the first 7 letters of any word are considered important. Therefore, the word “consider”, “considered”, and “considering” are effectively identical (and indexed as conside). You can change this limit if necessary.

You can add further stop words (words omitted from the index) using --stopwords | STOPWORDS | .stopWords. For example, a site about “Acme widgets” probably mentions them on every page. The words are of little practical use in the search index so the stop words 'acme,widget' could be set.

The --weight | WEIGHT_ | .wordWeight values define the score allocated to a word according to its location in a page. The defaults:

word locationscore
title / h1 heading10
description8
h2 heading6
h3 heading5
h4 heading4
h5 heading3
h6 heading2
emphasis (bold/italic)2
main content1
alt tags1
inbound link5

Consider a page with the word “static” in the title, <h2> heading, and an <em>. The page scores 18 (10 + 6 + 2) for “static”, so it will appear above pages scoring less in search results.

In addition, any other page linking to it using the word “static” adds a further 5 points to the score.

Example: change the language to Spanish, crop words to 6 characters, and set the title score to 20:

staticsearch --language es --wordcrop 6 --weighttitle 20

Logging options #

The following option controls logging verbosity:

CLIENVAPIdescription
-L, --loglevelLOGLEVEL.logLevellogging verbosity (2)

Set:

Next steps #

Once a site has been indexed, StaticSearch provides three options for implementing search on your site:

  1. the web component – an HTML-only search widget

  2. the bind module – allows you to bind elements to search functionality using HTML and/or JavaScript

  3. the search API – create your own search UI using the JavaScript API.