StaticSearch indexer

(updated )

1,780 words, 9-minute read

You must run the StaticSearch indexing process whenever your site content changes. You would typically run the indexer following a build as part of your site’s deployment process.

During indexing, StaticSearch extracts words from all the HTML files in a build directory (such as ./build/) and creates a new directory (./build/search/) containing index data, JavaScript, and CSS files.

Installing StaticSearch #

You can run StaticSearch without installation:

terminal

npx staticsearch

You can also install the module globally:

terminal

npm install staticsearch -g

then run using:

terminal

staticsearch

This tutorial shows global staticsearch commands, but you can still prepend npx if necessary.

StaticSearch help #

View StaticSearch command line help using:

terminal

staticsearch --help

Additional help options are available from the CLI:

CLIdescription
-v, --versionshow application version
-?, --helpshow CLI help
-E, --helpenvshow .env/environment variable help
-A, --helpapishow Node.js API help

Using the StaticSearch Node.js API #

You can configure and run StaticSearch from any Node.js project. This is useful when you want to index a site as part of your build process, perhaps within a Publican publican.config.js configuration file.

To use StaticSearch, install it as a dependency:

terminal

npm install staticsearch

then import the module, set configuration options, and run the .index() method in your JavaScript code:

index.js example

// example search index
import { staticsearch } from 'staticsearch';

// configuration
staticsearch.buildDir = './mysite/';
staticsearch.searchDir = './mysite/index/';
staticsearch.buildRoot = './blog/';
staticsearch.wordWeight.title = 20;

// run indexer
await staticsearch.index();

When an option is not explicitly set, StaticSearch falls back to an environment variable, then the default value.

Run your application as normal to index your site, e.g. node index.js

Indexer configuration #

StaticSearch can index most sites without configuration but options can be set as CLI arguments, environment variables, or as Node.js API properties.

Load environment files #

You can set StaticSearch options using environment variables, e.g.

terminal

export BUILD_DIR=./mysite/
staticsearch

You can also define variables in a file, e.g.

example .env

# StaticSearch environment variables
BUILD_DIR=./mysite/
SEARCH_DIR=./mysite/index/
BUILD_ROOT=/blog/

Then import this file on the command line:

terminal

staticsearch --env .env

Note that CLI arguments take precedence over environment variables.

File indexing options #

The following options control how StaticSearch parses and indexes HTML files:

CLIENVAPIdescription
-b, --builddirBUILD_DIR.buildDirstatic site directory (./build/)
-s, --searchdirSEARCH_DIR.searchDirsearch index data directory (./build/search/)
-d, --domainSITE_DOMAIN.siteDomainsite domain (http://localhost)
-r, --rootBUILD_ROOT.buildRootsite root path (/)
-i, --indexfileSITE_INDEXFILE.siteIndexFiledefault index file (index.html)
-f, --ignorerobotfileSITE_PARSEROBOTSFILE.siteParseRobotsFileparse robot.txt Disallows (true)
-m, --ignorerobotmetaSITE_PARSEROBOTSMETA.siteParseRobotsMetaparse robot meta noindex (true)

The build directory (--builddir | BUILD_DIR | .buildDir) is an absolute or relative path to the directory where you built your static site, e.g. ./build/.

The search directory (--searchdir | SEARCH_DIR | .searchDir) is an absolute or relative path to the directory where you want StaticSearch’s code and index files generated. You can use any path, but it should normally be inside your build directory, e.g. ./build/search/. If you don’t define a search directory, it defaults to a search subdirectory of the build directory.

If your pages use links with fully qualified URLs such as https://mysite.com/path/, you should set the domain (--domain | SITE_DOMAIN | .siteDomain) so StaticSearch can identify internal links.

StaticSearch presumes the web root path is / – so the file ./build/index.html is your home page. You can set it to another path, such as /blog/ (--root | BUILD_ROOT | .buildRoot). The file at ./build/index.html is then presumed to have the URL path /blog/index.html.

StaticSearch presumes the HTML index file used as the default for directory paths is index.html. You can change this to another filename (--indexfile | SITE_INDEXFILE | .siteIndexFile), e.g. default.htm.

StaticSearch parses the robots.txt file in the root of the build directory, e.g. omit any file in the /secret/ and /personal/ directories:

example robots.txt

User-agent: *
Disallow: /secret/

User-agent: staticsearch
Disallow: /personal/

You can disable robots.txt parsing with --ignorerobotfile | SITE_PARSEROBOTSFILE=false | .siteParseRobotsFile=false.

StaticSearch parses HTML meta tags. A page is not indexed when it includes noindex in the content attribute of a robots or staticsearch meta tag:

example index.html

<meta name="robots" content="noindex">
<!-- OR -->
<meta name="staticsearch" content="noindex">

You can disable meta tag parsing with --ignorerobotmeta | SITE_PARSEROBOTSMETA=false | .siteParseRobotsMeta=false.

Example: index HTML files in the ./mysite/ directory, ignore robots.txt restrictions, and write search index files to ./mysite/search/:

terminal

staticsearch --builddir ./mysite/ --ignorerobotfile

Document indexing options #

StaticSearch attempts to locate your page’s primary content. You would not normally want to index text in headers, footers, and navigation that is repeated on every page. The indexer checks for content in:

  1. the HTML <main> element, but it ignores child content in <nav> and <menu> elements.

  2. when there’s no <main> element, the indexer uses the <body> element, but ignores child content in <header>, <footer>, <nav>, and <menu> elements (or elements with an ID or class containing header, footer, etc).

If this is not suitable, you can set alternative elements using CSS selectors to locate your main content:

CLIENVAPIdescription
-D, --domPAGE_DOMSELECTORS.pageDOMSelectorsnodes to include
-X, --domxPAGE_DOMEXCLUDE.pageDOMExcludenodes to exclude

Specify the location of your main content by setting --dom | PAGE_DOMSELECTORS | .pageDOMSelectors to a comma-delimited list of CSS selectors, e.g. 'article.primary, .secondary, aside'.

Then exclude any nodes within the main content by setting --domx | PAGE_DOMEXCLUDE | .pageDOMExclude to a comma-delimited list of CSS (child) selectors e.g. 'nav, menu, .private'.

Example: index content in #main and .secondary elements but exclude all <nav>, and <div class="related"> elements within them:

terminal

npx staticsearch --dom '#main,.secondary' --domx 'nav,div.related'

Notes:

  1. Pages without #main or .secondary elements are not indexed.

  2. Be careful not to index the same elements more than once. Content inside a .secondary block that’s within a #main block is indexed twice which affects relevancy scores.

  3. Ensure excluded nodes are valid child selectors. Consider this example:

    npx staticsearch --dom '.main' --domx 'body nav'
    

    It would not exclude the <nav> in the following HTML because it couldn’t find a child body inside the .main element.

Word indexing options #

The following options control word indexing:

CLIENVAPIdescription
-l, --languageLANGUAGE.languagelanguage (en)
-c, --wordcropWORDCROP.wordCropcrop word letters (7)
-S, --stopwordsSTOPWORDS.stopWordscomma-separated list of stop words
-W, --ignorestopdefaultSTOPWORDS_DEFAULT.stopWordsDefaultuse language default stop words (true)
--weightlinkWEIGHT_LINK.wordWeight.linkword weight for inbound links (5)
--weighttitleWEIGHT_TITLE.wordWeight.titleword weight for main title (10)
--weightdescWEIGHT_DESCRIPTION.wordWeight.descriptionword weight for description (8)
--weighth2WEIGHT_H2.wordWeight.h2word weight for H2 headings (6)
--weighth3WEIGHT_H3.wordWeight.h3word weight for H3 headings (5)
--weighth4WEIGHT_H4.wordWeight.h4word weight for H4 headings (4)
--weighth5WEIGHT_H5.wordWeight.h5word weight for H5 headings (3)
--weighth6WEIGHT_H6.wordWeight.h6word weight for H6 headings (2)
--weightemphasisWEIGHT_EMPHASIS.wordWeight.emphasisword weight for bold and italic (2)
--weightaltWEIGHT_ALT.wordWeight.altword weight for alt tags (1)
--weightcontentWEIGHT_CONTENT.wordWeight.contentword weight for content (1)

The default --language | LANGUAGE | .language is English (en) which provides word stemming and stop word lists to reduce the size of the index and provide fuzzier searching.

StaticSearch automatically removes common stop words considered insignificant to the meaning of text – such as “and”, “the”, and “but” in English. It includes stop words courtesy of Stopwords ISO:

In some cases, such as smaller sites, you may wish to omit the default stop words using --ignorestopdefault | STOPWORDS_DEFAULT | .stopWordsDefault.

You can set custom stop words using --stopwords | STOPWORDS | .stopWords. For example, a site about “Acme widgets” probably mentions them on every page. The words are of little practical use in the search index so you could set the stop words 'acme,widget'.

By default, --wordcrop | WORDCROP | .wordCrop is 7: only the first 7 letters of any word are considered important. Therefore, the word “consider”, “considered”, and “considering” are effectively identical (and indexed as conside). You can change this limit if necessary.

The --weight | WEIGHT_ | .wordWeight values define scores allocated to a word according to its location in a page. The defaults:

word locationscore
title / h1 heading10
description8
h2 heading6
h3 heading5
h4 heading4
h5 heading3
h6 heading2
emphasis (bold/italic)2
main content1
alt tags1
inbound link5

Consider a page with the word “static” in the title, an <h2> heading, and an <em>. The page scores 18 (10 + 6 + 2) for “static”, so it will appear above pages scoring 17 or less.

Any other page linking to it using the word “static” adds a further 5 points to the score.

Example: change the language to Spanish, crop words to 6 characters, and set the title score to 20:

terminal

staticsearch --language es --wordcrop 6 --weighttitle 20

Logging options #

The following option controls logging verbosity:

CLIENVAPIdescription
-L, --loglevelLOGLEVEL.logLevellogging verbosity (2)

Set:

Next steps #

After indexing your site for the first time, you can add StaticSearch search functionality using any of these options:

  1. a web component – add search using a single HTML <static-search> tag

  2. a bind module – add search by binding HTML elements to search functionality using HTML or JavaScript

  3. a search API – create your own search UI using the JavaScript API.