Introduction - Workertown

What is it?

@workertown/search provides a text search engine via a REST API - think of it as a self-hosted Algolia or ElasticSearch... just without the front-end.

You index documents, and then you can search for them via their text content.

It is best suited for small to medium datasets, given to memory constraints found in most edge runtime environments. Examples of what it can be used for:

A search engine for a simple SaaS product
A search engine for a documentation site
A search engine for a blog

Getting started

Installation

You can install @workertown/search via npm/yarn/pnpm:

npm install @workertown/search

Creating a server

In your main file (e.g. worker.ts), import the search factory function and call it.

import { search } from "@workertown/search";

//...or `import search from "@workertown/search";`

const server = search();

//...probably `export default server;`

Like all Workertown services, the created server (based on Hono) instance with a fetch method.

The server function accepts a single argument, an optional options object. This options object allows you to customise the search service to fit your needs (see configuration for a full set of options).

import { search } from "@workertown/search";

// These are the default values...
const server = search({
  auth: {
    apiKey: {
      env: {
        apiKey: "SEARCH_API_KEY", // Environment variable for the API key
      },
    },
    basic: {
      env: {
        username: "SEARCH_USERNAME", // Environment variable for the admin username
        password: "SEARCH_PASSWORD", // Environment variable for the admin password
      },
    },
    jwt: {
      env: {
        jwksUrl: "SEARCH_JWKS_URL", // Environment variable for the JWKS URL
        secret: "SEARCH_JWT_SECRET", // Environment variable for the fixed JWT secret
        issuer: "SEARCH_JWT_ISSUER", // Environment variable for the JWT issuer
        audience: "SEARCH_JWT_AUDIENCE", // Environment variable for the JWT audience
      },
    },
  }, // See the "Authentication" section for all of the available options in `auth`
  basePath: "/", // Base path for the server to serve endpoints from
  endpoints: {
    v1: {
      admin: "/admin", // Base path for the server to serve admin endpoints from
      documents: "/docs", // Base path for the server to serve documents endpoints from
      search: "/search", // Base path for the server to serve search endpoints from
      suggest: "/suggest", // Base path for the server to serve suggest endpoints from
      tags: "/tags", // Base path for the server to serve tag endpoints from
    },
    public: "/", // Base path for the server to serve public endpoints from
  },
  env: {
    cache: "SEARCH_CACHE", // Environment variable for the cache KV binding (Cloudflare Workers only)
    database: "SEARCH_DB", // Environment variable for the D1 database binding (Cloudflare Workers only)
  },
  search: {
    scanRange: 1000, // The maximum amount of records to take from storage at one time to search across
    stopWords: new Set([
      /* ... */
    ]), // Set of words to ignore when matching search results
  },
});
//...

Concepts

Tenants

@workertown/search supports multi-tenant setups, allowing you to run separate search indexes for different "tenants" (e.g. different customers, different projects, etc). A tenant is simply identified by a unique string name.

If you do not need multi-tenancy, you can simply use a "default" tenant.

Indexes

All documents are stored within an index, to allow searches to be narrowed. An index is identified by a unique string name.

It is not required to search across a single index, but it is recommended.

Documents

A document is a single "thing" that can be indexed and searched for via some (or all) of its text content. A document can be anything you want it to be - it has no predefined structure.

An example document might look like this:

{
  "id": "1",
  "title": "Hello, world!",
  "content": "This is a test document.",
}

Documents can be optionally tagged, to allow for more fine-grained searches.

Searches

A search is a query that is run against a tenant to find matching documents. A search can be run across a single index, or across all data within a tenant. When searching, you specify which fields within your documents to search across, allowing you to narrow down the matching results in denormalised datasets. You can also optionally specify tags to narrow the search.

Fuzzy matching, prefix matching and exact matching are all optionally supported within a search query.

Suggestions

A suggestion is a query that is run against a tenant to find matching terms within a dataset. A suggestion can be run across a single index, or across all data within a tenant. When suggesting, you specify which fields within your documents to search across, allowing you to narrow down the matching results in denormalised datasets. You can also optionally specify tags to narrow the search.

Fuzzy matching, prefix matching and exact matching are all optionally supported within a suggestion query.

How does it work?

Search at the edge

Running a text search index on the edge is complicated - we don't have the luxury of copious amounts of RAM to use or disk space to store indexes on directly. @workertown/search manages to provide similar functionaility to a traditional search engine by combining edge-compatible storage with the wonderful Minisearch package.

In short, anything that is indexed is stored in an edge-compatible storage layer (database) - when a query is run, we load the whole index into memory and then instantiate Minisearch to actually perform the search. This may seem wasteful when it comes to resources, but this simple access pattern can make "hot" indexes/search terms highly cacheable.

To help keep the amount of documents loaded for a search as few as possible, @workertown/search also provides a "tagging" mechanism to allow you to group similar documents together to perform searches on.

What are the limitations?

Because of the fact we need to load the whole index/search space into memory, there are some limitations to be aware of when deciding whether @workertown/search is right for you.

@workertown/search is best suited to projects where:

The number of total documents within an index is relatively small (in the hundreds or thousands)
The size of a given document is not more than a few kilobytes
You can utilise tags to limit the number of documents that need to be loaded into memory

The "EJECT" button

Things don't always work out.. and software doesn't always scale with your business, or stand against the general test of time. That's OK - it's actually a good thing (mostly)!

If you've been running @workertown/search in production, getting your data out to be moved to another platform should be relatively straight forward. Everything that is persisted data is stored in the storage layer, so getting "at" your data is as simple as using whatever backup/dumping/querying tools that are available for your storage of choice.

That's the beauty of how simple @workertown/search is - the search index itself only ever lives in memory, and so cannot be (and doesn't need to be) backed-up or exported directly.