Getting started with Full-Text Search (part 1 of 5)

A Tutorial: How a Simple Full-Text Search Could Work—Based on an Example

On two occasions I have been asked, “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?” …​ I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

— Charles Babbage
Passages from the Life of a Philosopher
Tip

The sources for this part are available to you at https://github.com/esurovtsev/getting-started-with-fulltext-search/tree/part-1.

Introduction

You will not find any quick, easy-to-apply advice on deploying Solr or Elasticsearch here, nor how to trigger your first search queries using one of these. Instead we will touch on some other topics: the fundamental principles behind full-text search and the kinds of problems you may encounter in creating your own search engine.

But you ask: “Why would I need to know the inner-workings of full-text search?”

After all, it isn’t likely that you will ever have to build your own search engine from scratch. There are plenty of time-tested, out-of-the-box search engines on the market for you to use.

So why can’t you simply trigger a search request on Solr and get great results? Of course, you can!

But if your search is a bit more complex than a default request with some filtering added, then—sooner or later—you will discover that the search system is not magic. And to get the best results, you will need to understand how it behaves and how to make it perform to meet your requirements.

In other words, at some point, there’s a big chance that you will need to understand exactly how your system gathers results, and the purpose of this or that portion of the code. You will need to learn what tokenization is; how ranking works; why stemming is required; and, finally, when normalization (what’s that, by the way?) should be applied.

So, in this article, we will give you a brief introduction to each of these principles, explaining the basic components involved in nearly every full-text search process.

How will we keep all this technical stuff interesting along the way?

Simple! We’ll code something real, creating our own small search engine as we go. By the time we finish, our search engine will be capable of performing full-text searches and of returning an adequate result set.

Note

The search engine we will build here will not have complete and comprehensive full-text search capabilities. We will keep the examples brief and uncomplicated so that the main points being illustrated remain clear to you.

What will it look like?

By the end of this article we will have created a shell application capable of performing a full-text search through a list of specified text files. The search will return a result-set displaying the names of those files with contents logically related to the query supplied.

Output
Getting started with Full-Text Search
==========================================
type `help` for showing available options or `exit` to finish the demo.

shell:>search --type SCORING --query "massive glass hill"
Results found: 27
---------------------------------------
The Glass Mountain.txt
The Blue Mountains.txt
The Grateful Beasts.txt
The Iron Stove.txt
The Glass Axe.txt
The Story of Big Klaus and Little Klaus.txt
...

If you want to fully understand the concepts being presented and see the whole process of the evolution of the search engine application, we strongly recommend that you read this article from start to finish. If you do, you will clearly grasp of concepts and methods used as you see them introduced step-by-step into the application.

If, on the other hand, you’re a “too long; didn’t read” person, then simply scroll to the end of the article. There, in the “Tutorial Conclusion,” you can get a very brief synopsis of the entire article.

So, let’s get started.

Getting Started

Before we really get underway, we need to lay some groundwork. In the end, we will have a CLI (Command-Line Interface) application that will accept user queries and return the relevant result-set to the user. However, we don’t really want to spend too much time implementing the shell itself. We would rather use something readily available and easily adaptable to our requirements. After all, this is an article about search technologies, and not about how to create an effective CLI utility.

Tip

The source for the final application are available to you at https://github.com/esurovtsev/getting-started-with-fulltext-search.

Speaking of technologies: for building the CLI we will use Spring Boot and Spring Shell. We have chosen to use Kotlin as our programming language as we believe it will be a great fit for our use-case. Even more importantly, we find it is a pleasure to do something in a Kotlin way. (However, feel free to use your favorite development tool and adapt the examples according to your wishes.)

So, how should we get started?

First, we will create an empty Maven project and define dependencies on Spring Boot, Spring Shell, Kotlin and a couple of other libraries we will need.

Next, and before we start working on the search engine itself, we will check that our initial setup works properly and assure ourselves that we have a working CLI utility.

Tip

In order to validate any changes that you made recently to your project, you may simply rebuild the whole application and run it again.

mvn clean build
java -jar target/getting-started-with-fulltext-search-0.1.0-SNAPSHOT.jar

Since we have not yet added any commands, only built-in commands are currently available. To get a list of them, just type help and then press Enter.

Output
Getting started with Full-Text Search
==========================================
type `help` for showing available options or `exit` to finish the demo.

shell:>help
AVAILABLE COMMANDS

Built-In Commands
clear: Clear the shell screen.
exit, quit: Exit the shell.
help: Display help about available commands.
script: Read and execute commands from a file.
stacktrace: Display the full stacktrace of the last error.

All seems to be working fine. To exit the shell, type the exit command and press Enter again.

Okay! Now we are ready to get started working on our new search engine.

Stories

Let’s begin by formulating the concept on which we will be working.

We have put together (in the repository linked above) a set of fairytales (this is the content of two books available through Project Gutenberg—“The Yellow Fairy Book” and “Stories by English Authors”) with each story in a separate text file. In total, you should have 56 files when retrieved from the repository.

By the end of this project, we expect to have an application that we can use to get a list of stories matching the text search query we supplied. Or, speaking in more technical language, we will have full-text search capabilities across the 56 files and our result-set will be the names of the files where the contents of the files match our search request. For this case, we assume that the file name is a unique identifier of a story (which is technically correct since every filename in the folder is unique and defines a specific story).

Although it doesn’t really matter exactly where the files are to be located, for simplicity, let’s place all available content into the resources/stories folder. Later we will get the location of the files as part of our service’s configuration.

Output
>ls -1 getting-started-with-fulltext-search/src/main/resources/stories
A Story about a Darning-needle.txt
Alphege, or the Green Monkey.txt
Blockhead Hans.txt
Fairer-than-a-Fairy.txt
Hermod and Hadvor.txt
How Six Men travelled through the Wide World.txt
How to tell a True Princess.txt
In the Land of Souls.txt
Minions of the Moon.txt
Mr. Lismore and the Widow.txt
Prince Ring.txt
Story of the Emperor's New Clothes.txt
The Blue Mountains.txt
The Box Tunnel.txt
The Boy and the Wolves, or the Broken Promise.txt
The Cat and the Mouse in Partnership.txt
The Crow.txt
...

Configuration

As we get started, let’s externalize the application’s configuration. This will allow us to gather all of the settings for the application from a config file, or even from the command-line. By doing this, we can easily change the parameters for our application without have to rebuild the whole program to do so.

We already have some data we want to store in our configuration. Specifically, we want the configuration to remember where the raw content files (that is, the stories in which we are going to search) are located.

Since we have selected Spring Framework as our main building tool, the default configuration should be stored in the application.yaml file.

Application.yaml
com.searching-bits.get-started-fulltext-search.data-dir: target/classes/stories

How do we go about getting these configuration data out of application.yaml?

Well, Spring’s standard way would be to use the @Value annotation on the component’s property where you want to initialize the value from the application.yaml file data.

However, since our expectations are that we will have multiple configuration parameters in the future, we will put them all together into a separate component. It is more convenient to have a single object responsible for all the settings than it is to spread different parameters across different parts of the application.

In the configuration component we will create two methods: one to get the path to the contents to be searched, and another that simply resupplies all of the predefined application parameters.

Configuration.kt
const val DATA_DIR_KEY = "com.searching-bits.get-started-fulltext-search.data-dir"

@Component
class Configuration {
    @Value("\${$DATA_DIR_KEY}")
    private lateinit var dataDir: String // (1)

    fun getAllProperties(): Map<String, String> =
      mapOf(DATA_DIR_KEY to File(dataDir).absolutePath) // (2)

    fun getDataDir(): String = getAllProperties()[DATA_DIR_KEY] ?: "" // (3)
}
  1. Let Spring to initialize a class member with the config property value.

  2. Get the configuration map with all the properties. The map will be later used to display the current configuration to the console.

  3. Return the location where all original “raw” content is stored.

Also, for making new features available to the end-user, we would need to define new shell commands. In this case, we need the one that displays all current system settings.

Commands.kt
@ShellMethod("Show current configuration") // (1)
fun config(): List<String> = // (2)
    config.getAllProperties().entries.map { "${it.key}: ${it.value}" }
  1. Spring annotation @ShellMethod defines a new shell command, a default value of the annotation contains a description for the command.

  2. The command name will be derived from the name of the function, where Java style camelCase format will be automatically transformed to dash-case style.

Now, let’s check how our newly added command works. To do that, we need to run our application and then ask for help.

Output
Getting started with Full-Text Search
==========================================
type `help` for showing available options or `exit` to finish the demo.

shell:>help
AVAILABLE COMMANDS

Built-In Commands
clear: Clear the shell screen.
exit, quit: Exit the shell.
help: Display help about available commands.
script: Read and execute commands from a file.
stacktrace: Display the full stacktrace of the last error.

Commands
config: Show current configuration

As we can see, there is the new command config listed in the shell. Let’s try it and see what it does.

Output
shell:>config
com.searching-bits.get-started-fulltext-search.data-dir: target/classes/stories

Documents

Are we ready to start working on the search?

Hmm, probably not!

We’re still missing two things: 1) we need a list of all the files, and 2) we need to have the ability to retrieve the contents of each individual story, along with the story’s unique identifier. (Remember: the file name is the story’s unique identifier.)

In order to meet these requirements, we will create a new component responsible for providing each document’s data. We will call this component DocumentService.

DocumentService.kt
@Service
class DocumentService(private val config: Configuration) { // (1)
    fun findAllIds(): List<String> =
        File(config.getDataDir()).listFiles().map { it.name } // (2)

    fun findById(documentId: String): String? =
        this
            .findAllIds()
            .findLast { it == documentId } // (3)
            ?.let { File("${config.getDataDir()}/$it") } // (4)
            ?.takeIf { it.exists() && it.isFile && it.canRead() } // (5)
            ?.let { it.readText() } // (6)
}
  1. Inject the current configuration component into the service.

  2. Get the list of files from dataDir and return their names back.

  3. Search for a file name we need in the list of all file names.

  4. Create a file object based on the found name.

  5. Check if the given file exists and can be read.

  6. Load the content of the file and return it back to caller.

Let’s make sure that everything works as we expect. But in order to do that, we have to first add another command.

Commands.kt
@ShellMethod("Lists documents or document content")
fun document(
  @ShellOption(help = "Document ID or 'all' for getting list of all documents")
  id: String
): List<String> =
  if (id == "all") {
    documentService.findAllIds()
  } else {
    listOf(documentService.findById(id) ?: "document not found")
  }

Okay. Let’s see how it works.

Output
shell:>document all
The Six Swans.txt
The Snow-daughter and the Fire-son.txt
The Blue Mountains.txt
The Golden Crab.txt
The Flower Queen's Daughter.txt
The Glass Mountain.txt
The Nightingale.txt
Mr. Lismore and the Widow.txt
How to tell a True Princess.txt
The Witch.txt
The Nixy.txt
…

shell:>document "The Iron Stove.txt"
THE IRON STOVE(7)

(7) Grimm.

Once upon a time when wishes came true there was a king's son who was
enchanted by an old witch, so that he was obliged to sit in a large iron
stove in a wood. There he lived…
Eugene Surovtsev
Eugene Surovtsev
Search Enthusiast
comments powered by Disqus

Related