How does the indexer work? (What is tokenizing?)
SearchWP’s indexer is not a crawler. That is to say it does not operate like many search engines (e.g. Google) and it does not load pages of your website to index content. The indexer pulls data directly from the database. This is an intentional design decision that has many pros and a few cons.
The biggest advantage to this implementation is not having to deal with malformed HTML and instead pluck the content of each post in full and index that. Connecting to the database that powers the website itself is also advantageous when compared to crawling a site which requires scraping for new links on every page load.
Once the content has been retrieved from the database, the indexer tokenizes everything. SearchWP creates ‘tokens’ out of the site content so as to make it searchable. There are a number of reasons for this design decision as well, and with that come some pros and cons.
SearchWP creates tokens which boil down to the following:
- no punctuation
- no white space (spaces, tabs, etc.)
- no special characters
This provides a solid baseline upon which to build a search algorithm that works great for a very large percentage of websites. Add to that any number of SearchWP’s extensions and that percentage increases all the more.
There is another exception to the rule: the regex whitelist. SearchWP runs all content through a list of regular expressions which catch some common string patterns that would be destroyed by the tokenizing process. Dates, initials, phone numbers, IP addresses, and SKUs to name a few. Any regex whitelist matches are ‘excused’ from the tokenization process and indexed as-is. You have full control over the regex whitelist patterns used by SearchWP with this hook:
TL;DR: Since SearchWP tokenizes all content, searching for phrases (e.g. multiple words surrounded in quotes with the intention of limiting results to those that contain the quoted phrase as-is) is not and cannot be supported.
The primary reason SearchWP has implemented tokenization as a foundational part of the indexing and searching process is performance. One of the main reasons MySQL (the database engine powering your WordPress site) ends up being a bottleneck with search is that it was not designed for large-scale full text searches. Breaking up all of the content into single tokens helps with the performance of searches quite a bit, so that’s the approach SearchWP takes. It’s also the basis for the weighting system as well.
As a result, SearchWP’s index does not keep track of the order in which tokens are arranged per post, so it is not able to perform phrase-based searches like you can in some popular search engines like Google or even alternative search engine technologies like Elasticsearch.
SearchWP’s algorithm was designed to have results with the most matches for each search token bubble up to the top using the weight system that is implemented on the SearchWP settings screen. With the algorithm, chances are very good that the results you’d expect from a “quoted search phrase” will bubble up to the top of search results based on that system alone. The observed drawback is that in a search system that supports quoted search phrases, results that don’t explicitly contain the phrase exactly as-is are omitted from the results. One way to further restrict the search results in SearchWP is to enforce AND-logic only using this hook:
searchwp_and_logic_only which hook tells SearchWP to only return results that have all of the submitted search terms.
By default, SearchWP may take multiple passes at the index when performing searches. The first pass is an AND logic pass which checks for results that have all search terms present. If it finds a set of results on that first pass, it is displayed. If no results are found, SearchWP will fall back to an OR logic pass which will find results with any search terms present. Using
searchwp_and_logic_only you can effectively omit the OR logic pass and restrict SearchWP to only use AND logic when performing searches.