Document Processing in SearchWP

Last updated June 6, 2017 « Knowledge Base

When Media is enabled in any engine configuration, SearchWP will attempt to extract the plain text contained within documents uploaded to the Media library.

NOTE: Files MUST BE present in the Media library of the WordPress Dashboard. There is no way around this.

SearchWP will extract the text from many common file types including:

  • Plain text
  • CSV
  • Rich text (RTF)
  • PDFs (that have readable text*)
  • Office Documents (.docx, .xlsx, .pptx, NOT.doc)
  • OpenOffice Documents (.odt, .ods, .odp)

* To verify your PDF has readable text, try to copy a sentence to your clipboard and paste it somewhere. If you cannot select or paste it, the PDF does not have readable text.

How document processing works

As soon as a file is uploaded to your Media library, SearchWP will add it to the indexer queue like any other post (Post, Page, Custom Post Type, etc). The indexer will use the information extracted by WordPress to determine whether it is a document that SearchWP can read.

Tip: Documents take longer to parse than standard posts (Posts, Pages, Custom Post Types) because of the extra work being done to extract the text. Please allow adequate time for the indexer to work.

Once the indexer has processed the document, the parsed text is stored and subsequently indexed. You can control how much weight is afforded to the extracted document content by setting the appropriate field in your engine configuration:

2016-06-06 at 1.14 PM

PDF metadata (when applicable) is also extracted and stored with it’s own weight per engine configuration.

How results appear

SearchWP does not customize the way search results are shown, the existing results template is used. Like native WordPress search, Media results are shown as any other post type is. Media entries have a permalink and a title which are commonly used for each result in search results templates, including documents returned by SearchWP.

Many times, however, search results templates include post excerpts (which are repurposed by Media entries as Caption) but this field is rarely populated, especially for documents. An example Media result might look something like this:

Not a very useful Media excerpt

Automatically generate an excerpt

You can automatically generate an excerpt by installing the Term Highlight extension and taking advantage of it’s built in functionality to search all post content for a contextual excerpt, changing the result to look like this:

Note the improved caption which was automatically pulled using Term Highlight

To automatically generate an excerpt for Media after installing Term Highlight, add the following to your theme’s functions.php:

NOTE: Line 6 of this snippet checks to see if it’s the default search results page. If using a Supplemental Engine you will need to remove the is_search() condition with one that matches your Supplemental Engine Page Template.

<?php
function searchwp_term_highlight_auto_excerpt( $excerpt ) {
global $post;
if ( ! function_exists( 'searchwp_term_highlight_get_the_excerpt_global' ) || ! is_search() || 'attachment' !== get_post_type( $post ) ) {
return $excerpt;
}
// prevent recursion
remove_filter( 'get_the_excerpt', 'searchwp_term_highlight_auto_excerpt' );
$global_excerpt = searchwp_term_highlight_get_the_excerpt_global( $post->ID, null, get_search_query() );
add_filter( 'get_the_excerpt', 'searchwp_term_highlight_auto_excerpt' );
return $global_excerpt;
}
add_filter( 'get_the_excerpt', 'searchwp_term_highlight_auto_excerpt' );
view raw functions.php hosted with ❤ by GitHub

With that in place, the excerpt will be filtered and if a Media result is being output, Term Highlight will find an appropriate excerpt to be used in the results template.

Other closely related KB articles

Customizing parsed content — You have complete control over parsed document content, and you can customize it if you’d like.

Linking to file instead of Attachment page — Many theme’s do not account for Attachment templates, and it often makes sense to link directly to the file as a result.

Attributing Post Parent — SearchWP allows you to attribute keyword weight to the post parent, which is very useful when handling documents in search results.

“SearchWP Failed to Index X Posts” — SearchWP’s document processing can sometimes cause a post to fail to index. Find out common causes and fixes in this KB article.

Fix Search on Your Site. No Coding Required!

Now you can utilize all of the content that's gone unrecognized by native WordPress search instantly with SearchWP

Get SearchWP