Automatically Append Contextual PDF Content Snippets to Excerpts in Search Results

Last updated December 18, 2018 « Knowledge Base

One of SearchWP’s most powerful features is the ability to attribute result weight of one post to another. To put it another way: when you attach WordPress Media to a post, that post is the ‘parent’ of the Media file. You can tell SearchWP that when it finds search result weight for Media to not link to the Attachment page itself (which not many people use anyway) and instead transfer that search weight to the parent.

Transfer search weight to parent

 

When you’ve configured SearchWP in this way, Media is considered as much as any other post, but entries will never be linked directly on search results pages because you’ve transferred all of the keyword weight to the parent. This results in a more natural workflow because you’re directed to the post in which a PDF is linked instead of the PDF itself, for example.

Automatically appending contextual PDF snippets to the excerpt

You can take this integration one step further by automatically appending a contextual snippet from each attached PDF to your post excerpt in search results pages, which will indicate to your readers (before they have clicked through to the parent post) that there was a hit on a linked PDF within that post. Making this customization is as easy as adding the following filter to your active theme’s functions.php:

(Pay particular attention to line 88, as that defines the markup used in the output)

<?php
function maybe_append_searchwp_pdf_excerpt( $excerpt ) {
global $post;
$pdf_excerpt_length = 15; // number of words in PDF excerpt
if ( is_search() && ! post_password_required() ) {
// prep our 'environment'
// set up common words
$common_words = array();
if ( class_exists( 'SearchWP' ) ) {
$searchwp = SearchWP::instance();
$common_words = $searchwp->common;
}
// grab the terms
$terms = explode( ' ', get_search_query() );
$terms = array_map( 'sanitize_text_field', $terms );
// if we're on a search page, we want to check to see if the current result
// has a PDF with any of the search terms in the content
// first we need to backtrack and find all of the PDFs that are attached to this post
// since their weight has been attributed to this post
$attached_pdfs = get_attached_media( 'application/pdf', $post->ID );
foreach ( $attached_pdfs as $attached_pdf ) {
// check to make sure there is file content to scan
if ( $pdf_content = get_post_meta( $attached_pdf->ID, 'searchwp_content', true ) ) {
// find the first applicable search term (based on character length)
$flag = false;
foreach ( $terms as $termkey => $term ) {
if ( ! in_array( $term, $common_words ) && absint( apply_filters( 'searchwp_minimum_word_length', 3 ) ) <= strlen( $term ) ) {
$flag = $term;
break;
}
}
// our haystack is the PDF content
$haystack = explode( ' ', $pdf_content );
$pdf_excerpt = '';
// build our contextual excerpt
foreach ( $haystack as $haystack_key => $haystack_term ) {
preg_match( "/\b$flag\b(?!([^<]+)?>)/i", $haystack_term, $matches );
if ( count( $matches ) ) {
// our buffer is going to be 1/3 the total number of words in hopes of snagging one or two more
// highlighted terms in the second and third thirds
$buffer = floor( ( $pdf_excerpt_length - 1 ) / 3 ); // -1 to accommodate the search term itself
// find the start point
$start = 0;
$underflow = 0;
if ( $haystack_key < $buffer ) {
// the match occurred too early to get a proper first buffer
$underflow = $buffer - $haystack_key;
} else {
// there is enough room to grab a proper first buffer
$start = $haystack_key - $buffer;
}
// find the end point
$end = count( $haystack );
if ( $end > ( $haystack_key + ( $buffer * 2 ) ) ) {
$end = $haystack_key + ( $buffer * 2 );
}
// if we had an underflow (e.g. the first buffer wasn't fully included) grab more at the end
$end += $underflow;
$pdf_excerpt = array_slice( $haystack, $start, $end - $start );
$pdf_excerpt = implode( ' ', $pdf_excerpt );
break;
}
}
// append our PDF-specific excerpt to the main excerpt
if ( ! empty( $pdf_excerpt ) ) {
$pdf_label = get_the_title( $attached_pdf->ID ); // the PDF label will be the title of the PDF post
$excerpt .= '<br /><br /><strong>' . $pdf_label . '</strong>: ' . $pdf_excerpt;
}
}
}
}
return $excerpt;
}
add_filter( 'get_the_excerpt', 'maybe_append_searchwp_pdf_excerpt' );
view raw gistfile1.php hosted with ❤ by GitHub

When this filter is added, your standard excerpt will be shown on search results pages, but if a post has attached PDFs that contain search terms those specific PDFs will be called out by title with a supporting contextual excerpt from that PDF including at least one of the search terms as well. These callouts are appended to the original excerpt, so you don’t lose that valuable information in this process.

Fix Search on Your Site. No Coding Required!

Now you can utilize all of the content that's gone unrecognized by native WordPress search instantly with SearchWP

Get SearchWP