How does PDF parsing and indexing work?

Last updated February 6, 2015 « Knowledge Base

One of SearchWP’s most popular features is it’s ability to index PDF content. Unfortuantely PDF parsing can be a complex, server intensive process, but SearchWP aims to make it as easy as possible for each customer. It is important to understand how SearchWP parses and indexes PDFs so support staff can best assist you should you find any problems with PDF parsing, indexing, and results.

What PDFs can SearchWP index?

SearchWP has the ability to index any piece of content with a post ID (please note that ‘post’ in this case is not ‘Post’ (capital P) like in the WordPress admin menu, but instead a lowercase p as in a WordPress post object) and that includes everything in the Media library of your WordPress install. This is an integral part of SearchWP’s search algorithm, and it is also the reason that SearchWP cannot index PDFs that are not stored in the Media library.

There are some file management plugins out there that aim to help in organizing or restricting access to media, but that means the files themselves are stored outside the Media library, usually referenced by a custom database table. SearchWP cannot parse or index the content of these files. WordPress core is making strides in providing better organization in the Media library which will hopefully help lessen the reliance on these other plugins that abstract PDF storage outside of the Media library.

In order for PDFs to be parsed and indexed they must have readable text. That statement may sound strange, but there is a big difference between embedded text in a PDF, and a PDF with an image of text that cannot be read by a machine. You can test your PDFs by dragging to highlight the text as you would any web page. If the text highlights and you can copy it to your clipboard, SearchWP can likely read it as well. If you cannot copy and select the text, SearchWP will not be able to index the content.

How does PDF indexing work?

Since PDFs have a post ID, they are included in SearchWP’s indexing process. When the indexer picks up a PDF the content is first filtered through the searchwp_external_pdf_processing hook, which allows you to facilitate your own PDF processing if you so choose. This is how the Xpdf Integration extension works (more on that later).

If no PDF content is found via that hook, SearchWP applies it’s own series of PDF extraction processes on the file. SearchWP will take up to three passes at each PDF, the first pass attempts to extract PDF content using a PHP 5.3+ compatible process that usually has a great success rate. Since SearchWP mirrors WordPress’ system requirements, it will fall back to two PHP 5.2 compatible processes if you are running PHP 5.2. If you are not sure which version of PHP you are running you can find out in the System Information pane available at the bottom of the Advanced settings screen, which has a link at the bottom right of the main SearchWP settings screen.

Note: PDF content extraction is heavily based on the capabilities of your server and the size of your PDF. If you are attempting to index a 100 page PDF on a shared host, the process will likely fail simply because a PDF that large cannot be parsed in a shared hosting environment. Please keep this in mind as you evaluate your intentions with SearchWP’s PDF indexing capabilities.

When the content of the PDF has been extracted, it is saved as post meta (a Custom Field) called searchwp_content.

How does searching PDF content work?

During indexing, the content of the PDF is stored as post meta (a Custom Field) named searchwp_content. This field can be used like any other custom field, but is treated specially when it comes to SearchWP’s settings. In the Media tab there is a special field for :

2014-10-26 at 11.15 AM

This directly controls the weight as though you set a Custom Field weight for searchwp_content. You do not need to set both.

How can I determine what content has been indexed?

SearchWP automatically appends a meta box on Media edit screens titled SearchWP File Content that contains a text field with all of the content from your PDFs:

Screen Shot 2014-10-26 at 11.22.54 AM

You can navigate to this page by going to your Media library, finding your PDF, and clicking the Edit link to view more details.  The SearchWP File Content box will be at the bottom of that details screen:

2014-10-26 at 11.20 AM

What happens when PDF indexing fails?

There are a number of reasons PDF indexing can fail. The primary reason is that the built in, PHP-based PDF content extraction method was not able to extract content from your PDF, resulting in an empty SearchWP File Content box. This failure will occur silently by design. Alternatively, you may see an error notice at the top of WordPress admin screens indicating that a certain number of entries failed to be indexed that are PDFs. You are given the option to reintroduce these problematic posts to the indexer as they may have failed simply because the server ran out of resources at that particular time. If reintroduction fails there is likely a lower level problem that will prevent the PDF from being indexed automatically.

If SearchWP and SearchWP with Xpdf Integration are unable to extract content from your PDFs, or your PDFs do not have readable text, it is recommended that you manually populate the SearchWP File Content. When you do that, the content entered in the box by hand will be indexed as though it were extracted by SearchWP during indexing.

Fix Search on Your Site. No Coding Required!

Now you can utilize all of the content that's gone unrecognized by native WordPress search instantly with SearchWP

Get SearchWP