SearchWP

Xpdf Integration

Current version: 1.3.0 View Changelog

Download available with active license

Warning: This Extension requires the use of exec() and also requires you to install Xpdf command line tools yourself.

SearchWP offers the unique feature of extracting plain text from PDF files uploaded to your WordPress website. Out of the box, SearchWP attempts to do this using only PHP, but due to the complexity and variation of the PDF format that sometimes results in content not being accurately extracted. Enter Xpdf.

Xpdf has a set of command line tools that must be installed on your server in order for this Extension to work. Instructions are included here.

Using the Xpdf Integration Extension you can offload all the work PHP has to do in processing your PDF files to Xpdf’s command line tools, which are extremely fast and accurate when extracting content from your PDFs. After activating the Extension, you will need to follow the installation instructions. Once installed, SearchWP will offload the PDF content extraction process to Xpdf.

Installing Xpdf command line tools

Using this extension you can utilize Xpdf to extract the content from your PDFs.

IMPORTANT: Xpdf command line tools are not provided in this Extension download. You must follow these instructions to download the command line tools and upload them to a non-public (outside your Web root) location.

You can download the Xpdf command line tools for both Windows and Linux at http://www.xpdfreader.com/download.html.

Installation

Once you have downloaded the command line tools for your server type:

  1. Extract xpdf-tools-linux-4.03.tar.gz (the version number may be different)
  2. Upload the pdftotext binary (found in either the bin32 or bin64 directory after extracting, depending on your server architecture) to a non-public location, outside your Web root
  3. Upload the pdfinfo binary (found in either the bin32 or bin64 directory after extracting, depending on your server architecture) to a non-public location, outside your Web root
  4. Ensure that both pdftotext and pdfinfo have execute permissions for the PHP user on your server

The last step is to tell SearchWP Xpdf Integration where you installed pdftotext and pdfinfo. To do this:

Add the following to your SearchWP Customizations plugin, replacing /path/to/pdftotext with the actual path to the pdftotext and pdfinfo binaries (not the folder) on your server.

// Tell SearchWP the location of the pdftotext binary.
add_filter( 'searchwp_xpdf_path', function() {
    return '/home/johndoe/pdftotext'; // Full absolute path to the binary NOT A FOLDER, NOT A URL.
} );

// Tell SearchWP the location of the pdfinfo binary.
add_filter( 'searchwp_pdfinfo_path', function() {
    return '/home/johndoe/pdfinfo'; // Full absolute path to the binary NOT A FOLDER, NOT A URL.
} );

That’s it!

Adding PDF password support in Xpdf Integration

Xpdf does support parsing of password protected (read: not encrypted) PDFs using the searchwp_xpdf_command filter. This filter allows you to directly manipulate the command being executed to fire Xpdf, and since Xpdf supports an option to include a password, you can go ahead and do that like so:

<?php
function my_searchwp_xpdf_command( $cmd, $filename ) {
return $cmd . ' -upw password';
}
add_filter( 'searchwp_xpdf_command', 'my_searchwp_xpdf_command', 10, 2 );

Manually Testing Xpdf Integration

After uploading and activating the Xpdf Integration Extension and defining your path to pdftotext, you can manually confirm that Xpdf text extraction is working as expected on specific PDFs uploaded to your Media library. Begin by going to the SearchWP Settings screen (Settings > SearchWP) and find the Xpdf Integration link in the Extensions on the SearchWP settings screen.

On the Xpdf Integration Testing screen, you can enter in the ID of the PDF you’d like to test:

Screen Shot 2013-12-09 at 11.34.24 AM

The ID can be found by navigating to your Media section and then clicking the Edit link of your PDF, the ID will be in the URL, followed by post=

After submitting a valid ID you will be given a detailed log of the steps taken by the Xpdf Integration Extension as well as any failure points that may have occurred. You’re also shown the exact content Xpdf extracted from the PDF:

Screen Shot 2013-12-09 at 11.36.16 AM

If the log displays a point of failure, please include that in any support requests you submit.

Xpdf Integration error codes

If Xpdf had any issue running, one of the error codes listed below will be indicated in the log.

Exit Code Description
0 Command executed successfully
1 Catch-all for any unspecified error
2 Permissions issue, check to ensure www-data (or your web server user) has permissions to execute pdftotext
11 Segmentation fault. Are you using the proper Xpdf binary for your server environment?
126 There is a permissions problem executing pdftotext from the web server user. Please check with your host to ensure proper permissions.
127 Your server was not able to execute pdftotext. Please check with your host to ensure the web server user can execute pdftotext.
139 There is a permissions problem executing pdftotext from the web server user. Please check with your host to ensure proper permissions.

Changelog

1.3.0

  • [New] Adds support for pdfinfo to extract PDF metadata
  • [Update] Updated updater

1.2.0

  • [New] Adds support for SearchWP 4

1.1.6

  • [New] Display notice when exec() is not available as it is necessary
  • [Update] Updated updater

1.1.5

  • [Change] Xpdf is now XpdfReader which resulted in a change in command formatting. YOU MUST UPDATE pdftotext AS WELL. Please view the Xpdf Integration documentation for a link to the XpdfReader website to download an updated version.
  • [Update] Updated updater

1.1.3

  • [Update] Updated updater
  • [Change] Updated minimum required SearchWP version

1.1.2

  • [Improvement] Better exit code handling

1.1.1

  • [New] New filter: searchwp_xpdf_command allowing manipulation of Xpdf command

1.1

  • [Improvement] Added support for auto-updates based on SearchWP license key

0.7.2

  • [Fix] Better handling of Windows directory separators

0.7

  • Initial release

Want to make your search awesome right now?

More than 30,000 sites have chosen SearchWP!

You can utilize all of the content that’s gone unrecognized by native WordPress keyword search instantly with SearchWP.

Get SearchWP for just $99

  • Committed Support
    If you need help, support is fast, friendly, and here for you
  • Streamlined Setup
    Installation and setup that’s optimized for speed
  • Great Documentation
    Helpful, clear, and usable documentation is a priority

See what SearchWP customers have to say

  • “SearchWP has worked well for me. At one point, I had a problem with it, and they looked into the problem and came up with a very good fix within a few days. Great service for a great product!”

  • “We’ve been using SearchWP on our Knowledge Base for the past few months, and it has made a significant difference for our clients. It’s helped reduce the number of support tickets because people can find what they’re looking for, and we also love the metrics and seeing trends in searches. If you want people to find things on your site, it’s the best and I highly recommend SearchWP!”

  • “SearchWP is the best of multiple search plugins I’ve tried. Running a complex content site, being able to rank types of content differently, or even have a separate search engine with its own index for a more specific part of the site, is invaluable.”