Using Apache Tika for Document Processing

SearchWP has built in support for document processing, but there are some cases where alternate methods are preferred. One example is offloading PDF parsing (which can be a resource-intensive job for PHP) to a purpose built binary like Xpdf.

Another popular application that is able to parse documents and extract content is Apache Tika. If your server has Tika available, you can tell SearchWP to use it to parse PDF documents like so:

All hooks should be added to your custom SearchWP Customizations Plugin.

	<?php

	// Use Apache Tika to extract PDF content in SearchWP.
	add_filter( 'searchwp\parser\pdf', function( $content, $args ) {

	// Ensure this path is updated to match your Tika installation path!
	$path_to_tika = '/srv/bin/tika-app-1.18.jar';

	// Execute the command.
	$cmd = "java -jar {$path_to_tika} -t {$args['file']}";
	@exec( $cmd, $output, $exitCode );

	// If there was a problem, send the output to the debug log.
	if ( $exitCode ) {
	do_action( 'searchwp\debug\log', 'Error running Tika, exit code: ' . $exitCode );
	}

	return $output;
	}, 20, 2 );

view raw searchwp-customizations.php hosted with ❤ by GitHub

Apache Tika is a very capable application that can parse additional document types if you’d like. The above snippet will use Tika to parse PDFs, the following filters are available that can be customized in the same way:

Apache Tika may also have better support for your Office documents, in which case you can customize the parsed content with the searchwp\document\content filter.

Using Apache Tika for Document Processing

Create a Better WordPress Search Experience Today