Big Data/BI Zone is brought to you in partnership with:

Gary Sieling is a software developer interested in dev-ops, database technologies, and machine learning. He has a computer science degree from the Rochester Institute of Technology. He has worked on many products in the legal and regulatory industries, having worked on and supported several data warehousing applications. Gary is a DZone MVB and is not an employee of DZone and has posted 62 posts at DZone. You can read more from them at their website. View Full User Profile

Extracting PDF Text with Scala

05.09.2013
| 2877 views |
  • submit to reddit

This example extracts the text contents of a PDF for use in other systems. This demonstrates some basic differences from Java: multi-line strings (hooray!), imports, primitive arrays, and what implementing an interface looks like. The big downside to this is that the Eclipse Scala plugin doesn’t seem to have the ability to fill in interface methods on an object.

import java.io._
 
import org.apache.tika.parser.pdf._
import org.apache.tika.metadata._
import org.apache.tika.parser._
import org.xml.sax._
 
object pdfHandler extends ContentHandler {
	def characters(ch : Array[Char], start: Int, length: Int) {
		println(new String(ch))
	}
 
	def endDocument() {
	}
 
	def endElement(uri: String, localName: String, qName: String) {
	}
 
	def endPrefixMapping(prefix: String) {
	}
 
	def ignorableWhitespace(ch: Array[Char], start: Int, length: Int) {
	}
 
	def processingInstruction(target: String, data: String) {
	}
 
	def setDocumentLocator(locator: Locator) {
	}
 
	def skippedEntity(name: String) {
	}
 
	def startDocument() {
	}
 
	def startElement(uri: String, localName: String, qName: String, atts: Attributes) {
	}
 
	def startPrefixMapping(prefix: String, uri: String) {
	}
}
 
object pdf extends App {
	val folder = """\\nas\Files\Data\pacer2\"""
	val subfolder = """\00\00\gov.uscourts.rid.6064\"""
	val file = """gov.uscourts.rid.6064.20.0.pdf"""
 
	val pdf : PDFParser = new PDFParser();
 
	val stream : InputStream = new FileInputStream(folder + subfolder + file)
	val handler : ContentHandler = pdfHandler
	val metadata : Metadata = new Metadata()
	val context : ParseContext = new ParseContext()
 
	pdf.parse(stream,
         handler,
         metadata,
         context)
 
    stream.close()
}

Output:

UNITED STATES DISTRICT COURT
FOR THE DISTRICT OF RHODE ISLAND
...
It is hereby agreed by and between the parties that the above-captioned matter be
dismissed, with prejudice, no interest, no costs.


Published at DZone with permission of Gary Sieling, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)