PDF Text Parser: Converting PDF to Text in Java using PDFBox

Sunday, January 25, 2009

Converting PDF to text is an interesting task which has its use in many applications from search engines indexing PDF documents to other data processing tasks. I was looking for a java based API to convert PDF to text, or in other words a PDF Text parser in java, after going through many articles, the PDFBox project came to my rescue. PDFBox is a library which can handle different types of PDF documents including encrypted PDF formats and extracts text and has a command line utility as well to convert PDF to text documents.

I found the need to have a reusable java class to convert PDF Documents to text in one of my projects and the below java code does the same using the PDFBox java API. It takes two command line parameters, the input PDF file and the output text file, to which the parsed text from the PDF document will be written.

This code was tested with PDFBox 0.7.3 although it should work with other versions of PDFBox as well, it can be easily integrated with other java applications and can be used as a command line utility as well, the steps to run this code is furnished below.

Listing 1: PDFTextParser.java

1: /*
2: * PDFTextParser.java
3: * Author: S.Prasanna
4: *
5: */
6:
7: import org.pdfbox.cos.COSDocument;
8: import org.pdfbox.pdfparser.PDFParser;
9: import org.pdfbox.pdmodel.PDDocument;
10: import org.pdfbox.pdmodel.PDDocumentInformation;
11: import org.pdfbox.util.PDFTextStripper;
12:
13: import java.io.File;
14: import java.io.FileInputStream;
15: import java.io.PrintWriter;
16:
17: public class PDFTextParser {
18:
19: PDFParser parser;
20: String parsedText;
21: PDFTextStripper pdfStripper;
22: PDDocument pdDoc;
23: COSDocument cosDoc;
24: PDDocumentInformation pdDocInfo;
25:
26: // PDFTextParser Constructor
27: public PDFTextParser() {
28: }
29:
30: // Extract text from PDF Document
31: String pdftoText(String fileName) {
32:
33: System.out.println("Parsing text from PDF file " + fileName + "....");
34: File f = new File(fileName);
35:
36: if (!f.isFile()) {
37: System.out.println("File " + fileName + " does not exist.");
38: return null;
39: }
40:
41: try {
42: parser = new PDFParser(new FileInputStream(f));
43: } catch (Exception e) {
44: System.out.println("Unable to open PDF Parser.");
45: return null;
46: }
47:
48: try {
49: parser.parse();
50: cosDoc = parser.getDocument();
51: pdfStripper = new PDFTextStripper();
52: pdDoc = new PDDocument(cosDoc);
53: parsedText = pdfStripper.getText(pdDoc);
54: } catch (Exception e) {
55: System.out.println("An exception occured in parsing the PDF Document.");
56: e.printStackTrace();
57: try {
58: if (cosDoc != null) cosDoc.close();
59: if (pdDoc != null) pdDoc.close();
60: } catch (Exception e1) {
61: e.printStackTrace();
62: }
63: return null;
64: }
65: System.out.println("Done.");
66: return parsedText;
67: }
68:
69: // Write the parsed text from PDF to a file
70: void writeTexttoFile(String pdfText, String fileName) {
71:
72: System.out.println("\nWriting PDF text to output text file " + fileName + "....");
73: try {
74: PrintWriter pw = new PrintWriter(fileName);
75: pw.print(pdfText);
76: pw.close();
77: } catch (Exception e) {
78: System.out.println("An exception occured in writing the pdf text to file.");
79: e.printStackTrace();
80: }
81: System.out.println("Done.");
82: }
83:
84: //Extracts text from a PDF Document and writes it to a text file
85: public static void main(String args[]) {
86:
87: if (args.length != 2) {
88: System.out.println("Usage: java PDFTextParser ");
89: System.exit(1);
90: }
91:
92: PDFTextParser pdfTextParserObj = new PDFTextParser();
93: String pdfToText = pdfTextParserObj.pdftoText(args[0]);
94:
95: if (pdfToText == null) {
96: System.out.println("PDF to Text Conversion failed.");
97: }
98: else {
99: System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
100: pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
101: }
102: }
103: }
Explanation:

The above code takes two command line parameters, the input PDF file and the output text file, the method pdftoText in line 31 handles the text parsing functionality and the writeTexttoFile method in line 70 writes the parsed text to the output file.

Compliling and Running the code:

I used PDFBox 0.7.3 to compile/run the above code, so you need to add those jars in your java project settings.

1. Download PDFBox 0.7.3 from here.
2. Unzip PDFBox-0.7.3.zip.
3. Under the PDFBox-0.7.3 folder, add the jars in the lib (PDFBox-0.7.3.jar) and external directory (other external packages used by PDFBox-0.7.3) to the classpath to compile/run the code, it should work fine.

Note: I used JDK 1.6 to compile the above code.

55 comments:

Arun Kumar C said...

I also had a similar problem. Although I am able to extract text, I am not able to do it with the formatting information which is crucial for me. I had spent so much time to figure out how to get text. It would have been better if I had seen this post before I started luking for Java API's :-(

Prasanna Seshadri said...

Hi,

Thanks for saying so, glad that you landed on this page.

Rajat said...

Thanks for sharing this useful information prasanna.

Adjie said...

hi prasanna.. nice post.

it works fine when i use it as it was. but when i change the method pdftoText from default to public static. and i called it from another class, the compiler said

Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/FontMetric
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:334)
at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at TA.document.PDFHandler.getText(PDFHandler.java:45)
at TA.eksperimen.main(eksperimen.java:33)
Caused by: java.lang.ClassNotFoundException: org.fontbox.afm.FontMetric
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClassInternal(Unknown Source)
... 13 more

FYI i rename PDFTextParser(String filename) to PDFHandler(File file)

and i called it from eksperimen.java with syntax
file = new File("C:\\Users\\root\\Documents\\indexDocs\\presentation_tips.pdf");
tes = PDFHandler.getText(file); System.out.println("PDFHandler\n" + tes);

Adjie said...

Sorry.. i rename the class PDFTextParser to PDFHandler

and also String pdftoText(String fileName) to getText(File file).

so can u help me to point out my mistake..?

Prasanna Seshadri said...

Hi,

From the error you are getting, it looks like you have missed to include jars, you need to import the jars I mentioned, let me know.

manish said...

does PDFBox 0.7.3 work on jdk 1.4.2

Psybuck said...

Thanks for this post. Helped me a lot. I havent touched Java in almost three years and was lost as to how I would use this.

Prasanna Seshadri said...

Hi Manish,

PDFBox should work in JDK 1.4.2, but I haven't tried it.

Prasanna Seshadri said...

Psybuck, thanks for saying so.

manish said...

thanx for the reply ,
i used ur code as it is ,i am getting error for this line
PrintWriter pw = new PrintWriter(fileName); bcoz PrintWriter constructor cannot String object as parameter .What to do can u help me

Prasanna Seshadri said...

Oh Got your problem, in from java 5, PrintWriter takes a String argument for the file name, in 1.4.2, you need to create a FileWriter from the filename from which you will create a PrintWriter, so change that part accordingly.

manish said...

thanx it worked but my original formatting in pdf was lost in txt file ,the alignment and all what to do

manish said...

HI,
i need some help in pdfbox ,can i get all data from pdf file with out losing formatting information

Prasanna Seshadri said...

Hi,

I am not sure how you want the text to be, but if you have all text extracted, then it should be fine.

manish said...

hi prasanna ,
i am getting the text correctly but the actual pdf file looks different .here is the link to the pdf file .Can i get the same design as pdf in my text file

manish said...

sorry here is the link to my pdf
http://www.wikifortio.com/338703/MPRPOV.pdf

Anonymous said...

Hello,

I was trying to do this, but I got an error that said this:

Exception in thread "main" java.lang.NoClassDefFoundError: org.apache.pdfbox.pdmodel.font.PDTrueTypeFont
at java.lang.Class.initializeClass(libgcj.so.81)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:101)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:62)
at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:123)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:191)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:173)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:330)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:254)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:210)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:143)

I can see the class inside org.apache.pdfbox.pdmodel.font.PDTrueTypeFont, but it's not getting resolved. I added the jar file as an external jar to my Eclipse project settings. Can anyone please tell me what I might be doing wrong?

Thanks,
SSD.

Anonymous said...

Adjie

Our problems are similar. The solution is to include the fontbox.jar file under /external in the pdfbox download.

Thanks,
SSD.

Grainne said...

Useful entry. I can run this all fine on a windows environment, but when i transfer this to a linux server there is a problem in the pdfbox parser and returns that the header is corrupt.
Are there any dependencies in the parser to adobe reader or anything like that?
Thanks,
Grainne

fatih seker said...

thank you very much. i would buy you a cup of coffee if you were around :)

Prasanna Seshadri said...

Hi Grainne,

I think Windows or Linux doesn't matter because its related to jar, one suggestion would be to try the latest version of jars for the dependent packages.

Prasanna Seshadri said...

Hi Seker,

Thanks for saying so, I would love to have that coffee :)

ayush said...

I tried to run the code that you provided but it does not seem to be working
I added the jar in lib path
but it gave an error
Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/AFMParser
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:350)
at org.pdfbox.pdmodel.font.PDFont.getAverageFontWidthFromAFMFile(PDFont.java:313)
at org.pdfbox.pdmodel.font.PDSimpleFont.getAverageFontWidth(PDSimpleFont.java:231)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:276)
at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at PDFTextParser.pdftoText(PDFTextParser.java:48)
at PDFTextParser.main(PDFTextParser.java:93)
Caused by: java.lang.ClassNotFoundException: org.fontbox.afm.AFMParser
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClassInternal(Unknown Source)
... 14 more


I am using JDK 1.6

Please help its urgent
Regards

Ayush

Anonymous said...

Hi,
Is there any way I can convert a pdf document directly from a web link instead of a file on the local system?
Appreciated,

Anonymous said...

where to give d pdf file name and output filename

Prasanna Seshadri said...
This post has been removed by the author.
Prasanna Seshadri said...

There is no way to do that for a web link, but it can be enhanced to do that.

Prasanna Seshadri said...

Hi,

You can see the way in which the java class is run, use the command line arguments for giving input pdf and output text file.

Yati said...

Hi Prasanna,

I have run your code but end up to these errors. Could you help me please...Thanks in advance

org.pdfbox.exceptions.WrappedIOException
at org.pdfbox.util.PDFStreamEngine.(init)(PDFStreamEngine.java:128)
at org.pdfbox.util.PDFTextStripper.(init)(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:62)
at PDFTextParser.main(PDFTextParser.java:107)
java.lang.NullPointerException
at org.pdfbox.util.PDFStreamEngine.(init)(PDFStreamEngine.java:117)
at org.pdfbox.util.PDFTextStripper.(init)(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:62)
at PDFTextParser.main(PDFTextParser.java:107)
PDF to Text Conversion failed.

Prasanna Seshadri said...

Hi,

Did you use the same JDK and the jars mentioned in the blog, let me know.

S Marvasti said...

Great Post, I have a better implementation which will preserve formatting implementation. Needless to say the code is proprietory for the company that I work for. But I can give you guys hints. There is a TextPos ition object which you can access by subclasses PDFTextStripper.
I also recommend using the latest apache incubator project.


you can then overload a method called WritePage in PDFTextStripper.

Calling
List textBoxes = getCharactersByArticle().get(0);
Collections.sort(textBoxes, new TextPositionComparator());

See javadocs for PDFTextStripper.

After getting this list in writePage() method of PDFTextStripper(pdfbox 0.8) a sorted list of text objects with position that you may print out as you like using your custom logic of determining space.


If you want I can help you out... email me... though detailed help will require donation.

Prasanna Seshadri said...

Hi,

Thanks for the offer, appreciate a lot, will definitely getback to you when I rework this code for enhancement.

lilian said...

hi,

i would work with pdfbox and i will loose my nerves soon, because it won´t work. maybe you can help me.
i copied the jars from external and lib in a separate folder (named pdfbox) in my eclipse-plugin-folder. then i added these jars to my classpath and i see them under "referenced libariers".
the sourcecode reports no error. but in case of running this code i get the message "noclassdeffounderror: org/pdfbox/util/splitter".

Opita said...

It works like a charm :)

Thank u very much.

Anonymous said...

I am able to run the code and the text is generated out of a df file.

But when the pdf size is large, say 200mb, Java OutOfMemory Exception is thrown.

Any idea how do i go about this ?

My second question is can i get the contents of a chosen page only as test, without loading the whole file ?

Prasanna Seshadri said...

Hi,

This is just to understand the PDF Parsing API, for big files, you can use a different stream and try it out to avoid OutOfMemory errors.

Anonymous said...

Hi,

I am new to this pdfbox.

I have downloaded the PDFBox-0.7.3 and i set the CLASSPATH using Environment variable in control panel/system.

i dont know how to compile and run the code. can you suggest me.

i tried using javac PDFTextParser.java it giving error message.

Sunil said...

Hello,

I have a pdf which is basically scanned from a piece of printed paper. Does PDFBox support reading text from such PDF?

atul said...

when i executed the above code then i got the following error. could u pls help me out.

Parsing text from PDF file c:\\test.pdf....

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/log4j/Logger

at org.pdfbox.pdfparser.BaseParser.(BaseParser.java:70)

at test.pdftoText(test.java:33)

at test.main(test.java:84)

Kishan said...

Hi Prasanna,

First of all: Thanks for such a good post.

Problem: I used JDK1.6 and the jars mentioned in the blog.

The program Compiled fine. But when I tried to run it with the following command:

java PDFTextParser sample1.pdf aa.txt

or

java PDFTextParser sample1.pdf aa


I am getting the following exception.

Parsing text from PDF file sample1.pdf....
An exception occured in parsing the PDF Document.
org.pdfbox.exceptions.WrappedIOException
at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:128)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:49)
at PDFTextParser.main(PDFTextParser.java:91)
java.lang.NullPointerException
at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:117)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:49)
at PDFTextParser.main(PDFTextParser.java:91)
PDF to Text Conversion failed.

can you please help me to resolve the problem.

Thanks in advance.

Prateek said...

Hi prasanna !
when i am including the jar from the 'lib' in PDFBox-0.7.3 in the classpath, the error i am getting is:

Exception in thread "main" java.lang.NoClassDefFoundError: and
Caused by: java.lang.ClassNotFoundException: and
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)

please solve it out..
looking forward for a reply soon !

-prtk

ochingwa said...

thanks very much for your post,it is very helpful.
am getting some exceptions concerning pdftextparser.i would like more help to know how to resolve the.the following are the exceptions:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
at org.apache.pdfbox.pdfparser.BaseParser.(BaseParser.java:58)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:846)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:814)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:739)
at brainuppdfparser.BrainupPdfParser.main(BrainupPdfParser.java:35)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 5 more

meenal said...

hello,
i am getting around 100 errors with this code. when i unzipped PDFBox-0.7.3, fontbox.jar was not there. and these errors, i am including. please help...!!!

C:\Program Files\Java\jdk1.6.0_14\bin\PDFBox-0.7.3\org\pdfbox\pdmodel\common\PDRectangle.java:38: package org.fontbox.util does not exist
import org.fontbox.util.BoundingBox;
^
C:\Program Files\Java\jdk1.6.0_14\bin\PDFBox-0.7.3\org\pdfbox\pdmodel\font\PDFont.java:33: package org.fontbox.afm does not exist
import org.fontbox.afm.AFMParser;
^
C:\Program Files\Java\jdk1.6.0_14\bin\PDFBox-0.7.3\org\pdfbox\pdmodel\font\PDFont.java:35: package org.fontbox.afm does not exist
import org.fontbox.afm.FontMetric;

and so on

meenal said...

hi again. the previous program was solved, but here's a new one.
Please help...

An exception occured in parsing the PDF Document.
org.pdfbox.exceptions.WrappedIOException: org.pdfbox.util.operator.ShowTextGlyph

at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:128)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:45)
at PDFTextParser.main(PDFTextParser.java:87)
java.lang.ClassNotFoundException: org.pdfbox.util.operator.ShowTextGlyph
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:169)
at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:122)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:45)
at PDFTextParser.main(PDFTextParser.java:87)
PDF to Text Conversion failed.

Rahul Patil said...

Thanks!
Its really a good post..

bgung said...

i just found this link...
it says... pdfbox isn't it??
so ... it's online pdf2txt w/ pdfbox?

http://www.fileformat.info/convert/doc/pdf2txt.htm

pradeep said...

Hi, First of all thankyou for u r effort to made this. By this i can get conver the pdf into text file but the format is completely changed, can i make the text file without losing the format?? in pdf i have table like format, it is not necessary to print table in text file but i want table like format for displaying.

Thankyou,
Pradeep

Anonymous said...

check it

AHK said...

Hi Guys
Im having these 2 Exceptions

org.pdfbox.wrappedioexception.

and

java.lang.NullPointerException

Abed said...

Hi very good post in fact!.

Could you give me a hand with this error in order to know what is happening ?

C:\Program Files\Java\jdk1.6.0_19\bin>java PDFTextParser corte300121112009Price.
pdf salidacorte.txt
Parsing text from PDF file corte300121112009Price.pdf....
Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/AFMPa
rser
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:350)
at org.pdfbox.pdmodel.font.PDFont.getAverageFontWidthFromAFMFile(PDFont.
java:313)
at org.pdfbox.pdmodel.font.PDSimpleFont.getAverageFontWidth(PDSimpleFont
.java:231)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:276)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:
452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java
:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:17
4)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)

at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259
)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at PDFTextParser.pdftoText(PDFTextParser.java:61)
at PDFTextParser.main(PDFTextParser.java:101)
Caused by: java.lang.ClassNotFoundException: org.fontbox.afm.AFMParser
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 14 more

chandra said...

Hello Prashant,

Thank you very much for sharing this article.
I still need some more help, I have been able to convert a PDF file into a text file successfully, however my pdf file is as follows:

Column1 Column2
Ram is 1000
manager
Sam is 2000
supervisor

So when its extracted into a text file it appears as:
Column1
Column2
Ram is
manager
1000
Sam is
supervisor
2000

The challenge i am facing is to determine if "Ram is manager " should go to Column1 and not
"Ram is " ---> Column1
"manager " ---> Column2

Hope I was able to explain my problem clearly.
If anyone has dealt with this issue, would really appreciate if they can share there solution. I have already wasted a weeks time exploring the pdfbox and itext APIs

Thanks to all in advance.

Chandra

Sweta said...

Hi !!

I want to know if one can extract the bibliography data from pdf documents like research papers using pdfbox ?? I need it for my thesis !!

Thanks

Sweta said...
This post has been removed by the author.
sankar said...

I NEED TO GET FONT NAME AND SIZE FROM THE PDF TEXT FILES..ANYBODY HELP TO ME?


Copyright © 2008 Prasanna Seshadri, www.prasannatech.net, All Rights Reserved.
No part of the content or this site may be reproduced without prior written permission of the author.