PDF Text Parser: Converting PDF to Text in Java using PDFBox

Sunday, January 25, 2009

Converting PDF to text is an interesting task which has its use in many applications from search engines indexing PDF documents to other data processing tasks. I was looking for a java based API to convert PDF to text, or in other words a PDF Text parser in java, after going through many articles, the PDFBox project came to my rescue. PDFBox is a library which can handle different types of PDF documents including encrypted PDF formats and extracts text and has a command line utility as well to convert PDF to text documents.

I found the need to have a reusable java class to convert PDF Documents to text in one of my projects and the below java code does the same using the PDFBox java API. It takes two command line parameters, the input PDF file and the output text file, to which the parsed text from the PDF document will be written.

This code was tested with PDFBox 0.7.3 although it should work with other versions of PDFBox as well, it can be easily integrated with other java applications and can be used as a command line utility as well, the steps to run this code is furnished below.

Listing 1: PDFTextParser.java

1: /*
2: * PDFTextParser.java
3: * Author: S.Prasanna
4: *
5: */
6:
7: import org.pdfbox.cos.COSDocument;
8: import org.pdfbox.pdfparser.PDFParser;
9: import org.pdfbox.pdmodel.PDDocument;
10: import org.pdfbox.pdmodel.PDDocumentInformation;
11: import org.pdfbox.util.PDFTextStripper;
12:
13: import java.io.File;
14: import java.io.FileInputStream;
15: import java.io.PrintWriter;
16:
17: public class PDFTextParser {
18:
19: PDFParser parser;
20: String parsedText;
21: PDFTextStripper pdfStripper;
22: PDDocument pdDoc;
23: COSDocument cosDoc;
24: PDDocumentInformation pdDocInfo;
25:
26: // PDFTextParser Constructor
27: public PDFTextParser() {
28: }
29:
30: // Extract text from PDF Document
31: String pdftoText(String fileName) {
32:
33: System.out.println("Parsing text from PDF file " + fileName + "....");
34: File f = new File(fileName);
35:
36: if (!f.isFile()) {
37: System.out.println("File " + fileName + " does not exist.");
38: return null;
39: }
40:
41: try {
42: parser = new PDFParser(new FileInputStream(f));
43: } catch (Exception e) {
44: System.out.println("Unable to open PDF Parser.");
45: return null;
46: }
47:
48: try {
49: parser.parse();
50: cosDoc = parser.getDocument();
51: pdfStripper = new PDFTextStripper();
52: pdDoc = new PDDocument(cosDoc);
53: parsedText = pdfStripper.getText(pdDoc);
54: } catch (Exception e) {
55: System.out.println("An exception occured in parsing the PDF Document.");
56: e.printStackTrace();
57: try {
58: if (cosDoc != null) cosDoc.close();
59: if (pdDoc != null) pdDoc.close();
60: } catch (Exception e1) {
61: e.printStackTrace();
62: }
63: return null;
64: }
65: System.out.println("Done.");
66: return parsedText;
67: }
68:
69: // Write the parsed text from PDF to a file
70: void writeTexttoFile(String pdfText, String fileName) {
71:
72: System.out.println("\nWriting PDF text to output text file " + fileName + "....");
73: try {
74: PrintWriter pw = new PrintWriter(fileName);
75: pw.print(pdfText);
76: pw.close();
77: } catch (Exception e) {
78: System.out.println("An exception occured in writing the pdf text to file.");
79: e.printStackTrace();
80: }
81: System.out.println("Done.");
82: }
83:
84: //Extracts text from a PDF Document and writes it to a text file
85: public static void main(String args[]) {
86:
87: if (args.length != 2) {
88: System.out.println("Usage: java PDFTextParser ");
89: System.exit(1);
90: }
91:
92: PDFTextParser pdfTextParserObj = new PDFTextParser();
93: String pdfToText = pdfTextParserObj.pdftoText(args[0]);
94:
95: if (pdfToText == null) {
96: System.out.println("PDF to Text Conversion failed.");
97: }
98: else {
99: System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
100: pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
101: }
102: }
103: }
Explanation:

The above code takes two command line parameters, the input PDF file and the output text file, the method pdftoText in line 31 handles the text parsing functionality and the writeTexttoFile method in line 70 writes the parsed text to the output file.

Compliling and Running the code:

I used PDFBox 0.7.3 to compile/run the above code, so you need to add those jars in your java project settings.

1. Download PDFBox 0.7.3 from here.
2. Unzip PDFBox-0.7.3.zip.
3. Under the PDFBox-0.7.3 folder, add the jars in the lib (PDFBox-0.7.3.jar) and external directory (other external packages used by PDFBox-0.7.3) to the classpath to compile/run the code, it should work fine.

Note: I used JDK 1.6 to compile the above code.

113 comments:

Arun Kumar C said...

I also had a similar problem. Although I am able to extract text, I am not able to do it with the formatting information which is crucial for me. I had spent so much time to figure out how to get text. It would have been better if I had seen this post before I started luking for Java API's :-(

Prasanna Seshadri said...

Hi,

Thanks for saying so, glad that you landed on this page.

Rajat said...

Thanks for sharing this useful information prasanna.

Adjie said...

hi prasanna.. nice post.

it works fine when i use it as it was. but when i change the method pdftoText from default to public static. and i called it from another class, the compiler said

Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/FontMetric
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:334)
at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at TA.document.PDFHandler.getText(PDFHandler.java:45)
at TA.eksperimen.main(eksperimen.java:33)
Caused by: java.lang.ClassNotFoundException: org.fontbox.afm.FontMetric
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClassInternal(Unknown Source)
... 13 more

FYI i rename PDFTextParser(String filename) to PDFHandler(File file)

and i called it from eksperimen.java with syntax
file = new File("C:\\Users\\root\\Documents\\indexDocs\\presentation_tips.pdf");
tes = PDFHandler.getText(file); System.out.println("PDFHandler\n" + tes);

Adjie said...

Sorry.. i rename the class PDFTextParser to PDFHandler

and also String pdftoText(String fileName) to getText(File file).

so can u help me to point out my mistake..?

Prasanna Seshadri said...

Hi,

From the error you are getting, it looks like you have missed to include jars, you need to import the jars I mentioned, let me know.

manish said...

does PDFBox 0.7.3 work on jdk 1.4.2

Psybuck said...

Thanks for this post. Helped me a lot. I havent touched Java in almost three years and was lost as to how I would use this.

Prasanna Seshadri said...

Hi Manish,

PDFBox should work in JDK 1.4.2, but I haven't tried it.

Prasanna Seshadri said...

Psybuck, thanks for saying so.

manish said...

thanx for the reply ,
i used ur code as it is ,i am getting error for this line
PrintWriter pw = new PrintWriter(fileName); bcoz PrintWriter constructor cannot String object as parameter .What to do can u help me

Prasanna Seshadri said...

Oh Got your problem, in from java 5, PrintWriter takes a String argument for the file name, in 1.4.2, you need to create a FileWriter from the filename from which you will create a PrintWriter, so change that part accordingly.

manish said...

thanx it worked but my original formatting in pdf was lost in txt file ,the alignment and all what to do

manish said...

HI,
i need some help in pdfbox ,can i get all data from pdf file with out losing formatting information

Prasanna Seshadri said...

Hi,

I am not sure how you want the text to be, but if you have all text extracted, then it should be fine.

manish said...

hi prasanna ,
i am getting the text correctly but the actual pdf file looks different .here is the link to the pdf file .Can i get the same design as pdf in my text file

manish said...

sorry here is the link to my pdf
http://www.wikifortio.com/338703/MPRPOV.pdf

Anonymous said...

Hello,

I was trying to do this, but I got an error that said this:

Exception in thread "main" java.lang.NoClassDefFoundError: org.apache.pdfbox.pdmodel.font.PDTrueTypeFont
at java.lang.Class.initializeClass(libgcj.so.81)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:101)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:62)
at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:123)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:191)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:173)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:330)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:254)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:210)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:143)

I can see the class inside org.apache.pdfbox.pdmodel.font.PDTrueTypeFont, but it's not getting resolved. I added the jar file as an external jar to my Eclipse project settings. Can anyone please tell me what I might be doing wrong?

Thanks,
SSD.

Anonymous said...

Adjie

Our problems are similar. The solution is to include the fontbox.jar file under /external in the pdfbox download.

Thanks,
SSD.

Grainne said...

Useful entry. I can run this all fine on a windows environment, but when i transfer this to a linux server there is a problem in the pdfbox parser and returns that the header is corrupt.
Are there any dependencies in the parser to adobe reader or anything like that?
Thanks,
Grainne

fatih seker said...

thank you very much. i would buy you a cup of coffee if you were around :)

Prasanna Seshadri said...

Hi Grainne,

I think Windows or Linux doesn't matter because its related to jar, one suggestion would be to try the latest version of jars for the dependent packages.

Prasanna Seshadri said...

Hi Seker,

Thanks for saying so, I would love to have that coffee :)

ayush said...

I tried to run the code that you provided but it does not seem to be working
I added the jar in lib path
but it gave an error
Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/AFMParser
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:350)
at org.pdfbox.pdmodel.font.PDFont.getAverageFontWidthFromAFMFile(PDFont.java:313)
at org.pdfbox.pdmodel.font.PDSimpleFont.getAverageFontWidth(PDSimpleFont.java:231)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:276)
at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at PDFTextParser.pdftoText(PDFTextParser.java:48)
at PDFTextParser.main(PDFTextParser.java:93)
Caused by: java.lang.ClassNotFoundException: org.fontbox.afm.AFMParser
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClassInternal(Unknown Source)
... 14 more


I am using JDK 1.6

Please help its urgent
Regards

Ayush

Anonymous said...

Hi,
Is there any way I can convert a pdf document directly from a web link instead of a file on the local system?
Appreciated,

Anonymous said...

where to give d pdf file name and output filename

Prasanna Seshadri said...
This comment has been removed by the author.
Prasanna Seshadri said...

There is no way to do that for a web link, but it can be enhanced to do that.

Prasanna Seshadri said...

Hi,

You can see the way in which the java class is run, use the command line arguments for giving input pdf and output text file.

Yati said...

Hi Prasanna,

I have run your code but end up to these errors. Could you help me please...Thanks in advance

org.pdfbox.exceptions.WrappedIOException
at org.pdfbox.util.PDFStreamEngine.(init)(PDFStreamEngine.java:128)
at org.pdfbox.util.PDFTextStripper.(init)(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:62)
at PDFTextParser.main(PDFTextParser.java:107)
java.lang.NullPointerException
at org.pdfbox.util.PDFStreamEngine.(init)(PDFStreamEngine.java:117)
at org.pdfbox.util.PDFTextStripper.(init)(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:62)
at PDFTextParser.main(PDFTextParser.java:107)
PDF to Text Conversion failed.

Prasanna Seshadri said...

Hi,

Did you use the same JDK and the jars mentioned in the blog, let me know.

S Marvasti said...

Great Post, I have a better implementation which will preserve formatting implementation. Needless to say the code is proprietory for the company that I work for. But I can give you guys hints. There is a TextPos ition object which you can access by subclasses PDFTextStripper.
I also recommend using the latest apache incubator project.


you can then overload a method called WritePage in PDFTextStripper.

Calling
List textBoxes = getCharactersByArticle().get(0);
Collections.sort(textBoxes, new TextPositionComparator());

See javadocs for PDFTextStripper.

After getting this list in writePage() method of PDFTextStripper(pdfbox 0.8) a sorted list of text objects with position that you may print out as you like using your custom logic of determining space.


If you want I can help you out... email me... though detailed help will require donation.

Prasanna Seshadri said...

Hi,

Thanks for the offer, appreciate a lot, will definitely getback to you when I rework this code for enhancement.

lilian said...

hi,

i would work with pdfbox and i will loose my nerves soon, because it won´t work. maybe you can help me.
i copied the jars from external and lib in a separate folder (named pdfbox) in my eclipse-plugin-folder. then i added these jars to my classpath and i see them under "referenced libariers".
the sourcecode reports no error. but in case of running this code i get the message "noclassdeffounderror: org/pdfbox/util/splitter".

Opita said...

It works like a charm :)

Thank u very much.

Anonymous said...

I am able to run the code and the text is generated out of a df file.

But when the pdf size is large, say 200mb, Java OutOfMemory Exception is thrown.

Any idea how do i go about this ?

My second question is can i get the contents of a chosen page only as test, without loading the whole file ?

Prasanna Seshadri said...

Hi,

This is just to understand the PDF Parsing API, for big files, you can use a different stream and try it out to avoid OutOfMemory errors.

Anonymous said...

Hi,

I am new to this pdfbox.

I have downloaded the PDFBox-0.7.3 and i set the CLASSPATH using Environment variable in control panel/system.

i dont know how to compile and run the code. can you suggest me.

i tried using javac PDFTextParser.java it giving error message.

Sunil said...

Hello,

I have a pdf which is basically scanned from a piece of printed paper. Does PDFBox support reading text from such PDF?

atul said...

when i executed the above code then i got the following error. could u pls help me out.

Parsing text from PDF file c:\\test.pdf....

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/log4j/Logger

at org.pdfbox.pdfparser.BaseParser.(BaseParser.java:70)

at test.pdftoText(test.java:33)

at test.main(test.java:84)

Kishan said...

Hi Prasanna,

First of all: Thanks for such a good post.

Problem: I used JDK1.6 and the jars mentioned in the blog.

The program Compiled fine. But when I tried to run it with the following command:

java PDFTextParser sample1.pdf aa.txt

or

java PDFTextParser sample1.pdf aa


I am getting the following exception.

Parsing text from PDF file sample1.pdf....
An exception occured in parsing the PDF Document.
org.pdfbox.exceptions.WrappedIOException
at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:128)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:49)
at PDFTextParser.main(PDFTextParser.java:91)
java.lang.NullPointerException
at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:117)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:49)
at PDFTextParser.main(PDFTextParser.java:91)
PDF to Text Conversion failed.

can you please help me to resolve the problem.

Thanks in advance.

Prateek said...

Hi prasanna !
when i am including the jar from the 'lib' in PDFBox-0.7.3 in the classpath, the error i am getting is:

Exception in thread "main" java.lang.NoClassDefFoundError: and
Caused by: java.lang.ClassNotFoundException: and
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)

please solve it out..
looking forward for a reply soon !

-prtk

ochingwa said...

thanks very much for your post,it is very helpful.
am getting some exceptions concerning pdftextparser.i would like more help to know how to resolve the.the following are the exceptions:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
at org.apache.pdfbox.pdfparser.BaseParser.(BaseParser.java:58)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:846)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:814)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:739)
at brainuppdfparser.BrainupPdfParser.main(BrainupPdfParser.java:35)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 5 more

meenal said...

hello,
i am getting around 100 errors with this code. when i unzipped PDFBox-0.7.3, fontbox.jar was not there. and these errors, i am including. please help...!!!

C:\Program Files\Java\jdk1.6.0_14\bin\PDFBox-0.7.3\org\pdfbox\pdmodel\common\PDRectangle.java:38: package org.fontbox.util does not exist
import org.fontbox.util.BoundingBox;
^
C:\Program Files\Java\jdk1.6.0_14\bin\PDFBox-0.7.3\org\pdfbox\pdmodel\font\PDFont.java:33: package org.fontbox.afm does not exist
import org.fontbox.afm.AFMParser;
^
C:\Program Files\Java\jdk1.6.0_14\bin\PDFBox-0.7.3\org\pdfbox\pdmodel\font\PDFont.java:35: package org.fontbox.afm does not exist
import org.fontbox.afm.FontMetric;

and so on

meenal said...

hi again. the previous program was solved, but here's a new one.
Please help...

An exception occured in parsing the PDF Document.
org.pdfbox.exceptions.WrappedIOException: org.pdfbox.util.operator.ShowTextGlyph

at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:128)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:45)
at PDFTextParser.main(PDFTextParser.java:87)
java.lang.ClassNotFoundException: org.pdfbox.util.operator.ShowTextGlyph
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:169)
at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:122)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:45)
at PDFTextParser.main(PDFTextParser.java:87)
PDF to Text Conversion failed.

Rahul Patil said...

Thanks!
Its really a good post..

bgung said...

i just found this link...
it says... pdfbox isn't it??
so ... it's online pdf2txt w/ pdfbox?

http://www.fileformat.info/convert/doc/pdf2txt.htm

pradeep said...

Hi, First of all thankyou for u r effort to made this. By this i can get conver the pdf into text file but the format is completely changed, can i make the text file without losing the format?? in pdf i have table like format, it is not necessary to print table in text file but i want table like format for displaying.

Thankyou,
Pradeep

Anonymous said...

check it

AHK said...

Hi Guys
Im having these 2 Exceptions

org.pdfbox.wrappedioexception.

and

java.lang.NullPointerException

Abed said...

Hi very good post in fact!.

Could you give me a hand with this error in order to know what is happening ?

C:\Program Files\Java\jdk1.6.0_19\bin>java PDFTextParser corte300121112009Price.
pdf salidacorte.txt
Parsing text from PDF file corte300121112009Price.pdf....
Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/AFMPa
rser
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:350)
at org.pdfbox.pdmodel.font.PDFont.getAverageFontWidthFromAFMFile(PDFont.
java:313)
at org.pdfbox.pdmodel.font.PDSimpleFont.getAverageFontWidth(PDSimpleFont
.java:231)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:276)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:
452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java
:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:17
4)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)

at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259
)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at PDFTextParser.pdftoText(PDFTextParser.java:61)
at PDFTextParser.main(PDFTextParser.java:101)
Caused by: java.lang.ClassNotFoundException: org.fontbox.afm.AFMParser
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 14 more

chandra said...

Hello Prashant,

Thank you very much for sharing this article.
I still need some more help, I have been able to convert a PDF file into a text file successfully, however my pdf file is as follows:

Column1 Column2
Ram is 1000
manager
Sam is 2000
supervisor

So when its extracted into a text file it appears as:
Column1
Column2
Ram is
manager
1000
Sam is
supervisor
2000

The challenge i am facing is to determine if "Ram is manager " should go to Column1 and not
"Ram is " ---> Column1
"manager " ---> Column2

Hope I was able to explain my problem clearly.
If anyone has dealt with this issue, would really appreciate if they can share there solution. I have already wasted a weeks time exploring the pdfbox and itext APIs

Thanks to all in advance.

Chandra

Sweta said...

Hi !!

I want to know if one can extract the bibliography data from pdf documents like research papers using pdfbox ?? I need it for my thesis !!

Thanks

Sweta said...
This comment has been removed by the author.
sankar said...

I NEED TO GET FONT NAME AND SIZE FROM THE PDF TEXT FILES..ANYBODY HELP TO ME?

Anant said...

Hi,
Thanks a lot.
I was stuck from last 3 weeks.
Anant Kumar

Kevin Chua (凯文) said...

Hi,

I managed to extract the content but can i know why all my text display in this format :
"uni0045uni006Euni0067uni006Cuni0069uni0073uni0068uni0020uni0057uni0069uni006Buni0069uni0070uni0065uni0064uni0069uni0061uni0020uni0072uni0069uni0067uni0068uni0074uni0020uni006Euni006Funi0077 " ?

Can teach me how to get back the original text value but not in this format ?

kiran said...

Hi,

I tried executing your code as it is..but the following exception popups help..

Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/FontM
etric
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:334)
at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:
104)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)

at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:
452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java
:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:17
4)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)

at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259
)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at PDFTextParser.pdftoText(PDFTextParser.java:52)
at PDFTextParser.main(PDFTextParser.java:95)

I did include all the jar files including org.font...etc.
My jvm version is 1.5

Anonymous said...

Hi prasanna,
nice code ! it works good ! Do you have maybe an example to index a pdf-files with the pdfbox ?
crombix.

Prasanna Seshadri said...

Not sure about indexing, the idea is to have a simple text parser for PDF documents.

Anonymous said...

hi, can help me? i am new. using netbean. how do i using the source code provided?

Coder Xpert said...

Hi....
I have read ur Blog and i ike ur work in pdf and java.
I am new in this field. I want to make a program that take a PDF file in input and as an out put it shows all Headings of the Document.
For example there is a document with 5 pages than my program should show all the different type of headings in that 5 pages.
what i need,
1. that is there any object/tag for heading in pdf heading.
2. Or any other way to do this.

I will be very thank full to u for ur time

Vignesh said...

Thanks a lot buddy..how can we use the command line parameters in this to convert pdf to html

crys said...

am using netbeans 6.9...pls help me know where to include the jar files , to find the jar files in pdfbox library and set the external directory to the classpath

Prasanna Seshadri said...

For people who are reporting compilation errors, if you follow the above procedure as it is it should work fine, please make sure that the packages you use are also the ones I used to compile, the jars may have changed with new versions.

Anonymous said...

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
at org.apache.pdfbox.pdfparser.BaseParser.(BaseParser.java:58)
at de.fhwedel.swp.indexier06.Main.main(Main.java:188)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 2 more

I dont know what the program want???

Rana said...

I want to write text to PDF file with position x,y,width and height as we do it for read using objTextStripperbyArea.getTextForRegion.

Anonymous said...

Thanks for your blog. I have a problem extracting text, the bold letters in the pdf file goes to the end of the corresponding line. I have tried by setting pdfTextStripper.setSortByPosition(true), but it makes some other contents misplaced. Is there any other options to make this work?

LordMax said...

Hi to all

For error message:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
at org.apache.pdfbox.pdfparser.BaseParser.(BaseParser.java:58)

jar commons-logging-x.x.x.jar miss
You can find it here:
http://commons.apache.org/logging/download_logging.cgi

Anonymous said...

It works with following commands:

compile:
javac -classpath ./PDFBox-0.7.3/lib/PDFBox-0.7.3.jar:/ PDFTextParser.java

execute:
java -classpath ../PDFBox-0.7.3/lib/PDFBox-0.7.3.jar:../PDFBox-0.7.3/external/bcmail-jdk14-132.jar:../PDFBox-0.7.3/external/bcprov-jdk14-132.jar:../PDFBox-0.7.3external/checkstyle-all-4.2.jar:../PDFBox-0.7.3external/junit.jar:../PDFBox-0.7.3external/lucene-demos-2.0.0.jar:../PDFBox-0.7.3/external/ant.jar:../PDFBox-0.7.3/external/FontBox-0.1.0-dev.jar:../PDFBox-0.7.3/external/lucene-core-2.0.0.jar:. PDFTextParser ECN\ 001\ \(LDN120508\).pdf kernelsource1.txt

Jani Verkkomäki

avi said...

hiiiii....
i am avinash...
i am using your code in my program...but it shows some errors can u tell me how to solve these error....pls tell me as soon as possible..very urgent...


Parsing text from PDF file c:/simple.pdf....
Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/AFMParser
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:350)
at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at javaapplication12.PDFTextParser.pdftoText(Main.java:54)
at javaapplication12.Main.main(Main.java:97)
Caused by: java.lang.ClassNotFoundException: org.fontbox.afm.AFMParser
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 13 more

..thank u....

avi said...

hii..
i am avinash...
i am using your code in my program..but it raises some exceptions..can u tell me how to solve these..exceptions...
inform me as early as possible....



Parsing text from PDF file c:/simple.pdf....
Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/AFMParser
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:350)
at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at javaapplication12.PDFTextParser.pdftoText(Main.java:54)
at javaapplication12.Main.main(Main.java:97)
Caused by: java.lang.ClassNotFoundException: org.fontbox.afm.AFMParser
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 13 more

thank u...

Prasanna Seshadri said...

Looks like your jar doesn't have the class java.lang.NoClassDefFoundError: org/fontbox/afm/AFMParser

Use the version of PDFBox mentioned in this article.

Grewal (Nitin) said...

I want to extract the font size of the text...as a single line contains two different fonts...

shravu said...

Hi,

I need help develop a tool which can compare two pdf files and highlight the non-matched content. Could you please suggest me the process how to do it.

Please do reply me.

Thanks in advance.

Hasan Rahman said...

Hello dear, I face a problem like the folowing snippet. I use same jar and jdk as you suggest. But face the following error. There is any suggestion for this. Thank you. I try to extract bangla text from pdf file. Any suggestion? Please help.

G:\PDFBOX>javac PDFTextParser.java

G:\PDFBOX>java PDFTextParser book.pdf test.txt
Parsing text from PDF file book.pdf....
An exception occured in parsing the PDF Document.
org.pdfbox.exceptions.WrappedIOException
at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:128)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:59)
at PDFTextParser.main(PDFTextParser.java:102)
java.lang.NullPointerException
at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:117)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:59)
at PDFTextParser.main(PDFTextParser.java:102)
PDF to Text Conversion failed.

Wei said...

this program runs great on me. Thanks!
However I can not extract any filled-in text in my PDF file. The filled-in text are all missing, so essentially I got a blank PDF converted in text file.
Has anyone experienced the same and what solution you have to get around it?
Thanks a million.!


Wei

vaishali said...

Hi Prasanna Seshadri

Thanks for this post. I can covert pdf using this code but getting lots of null null character

Can anyone suggest me?

out put ..
Interest nullearinnull loans and
nullorronullinnulls and interest free loans 1nullnull21null 1nullnull2nullnull 11null1nullnull nullnullnullnull1 3nullnull 10null nullnullnullnullnull2
nullinance leases 2null0 2nullnull 13null null null null nullnull3
Trade and other payanullles 3nullnullnull3null 1nullnull null null null null 3nullnullnull32
null1nullnullnull3 1nullnullnull0null 11null32null nullnullnullnull1 3nullnull 10null null3null10null
The tanullle nullelonull summarises the maturity profile of the nullompanynulls financial lianullilities nullincludinnull trade and other payanulllesnull at 30 April 2010 and
30 April 200null nullased on contractual undiscounted paymentsnull
Contractual cash nullows
nullithin
one year

Chenda said...

There should be a fontbox library added into classpath

Mariana said...

Thank you very much!

Anonymous said...

Hello guys, I have small problem when trying to read PDF file, I get this:

Past v vismaz divas probl mas, kas ir j risina programm t jiem, t s ir plaša projekta
funkcionalit te un t izveides laiks.

in place of:

Pastāv vismaz divas problēmas, kas ir jārisina programmētājiem, tās ir plaša projekta
funkcionalitāte un tā izveides laiks.

Can someone tell me, how can I add other encodings to text Parser?

Gaurav said...

Respected Sir,
I need to convert a Pdf containing hyperlink to a html,in which the hyperlink should be displayed in the html and on click of that page should be redirected to url specified.While extracting the Pdf if I can parse the hyperlink in such a way so that I can identify that the converted text is hyperlink then my job will be done.
If you have any solution please reply me at this email: gaurav.das@saggezza.com

Thanks in advance

Anonymous said...

Works great thanks.

I am using pdfbox 1.6.0 there is a class ExtractText. It does the same thing but a lot simpler.

ExtractText.main(new String[]{"your pdf.pdf"});

It strips everything and gives you a text file.

priyanka said...

thanx sir it help alot

Anonymous said...

i cant understand how to run this code. kindly explain it more elaborately.
please explain the point 3.
i'll be very thakful to you.

asif ali said...

please tell me how to run this code . eclipse said there is no project to import can anyone please tell me how to run this code ASAP....

asif ali said...

please tell me how to run this code . eclipse said there is no project to import can anyone please tell me how to run this code ASAP....

Anonymous said...

Thanks For This useful Post,
This code working well for some Pdf but for some pdf from swift messages i am getting error as.....

Parsing text from PDF file MT1xx.pdf....
An exception occured in parsing the PDF Document.
java.lang.NullPointerException
at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at newpackage.ConvertPDFToTEXT.pdftoText(ConvertPDFToTEXT.java:166)
at newpackage.ConvertPDFToTEXT.main(ConvertPDFToTEXT.java:206)
PDF to Text Conversion failed.


Please help me in this.

abhishek joshi said...

Hi All,

I am trying to convert pdf to html.
for that purpose iam using ExtractText class.
After running the specified program i am getting the exception as

Exception in thread "main" org.apache.pdfbox.exceptions.WrappedIOException
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:860)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:825)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:750)
at pack.ExtractText.startExtraction(ExtractText.java:180)
at pack.ExtractText.main(ExtractText.java:60)
Caused by: java.util.NoSuchElementException
at java.util.AbstractList$Itr.next(Unknown Source)
at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
... 5 more


this problem seems to be with the pdf version 1.5.

The resolution which i ve got is "to use PDFBOX 1.3",still trying to find the appropriate solution but could not find much.

Any help will be appreciated.

Thanks,
Abhi.

saravanapriyan vallinayagam said...

I want to convert the pdf file to blob and then the blob content is stored into mysql database...
Retrived the blob from DB and then displayed in browser as pdf file.. pls help me..

Kiran Dhanave said...

Hi Prasanna,

I copied the code and it is working fine. But I want to preserve formatting also(font, color etc.) like creating a MS word file from PDF(restoring images etc). Is it possible?


Regards,
Kiran

Anonymous said...

I am getting below exception when i am using PDFBox1.5.0 jar, can anybody tell me how we can remove this.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/fontbox/afm/AFMParser
at org.apache.pdfbox.pdmodel.font.PDFont.addAdobeFontMetric(PDFont.java:136)
at org.apache.pdfbox.pdmodel.font.PDFont.getAdobeFontMetrics(PDFont.java:108)
at org.apache.pdfbox.pdmodel.font.PDFont.(PDFont.java:101)
at com.hds.ebook.utility.testMerge.doIt(testMerge.java:43)
at com.hds.ebook.utility.testMerge.main(testMerge.java:97)

Anonymous said...

Hi,

I am running the same code for android. Have modified whatever changes were needed.I need to extract text only. But the problem I am facing is that when the file size is more that 6 MB I am getting out of memory.
This problem occurs while parsing the pdf.When we call parser.parse() the whole pdf file is being parsed and loaded in the buffer which takes a lot of memory.
Can we parse the pdf page by page?
so that it does not consumes so much inmemory.
For parsing a 3 MB PDF the memory used is 40MB something.

Sasank said...

I am able to extract text from thePDF and able to convert to a Text file, however if PDF contains any tick mark or Check mark it is not displaying in converted text, will it supports or not

Veera Saravanan said...


VEERA..........

hi to every one

i have small issues when i run my application which is based on PDF file covert to text file ...... i got output for almost of pdf files......but strategically some error occured when i run some pdf files..... not covert to text files...... so pl explain how to rectifying .......


java.lang.NullPointerException
........filename2.txt
PDF convert failed
at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at file_convertprocess.PDF_to_Text.pdftoText(PDF_to_Text.java:58)
at file_convertprocess.Main.main(Main.java:31)


Mackmilan Selva said...

Hi Prasanna,

your programme is workin fine,
but i need to know,how to extract the text from scan image pdf files?
please help me....

Here -------begins the journey said...

I am getting this exception, please help.

C:\testjava>java PDFTextParser eng.pdf english.txt
Parsing text from PDF file eng.pdf....
Exception in thread "main" java.lang.NoClassDefFoundError: org/pdfbox/pdfparser/
PDFParser
at PDFTextParser.pdftoText(PDFTextParser.java:50)
at PDFTextParser.main(PDFTextParser.java:101)
Caused by: java.lang.ClassNotFoundException: org.pdfbox.pdfparser.PDFParser
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 2 more

sagar devanahalli said...

Parsing text from PDF file a.pdf....
An exception occured in parsing the PDF Document.
org.pdfbox.exceptions.WrappedIOException
at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:128)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:59)
at PDFTextParser.main(PDFTextParser.java:101)
java.lang.NullPointerException
at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:117)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:59)
at PDFTextParser.main(PDFTextParser.java:101)
PDF to Text Conversion failed.

Luz Flores said...

Hi,
I've been working with this api but I have a trouble.
I have a very specific format in a pdf but the stripper is returning the text in a random order. Does anyone know how does the stripper stablish the order for a document?
Or how can I change this?

ANKUSH RAINA said...
This comment has been removed by the author.
ANKUSH RAINA said...

Heloo Prasanna, i have a requirement to compare two pdf files having images in them as well as text. I need to compare the images inside and mark the difference areas if any onto the third generated pdf. Can u help me..? Its too urgent...!

someone said...

Hi,
When I use this example to extract pdf files, I get gibberish content which is unreadable, any ideas would be appreciated.

Thanks,

Anonymous said...

http://www.blogger.com/profile/03273093803320353055 hey..
can u pls tell me how you resolved this problem ...i m having same errors ....its urgent ...

Vinay said...

hi,
its good code buddy,..

Madhu Agrawal said...

I have no words to thank you.You saved me hell lot of trouble.Thank you so much....

AnanD Navalar said...

java PDFTextParser FirstPdf.pdf aa.txt

AnanD Navalar said...

where to import my pdf file
java PDFTextParser FirstPdf.pdf aa.txt
i got this in my command but nothing happen after that
th

AnanD Navalar said...

help me

Prasad said...

Hi Prasanna...Thanks for the great post. I am able extract the data from the saved PDF file....if I have to extract the data from the URL which opens PDF file...how to do that?

Thanks in advance
Prasad

Prasanna Seshadri said...

Prasad, in that case use some URL libraries (like python urllib or similar modules in other languages) and read the contents and feed it to the parser.

Diana Ross said...

I used Aspose.PDF for Java to convert my PDF file to text and i am very much satisfied with the result i got and you can also convert text files to pdf with this library. I found this library very useful, you should try it also.

http://www.aspose.com/java/pdf-component.aspx

Prasanna Seshadri said...

Thanks, that's helpful.

Goitom Gebrehiwot said...

Hello Dears,

When I try to index text files using the below code, I come accross errors like (Error one4, Error one5, Error one6, Error one3, ....). I tried to save the file in different formats like(UTF-8, Big Indian, UTF) but the change is only the number of errors varied.

The text is in Ethiopic(Geez), and I have my own analyzer.

The environment is: Windows 7, Netbeans IDE 7.3.1, and I have included the necessary jar files. Please help me to avoid these errors.

public void addTextDocument(String htmlPath, IndexWriter Writerindex) throws Exception{
File file=new File(htmlPath);
FileInputStream input=new FileInputStream(file);
InputStreamReader read=new InputStreamReader(input,"utf-8");
BufferedReader reader=new BufferedReader(read);
StringBuffer buffer=new StringBuffer();
String line=null;
while((line=reader.readLine())!=null)
{ buffer.append(line);}
String content=buffer.toString();
String filename = file.getName();
String url=filename;
Document document = new Document();
if((url!=null)&&(!url.equals("")))
{ document.add(Field.Keyword("url",url));}
if((content!=null) &&(!content.equals("")))
{ document.add(Field.Text("content",content));}

try {
System.out.println("=====================================");
Writerindex.addDocument(document);
System.out.println("=====================================");
} catch (IOException e) {

e.printStackTrace();
}
}


Copyright © 2008 Prasanna Seshadri, www.prasannatech.net, All Rights Reserved.
No part of the content or this site may be reproduced without prior written permission of the author.