Converting PDF to text is an interesting task which has its use in many applications from search engines indexing PDF documents to other data processing tasks. I was looking for a java based API to convert PDF to text, or in other words a PDF Text parser in java, after going through many articles, the PDFBox project came to my rescue. PDFBox is a library which can handle different types of PDF documents including encrypted PDF formats and extracts text and has a command line utility as well to convert PDF to text documents.
I found the need to have a reusable java class to convert PDF Documents to text in one of my projects and the below java code does the same using the PDFBox java API. It takes two command line parameters, the input PDF file and the output text file, to which the parsed text from the PDF document will be written.
This code was tested with PDFBox 0.7.3 although it should work with other versions of PDFBox as well, it can be easily integrated with other java applications and can be used as a command line utility as well, the steps to run this code is furnished below.
Listing 1: PDFTextParser.java
The above code takes two command line parameters, the input PDF file and the output text file, the method pdftoText in line 31 handles the text parsing functionality and the writeTexttoFile method in line 70 writes the parsed text to the output file.
Compliling and Running the code:
I used PDFBox 0.7.3 to compile/run the above code, so you need to add those jars in your java project settings.
1. Download PDFBox 0.7.3 from here.
2. Unzip PDFBox-0.7.3.zip.
3. Under the PDFBox-0.7.3 folder, add the jars in the lib (PDFBox-0.7.3.jar) and external directory (other external packages used by PDFBox-0.7.3) to the classpath to compile/run the code, it should work fine.
Note: I used JDK 1.6 to compile the above code.
1: /*Explanation:
2: * PDFTextParser.java
3: * Author: S.Prasanna
4: *
5: */
6:
7: import org.pdfbox.cos.COSDocument;
8: import org.pdfbox.pdfparser.PDFParser;
9: import org.pdfbox.pdmodel.PDDocument;
10: import org.pdfbox.pdmodel.PDDocumentInformation;
11: import org.pdfbox.util.PDFTextStripper;
12:
13: import java.io.File;
14: import java.io.FileInputStream;
15: import java.io.PrintWriter;
16:
17: public class PDFTextParser {
18:
19: PDFParser parser;
20: String parsedText;
21: PDFTextStripper pdfStripper;
22: PDDocument pdDoc;
23: COSDocument cosDoc;
24: PDDocumentInformation pdDocInfo;
25:
26: // PDFTextParser Constructor
27: public PDFTextParser() {
28: }
29:
30: // Extract text from PDF Document
31: String pdftoText(String fileName) {
32:
33: System.out.println("Parsing text from PDF file " + fileName + "....");
34: File f = new File(fileName);
35:
36: if (!f.isFile()) {
37: System.out.println("File " + fileName + " does not exist.");
38: return null;
39: }
40:
41: try {
42: parser = new PDFParser(new FileInputStream(f));
43: } catch (Exception e) {
44: System.out.println("Unable to open PDF Parser.");
45: return null;
46: }
47:
48: try {
49: parser.parse();
50: cosDoc = parser.getDocument();
51: pdfStripper = new PDFTextStripper();
52: pdDoc = new PDDocument(cosDoc);
53: parsedText = pdfStripper.getText(pdDoc);
54: } catch (Exception e) {
55: System.out.println("An exception occured in parsing the PDF Document.");
56: e.printStackTrace();
57: try {
58: if (cosDoc != null) cosDoc.close();
59: if (pdDoc != null) pdDoc.close();
60: } catch (Exception e1) {
61: e.printStackTrace();
62: }
63: return null;
64: }
65: System.out.println("Done.");
66: return parsedText;
67: }
68:
69: // Write the parsed text from PDF to a file
70: void writeTexttoFile(String pdfText, String fileName) {
71:
72: System.out.println("\nWriting PDF text to output text file " + fileName + "....");
73: try {
74: PrintWriter pw = new PrintWriter(fileName);
75: pw.print(pdfText);
76: pw.close();
77: } catch (Exception e) {
78: System.out.println("An exception occured in writing the pdf text to file.");
79: e.printStackTrace();
80: }
81: System.out.println("Done.");
82: }
83:
84: //Extracts text from a PDF Document and writes it to a text file
85: public static void main(String args[]) {
86:
87: if (args.length != 2) {
88: System.out.println("Usage: java PDFTextParser); "
89: System.exit(1);
90: }
91:
92: PDFTextParser pdfTextParserObj = new PDFTextParser();
93: String pdfToText = pdfTextParserObj.pdftoText(args[0]);
94:
95: if (pdfToText == null) {
96: System.out.println("PDF to Text Conversion failed.");
97: }
98: else {
99: System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
100: pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
101: }
102: }
103: }
The above code takes two command line parameters, the input PDF file and the output text file, the method pdftoText in line 31 handles the text parsing functionality and the writeTexttoFile method in line 70 writes the parsed text to the output file.
Compliling and Running the code:
I used PDFBox 0.7.3 to compile/run the above code, so you need to add those jars in your java project settings.
1. Download PDFBox 0.7.3 from here.
2. Unzip PDFBox-0.7.3.zip.
3. Under the PDFBox-0.7.3 folder, add the jars in the lib (PDFBox-0.7.3.jar) and external directory (other external packages used by PDFBox-0.7.3) to the classpath to compile/run the code, it should work fine.
Note: I used JDK 1.6 to compile the above code.




55 comments:
I also had a similar problem. Although I am able to extract text, I am not able to do it with the formatting information which is crucial for me. I had spent so much time to figure out how to get text. It would have been better if I had seen this post before I started luking for Java API's :-(
Hi,
Thanks for saying so, glad that you landed on this page.
Thanks for sharing this useful information prasanna.
hi prasanna.. nice post.
it works fine when i use it as it was. but when i change the method pdftoText from default to public static. and i called it from another class, the compiler said
Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/FontMetric
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:334)
at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at TA.document.PDFHandler.getText(PDFHandler.java:45)
at TA.eksperimen.main(eksperimen.java:33)
Caused by: java.lang.ClassNotFoundException: org.fontbox.afm.FontMetric
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClassInternal(Unknown Source)
... 13 more
FYI i rename PDFTextParser(String filename) to PDFHandler(File file)
and i called it from eksperimen.java with syntax
file = new File("C:\\Users\\root\\Documents\\indexDocs\\presentation_tips.pdf");
tes = PDFHandler.getText(file); System.out.println("PDFHandler\n" + tes);
Sorry.. i rename the class PDFTextParser to PDFHandler
and also String pdftoText(String fileName) to getText(File file).
so can u help me to point out my mistake..?
Hi,
From the error you are getting, it looks like you have missed to include jars, you need to import the jars I mentioned, let me know.
does PDFBox 0.7.3 work on jdk 1.4.2
Thanks for this post. Helped me a lot. I havent touched Java in almost three years and was lost as to how I would use this.
Hi Manish,
PDFBox should work in JDK 1.4.2, but I haven't tried it.
Psybuck, thanks for saying so.
thanx for the reply ,
i used ur code as it is ,i am getting error for this line
PrintWriter pw = new PrintWriter(fileName); bcoz PrintWriter constructor cannot String object as parameter .What to do can u help me
Oh Got your problem, in from java 5, PrintWriter takes a String argument for the file name, in 1.4.2, you need to create a FileWriter from the filename from which you will create a PrintWriter, so change that part accordingly.
thanx it worked but my original formatting in pdf was lost in txt file ,the alignment and all what to do
HI,
i need some help in pdfbox ,can i get all data from pdf file with out losing formatting information
Hi,
I am not sure how you want the text to be, but if you have all text extracted, then it should be fine.
hi prasanna ,
i am getting the text correctly but the actual pdf file looks different .here is the link to the pdf file .Can i get the same design as pdf in my text file
sorry here is the link to my pdf
http://www.wikifortio.com/338703/MPRPOV.pdf
Hello,
I was trying to do this, but I got an error that said this:
Exception in thread "main" java.lang.NoClassDefFoundError: org.apache.pdfbox.pdmodel.font.PDTrueTypeFont
at java.lang.Class.initializeClass(libgcj.so.81)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:101)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:62)
at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:123)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:191)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:173)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:330)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:254)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:210)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:143)
I can see the class inside org.apache.pdfbox.pdmodel.font.PDTrueTypeFont, but it's not getting resolved. I added the jar file as an external jar to my Eclipse project settings. Can anyone please tell me what I might be doing wrong?
Thanks,
SSD.
Adjie
Our problems are similar. The solution is to include the fontbox.jar file under /external in the pdfbox download.
Thanks,
SSD.
Useful entry. I can run this all fine on a windows environment, but when i transfer this to a linux server there is a problem in the pdfbox parser and returns that the header is corrupt.
Are there any dependencies in the parser to adobe reader or anything like that?
Thanks,
Grainne
thank you very much. i would buy you a cup of coffee if you were around :)
Hi Grainne,
I think Windows or Linux doesn't matter because its related to jar, one suggestion would be to try the latest version of jars for the dependent packages.
Hi Seker,
Thanks for saying so, I would love to have that coffee :)
I tried to run the code that you provided but it does not seem to be working
I added the jar in lib path
but it gave an error
Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/AFMParser
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:350)
at org.pdfbox.pdmodel.font.PDFont.getAverageFontWidthFromAFMFile(PDFont.java:313)
at org.pdfbox.pdmodel.font.PDSimpleFont.getAverageFontWidth(PDSimpleFont.java:231)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:276)
at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at PDFTextParser.pdftoText(PDFTextParser.java:48)
at PDFTextParser.main(PDFTextParser.java:93)
Caused by: java.lang.ClassNotFoundException: org.fontbox.afm.AFMParser
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClassInternal(Unknown Source)
... 14 more
I am using JDK 1.6
Please help its urgent
Regards
Ayush
Hi,
Is there any way I can convert a pdf document directly from a web link instead of a file on the local system?
Appreciated,
where to give d pdf file name and output filename
There is no way to do that for a web link, but it can be enhanced to do that.
Hi,
You can see the way in which the java class is run, use the command line arguments for giving input pdf and output text file.
Hi Prasanna,
I have run your code but end up to these errors. Could you help me please...Thanks in advance
org.pdfbox.exceptions.WrappedIOException
at org.pdfbox.util.PDFStreamEngine.(init)(PDFStreamEngine.java:128)
at org.pdfbox.util.PDFTextStripper.(init)(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:62)
at PDFTextParser.main(PDFTextParser.java:107)
java.lang.NullPointerException
at org.pdfbox.util.PDFStreamEngine.(init)(PDFStreamEngine.java:117)
at org.pdfbox.util.PDFTextStripper.(init)(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:62)
at PDFTextParser.main(PDFTextParser.java:107)
PDF to Text Conversion failed.
Hi,
Did you use the same JDK and the jars mentioned in the blog, let me know.
Great Post, I have a better implementation which will preserve formatting implementation. Needless to say the code is proprietory for the company that I work for. But I can give you guys hints. There is a TextPos ition object which you can access by subclasses PDFTextStripper.
I also recommend using the latest apache incubator project.
you can then overload a method called WritePage in PDFTextStripper.
Calling
List textBoxes = getCharactersByArticle().get(0);
Collections.sort(textBoxes, new TextPositionComparator());
See javadocs for PDFTextStripper.
After getting this list in writePage() method of PDFTextStripper(pdfbox 0.8) a sorted list of text objects with position that you may print out as you like using your custom logic of determining space.
If you want I can help you out... email me... though detailed help will require donation.
Hi,
Thanks for the offer, appreciate a lot, will definitely getback to you when I rework this code for enhancement.
hi,
i would work with pdfbox and i will loose my nerves soon, because it won´t work. maybe you can help me.
i copied the jars from external and lib in a separate folder (named pdfbox) in my eclipse-plugin-folder. then i added these jars to my classpath and i see them under "referenced libariers".
the sourcecode reports no error. but in case of running this code i get the message "noclassdeffounderror: org/pdfbox/util/splitter".
It works like a charm :)
Thank u very much.
I am able to run the code and the text is generated out of a df file.
But when the pdf size is large, say 200mb, Java OutOfMemory Exception is thrown.
Any idea how do i go about this ?
My second question is can i get the contents of a chosen page only as test, without loading the whole file ?
Hi,
This is just to understand the PDF Parsing API, for big files, you can use a different stream and try it out to avoid OutOfMemory errors.
Hi,
I am new to this pdfbox.
I have downloaded the PDFBox-0.7.3 and i set the CLASSPATH using Environment variable in control panel/system.
i dont know how to compile and run the code. can you suggest me.
i tried using javac PDFTextParser.java it giving error message.
Hello,
I have a pdf which is basically scanned from a piece of printed paper. Does PDFBox support reading text from such PDF?
when i executed the above code then i got the following error. could u pls help me out.
Parsing text from PDF file c:\\test.pdf....
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/log4j/Logger
at org.pdfbox.pdfparser.BaseParser.(BaseParser.java:70)
at test.pdftoText(test.java:33)
at test.main(test.java:84)
Hi Prasanna,
First of all: Thanks for such a good post.
Problem: I used JDK1.6 and the jars mentioned in the blog.
The program Compiled fine. But when I tried to run it with the following command:
java PDFTextParser sample1.pdf aa.txt
or
java PDFTextParser sample1.pdf aa
I am getting the following exception.
Parsing text from PDF file sample1.pdf....
An exception occured in parsing the PDF Document.
org.pdfbox.exceptions.WrappedIOException
at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:128)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:49)
at PDFTextParser.main(PDFTextParser.java:91)
java.lang.NullPointerException
at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:117)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:49)
at PDFTextParser.main(PDFTextParser.java:91)
PDF to Text Conversion failed.
can you please help me to resolve the problem.
Thanks in advance.
Hi prasanna !
when i am including the jar from the 'lib' in PDFBox-0.7.3 in the classpath, the error i am getting is:
Exception in thread "main" java.lang.NoClassDefFoundError: and
Caused by: java.lang.ClassNotFoundException: and
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
please solve it out..
looking forward for a reply soon !
-prtk
thanks very much for your post,it is very helpful.
am getting some exceptions concerning pdftextparser.i would like more help to know how to resolve the.the following are the exceptions:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
at org.apache.pdfbox.pdfparser.BaseParser.(BaseParser.java:58)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:846)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:814)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:739)
at brainuppdfparser.BrainupPdfParser.main(BrainupPdfParser.java:35)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 5 more
hello,
i am getting around 100 errors with this code. when i unzipped PDFBox-0.7.3, fontbox.jar was not there. and these errors, i am including. please help...!!!
C:\Program Files\Java\jdk1.6.0_14\bin\PDFBox-0.7.3\org\pdfbox\pdmodel\common\PDRectangle.java:38: package org.fontbox.util does not exist
import org.fontbox.util.BoundingBox;
^
C:\Program Files\Java\jdk1.6.0_14\bin\PDFBox-0.7.3\org\pdfbox\pdmodel\font\PDFont.java:33: package org.fontbox.afm does not exist
import org.fontbox.afm.AFMParser;
^
C:\Program Files\Java\jdk1.6.0_14\bin\PDFBox-0.7.3\org\pdfbox\pdmodel\font\PDFont.java:35: package org.fontbox.afm does not exist
import org.fontbox.afm.FontMetric;
and so on
hi again. the previous program was solved, but here's a new one.
Please help...
An exception occured in parsing the PDF Document.
org.pdfbox.exceptions.WrappedIOException: org.pdfbox.util.operator.ShowTextGlyph
at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:128)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:45)
at PDFTextParser.main(PDFTextParser.java:87)
java.lang.ClassNotFoundException: org.pdfbox.util.operator.ShowTextGlyph
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:169)
at org.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:122)
at org.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:119)
at PDFTextParser.pdftoText(PDFTextParser.java:45)
at PDFTextParser.main(PDFTextParser.java:87)
PDF to Text Conversion failed.
Thanks!
Its really a good post..
i just found this link...
it says... pdfbox isn't it??
so ... it's online pdf2txt w/ pdfbox?
http://www.fileformat.info/convert/doc/pdf2txt.htm
Hi, First of all thankyou for u r effort to made this. By this i can get conver the pdf into text file but the format is completely changed, can i make the text file without losing the format?? in pdf i have table like format, it is not necessary to print table in text file but i want table like format for displaying.
Thankyou,
Pradeep
check it
Hi Guys
Im having these 2 Exceptions
org.pdfbox.wrappedioexception.
and
java.lang.NullPointerException
Hi very good post in fact!.
Could you give me a hand with this error in order to know what is happening ?
C:\Program Files\Java\jdk1.6.0_19\bin>java PDFTextParser corte300121112009Price.
pdf salidacorte.txt
Parsing text from PDF file corte300121112009Price.pdf....
Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/AFMPa
rser
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:350)
at org.pdfbox.pdmodel.font.PDFont.getAverageFontWidthFromAFMFile(PDFont.
java:313)
at org.pdfbox.pdmodel.font.PDSimpleFont.getAverageFontWidth(PDSimpleFont
.java:231)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:276)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:
452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java
:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:17
4)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259
)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at PDFTextParser.pdftoText(PDFTextParser.java:61)
at PDFTextParser.main(PDFTextParser.java:101)
Caused by: java.lang.ClassNotFoundException: org.fontbox.afm.AFMParser
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 14 more
Hello Prashant,
Thank you very much for sharing this article.
I still need some more help, I have been able to convert a PDF file into a text file successfully, however my pdf file is as follows:
Column1 Column2
Ram is 1000
manager
Sam is 2000
supervisor
So when its extracted into a text file it appears as:
Column1
Column2
Ram is
manager
1000
Sam is
supervisor
2000
The challenge i am facing is to determine if "Ram is manager " should go to Column1 and not
"Ram is " ---> Column1
"manager " ---> Column2
Hope I was able to explain my problem clearly.
If anyone has dealt with this issue, would really appreciate if they can share there solution. I have already wasted a weeks time exploring the pdfbox and itext APIs
Thanks to all in advance.
Chandra
Hi !!
I want to know if one can extract the bibliography data from pdf documents like research papers using pdfbox ?? I need it for my thesis !!
Thanks
I NEED TO GET FONT NAME AND SIZE FROM THE PDF TEXT FILES..ANYBODY HELP TO ME?
Post a Comment