HTML Text Parser: Converting HTML to Text in Java using NekoHTML

Tuesday, February 3, 2009

When I was working on my Desktop Search project in Solaris, some of the tasks were to develop parsers to extract text from different types of documents like PDF, HTML and OpenOffice for indexing. While I thought it's a straightforward task, it was little tricky in fact with several libraries available for parsing the above document formats. For HTML, its little interesting because there is a swing HTML parser in JDK itself, which I hacked on for a while, but then came to know about other open source HTML parsers for java which would make the task simple, then decided to go on with NekoHTML.

NekoHTML is an open source HTML parser in java used to parse HTML documents with an XML DOM kind of representation of the parsed HTML from which the text contents can be retrieved, one should also use Apache Xerces parser for java along with this library to extract text from HTML, for more information on using NekoHTML, do look at its FAQ section here. The good thing about NekoHTML project is that its updated regularly.

The below code uses the latest NekoHTML and Apache Xerces libraries, the instructions for compiling the source using the respective jars can be found below.

Listing 1: HTMLTextParser.java

1: /*
2: * HTMLTextParser.java
3: * Author: S.Prasanna
4: *
5: */
6:
7: import java.io.File;
8: import java.io.FileInputStream;
9: import java.io.PrintWriter;
10:
11: import org.w3c.dom.DocumentFragment;
12: import org.w3c.dom.Node;
13: import org.w3c.dom.NodeList;
14:
15: import org.xml.sax.InputSource;
16:
17: import org.cyberneko.html.parsers.DOMFragmentParser;
18:
19: import org.apache.xerces.dom.CoreDocumentImpl;
20:
21: public class HTMLTextParser {
22:
23: FileInputStream fin = null;
24: StringBuffer TextBuffer = null;
25: InputSource inSource = null;
26:
27: // HTMLTextParser Constructor
28: public HTMLTextParser() {
29: }
30:
31: //Gets the text content from Nodes recursively
32: void processNode(Node node) {
33: if (node == null) return;
34:
35: //Process a text node
36: if (node.getNodeType() == node.TEXT_NODE) {
37: TextBuffer.append(node.getNodeValue());
38: } else if (node.hasChildNodes()) {
39: //Process the Node's children
40:
41: NodeList childList = node.getChildNodes();
42: int childLen = childList.getLength();
43:
44: for (int count = 0; count < childLen; count ++)
45: processNode(childList.item(count));
46: }
47: else return;
48: }
49:
50: // Extracts text from HTML Document
51: String htmltoText(String fileName) {
52:
53: DOMFragmentParser parser = new DOMFragmentParser();
54:
55: System.out.println("Parsing text from HTML file " + fileName + "....");
56: File f = new File(fileName);
57:
58: if (!f.isFile()) {
59: System.out.println("File " + fileName + " does not exist.");
60: return null;
61: }
62:
63: try {
64: fin = new FileInputStream(f);
65: } catch (Exception e) {
66: System.out.println("Unable to open HTML file " + fileName + " for reading.");
67: return null;
68: }
69:
70: try {
71: inSource = new InputSource(fin);
72: } catch (Exception e) {
73: System.out.println("Unable to open Input source from HTML file " + fileName);
74: return null;
75: }
76:
77: CoreDocumentImpl codeDoc = new CoreDocumentImpl();
78: DocumentFragment doc = codeDoc.createDocumentFragment();
79:
80: try {
81: parser.parse(inSource, doc);
82: } catch (Exception e) {
83: System.out.println("Unable to parse HTML file " + fileName);
84: return null;
85: }
86:
87: TextBuffer = new StringBuffer();
88:
89: //Node is a super interface of DocumentFragment, so no typecast needed
90: processNode(doc);
91:
92: System.out.println("Done.");
93:
94: return TextBuffer.toString();
95: }
96:
97: // Writes the parsed text from HTML to a file
98: void writeTexttoFile(String htmlText, String fileName) {
99:
100: System.out.println("\nWriting HTML text to output text file " + fileName + "....");
101: try {
102: PrintWriter pw = new PrintWriter(fileName);
103: pw.print(htmlText);
104: pw.close();
105: } catch (Exception e) {
106: System.out.println("An exception occurred in writing the html text to file.");
107: e.printStackTrace();
108: }
109: System.out.println("Done.");
110: }
111:
112: // Extracts text from an HTML Document and writes it to a text file
113: public static void main(String args[]) {
114:
115: if (args.length != 2) {
116: System.out.println("Usage: java HTMLTextParser ");
117: System.exit(1);
118: }
119:
120: HTMLTextParser htmlTextParserObj = new HTMLTextParser();
121: String htmlToText = htmlTextParserObj.htmltoText(args[0]);
122:
123: if (htmlToText == null) {
124: System.out.println("HTML to Text Conversion failed.");
125: }
126: else {
127: System.out.println("\nThe text parsed from the HTML Document....\n" + htmlToText);
128: htmlTextParserObj.writeTexttoFile(htmlToText, args[1]);
129: }
130: }
131: }
Explanation:

The above code uses DOMFragmentParser parser from the NekoHTML library (line 53), DOMFragmentParser has a parse method, which takes an InputSource (line 71) and a DocumentFragment (line 78) instance, the CoreDocumentImpl class (line 77) in the Xerces parser library has a factory method, which creates an instance of DocumentFragment, then the HTML content is parsed to return a DOM based respresentation of the HTML document, from which the text contents are extracted recursively (line 32), similar to the way I developed a text parser for OpenOffice Documents using the JDOM library.

Confused? So was I when I was searching for the right classes and parsers, go through the javadoc for NekoHTML and Xerces parser to understand more about the code, else follow the below procedure to execute this code.

Compliling and Running the code:

1. Download NekoHTML, the latest version is 1.9.11 at the time of this writing.
2. Untar/Unzip nekohtml-1.9.11 archive, add nekohtml.jar to the CLASSPATH from the extracted nekohtml-1.9.11 folder.
3. Download Apache Xerces parser for java, the latest version is 2.9.1 at the time of this writing.
4. Untar/Unzip Xerces-J-bin.2.9.1 archive (if you downloaded the binaries), add xercesImpl.jar to the CLASSPATH from the extracted xerces-2_9_1 folder, the code should compile with the above two jars.

Note: I used JDK 1.6 to compile the above code.

9 comments:

chanu said...

hello sir,
i am an engineering student
doing my project which need a html parser to parse file.
i have read ur code for htmlparsing using nekohtml its good using a recursive function
my problem is actually retrieving the actual content from a wiki site
it works for that also but i am getting the text not related to the context
i mean while pasing a text of a file suppose http://en.wikipedia.org/wiki/Valley
i am getting text in main body content about valley and also some coding done in source code
can u fix it and modify code to get the exact contents about valley (file)
please post the modified code
plese mail if possible chanikya.cse.vits@gmail.com

Sumit said...

Great code Prasanna! It works like a gem. Thanks for sharing!

Ankit said...

Hi Prasanna,
Great code. Very clean. I don't believe I've used so many try-catches in one method.
I've been experimenting with Neko for a project, and it seems unreliable at times. The parsing is spot on, but retrieving , not as much.
I've been using the DOMParser and when I give the file path as argument, sometimes it parses it and other times simply says unknown protocol. This is what I wrote:
String filePath="c:\\x.html"
DOMParser p1= new DOMParser();
p1.parse(filePath);
Doc d1= p1.getDocument();

This code ran well enough in my test bed, but in the final implementation it cant pick up on the file path. Is there a known error or bug or am I doing something wrong?

I know you have used InputSource, but the DOMParser does mention that you can either use InputSource or pass a file path and it will resolve it as an input source.

Lastly, a bit unrelated, when opening a file in Java why cant it process file paths with spaces like "c:\\my file\\x.html". This may be simply restricted to DOM in general.

Please get back to me if possible at schmuck_dud@yahoo.com

1..2......3...........NILL said...

Hello,

The code works great.. Thanks for sharing it..
Is there any way to get rid of the contents of < script > tags. Although the tag are gotten rid off, the actual script is still visible.

Thanks.

Anonymous said...

hi..

can you plz tell me how to convert html to word document using java...
my mail id is aucse_n@yahoo.com

M412 said...
This comment has been removed by the author.
abe.izar said...

may i please use your code in an open source project?
(http://code.google.com/p/orayta/)
thanks.

Boris said...

Hi Prasanna,
Great Code, very clean an thanks for sharing it.
I have read your code and it gives me possibility to know how to use Nekohtml. But I don't understand the working of Node's function.
Until now, I don't see in your code where you specify which text fragment you want to extract.
please, can you tell me how to do? Thanks

Atef Charef said...

How you do that!
System.out.println("\nThe text parsed from the HTML Document....\n" + htmlToText);
htmlTextParserObj.writeTexttoFile(htmlToText, args[1]);
you write the same String in the new file!!
htmlToText...
Mistake right? I test the code and I didn't find a good result


Copyright © 2008 Prasanna Seshadri, www.prasannatech.net, All Rights Reserved.
No part of the content or this site may be reproduced without prior written permission of the author.