I recently had to figure out a way to extract all of the text from a .DOCX file (Word 2007 document) in Java for a project I’m working on. Surprisingly, it’s even easier than extracting text from a regular .DOC file.
A .DOCX file is really a ZIP package of XML documents. To see for yourself, take any .DOCX file and rename its extension to .ZIP instead (if you get a warning from Windows, click “Yes”; also, this has to be a .DOCX file containing something — a blank document will yield an error). Next, unzip the package as you normally would to any other ZIP file.
After extraction, you will see that there are folders and files within this package. The file we are interested in is in the word/Document.xml file. If you like, open it and inspect it. It may look intimidating if you’re not familiar with XML, but that’s OK! We just have to figure out a way to extract the ZIP file in Java and parse the Document.xml file to get the text we want.
Lucky for us, Java has a class for handling ZIPs — ZipFile. Not only that, but we’re able to parse the XML file once we retrieve it (if you looked into the XML file, we need to get all of the text within the <w:t> tags. After we do some text replacement, we have the clean text to work with (in which each “line” [more probably a paragraph, however] is stored in a vector).
Here’s the code:
/**
* Since DOCX files are actually ZIP packages containing
* XML, we will unzip the file, look for the document.xml
* file and parse it for the document text.
*/
Vector<String> paragraphs = new Vector<String>(); // Vector for storing paragraphs
private static final String file = "file.docx"; // Change this to your file name
try {
ZipFile docxFile = new ZipFile( new File( file ) );
ZipEntry documentXML = docxFile.getEntry("word/document.xml");
InputStream documentXMLIS = docxFile.getInputStream( documentXML );
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
Document doc = dbf.newDocumentBuilder().parse( documentXMLIS );
Element tElement = doc.getDocumentElement();
NodeList n = (NodeList) tElement.getElementsByTagName("w:t");
for( int j = 0; j < n.getLength(); j++ ) {
Node child = n.item( j );
String currentLine = child.getTextContent();
currentLine = currentLine.replaceAll("\cM?r?rnt", "");
currentLine = currentLine.trim();
// Only get lines that contain something
if ( currentLine != null && !currentLine.isEmpty() && currentLine != " " ) {
paragraphs.add( currentLine ); // Add this line to the vector
} // end if
} // end for loop
} catch (ZipException e) {
e.printStackTrace();
} catch (DOMException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (ParserConfigurationException e) {
e.printStackTrace();
} // end try block
At the end of this snippet, all of the lines from the DOCX file are stored within the vector and may be accessed or used any way you wish.
Questions or comments? Leave them in the forum or comments section.
Tags: docx, Microsoft Word 2007, xml

December 29th, 2008 at 5:06 am
[...] links >> docx How to open .docx files in open office First saved by Wren73 | 4 days ago HOWTO: Extract Text from .DOCX file (Word 2007) First saved by xxXtishaXxx | 14 days ago Change Microsoft Word 2007’s default file type from [...]
April 17th, 2009 at 3:45 am
Thanks a lot! The HOWTO is really useful as neither Nutch nor Apache POI support Office Open XML at the moment.