Monday, September 10, 2012



Today's entry - XML parsing: the traditional approaches


Introduction

There are a few traditional XML parsing methods. Our parser hopes to make XML parsing a lot easier by replacing these methods with graphical parsing and cloud based parsing. But it is useful to know the other various methods for XML parsing and how they are used. 

In the next few blog entries I will talk about DOM, SAX and StAX parsing and then I will go on to tell you about our parser and our adventure so far. 

But first....

What is XML parsing?

XML parsing means "reading" the XML file in and extracting its content (or relevant parts of it's content). XML information is encoded in tag names, tag text values and attributes. XML parsing pulls out the required elements (whether they be tagname, value or attributes).
What exactly is pulled out of a document depends on what the program requires.




Different types of XML parser


DOM

DOM stands for Document Object Model. It is an API that allows for navigation of the entire document as if it were a tree of objects called nodes.
Basically, a DOM parser reads in an XML document and stores the entire document in virtual memory as a tree.
DOM then provides an API for searching for various elements in that tree.
Each element is called a node.


Advantages of using DOM or SAX or StAX
  1. DOM is easier to use than SAX or Stax. The other two APIs require a lot of extra coding in order to pull various elements out of an XML file and to hold these elements in a data structure. This is especially the case for complex rules which search for parent and child tags.
  2. DOM can be used to edit and add nodes to an XML file.

Problems
  1. DOM implementations tend to be memory intensive, as they generally require the entire document to be loaded into memory and constructed as a tree of objects before access is allowed. This drastically limits the maximum XML file size.
  2. DOM is relatively quite slow. SAX, StAX and our new parser are much faster.
  3. Although DOM is easier to use than Sax or StAX it still requires quite a lot of programming to extract each piece of information from a file and this can be mundane, error prone coding.

How to read in a DOM document
  1. Start with an XML document. Here is a sample document containing a list of books. For each book element there are child elements with title, price, author and genre. Each book has an attribute with name id and a value.



<?xml version="1.0"?>
<store>
        <book id=”1”>
                <title>Object-Oriented Software Engineering</title>

                <author>Stephen R. Schach</author>
                <price>50.00</price>
                <genre>software</genre>
        </book>
        <book id=”500”>
                <title>A Flock of Ships</title>
                <author>Brian Callison</author>
                <price>15.00</price>
                <genre>thriller</genre>
        </book>
</store>



  1. Next, we create a documentBuilderFactory object. This is used to create the DOM parser object.
    DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
    
  2. We now use the factory we created to create the parser.
    DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
    
  3. Now we must load in the Xml file as a File object in java in order to use this file with the parser.
    File fXmlFile = new File("c:\\books.xml");
    
  4. Next, we parse the actual file, it is loaded into memory as a hierarchical series of nodes which is stored in a Document object. All parsing is then done on this document object and it's elements.
  5. If we want to get a list of book elements (elements with tagname = book) we do so by using the method getElementsByTagName. This reutrns a type of list object called a NodeList.
    NodeList nList = doc.getElementsByTagName("book");
  6. For each element we must get the node by calling item(n) on the list. We can do this in a loop.
 for (int i = 0; i < nList.getLength(); i++) {
                   Node nNode = nList.item(temp); 
 }


  1. Now, for each node we find, we have to check if it is an element. If so, we can cast it as an element object and get the value of it's child nodes.
    if (nNode.getNodeType() == Node.ELEMENT_NODE)
    {
       Element eElement = (Element) nNode;
    
    
 }
  1. We now have a book element and we need to get the child elements. If we want to get the childnode called author, we call getElementByTagName on the book element to get it's child elements called author. We want the first author element (there is only one per book element) so we use .item(0) to get this and then we use .getChildNodes to get the childnodes of this element. In DOM the text value is stored in a text child node. So, the author element has a childnode which contains the text value of the tag i.e. Stephen R. Schach
 String tagname =”author”;
    NodeList nlList = eElement.getElementsByTagName(tagname).item(0).getChildNodes();


Now I am going to talk about SAX and sTAx parsers. It is good to learn about different options available for parsing as each is better in different scenarios.
In the next blog entries, I will be talking about our cloud-based graphical alternative to traditional XML parsing.

Check out our free developer version at http://www.sxml.com.au:8080/Expresso/login.jsp


or find out more at www.sxml.com.au



No comments:

Post a Comment