Monday, September 24, 2012


Expresso client code: the inner workings - Part 1 XML connection client code in java 


About client code

Client code is useful. It enables Expresso parser to be accessed pragmatically. Basically, you set up rules for parsing or web service modules using the GUI and then remotely access those rules from your own project using client code.  This makes client code powerful! 

Various types of client code

Client code comes in different languages. We are currently aiming at increasing it's usefulness by adding new languages so that you can use Expresso parser whether you are a java, C++, Ruby or javascript developer. This list will keep growing and we are eager for suggestions as to new languages which we can support! 

There are two types of client code: XML connection code and Web service module code. The former allows you to parse an XML file using your prepared rules. The latter allows you to consume a prepared vendor web service. 

How client code works

Client code connects to Expresso using HTTPS. It passes paramaters into a HTTPS request, which Expresso processes, and it gets a result back.

The values returned from XML connection client code

A three dimensional array is returned from an XML connection. Most times you will only need one or two dimensions of this array.

The outer array is a list of rules which you parsed. If you only parsed using one rule, there will only be one item in this array.
So, we start with an array which has one element for each rule parsed. If we wish to handle the results of any one rule we simple choose that element from the array.
For example, if you are parsing three rules and you wish to process the results of the first rule, simply use the first element of this array.

Each rule element contains a 2 dimensional array.
This middle array contains each return type.

So, what are return types? 

Simple rule - 1 return type
Well, if you have a rule which says "get element address and return it" you are returning each address element value. Your rule has one return type - the address.
In this case, you will have a one element array. The element will simply be a list of addresses.  

Complex rule - multiple return types
If you then say "get element address and return it AND also get element postcode" you are getting two return types - address and postcode. So, the results will contain a list of addresses and their corresponding postcodes.

In this case, you will have a two element array. The first element will be a list of addresses. The second element will be a list of postcodes.

As you can imagine, the inner array is this list of returns e.g. the list of addresses.

Here is an example of a simple rule....

<books>
<book>
A brief history of everything
</book>

<book>
History of Europe 1900 - present
</book>
</books>

We create a rule for this XML file. Our rule is as follows:
Return the text value of any element called book.

We then call client code to run our one rule search. The results are a three dimensional array as follows:

1. Our outer array is the rule. It will have one element as there is only one rule. We take this element and look inside. It is a 2 dimensional array - the middle array.
2. The middle array will have 1 element in it as we are only returning one return type i.e. book value. We take this element and look inside. It is an array.
3. This inner array contains the value of each book element. i.e. A brief history of everything, History of Europe 1900 - present.

We can loop through this array and print out the values.

The values returned from web service connection client code

Web service connection code is simpler. We return an array containing two elements. The first element is the XML response from the web service. The second element is the parsed XML response as an array of results. 

XML Connection client code in java in more detail


The steps involved in the client remote connection
There are three major parts to this client code 
These are:
  1. Setting parameter values
  2. Sending a HTTPS request 
  3. Reading the response 

Setting parameter values

We set various values for the parameters. Some of these are required such as the username, password, connection name and company of the sender. 
There are then some optional parameters which allow you to specify the location of the XML file,  the XML file itself (if on your system) and whether or not you wish to use a cached version of the file.

There are also advanced parameters which enable you to do things such as specific particular rules, supply parameters and sort results. 

Part 1: parameters

required parameters

  1. Username - the name you use to login to the website.
  2. Password  - the password you use to login to the website.
  3. Company  - the company name you use to login to the website.
  4. ConnectionName - the name of the XML connection you wish to parse. This is the name you supplied when creating the connection on the website.

Optional simple paramaters

  1. xml source - the source of the XML you will parse. You have three choices here: client, web or server.  
    1. You can use an XML file which you supply with the request i.e. it is uploaded. This is client code. It allows you to supply a new XML file with each request.
    2. You can use a web-based XML file. This is called web mode. When you set up a connection on the website you have the option to supply a URL rather than uploading an XML file. Now you use this URL again to access the XML file. Since the URL has been saved with your account you do not need to supply it.
    3. In most cases the XML file is uploaded to the website when creating a connection and this XML file stored on the sever is used for parsing. This is server mode.
    4. If no mode is supplied server mode is used by default.
  2. XML File  - If using client mode, the XML file is supplied with the request. This field is it's location on your local system and it is specified here so that the file can be loaded as a string and sent with the request. This is only required with client mode.
  3. caching - This specifies whether or not the file will be parsed using a cached version. It defaults to false.

optional advanced parameters

  1. mode - This allows you to parse by a selection of rules rather than all the rules associated with that connection. You can choose to parse a connection with all it's associated rules by using mode = all. This is the default. You can specify one or more rules to parse with by listing these rules as the mode. Each rule should be separated by &. 
  2. sortBy - This allows you to sort the results in ascending order. For simple rules, sortBy should specify the rule name and 0 as there is only one possible return type to sort by. Otherwise choose which of the return types to sort by e.g. if returning the title and price of a list of books, choose 0 to sort by title and 1  to sort by price. 
  3. dynamic Parameters - These can be used  to modify rules on the fly depending on user input. You can add a new value to a rule and this value will be used with the rule. e.g. You can have a rule which searches for tag = book and price is  > 5.00. You can then add a parameter of 10.00 to the rule and the rule will become tag = book and price is  > 10.00. 
  4. URL parameters - if you are using a web based XML source and the URL changes with each request you can supply URL parameters to dynamically create the URL where the XML file is found. 

Part 2: sending the response
The response is send via HTTPS to the Expresso parser and the results are returned. 

Part 3: Dealing with the returns

The returns are checked for errors and then the 3 dimensional array is looped through and the values are stored and printed out. 


Part 4: Possible error messages 

ERROR CODE 1: incorrect user details

This means that your username, password or company is not correct. 

ERROR CODE 2: userFileStore is missing

This error means that the username or company you supplied does not exist on the server. Check these parameters and contact SXML Help if this happens.

ERROR CODE 3: file is missing from request. Please ensure this field has been added

This means that the fileForXMLUpload parameter is blank and that you have chosen client as your XML file source. Ensure that the correct local location for the XML file to be uploaded is supplied. 

ERROR CODE 4: remote file name on server does not exist at this location

The file you are trying to parse does not exist on the server. This can be caused by choosing not to save the file when creating an XML connection or by deleting an XML connection.  Check that you have correctly spelled the XML connection name supplied and that this connection exists and that the 'save file' option is set to true. 

ERROR CODE 5 - parsing error

This means that there was an error parsing this XML file. The error details are supplied. 

ERROR CODE 6: file not saved on remote Server. Please login to web page to upload file

The file you are trying to parse does not exist on the server. This can be caused by choosing not to save the file when creating an XML connection or by deleting an XML connection.  Check that you have correctly spelled the XML connection name supplied and that this connection exists and that the 'save file' option is set to true. 


ERROR CODE 7: cache could not be located

This means that the cache related to the XML file does not exist. Ensure that you choose 'caching' as true when creating the connection. 



Monday, September 17, 2012


Expresso Parser times trial benchmarks - The results are in!

 Time trials 

The Expresso parser was tested in it's non -caching mode.
The parser was tested against the leading java XML parsers in the field. These were Xerces DOM, Woodstox StAX, Picillo SAX and VTD-XML.

When Expresso was tested against VTD-XML, both parsers were tested in non-catching mode. 

Each parsing was the first and only parsing of the file by the parser, there were no loops involved or any other complexities.

Files used


The Expresso parser was tested on a simple search for TAG = PERSONA on a Shakespeare play john.xml.

Files of various sizes were used ranging from 850kb to 2.5mb.

Larger files were simple the file john.xml repeated multiple times.


Results

 



Check out our free developer version at http://www.sxml.com.au:8080/Expresso/login.jsp



Expresso Parser Large File parsing - The results are in!

The Expresso parser works well with massive XML files including files up to 35GB in size.

As the parser is not limited by file size it is potentially possible to parse files of any size.

The Expresso parser is limited only by the amount of return elements.

According to latest tests Expresso can now return 230,000 elements with normal JVM memory conditions.

That's right, 230,000 elements!


Check out our free developer version at http://www.sxml.com.au:8080/Expresso/login.jsp



Accessing Expresso remotely: 

Check out our free developer version at http://www.sxml.com.au:8080/Expresso/login.jsp

The power of client code 

Expresso client code allows you to remotely interact with the Expresso Parser through your own application in either java or javascript. The amount of supported access languages will be extended in future to include C++, ruby and node.js among other languages.

Accessing expresso remotely 

Expresso client code is used to access the service. It is available for both XML parsing connections and web service modules and presently in java and javascript with JSON.

The client code generation page

 The expresso client code generation page is accessible by clicking on the 'client code' header.



When on the generation page select a language and select either XML connections or web services. Click generate. Client code is generated on the screen.




The client code is already populated with your username, company and the last connection or web service which you accessed as well as any web service parameters you supplied.

All you have to do is paste this code into your own project, add your access password (your normal Expresso account password) and start using the client code.

Java access through HTTPS

The java client code works as follows.
  1. The parameters needed for the request are created and given values.
  2. a HTTPS request is created using the URL of the Expresso Remote Parser.
  3. The HTTPS request is sent and the results are arranged in a three dimensional array.
  4. The outer array is for each rule which has been parsed. 
  5. The middle array contains each set of return set
  6. Each inner array contains each of the return types within the return set

Javascript access through JSON

The javascript client code works as follows.
  1. Within script tags a JSON request over HTTPS is created.
  2. The various required parameters are added to the JSON request
  3. The JSON request is sent and the result is available for processing

Updating client code graphically

You can modify your client code parameters using our graphical tool. When you generate client code simply click 'modify client code graphically' and you are brought to a page where you can update any parameter and the resulting client code is produced. 




Forms Wizard and dynamic parameter forms

Expresso allows you to create a HTML form and backing java servlet code. You can then publish this HTML code to your website where users can supply parameters values for XML parsing rules and have they stored XML file parsed with those parameters.
For example you might wish to create an XML parsing rule which is TAG = SHOP and tag value = "X" where X is a value entered by a user. With Forms you can easily do this. You can create  a form linked to a rule where the user enters a value for part of the rule.

To create a dynamic parameter form

  • Click on Forms header to get to the forms section. Choose the particular XML connection which you wish to create a form

  •  The rules associated with this XML connection are listed. You now go through each rule deciding whether to create form elements for that rule or not.
  • For each rule, you can choose a form element for any rule part.
  • Choose a field name for the form element.

 
  1. When you have chosen all the form elements the client code is created.
  2. You can paste this client code into your own project and start using it straight away.




 



Check out our free developer version at http://www.sxml.com.au:8080/Expresso/login.jsp


 

 

Monday, September 10, 2012

Getting started with Expresso parser 



Check out our free developer version at http://www.sxml.com.au:8080/Expresso/login.jsp



Today I will show you how to parse an XML file with Expresso.  We will be dealing with the visual aspect of parsing a file.


Register for a free developer account

  1. To register simply enter a username, password and a company. If you are not currently working add your username again in this field. 
  2. Choose free developer version and Click 'register'. 


Login to the site


  1. Enter your username, password and company and click 'login'. You will be brought to the dashboard







The dashboard
The main area is called the dashboard. this contains a list of all XML connections and web services which you have created. Each can be edited.







Create a new XML connection

  1. In dashboard, click 'add new connection' and you are brought to a new page.
  2. Enter a name for the connection, upload or specify an XML file URI and choose settings and click save.
  3. The new connection will now be shown on the dashboard.







Create a new parsing rule 

  1. Click on the rules section of the connection listing to see a particular connection's rules.
  2. Click 'add new' to add a new parsing rule. A popup box appears. Fill in details.



  1. To search for child elements click 'add child'. A hierarchy of search rules can be created.
  2. You can return various aspects of an element including tagname, tag value, attribute name and attribute value.


  1. You can use regular expressions and mathematical operators in the search.
  2. You can search for child elements or descendants.
  3. Click save to save the rule. 
  4. Once you save a rule it appears on the right hand side of the screen. When you select a new connection only the selected connection's rules are shown. 











See results
1.  Click on a rule's run method to run the rule against the XML file. The results are shown on screen.





Check out our free developer version at http://www.sxml.com.au:8080/Expresso/login.jsp





XLM Parsing on the cloud

Why cloud parsing?

We choose to provide a cloud-based parser as we believe that the painful integration layer that companies go through can be minimized or avoided. 

My husband asked me once why companies buy third party software and still take months to start using it? 

My answer: The integration layer. 






What does cloud parsing mean for us?
Instead of downloading XML parser and adding it as a third party library to your existing project, you can simply register and login to our website and start parsing online. 

No more conflicts
This means that you can use our parser from any environment as long as you have a web browser!
There are no more conflict issues with JVMs, other 3rd party libraries or other parsers. 

We handle the complexity
If you are running your application on something like node.js then outsource all the mathematical complexity of XML parsing to us!! 
We handle the parsing and send your back your results over HTTPS. 



How to get started with Expresso parser 

  1. Register for a free developer version
  2. Login and click the browse button to upload an XML file or enter the URL of a HTTP based file.
  3. Click a button to open the graphical parser which lets you visually generate parsing rules.
  4. Click "run" to test your newly created parsing rule against the XML file you are using and see the results on screen.
How to connect with client code 
  1. After creating your rules you can connect remotely to the parser using java or javascript or JSON.
  2. Our client code generator creates client code specific to your user details and to the last XML connection you created.
  3. Simply paste this code into your project and immediately start connecting to the parser remotely.
  4. Results from each parsing can be returned as java arrays or as JSON objects.

How to share files 
  1. If you have a medium or large business with multiple users you can set up groups and roles graphically within Expresso.
  2. You can then use these groups and roles to share XML file connections and associated parsing rules securely among your team.
  3. You can use roles to limit access e,g, to read only.

How to access global Web services 
  1. We are creating a library of globally used web services which are set up using our parser.
  2. You can browse through these, choose web services you like and add them to your parsing suite.
  3. You can then graphically modify the web service method used and any parameters and consume the web service from the parsing suite and remotely using client code. 


Check out our free developer version at http://www.sxml.com.au:8080/Expresso/login.jsp



The benefits of Expresso Parser



  1. Faster parsing
  2. Parse larger files with no memory restrictions
  3. Parse files simply.
  4. Graphically set search rules for parsing without the need for any parsing code.
  5. Parsing is error-proof with a GUI for rules setting
  6. XML search results can be tested immediately
  7. Instantly see the results of parsing without any parsing code.
  8. Modify parsing rules dynamically without changing the parsing code.
  9. Carry out complex search on large XML files quickly and efficiently.
  10. Can handle inheritance searches
  11. Permanently store, manage and re-run graphically created XML parsing rules.
  12. No integration layer
  13. Business rules are separate from underlying code.
  14. No costly, time consuming coding changes needed when business rules change.



  1. Faster parsing
SXML parses faster than DOM and SAX and is even faster when in caching mode. The SXML caching is based on virtual tag ids and so allows for changes in the values of elements in an XML file as well as alterations in tag values.

  1. Parse larger files with no memory restrictions
SXML allows a user to parse large files in caching mode without memory restrictions. XML files of 15GB can be parsed without any speed decrease.

  1. Parse files simply
Files parsed based on pre-set rules. No parsing code needed. Client code needed only to connect to the server and to obtain results for a specified file.

  1. Graphically set search rules for parsing without the need for any parsing code.
Say goodbye to complicated, time consuming and error prone XML parsing code. XML parsing can now be set graphically and tested dynamically on an uploaded XML file. The small amount of client code needed to access the SXML server remotely for later parsing is generated automatically.


  1. Parsing is error-proof with a GUI for rules setting
Since parsing is carried out with a Graphical user interface, it is not error prone like writing parsing code.

  1. XML search results can be tested immediately
When a user creates a search rule, they can see the results immediately, so they are able to test the validity of their search rules ensuring that they are returning the correct data from the XML file.

  1. Instantly see the results of parsing without any parsing code.
Graphically set a parsing rule for a file and see the results of the parsed file appear on the screen.
  1. Modify parsing rules dynamically without changing the parsing code.
Manage and maintain multiple search rules for each file. Modify search rules dynamically without having to update the client code.

  1. Carry out complex searches on large xml files quickly and efficiently.
Complex searches can be carried out intuitively using SXML Categories, the powerful engine behind the rules parser. SXML categories uses graph theory to allow the parser to search for results from an XML file while bypassing the regular, time consuming tree navigation.

  1. Can handle inheritance searches
SXML allows a user to search an xml file for elements based on the value of their parent, ancestor or sibling elements.

  1. Permanently store, manage and re-run graphically created XML parsing rules.
Each user keeps their own store of XML rules for each xml file which they have tested. These rules can be modified, re-tested or deleted. The user can remotely parse a file based on one or more of the search rules which have been developed for this file.

  1. No integration layer
SXML is available as a service. It does not need any integration layer as it is not installed on a user's machine. There are no interaction issues with various software versions and no security issues with having a new piece of software installed.

  1. Business rules are separate from underlying code.
The XML rules are stored separately from the XML parsing code and can be viewed, managed and modified using the GUI.

  1. No costly, time consuming coding changes needed when business rules change.
When a business rule changes, why spend three months or more updating parsing code for XML files containing the underlying data? Why risk future errors with XML parsing code in order to facilitate changes in business rules.
In the real world business rules change all the time, so you need an XML parser which will be automatically updated when you graphically set new business rules. SXML allows rules to be changed online using the Rules parser and no code changes are required to parse the xml with these rules.


For now, our beta version is available here. Try our XML Parser.


or find out more at www.sxml.com.au





Problems with traditional XML parsers


DOM - Specific problems


  1. Speed - DOM is slow. It is slower than SAX or StAX or Expresso Parser. It's really slow. This doesn't matter too much if you're parsing one file. But what if you're parsing loads of files? It all adds up!
  2. Probably the biggest problem with DOM specifically is memory. It works for small files but it can have memory errors surprisingly quickly!

SAX - Specific problems

  1. Difficulty - SAX is a challenge to use. It's much tougher to learn, especially for those coming  from an Object oriented school of thought.
  2. Parsing complexity - SAX makes you hold your own data structure so for complex parsing searches where you want to find the parent element  and the child element you have to take care of all that tracking and data storage yourself. 
  3. SAX cannot be used to create or modify XML. 


StAX- Specific problems

  1. Difficulty - StAX is easier to use than SAX but it is still much more difficult to set up and adjust to than DOM. 
  2. StAX does not provide a way to validate files. 
  3. The only way to use it is by continuous if-else conditions
  4. There are no access functions for getting child nodes. 

General problems 

These are problems with all the parsers which Expresso fixes

  1. Speed. Expresso is faster than DOM, SAX or StAX.
  2. Memory. Expresso can parse massive files such as 15GB files which can cause problems to even SAX parsers.
  3. Difficult API - expresso takes the pain out of parsing by replacing the difficult job of instantiating a parser, loading files and writing all that parsing code with a graphical parser which is a neat little pop up box allowing you to create parsing rules easily and test them immediately against your live XML file.
  4. Conflicts - With the other parsers you are always dependent on your local environment. There can be conflicts between various parsing libraries, other 3rd party libraries or JVMs. There can be issues when moving from one environment to another. Expresso is completely cloud based. 
  5. No general way to share XML parsing rules -  Expresso provides a secure online place to share XML files and their parsing rules within an organisation using fine grained access control. 
  6. No general visual environment for managing XML files and parsing rules - Expresso provides a friendly, visual tool for managing all your XML files. 
  7. You have to load the file - Expresso can work with remote files through HTTPs so you don't need a local copy of the file.



 find out more at www.sxml.com.au





Announcement: Parser name change 


Parsing Time will now be called Expresso


After lots of talk and deliberation we have finally found a name we are happy with: expresso. For the lats two years while we worked on this parser we called it SXML. That stood for something? XML and it was a placeholder as we couldn't come up with a name.
All the good XML based puns are gone already! 
We didn't want to have a boring corporate sounding name or a buzz word name. 
We are all techies so we struggled to do some right-brain activities. We put it off. We thought about the parsing. Finally, we came up with parsing time.  But not everyone liked it. It sounded a bit like a children's TV show and that's not really the audience we were going for (although our parser is so simple that children can use it :) 
So, we went with the Universal parser... the panacea for all your parsing worries. But then we found out that the name was taken by a company in Japan. 

So... we did some brainstorming. We all like coffee (good Melbourne style Batista standard coffee) and I am missing good coffee since I returned to live in Ireland (why can't you make a good cup of coffee, Ireland? Why?) so we naturally came to Expresso.

It's not spelt the same way as the coffee. It's there to reflect the speed of the parser. It's really fast, like super fast, so we have called it expresso.


So, if you here EXPRESSO parser!!! It's us. 

Say it together everyone .... 


EXPRESSO parser!!!


Check out our free developer version at http://www.sxml.com.au:8080/Expresso/login.jsp





or find out more at www.sxml.com.au





XML parsing - traditional parsers: 

Pull-parsing

Pull parsers are similar to SAX in that they read in a file line by line and do not store the entire document. However, these parsers are the next step up from SAX.
Pull parsers include XMLReader in PHP and .NET and StAX in java.


Pull parsers treat the document as a series of items which are read in sequence. The contain the concept of a cursor which can be moved to various locations in the incoming file.
There is an iterator that sequentially visits the various elements, attributes, and data in an XML document.
There are then methods which can use this iterator, test the current item to see it's type and, if it is the expected type, pull out aspects of the element such as text value or attributes.
These methods also have the task of moving the cursor on to the next element.


Advantages
  1. Pull-parsing code can be more straightforward to understand and maintain than SAX parsing code.
  2. Pull-parsing can be faster and more memory efficient than DOM
  3. Can be used to read objects


Disadvantages
  1. More difficult to use than DOM and has a tougher learning curve.
  2. Creates a massive if-else loop in code which can be messy and unmaintainable
  3. You can only go in a forward direction
  4. No XML file validaton
How to parse a document with StAX
  1. Create the parser factory object. This factory is then used to create the parser.
    XMLInputFactory inputFactory=XMLInputFactory.newInstance();
  2. Create the parser (reader) from the factory object and create a file input stream and place this into the factory method.
    InputStream input=new FileInputStream(new File("C:/STAX/catalog.xml"));
 XMLStreamReader  xmlStreamReader  =inputFactory.createXMLStreamReader(input);
  1. Call the hasNext() method to see if there are other elements remaining.
    int event=xmlStreamReader.next();
  2. In order to skip a type of element you don't want to process add a simple method which identifies this type of element and continues on.
    If(event.getEventType()==XMLStreamConstants.ENTITY_DECLARATION){
  int event=xmlStreamReader.next();
 }
  1. You can get each element using a method which pulls in the next element. You can then extract data from that element. In this example, we get the element's name.
    if(event==XMLStreamConstants.START_ELEMENT){
 
 System.out.println("Element Local Name:"+xmlStreamReader.getLocalName());
 
 }
 
  1. You can also loop around the attributes of an element and get the value of an attibute based on it's index in the loop e.g.
    xmlStreamReader.getAttributeLocalName(i)

In the next entry I will be talking about the general drawbacks of using DOM, SAX or StAX and how our new approach to parsing solves these problems. 


Check out our free developer version at http://www.sxml.com.au:8080/Expresso/login.jsp



or find out more at www.sxml.com.au



XML parsing - traditional parsers - SAX 


Simple API for XML

SAX stands for Simple API for XML. It is different from DOM in the way it reads in XML. DOM reads in an entire file in one go and stores it all in memory. SAX reads a file in line by line.
SAX is known as event-driven as a document is read serially and its contents are reported as callbacks to various methods on a handler object of the user's design
So, a user creates some code to instantiate a SAX parser and read in the XML file.
Next, the user creates a series of methods which act on certain information pulled out of a file.
These methods can then go off with this extracted data and do things with it.
The methods themselves are triggered by various elements being found in the particular line of text which has been read in.


Advantages of using DOM or SAX or StAX
  1. SAX is fast and efficient to implement
  2. SAX can handle large files
Problems
  1. SAX is difficult to use for extracting information at random from the XML, since it tends to burden the application author with keeping track of what part of the document is being processed
  2. SAX is difficult to use for any kind of complicated search
  3. SAX is seen as more daunting to learn for OO programmers as it uses callbacks rather than an OO API.




How to parse a document with SAX
  1. Create a java class for the parsing. We will call is SAX..
  2. In the static main method, we will set up the parser and in the other methods we will handle the parsing callbacks.
  3. So, in the main method, create the parser factory object. This factory is then used to create the parser.
    SAXParserFactory spf = SAXParserFactory.newInstance();
  4. Create the parser from the factory object.
    saxParser = spf.newSAXParser();
  5. Use the parser to create an XMLReader object.
    XMLReader xmlReader = saxParser.getXMLReader();
  6. Set the content handler to this particular SAX class which contains the callbck methods.
    xmlReader.setContentHandler(new Sax());
  7. Set an error handle to deal with any errors.
    xmlReader.setErrorHandler(new MyErrorHandler(System.err));
  8. Parse the Xml file
    xmlReader.parse(convertToFileURL(fileName));
  9. Now, create the first callback method startElement. This method will be called when a new element is found and it will pull in the element namespaceURI, element name, qName and attributes to be handled inside the method.
    startElement(String namespaceURI, String localName, String qName, Attributes atts)
  10. Finally create another java class which will contain methods to hold errors. Call it MyErrorHandler. This is called This class contains methods to catch errors. e.g.
    private String getParseExceptionInfo(SAXParseException spe)



NEXT ENTRY

Next I am going to talk about StAX parsers. It is good to learn about different options available for parsing as each is better in different scenarios.
After that  I will be talking about our cloud-based graphical alternative to traditional XML parsing.

Check out our free developer version at http://www.sxml.com.au:8080/Expresso/login.jsp



or find out more at www.sxml.com.au








Today's entry - XML parsing: the traditional approaches


Introduction

There are a few traditional XML parsing methods. Our parser hopes to make XML parsing a lot easier by replacing these methods with graphical parsing and cloud based parsing. But it is useful to know the other various methods for XML parsing and how they are used. 

In the next few blog entries I will talk about DOM, SAX and StAX parsing and then I will go on to tell you about our parser and our adventure so far. 

But first....

What is XML parsing?

XML parsing means "reading" the XML file in and extracting its content (or relevant parts of it's content). XML information is encoded in tag names, tag text values and attributes. XML parsing pulls out the required elements (whether they be tagname, value or attributes).
What exactly is pulled out of a document depends on what the program requires.




Different types of XML parser


DOM

DOM stands for Document Object Model. It is an API that allows for navigation of the entire document as if it were a tree of objects called nodes.
Basically, a DOM parser reads in an XML document and stores the entire document in virtual memory as a tree.
DOM then provides an API for searching for various elements in that tree.
Each element is called a node.


Advantages of using DOM or SAX or StAX
  1. DOM is easier to use than SAX or Stax. The other two APIs require a lot of extra coding in order to pull various elements out of an XML file and to hold these elements in a data structure. This is especially the case for complex rules which search for parent and child tags.
  2. DOM can be used to edit and add nodes to an XML file.

Problems
  1. DOM implementations tend to be memory intensive, as they generally require the entire document to be loaded into memory and constructed as a tree of objects before access is allowed. This drastically limits the maximum XML file size.
  2. DOM is relatively quite slow. SAX, StAX and our new parser are much faster.
  3. Although DOM is easier to use than Sax or StAX it still requires quite a lot of programming to extract each piece of information from a file and this can be mundane, error prone coding.

How to read in a DOM document
  1. Start with an XML document. Here is a sample document containing a list of books. For each book element there are child elements with title, price, author and genre. Each book has an attribute with name id and a value.



<?xml version="1.0"?>
<store>
        <book id=”1”>
                <title>Object-Oriented Software Engineering</title>

                <author>Stephen R. Schach</author>
                <price>50.00</price>
                <genre>software</genre>
        </book>
        <book id=”500”>
                <title>A Flock of Ships</title>
                <author>Brian Callison</author>
                <price>15.00</price>
                <genre>thriller</genre>
        </book>
</store>



  1. Next, we create a documentBuilderFactory object. This is used to create the DOM parser object.
    DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
    
  2. We now use the factory we created to create the parser.
    DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
    
  3. Now we must load in the Xml file as a File object in java in order to use this file with the parser.
    File fXmlFile = new File("c:\\books.xml");
    
  4. Next, we parse the actual file, it is loaded into memory as a hierarchical series of nodes which is stored in a Document object. All parsing is then done on this document object and it's elements.
  5. If we want to get a list of book elements (elements with tagname = book) we do so by using the method getElementsByTagName. This reutrns a type of list object called a NodeList.
    NodeList nList = doc.getElementsByTagName("book");
  6. For each element we must get the node by calling item(n) on the list. We can do this in a loop.
 for (int i = 0; i < nList.getLength(); i++) {
                   Node nNode = nList.item(temp); 
 }


  1. Now, for each node we find, we have to check if it is an element. If so, we can cast it as an element object and get the value of it's child nodes.
    if (nNode.getNodeType() == Node.ELEMENT_NODE)
    {
       Element eElement = (Element) nNode;
    
    
 }
  1. We now have a book element and we need to get the child elements. If we want to get the childnode called author, we call getElementByTagName on the book element to get it's child elements called author. We want the first author element (there is only one per book element) so we use .item(0) to get this and then we use .getChildNodes to get the childnodes of this element. In DOM the text value is stored in a text child node. So, the author element has a childnode which contains the text value of the tag i.e. Stephen R. Schach
 String tagname =”author”;
    NodeList nlList = eElement.getElementsByTagName(tagname).item(0).getChildNodes();


Now I am going to talk about SAX and sTAx parsers. It is good to learn about different options available for parsing as each is better in different scenarios.
In the next blog entries, I will be talking about our cloud-based graphical alternative to traditional XML parsing.

Check out our free developer version at http://www.sxml.com.au:8080/Expresso/login.jsp


or find out more at www.sxml.com.au



Tuesday, September 4, 2012

XML parser



XML Parsing in the Cloud

This is a site about my startup company's new, innovative parsing suite. Firstly, a little about us and our motivations. Our project is a collaboration of two companies, both startups. The first is Australian web company Technocrat who have are experts in web development. Technocrat have been in business since June 2009 and they are based in Sydney and Melbourne, Australia.
The second company is Irish startup company Crave Technologies which was started by Laura Cavanagh (that's me!) and John Craddock. John and I are both software developers who have worked extensively with  XML and java.

The aims behind our project were as follows
1. To create a parser which is faster than the current fastest java parsers.
2. To create a parser which can parse massive files.
3. To simplify XML parsing so as to remove the mundane, repetitive and time consuming coding required to parse XML. By taking this element out developers time is freed up so that they can concentrate on the more challenging tasks of logically designing what information is required and how it will be extracted from the XML file.
4. To create a parser on the cloud. Our aim here was to take the initial pain out of changing parsers. I have seen the effort involved in changing parsers from, e.g. DOM to SAX. With our parser you can simply login and start parsing straight away. There are no installation issues, versions issues, environmental conflicts, etc.
5. To create a way to consume SOAP based web services without writing code or using difficult third party libraries. Although there are wizards out there for SOAP based Web Services many of these are unreliable, don't work with complex WSDL files and crash easily.
6. To provide a secure, visible environment to store and manage XML files, web services and parsing rules as well as sharing these items among various members of an organization.
7. To provide a way for users to pragmatically access the service using java or javascript from their own systems without having to write code.

In the next blog post I will talk about general XML parsing methods and the issues with XML parsing at the moment.


Check out our free developer version at http://www.sxml.com.au:8080/Expresso/login.jsp

or find out more at www.sxml.com.au