XML With Python
XML With Python
DOM (DOCUMENT OBJECT MODEL).
We load the entire file in memory because then we can access the content like it’s an object.
Which results in faster processing and also giving it the possibility to make adjustments in the file.
SAX (SIMPLE API FOR XML)
Reads the file bit by bit and processes the XML on events.
The file is never stored in memory which, for huge files and/or small resources, is a huge plus!
Is read-only.
CONCLUSION:
Both methodologies can do the same stuff when it comes to reading the file. But the way of working is very different. There are many use-cases where you can make use of the strength of both methods.
For example. In case you need to process very large files, containing data from many years. You can split them up with SAX and then process the smaller files further with the DOM methodology.
LET’S GET STARTED WITH SAX
In this blog, we will focus on getting started to work with SAX. As it was, at least for me, the lesser-known method. Which has some specialties in the way you should code it. Because you can’t access all the tags, the content immediately. You have to know what you want to achieve with your code. The way of developing can be compared as if you were reading a book. Rule by rule. Tag by tag.
Working with Chunks
Like said before, the file is processed piece by piece and those pieces are called chunks. When working with the SAX libraries this is a very important thing to remember when developing the code. You get the information you need as if you were reading it from the start till the end.
For example, you assume the content for a specific tag <ABC>1234</ABC> is 1234. But if the current chunk is loaded till 12 and the next one contains 34, you’re working with the wrong results.
Our Case
We want to analyze our XML files. Which are containing huge amounts of data of transactions we made in the past years. We are interested to know how many transactions took place and what the total amount was. To keep it easy to understand, we will take just a small subset to guide us through this development.
<transactions>
<transaction>
<from>Jani</from>
<amount>500</amount>
<currency>Don’t forget me this weekend!</currency>
</transaction>
<transaction>
<from>Jean</from>
<amount>1000</amount>
<currency>Don’t forget me this weekend!</currency>
</transaction>
</transactions>
Let’s get started
Start with importing the XML SAX module and create a handler.
import xml.sax
This handler will take further care over how the file is processed. It’s based on the ContentHandler from sax. From which we will overwrite some functions and adjust them to our needs. In the init function we define some variables that we will use to store some data when we start walking over our file.
class TransactionHandler(xml.sax.ContentHandler):
def __init__(self):
#Used for correct working
self.currentTag = “”
self.content = “”
#Used for counting
self.transactionCount = 0
self.transactionAmount = 0
Now we will start implementing the functions. Those got triggered automatically by the handler itself. So when the handler sees a new XML-element, the function ‘StartElement’ gets called. Since we don’t know the value of the tag at this moment. We save the tag, to handle it when the endElement function gets called.
def startElement(self, tag, attributes):
self.currentTag = tag
When the handler finds text inside of the tags they will be pushed to the characters function. Since we are working with chunks, we don’t know if all the data is loaded yet. So we decide to add the content to a variable and handle it when the end-element gets called.
# get characters (important for chunks!)
def characters(self, content):
self.content += content
Now in the end element, we can add all our logic, all the information we could use is now available. (=The tag and the content)
So in the endElement we get our tag plus the information we saved from before. So when the current Tag is amount we will add the value of it to a variable.
When the endTag is equal to ‘Transaction’, we add the value to the variable ‘transactionAmount’. We could also do this in de startElement. But I decided to bundle all the logic in the endElement.
def endElement(self, tag):
if self.currentTag == “amount”:
self.transactionAmount += float(self.content)
if tag == “transaction”:
self.transactionCount += 1
#Reset the value of the content
self.content = “”
self.currentTag = “”
The endDocument function is called when the handler reaches the end of the document. This is a good time to call some of the last functions. For example to assign the values to access them outside the handler.
def endDocument(self):
self.sendValues()
def sendValues(self):
self.output = {
‘amount’ : self.transactionAmount,
‘count’ : self.transactionCount
}
Turn on the engine please!
So now we set up everything to parse the file. It’s time to push the XML file through it! We create a new SAX parser and say that it should be handled with the handler we just created.
parser = xml.sax.make_parser()
Handler = TransactionHandler()
parser.setContentHandler( Handler )
parser.parse(str(path) + str(file))
print( Handler.output )
Output of the script:
{‘amount’: 1500.0, ‘count’: 2}
And that’s how you handle XML files when you can’t load them completely into memory. I agree, it’s quite a lot of coding for this small result. I also know this could be done with the DOM-methodology in just a couple of lines. But when handling large files (>1GB) and the resources are limited, this is a nice alternative. What I like the most about this method is that you can make it as complex as you like it yourself..
Did this blog inspire you? Don’t know how to proceed further with this? Do you need any advice on this topic? Was it too brief and want to know more? Don’t hesitate to contact us.
Blog written by Neil Van den Broeck