XML Differencing Overview

posted 11:41AM May 17, 2004 with tags by Lars Trieloff

There was a interesting discussion going on on xml-doc@yahoogroups.com and before I drop the mails from my INBOX, I will repost the excerpt of them here, before I will drop the facts from my mind.

The Theory - Papers written on XML differencing

A Delta Format for XML: Identifying Changes in XML Files and Representing the Changes in XML - by Robin La Fontaine
This paper describes how changes to XML documents and data files can be represented in XML and proposes a delta format for XML. Although Canonical XML provides a mechanism for verifying that two XML files or documents are equal, it is more often necessary to determine the differences between two XML documents. Such differences should ideally be represented in XML and this paper describes how such changes can be represented in XML with minimal additional attributes and elements.
This paper is one of the foundations for the product DeltaXML below.
Merging XML files: a new approach providing intelligent merge of XML data sets - by Robin La Fontaine
This paper proposes a systematic approach to merging based on the use of an intermediate XML file that contains both of the files to be merged in a formal structure that clearly identifies data that is common to both files and data that is unique to one of the files. The advantage of this intermediate file is that many of the conflicts that typically emerge when XML data is merged can be identified and resolved. The resolution of these conflicts is a key to achieving a useful merge.
Compiled from a presentation at XML Europe 2002
XUpdate - language for updating XML documents

This specification defines the syntax and semantics of the XUpdate update language. An update in the XUpdate language is expressed as a well-formed XML document.

XUpdate makes extensive use of the expression language defined by XPath for selecting elements for updating and for conditional processing.

XUpdate is a pure descriptive language which is designed with references to the definition of XSL Transformations XSLT.

This specification is nearly three years old, but there is no reference to implementations.
Tree-to-tree Correction for Document Trees - David T. Barnard, Gwen Clarke, Nicolas Duncan
Abstract: Documents can be represented as ordered labelled trees. Finding the editing distance between documents is a particular case of the general problem for trees. We give a detailed survey of previous results, presenting them in a single notation to elucidate their commonalities. We then discuss two ways of extending these results---first, by changing the set of primitive editing operations used by existing algorithms and, second, by post-processing the output of the algorithms to recognize patterns
Version Management of XML Documents - Shu-Yao Chien, Vassilis J. Tsotras, Carlo Zaniolo Lecture Notes in Computer Science
The problem of ensuring efficient storage and fast retrieval for multi-version structured documents is important because of the recent popularity of XML documents and semistructured information on the web. Traditional document version control systems, e.g. RCS, which model documents as a sequence of lines of text and use the shortest edit script to represent version differences, can be inefficient and they do not preserve the logical structure of the original document.
Comparing of SGML documents - Mohammad Taghi Hajiaghayi
Documents can be represented as structures with a hierarchial arrangement of text and non-text nodes, where nodes are labeled by category names such as paragraph and section. Representing documents this way is a natural consequence of using the Standard Generalized Markup Language(SGML) to encode text documents which has many applications in different areas.
TopBlend - An Efficient Implementation of HtmlDiff in Java
The World Wide Web is growing rapidly with new and changing web content. Detecting changes in web pages is crucial for website masters who care about website integrity. It is also convenient for web surfers who are constantly looking for new products, services, or information. This paper describes TopBlend, a new HTML differencing tool implemented in Java. TopBlend uses the fast Jacobson-Vo algorithm, which solves the Heaviest Common Subsequence problem, for page comparison.
Change Detection in Hierarchically Structured Information - Sudarshan S. Chawathe, Anand Rajaraman, Hector Garcia-Molina, Jennifer Widom
Detecting and representing changes to data is important for active databases, data warehousing, view maintenance, and version and configuration management. Most previous work in change management has dealt with flat-file and relational data
X-Diff: An Effective Change Detection Algorithm for XML Documents - Yuan Wang David J. DeWitt Jin-Yi Cai
Over the next several years XML is likely to replace HTML as the standard web publishing language and data transportation format. Since online information changes frequently, being able to quickly detect changes in XML documents is important to Internet query systems, search engines, and continuous query systems. Previous work in change detection on XML or other hierarchically structured documents used an ordered tree model, in which left-to-right order among siblings is important and it affects the change result. In this paper, we argue that an unordered model (only ancestor relationships are significant) is more suitable for most database applications. Using an unordered model, change detection is substantially harder than using the ordered model, but the change result that it generates is more accurate. We propose X-Diff, an effective algorithm that integrates key XML structure characteristics with standard tree-to-tree correction techniques. We also analyze the algorithm and compare it with XyDiff [CAM02], a published XML diff algorithm. We provide an experimental evaluation on both algorithms.
building a good mac os x citizen an xml differencing application for mac os x - Kenneth Macarthur
Since the publication of the original XML 1.0 recommendation [1] by the World Wide Web Consortium in 1998, the use of XML as a vendor- and technology-independent means of exchanging data has exploded. During the intervening period, Apple Computer, Inc. released its next-generation, UNIX-based operating system, Mac OS X [2], and, in so doing, transformed the Mac into a truly compelling platform on which to develop and use software. Taking these two facts together along with recent developments in the field of XML differencing it is now considered timely to develop an application for Mac OS X that performs on XML documents a task for which there is likely to be growing demand amongst Mac users as the volume of XML in existence spirals upwards, namely that of difference calculation and display. In this report, we examine a number of algorithms for computing differences between XML documents and pragmatically select one for use in such an application based on its availability in the form of a functioning implementation. We also examine Mac OS X, focusing on the operating system from the perspective of the software developer. We then document the substance of a project to develop a differencing application for Mac OS X using the Cocoa [3] application framework. We discuss the problems we encountered using this framework and our chosen programming language, and highlight the limitations of the algorithm, and of our application.
Change Detection of XML Documents Using Signatures - Latifur Khan, Lei Wang and Yan Rao
XML is an emerging standard for the representation and exchange of Internet data. The characteristics of XML, tree-structured (i.e. a collection of nodes) and self-descriptive, facilitate the detection of changes in an XML document in minute detail and at a finer grain than obtainable at the document level. There is at present no really effective method for detecting these changes. We wish to propose such a method here. Rather than inspecting all nodes between two versions of XML documents, we propose an effective algorithm, called top-down, which will detect changes in XML documents by exploring a subset of nodes in the tree.
Detecting Changes in XML Documents - Gregory Cobena, Serge Abiteboul Amelie Marian
We present a diff algorithm for XML data. This work is motivated by the support for change control in the context of the Xyleme project that is investigating dynamic warehouses capable of storing massive volume of XML data. Because of the context, our algorithm has to be very efficient in terms of speed and memory space even at the cost of some loss of quality . Also, it considers, besides insertions, deletions and updates (standard in diffs), a move operation on subtrees that is essential in the context of XML. Intuitively, our diff algorithm uses signatures to match (large) subtrees that were left unchanged between the old and new versions.

The Tools - Software for XML Diff and Merge

DeltaXML - Managing Change in an XML Environment
DeltaXML can be used to find differences in XML files, to merge revisions and to syncronize XML. DeltaXML can be integrated into existing products by using standard SAX and Trax APIs. It will even perform o files upto 50 MB.
No product on the market today comes close to DeltaXML in scope or performance. Our customers are building solutions for XML software testing, merging XML, displaying changes in XML data on the web or in printed documents, improving security, reducing bandwidth and identifying changes for database update and synchronization. The provision of changes only or changes plus unchanged data in the delta file, the handling of unordered data and ability to use keys to control the comparison process provide significant business benefits to our customers and make DeltaXML the preferred choice for leading XML users.
This is why the pricing of this high-profile tool starts at 5000 US$/EUR.
RevXml - a tool for traking changes in XML/SGML tagged documents
RevXml tracks all changes between two XML or SGML tagged documents or data. It takes into account the tree structure of the document, the attributes and the text in between the XML-tags
The website mentioned in the press release requires the Flash plugin.
Versim Differencing Toolkit - by Verisim Technologies
  • Finds all the differences between XML instances, including comments and processing instructions.
  • Finds changes within text nodes to the word level.
  • Flexible API makes it simple to exclude certain nodes or node types from the difference without pre- or post-processing.
  • Provides easy programmatic access to the difference result.
A 30-day evaluation version can be downloaded fo free, the full license is priced around 500 US$/EUR.
XML Diff and Merge Tool - by IBM alphaworks
XML Diff and Merge Tool is a Java program that can be used for reconciling or understanding changes made by a single user to an XML document OR for reconciling or understanding changes made by several people to a single document.
The tool is available for all Java 2 Platforms and can be downloaded for free, but a commercial license costs US$ 1500.
TurboXML - mentioned by Binisiya
TIBCO TurboXML is an Integrated Development Environment (IDE) for developing and managing XML assets. With facilities for creating, validating, converting, and managing XML schemas, XML files and DTDs, TurboXML provides a best-of-class XML implementation platform that combines industry leading XML solutions (XML schema authoring, XML instance editing, and XML project management) into an simple and intuitive IDE. TurboXML, which is currently deployed in thousands of organizations around the world, is the first XML IDE to offer comprehensive support for the latest XML standard - the XML Schema Recommendation.
besides that it offers a visual differencing tool, but It cannot accessed via an programming interface. A single user license is avaiable for around 250 US$.
Arbortext Epic - high-end XML/SGML editor
Arbortext Epic is a high-end XML authoring solution. It costs around 700 US$ for a single user license, but it also includes a tool to visualize changes and differences in XML files.
Designed for power authors, Epic Editor meets the most demanding challenges faced by organizations that employ teams of authors, each working with large amounts of material, all of which needs to be delivered to multiple audiences and disseminated in multiple forms—Web, CD-ROM, print, PDF and wireless.
DiffMK (Perl) & DiffMK (Java) -

In a nutshell, diffmk works like this: it converts the documents into two lists of nodes (text and/or element nodes) and attempts to find the longest common subsequence of nodes. Phrased another way, it finds the smallest number of additions and deletions to each list that are required to make the two lists the same.

Unfortunately, the actual set of changes is not always the shortest distance between two versions. So, naturally, the algorithm sometimes identifies changes incorrectly. However, it does identify all of the changes. In other words, it sometimes incorrectly identifies what we would consider a "change" as an "addition followed by a deletion" or an "addition followed by a deletion" as a "change", but it never misses anything. So the highlighted text always shows at least everything that has changed, and sometimes a little more.

Available for free with source code, aimed at writers who want to find differences between two revisions of an XML document.
VMTools - diffing and patching
VM Tools has the following features:
  • Uses sophisticated algorithm to calculate difference between two XML documents
  • Supports generic OTA_Update message generation/processing
  • The difference document generated is optimized for minimal size
  • Core code is generic and extensible
  • Designed to plug into your java application
  • Minimal subset XPath implementation allows use of simple XPath notation to reference change points within the source document.
  • Supports mark-up style documents in addition to data documents
  • The tools are available for free under a open-source license. All tools are accessible via a Java API.
XmlDiff - by logilab.com
XMLdiff is a python tool that figures out the differences between two similar XML files, in the same way the diff utility does it for text files. It was developed for the Narval project and could also be used as a library or as a command line tool. It can work either with XML files or DOM trees
Available under the conditions of the GPL and written in Python. Not very extensive documentation.
XML Diff and Patch 1.0 - by Microsoft
Microsoft XML Diff and Patch is a set of tools for comparison of two XML documents and application of the changes (patching). XML Diff detect addition, deletion and other changes between two XML documents. It also detects structural changes like a move of a XML subtree. It produces Xml Diff Language Diffgram (XDL diffgram or just diffgram) that describes the differences between the two XML documents. The diffgram can be then used to display these differences or to perform a patch operation using the XML Patch tool.
The .NET-based tool can be tested from the website demonstration and downloaded for free.
XML Linear Time Treediff - by Christian Nentwich
This package allows you to compute the difference between to DOM trees in n steps, where n is the number of nodes in the first of the two trees. Speed comes at a price, so there are some points...
Michael Smith has also written an xmlhack.com article on this free and open-source Java tool
XML-SemanticDiff - Perl Module
erl extension for comparing XML documents
Not too much documentation here. Read the source, dude!
diffxml & patchxml - Tools for comparing and patching XML files

The standard Unix tools diff and patch are used to find the differences between text files and to apply the differences. These tools operate on a line by line basis using well-studied methods for computing the longest common subsequence (LCS).

Using these tools on hierarchically structured data (XML etc) leads to sub-optimal results, as they are incapable of recognizing the tree-based structure of these files.

This project aims to provide Open Source XML diff and patch utilities which operate on the hierarchical structure of XML documents.

There is also an xmlhack.com article by Michael Smith on this tool. For documentation, there is the dissertation of the author and the specification of the output format. The program is written in Java and available to the conditions of the GNU Public License
XML TreeDiff - by IBM alphaworks
XML TreeDiff is a set of Java beans that enable efficient differentiation and updating of DOM trees, similar to diff and patch differentiation and update data files. However, rather than differentiating the file representations of the documents (that is, the XML files), XML TreeDiff runs directly on the DOMs themselves. This way, the differences are directly expressed in terms of native tree operations such as change node, delete node, or insert node, rather than line mismatches.
Nearly four years old, this tool is available for all Java platforms.
Dommitt v2.0 Unordered Diff Tool - Scalable XML tools for the World Wide Web
On June 19th, 2002 Dommitt debuted the latest version of its XML Differencing Tool (v2.0). A significant new feature added to v2.0 is the ability to perform an unordered difference of XML documents. The significance of this new feature is that the gap between SQL and XML vanishes. SQL deals with result sets that have no order. An ordered XML document attaches significance to child order. So, an ordered diff algorithm would indicate some spurious differences in XML documents generated by SQL queries. This new unordered diff algorithm ignores the order in which elements are laid out in the document. No user input is required to determine unordered differences. Dommitt is the only commercial tool offering this capability.
A Java API is available, but I could not find any pricing information.

Comments:

Post a Comment:
Comments are closed for this entry.