The Surveying Problem: XSLT Bonus!

Update

So I started looking at graph-tools, but unfortunately, I fell at the first hurdle; I couldn’t even get it to compile. Kept running out of memory at the same point, I suppose I should have gone for more than 16 Gb when I upgraded my computer earlier this year.

The solution is to use a docker image that is helpfully provided by the graph-tools maintainer, but in order to do that, I need to learn docker. Making some great process, but it might be another week or so before I can get my Python code running against containerized Python. So while I do that, here’s a post I wrote a few weeks ago but was saving for a rainy day…

Background

Shortly after the interview where I came up with the DFS solution to the surveying problem I had an amusing thought. The panel said I could use any language I wanted, so I thought to myself, what would the panel have thought if I decided to try and solve the problem in XSLT1https://en.wikipedia.org/wiki/XSLT?

Anyone who knows me knows I have a strange fondness for XSLT. That strange fondness goes back to my first job after my Ph.D., where XSLT was used extensively to transform XML documents. The majority of my colleagues hated that language since it has a trifecta of attributes against it:

  1. It’s functional, as opposed to imperative
  2. It’s very rarely used
  3. It is itself written in XML

I personally always enjoyed writing XSLT; I saw the need to learn a functional programming language as an interesting challenge, and once I got used to thinking in a functional way, the code came naturally. The fact that it was written in XML wasn’t too much of an issue thanks to a decent IDE, the only real issue was that with a (comparatively) small user community, it was often up to me to figure out the best way to do things.

Another important thing to know about XSLT, besides the fact that it’s a functional language, is that it’s a transformation language, i.e. it works by transforming an input document into an output document. An XSLT file can’t do anything by itself, it must be applied to something.

The Solution

Anyway, enough background. I’ll now go through the code I wrote to solve this problem in XSLT. As I mentioned above, an XSLT file must be applied to an input document, so here’s my (abridged) input document:

<grid>
  <el>
    <x>0</x>
    <y>16</y>
  </el>
  <el>
    <x>0</x>
    <y>17</y>
  </el>
  <el>
    <x>0</x>
    <y>19</y>
  </el>
  <el>
    <x>1</x>
    <y>3</y>
  </el>
  <el>
    <x>1</x>
    <y>4</y>
  </el>
</grid>

It’s a pretty simple document format and required some pretty trivial modifications to the build_graph() code from a previous article to spit out XML instead of a Python set object. Each <el> element represents a location in the set, and is defined by it’s <x> and <y> child elements.

Next is the XSLT itself:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:exsl="http://exslt.org/common"
    exclude-result-prefixes="xs exsl"
    version="1.0">
    
   <xsl:include href="firstPass.xslt" />
   <xsl:include href="secondPass.xslt" />
    
    <xsl:output method="xml" indent="yes"/>
    
    <xsl:template match="grid">
        <!-- Actually implement the logic -->
        <xsl:variable name="firstPass">
            <xsl:apply-templates select="el" mode="first" />
        </xsl:variable>
        
        <!-- Strip duplicate locations and empty reservoirs -->
        <xsl:variable name="reservoirs">
            <xsl:apply-templates select="exsl:node-set($firstPass)/reservoir" mode="secondPass"/>
        </xsl:variable>
                
        <!-- Generate summary results -->
        <xsl:element name="results">
            <xsl:element name="numberOfReservoirs">
                <xsl:value-of select="count(exsl:node-set($reservoirs)/reservoir)" />
            </xsl:element>
            <xsl:element name="reservoirs">
                <xsl:apply-templates select="exsl:node-set($reservoirs)/reservoir" mode="results" />
            </xsl:element>
        </xsl:element>
    </xsl:template>
    
    <!-- Results wrapper template -->
    <xsl:template match="reservoir" mode="results">
        <xsl:copy>
            <xsl:attribute name="size">
                <xsl:value-of select="count(location)" />
            </xsl:attribute>
            <xsl:copy-of select="location"/>
        </xsl:copy>
    </xsl:template>
    
</xsl:stylesheet>
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:exsl="http://exslt.org/common"
    exclude-result-prefixes="xs exsl"
    version="1.0">
    
    <!-- First time through we need to create the reservoir element -->
    <xsl:template match="el" mode="first">        
        <xsl:element name="reservoir">

                <xsl:element name="location">
                    <xsl:copy-of select="*" />
                </xsl:element>
                
                <xsl:variable name="this_x" select="number(x/text())"/>
                <xsl:variable name="this_y" select="number(y/text())"/>
                
                <!-- Register that we have visited this element before -->
                <xsl:variable name="visited">
                    <xsl:copy-of select="." />
                </xsl:variable>
                
                <!-- We rely on the grid being sorted in increasing x and y. -->
                <!-- The first time through, we will only find a neighbor with increasing or same x -->
                <xsl:apply-templates select="following::el[
                    (x = $this_x and y = $this_y + 1) or
                    (x = $this_x + 1 and y = $this_y - 1) or
                    (x = $this_x + 1 and y = $this_y) or
                    (x = $this_x + 1 and y = $this_y + 1)]" mode="recursive">
                    <xsl:with-param name="visited" select="$visited" />
                </xsl:apply-templates>
            
        </xsl:element>
    </xsl:template>
    
    <!-- Subsequent times through we don't, but we do need to check we haven't been here before -->
    <xsl:template match="el" mode="recursive">
        <xsl:param name="visited" />
        
        <xsl:variable name="currentElement" select="." />
        
        <!-- Check if we have been here before, which stops infinite recursion -->
        <xsl:if test="not(exsl:node-set($visited)/el[x = $currentElement/x][y = $currentElement/y])">
            
            <xsl:variable name="this_x" select="number(x/text())"/>
            <xsl:variable name="this_y" select="number(y/text())"/>
            
            <!-- Add the current location to the list of visited locations -->
            <xsl:variable name="newVisited">
                <xsl:copy-of select="exsl:node-set($visited)/el" />
                <xsl:copy-of select="$currentElement" />
            </xsl:variable>
            
            <xsl:element name="location">
                <xsl:copy-of select="*" />
            </xsl:element>
            
            <!-- Apply this template over all neighbor locations -->
            <!-- Here we might need to go 'up' and 'left', so we need to check negative offsets -->
            <xsl:apply-templates select="../el[
                (x = $this_x - 1 and y = $this_y - 1) or
                (x = $this_x - 1 and y = $this_y) or
                (x = $this_x - 1 and y = $this_y + 1) or
                (x = $this_x and y = $this_y - 1) or
                (x = $this_x and y = $this_y + 1) or
                (x = $this_x + 1 and y = $this_y - 1) or
                (x = $this_x + 1 and y = $this_y) or
                (x = $this_x + 1 and y = $this_y + 1)]" mode="recursive">
                <xsl:with-param name="visited" select="$newVisited" />
            </xsl:apply-templates>
        </xsl:if>
    </xsl:template>
    
</xsl:stylesheet>
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="1.0">
    
    <!-- We need a separator here, otherwise we can't tell the difference between e.g. 1,20 and 12,0 -->
    <xsl:key name="locations-by-coords" match="location" use="concat(x, ',', y)" />
    
    <!-- Only match reservoirs that have at least one location after stripping duplicates -->
    <xsl:template match="reservoir[count(location[generate-id() = generate-id(key('locations-by-coords', concat(x, ',', y))[1])]) != 0]" mode="secondPass">
        <xsl:copy>
            <!-- Iterate over the unique locations in a reservoir -->
            <xsl:apply-templates select="location[generate-id() = generate-id(key('locations-by-coords', concat(x, ',', y))[1])]" mode="secondPass" />
        </xsl:copy>
    </xsl:template>
    
    <!-- Actually output the location -->
    <xsl:template match="location" mode="secondPass">
        <xsl:copy-of select="." />
    </xsl:template>
    
</xsl:stylesheet>

You can see one of the things I hate about XSLT, which is how wordy and bloated the source is! In case you didn’t notice, it’s in three different files just to avoid hundreds and hundreds of lines of scrolling.

The code is executed in two parts. The first is extremely similar to the Python DFS solution, although because XSLT is a functional language and all variables are immutable, there’s no way of determining if a location has already been visited. There are some optimizations that can be made by ensuring the locations are visited in a well-defined order using the <xsl:sort /> function, but it doesn’t allow the elimination of visiting the same location twice. Consider the following situation:

Even if the list of locations is sorted by increasing x and then increasing y, or even increasing distance from the origin, you still wouldn’t necessarily identify (3,0) as being a part of the well unless you go all the way down to (3,2), and then back up to (3,0). At this point, you may already have started a new well at (3,0) which is a duplicate of the first well. The global ‘visited’ set in the Python implementation solves this problem, but that’s not possible in a functional implementation.

The solution here is to add a second pass through which does two things. The first is that it removes duplicate nodes within the same well, caused when there are multiple possible ways of getting from one active location to another. The second is to remove all reservoirs that are a subset of another reservoir. Through some clever sorting, we can ensure that we will always have the full reservoir before any subsets, which means we can perform this filtering by doing a global filter on unique locations. The code implements Muenchian Grouping2https://en.wikipedia.org/wiki/XSLT/Muenchian_grouping to enable this, which is a neat way of removing duplicate nodes based on arbitrary but definable rules for what constitutes a duplicate within an element.

The final step is then a template that wraps the reservoir results with a summary of how many reservoirs were found in total and adds the size of each reservoir to each reservoir element. An excerpt of the output is given below.

<?xml version="1.0" encoding="UTF-8"?>
<results>
   <numberOfReservoirs>21</numberOfReservoirs>
   <reservoirs>
      <reservoir size="3">
         <location>
            <x>0</x>
            <y>16</y>
         </location>
         <location>
            <x>0</x>
            <y>17</y>
         </location>
         <location>
            <x>1</x>
            <y>17</y>
         </location>
      </reservoir>
      <reservoir size="2">
         <location>
            <x>0</x>
            <y>19</y>
         </location>
         <location>
            <x>1</x>
            <y>20</y>
         </location>
      </reservoir>

Thoughts

First things first, I know this is a terrible solution. It’s inefficient in that it requires multiple passes to generate the right output, and it fails completely at pretty modestly-sized grids (anything above 50×50 takes too long to complete). It might be possible to perform some improvements, but it’s not something I really care to do.

I realize that XSLT was never designed to perform this kind of work, but then I thought to myself: “how you would go about any kind of efficient BFS implementation without being able to rely on a global ‘visited’ tracker to avoid hitting the same location twice?” Is it even possible to implement an O(n) BFS algorithm in a functional language?

The answer seems to be yes! An article on Stack Overflow3https://stackoverflow.com/questions/30464163/functional-breadth-first-search has a solution in Haskell that seems to do exactly what’s required. I have no experience writing in Haskell, but it seems like an extremely interesting approach. If I get time, it’s something I’ll play around with to see how it works, and whether it makes sense for a problem like this!

Until then, back to docker…