{"id":41,"date":"2020-07-31T13:18:00","date_gmt":"2020-07-31T17:18:00","guid":{"rendered":"https:\/\/andygrigg.com\/?p=41"},"modified":"2020-08-02T18:37:04","modified_gmt":"2020-08-02T22:37:04","slug":"the-surveying-problem-2","status":"publish","type":"post","link":"https:\/\/andygrigg.com\/?p=41","title":{"rendered":"The Surveying Problem Part 3: The Recursive Solution"},"content":{"rendered":"\n<p>As I mentioned in my previous post, I have been working on some possible solutions to the following problem:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>You are working for a company that is responsible for processing survey data for an Oil and Gas company. You are given data that summarizes whether oil was found at a certain grid reference. The assumption is that if oil is found in contiguous grid references (or grid elements), then they form a single oil reservoir. The objective is to write a program that will determine the number and locations of the reservoirs.<\/p><\/blockquote>\n\n\n\n<p>The question was posed to me in an interview context, so I only had access to standard Python libraries. As a result, the solution I came up with used a recursive algorithm, here&#8217;s the code:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"enlighter\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">\"\"\" Generate a grid of elements with a random layout of elements that contain oil\nbuild_grid: Build the grid with an optional random seed\nget_neighbors: Find the number of active neighbor elements for a given element\n\"\"\"\n\nimport random\nimport itertools\n\nTHRESHOLD = 0.85\n\ndef get_neighbors(this_element, grid):\n    \"\"\" Returns a list of neighbor location objects\n\n     this_element: A 2-tuple representing the x and y coordinates of an element\n     grid: A dictionary containing all elements and their state\n     \"\"\"\n    x_coord, y_coord = this_element\n    x_offsets = [-1, 0, 1]\n    y_offsets = [-1, 0, 1]\n\n    neighbors = list()\n    for x_offset in x_offsets:\n        for y_offset in y_offsets:\n            if x_offset == 0 and y_offset == 0:\n                continue\n\n            coord = (x_coord + x_offset, y_coord + y_offset)\n            if coord in grid:\n                neighbors.append(coord)\n    return neighbors\n\n\ndef build_grid(size, seed=None):\n    \"\"\" Build a square grid of elements, where each element may or may not contain oil\n\n     size: The number of elements along one edge of the grid\n     seed: Random seed to be used to generate the grid\n     \"\"\"\n    random.seed(seed)\n\n    initial_grid = set()\n    for location in itertools.product(range(0, size), repeat=2):\n        if random.random() > THRESHOLD:\n            initial_grid.add(location)\n\n    grid = set()\n    # Cluster the grid. If an active element is not isolated,\n    # or if an inactive element has at least 4 active neighbors\n    for location in itertools.product(range(0, size), repeat=2):\n        state = location in initial_grid\n        sites_nearby = get_neighbors(location, initial_grid)\n        neighbor_count = len(sites_nearby)\n        if (state and neighbor_count != 0) or neighbor_count >= 4:\n            grid.add(location)\n\n    return grid\n<\/pre>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><\/div>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><\/div>\n\n\n\n<p>It&#8217;s worth saying that whilst I came up with the broad algorithm during the interview, the version above contains a few improvements over the original, both to the structure of the code and the performance.<\/p>\n\n\n\n<p>One example: the code in the previous post had the locations object as a <code>dict()<\/code>, where each key was a 2-tuple representing the x and y coordinates, and each value was either <code>True<\/code> or <code>False<\/code>. This was fine, but when trying with large grids (1000 x 1000+) I was using a not-insignificant amount of memory, so I switched to using a list containing the <code>True<\/code> 2-tuples only.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Performance Optimization<\/h2>\n\n\n\n<p>However, imagine my horror when I realized that while I reduced my RAM usage, I slowed down the performance on a 100&#215;100 grid by a factor of over 1000! I must admit I had to resort to the Profiler in PyCharm to point me in the right direction, which pointed me firmly in the direction of my <code>get_neighbors<\/code> method.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-medium\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"230\" src=\"https:\/\/i0.wp.com\/andygrigg.com\/wp-content\/uploads\/2020\/08\/Slow-1-e1596390972159-300x230.png?resize=300%2C230&#038;ssl=1\" alt=\"\" class=\"wp-image-55\" srcset=\"https:\/\/i0.wp.com\/andygrigg.com\/wp-content\/uploads\/2020\/08\/Slow-1-e1596390972159.png?resize=300%2C230&amp;ssl=1 300w, https:\/\/i0.wp.com\/andygrigg.com\/wp-content\/uploads\/2020\/08\/Slow-1-e1596390972159.png?resize=768%2C588&amp;ssl=1 768w, https:\/\/i0.wp.com\/andygrigg.com\/wp-content\/uploads\/2020\/08\/Slow-1-e1596390972159.png?w=788&amp;ssl=1 788w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/figure><\/div>\n\n\n\n<p>The Profiler wasn&#8217;t kind enough to tell me that it was specifically my <code>coord in all_locations<\/code> call that was causing the slowdown, but the <code>get_neighbors<\/code> function doesn&#8217;t do a whole lot, so I quickly realized this must be the cause. After a quick Google, I came up with the Python TimeComplexity page<sup class=\"modern-footnotes-footnote \" data-mfn=\"1\" data-mfn-post-scope=\"000000000f4cd4ce00000000421c9005_41\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-000000000f4cd4ce00000000421c9005_41-1\">1<\/a><\/sup><span id=\"mfn-content-000000000f4cd4ce00000000421c9005_41-1\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"1\"><a rel=\"noreferrer noopener\" href=\"https:\/\/wiki.python.org\/moin\/TimeComplexity\" target=\"_blank\">TimeComplexity<\/a><\/span>. As long as the keys stored in a <code>dict<\/code> have sensible hashing functions (which my 2-tuples do), an <code>in<\/code> call with a <code>dict<\/code> is O(1), compared to O(n) for a list.<\/p>\n\n\n\n<p>I moved away from a <code>dict<\/code> because I had no values, so instead of switching to a list, I should have switched to a <code>set<\/code>. A <code>set<\/code> is basically a <code>dict<\/code> without values (it turns out in CPython that&#8217;s exactly how it&#8217;s implemented), and is an extremely useful type, especially when you want to perform set-type operations. Here&#8217;s the call graph using a set instead of a list, as you can see the bottleneck in <code>get_neighbors<\/code> is gone.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-medium\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"298\" src=\"https:\/\/i0.wp.com\/andygrigg.com\/wp-content\/uploads\/2020\/08\/Fast-e1596392148881-300x298.png?resize=300%2C298&#038;ssl=1\" alt=\"\" class=\"wp-image-58\" srcset=\"https:\/\/i0.wp.com\/andygrigg.com\/wp-content\/uploads\/2020\/08\/Fast-e1596392148881.png?resize=300%2C298&amp;ssl=1 300w, https:\/\/i0.wp.com\/andygrigg.com\/wp-content\/uploads\/2020\/08\/Fast-e1596392148881.png?resize=150%2C150&amp;ssl=1 150w, https:\/\/i0.wp.com\/andygrigg.com\/wp-content\/uploads\/2020\/08\/Fast-e1596392148881.png?w=618&amp;ssl=1 618w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/figure><\/div>\n\n\n\n<p>In case you are wondering, the other 60% is in <code>_find_and_load<\/code>, so not much we can do there. This was for a grid size of around 100&#215;100 though, and the <code>_find_and_load<\/code> overhead decreases as expected as the grid size is increased.<\/p>\n\n\n\n<p>That&#8217;s pretty much it in terms of the recursive approach. Performance is not bad, here are some results for some example runs:<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><thead><tr><th>Grid Size<\/th><th class=\"has-text-align-left\" data-align=\"left\">Time<\/th><th># of Reservoirs<\/th><th>Time \/ Reservoir<\/th><th>Time \/ Element<\/th><th>Reservoir Frequency<\/th><\/tr><\/thead><tbody><tr><td>10&#215;10<\/td><td class=\"has-text-align-left\" data-align=\"left\">1.528E-05 s<\/td><td>2<\/td><td>7.64E-06 s<\/td><td>1.528E-07 s<\/td><td>0.02<\/td><\/tr><tr><td>100&#215;100<\/td><td class=\"has-text-align-left\" data-align=\"left\">3.417E-03 s<\/td><td>306<\/td><td>1.117E-05 s<\/td><td>3.417E-07 s<\/td><td>0.0306<\/td><\/tr><tr><td>1000&#215;1000<\/td><td class=\"has-text-align-left\" data-align=\"left\">3.657E-01 s<\/td><td>28,425<\/td><td>1.287E-05 s<\/td><td>3.657E-07 s<\/td><td>0.0284<\/td><\/tr><tr><td>10000&#215;10000<\/td><td class=\"has-text-align-left\" data-align=\"left\">8.298E+01 s<\/td><td>2,841,360<\/td><td>2.920E-05 s<\/td><td>8.298E-07 s<\/td><td>0.0284<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The Time\/Element and Time\/Reservoir results are pretty flat, which is probably what I would expect. The average number of reservoirs per element (reservoir frequency) is also reasonably flat, which trends towards 0.0284. It might be interesting to look into this a bit more, especially how it depends on how the grid itself was made and the clustering rules.  #TODO<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Memory Use<\/h2>\n\n\n\n<p>One definite negative to the code above is that even with a shift to storing only active elements in a set, it&#8217;s extremely memory intensive. The final run on a 10,000&#215;10,000 grid was using around 4 Gb of memory at peak, which I think is probably due to two factors:<\/p>\n\n\n\n<p>Firstly, I can&#8217;t use a yield-type idiom because I&#8217;m bouncing around looking at elements before and after the one I am processing. Also, during the clustering process, both the original and clustered grids have to be in memory at once, so I can examine the current state while I build the new state.<\/p>\n\n\n\n<p>Secondly, native Python types are pretty bloated, memory-wise<sup class=\"modern-footnotes-footnote \" data-mfn=\"2\" data-mfn-post-scope=\"000000000f4cd4ce00000000421c9005_41\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-000000000f4cd4ce00000000421c9005_41-2\">2<\/a><\/sup><span id=\"mfn-content-000000000f4cd4ce00000000421c9005_41-2\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"2\"><a href=\"https:\/\/stackoverflow.com\/questions\/1331471\/in-memory-size-of-a-python-structure\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/stackoverflow.com\/questions\/1331471\/in-memory-size-of-a-python-structure<\/a><\/span>. An <code>int<\/code> is 24 bytes and the overhead for a <code>tuple<\/code> is 56 bytes, which means each element needs 104 bytes. The original grid is 15% active, which means for a grid of 10,000 x 10,000 elements 1.5 Gb of RAM is required. Assuming the clustered grid is around 10% active, that&#8217;s 2.5 Gb of memory to store both the original and clustered grid. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Next Steps<\/h2>\n\n\n\n<p>So, pros and cons to this approach, and definitely worth looking at more. However, my plan is to dig into the graph-based approach to see how the code complexity and performance changes.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As I mentioned in my previous post, I have been working on some possible solutions to the following problem: You are working for a company that is responsible for processing survey data for an Oil and Gas company. You are given data that summarizes whether oil was found at a certain grid reference. The assumption [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[10],"tags":[13,14,15,5],"class_list":["post-41","post","type-post","status-publish","format-standard","hentry","category-code","tag-memory","tag-performance","tag-profiling","tag-python"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/andygrigg.com\/index.php?rest_route=\/wp\/v2\/posts\/41","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/andygrigg.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/andygrigg.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/andygrigg.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/andygrigg.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=41"}],"version-history":[{"count":21,"href":"https:\/\/andygrigg.com\/index.php?rest_route=\/wp\/v2\/posts\/41\/revisions"}],"predecessor-version":[{"id":128,"href":"https:\/\/andygrigg.com\/index.php?rest_route=\/wp\/v2\/posts\/41\/revisions\/128"}],"wp:attachment":[{"href":"https:\/\/andygrigg.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=41"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/andygrigg.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=41"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/andygrigg.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=41"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}