The Mobile Isolation era begins: smart DOM

Gautam Altekar

March 20, 2021

Mobile Isolation from Menlo Security features Smart Document Object Model (DOM), a next-generation browser remoting technology designed to deliver a best-in-class browsing experience despite the unique challenges of the mobile environment and an ever expanding web platform.

‍Remoting technology is a cornerstone of any Remote Browser Isolation (RBI) system: it enables the system to offload the most dangerous phases of rendering a web page—fetch and execution—to an Isolated Browser in the cloud, while still allowing the user to see the results of that rendering and interact with it on the Endpoint Browser.

diagram showing how endpoint and isolated browsers work

Remote Browser Isolation. The Menlo Isolation Platform offloads potentially unsafe page fetch and execution to an Isolated Browser while providing a transparent and interactive browsing experience on the user’s Endpoint Browser. No client installs required.

Under the hood, Menlo’s remoting technology, known as Adaptive Clientless Rendering (ACR)^[1], uses a technique called DOM Mirroring to render content on the endpoint. It works by mirroring the Isolated Browser’s DOM tree, excluding dangerous content such as <script> elements, to a JavaScript Thin Client running on the Endpoint Browser.

illustration of endpoint and isolated browsers mirroring DOM tree

DOM Mirroring is a remoting technique used in the Menlo Isolation Platform. It works by mirroring safe DOM elements and associated resources to the Endpoint Browser.

DOM Mirroring was a pioneering technology in that it provides a secure and transparent browsing experience, surpassing that of traditional pixel-based remoting technologies. DOM Mirroring’s security stems from the guarantee that no active content from the page reaches the endpoint, thereby disrupting the kill chain of modern day browser exploits. And, its superior browsing experience derives from its ability to (a) offload the bulk of interactive animation rendering work such as scrolling to the Endpoint Browser, thus enabling a low-latency, network-independent interactive user experience, and (b) retain the semantic visibility necessary for the broader browser ecosystem---including screen readers, password managers, and built-in browser functionality such as text search and copy/paste---to function.Now, as we scale our browser ecosystem to include mobile browser support, an opportunity emerges to evolve our DOM-based technology to meet the challenges of the Mobile Isolation era. These challenges are two-fold:

Keeping up with the evolving web platform. Mobile introduces a new set of browsers that places additional strain on DOM Mirroring’s ability to keep up with ever-expanding web platform features. For instance, emerging DOM/CSS capabilities such as Shadow DOM, CSS Layout API, and CSS Paint API all require dedicated DOM Mirroring support (see example below), and that effort is multiplied after accounting for cross-browser variations in these APIs as well as the need to test and maintain each such API across browsers.
Power and network-efficient rendering. Mobile introduces a broad range of devices, many of which are constrained in power, performance, and network bandwidth and latency. Optimizing for these characteristics requires minimizing the amount of rendering work done on the endpoint as well as careful management of network resources, both of which were not primary design considerations in the original design of DOM Mirroring.

Smart DOM: The Next Evolution

To address the challenges, Menlo’s Mobile Isolation introduces Smart DOM: the next iteration of our pioneering DOM-based rendering technology. Like DOM Mirroring, Smart DOM leverages the power of the DOM to provide a transparent user experience while retaining the security benefits that come with executing active content away from the endpoint. But, unlike DOM Mirroring—which focuses on mirroring the DOM tree and associated resources—Smart DOM focuses on mirroring the page’s Layer Tree representation and then transforming it, at the Endpoint Browser, into a safe DOM.

The Layer Tree representation and its subsequent translation to a full-fledged DOM enables Smart DOM to provide secure, accurate, and efficient rendering across a range of browsers and a constantly evolving web platform.

Smart DOM Architecture. The key idea is to mirror the Compositor-generated Layer Tree to the Endpoint Browser. A JavaScript Thin Client running on the Endpoint Browser transforms the Layer Tree to a semantically-equivalent DOM that is then composited and rasterized using the the Endpoint Browser’s GPU-optimized machinery.

The Layer Tree is a Chromium post-layout structure that represents the page as a tree of Layers, each of which precisely defines what content to draw and where.

Layers. Chromium decomposes the page into a set of layers, each of which captures a subtree of DOM content. Layers enable independent rasterization, and they enable compositing to be offloaded to a remote rendering device, which in our case is the Endpoint Browser.

Non-video content in the Layer Tree, denoted by Layers of type cc::PictureLayer, are described by a tree of primitive vector-based drawing operations known as a Display List.

diagram of display items and display list

Display List. A vector-based drawing associated with cc::PictureLayer objects in the Chromium Layer Tree representation. It enables generation of the final bitmap (rasterization) to be offloaded to the Endpoint Browser, providing network-efficient transfer and crisp results regardless of the endpoint’s resolution. Each Display List consists of a handful of Display Items, a.k.a. drawing commands, that are well-suited for DOM or GPU-accelerated rasterization.

The Layer Tree^[1] and associated Display Lists are produced by the Chromium browser in the final stages of its rendering pipeline [2]. While Chromium natively uses the Layer Tree for efficiently rendering content on the local GPU, we’ve adapted our Chromium-based Isolated Browser to redirect the Layer Tree to the Endpoint Browser, effectively for remote GPU-optimized rendering.

To render the Layer Tree, Smart DOM’s JavaScript-based Thin Client employs a novel transformation procedure, called DOM Reconstruction, that converts the Layer Tree back into DOM form. Once in DOM form, the Endpoint Browser’s built-in GPU-optimized rendering machinery does all the remaining work of generating user-visible pixels.

------------

‍^[1] More recent versions of Chromium have refactored the Layer Tree into a Layer List plus a set of Property Trees, but given that the two representations are conceptually equivalent we’ll speak only of Layer Trees to simplify the discussion.

diagram of layer tree, resonstructed DOM tree, and rendered result

DOM Reconstruction. The Thin Client transforms the Layer Tree back to a safe DOM that is at once different yet semantically equivalent to the page’s original DOM. cc::Layer, cc::VideoLayer, and cc::PictureLayer correspond to different types of Layer objects in the Chromium Compositor (cc).

The concept of mirroring Compositor data structures such as the Layer Tree and Display Lists is not new to Smart DOM: it was introduced by the open source Chromium Blimp project as far back as 2015 for the purposes of Chromium-to-Chromium remoting ^[3]. However, we believe that Smart DOM, with its unique DOM Reconstruction technique, is the first Compositor based remoting system compatible with unmodified, thin-client based endpoint browsers---a critical requirement for providing transparent Remote Browser Isolation.

In reality, Smart DOM has long been a mainstay of Menlo’s Adaptive Clientless Rendering system, rendering pages for desktop browsers behind the scenes when performance and accuracy requirements gave it the clear upper hand over DOM Mirroring. After years of maturation, we believe that Smart DOM is now ready to fulfil its promise, starting with the mobile realm.

Remoting for the Mobile Isolation Era

DOM Reconstruction confers Smart DOM with key benefits that make it ideal for mobile browsers.

Rendering Accuracy that Scales with the Web Platform

Smart DOM provides accurate rendering that is agnostic to the particular endpoint browser in use and the web features used by the page, for two reasons. First, the Layer Tree and associated Display Lists are simple, lowest-common-denominator representations that have an equivalent representation in the well-supported subset of the DOM API. For example, the DrawImage item corresponds to the equivalent DOM <img> element, and the DrawText item maps to an equivalent DOM text node.

More importantly, because it is a rendering structure that must remain simple enough for GPU compositing and rasterization, the Layer Tree and Display Lists remain stable even as the DOM and CSS API increases in complexity---a property that is critical for supporting the latest and greatest web platform features, such as Shadow DOM, in a timely fashion.

illustration of how isolated browser renders shadow elements

DOM API Evolution Poses a Challenge for DOM Mirroring. Shadow DOM, a new web platform API, requires explicit emulation support for it to render well across all platforms with DOM Mirroring. By contrast, Smart DOM remoting operates at the Compositor layer and therefore does not require special provisions for rendering Shadow DOM.

Power-Efficient Rendering

Targeting a DOM-tree representation of the page, even if in reconstructed form, enables Smart DOM to inherit a decade-worth of DOM rendering advances that ultimately end up improving CPU utilization and reducing overall power draw. Examples of such advances include GPU accelerated compositing and rasterization, efficient GPU memory management, and pipelined drawing. What’s more, rendering continues to be a core focus area as browser makers invest aggressively in improving the overall usability of the web platform.

By contrast, a WebGL or Canvas API based remoting approach would need to implement these advances from scratch in the constrained environment of the browser, or employ heavy-weight techniques such as Chromium (or Skia) to Web Assembly compilation^[4] that create a different set of challenges. WebGL, in particular, is often disabled in virtualized environments, further diminishing its applicability, and most browser Canvas API implementations are not optimized for rendering multi-layered web pages.

Beyond the inherited savings of using existing rendering machinery, Smart DOM naturally saves on CPU, and by extension power, by offloading JavaScript execution and, more uniquely, the bulk of page layout work, to the Isolated Browser. As the product of the Layout phase in the Isolated Browser’s rendering pipeline, the Layer Tree contains the precise geometries of all page content. The Endpoint Browser must still perform some layout in order to render the DOM that is reconstructed from the Layout Tree, but the most CPU intensive phases of rendering such as style resolution and geometry calculation have already been done on the Isolated Browser.

Prioritized Bandwidth Allocation

As a post-layout rendering structure, the Layer Tree enables Smart DOM to accurately determine what content is most relevant to the user, and to deprioritize or even omit portions of the Layer Tree update that are not of interest. Selective deprioritization enables Smart DOM to minimize network usage in the interest of optimal battery life while preserving the user experience.

Here are some ways in which Smart DOM prioritizes content:

The Layers and Display Items that are actually visible in the user’s viewport are given higher bandwidth priority over those that are not.
Display List resources such as images and glyphs are sent to the Endpoint Browser only when those resources are drawn.
Layers and Display Items that have greater value to the user may be prioritized over those that have less value. For instance, expensive background animations and ad frames may be throttled down while the user is idle or focused on interacting with other portions of the page.

By contrast, effective sub-page bandwidth prioritization is harder to accomplish with DOM Mirroring. The potential dependencies between DOM elements and CSS are the core challenge. Without accounting for these dependencies, throttling or omitting updates may result in broken rendering.

Smart DOM Inherits the Best of DOM Mirroring

Security Against Zero-Days

As with DOM Mirroring, Smart DOM does not send active content of any kind to the endpoint, thus breaking the kill chain of modern day exploits. The rationale is straightforward: the Layer Tree is a compiled representation of renderable DOM and CSS artifacts. Active content, and <script> elements in particular, are not renderable and therefore are naturally filtered out during the Layout stage of the Isolated Browser’s rendering pipeline.

Security is further bolstered by the fact that the reconstructed DOM offers little opportunity to trigger exploits on the Endpoint Browser. It uses a minimal and well-used subset of the DOM/CSS API regardless of the content in the DOM of the isolated page, which means that a malicious page has very little control over the structure and content of the reconstructed DOM tree.

Finally, Smart DOM inherits DOM Mirroring’s defense-in-depth measures. For instance, to handle the case of a compromised Isolated Browser, mechanisms such as Content Security Policy further protect the endpoint from malicious script execution.

A Native Browsing Experience

Besides accuracy of rendering, a key element of a transparent user-experience is the buttery-smooth feeling of 60 FPS scrolling. DOM Mirroring does this well: armed with the DOM tree and CSS, the Endpoint Browser performs most animations---including scrolling, pinch-zoom, and CSS animations---entirely locally on the Endpoint Browser.Smart DOM, too, is designed to render animations entirely locally on the Endpoint Browser---but without mirroring the DOM tree. For instance, the Layer Tree contains everything needed for the Thin Client to identify the Layer that is the target of a scroll action---a process known as Hit Testing. Once identified, updating the scroll position entails having the Thin Client adjust the scroll offset component of the target Layer’s transform, and apply it via a CSS transform update on the corresponding reconstructed DOM node, thus completing the local scroll animation.

Compatibility with the Broader Browser Ecosystem

A key advantage of DOM-based rendering technologies is that they play well with the broader browser ecosystem. Password managers and native browser features like copy/paste all rely on the semantics exposed by the DOM to function correctly. For instance, in order to trigger password auto-fill, password managers look for <input type=’password’> elements in the DOM. To give another example, the browser looks for DOM text nodes in order to trigger native text selection and to provide native search highlighting.In contrast to DOM-based rendering, canvas and WebGL based remoting technologies break compatibility as they do not render semantically rich DOM. While shadowing and emulation techniques can fill some of the gaps, the challenges of dealing with an opaque bag of pixels devoid of semantics remain an obstacle to a truly native experience.Like DOM Mirroring, Smart DOM is designed to work well with the broader browser ecosystem. It does this by transforming the Layer Tree into a semantically rich DOM, i.e., one in which DOM text nodes expose text semantics, anchor elements expose link semantics, <input> elements trigger password manager auto-fill, and so on.

The key idea behind the semantically rich DOM is a rendering representation we term a Semantic Display List (SDL). A Semantic Display List augments the classic notion of a Display List with DOM-level properties of the content being drawn. Examples of such properties include the associated DOM text, input element type, and associated links (properly sanitized). Generating a semantically-rich DOM, then, is a matter of emitting the DOM nodes corresponding to the semantic properties.

Conclusion

Smart DOM pushes the boundaries of RBI remoting technology by enabling accurate and efficient rendering that is well suited for mobile browsers and an ever-expanding web platform, all while retaining the security and user-experience benefits of its predecessor. DOM Reconstruction, the core technique underlying Smart DOM, shows it is feasible to achieve DOM-based rendering and its benefits without actually mirroring the DOM. Like other components of Menlo Security’s Isolation Platform, Smart DOM is the product of Menlo’s nearly decade-long experience in developing and deploying RBI for millions of users across hundreds of organizations. We look forward to sharing more insights about Smart DOM as well as the broader Isolation Platform in future articles.

Acknowledgements

Some of the figures in this article, particularly those related to the Chromium rendering pipeline, were adapted from or inspired by Steve Kobes’ excellent “Life of a Pixel” presentation ^[2], which is licenced under the Creative Commons Attribution 2.5 License.

References

1. Whitepaper: Adaptive Clientless Rendering.https://info.menlosecurity.com/Menlo-Security-Cloud-Platform-Powered-by-Isolation-WhitePaper.html

2. Life of a Pixel by Steve Kobes (Chromium Project):http://bit.ly/lifeofapixel, 2020.

3. The Chromium Blimp Project (discontinued): https://chromium.googlesource.com/chromium/src.git/+/49.0.2623.112/blimp/, 2015.

4. Skia CanvasKit: https://skia.org/user/modules/canvaskit .

Blog Category

Tagged

Blog

Research

Demo