LANGSEC: Taming the Weird Machines
Introduction
I want to get some of my opinions on the current state of computer security out there, but first I want to highlight some of the most exciting, and in my views, promising recent developments in security: language-theoretic security (LangSec). Feel free to skip the next few paragraphs of background if you are familiar with the concepts to get to my analysis, otherwise, buckle up for a little ride!
Background
If I were to distill the core of the LangSec movement into a single thesis it would be this: The complexity of our computing systems (both software and hardware) have reached such a degree that data must treated as formally as code. A concrete example of this is return-oriented programming (ROP), where instead of executing shellcode loaded into memory by the attacker, a number of gadgets are found in existing code (such as libc) and their addresses chained together on the stack and as the ret
instruction is repeatedly called, the semantics of the gadgets is executed. This hybrid execution environment of using existing code and driving it with a buffer-overflow of data is one example of a weird machine.
Such weird machines crop up in many sorts of places: viz. the Intel x86 MMU that has been shown to be Turing-complete, the meta-data of ELF executable files that can drive execution in the loading & dynamic-linking stage, etc... This highlights the fact that data can be treated as instructions or code on these weird machines, much like Java byte-code is data to an x86 CPU, it is interpreted as code by the JVM. The JVM is a formal, explicit machine, much like the x86 CPU; weird machines on the other hand are ad hoc, implicit and generally not intentionally created. Many exploits are simply shellcode developed for a weird machine instead of the native CPU.
As a result, a computing system consists of countless weird machines in various positions (some overlapping) in the system, blurring the line of what and where the machine and computation actually is occurring. From this thesis, two general action plans are laid out:
- Formally parse data (no ad hoc, implicit parsing)
- Use as limited a set of "tools" from the toolbox as possible (e.g. minimize large library inclusion, shrink protocol/data structure complexity, restrict your language usage to safer sub-sets, etc...)
The former is the core of LangSec, it implies treating data as strictly as code is treated during compilation/interpretation, with no allowances for ambiguities. Every program that takes input in some form performs parsing on that data, even if the parser is trivial. Where errors and weird machines arise is when that parsing is performed in an ad hoc or implicit manner. Generally ad hoc parsing builds a programming narrative (i.e. what is in the developer's mind when coding) of a benign user, and the parsing is done with no formal specification, and non-existent (or ill-defined) error states for invalid input that the developer thinks of when programming. Due to the heavy reliance on the developer having a priori knowledge of all possible invalid inputs and writing specific checks for those instances, all an attacker needs to do is find a single input class that the developer had forgotten. A formal specification of the input format translated into an explicit parser lessens the degree of which the developer is in charge of the security of the input processing routine. A formal parser can also much more easily be formally verified against the specification to detect implementation errors.
The latter action plan, or recipe for improvement is to restrict the tools available to a program to the bare minimum. This can entail removing unneeded libraries and modules (e.g. Heartbleed bug in a module that was not needed in many cases), sand-boxing the application following the least-privilege principle, or restricting the language to a safer subset (e.g. SafeHaskell). By shrinking the tools at your disposal to only those you need, it is possible to minimize the tools the attacker has in the event of an exploit. Safer run time environments and language sub-sets can also support a higher granularity of information provided to an OS or reference monitor to detect compromise earlier in the "cyber kill chain".
Thoughts
This concept is similar (though much more tractable to deploy in practice) to the programming paradigm of total-functional programming. Total functional programming is a paradigm in which all programs are provably terminating (not Turing-complete) and every function must have a result for every possible input. In general, programs are "useful" due to their ability to take input, perform computations on that input, and generate output in a designated format. In order to fulfill these requirements, programs need to be able to fully parse and understand the input that is being provided without ambiguity, and handle all possible sets of input in a graceful manner.
The next step (number 2 above) is to minimize the chance of an execution going astray. This can be realized by minimizing the size of the execution "play-ground", or using as little rope as possible to hang yourself. By restricting your programming class, included libraries, etc... you give the attacker a smaller and more restrictive arena in which to extend and persist his or her foot hold.
The LangSec concept in conclusion dials in on the root of the security problems sprouting up and fundamentally changes the "cyber playing-field", stopping the cat-and-mouse game between the defense and offense fought daily in government and industry, tying theoretical foundations of security into a practical path forward. I look forward to watching this field expand and become a "best-practice" in the software engineering discipline.
Cyber-security Philosopher and Boffin