Using Benford's law to understand code regularity and detect malware
This is a rough outline of an idea I am experimenting with.
Several active research areas in computer science investigate recurring patterns and mathematical properties underlying program behavior. Observations about program behavior have motivated the design and implementation of various algorithms. For example, Chris Osaki’s algorithms on Braun trees take advantage of an observation made about the structure and arrangement of nodes. Similarly, various meta-studies examine the correlation between the semantics of certain programming primitives with the occurrence of runtime errors. Other studies examine the percentage of language primitives that are likely to be present in a given program. While these examples are diverse, they illustrate that the more we learn about programmatic properties, analyses of software can be performed with more accuracy.
Examining properties that give way to regularity can also, potentially, help us detect malware. If code follows properties that give way to expected patterns, determining which deviations signal malicious activity could be useful for threat modeling and abuse mitigation.
What is Benford’s Law
Benford’s law describes a near-universal pattern underlying numerical datasets, ones created by humans as well as those that occur in nature. According to Bedford’s law, it can be observed that in any given numerical dataset, 30% of the numbers will start with 1; 17% will start with 2, and each subsequent natural number’s frequency will decay logarithmically until 9, which will be the starting digit for ~5% of numbers.
Despite being statistically consistent, this law is counter-intuitive given that 9 possible digits should, theoretically, occur with equal probability. Benford’s law has been used for election analyses, tax fraud detection, to identify spammy bot accounts on social platforms, and to discern deep fakes from real image and video data. It also occurs in nature, observed in earth quakes, the brightness of gamma rays that reach Earth as recorded by the Fermi space telescope, the rotation rates of dead starts, and infectious disease numbers reported to the World Health Organization.