Internship subject – Sujet stage M2

Dataflow analysis of malicious binary codes. Toward a study of the cartography of functionalities and their correlations.

The team Carbone at Loria has developed thanks to the high security lab (LHS), an innovative method called morphological analysis. This method can detect code similarities. It can also detect functionalities embedded in a binary code and detect malware. The objective is to rebuild the dataflow graph in order to cartography the set of functionalities used inside a malicious code.

Sujet détaillé

Keynote at FPS 2016

A Morphological Approach to Binary Code Analysis

FPS 2016 – 9th International Symposium on Foundations & Practice of Security

Abstract
Binary code analysis is a complex process which can be performed nowadays only by skilled cybersecurity experts whose workload just keeps increasing. Uses cases include vulnerabilities detection, testing, clustering and classification, malware analysis, etc… We develop a tool named Gorille, which is based on the reconstruction of an high level semantics for the binary code. Control flow graphs provide a fair level of abstraction to deal with the binary codes they represent. After applying some graph rewriting rules to normalize these graphs, our software tackles the subgraph search problem in a way which is both efficient and convenient for that kind of graphs. This technique is described as morphological analysis as it recognizes the whole shape of the malware.

That being said, some pitfalls still need to be considered. First of all, the output can only get as good as the input data. And it is known that static disassembly cannot produce the perfect control flow graph since this problem is undecidable. As a matter of facts, malware heavily use obfuscation techniques such as opaque predicates to hide their payloads and confuse analyses. Dynamic analysis should then be used along with static disassembly to combine their strengths. Another dangerous pitfall feared by every expert is the so-called false positives rate : false alarms that make them waste indeed a precious time assessing the reality of the threat. Shared binary code is not always relevant as many software embed static standard libraries. Gorille’s solution to this issue lies in graph rewriting. By rewriting classic subgraphs into configuration-based special nodes, we even obtain an higher abstraction of the control flow graph.