연구 분야: Analysis
학회: ACM Transactions on Software Engineering and Methodology, Volume 34, Issue 6
Finding similar code is important for software engineering, defense of intellectual property, and security, and one of the increasingly common ways adversaries use to defeat the detection of similar code is through obfuscations such as code transformation and scattering the code they wish to hide among long sequences. Moving code far enough apart poses a specific challenge for solutions with localized features (e.g., n-grams), or attention mechanisms as the code parts are distributed beyond the local context window. We introduce a neural network solution pattern called “Cybertron” that addresses this problem by utilizing reinforcement learning to train a code abstraction and summarization function; this converts arbitrarily long code into fixed-length real vectors in a way that is optimized for similarity search. The key to the design is the smart selection of important elements of the code and abstraction to preserve semantic function while minimizing syntactic feature information. We evaluated the approach on a three-challenge benchmark of obfuscated JavaScript, a scripting language that is commonly obfuscated and for which code-mixing is a rising challenge. The evaluation shows our approach identifies obfuscated code within even large scripts with an AUC of 78%, which outperforms current state-of-the-art sequence models by 7–35%.
| 발행 연도 | 2025년 |
|---|---|
| 인용수 | 0 |
| 출판 국가 | Canada |
| 사이트 | ACM |
| 좋아요 수 | 0 |