Skip to content

Instantly share code, notes, and snippets.

  • Save GSByeon/e6420ffa23f7ee33777aa09452849dd4 to your computer and use it in GitHub Desktop.
Save GSByeon/e6420ffa23f7ee33777aa09452849dd4 to your computer and use it in GitHub Desktop.
Searching statically-linked vulnerable library functions in executable code

Searching statically-linked vulnerable library functions in executable code

๋ฒˆ์—ญ Project Zero: Searching statically-linked vulnerable library functions in executable code

Summary

์ทจ์•ฝํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ static link (์ •์  ๋งํฌ)๋œ ์œ ์‚ฌํ•œ ํŒŒ์ผ๋“ค์„ binary ๋ ˆ๋ฒจ์—์„œ ํƒ์ง€ํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต๋‹ค

์ด ๊ธ€์—์„  ๋ฐ”์ด๋„ˆ๋ฆฌ ํŒŒ์ผ์„ ์ •์ ์œผ๋กœ ๋ถ„์„ํ•˜์—ฌ ์ทจ์•ฝํ•œ ์˜คํ”ˆ ์†Œ์Šค ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ฅผ ํƒ์ง€ํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ๋‹ค.

Technique

์˜คํ”ˆ์†Œ์Šค ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ์ทจ์•ฝ์ ์ด ์ž์ฃผ ๋ฐœ๊ฒฌ๋œ๋‹ค. ์ด๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ์œ„ํ—˜ํ•˜๋ฏ€๋กœ ์ฃผ์˜ํ•ด์•ผํ•œ๋‹ค.

dynamic linked ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํŒจ์น˜ํ•  ๋•Œ์—๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋งŒ ๊ต์ฒดํ•˜๋ฉด ๋˜์ง€๋งŒ static linked ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํŒจ์น˜ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋งํ‚น๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ๋ฅผ ์—…๋ฐ์ดํŠธ ํ•ด์•ผํ•œ๋‹ค.

๊ตณ์ด ์ƒˆ๋กœ์šด ์ทจ์•ฝ์ ์„ ์ฐพ์œผ๋ ค๊ณ  ํ•˜์ง€ ์•Š์•„๋„ ํƒ€๊ฒŸ์—์„œ ์ทจ์•ฝ์ ์„ ์‰ฝ๊ฒŒ ์ฐพ์„์ˆ˜ ์žˆ๋Š” ๊ธฐํšŒ์ด๋‹ค.

์ด์™€ ๊ด€๋ จ๋œ ๋„๊ตฌ๊ฐ€ ๋ถ€์กฑํ•˜์—ฌ ๋„์ „์ด๊ธฐ๋„ ํ•˜๋‹ค.

static link library๋ฅผ ํƒ์ง€ํ•˜๋Š” ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ๋ฐฉ๋ฒ•
  • ๊ณ ์œ ํ•œ ๋ฌธ์ž์—ด ๊ฒ€์ƒ‰
  • ์ถ”์ธก
  • BinDiff ํˆด ์‚ฌ์šฉ

1. Tech Stack and Algorithms for Implementation

Tech Problem

๋น„๊ต์  ํฐ ํฌ๊ธฐ๋กœ ํšจ์œจ์ ์ธ fuzzy search

Fuzzy Search๊ฐ€ ํ•„์š”ํ•œ ์ด์œ  : compiler์˜ ์ฐจ์ด(๋‹ค์–‘์„ฑ), ์ตœ์ ํ™”(optimization) ๋ณ€ํ™”, ์ฝ”๋“œ ๋ณ€ํ™”์— ๋”ฐ๋ผ noise๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

Algorithm 1 - Fuzzy Search

๋งž์ถค๋ฒ• ๊ฒ€์‚ฌ๋‚˜ ๋งž์ถค๋ฒ• ์˜ค๋ฅ˜ ๊ต์ •์— ์‚ฌ์šฉ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜.

์ •ํ™•ํ•œ ํ‚ค์›Œ๋“œ๋กœ ๊ฒ€์ƒ‰ํ•˜์ง€ ์•Š์•„๋„ ์œ ์˜์–ด ๊ฒ€์ƒ‰์‹œ ๊ด€๋ จ์ด ๊นŠ์€/์œ ์‚ฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ํƒ์ƒ‰ํ•ด์ค€๋‹ค.

Example : Misissippi -> Mississippi

์œ ์‚ฌํ•œ ๊ฒฐ๊ณผ์™€ ๊ด€๋ จ๋œ ์œ ์‚ฌ ๊ฒ€์ƒ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ฏ€๋กœ ์ด ์—ฐ๊ตฌ ์ง„ํ–‰์— ์ ํ•ฉํ•˜๋‹ค.

์ •์ ์ด๊ณ  ์ •ํ™•ํ•œ ๊ฒ€์ƒ‰๋ณด๋‹ค ํšจ์œจ์ ์ด๋‹ค.

Fuzzy Matching

์ปดํŒŒ์ผ๋Ÿฌ์™€ ์ปดํŒŒ์ผ ์˜ต์…˜, ์ตœ์ ํ™”์— ๋”ฐ๋ผ CFG๊ฐ€ ๋‹ค๋ฅด๊ฒŒ ๋ณ€ํ™”ํ•œ๋‹ค.

code duplication, instruction movement and scheduling์œผ๋กœ ์ธํ•ด disassembly๋“ค์ด ๋‹ค์–‘ํ•ด์ง„๋‹ค.

ํ•จ์ˆ˜์˜ ๋ณ€ํ™”๋ฅผ ์‹๋ณ„ํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•„์š”ํ•˜๋‹ค.

Algorithm 2 - SimHash

๊ฒ€์ƒ‰์—”์ง„์—์„œ ์ค‘๋ณต๋œ ๋ฌธ์„œ๋ฅผ ์ œ๊ฑฐํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜.

Locality Sensitive Hashing์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ณผ์ •์˜ ์ผ๋ถ€๋ถ„์ด๋‹ค.

์œ ์‚ฌํ•œ ์ฝ”๋“œ๋ฅผ ํƒ์ง€ํ• ๋•Œ ํ•„์š”ํ•˜๋‹ค.

  • feature์˜ n ๋น„ํŠธ๊ฐ€ 0์ด๋ฉด n๋ฒˆ์งธ float point์—์„œ 1์„ ๋นผ๊ณ  1์ด๋ฉด ๋”ํ•œ๋‹ค
  • float point vector -> 128bit vector๋กœ ๋ณ€ํ™˜ : +๋Š” 1๋กœ, -๋Š” 0์œผ๋กœ ๋ณ€๊ฒฝ

๊ณ„์‚ฐํ•œ fingerprint(hash)๋ฅผ XOR ํ›„ POPCNT ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉด hamming distance๋ฅผ ๊ตฌํ• ์ˆ˜ ์žˆ๋‹ค.

feature(ํŠน์ง•)๋กœ ์‚ฌ์šฉ ํ• ๋งŒํ•œ ์š”์†Œ

์œ ์‚ฌํ•˜๋‹ค๊ณ  ํƒ์ง€ํ•  ํŠน์ง•์ด ํ•„์š”ํ•˜๋‹ค.

  • cfg์˜ sub graph

  • asm mnemonic์˜ ngram(์—ฐ์†๋œ ์ง‘ํ•ฉ)

  • ํ•จ์ˆ˜ prologue

ํ•จ์ˆ˜์˜ prologue๋Š” ์œ ์‚ฌํ•จ์ˆ˜ ํŒ๋ณ„์— ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๋‹ค.

Approximate Nearest-Neighbor Search

๊ฐ€์žฅ ๊ทผ์ ‘ํ•œ ํ•ด์‹œ๋ฅผ ๊ฐ„๋‹จํžˆ ๊ทผ์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•

Locality Sensitive Hashing - ํ•จ์ˆ˜์— ๋Œ€ํ•œ similarity-preserving hash๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ํฌ๊ธฐ๊ฐ€ ์ •ํ•ด์ง€์ง€ ์•Š์€ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ํ•ด์‹œ๋ฅผ ์ฐพ๊ธฐ์œ„ํ•ด ์‚ฌ์šฉ

image-20190112042431254

๋žœ๋คํ•œ ์„ ์„ ๊ทธ์–ด ํŒŒํ‹ฐ์…˜์„ ๋‚˜๋ˆˆํ›„ ํŒŒํ‹ฐ์…˜์„ bucket์— ๋„ฃ๊ณ  ๋ถ„๋ฅ˜ํ•œ๋‹ค.

Algorithm 3 - LSH(Locality Sensitive Hashing)

ANNS(Approximate Nearest-Neighbor Search)์˜ ์ข…๋ฅ˜์ด๋‹ค.

๋ฒกํ„ฐ ๊ณต๊ฐ„์„ ๋‚ฎ์€ ์ฐจ์›์˜ ๊ณต๊ฐ„์œผ๋กœ ๋ถ„ํ• ํ•˜์—ฌ hash๋ฅผ ๊ตฌํ•˜๊ณ  data bucket์— ๋‹ด์•„ ์œ ์‚ฌ๋„๋ฅผ ํŒ๋ณ„ํ•œ๋‹ค.

์ฐจ์›์ด ๋†’์€ data set๋“ค์„ ๋‚ฎ์€ ์ฐจ์›์œผ๋กœ ๋ณ€ํ™˜ ํ•œ๋‹ค.

์ผ์ • ๋ฒ”์œ„๋‚ด์˜ ์ ๋“ค์„ ํ•œ ๋ฒ„ํ‚ท์œผ๋กœ ๋ถ„๋ฅ˜ํ•œ๋‹ค.

Random bits for LSH

bit vector๋ฅผ ์ž…๋ ฅํ•˜๋ฏ€๋กœ ์ด ๋น„ํŠธ๋“ค์„ subsampleํ•œ๋‹ค. -> ๋น„ํŠธ์˜ ์ˆœ์—ด์„ ๊ตฌํ•œ๋‹ค.

k๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ํ•ด์‹œ๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด k๊ฐœ์˜ ์ˆœ์—ด์„ ๊ตฌํ•œ๋‹ค. 128bit ์ˆœ์—ด์€ ์ฝ”์ŠคํŠธ๊ฐ€ ํฌ์ง€ ์•Š์•„ ์ ๋‹นํ•˜๋‹ค.

Data Structure

<PermutationIndex, k-th-permutation-of-input-hash, result-id>
<k, perm_k(input) & (0xFFL << 56), 0>

binary search๋ฅผ k๋ฒˆ ์ˆ˜ํ–‰ํ•˜์—ฌ ์–ป์€ hash bucket๋“ค์„ ํ›„๋ณด์— ๋„ฃ์œผ๋ฉด ์ด๋“ค๊ณผ ์ž…๋ ฅ๋œ hash๊ฐ„์˜ hamming distance๋ฅผ ๊ตฌํ• ์ˆ˜ ์žˆ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ์™€ ์บ์‹œ์˜ ๊ฐ€์žฅ ํšจ์œจ์ ์ธ ๋ฒ„์ „์€ ์ •๋ ฌ๋œ flat array / vector๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

์ถ”๊ฐ€๋ฅผ ์œ„ํ•ด์„  std::set๊ฐ€ ํšจ์œจ์ ์ด๊ณ  ์ฝ๊ธฐ๊ฐ€ ๋งŽ์„ ๊ฒฝ์šฐ ์ •๋ ฌ๋œ ๋ฒกํ„ฐ๊ฐ€ ํšจ์œจ์ ์ด๋‹ค.

Problem in SimHash

์ž…๋ ฅ ์…‹์˜ ๋ชจ๋“  feature์˜ ์ค‘์š”๋„๊ฐ€ ๊ฐ™๊ฒŒ ์ฒ˜๋ฆฌ๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ๋‹ค.

๋‹คํ–‰ํžˆ float point vector์— +1 -1ํ•˜๋Š” ๋Œ€์‹ ์— weight(๊ฐ€์ค‘์น˜)๋ฅผ ๋”ํ•˜๊ฑฐ๋‚˜ ๋นผ๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.

Cheap Gradient Principle

ML์€ Automatic Differentiation(๋ฏธ๋ถ„ ์ž๋™ํ™”)๊ฐ€ ํ•ต์‹ฌ์ด๋‹ค. -> Cheap Gradient Principle์— ์ ์šฉ์‹œ ๋„์›€์ด ๋œ๋‹ค.

์˜ค์ฐจํ•จ์ˆ˜์ธ Loss Function์˜ ๊ฐ’์„ ์ตœ์†Œํ™” ํ•ด์•ผํ•œ๋‹ค. -> ๊ฐ€์ค‘์น˜๊ฐ€ ํฌํ•จ๋œ ํ•จ์ˆ˜๋ฅผ ๊ตฌํ•ด์•ผํ•œ๋‹ค.

๊ฐ™๋‹ค ํ˜น์€ ๊ฐ™์ง€ ์•Š๋‹ค ๋กœ ๋ถ„๋ฅ˜๋œ ๋ฐ์ดํ„ฐ๋กœ ํšจ์œจ์ ์ธ loss function์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

Loss Function for SimHash Distance

๊ฑฐ๋ฆฌ๋Š” ๋‘ ๋น„ํŠธ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ ํ•ด๋ฐ ๊ฑฐ๋ฆฌ์ด๋ฏ€๋กœ ๊ธฐ์šธ๊ธฐ๊ฐ€ 0์ผ ํ™•๋ฅ ์ด ๋†’๋‹ค.

๊ฐ„๋‹จํ•œ ํ•ด๊ฒฐ๋ฒ•์€ ํ•ด์‹œ๋ฅผ ๊ณ„์‚ฐํ• ๋•Œ ๋งˆ์ง€๋ง‰ ๊ณผ์ •์„ ์ œ์™ธํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

Hash ๋น„๊ต ๋Œ€์‹  ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ์ธก์ •์œผ๋กœ ๋Œ€์ฒดํ•œ๋‹ค. -> ๋น„ํšจ์œจ์ 

๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ํ•ด๊ฒฐ์€ ์œ ์‚ฌํ•œ ๋‘ ํ•จ์ˆ˜์˜ ๊ฐ€์ค‘์น˜๋ฅผ 0์œผ๋กœ ์ค„์ด๋Š” ๊ฒƒ.

Penalize

๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€๋•Œ ํ•จ์ˆ˜๊ฐ€ ์œ ์‚ฌํ•˜๋‹ค๊ณ  ํŒ๋ณ„ํ•˜๊ฑฐ๋‚˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๊ฐ€๊นŒ์šธ ๋•Œ ์œ ์‚ฌํ•˜์ง€ ์•Š๋‹ค๊ณ  ํŒ๋ณ„ํ•˜๋ฉด penalizeํ•ด์•ผํ•œ๋‹ค.

ํ•จ์ˆ˜์˜ ์กฐ๊ฑด

๋‘ ๊ฐ’์ด ๊ฐ™์€ ๋ถ€ํ˜ธ -> - ํ•จ์ˆ˜ or 0 ํ•จ์ˆ˜

๋‘ ๊ฐ’์ด ๋‹ค๋ฅธ ๋ถ€ํ˜ธ -> + ํ•จ์ˆ˜.

๊ธฐ์šธ๊ธฐ / ์ธ์„ผํ‹ฐ๋ธŒ : ๊ฐ™์€ ๋ถ€ํ˜ธ์˜ ๋ฐฉํ–ฅ์œผ๋กœ ์ž…๋ ฅ์„ ์ด๋™

<- ์œ ๋„๋œ ํ•จ์ˆ˜

g(x,y) = -xy/sqrt((x^2)*(y^2)+1) + 1

x์™€ y์˜ ๋ถ€ํ˜ธ๊ฐ€ ๋‹ค๋ฅผ ๋•Œ ์†์‹ค๊ฐ’์ด ๋†’๊ณ  ๊ฐ™์œผ๋ฉด ์†์‹ค์ด 0์ด๋œ๋‹ค.

๋ฌธ์ œ์  : ๊ทธ๋ž˜ํ”„๊ฐ€ ํ‰ํ‰ํ•˜์—ฌ ๊ฐ’์ด ์ผ์ •ํ•˜๋‹ค.

g(x,y) * d(x,y) // d(x,y) = sqrt(x-y+0.01)

์กฐ๊ฑด์— ๋งž๋Š” ํ•จ์ˆ˜๋ฅผ ๊ตฌํ•˜์˜€๋‹ค.

ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •ํ• ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค.

2. Get Data For Training

Generating Training Data

  • ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์ด ๊ฐ„๋‹จํ•ด์•ผํ•จ.
  • compiler์˜ ๋‹ค์–‘ํ•œ version๊ณผ ๋‹ค์–‘ํ•œ option์„ ์„ค์ •ํ•˜์—ฌ ์˜คํ”ˆ ์†Œ์Šค ์ฝ”๋“œ ์ปดํŒŒ์ผ
    • ํƒ์ƒ‰ํ•  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์–ด๋–ค ์ปดํŒŒ์ผ ์˜ต์…˜๊ณผ ๋ฒ„์ „์œผ๋กœ ์ปดํŒŒ์ผ๋˜์–ด ์žˆ๋Š”์ง€ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€๋Šฅํ•œํ•œ ๋น„๊ต๊ตฐ์„ ๋งŽ์ด ๋งŒ๋“ค์–ด์•ผํ•œ๋‹ค.
  • symbol info parsing - ๋ณ€ํ™”ํ•œ/๋‹ค์–‘ํ•œ ํ•จ์ˆ˜๋“ค์˜ ๊ทธ๋ฃน์„ ๋งŒ๋“ค๋•Œ ํ•„์š”
  • ์ปดํŒŒ์ผ ์˜ต์…˜์— ๋”ฐ๋ฅธ ๋‹ค์–‘ํ•œ ํ•จ์ˆ˜๋ฅผ ๊ทธ๋ฃนํ™”
  • ์œ ์‚ฌํ•˜์ง€ ์•Š์€ ์Œ์€ ๋‹ค๋ฅธ ๊ธฐํ˜ธ์˜ ๋‹ค๋ฅธ ํ•จ์ˆ˜๋กœ ์ •์˜ํ•  ์ˆ˜ ์žˆ์Œ.

ํ•˜์ง€๋งŒ symbol parsing ๊ณผ CFG ์žฌ๊ตฌ์„ฑ์— ๋งŽ์€ ๊ตฌํ˜„ ๋ฌธ์ œ๊ฐ€ ์กด์žฌ

Issue 1 - Symbols

PDB ํŒŒ์ผ์„ parsingํ•˜๋Š” cross platform ๋„๊ตฌ๊ฐ€ ์—†๋‹ค.

GCC, CLANG ๋Œ€์ฒดํ•œ๋‹ค๋ฉด ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์ง€๋งŒ Visual Studio์™€ ํ•จ๊ป˜ ๋นŒ๋“œํ•˜์ง€ ์•Š๋Š” ํ”„๋กœ์ ํŠธ๋Š” ๋” ์ ๋‹ค. -> ๋Œ€์ฒด๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.

VS์™€ GCC๋กœ ๊ฐ™์€ codebase๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ๋งŒ๋“œ๋Š” ๊ฒƒ์„ ํฌ๊ธฐํ•˜์˜€๋‹ค.

C++ mangling convention์ด ์ปดํŒŒ์ผ๋Ÿฌ๋งˆ๋‹ค ๋‹ค๋ฅด๋‹ค. -> ๊ฐ™์€ ํ•จ์ˆ˜์˜ ์ด๋ฆ„์ด ๋‹ค๋ฅด๋‹ค.

Type info๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  hackishํ•œ ํˆด๋กœ ์„ธ ์ปดํŒŒ์ผ๋Ÿฌ๋“ค์˜ ํ‘œ๊ธฐ๋ฒ•์„ ํ†ต์ผํ•˜์—ฌ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.

Issue 2 - CFG, Polluted Data Sets

๋””์Šค์–ด์…ˆ๋ธ”๋Ÿฌ์—์„œ switch๋ฌธ์„ ์ฒ˜๋ฆฌํ• ๋–„ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒ : ํ•จ์ˆ˜ ์ž˜๋ฆผ, basic block ํ• ๋‹น ๋ฌธ์ œ -fPIE์™€ -fPIC๋กœ ์ปดํŒŒ์ผ๋œ gcc binary๋Š” ๋” ๋ฌธ์ œ๊ฐ€ ๋งŽ๋‹ค. ์š”์ฆ˜ linux ์‹œ์Šคํ…œ์€ ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ ์„ค์ •๋˜์–ด์žˆ๋‹ค.

Stack Cookie Check ํ• ๋•Œ return์„ ํ•˜์ง€ ์•Š์•„ ๋””์Šค์–ด์…ˆ๋ธ”๋Ÿฌ์˜ CFG์—์„œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ๋” ์•ˆ์ •์ ์ธ disassembly๋กœ ํ•ด๊ฒฐํ• ์ˆ˜ ์žˆ์ง€๋งŒ -fno-PIE -fno-PIC ์˜ต์…˜์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ๋ฐ–์— ์—†๋‹ค.

Real Data Generation

testdata/generate_training_data.py ์Šคํฌ๋ฆฝํŠธ๋กœ ์ƒ์„ฑํ• ์ˆ˜ ์žˆ๋‹ค.

testdata/ELF ์™€ testdata/PE ํด๋”์— ์žˆ๋Š” ๋ชจ๋“  binary file์„ ๋ถ„์„ํ•œ๋‹ค.

Extract Function Symbols

ELF : DWARF debug info๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ objdump๋กœ ์ถ”์ถœ

PE : Linux์—์„œ PDB ๋ถ„์„ ๋ฐฉ๋ฒ•์„ ์ฐพ์„์ˆ˜ ์—†์—ˆ์Œ. DIA2Dump ํˆด ์‚ฌ์šฉ

Result File

EXE ID : SHA256

./extracted_symbols_<EXEID>.txt
[exe ID] [exe path] [function address] [base64 encoded symbol] false
./functions_<EXEID>.txt
[exe ID]:[function address] [sequence of 128-bit hashes per feature]
./[training|validation]_data_[seen|unseen]/attract.txt
./repulse.txt
[exe ID]:[function address] [exe ID]:[function address]

Split Training/Validation Data

Training/Validation์„ ๋”ฐ๋กœ ๋‚˜๋ˆ„์–ด ์ƒ์„ฑํ•œ ์ด์œ  : ๋ฐ์ดํ„ฐ๋ฅผ ๋จผ์ € ํ•™์Šตํ•œ ๋‹ค์Œ ๋ถ„๋ฆฌ ์‹œํ‚จํ›„ ๋‚˜๋จธ์ง€๋ฅผ ํƒ์ง€ ํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

(1) ํ•™์Šตํ•œ ํ•จ์ˆ˜๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ณ€ํ™”๋ฅผ ํƒ์ง€
  • ํ•จ์ˆ˜์˜ ๋ณ€ํ™”๋ฅผ ๋ถ„๋ฆฌ

  • ๋‚˜๋จธ์ง€๋ฅผ ํ•™์Šต

  • ๋ถ„๋ฆฌํ•œ ํ•จ์ˆ˜ validation

(2) ํ•™์Šต๋™์•ˆ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ• ์ˆ˜ ์žˆ๋Š” ๋ฒ„์ „์ด ์—†๋”๋ผ๋„ ํ•จ์ˆ˜์˜ ๋ณ€ํ™”๋ฅผ ํƒ์ง€
  • ๊ฐ€์žฅ ์œ ์šฉํ•˜์ง€๋งŒ ํ˜„์‹ค์ ์œผ๋กœ ๋ถˆ๊ฐ€๋Šฅ.
  • training/validation data๋ฅผ Function group(๊ฐ™์€ ํ•จ์ˆ˜ ์ฝ”๋“œ๊ฐ€ ๋ณ€ํ™”ํ•œ ์ง‘ํ•ฉ)์— ๋”ฐ๋ผ ๋‚˜๋ˆ„์–ด์•ผ ํ•œ๋‹ค.
  • Function group์„ ๋ถ„๋ฆฌ ํ›„ ๋‹ค๋ฅธ function group์„ ํ•™์Šต ๋ถ„๋ฆฌํ•œ ๊ทธ๋ฃน์„ validationํ•ด์•ผ ํ•œ๋‹ค.

3. Training

Traning Issue

๋ณ‘๋ ฌ ์ž๋™ํ™”, GPU offload

  • TensorFlow
  • Julia + AutoDiff

ํ•˜์ง€๋งŒ dependency๋ฅผ ์ค„์ด๊ธฐ์œ„ํ•ด C++์„ ์‚ฌ์šฉํ•˜์—ฌ loss function ์„ค์ •.

SPII

SPII C++๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์†Œํ™”

  • ์žฅ์  : ๊น”๋”ํ•˜๊ณ  ์ข‹์€ ํ”„๋กœ๊ทธ๋ž˜๋ฐ model
  • ๋‹จ์  : CPU๋งŒ ์‚ฌ์šฉ. GPU๋กœ ๋Œ€์ฒด ํ•„์š”.

Real Training

thomasdullien@machine-learning-training:~/sources/functionsimsearch/bin$ ./trainsimhashweights -data=/mnt/training_data/training_data_seen/ โ€”weights=weights_seen.txt

L-BFGS๊ฐ€ 500๋ฒˆ ๋ฐ˜๋ณต ๋๊ณ  ํ•™์Šต์„ 20๋ฒˆํ• ๋–„๋งˆ๋‹ค snapshot ์ƒ์„ฑ๋จ.

L-BFGS

Limited-memory BFGS - Wikipedia

Quasi-Newton ๋ฐฉ๋ฒ•์ค‘ ํ•˜๋‚˜๋กœ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.

Quasi-Newton

Quasi-Newton method - Wikipedia

Hessian ํ–‰๋ ฌ์„ ๋ฏธ๋ถ„ ์—†์ด ํ•จ์ˆ˜๊ฐ’์œผ๋กœ ๊ทผ์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

๊ทธ๋ž˜ํ”„๋Š” x์ถ•์€ ์Šคํ… y์ถ•์€ ์œ ์‚ฌํ•œ ํ•จ์ˆ˜์™€ ์œ ์‚ฌํ•˜์ง€ ์•Š์€ ํ•จ์ˆ˜์˜ ๋น„ํŠธ(๊ฑฐ๋ฆฌ) ์ฐจ์ด๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. 420 ์Šคํ…์ดํ›„๋กœ ๊ทธ๋ž˜ํ”„๊ฐ€ ๊ฐ์†Œ.

๋”์ด์ƒ ์ตœ์ ํ™”์˜ ํ•„์š”๊ฐ€ ์—†์–ด์ง.

๊ฑฐ๋ฆฌ ์ฐจ์ด๊ฐ€ 10๋น„ํŠธ์—์„œ 25๋น„ํŠธ๋กœ ํ–ฅ์ƒ ๋˜์—ˆ๋‹ค. -> ํ•จ์ˆ˜์˜ ๋ณ€ํ™”๋ฅผ ์ธ์‹ํ•˜๋Š” ํ•™์Šต์ด ์ ์šฉ๋˜์—ˆ๋‹ค.

4. Result Analysis

Training Results

  1. ๊ณ ์ฐจ์› ์‹œ๊ฐํ™”ํ•˜์—ฌ ๋ถ„์„ - t-SNE or MDS ์‚ฌ์šฉ
  2. AUC (Area-under-ROC-curve) ๋ถ„์„
  3. ๊ตฌํ•œ feature ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ„์„ํ•˜์—ฌ ํ•™์Šต๊ฒฐ๊ณผ ๋ถ„์„

t-SNE VIsualisation

T-distributed Stochastic Neighbor Embedding

๊ณ ์ฐจ์›์„ ์ €์ฐจ์›์œผ๋กœ ์ค„์ด๋Š” ์ฐจ์›๊ฐ์†Œ ์‹œ๊ฐํ™”.

T ๋ถ„ํฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์œ ์‚ฌ๋„๊ฐ€ ๋น„์Šทํ•œ๊ฒƒ๋ผ๋ฆฌ ๋ฌถ์–ด ์‹œ๊ฐํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

./plot_function_groups.py ../bin/symbols.txt ../bin/unit_index.txt /tmp/unit_features.html
./plot_function_groups.py ../bin/symbols.txt ../bin/learnt_index.txt /tmp/learnt_features.html

๋ชจ๋“  feature๊ฐ€ ํ•™์Šตํšจ๊ณผ๋ฅผ ๋ณด์ด์ง„ ์•Š์•˜๋‹ค. ๋ช‡๋ช‡ feature๋Š” ํ•™์Šตํšจ๊ณผ๊ฐ€ ๋–จ์–ด์กŒ๋‹ค.

TPR, FPR, IRR, ROC-curve

  • TP(True Positive) : True๋ฅผ True๋ผ ํ•˜๋Š”๊ฒƒ
  • TPR(True Positive Rate) : ์ •๋ฐ€๋„
  • FP(False Positive) or Type 1 Error : False์ธ๊ฒƒ์„ True๋ผ ํ•˜๋Š”๊ฒƒ.
  • FPR(False Positive Rate) : ์œ„์–‘๋„
  • IRR(Irrelevant Result Rate): ๋ถ€์ ํ•ฉ๋„
  • ROC-Curve : x์ถ•์ด FPR, y์ถ•์ด TPR์ธ ๊ทธ๋ž˜ํ”„

Hash bucket ์–ผ๋งˆ๋‚˜ ์„ ํƒํ•ด์•ผํ•˜๋Š”์ง€ ์ •ํ•  ๋–„์™€ ๊ทผ์‚ฌ์œจ๊ณผ ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ๋„์›€์ด ๋œ๋‹ค.

ROC ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์Šคํฌ๋ฆฝํŠธ๋ฅผ ๊ฐœ๋ฐœํ–ˆ๋‹ค.

testdata/evaluate_ROC_curve.py --symbols=/media/thomasdullien/roc/symbols.txt --dbdump=/media/thomasdullien/roc/search.index.txt --index=/media/thomasdullien/roc/search.index

gnuplot์„ ์ด์šฉํ•ด ๊ทธ๋ž˜ํ”„ ์ƒ์„ฑ

gnuplot -c ./testdata/plot_results_of_evaluate_ROC_curve.gnuplot ./untrained_roc.txt
gnuplot -c ./testdata/tpr_fpr_curve.gnuplot ./untrained_roc.txt ./trained_roc.txt

๊ทธ๋ž˜ํ”„ ๋ถ„์„ (ํ•™์Šต ์ „)

1๋ฒˆ์งธ ๊ทธ๋ž˜ํ”„

  • TPR์ด 50%๊ฐ€ ๋„˜์„๋–„ 20%๊ฐ€ irrelevantํ•˜๋‹ค. Cut-off distance๋Š” 25๋น„ํŠธ ์ •๋„์ด๋‹ค.
  • TPR์ด 55%๊ฐ€ ๋˜๋Š”๊ฒƒ์„ ๋ณด์•„ 35๋น„ํŠธ ๋ถ€ํ„ฐ irrelevant ํ•˜๋‹ค. ๊ฐ€์ค‘์น˜๋ฅผ ๋” ํ•™์Šตํ•˜์—ฌ ๊ฐœ์„ ํ• ์ˆ˜ ์žˆ๋‹ค.

2๋ฒˆ์งธ ๊ทธ๋ž˜ํ”„

  • TPR๊ณผ FPR์ด ํ‰ํ‰ํ•˜๋‹ค.
  • ๋น„ํŠธ๋ฅผ ํ™•์žฅ์‹œ์ผœ search space๊ฐ€ ์ ์  ๊ฐ์†Œํ•˜๊ณ  ์žˆ๋‹ค.
  • relevant๋ฅผ ๊ฐœ์„ ํ•ด์•ผํ•œ๋‹ค.

3๋ฒˆ์งธ ๊ทธ๋ž˜ํ”„

  • Recall ํ–ฅ์ƒ๊ณผ ์ •ํ™•๋„ ์ €ํ•˜ ๋น„๊ต

๊ทธ๋ž˜ํ”„ ๋ถ„์„ (ํ•™์Šต ํ›„)

1๋ฒˆ์งธ ๊ทธ๋ž˜ํ”„

  • 10๋น„ํŠธ ์ชฝ์—์„œ IRR์ด 15%->5% ๊ฐ์†Œ
  • TPR ๊ฐ์†Œ
  • 33% ์„ฑ๊ณต
  • IRR ํšจ์œจ์€ ์ข‹์•˜์œผ๋‚˜ ๊ฒฐ๊ณผ๋Š” ๋–จ์–ด์ง

2๋ฒˆ์งธ ๊ทธ๋ž˜ํ”„

  • 15%์˜ IRR์„ ์ฑ„ํƒํ•˜๋ฉด 45% ์„ฑ๊ณต๋ฅ ์„ ๊ฐ€์งˆ์ˆ˜์Œ (ํ•™์Šต ์ „)
  • 5% IRR ์ฑ„ํƒ์‹œ 40% ์„ฑ๊ณต๋ฅ ์ด ๊ฐ์†Œ (ํ•™์Šต ํ›„)

ํ•™์Šต ๊ฒฐ๊ณผ

  • IRR์„ ๋‚ฎ์ถ”๋Š”๋ฐ์—๋Š” ํ•™์Šต์ด ํšจ๊ณผ์ 
  • ํ•™์Šตํ•˜์ง€ ์•Š์€ ๊ฒฐ๊ณผ๊ฐ€ ๋” ์ข‹์•˜์Œ.

Generalizing out-of-sample function

Out-of-sample : ํ•™์Šตํ•˜์ง€ ์•Š์€ ํ‘œ๋ณธ์„ ์˜ˆ์ธก

์งˆ๋ฌธ(2)๋ฅผ ์œ„ํ•ด ์งˆ๋ฌธ(1)์„ ํ‰๊ท  ๊ฑฐ๋ฆฌ ์ฐจ์ด๋กœ ์‹œ๊ฐํ™”ํ•œ ๊ทธ๋ž˜ํ”„

80 ํ•™์Šต ์Šคํ… ์ดํ›„ 11.42๋น„ํŠธ์—์„œ 12.81๋น„ํŠธ๋กœ ์˜ฌ๋ž๋‹ค.

5. Practical Results

Practical Searching

IDA, Radare, Binary Ninja, Miasm๋“ฑ RE๋„๊ตฌ์™€ ์—ฐ๋™์„ ์œ„ํ•ดFunctionSimSearch๋Š” Python Binding์„ ์ œ๊ณตํ•œ๋‹ค.

 jsonstring = (... load the JSON ... )
 fg = functionsimsearch.FlowgraphWithInstructions()
 fg.from_json(jsonstring)
 hasher = functionsimsearch.SimHasher("../testdata/weights.txt")
 function_hash = hasher.calculate_hash(fg)

JSON๊ธฐ๋ฐ˜์˜ Python API๋กœ ์‚ฌ์šฉํ•˜์—ฌ Flow Graph๋ฅผ ๋‚˜ํƒ€๋‚ผ์ˆ˜ ์žˆ๋‹ค.

{
 "edges": [ { "destination": 1518838580, "source": 1518838565 }, (...) ],
 "name": "CFG",
 "nodes": [
   {
     "address": 1518838565,
     "instructions": [
       { "mnemonic": "xor", "operands": [ "EAX", "EAX" ] },
       { "mnemonic": "cmp", "operands": [ "[ECX + 4]", "EAX" ] },
       { "mnemonic": "jnle", "operands": [ "5a87a334" ] } ]
   }, (...)  ]
}

์ด json ๋ฐ์ดํ„ฐ๋“ค์„ input์œผ๋กœ ๋„ฃ์–ด์ฃผ๋ฉด Python Tuple๋กœ function hash๊ฐ€ ๋‚˜์˜จ๋‹ค.

Python Plugin Binding

Searching For unrar Code In mpengine.dll

ida_example.py๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ mpengine.dll์— ์กด์žฌํ•˜๋Š” unrarOpen Source Library ์ฝ”๋“œ๋ฅผ ๋ถ„์„ํ•˜๋„๋ก ํ•œ๋‹ค.

Result is 125.000000 - build.VS2015\unrar32\Release\UnRAR.exe 'memcpy_s' (1 in inf searches)
Result is 125.000000 - build.VS2015\unrar32\MinSize\UnRAR.exe 'memcpy_s' (1 in inf searches)
Result is 125.000000 - build.VS2015\unrar32\FullOpt\UnRAR.exe 'memcpy_s' (1 in inf searches)
--------------------------------------
Result is 108.000000 - build.VS2015\unrar32\MinSize\UnRAR.exe '?RestartModelRare@ModelPPM@@AAEXXZ' (1 in 12105083908.189119 searches)
Result is 107.000000 - build.VS2013\unrar32\MinSize\UnRAR.exe '?RestartModelRare@ModelPPM@@AAEXXZ' (1 in 3026270977.047280 searches)
Result is 107.000000 - build.VS2010\unrar32\MinSize\UnRAR.exe '?RestartModelRare@ModelPPM@@AAEXXZ' (1 in 3026270977.047280 searches)
--------------------------------------
Result is 106.000000 - build.VS2010\unrar32\Release\UnRAR.exe '?Execute@RarVM@@QAEXPAUVM_PreparedProgram@@@Z' (1 in 784038800.726675 searches)
Result is 106.000000 - build.VS2010\unrar32\FullOpt\UnRAR.exe '?Execute@RarVM@@QAEXPAUVM_PreparedProgram@@@Z' (1 in 784038800.726675 searches)
Result is 105.000000 - build.VS2010\unrar32\MinSize\UnRAR.exe '?Execute@RarVM@@QAEXPAUVM_PreparedProgram@@@Z' (1 in 209474446.235050 searches)
--------------------------------------
Result is 106.000000 - build.VS2010\unrar32\MinSize\UnRAR.exe '?ExecuteCode@RarVM@@AAE_NPAUVM_PreparedCommand@@I@Z' (1 in 784038800.726675 searches)
--------------------------------------
Result is 105.000000 - ar\build.VS2015\unrar64\Debug\UnRAR.exe 'strrchr' (1 in 209474446.235050 searches)
Result is 105.000000 - ar\build.VS2013\unrar64\Debug\UnRAR.exe 'strrchr' (1 in 209474446.235050 searches)
Result is 105.000000 - ar\build.VS2012\unrar64\Debug\UnRAR.exe 'strrchr' (1 in 209474446.235050 searches)
--------------------------------------

๊ฒฐ๊ณผ ๋ถ„์„

Result is ??

?? ๋Š” 128bit ํ•ด์‹œ์ค‘ ๋ช‡ ๋น„ํŠธ๊ฐ€ ์œ ์‚ฌํ•œ์ง€ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ด๋‹ค.

memcpy_s - 97.6% (125bit/128bit)

  • ๋น„ํŠธ ์œ ์‚ฌ๋„ : 97.6%
  • ์œ ์‚ฌ ๋น„ํŠธ ์ˆ˜ : 125bit/128bit
  • ์‚ฌ์†Œํ•œ ๋ณ€ํ™” ์™ธ ๊ฑฐ์˜ ์ผ์น˜
  • CFG ์ผ์น˜

ppmii :: ModelPPM :: RestartModelRare - 84.3% (108bit/128bit)

  • ๋น„ํŠธ ์œ ์‚ฌ๋„ : 84.3%
  • ์œ ์‚ฌ ๋น„ํŠธ ์ˆ˜ : 108bit / 128bit
  • Structure Offset์— ๊ฝค ๋ณ€ํ™”
  • CFG ๋งŽ์ด ๋ฐ”๋€Œ์ง„ ์•Š์Œ.
  • ์ฒซ basic block์— ์ฐจ์ด๊ฐ€ ๋ณด์ž„ -> ๊ทธ๋ž˜๋„ ์œ ์‚ฌํ•จ

RarVM :: ExecuteCode - 82.8% (106bit/128bit)

  • ๋น„ํŠธ ์œ ์‚ฌ๋„ : 82.8%
  • ์œ ์‚ฌ ๋น„ํŠธ ์ˆ˜ : 106bit/128bit
  • 0x17D7840 ์ƒ์ˆ˜๊ฐ€ ๊ฐ™์Œ.

[FP Case] RarTime :: operator == - 97.6% ( 125bit/128bit)

  • ๋น„ํŠธ ์œ ์‚ฌ๋„ : 97.6%
  • ์œ ์‚ฌ ๋น„ํŠธ์ˆ˜ : 125bit/128bit
  • ๋น„ํŠธ๊ฐ€ ๋งค์šฐ ์œ ์‚ฌํ•˜๋‹ค๊ณ  ํŒ๋ณ„ํ•˜์˜€์ง€๋งŒ False Positive ์ž„
  • ๋‹ค๋ฅธ ํ•จ์ˆ˜์ด์ง€๋งŒ ๋น„ํŠธ๊ฐ€ ์œ ์‚ฌํ•˜๋‹ค๊ณ  ํŒ๋ณ„.
  • operater== ์—์„œ ์‰ฝ๊ฒŒ ๋ฐœ์ƒ ๊ฐ€๋Šฅ

libtiff in Adobe Reader

์˜ˆ์ „ Adobe Reader๊ฐ€libtiff์˜ ๊ตฌ๋ฒ„์ „์„ ์‚ฌ์šฉํ•˜์—ฌ ์ทจ์•ฝ์ ์ด ๋ฐœ์ƒํ•œ์ ์ด ์žˆ๋‹ค.

Search libtiff Code in All of WIndows DLL

  • ๋‹ค์–‘ํ•œ ๋ฒ„์ „์˜ Visual Studio์™€ ๋‹ค์–‘ํ•œ Compile Setting์œผ๋กœ ์ปดํŒŒ์ผํ•œ ๋””๋ ‰ํ† ๋ฆฌ
  • PDB ํŒŒ์ผ ๋””๋ฒ„๊น…์„ ์œ„ํ•ด DIA2Dump๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ .debugdump ํŒŒ์ผ์„ ์–ป๋Š”๋‹ค.
for i in $(find /media/thomasdullien/storage/libtiff/PE/ -name tiff.dll); do./addfunctionstoindex --input=$i --format=PE --index=/var/tmp/work/simhash.index; done 

tiff.dll์˜ SimHash index๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

~/Desktop/sources/functionsimsearch/testdata/generate_training_data.py --work_directory=/var/tmp/work/ --executable_directory=/media/thomasdullien/storage/libtiff/ --generate_fingerprints=True --generate_json_data=False

cat /var/tmp/work/extracted_symbols* > /var/tmp/work/simhash.index.meta

generate_training_data ํ•จ์ˆ˜์˜ symbol๋“ค์„ ๋งŒ๋“ค๊ณ  training data๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

for i in $(find /media/DLLs -iname ./*.dll); do echo $i; ./matchfunctionsindex --index=/var/tmp/work/simhash.index --input $i; done 

matchfunctionsindex๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์„ฑํ•œ ๋ฐ์ดํ„ฐ์™€ ์ธ๋ฑ์Šค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชจ๋“  dll์„ ํƒ์ƒ‰ํ•œ๋‹ค.

์‹คํ–‰๊ฒฐ๊ณผ

/home/thomasdullien/Desktop/sources/adobe/binaries/AGM.dll
(...)
/home/thomasdullien/Desktop/sources/adobe/binaries/BIBUtils.dll
(...)
[!] (3191/3788 - 23 branching nodes) 0.851562: cf1cc98bead49abf.53135c10 matches 39dd1e8a79a9f2bc.1001d43d /home/thomasdullien/Desktop/tiff-3.9.5-builds/PE/vs2015.32bits.O1/libtiff.dll PackBitsEncode( tiff*, unsigned char*, int, unsigned short)

/media/dlls/Windows/SysWOW64/WindowsCodecs.dll
[!] Executable id is cf1cc98bead49abf
[!] Loaded search index, starting disassembly.
[!] Done disassembling, beginning search.
[!] (3191/3788 - 23 branching nodes) 0.851562: cf1cc98bead49abf.53135c10 matches 39dd1e8a79a9f2bc.1001d43d /home/thomasdullien/Desktop/tiff-3.9.5-builds/PE/vs2015.32bits.O1/libtiff.dll PackBitsEncode( tiff*, unsigned char*, int, unsigned short)
[!] (3192/3788 - 23 branching nodes) 0.804688: cf1cc98bead49abf.53135c12 matches 4614edc967480a0d.1002329a /home/thomasdullien/Desktop/tiff-3.9.5-builds/PE/vs2013.32bits.O2/libtiff.dll
[!] (3192/3788 - 23 branching nodes) 0.804688: cf1cc98bead49abf.53135c12 matches af5e68a627daeb0.1002355a /home/thomasdullien/Desktop/tiff-3.9.5-builds/PE/vs2013.32bits.Ox/libtiff.dll
[!] (3192/3788 - 23 branching nodes) 0.804688: cf1cc98bead49abf.53135c12 matches a5f4285c1a0af9d9.10017048 /home/thomasdullien/Desktop/tiff-3.9.5-builds/PE/vs2017.32bits.O1/libtiff.dll PackBitsEncode( tiff*, unsigned char*, int, unsigned short)
[!] (3277/3788 - 13 branching nodes) 0.828125: cf1cc98bead49abf.5313b08e matches a5f4285c1a0af9d9.10014477 /home/thomasdullien/Desktop/tiff-3.9.5-builds/PE/vs2017.32bits.O1/libtiff.dll 

WindowsCodecs.dll์— libtiff.dll์ด ๋งค์น˜ ํ•˜๋Š”๊ฒƒ์„ ์ฐพ์•˜๋‹ค.

๊ฒฐ๊ณผ ๋ถ„์„

CFG๊ฐ€ ๊ทธ๋ฆฌ ์œ ์‚ฌํ•œ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด์ง€๋Š” ์•Š๋‹ค.

ํ™•๋Œ€ํ•ด๋ณด๋ฉด ์œ ์‚ฌํ•œ ๋ถ€๋ถ„์ด ์กด์žฌํ•œ๋‹ค.

PDB์—์„œ ์–ป์€ ํ•จ์ˆ˜ ์ด๋ฆ„์ธ PackBitsEncode๊ฐ€ ์ผ์น˜ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ™•์‹ ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

์กฐ์‚ฌ๊ฒฐ๊ณผ WindowsCodecs.dll์—์„œ libtiff 3.9.5 version์„ ์‚ฌ์šฉํ•œ ๊ฒƒ์„ ์•Œ์•„๋ƒˆ๋‹ค. - libjpeg๋ฅผ link ํ•˜๊ณ ์žˆ๋‹ค.

6. Conclusion

Lessons

Search Index vs Linear Sweep

  • ์š”์ฆ˜ CPU๋Š” ํฐ ๋ฉ”๋ชจ๋ฆฌ๋กœ Linear Sweep์ด ๋น ๋ฅด๋‹ค.

  • Hash Load - XOR - Calc Bitํ•˜๊ณ  ์ˆ˜์–ต๊ฐœ์˜ ํ•ด์‹œ๋ฅผ ๊ฒ€์‚ฌํ•œ๋‹ค.

  • ์–ผ๋งˆ๋‚˜ ๋งŽ์€ ํ•ด์‹œ๋ฅผ ๋น„๊ตํ•ด์•ผํ•˜๋Š”์ง€๋Š” ๋ถˆ๋ถ„๋ช…ํ•˜๋‹ค.

  • Search index๋Š” over-engineered(๊ณผ๋„)ํ• ์ˆ˜ ์žˆ๋‹ค.

  • ํšจ์œจ์ ์ธ ์Šคํ† ๋ฆฌ์ง€ ๊ด€๋ฆฌ๋กœ ์ธํ•ด ๊ฐ„๋‹จํ•œ Linear Sweep์ด ๋” ๋‚˜์„ ์ˆ˜๋„ ์žˆ๋‹ค.

Search Simple String

  • Static Linked LIbrary๋ฅผ ์ฐพ์„๋•Œ ๋Œ€๋ถ€๋ถ„(90%) known/magic string์„ ๊ฒ€์ƒ‰ํ•˜๋Š”๊ฒƒ์ด ๊ฐ€์žฅ ํšจ๊ณผ์ ์ผ ๊ฒƒ์ด๋‹ค.

  • ํŠน์ดํ•œ ๋ฌธ์ž์—ด์ด ๋งŽ์„ ๊ฒฝ์šฐ ๊ฒฐ๊ณผ๋„ ์ข‹์•„์ง„๋‹ค.

  • ์—ฐ๊ตฌํ•œ ํˆด์€ ์˜คํ”ˆ์†Œ์Šค ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ฝ”๋“œ๋ฅผ ๋ถ™์—ฌ๋„ฃ์–ด ์ปดํŒŒ์ผ ํ–ˆ์„๋•Œ ์œ ์šฉํ•˜๋‹ค. (๋ฌธ์ž์—ด์ด ์—†๋Š” ์ƒํ™ฉ์— ์œ ์šฉํ•˜๋‹ค.)

  • mpengine.dll์— ์ ์šฉ๋ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

Still Hard Problem

  • 40%์˜ ๊ฒฐ๊ณผ๋งŒ ํ™•์‹ ํ• ์ˆ˜ ์žˆ๋‹ค.

  • IRR์„ ์ค„์ด๊ณ  TPR์„ 90%์ด์ƒ์œผ๋กœ ๊ฐœ์„ ํ•ด์•ผํ•œ๋‹ค.

Future Step

TensorFlow๋‚˜ Julia๋กœ ๋‹ค์‹œ ์ฝ”๋”ฉํ•ด์•ผํ•œ๋‹ค.

  • C++๋กœ loss function์„ ์ฝ”๋”ฉํ•˜์˜€๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต์— ์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆฐ๋‹ค.
  • Single language codebase๋กœ๋Š” ์ข‹๋‹ค.
  • GPU๋กœ ํ•™์Šต์„ ๋ณ‘๋ ฌํ™” ํ•˜๋ฉด ๋” ์‰ฌ์šธ ๊ฒƒ์ด๋‹ค.

L-BFGS ์ตœ์ ํ™”๋ฅผ SGD๋กœ ๋ณ€๊ฒฝํ•ด์•ผํ•œ๋‹ค.

  • ํฐ ๋ฐ์ดํ„ฐ๋Š” L-BFGS์— ๋ถ€์ ํ•ฉํ•˜๋‹ค.

Triplet and Quadruplet Training

๋” ํšจ๊ณผ์ ์ธ Features ์„ ํƒ

  • mnemonic tuple, CFG, 4์˜ ๋ฐฐ์ˆ˜๊ฐ€์•„๋‹Œ ์ƒ์ˆ˜ ๋“ฑ ํ˜„์žฌ features๋Š” ์ข‹์ง€ ์•Š๋‹ค.
  • ์ค‘์š”ํ•œ ์ •๋ณด๊ฐ€์žˆ๋Š” operand, structure offset, strings์€ ๊ณ ๋ ค๋˜์ง€ ์•Š์•˜๋‹ค.

Graph-NN

  • ๊ทธ๋ž˜ํ”„ ํ•™์Šตํ•˜๋Š” ๋งŽ์€ ML ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜์—ˆ๋‹ค.
  • ๊ด€๋ จ ๋…ผ๋ฌธ - CCS'17
  • ๋ฌธ์ž์—ด ๋งค์น˜๋กœ ํƒ์ง€ํ• ์ˆ˜ ์—†๋Š” ๋ถ€๋ถ„์„ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

ํ•จ์ˆ˜ ์ธ์ ‘์„ฑ ํ™œ์šฉ

  • ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๊ฐ€์ ธ์™€ ์‚ฌ์šฉํ•œ ํ•จ์ˆ˜๋Š” ๋ฐ”์ด๋„ˆ๋ฆฌ ๋ ˆ๋ฒจ์—์„œ ์ธ์ ‘ํ•˜๊ฒŒ ๋ฐฐ์น˜ ๋˜์–ด์žˆ๋‹ค.
  • ์ด ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜๋ฉด ๋” ํ™•์‹คํ•ด์งˆ ๊ฒƒ์ด๋‹ค.

ANN tree data structure ๋ฅผ flat array๋กœ ๋ณ€๊ฒฝ

  • ANN ์€ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๊ฐ€ ๋ณต์žกํ•˜๊ธฐ ๋•Œ๋ฌธ์— Linear Sweep์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋น„ํšจ์œจ์ ์ผ์ˆ˜ ์žˆ๋‹ค.
    • ์ €์žฅ์‹œ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.
  • Linear Sweep ์„ ์‚ฌ์šฉํ• ๋•Œ LSH ๊ฐ™์€ ๋น„ํŠธ ์ˆœ์—ด๋ณด๋‹ค ํšจ๊ณผ์ ์ด์–ด์•ผํ•œ๋‹ค.

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment