Author: Ron Murawski
Date: 14:08:00 01/29/06
Go up one level in this thread
Hi Uri,
There was a series of articles in Dr Dobbs Journal (a programming magazine)
written by Andew Schulman entitled "Finding Binary Clones With Opstrings &
Function Digests". The main thrust of the article is to assign unique
"opcharacters" to each assembly language mnuemonic. These opcharacters are
combined into "opstrings".
Here's a quote from the article: "In x86 Win32 code, one such opstring would be:
"test, jz, call, test, jz, call, ret, loc, [HeapFree], loc, ret". Using an
algorithm such as MD5, this can be turned into a function digest such as
65E482BF6A2F391D84D243D9E02244D7."
You'll notice that the opstring is much abbreviated from real assembly language
in an attempt to filter out "noise". For instance, it is very unlikely that a
"jz" instruction will jump to the exact same memory address, so it's best to
just ignore the destination address. Generally it is only the first byte of
assembly code that is salient to clone searching.
The idea is to compare function digests between programs to detect plagiarism.
In order to do this you have to decide where funtions begin and end. This is not
always easy, but it is necessary to ensure that the function digests represent
comparable data between exe files. This method not only claims to detect exact
copying, but also code that has been modified with the intent to appear
different!!!
Here's an example of two functions that were detected to be clones:
// test1.c
int test1(int top)
{
unsigned char *num = calloc(top, 1);
int i, j, count;
if (!num) return 0;
for (i = 3, count = 1; i <= top; i += 2)
{
if (!num[1])
{
count++;
for (j = 1; j <= top; j +=i)
num[j]++;
}
}
free(num);
return count;
}
=======
// test2.c
int test2(int maxi)
{
unsigned char *arr = calloc(1, maxi);
int a = 3, pr = 1;
if (!arr) return 0;
while (a <= maxi)
{
if (arr[a] == 0)
{
int b = a << 1;
pr++;
while (b <= maxi)
{
arr[b]++;
b += a;
}
}
a += 2;
}
free(arr);
return pr;
}
[I hope I did not make a typo above. I copied it directly from the printed
magazine article.]
I was very impressed at this detection ability. The disassemblies for each of
these functions seemed very different to my eye.
The author makes no claims that the technique is perfect. False positives occur
and human interpretation is always necessary before a final decision can be
made. Also, it is better comparing two programs have been compiled using the
same compiler (using the same optimization flags!). But it seems that if you
limit your comparisons to longer opstrings, the probability of detecting cloned
functions becomes extremely high.
Here's a link to the articles in the series but you need to be a DDJ subsciber
to read the articles in their entirety.
http://www.ddj.com/documents/s=9809/ddj0507k/0507k.html
http://www.ddj.com/documents/s=9825/ddj0508h/0508h.html
http://www.ddj.com/documents/s=9860/ddj0509i/0509i.html
Ron
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.