Table of Contents
- Intro
- The malware sample used in this blog post
- The KSA and PRGA functions
- Observing call into function
- Key-scheduling algorithm (KSA)
- Pseudo-random generation algorithm (PRGA)
- Conclusion
Intro
During a deep dive analysis of a recent sample, another case of using RC4 came up and I thought it was a good example to show how it appeared in the malware. The RC4 encryption algorithm is encountered quite often in malware due to it’s ease of use and ability to hide strings and information and quickly change hashes in samples (sometimes changed dynamically per infection).
The well-known wiki page with pseudo-code is found here:
https://en.wikipedia.org/wiki/RC4
From the pseudo-code on the wiki I translated it to python that can be used to test data and keys with:
rc4.py
The goal is to be able to spot the use of it very quickly and due to its small KSA/PRGA looping algorithm it is typically easy to spot. Examples below show a specific case of this to help see it in a real-world sample.
The malware sample used in this blog post
- MD5 hash:
8e3ecc68ad8bb0db61b5de65d8381eff
- VirusTotal link to sample: https://www.virustotal.com/gui/file/20c8baddda18909c2cd6eb78ac904fe9ac1a1e96db37698b157b12f745ce1ff8/detection
The KSA and PRGA functions
The Key-scheduling algorithm (KSA) and the Pseudo-random generation algorithm (PRGA) may be all in one call or split into two separate calls. In the case of this sample it is observed to be a single call.
The RC4 encryption can be thought of as two phases:
1) Generating the S-box array by initializing it with a key
2) Using the S-box array as a stream cipher to apply onto any data you want to encrypt or decrypt
Observing call into function
The key pieces of information that the RC4 algorithm will be looking for is the data, the length of the data, and the key that will be used when initializing the stream cipher. When debugging live, we can see from the call into the function a pointer to the MZ
byte at the start of the data that will be encrypted in ecx
, the length of this data in edx
, and finally the key has been identified in [esp]
with the bytes highlighted in the Dump 1 window.
Key-scheduling algorithm (KSA)
KSA identity permutation (initialization)
Pseudo-code
1
2
3
for i from 0 to 255
S[i] := i
endfor
When looking for the presence of RC4 there are a few indicators to look for:
1) Loops for 256 iterations (look for compares of 100h or sometimes FFh)
2) An array being populated with the value of a counter (0, 1, 2, 3, …)
When debugging live, we can see the buffer that was allocated with the series of bytes from 00
to FF
. In this algorithm, the array of bytes is defined as “S” and is often referred to as the “s-box”.
KSA loop to mix in key bytes
Pseudo-code
1
2
3
4
5
j := 0
for i from 0 to 255
j := (j + S[i] + key[i mod keylength]) mod 256
swap values of S[i] and S[j]
endfor
You now want to look for a second loop across the same array that:
1) Loops for 100h
iterations like before
2) References bytes from the key
The result of this second loop will scramble the s-box array to form the final array that will be used by the PRGA algorithm that will generate the bytes to XOR
with the data to be encrypted or decrypted.
Pseudo-random generation algorithm (PRGA)
Pseudo-code
1
2
3
4
5
6
7
8
9
i := 0
j := 0
while GeneratingOutput:
i := (i + 1) mod 256
j := (j + S[i]) mod 256
swap values of S[i] and S[j]
K := S[(S[i] + S[j]) mod 256]
output K
endwhile
The PRGA algorithm is where the magic happens and the bytes to XOR
with your data bytes happen.
There are several indicators to look for when trying to identify this in the assembly:
1) This will be a third loop but will continue to the length of the data
2) The loop will be iterating over each byte in the data
3) At the very end of the loop you should see a XOR
operation against the generated PRGA byte and a byte from the data
As this loop iterates you can watch the plaintext bytes in memory slowly encrypt.
Conclusion
Since RC4 is used so heavily in malware it is important to be able to identify this quickly in static analysis reviews. Fortunately, due to the very distinct sequence of 100h
loops it is usually easy to identify and confirm.