Breaking News



[*]

A PoC implementation for a classy in-memory evasion technique that spoofs Thread Identify Stack. The program lets in to keep away from thread-based memory examination rules and better quilt shellcodes while in-process memory.

Intro

This is an example implementation for Thread Stack Spoofing technique aiming to evade Malware Analysts, AVs and EDRs searching for references to shellcode’s frames in an examined thread’s identify stack.
The speculation is to hide references to the shellcode on thread’s identify stack thus masquerading allocations containing malware’s code.

Implementation at the side of my ShellcodeFluctuation brings Offensive Protection team development implementations to atone for the offering made by the use of commercial C2 products, so that we will do no worse in our Purple Body of workers toolings.

Implementation has changed

Provide implementation differs carefully to what was first of all published.
It’s because I realised there is a manner simpler strategy to terminate thread’s identify stack processal and hide shellcode’s an identical frames by the use of simply writing 0 to the return maintain of the principle frame we keep an eye on:


void WINAPI MySleep(DWORD _dwMilliseconds)
{
[...]
auto overwrite = (PULONG_PTR)_AddressOfReturnAddress();
const auto origReturnAddress = *overwrite;
*overwrite = 0;

[...]
*overwrite = origReturnAddress;
}

The previous implementation, utilising StackWalk64 can also be accessed in this commit c250724.

This implementation is much more cast and works correctly on every Debug and Unencumber under two architectures – x64 and x86.

Demo

This is how a call stack would most likely seem to be when it is NOT spoofed:

 

This in turn, when thread stack spoofing is enabled:

 

Above we will see that the final frame on our identify stack is our MySleep callback.
One can wonder does it immediately brings possible choices new IOCs? Looking rules can seek for threads having identify stacks not unwinding into following expected thread get admission to problems located inside of system libraries:

kernel32!BaseThreadInitThunk+0x14
ntdll!RtlUserThreadStart+0x21

Alternatively the verdict stack of the spoofed thread would most likely look reasonably strange to begin with, a temporary examination of my system confirmed, that there are other threads not unwinding to the above get admission to problems as smartly:

The above screenshot displays a thread of unmodified Common Commander x64. As we will see, its identify stack almost about resembles our non-public in terms of initial identify stack frames.

Why should we care about carefully faking our identify stack when there are processes showing traits that we will simply mimic?

How it works?

The harsh algorithm is following:

  1. Be informed shellcode’s contents from record.
  2. Succeed in the entire necessary function pointers from dbghelp.dll, identify SymInitialize
  3. Hook kernel32!Sleep pointing once more to our callback.
  4. Inject and liberate shellcode by way of VirtualAlloc + memcpy + CreateThread. The thread should get began from our runShellcode function to avoid having Thread’s StartAddress stage into somewhere surprising and anomalous (comparable to ntdll!RtlUserThreadStart+0x21)
  5. As soon as Beacon makes an try to sleep, our MySleep callback gets invoked.
  6. We then overwrite final return maintain on the stack to 0 which effectively should finish the verdict stack.
  7. In any case a option to ::SleepEx is made to let the Beacon’s sleep while taking a look forward to further dialog.
  8. After Sleep is finished, we restore previously saved distinctive function return addresses and execution is resumed.

Function return addresses are scattered all over the place the thread’s stack memory house, pointed to by the use of RBP/EBP take a look at in.
So that you can to search out them on the stack, we need to at first gain frame pointers, then dereference them for overwriting:

(the above image was borrowed from Eli Bendersky’s put up named Stack frame structure on x86-64)

	*(PULONG_PTR)(frameAddr + sizeof(void*)) = Fake_Return_Address;

Initial implementation of ThreadStackSpoofer did that throughout walkCallStack and spoofCallStack functions, then again the existing implementation displays that the ones efforts aren’t required to handle stealthy identify stack.

Example run

Use case:

C:> ThreadStackSpoofer.exe <shellcode> <spoof>

Where:

  • <shellcode> is a path to the shellcode record
  • <spoof> when 1 or true will permit thread stack spoofing and anything else disables it.

Example run that spoofs beacon’s thread identify stack:

PS D:dev2ThreadStackSpoofer> .x64ReleaseThreadStackSpoofer.exe .testsbeacon64.bin 1
[.] Finding out shellcode bytes...
[.] Hooking kernel32!Sleep...
[.] Injecting shellcode...
[+] Shellcode is now operating.
[>] Distinctive return maintain: 0x1926747bd51. Finishing identify stack...

===> MySleep(5000)

[<] Restoring distinctive return maintain...
[>] Distinctive return maintain: 0x1926747bd51. Finishing identify stack...

===> MySleep(5000)

[<] Restoring distinctive return maintain...
[>] Distinctive return maintain: 0x1926747bd51. Finishing identify stack...


How do I profit from it?

Check out the code and its implementation, understand the concept that that and re-implement the concept that that inside of your personal Shellcode Loaders that you simply utilise to send your Purple Body of workers engagements.
This is an however another technique for sophisticated in-memory evasion that may build up your Teams’ chances for not getting caught by the use of Anti-Viruses, EDRs and Malware Analysts taking take a look at your implants.

While rising your difficult shellcode loader, you might also wish to implement:

  • Process Heap Encryption – take an inspiration from this blog put up: Hook Heaps and Live Unfastened – which is in a position to allow you to evade Beacon configuration extractors like BeaconEye
  • Change your Beacon’s memory pages protection to RW (from RX/RWX) and encrypt their contents – using Shellcode Fluctuation technique – right kind forward of sleeping (that will evade scanners comparable to Moneta or pe-sieve)
  • Filter out any leftovers from Reflective Loader to avoid in-memory signatured detections
  • Unhook the whole thing you need to have hooked (comparable to AMSI, ETW, WLDP) forward of sleeping and then re-hook afterwards.

In truth this is not (however) an actual stack spoofing

As it’s been recognized to me, the technique proper right here is not however truly keeping up up to its identify for being a stack spoofer. Since we’re merely overwriting return addresses on the thread’s stack, we don’t seem to be spoofing the remaining areas of the stack itself. Moreover we’re leaving our identify stack unwindable meaking it look anomalous given that system will be unable to as it should be walk all of the identify stack frames chain.

Alternatively I’m conscious about the ones shortcomings, at the moment I’ve left it as is since I cared maximum usually about evading automatic scanners that will iterate over processes, enumerate their threads, walk those threads stacks and select up on any return maintain pointing once more to a non-image memory (comparable to SEC_PRIVATE – the one allocated dynamically by the use of VirtuaAlloc and friends). A targeted malware analyst would immediately spot the oddity and consider the thread reasonably strange, hunting down our implant. More than sure about it. However, I don’t imagine that at the present time automatic scanners comparable to AV/EDR have varieties of heuristics performed that may in truth walk every thread’s stack to verify whether or not or now not its un-windable ¯_(ツ)_/¯ .

For sure this mission (and commercial implementation found in C2 frameworks) provides AV & EDR vendors arguments to consider implementing appropriate heuristics protecting this sort of novel evasion technique.

So that you can beef up this system, one can aim for an actual Thread Stack Spoofer by the use of striking carefully crafted fake stack frames established in an reverse-unwinding task.
Be informed further on this concept underneath.

Imposing an actual Thread Stack Spoofer

Hours-long conversation with namazso teached me, that with the intention to aim for a correct thread stack spoofer we may need to reverse x64 identify stack unwinding task.
Firstly, one needs to carefully acknowledge the stack unwinding task outlined in (a) comparable underneath. The system when traverses Thread identify stack on x64 construction may not simply rely on return addresses scattered around the thread’s stack, on the other hand reasonably it:

  1. takes return maintain
  2. makes an try to spot function containing that maintain (with RtlLookupFunctionEntry)
  3. That function returns RUNTIME_FUNCTION, UNWIND_INFO and UNWIND_CODE structures. The ones structures describe where are the function’s beginning maintain, completing maintain, and where are the entire code sequences that change RBP or RSP.
  4. Software should find out about all stack & frame pointers adjustments that took place in every function across the Identify Stack to then just about rollback the ones changes and just about restore identify stack pointers when a option to the processed identify stack frame took place (this is performed in RtlVirtualUnwind)
  5. The system processes all UNWIND_CODEs that examined function exhbits to precisely compute the site of that frame’s return maintain and stack pointer worth.
  6. By the use of this emulation, the Software is able to walk down the verdict stacks chain and effectively “unwind” the verdict stack.

So that you can interfere with this task we wuold need to revert it by the use of having our reverted form of RtlVirtualUnwind. We may need to iterate over functions defined in a module (let’s be it kernel32), scan every function’s UNWIND_CODE codes and moderately emulate it backwards (as compared to RtlVirtualUnwind and precisely RtlpUnwindPrologue) with the intention to to search out puts on the stack, where to position our fake return addresses.

namazso mentions the wish to introduce 3 fake stack frames to correctly stitch the verdict stack:

  1. A “desync” frame (consider it as a gadget-frame) that unwinds in a different way compared to the caller of our MySleep (having differnt UWOP – Unwind Operation code). We do this by the use of looking by the use of all functions from a module, looking by the use of their UWOPs, calculating how massive the fake frame should be. This frame should have UWOPS different than our MySleep‘s caller.
  2. Next frame that we wish to to search out is a function that unwindws by the use of popping into RBP from the stack – basically by the use of UWOP_PUSH_NONVOL code.
  3. third frame we would like a function that restores RSP from RBP at some stage in the code UWOP_SET_FPREG

The restored RSP should be set with the RSP taken from any place keep an eye on waft entered into our MySleep so that every one our frames change into hidden, on account of third machine unwinding there.

So that you can get started the process, one can iterate over executable’s .pdata by the use of dereferencing IMAGE_DIRECTORY_ENTRY_EXCEPTION data record get admission to.
Believe underneath example:

    ULONG_PTR imageBase = (ULONG_PTR)GetModuleHandleA("kernel32");
PIMAGE_NT_HEADERS64 pNthdrs = PIMAGE_NT_HEADERS64(imageBase + PIMAGE_DOS_HEADER(imageBase)->e_lfanew);

auto excdir = pNthdrs->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXCEPTION];
if (excdir.Size == 0 || excdir.VirtualAddress == 0)
return;

auto get started = PRUNTIME_FUNCTION(excdir.VirtualAddress + imageBase);
auto end = PRUNTIME_FUNCTION(excdir.VirtualAddress + imageBase + excdir.Size);

UNWIND_HISTORY_TABLE mshist = { 0 };
DWORD64 imageBase2 = 0;

PRUNTIME_FUNCTION currFrame = RtlLookupFunctionEntry(
(DWORD64)caller,
&imageBase2,
&mshist
);

UNWIND_INFO *mySleep = (UNWIND_INFO*)(currFrame->UnwindData + imageBase);
UNWIND_CODE myFrameUwop = (UNWIND_CODE)(mySleep->UnwindCodes[0]);

log("1. MySleep RIP UWOP: ", myFrameUwop.UnwindOpcode);

for (PRUNTIME_FUNCTION it = get started; it < end; ++it)
{
UNWIND_INFO* unwindData = (UNWIND_INFO*)(it->UnwindData + imageBase);
UNWIND_CODE frameUwop = (UNWIND_CODE)(unwindData->UnwindCodes[0]);

if (frameUwop.UnwindOpcode != myFrameUwop.UnwindOpcode)
{
// Came upon candidate function for a desynch machine frame

}
}

The process is reasonably convoluted, however boils all of the approach all the way down to reverting thread’s identify stack unwinding task by the use of substituting arbitrary stack frames with carefully determined on other ones, in a ROP alike means.

This PoC does not follows reflect this algorithm, on account of my provide understanding lets in me to easily settle for the verdict stack finishing on an EXE-based stack frame and I don’t wish to overcompliate neither my shellcode loaders nor this PoC. Leaving the exercise of implementing this and sharing publicly to a keen reader. Or most likely I am going to sit and have a take a look at on doing this myself given some further spare time 🙂

More information:


Word of caution

Will have to you propose on together with this capacity for your non-public shellcode loaders / toolings you will want to AVOID unhooking kernel32.dll.
An attempt to unhook kernel32 will restore distinctive Sleep capacity fighting our callback from being known as.
If our callback is not known as, the thread will be unable to spoof its non-public identify stack by itself.

If that’s what you want to have, than it’s possible you’ll need to run another, watchdog thread, making sure that the Beacons thread will get spoofed every time it sleeps.

If you’re using Cobalt Strike and a BOF unhook-bof by the use of Raphael’s Mudge, be sure that to check out my Pull Request that gives now not mandatory parameter to the BOF specifying libraries that are supposed to not be unhooked.

This way you are able to handle your hooks in kernel32:

beacon> unhook kernel32
[*] Working unhook.
Will skip the ones modules: wmp.dll, kernel32.dll
[+] host known as space, sent: 9475 bytes
[+] received output:
ntdll.dll <.text>
Unhook is done.

Modified unhook-bof with solution to put out of your mind about specified modules


Final observation

This PoC was designed to art work with Cobalt Strike’s Beacon shellcodes. The Beacon is known to call out to kernel32!Sleep to look forward to further instructions from its C2.
This loader leverages that fact by the use of hooking Sleep with the intention to perform its space tasks.

This implementation would most likely not art work with other shellcodes to be had out there (comparable to Meterpreter) if they don’t use Sleep to cool down.
Since this is merely a Proof of Concept showing the technique, I don’t intend on together with beef up for each different C2 framework.

When you understand the concept that that, indisputably you’ll be able to translate it into your shellcode prerequisites and adapt the solution for your benefit.

Please do not open Github issues related to “this code does now not art work with XYZ shellcode”, they will be closed immediately.


Show Fortify

This and other duties are end result of sleepless nights and fairly a couple of arduous art work. Will have to you favor what I do and acknowledge that I all the time give once more to the crowd,
Believe buying me a coffee (or upper a beer) merely to say thank you!


Author

   Mariusz Banach / mgeeky, 21
<mb [at] binary-offensive.com>
(https://github.com/mgeeky)


[*]

Leave a Reply

Your email address will not be published. Required fields are marked *

Donate Us

X