Breaking News



[*]

A PoC implementation for a sophisticated in-memory evasion method that spoofs Thread Title Stack. This system allows to steer clear of thread-based memory examination laws and better quilt shellcodes while in-process memory.

Intro

This is an example implementation for Thread Stack Spoofing method aiming to evade Malware Analysts, AVs and EDRs searching for references to shellcode’s frames in an examined thread’s title stack.
The speculation is to hide references to the shellcode on thread’s title stack thus masquerading allocations containing malware’s code.

Implementation along side my ShellcodeFluctuation brings Offensive Protection team trend implementations to compensate for the offering made by the use of business C2 products, so that we can do no worse in our Red Staff toolings.

Implementation has changed

Provide implementation differs intently to what was firstly published.
This is because I realised there is a means more practical approach to terminate thread’s title stack processal and hide shellcode’s identical frames by the use of simply writing 0 to the return take care of of the main frame we control:


void WINAPI MySleep(DWORD _dwMilliseconds)
{
[...]
auto overwrite = (PULONG_PTR)_AddressOfReturnAddress();
const auto origReturnAddress = *overwrite;
*overwrite = 0;

[...]
*overwrite = origReturnAddress;
}

The previous implementation, utilising StackWalk64 will also be accessed in this commit c250724.

This implementation is much more robust and works smartly on every Debug and Free up beneath two architectures – x64 and x86.

Demo

This is how a call stack would possibly appear to be when it is NOT spoofed:

 

This in turn, when thread stack spoofing is enabled:

 

Above we can see that the remainder frame on our title stack is our MySleep callback.
One can marvel does it instantly brings choices new IOCs? Having a look laws can seek for threads having title stacks no longer unwinding into following expected thread get entry to problems located within system libraries:

kernel32!BaseThreadInitThunk+0x14
ntdll!RtlUserThreadStart+0x21

However the verdict stack of the spoofed thread would possibly look moderately atypical to start with, a temporary examination of my system confirmed, that there are other threads no longer unwinding to the above get entry to problems as neatly:

The above screenshot shows a thread of unmodified Common Commander x64. As we can see, its title stack with regards to resembles our private relating to initial title stack frames.

Why must we care about moderately faking our title stack when there are processes appearing traits that we can simply mimic?

How it works?

The cruel algorithm is following:

  1. Be informed shellcode’s contents from file.
  2. Succeed in all the necessary function pointers from dbghelp.dll, title SymInitialize
  3. Hook kernel32!Sleep pointing once more to our callback.
  4. Inject and free up shellcode by the use of VirtualAlloc + memcpy + CreateThread. The thread must get began from our runShellcode function to avoid having Thread’s StartAddress degree into somewhere surprising and anomalous (akin to ntdll!RtlUserThreadStart+0x21)
  5. As soon as Beacon makes an try to sleep, our MySleep callback gets invoked.
  6. We then overwrite ultimate return take care of on the stack to 0 which effectively must finish the verdict stack.
  7. In any case a option to ::SleepEx is made to let the Beacon’s sleep while taking a look ahead to further dialog.
  8. After Sleep is finished, we restore previously saved unique function return addresses and execution is resumed.

Function return addresses are scattered far and wide the thread’s stack memory space, pointed to by the use of RBP/EBP take a look at in.
So that you can find them on the stack, we need to to start with collect frame pointers, then dereference them for overwriting:

(the above image was borrowed from Eli Bendersky’s post named Stack frame construction on x86-64)

	*(PULONG_PTR)(frameAddr + sizeof(void*)) = Fake_Return_Address;

Initial implementation of ThreadStackSpoofer did that all the way through walkCallStack and spoofCallStack functions, however the prevailing implementation shows that the ones efforts don’t seem to be required to deal with stealthy title stack.

Example run

Use case:

C:> ThreadStackSpoofer.exe <shellcode> <spoof>

Where:

  • <shellcode> is a path to the shellcode file
  • <spoof> when 1 or true will permit thread stack spoofing and the rest disables it.

Example run that spoofs beacon’s thread title stack:

PS D:dev2ThreadStackSpoofer> .x64ReleaseThreadStackSpoofer.exe .testsbeacon64.bin 1
[.] Learning shellcode bytes...
[.] Hooking kernel32!Sleep...
[.] Injecting shellcode...
[+] Shellcode is now operating.
[>] Distinctive return take care of: 0x1926747bd51. Finishing title stack...

===> MySleep(5000)

[<] Restoring unique return take care of...
[>] Distinctive return take care of: 0x1926747bd51. Finishing title stack...

===> MySleep(5000)

[<] Restoring unique return take care of...
[>] Distinctive return take care of: 0x1926747bd51. Finishing title stack...


How do I benefit from it?

Check out the code and its implementation, understand the idea that that and re-implement the idea that that within your own Shellcode Loaders that you simply utilise to send your Red Staff engagements.
This is an however some other method for complicated in-memory evasion that can build up your Teams’ chances for no longer getting caught by the use of Anti-Viruses, EDRs and Malware Analysts taking take a look at your implants.

While rising your complicated shellcode loader, you may also want to put in force:

  • Process Heap Encryption – take an inspiration from this blog post: Hook So much and Live Free – which is in a position to imply you’ll be able to evade Beacon configuration extractors like BeaconEye
  • Change your Beacon’s memory pages protection to RW (from RX/RWX) and encrypt their contents – the use of Shellcode Fluctuation method – right kind previous than napping (that might evade scanners akin to Moneta or pe-sieve)
  • Filter out any leftovers from Reflective Loader to avoid in-memory signatured detections
  • Unhook the whole lot it’s essential have hooked (akin to AMSI, ETW, WLDP) previous than napping and then re-hook afterwards.

Actually this is not (however) an actual stack spoofing

As it’s been recognized to me, the method proper right here is not however in fact retaining up to its identify for being a stack spoofer. Since we’re merely overwriting return addresses on the thread’s stack, we don’t seem to be spoofing the remainder areas of the stack itself. Moreover we’re leaving our title stack unwindable meaking it look anomalous for the reason that system will be unable to as it should be walk the entire title stack frames chain.

However I’m aware of the ones shortcomings, these days I’ve left it as is since I cared maximum often about evading computerized scanners that might iterate over processes, enumerate their threads, walk those threads stacks and make a choice up on any return take care of pointing once more to a non-image memory (akin to SEC_PRIVATE – the one allocated dynamically by the use of VirtuaAlloc and pals). A centered malware analyst would instantly spot the oddity and consider the thread moderately peculiar, doing away with our implant. More than sure about it. However, I don’t consider that this present day computerized scanners akin to AV/EDR have types of heuristics carried out that would possibly actually walk each thread’s stack to ensure whether or not or no longer its un-windable ¯_(ツ)_/¯ .

No doubt this challenge (and business implementation found in C2 frameworks) provides AV & EDR vendors arguments to consider implementing appropriate heuristics overlaying this type of novel evasion method.

So that you can improve this technique, one can objective for an actual Thread Stack Spoofer by the use of placing moderately crafted faux stack frames established in an reverse-unwinding process.
Be informed additional on this idea beneath.

Implementing an actual Thread Stack Spoofer

Hours-long conversation with namazso teached me, that so that you can objective for a proper thread stack spoofer we would need to reverse x64 title stack unwinding process.
Originally, one needs to carefully acknowledge the stack unwinding process outlined in (a) attached beneath. The system when traverses Thread title stack on x64 construction would possibly not simply rely on return addresses scattered around the thread’s stack, then again moderately it:

  1. takes return take care of
  2. makes an try to spot function containing that take care of (with RtlLookupFunctionEntry)
  3. That function returns RUNTIME_FUNCTION, UNWIND_INFO and UNWIND_CODE constructions. The ones constructions describe where are the function’s beginning take care of, completing take care of, and where are all the code sequences that modify RBP or RSP.
  4. Device will have to learn about all stack & frame pointers adjustments that took place in each function across the Title Stack to then with regards to rollback the ones changes and with regards to restore title stack pointers when a option to the processed title stack frame took place (this is carried out in RtlVirtualUnwind)
  5. The system processes all UNWIND_CODEs that examined function exhbits to precisely compute the web page of that frame’s return take care of and stack pointer worth.
  6. By the use of this emulation, the Device is able to walk down the verdict stacks chain and effectively “unwind” the verdict stack.

So that you can intrude with this process we wuold need to revert it by the use of having our reverted form of RtlVirtualUnwind. We would possibly need to iterate over functions defined in a module (let’s be it kernel32), scan each function’s UNWIND_CODE codes and closely emulate it backwards (as compared to RtlVirtualUnwind and precisely RtlpUnwindPrologue) so that you can find puts on the stack, where to put our faux return addresses.

namazso mentions the wish to introduce 3 faux stack frames to smartly stitch the verdict stack:

  1. A “desync” frame (consider it as a gadget-frame) that unwinds another way compared to the caller of our MySleep (having differnt UWOP – Unwind Operation code). We do this by the use of looking via all functions from a module, looking via their UWOPs, calculating how large the faux frame must be. This frame must have UWOPS different than our MySleep‘s caller.
  2. Next frame that we want to find is a function that unwindws by the use of popping into RBP from the stack – basically via UWOP_PUSH_NONVOL code.
  3. third frame we would like a function that restores RSP from RBP all over the code UWOP_SET_FPREG

The restored RSP must be set with the RSP taken from anywhere control drift entered into our MySleep so that all our frames become hidden, on account of third device unwinding there.

So that you can get started the process, one can iterate over executable’s .pdata by the use of dereferencing IMAGE_DIRECTORY_ENTRY_EXCEPTION wisdom record get entry to.
Imagine beneath example:

    ULONG_PTR imageBase = (ULONG_PTR)GetModuleHandleA("kernel32");
PIMAGE_NT_HEADERS64 pNthdrs = PIMAGE_NT_HEADERS64(imageBase + PIMAGE_DOS_HEADER(imageBase)->e_lfanew);

auto excdir = pNthdrs->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXCEPTION];
if (excdir.Size == 0 || excdir.VirtualAddress == 0)
return;

auto get started = PRUNTIME_FUNCTION(excdir.VirtualAddress + imageBase);
auto end = PRUNTIME_FUNCTION(excdir.VirtualAddress + imageBase + excdir.Size);

UNWIND_HISTORY_TABLE mshist = { 0 };
DWORD64 imageBase2 = 0;

PRUNTIME_FUNCTION currFrame = RtlLookupFunctionEntry(
(DWORD64)caller,
&imageBase2,
&mshist
);

UNWIND_INFO *mySleep = (UNWIND_INFO*)(currFrame->UnwindData + imageBase);
UNWIND_CODE myFrameUwop = (UNWIND_CODE)(mySleep->UnwindCodes[0]);

log("1. MySleep RIP UWOP: ", myFrameUwop.UnwindOpcode);

for (PRUNTIME_FUNCTION it = get started; it < end; ++it)
{
UNWIND_INFO* unwindData = (UNWIND_INFO*)(it->UnwindData + imageBase);
UNWIND_CODE frameUwop = (UNWIND_CODE)(unwindData->UnwindCodes[0]);

if (frameUwop.UnwindOpcode != myFrameUwop.UnwindOpcode)
{
// Came upon candidate function for a desynch device frame

}
}

The process is a little of convoluted, however boils proper all the way down to reverting thread’s title stack unwinding process by the use of substituting arbitrary stack frames with moderately determined on other ones, in a ROP alike means.

This PoC does no longer follows mirror this algorithm, on account of my provide understanding allows me to simply settle for the verdict stack finishing on an EXE-based stack frame and I don’t want to overcompliate neither my shellcode loaders nor this PoC. Leaving the exercise of implementing this and sharing publicly to a ready reader. Or in all probability I will be able to sit down and have a take a look at on doing this myself given some additional spare time 🙂

Additional info:


Word of caution

For those who plan on together with this capacity to your private shellcode loaders / toolings you’ll want to AVOID unhooking kernel32.dll.
An attempt to unhook kernel32 will restore unique Sleep capacity fighting our callback from being known as.
If our callback is not known as, the thread will be unable to spoof its private title stack by itself.

If that’s what you wish to have to have, than it’s essential need to run some other, watchdog thread, making sure that the Beacons thread will get spoofed each time it sleeps.

If you’re the use of Cobalt Strike and a BOF unhook-bof by the use of Raphael’s Mudge, make certain to try my Pull Request that gives optional parameter to the BOF specifying libraries that should no longer be unhooked.

This way you are able to deal with your hooks in kernel32:

beacon> unhook kernel32
[*] Working unhook.
Will skip the ones modules: wmp.dll, kernel32.dll
[+] host known as space, sent: 9475 bytes
[+] gained output:
ntdll.dll <.text>
Unhook is finished.

Modified unhook-bof with strategy to omit about specified modules


Final commentary

This PoC was designed to art work with Cobalt Strike’s Beacon shellcodes. The Beacon is known to call out to kernel32!Sleep to watch for further instructions from its C2.
This loader leverages that fact by the use of hooking Sleep so that you can perform its home tasks.

This implementation would possibly no longer art work with other shellcodes to be had available in the market (akin to Meterpreter) if they don’t use Sleep to cool down.
Since this is merely a Proof of Concept showing the method, I don’t intend on together with support for each different C2 framework.

While you understand the idea that that, surely you’ll be able to translate it into your shellcode must haves and adapt the solution on your get advantages.

Please do not open Github issues related to “this code does no longer art work with XYZ shellcode”, they’re going to be closed instantly.


Show Toughen

This and other duties are result of sleepless nights and quite a few exhausting art work. For those who like what I do and recognize that I always give once more to the crowd,
Imagine buying me a coffee (or upper a beer) merely to say thank you!


Creator

   Mariusz Banach / mgeeky, 21
<mb [at] binary-offensive.com>
(https://github.com/mgeeky)


[*]

Leave a Reply

Your email address will not be published.

Donate Us

X