This is probably the most useless optimization I've ever made. I can imagine this solution could be used as part of a more complex search algorithm or, in rare cases, to search for a combination in a short string. Anyway, have fun reading this.
Full source code can be found here: https://github.com/Lezh1k/Codewars_c/blob/master/src/trench_assault.c
In short, there are two groups of letters. Each letter has a 'weight.'
Full problem statement
The left side letters and their power:
w - 4
p - 3
b - 2
s - 1
The right side letters and their power:
m - 4
q - 3
d - 2
z - 1
So, we need a function to acquire each symbol's weight and "side" in the input string. The side can be defined as a sign of the weight (negative - left side, positive - right side).
Let's define the letters and sides:
static const char left_side_letters[] = {'s', 'b', 'p', 'w', 0};
static const char right_side_letters[] = {'z', 'd', 'q', 'm', 0};
static const char relief_letters[] = {' ', '-', '|', 0};
These arrays will be cast to uint32. Do not do it on prod. Use union instead.
Something like this:
typedef union short_str {
uint32_t val;
char arr[4];
} short_str_t;
The most obvious way is to use a simple switch, but it’s easy to lose something there. So, here is the initial simplest implementation of the 'weight' function:
int weight_slow(char s) {
for (const char *pl = relief_letters; *pl; ++pl) {
if (*pl != s)
continue;
return 0;
}
for (const char *pl = left_side_letters; *pl; ++pl) {
if (*pl != s)
continue;
return -((int)(pl - left_side_letters) + 1);
}
for (const char *pl = right_side_letters; *pl; ++pl) {
if (*pl != s)
continue;
return (int)(pl - right_side_letters) + 1;
}
// invalid input, raise error
exit(1);
}
But I didn't feel like this was the fastest way to find a character in a short string. These 4 bytes fit one 32-bit integer. Therefore, we can combine making a mask from the searched byte and XORing it with one of the letter sets.
The main idea is pretty simple. For example, left-side letters can be expressed as 0x73627077. If we are looking for symbol 'p' (0x70), we can xor each byte with 0x70, and only the match will give zero as the XOR result. In our case, the result is 0x03120007. The only thing left is to find the index of the 0x00 byte in an integer. This is possible. See the weight
function below.
The first attempt included an MMX because there are special CPU instructions on comparing bytes and getting the mask.
.intel_syntax noprefix
.text
.global barr8_char_idx
# Function prototype:
# int barr8_char_idx(const char* array, char input_char);
barr8_char_idx:
# array (address of 8-byte array) -> rdi
# input_char -> esi
movdqu xmm0, [rdi]
# Broadcast the input_char across an SSE register
movd xmm1, esi
pshufd xmm1, xmm1, 0
# Compare each byte of xmm0 with xmm1
pcmpeqb xmm0, xmm1
# Extract the result into a mask
pmovmskb eax, xmm0
# Check if the mask is non-zero
test eax, eax
jz not_found
bsf eax, eax
ret
not_found:
mov eax, -1
ret
This one works slowly (even slower than the first solution). Data is not aligned + using MMX is too much for such a small issue. But MMX has ready functions to compare registers and convert results into a bitmask. This was just the proof of concept.
So the main challenge is to find the 0x00 byte in uint32. This is possible, and we can convert all the non-zero bytes in uint32 to 0xff and all the zero bytes into 0x7f. Inverting the result gives all zeros except in the position of the 0x00 byte. After transforming and inverting, it equals 0x80. Then, the only thing left is to count trailing zero bits and divide the result by 8 to get the byte index. There are several ways to count trailing zero bits (Please see
The second attempt is the fastest implementation at this time:
int zbyte_32(uint32_t x) {
// for 0 byte set 0x7f, for other bytes - 0xff
uint32_t y = (x & 0x7f7f7f7f) + 0x7f7f7f7f;
// inverting gives 0x80 where 0 byte was and 0 for other bytes
y = ~(y | x | 0x7f7f7f7f);
//This check is necessary because 0 as an argument of __builtin_ctz is undefined behavior
// without this check gcc/clang compilers change the weight function just to return 0;
// statement.
if (y == 0) {
return -1;
}
// find index of first non zero bit in int32_t
int n = __builtin_ctz(y);
// divide this index by 8 to get byte index (instead of bit index)
return n >> 3;
}
int weight(char s) {
uint32_t s_msk = (uint32_t)s * 0x01010101;
uint32_t relief_val = *(const uint32_t *)relief_letters;
uint32_t left_side_val = *(const uint32_t *)left_side_letters;
uint32_t right_side_val = *(const uint32_t *)right_side_letters;
int w = zbyte_32(relief_val ^ s_msk);
if (w != -1)
return 0;
w = zbyte_32(right_side_val ^ s_msk);
if (w != -1)
return w + 1;
w = zbyte_32(left_side_val ^ s_msk);
if (w != -1)
return -w - 1;
__builtin_unreachable();
}
I also tried to optimize the zbyte_32 function. if (y == 0) return -1;
- this check seemed excessive. The BFS function scans the source operand for the first bit set. Sets ZF if a bit is found set and loads the destination with an index to the first set bit. Clears ZF if no bits are found set. So why should I check Y before calling this function? I can use BSF and then check the ZF flag and return -1 if it is set.
So I tried this implementation:
.intel_syntax noprefix
.text
.global zbyte_32_asm
# Function prototype:
# extern int zbyte_32_asm(uint32_t x);
zbyte_32_asm:
# Input: rdi -> input uint32_t
mov eax, edi
and eax, 0x7f7f7f7f
add eax, 0x7f7f7f7f
or eax, edi
or eax, 0x7f7f7f7f
not eax
bsf eax, eax
jz not_found
shr eax, 3
ret
not_found:
mov eax, -1
ret
It works 10 times slower than what compilers generate. See the profiling results.
To profile the function, I tried to get the weight of each symbol of the test string sbpwzdqm -|sbpwzdqm -|sbpwzdqm -|sbpwzdqm -|
10,000,000 times. Here are the results:
slow() took 1.2938900000 seconds to execute
fast() took 0.1967720000 seconds to execute
asm() took 1.6532540000 seconds to execute
In most cases, micro-optimizations are a waste of time, as it's much more useful (and sometimes easier) to reduce complexity, add a cache, change a memory allocator etc. However, in rare cases, micro-optimizations are the only means of achieving the necessary service performance.