How to Quickly Index a Character in Short String

This is probably the most useless optimization I've ever made. I can imagine this solution could be used as part of a more complex search algorithm or, in rare cases, to search for a combination in a short string. Anyway, have fun reading this.

Full source code can be found here: https://github.com/Lezh1k/Codewars_c/blob/master/src/trench_assault.c

Problem

In short, there are two groups of letters. Each letter has a 'weight.'

Full problem statement here.

The left side letters and their power:

w - 4

p - 3

b - 2

s - 1

The right side letters and their power:

m - 4

q - 3

d - 2

z - 1

So, we need a function to acquire each symbol's weight and "side" in the input string. The side can be defined as a sign of the weight (negative - left side, positive - right side).

Solution

Let's define the letters and sides:

static const char left_side_letters[] = {'s', 'b', 'p', 'w', 0};
static const char right_side_letters[] = {'z', 'd', 'q', 'm', 0};
static const char relief_letters[] = {' ', '-', '|', 0};

These arrays will be cast to uint32. Do not do it on prod. Use union instead.

Something like this:

typedef union short_str {
  uint32_t val;
  char arr[4];
} short_str_t;

The most obvious way is to use a simple switch, but it’s easy to lose something there. So, here is the initial simplest implementation of the 'weight' function:

int weight_slow(char s) {
  for (const char *pl = relief_letters; *pl; ++pl) {
    if (*pl != s)
      continue;
    return 0;
  }

  for (const char *pl = left_side_letters; *pl; ++pl) {
    if (*pl != s)
      continue;
    return -((int)(pl - left_side_letters) + 1);
  }

  for (const char *pl = right_side_letters; *pl; ++pl) {
    if (*pl != s)
      continue;
    return (int)(pl - right_side_letters) + 1;
  }
  // invalid input, raise error
  exit(1);
}

But I didn't feel like this was the fastest way to find a character in a short string. These 4 bytes fit one 32-bit integer. Therefore, we can combine making a mask from the searched byte and XORing it with one of the letter sets.

Optimizations

The main idea is pretty simple. For example, left-side letters can be expressed as 0x73627077. If we are looking for symbol 'p' (0x70), we can xor each byte with 0x70, and only the match will give zero as the XOR result. In our case, the result is 0x03120007. The only thing left is to find the index of the 0x00 byte in an integer. This is possible. See the weight function below.

MMX Optimization

The first attempt included an MMX because there are special CPU instructions on comparing bytes and getting the mask.

.intel_syntax noprefix
.text
.global barr8_char_idx

# Function prototype:
# int barr8_char_idx(const char* array, char input_char);

barr8_char_idx:
    # array (address of 8-byte array) -> rdi
    # input_char -> esi 
    movdqu    xmm0, [rdi]
    # Broadcast the input_char across an SSE register
    movd    xmm1, esi
    pshufd  xmm1, xmm1, 0
    # Compare each byte of xmm0 with xmm1
    pcmpeqb xmm0, xmm1
    # Extract the result into a mask
    pmovmskb eax, xmm0
    # Check if the mask is non-zero
    test    eax, eax 
    jz not_found
    bsf eax, eax
    ret    
  not_found:
    mov eax, -1
    ret

This one works slowly (even slower than the first solution). Data is not aligned + using MMX is too much for such a small issue. But MMX has ready functions to compare registers and convert results into a bitmask. This was just the proof of concept.

Math + bit twiddling hack optimization

So the main challenge is to find the 0x00 byte in uint32. This is possible, and we can convert all the non-zero bytes in uint32 to 0xff and all the zero bytes into 0x7f. Inverting the result gives all zeros except in the position of the 0x00 byte. After transforming and inverting, it equals 0x80. Then, the only thing left is to count trailing zero bits and divide the result by 8 to get the byte index. There are several ways to count trailing zero bits (Please see https://graphics.stanford.edu/\~seander/bithacks.html#ZerosOnRightLinear). Sometimes, it's possible to find the necessary function among the compiler’s built-in functions or in the CPU instructions set (BSF for x86_64).

The second attempt is the fastest implementation at this time:

int zbyte_32(uint32_t x) {
  // for 0 byte set 0x7f, for other bytes - 0xff
  uint32_t y = (x & 0x7f7f7f7f) + 0x7f7f7f7f;
  // inverting gives 0x80 where 0 byte was and 0 for other bytes
  y = ~(y | x | 0x7f7f7f7f);
  
  //This check is necessary because 0 as an argument of __builtin_ctz is undefined behavior
  // without this check gcc/clang compilers change the weight function just to return 0;     
  // statement.
  if (y == 0) {
    return -1;
  }
  // find index of first non zero bit in int32_t
  int n = __builtin_ctz(y);
  // divide this index by 8 to get byte index (instead of bit index)
  return n >> 3;
}

int weight(char s) {
  uint32_t s_msk = (uint32_t)s * 0x01010101;
  uint32_t relief_val = *(const uint32_t *)relief_letters;
  uint32_t left_side_val = *(const uint32_t *)left_side_letters;
  uint32_t right_side_val = *(const uint32_t *)right_side_letters;  
  int w = zbyte_32(relief_val ^ s_msk);
  if (w != -1) 
    return 0;
  w = zbyte_32(right_side_val ^ s_msk);
  if (w != -1)
    return w + 1;
  w = zbyte_32(left_side_val ^ s_msk);
  if (w != -1)
    return -w - 1;
  __builtin_unreachable();
}

I also tried to optimize the zbyte_32 function. if (y == 0) return -1; - this check seemed excessive. The BFS function scans the source operand for the first bit set. Sets ZF if a bit is found set and loads the destination with an index to the first set bit. Clears ZF if no bits are found set. So why should I check Y before calling this function? I can use BSF and then check the ZF flag and return -1 if it is set.

So I tried this implementation:

.intel_syntax noprefix
.text
.global zbyte_32_asm

# Function prototype:
# extern int zbyte_32_asm(uint32_t x);

zbyte_32_asm:
  # Input: rdi -> input uint32_t
  mov eax, edi
  and eax, 0x7f7f7f7f
  add eax, 0x7f7f7f7f

  or eax, edi
  or eax, 0x7f7f7f7f
  not eax

  bsf eax, eax
  jz not_found
  shr eax, 3
  ret

not_found:
  mov eax, -1
  ret

It works 10 times slower than what compilers generate. See the profiling results.

Profiling results

To profile the function, I tried to get the weight of each symbol of the test string sbpwzdqm -|sbpwzdqm -|sbpwzdqm -|sbpwzdqm -| 10,000,000 times. Here are the results:

slow() took 1.2938900000 seconds to execute
fast()  took 0.1967720000 seconds to execute
asm()   took 1.6532540000 seconds to execute

In most cases, micro-optimizations are a waste of time, as it's much more useful (and sometimes easier) to reduce complexity, add a cache, change a memory allocator etc. However, in rare cases, micro-optimizations are the only means of achieving the necessary service performance.

How to Quickly Index a Character in Short String

Too Long; Didn't Read

Problem

Solution

Optimizations

MMX Optimization

Math + bit twiddling hack optimization

Profiling results

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

How to Quickly Index a Character in Short String

Too Long; Didn't Read

Problem

Solution

Optimizations

MMX Optimization

Math + bit twiddling hack optimization

Profiling results

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics