Valkey 8에 반영된 개선사항 - new hash table

개발하면서/코드보면서 2025. 5. 4. 22:23

작년 3월, Redis는 기존 BSD 라이선스에서 Redis Source Available License (RSALv2)과
Server Side Public License (SSPLv1) 듀얼 라이센스를 적용했다.

https://redis.io/blog/redis-adopts-dual-source-available-licensing/

https://github.com/redis/redis/pull/13157

개인 사용자나 Redis를 이용한 라이브러리들은 영향 없지만

AWS, Azure, GCP 같은 클라우드 업체에서 라이선스 버전 이후의 Redis는 서비스 제공이 어렵게 된 거다.

그러자 라이센스 반발로 만들어진 Redis 포크 프로젝트 Valkey를 많은 대기업들이 지원하기로 했다.

그저 개인의 치기 어린 포크 프로젝트가 아니다.

https://news.hada.io/topic?id=14436

1년이 지난 지금,

antirez가 복귀하고 VectorSet 데이터 타입이 추가되고, Redis에 AGPL 라이선스도 추가했지만 딱히 끌리지 않는다.
※ antirez 옹 미안해요..

올 초부터 Valkey에서 달라진 점들을 알고 싶었는데, 이번 기회에 알아보고자 한다.

https://valkey.io/blog/ 내용 중 Valkey 8 글들을 Google NotebookLM에 던져놓고 알아보기 시작했다.

자료를 요약해 주고 정리도 해줘서 시작하는 허들은 낮출 수 있었지만 결국 내 걸로 만들려면 많은 시간과 노력은 필요했다.

Valkey 8에 적용된 개선 내용

new hash table
dictionary per slot
key embedding
multi-threaded architecture
specific command/data type optimizations

이번글에서는 Valkey 8.1에 반영된 새로운 hash table을 알아본다.

https://valkey.io/blog/new-hash-table/

Valkey · A new hash table

Viktor is an open source developer from Ericsson and one of the maintainers of Valkey.

valkey.io

결론부터 말하면 새로운 hash table 구조는

key-value 하나당 약 20 byte를 아끼고, TTL이 적용된 key-value는 약 30 byte를 아낀다고 한다.

와우!!!!!

새로운 hash table을 알아보기 전에 기존 구조를 알아보자.

Valkey 7.2 Hash Table

오래전에 파악한 내용이 있어 다시 봤는데 음... 다시 정리해 보자.

Dict

Dict는 다양한 자료구조를 담는 기본적인 redis의 자료구조입니다. key, value형식으로 되어있고 value에 사용자가 입력한 자료구조(?)들이 들어가게 되죠출처 : http://en.wikipedia.org/wiki/File:Hash_table_5_0_1_1

dol9.tistory.com

Valkey 7.2에서 dict 구조를 살펴보면 아래와 같다.

특정 조건(해시충돌, 키 사용량 등)을 기준으로 bucket 수를 늘리기 위한 rehash를 목적으로 dictEntry **ht_table[2]가 있다.

rehash가 발생하지 않을 때는 ht_table[0]에서 조회/추가/삭제를 하지만,

rehash 중일 때는 추가는 ht_table[1]만, 조회, 삭제는 ht_table[0]과 ht_table[1] 두 곳에 진행한다.

위 구조에 "FOO" -> "BAR" key-value 쌍으로 저장된 상태에서

조회를 하려고 $ get FOO 를 실행하면 메모리 접근은 4번 일어난다.

1. ht_table[0][hash_function("FOO")%hash_size]

2. ht_table[0][hash_function("FOO")%hash_size] -> key

3. ht_table[0][hash_function("FOO")%hash_size] -> val

4. ht_table[0][hash_function("FOO")%hash_size] -> val -> ptr

그리고 주소 정보를 담고 있는 포인터는 공짜가 아니다. 64bit OS 기준 8 byte다.

Valkey 8.1 Hash Table

아예 새로운 프로젝트다. 새롭게 보이는 kvstore, hashtable 파일에 블로그에 쓰인 디자인들이 구현되어 있다.

코드 주석에 쓰여있는 친절한 설명과 Copilot의 /explain 기능, 그리고 상상력을 더해 알아보자.

새로운 Hash Table의 흐름은 다음과 같다.

1. kvstore와 hashtable 조합으로 기존 dict를 교체했다.

서버가 뜰 때 관리하는 DB 구조체가 있는데(7.2 redisDb, 8.1 serverDb)
가장 중요한 dict와 expires key-value를 dict에서 kvstore + hashtable로 바뀌었다.

※ dict 변수명을 keys로 바꾼 것 나이스!!

kvstore 없이 바로 hashtable만 써도 될 것 같은데 뭔가 이유가 있겠지.. 하며 다음을 기약한다.

// Valkey 7.2
/* Redis database representation. There are multiple databases identified
 * by integers from 0 (the default database) up to the max configured
 * database. The database number is the 'id' field in the structure. */
typedef struct redisDb {
    dict *dict;                 /* The keyspace for this DB */
    dict *expires;              /* Timeout of keys with a timeout set */
    dict *blocking_keys;        /* Keys with clients waiting for data (BLPOP)*/
    dict *blocking_keys_unblock_on_nokey;   /* Keys with clients waiting for
                                             * data, and should be unblocked if key is deleted (XREADEDGROUP).
                                             * This is a subset of blocking_keys*/
    dict *ready_keys;           /* Blocked keys that received a PUSH */
    dict *watched_keys;         /* WATCHED keys for MULTI/EXEC CAS */
    int id;                     /* Database ID */
    long long avg_ttl;          /* Average TTL, just for stats */
    unsigned long expires_cursor; /* Cursor of the active expire cycle. */
    list *defrag_later;         /* List of key names to attempt to defrag one by one, gradually. */
    clusterSlotToKeyMapping *slots_to_keys; /* Array of slots to keys. Only used in cluster mode (db 0). */
} redisDb;

struct dict {
    dictType *type;

    dictEntry **ht_table[2];
    unsigned long ht_used[2];

    long rehashidx; /* rehashing not in progress if rehashidx == -1 */

    /* Keep small vars at end for optimal (minimal) struct padding */
    int16_t pauserehash; /* If >0 rehashing is paused (<0 indicates coding error) */
    signed char ht_size_exp[2]; /* exponent of size. (size = 1<<exp) */

    void *metadata[];           /* An arbitrary number of bytes (starting at a
                                 * pointer-aligned address) of size as defined
                                 * by dictType's dictEntryBytes. */
};



// Valkey 8.1
/* Database representation. There are multiple databases identified
 * by integers from 0 (the default database) up to the max configured
 * database. The database number is the 'id' field in the structure. */
typedef struct serverDb {
    kvstore *keys;                        /* The keyspace for this DB */
    kvstore *expires;                     /* Timeout of keys with a timeout set */
    dict *blocking_keys;                  /* Keys with clients waiting for data (BLPOP)*/
    dict *blocking_keys_unblock_on_nokey; /* Keys with clients waiting for
                                           * data, and should be unblocked if key is deleted (XREADEDGROUP).
                                           * This is a subset of blocking_keys*/
    dict *ready_keys;                     /* Blocked keys that received a PUSH */
    dict *watched_keys;                   /* WATCHED keys for MULTI/EXEC CAS */
    int id;                               /* Database ID */
    long long avg_ttl;                    /* Average TTL, just for stats */
    unsigned long expires_cursor;         /* Cursor of the active expire cycle. */
} serverDb;

struct _kvstore {
    int flags;
    hashtableType *dtype;
    hashtable **hashtables;
    int num_hashtables;
    int num_hashtables_bits;
    list *rehashing;                          /* List of hash tables in this kvstore that are currently rehashing. */
    int resize_cursor;                        /* Cron job uses this cursor to gradually resize hash tables (only used if num_hashtables > 1). */
    int allocated_hashtables;                 /* The number of allocated hashtables. */
    int non_empty_hashtables;                 /* The number of non-empty hashtables. */
    unsigned long long key_count;             /* Total number of keys in this kvstore. */
    unsigned long long bucket_count;          /* Total number of buckets in this kvstore across hash tables. */
    unsigned long long *hashtable_size_index; /* Binary indexed tree (BIT) that describes cumulative key frequencies up until
                                               * given hashtable-index. */
    size_t overhead_hashtable_lut;            /* Overhead of all hashtables in bytes. */
    size_t overhead_hashtable_rehashing;      /* Overhead of hash tables rehashing in bytes. */
};

struct hashtable {
    hashtableType *type;
    ssize_t rehash_idx;        /* -1 = rehashing not in progress. */
    bucket *tables[2];         /* 0 = main table, 1 = rehashing target.  */
    size_t used[2];            /* Number of entries in each table. */
    int8_t bucket_exp[2];      /* Exponent for num buckets (num = 1 << exp). */
    int16_t pause_rehash;      /* Non-zero = rehashing is paused */
    int16_t pause_auto_shrink; /* Non-zero = automatic resizing disallowed. */
    size_t child_buckets[2];   /* Number of allocated child buckets. */
    void *metadata[];
};

2. 캐시라인에 맞춘 64 byte에 bucket을 만들었다.

캐시라인 설명을 위해 어렴풋하게 알고 있는 프로그램 실행 흐름을 얘기하면
작성한 파일을 메인 메모리에 올리고 명령어와 데이터를 CPU로 가져와서 실행한다고 알고 있다.

이때 CPU가 데이터를 빨리 가져오기 위해 CPU 캐시가 존재하는데

캐시 라인은 CPU 캐시 메모리의 기본 데이터 전송 단위, 즉 캐시의 최소 단위 크기를 말하고

요즘 하드웨어는 64 바이트라고 한다.

이를 적극 활용하려고 Valkey 8.1에서는 bucket 크기를 64 바이트로 맞췄다.

bucket 크기를 64 byte로 bucket을 맞췄을 때 장점은 hash 충돌이 발생한 경우

CPU에서 추가 메모리 접근 없이 이미 CPU 캐시에 로드한 데이터에 다른 Entry 정보도 있기 때문에 빠른 탐색이 가능하다.

※ 값이 다른 key인데 hash 값이 같다면 동일한 bucket에 담겨 있을 테고 그럼 두 번째 Entry에 저장이 된다.

한 bucket은 8byte의 메타데이터와 8 byte Entry 7개로 구성되어 있다. (8 + 8*7 = 64)

메타 데이터는 1. chain 여부, 2. key 존재 여부, 3. Entry 7개의 key를 1 byte로 표현한 secondary hash로 구성되어 있다.

하나씩 알아보자.

3. bucket이 가득 찼을 땐 자식 bucket을 만들어 처리한다.

위에서 하나의 bucket에는 7개의 Entry가 존재한다고 했다.

그럼 7개 Entry가 있는 bucket에 Entry를 추가하면 어떤 일이 일어날까?

1. 새로운 Bucket을 만들고

2. 기존 Bucket의 일곱 번째 Entry를 새로운 Bucket 첫번째로 옮긴다.

3. 기존 Bucket 메타데이터 수정
bucket_from->chained = 1 그리고 bucket_from->presence &= ~(1 << pos_from)

3. 추가할 Entry는 새로운 bucket 두 번째 Entry에 추가해 준다.

4. 해당 bucket에 Entry 존재 여부는 bit로 먼저 판단한다.

해당 bucket의 데이터가 비었는지, 가득 찼는지, Entry에 데이터가 있는지 없는지를 판단할 때
7개 Entry를 직접 조회하지 않고 presence를 이용한다.

1이면 존재, 0이면 비어있다는 의미다.

앞에서 살펴본 Child bucket이 생성 됐는데 키가 삭제되면서 Child bucket이 비었거나
Entry 중간에 비어있다면 정리해 줄 때도 presence가 잘 쓰인다.

처음 보는 함수가 있어 살펴보니 gcc 빌트인 함수라고 한다. 오.. 득템
__builtin_popcount: parameter 값을 이진수로 나타냈을 때 1이 몇 개인지 카운트해 주는 함수다.

__builtin_ctz(count trailing zeros): parameter 값을 이진수로 나타냈을 때 LSB부터 왼쪽으로 이동하면서
0이 몇개인지 카운트해주는 함수다. 반대는 __builtin_clz(MSB부터 오른쪽으로 이동)

static void compactBucketChain(hashtable *ht, size_t bucket_index, int table_index) {
    bucket *b = &ht->tables[table_index][bucket_index];
    while (b->chained) {
        bucket *next = getChildBucket(b);
        // child bucket도 chained 타입이고 비어있다면 child bucket 삭제(포인트 바꾸고 메모리 해제)
        if (next->chained && next->presence == 0) {
            /* Empty bucket in the middle of the chain. Remove it from the chain. */
            bucket *next_next = getChildBucket(next);
            b->entries[ENTRIES_PER_BUCKET - 1] = next_next;
            zfree(next);
            if (ht->type->trackMemUsage) ht->type->trackMemUsage(ht, -sizeof(bucket));
            ht->child_buckets[table_index]--;
            continue;
        }

        // child bucket이 비어있고 chained 타입이 아니라면 삭제(prune)
        if (!next->chained && (next->presence == 0 || __builtin_popcount(next->presence) == 1)) {
            /* Next is the last bucket and it's empty or has only one entry.
             * Delete it and turn b into an "unchained" bucket. */
            pruneLastBucket(ht, b, next, table_index);
            return;
        }

        // 중간에 비어있는 Entry가 있다면 가장 마지막에 있는 child bucket의 Entry로 채움
        if (__builtin_popcount(b->presence) < ENTRIES_PER_BUCKET - 1) {
            /* Fill the holes in the bucket. */
            for (int pos = 0; pos < ENTRIES_PER_BUCKET - 1; pos++) {
                if (!isPositionFilled(b, pos)) {
                    fillBucketHole(ht, b, pos, table_index);
                    if (!b->chained) return;
                }
            }
        }

        /* Bucket is full. Move forward to next bucket. */
        b = next;
    }
}

5. 입력한 key가 몇 번째 Entry인지 secondary hash로 판단 후 Entry key값을 비교한다.

데이터 저장 할 때 key 값에 hash 함수를 적용해 64 비트로 만든 후
그중 상위 8비트를 highBit 함수로 구해서 hashes에 저장한다.

# <limits.h> 64 bit 8
#define CHAR_BIT  __CHAR_BIT__

/* For the hash bits stored in the bucket, we use the highest bits of the hash
 * value, since these are not used for selecting the bucket. */
static inline uint8_t highBits(uint64_t hash) {
    return hash >> (CHAR_BIT * 7);
}

/* Helper to insert an entry. Doesn't check if an entry with a matching key
 * already exists. This must be ensured by the caller. */
static void insert(hashtable *ht, uint64_t hash, void *entry) {
    ...
    bucket *b = findBucketForInsert(ht, hash, &pos_in_bucket, &table_index);
    b->entries[pos_in_bucket] = entry;
    b->presence |= (1 << pos_in_bucket);
    b->hashes[pos_in_bucket] = highBits(hash);
    ht->used[table_index]++;
}

그럼 나중에 해당 key를 조회하거나 삭제할 때 bucket에 있는 Entry key를 하나씩 꺼내고 비교하는 게 아니라

hash값에 상위 8비트와 hashes에 있는 값을 비교하고(isPositionFilled(b, pos) && b->hashes[pos] == h2)

매칭되면 실제 key를 비교하는 식이다. (compareKeys)

static bucket *findBucket(hashtable *ht, uint64_t hash, const void *key, int *pos_in_bucket, int *table_index) {
    if (hashtableSize(ht) == 0) return 0;
    uint8_t h2 = highBits(hash);
    int table;
    ...
    for (table = 0; table <= 1; table++) {
        ...
        do {
            /* Find candidate entries with presence flag set and matching h2 hash. */
            for (int pos = 0; pos < numBucketPositions(b); pos++) {
                if (isPositionFilled(b, pos) && b->hashes[pos] == h2) {
                    /* It's a candidate. */
                    void *entry = b->entries[pos];
                    const void *elem_key = entryGetKey(ht, entry);
                    if (compareKeys(ht, key, elem_key) == 0) {
                        /* It's a match. */
                        assert(pos_in_bucket != NULL);
                        *pos_in_bucket = pos;
                        if (table_index) *table_index = table;
                        return b;
                    }
                }
            }
            b = getChildBucket(b);
        } while (b != NULL);
    }
    return NULL;
}

진짜 비트 하나하나 알차게 쓰네...

6. 메모리를 아끼기 위해 embed type이 있는데 항상 적용되는 건 아니다.

8.1에서 메모리 접근 횟수 감소와 메모리 사용이 줄어든 이유는 serverObject에 포인터가 아닌 값을 직접 저장하고 있기 때문이다.

그럼 Valkey 8.1은 항상 embed type으로 저장이 될까? 아니다.

저장할 때 value 길이가 44 이하일 때 Embed String 타입으로 저장하고,

44 초과면 7.2에서 본것처럼 포인터로 저장한다.

/* Create a string object with EMBSTR encoding if it is smaller than
 * OBJ_ENCODING_EMBSTR_SIZE_LIMIT, otherwise the RAW encoding is
 * used.
 *
 * The current limit of 44 is chosen so that the biggest string object
 * we allocate as EMBSTR will still fit into the 64 byte arena of jemalloc. */
#define OBJ_ENCODING_EMBSTR_SIZE_LIMIT 44
robj *createStringObject(const char *ptr, size_t len) {
    if (len <= OBJ_ENCODING_EMBSTR_SIZE_LIMIT)
        return createEmbeddedStringObject(ptr, len);
    else
        return createRawStringObject(ptr, len);
}

serverObject에 key, value, expire 데이터가 embed 타입으로 어떻게 저장되는지 알아보자.
※ object.c의 `static robj *createEmbeddedStringObjectWithKeyAndExpire`

serverObject 구조체는 다음과 같다.

type + encoding + lru + hasexpire + hasembkey + refcount = 64 bit

64 bit 다음에 void* ptr이 존재한다.

구조체에 bit 필드 선언하는 방법과 구조체에서 마지막 void 포인터는 구조체 크기에 영향을 안 준다는 구글 AI 검색 내용.

크흐~~ 옛날에 고생하면서 찾았던 기억이 나면서 스샷 한번 ㅋㅋㅋ

static robj *createEmbeddedStringObjectWithKeyAndExpire(const char *val_ptr,
                                                        size_t val_len,
                                                        const sds key,
                                                        long long expire) {
    // 64 bit + value size + expire size(optional) + key size(optional)
    size_t min_size = sizeof(robj) + val_sds_size;
    if (expire != -1) {
        min_size += sizeof(long long);
    }
    if (has_embkey) {
        /* Size of embedded key, incl. 1 byte for prefixed sds hdr size. */
        min_size += 1 + key_sds_size;
    }

    /* 64 bit 알차게 값 할당 */
    size_t bufsize = 0;
    robj *o = zmalloc_usable(min_size, &bufsize);
    o->type = OBJ_STRING;
    o->encoding = OBJ_ENCODING_EMBSTR;
    o->refcount = 1;
    o->lru = 0;
    o->hasexpire = (expire != -1);
    o->hasembkey = has_embkey;


    /* embedded 데이터 설정을 위해 void포인터 크기 만큼(8 byte) 이동 */
    char *data = (void *)(o + 1);

    /* Set the expire field. */
    if (o->hasexpire) {
        *(long long *)data = expire;
        data += sizeof(long long);
    }

    /* Copy embedded key. */
    if (o->hasembkey) {
        *data++ = sdsHdrSize(key_sds_type);
        sdswrite(data, key_sds_size, key_sds_type, key, key_sds_len);
        data += key_sds_size;
    }

    /* Copy embedded value (EMBSTR) always as SDS TYPE 8. Account for unused
     * memory in the SDS alloc field. */
    size_t remaining_size = bufsize - (data - (char *)(void *)o);
    o->ptr = sdswrite(data, remaining_size, SDS_TYPE_8, val_ptr, val_len);

    return o;
}

이런 느낌이다.

새로운 해시 테이블 코드 보면서 앞으로 Redis와 합쳐지기는 어렵다는 생각이 더 굳어졌다.

메모리 아끼고 성능 개선하려고 만들어진 코드를 보며 재미도 느끼고 오... 이렇게 까지 한다고? 신기해하며 코드를 봤는데

한편으론 Redis보다 복잡해진 느낌도 받아서 좀 더 알아봐야겠다.

저작자표시 비영리 변경금지 (새창열림)

ABOUT ME

인기 태그

조급하지말고 천천히 조급하지말고 천천히

Valkey 8에 적용된 개선 내용

Valkey 7.2 Hash Table

Valkey 8.1 Hash Table

1. kvstore와 hashtable 조합으로 기존 dict를 교체했다.

2. 캐시라인에 맞춘 64 byte에 bucket을 만들었다.

3. bucket이 가득 찼을 땐 자식 bucket을 만들어 처리한다.

4. 해당 bucket에 Entry 존재 여부는 bit로 먼저 판단한다.

5. 입력한 key가 몇 번째 Entry인지 secondary hash로 판단 후 Entry key값을 비교한다.

6. 메모리를 아끼기 위해 embed type이 있는데 항상 적용되는 건 아니다.

티스토리툴바

ABOUT ME

인기 태그

Valkey 8에 적용된 개선 내용

Valkey 7.2 Hash Table

Valkey 8.1 Hash Table

1. kvstore와 hashtable 조합으로 기존 dict를 교체했다.

2. 캐시라인에 맞춘 64 byte에 bucket을 만들었다.

3. bucket이 가득 찼을 땐 자식 bucket을 만들어 처리한다.

4. 해당 bucket에 Entry 존재 여부는 bit로 먼저 판단한다.

5. 입력한 key가 몇 번째 Entry인지 secondary hash로 판단 후 Entry key값을 비교한다.

6. 메모리를 아끼기 위해 embed type이 있는데 항상 적용되는 건 아니다.

관련글 관련글 더보기

티스토리툴바