CMU15445 (Fall 2019) 之 Project#2 - Hash Table 详解

本文主要是介绍CMU15445 (Fall 2019) 之 Project#2 - Hash Table 详解，对大家解决编程问题具有一定的参考价值，需要的程序猿们随着小编来一起学习吧！

前言

该实验要求实现一个基于线性探测法的哈希表，但是与直接放在内存中的哈希表不同的是，该实验假设哈希表非常大，无法整个放入内存中，因此需要将哈希表进行分割，将多个键值对放在一个 Page 中，然后搭配上一个实验实现的 Buffer Pool Manager 一起食用。哈希表的大致结构如下图所示：

下面介绍如何实现一个线程安全的哈希表。

代码实现

Page 布局

从上图可以看出，多个键值对被放在 Page 里面，作为 Page 的数据存在磁盘中。为了更好地组织和管理这些键值对，实验任务一要求我们实现两个类：HashTableHeaderPage 和 HashTableBlockPage，HashTableHeaderPage 保存着 block index 到 page id 的映射关系以及其他哈希表元数据，每个哈希表只有一个 HashTableHeaderPage，而 HashTableBlockPage 可以有多个。

HashTableHeaderPage 有以下几个字段：

字段	大小	描述
`lsn_`	4 bytes	Log sequence number (Used in Project 4)
`size_`	4 bytes	Number of Key & Value pairs the hash table can hold
`page_id_`	4 bytes	Self Page Id
`next_ind_`	4 bytes	The next index to add a new entry to `block_page_ids_`
`block_page_ids_`	4080 bytes	Array of block `page_id_t`

这些字段总共 4096 字节，正好是一个 Page 的大小，在 src/include/common/config.h 中可以修改 PAGE_SIZE 的大小。该类的实现代码如下：

复制namespace bustub {
page_id_t HashTableHeaderPage::GetBlockPageId(size_t index) {
  assert(index < next_ind_);
  return block_page_ids_[index];
}

page_id_t HashTableHeaderPage::GetPageId() const { return page_id_; }

void HashTableHeaderPage::SetPageId(bustub::page_id_t page_id) { page_id_ = page_id; }

lsn_t HashTableHeaderPage::GetLSN() const { return lsn_; }

void HashTableHeaderPage::SetLSN(lsn_t lsn) { lsn_ = lsn; }

void HashTableHeaderPage::AddBlockPageId(page_id_t page_id) { block_page_ids_[next_ind_++] = page_id; }

size_t HashTableHeaderPage::NumBlocks() { return next_ind_; }

void HashTableHeaderPage::SetSize(size_t size) { size_ = size; }

size_t HashTableHeaderPage::GetSize() const { return size_; }

}  // namespace bustub

Hash Table Block Page

HashTableBlockPage 包含多个 slot，用于保存键值对，所以该类定义了查询、插入和删除键值对的函数。为了跟踪每个 slot 的使用情况，该类包含以下三个数据成员：

occupied_ : 第 i 位置 1 表示 Page 的第 i 个 slot 上存储了键值对或者之前存了键值对但之后被删除了（起到墓碑的作用）
readable_ : 第 i 位置 1 表示 Page 的第 i 个 slot 上存储了键值对
array_ : 用于保存键值对的数组

每个键值对的大小为 sizeof(std::pair<KeyType, ValueType>) 字节（记为 PS），每个键值对对应两个 bit（occupied 和 readable）即 1/4 个字节，所以一个 Page 最多能保存 BLOCK_ARRAY_SIZE = PAGE_SIZE / (PS + 1/4) 个键值对，即每个 Page 有 BLOCK_ARRAY_SIZE 个 slot。

由于 occupied_ 和 readable_ 被定义为 char 数组，所以需要两个辅助函数 GetBit 和 SetBit 来访问第 i 位的比特。

复制namespace bustub {
/**
 * Store indexed key and and value together within block page. Supports
 * non-unique keys.
 *
 * Block page format (keys are stored in order):
 *  ----------------------------------------------------------------
 * | KEY(1) + VALUE(1) | KEY(2) + VALUE(2) | ... | KEY(n) + VALUE(n)
 *  ----------------------------------------------------------------
 *
 *  Here '+' means concatenation.
 *
 */
template <typename KeyType, typename ValueType, typename KeyComparator>
class HashTableBlockPage {
 public:
  // Delete all constructor / destructor to ensure memory safety
  HashTableBlockPage() = delete;

  KeyType KeyAt(slot_offset_t bucket_ind) const;
  ValueType ValueAt(slot_offset_t bucket_ind) const;
  bool Insert(slot_offset_t bucket_ind, const KeyType &key, const ValueType &value);
  void Remove(slot_offset_t bucket_ind);
  bool IsOccupied(slot_offset_t bucket_ind) const;
  bool IsReadable(slot_offset_t bucket_ind) const;

 private:
  bool GetBit(const std::atomic_char *array, slot_offset_t bucket_ind) const;
  void SetBit(std::atomic_char *array, slot_offset_t bucket_ind, bool value);

  std::atomic_char occupied_[(BLOCK_ARRAY_SIZE - 1) / 8 + 1];

  // 0 if tombstone/brand new (never occupied), 1 otherwise.
  std::atomic_char readable_[(BLOCK_ARRAY_SIZE - 1) / 8 + 1];
  MappingType array_[0];
};

}  // namespace bustub

实现代码如下：

复制namespace bustub {

template <typename KeyType, typename ValueType, typename KeyComparator>
KeyType HASH_TABLE_BLOCK_TYPE::KeyAt(slot_offset_t bucket_ind) const {
  return array_[bucket_ind].first;
}

template <typename KeyType, typename ValueType, typename KeyComparator>
ValueType HASH_TABLE_BLOCK_TYPE::ValueAt(slot_offset_t bucket_ind) const {
  return array_[bucket_ind].second;
}

template <typename KeyType, typename ValueType, typename KeyComparator>
bool HASH_TABLE_BLOCK_TYPE::Insert(slot_offset_t bucket_ind, const KeyType &key, const ValueType &value) {
  if (IsReadable(bucket_ind)) {
    return false;
  }

  array_[bucket_ind] = {key, value};
  SetBit(readable_, bucket_ind, true);
  SetBit(occupied_, bucket_ind, true);
  return true;
}

template <typename KeyType, typename ValueType, typename KeyComparator>
void HASH_TABLE_BLOCK_TYPE::Remove(slot_offset_t bucket_ind) {
  SetBit(readable_, bucket_ind, false);
}

template <typename KeyType, typename ValueType, typename KeyComparator>
bool HASH_TABLE_BLOCK_TYPE::IsOccupied(slot_offset_t bucket_ind) const {
  return GetBit(occupied_, bucket_ind);
}

template <typename KeyType, typename ValueType, typename KeyComparator>
bool HASH_TABLE_BLOCK_TYPE::IsReadable(slot_offset_t bucket_ind) const {
  return GetBit(readable_, bucket_ind);
}

template <typename KeyType, typename ValueType, typename KeyComparator>
bool HASH_TABLE_BLOCK_TYPE::GetBit(const std::atomic_char *array, slot_offset_t bucket_ind) const {
  return array[bucket_ind / 8] & (1 << bucket_ind % 8);
}

template <typename KeyType, typename ValueType, typename KeyComparator>
void HASH_TABLE_BLOCK_TYPE::SetBit(std::atomic_char *array, slot_offset_t bucket_ind, bool value) {
  if (value) {
    array[bucket_ind / 8] |= (1 << bucket_ind % 8);
  } else {
    array[bucket_ind / 8] &= ~(1 << bucket_ind % 8);
  }
}

// DO NOT REMOVE ANYTHING BELOW THIS LINE
template class HashTableBlockPage<int, int, IntComparator>;
template class HashTableBlockPage<GenericKey<4>, RID, GenericComparator<4>>;
template class HashTableBlockPage<GenericKey<8>, RID, GenericComparator<8>>;
template class HashTableBlockPage<GenericKey<16>, RID, GenericComparator<16>>;
template class HashTableBlockPage<GenericKey<32>, RID, GenericComparator<32>>;
template class HashTableBlockPage<GenericKey<64>, RID, GenericComparator<64>>;

}  // namespace bustub

哈希表

声明

实验要求我们实现哈希表的插入、查找、删除和调整大小的的操作，对应的类声明如下，为了完成这些操作，我们多定义了几个私有的辅助函数和成员变量：

复制namespace bustub {

#define HASH_TABLE_TYPE LinearProbeHashTable<KeyType, ValueType, KeyComparator>

template <typename KeyType, typename ValueType, typename KeyComparator>
class LinearProbeHashTable : public HashTable<KeyType, ValueType, KeyComparator> {
 public:

  explicit LinearProbeHashTable(const std::string &name, BufferPoolManager *buffer_pool_manager,
                                const KeyComparator &comparator, size_t num_buckets, HashFunction<KeyType> hash_fn);

  bool Insert(Transaction *transaction, const KeyType &key, const ValueType &value) override;
  bool Remove(Transaction *transaction, const KeyType &key, const ValueType &value) override;
  bool GetValue(Transaction *transaction, const KeyType &key, std::vector<ValueType> *result) override;
  void Resize(size_t initial_size);
  size_t GetSize();

 private:
  using slot_index_t = size_t;
  using block_index_t = size_t;
  enum class LockType { READ = 0, WRITE = 1 };

  /**
   * initialize header page and allocate block pages for it
   * @param page the hash table header page
   */
  void InitHeaderPage(HashTableHeaderPage *page);

  /**
   * get index according to key
   * @param key the key to be hashed
   * @return a tuple contains slot index, block page index and bucket index
   */
  std::tuple<slot_index_t, block_index_t, slot_offset_t> GetIndex(const KeyType &key);

  /**
   * linear probe step forward
   * @param bucket_index the bucket index
   * @param block_index the hash table block page index
   * @param header_page hash table header page
   * @param raw_block_page raw hash table block page
   * @param block_page hash table block page
   * @param lock_type lock type of block page
   */
  void StepForward(slot_offset_t &bucket_index, block_index_t &block_index, Page *&raw_block_page,
                   HASH_TABLE_BLOCK_TYPE *&block_page, LockType lockType);

  bool InsertImpl(Transaction *transaction, const KeyType &key, const ValueType &value);
    
  inline bool IsMatch(HASH_TABLE_BLOCK_TYPE *block_page, slot_offset_t bucket_index, const KeyType &key,
                      const ValueType &value) {
    return !comparator_(key, block_page->KeyAt(bucket_index)) && value == block_page->ValueAt(bucket_index);
  }

  inline HashTableHeaderPage *HeaderPageCast(Page *page) {
    return reinterpret_cast<HashTableHeaderPage *>(page->GetData());
  }

  inline HASH_TABLE_BLOCK_TYPE *BlockPageCast(Page *page) {
    return reinterpret_cast<HASH_TABLE_BLOCK_TYPE *>(page->GetData());
  }

  /**
   * get the slot number of hash table block page
   * @param block_index the index of hash table block page
   * @return the slot number of block page
   */
  inline size_t GetBlockArraySize(block_index_t block_index){
    return block_index < num_pages_ - 1 ? BLOCK_ARRAY_SIZE : last_block_array_size_;
  }

  // member variable
  page_id_t header_page_id_;
  BufferPoolManager *buffer_pool_manager_;
  KeyComparator comparator_;
  std::vector<page_id_t> page_ids_;
  size_t num_buckets_;
  size_t num_pages_;
  size_t last_block_array_size_;

  // Readers includes inserts and removes, writer is only resize
  ReaderWriterLatch table_latch_;

  // Hash function
  HashFunction<KeyType> hash_fn_;
};

}  // namespace bustub

构造函数

在构造函数中负责根据用户指定的 num_buckets （也就是 slot 的数量）分配 Page，最后一个 Page 的 slot 数量可能少于前面的 Page。这里还将每个 HashTableBlockPage 对应的 page_id 保存到 page_ids_ 成员里面了，这样之后就不需要仅仅为了知道某个 HashTableBlockPage 的 page_id 而去找 BufferPoolManager 索要 HashTableHeaderPage。

复制template <typename KeyType, typename ValueType, typename KeyComparator>
HASH_TABLE_TYPE::LinearProbeHashTable(const std::string &name, BufferPoolManager *buffer_pool_manager,
                                      const KeyComparator &comparator, size_t num_buckets,
                                      HashFunction<KeyType> hash_fn)
    : buffer_pool_manager_(buffer_pool_manager),
      comparator_(comparator),
      num_buckets_(num_buckets),
      num_pages_((num_buckets - 1) / BLOCK_ARRAY_SIZE + 1),
      last_block_array_size_(num_buckets - (num_pages_ - 1) * BLOCK_ARRAY_SIZE),
      hash_fn_(std::move(hash_fn)) {
  auto page = buffer_pool_manager->NewPage(&header_page_id_);
  page->WLatch();

  InitHeaderPage(HeaderPageCast(page));

  page->WUnlatch();
  buffer_pool_manager_->UnpinPage(header_page_id_, true);
}

template <typename KeyType, typename ValueType, typename KeyComparator>
void HASH_TABLE_TYPE::InitHeaderPage(HashTableHeaderPage *header_page) {
  header_page->SetPageId(header_page_id_);
  header_page->SetSize(num_buckets_);

  page_ids_.clear();
  for (size_t i = 0; i < num_pages_; ++i) {
    page_id_t page_id;
    buffer_pool_manager_->NewPage(&page_id);
    buffer_pool_manager_->UnpinPage(page_id, false);
    header_page->AddBlockPageId(page_id);
    page_ids_.push_back(page_id);
  }
}

查找

哈希表使用线性探测法进行键值对的查找，由于实验要求哈希表支持插入同键不同值的键值对，所以在线性探测过程中需要将所有相同键的值插入 result 向量中：

复制template <typename KeyType, typename ValueType, typename KeyComparator>
bool HASH_TABLE_TYPE::GetValue(Transaction *transaction, const KeyType &key, std::vector<ValueType> *result) {
  table_latch_.RLock();

  // get slot index, block page index and bucket index according to key
  auto [slot_index, block_index, bucket_index] = GetIndex(key);

  // get block page that contains the key
  auto raw_block_page = buffer_pool_manager_->FetchPage(page_ids_[block_index]);
  raw_block_page->RLatch();
  auto block_page = BlockPageCast(raw_block_page);

  // linear probe
  while (block_page->IsOccupied(bucket_index)) {
    // find the correct position
    if (block_page->IsReadable(bucket_index) && !comparator_(key, block_page->KeyAt(bucket_index))) {
      result->push_back(block_page->ValueAt(bucket_index));
    }

    StepForward(bucket_index, block_index, raw_block_page, block_page, LockType::READ);

    // break loop if we have returned to original position
    if (block_index * BLOCK_ARRAY_SIZE + bucket_index == slot_index) {
      break;
    }
  }

  // unlock
  raw_block_page->RUnlatch();
  buffer_pool_manager_->UnpinPage(raw_block_page->GetPageId(), false);
  table_latch_.RUnlock();
  return result->size() > 0;
}

GetIndex 函数根据 key 计算出对应的 slot_index、block_index 和 bucket_index（就是 slot offset），结合上图就能理解该函数的工作原理了：

复制template <typename KeyType, typename ValueType, typename KeyComparator>
auto HASH_TABLE_TYPE::GetIndex(const KeyType &key) -> std::tuple<slot_index_t, block_index_t, slot_offset_t> {
  slot_index_t slot_index = hash_fn_.GetHash(key) % num_buckets_;
  block_index_t block_index = slot_index / BLOCK_ARRAY_SIZE;
  slot_offset_t bucket_index = slot_index % BLOCK_ARRAY_SIZE;
  return {slot_index, block_index, bucket_index};
}

在线性探测过程中，我们可能从从一个 HashTableBlockPage 跳到下一个，这时候需要更新 bucket_index 和 block_index：

复制template <typename KeyType, typename ValueType, typename KeyComparator>
void HASH_TABLE_TYPE::StepForward(slot_offset_t &bucket_index, block_index_t &block_index, Page *&raw_block_page,
                                  HASH_TABLE_BLOCK_TYPE *&block_page, LockType lockType) {
  if (++bucket_index != GetBlockArraySize(block_index)) {
    return;
  }

  // move to next block page
  if (lockType == LockType::READ) {
    raw_block_page->RUnlatch();
  } else {
    raw_block_page->WUnlatch();
  }
  buffer_pool_manager_->UnpinPage(page_ids_[block_index], false);

  // update index
  bucket_index = 0;
  block_index = (block_index + 1) % num_pages_;

  // update page
  raw_block_page = buffer_pool_manager_->FetchPage(page_ids_[block_index]);
  if (lockType == LockType::READ) {
    raw_block_page->RLatch();
  } else {
    raw_block_page->WLatch();
  }
  block_page = BlockPageCast(raw_block_page);
}

插入

实验要求哈希表不允许插入已经存在的键值对，同时插入过程中如果回到了最初的位置，说明没有可用的 slot 用于插入键值对，这时需要将哈希表的大小翻倍。由于 Resize 的函数也要用到插入操作，如果直接调用 Insert 会出现死锁，所以这里使用 InsertImpl 来实现插入：

复制template <typename KeyType, typename ValueType, typename KeyComparator>
bool HASH_TABLE_TYPE::Insert(Transaction *transaction, const KeyType &key, const ValueType &value) {
  table_latch_.RLock();
  auto success = InsertImpl(transaction, key, value);
  table_latch_.RUnlock();
  return success;
}

template <typename KeyType, typename ValueType, typename KeyComparator>
bool HASH_TABLE_TYPE::InsertImpl(Transaction *transaction, const KeyType &key, const ValueType &value) {
  // get slot index, block page index and bucket index according to key
  auto [slot_index, block_index, bucket_index] = GetIndex(key);

  // get block page that contains the key
  auto raw_block_page = buffer_pool_manager_->FetchPage(page_ids_[block_index]);
  raw_block_page->WLatch();
  auto block_page = BlockPageCast(raw_block_page);

  bool success = true;
  while (!block_page->Insert(bucket_index, key, value)) {
    // return false if (key, value) pair already exists
    if (block_page->IsReadable(bucket_index) && IsMatch(block_page, bucket_index, key, value)) {
      success = false;
      break;
    }

    StepForward(bucket_index, block_index, raw_block_page, block_page, LockType::WRITE);

    // resize hash table if we have returned to original position
    if (block_index * BLOCK_ARRAY_SIZE + bucket_index == slot_index) {
      raw_block_page->WUnlatch();
      buffer_pool_manager_->UnpinPage(raw_block_page->GetPageId(), false);

      Resize(num_pages_);
      std::tie(slot_index, block_index, bucket_index) = GetIndex(key);

      raw_block_page = buffer_pool_manager_->FetchPage(page_ids_[block_index]);
      raw_block_page->WLatch();
      block_page = BlockPageCast(raw_block_page);
    }
  }

  raw_block_page->WUnlatch();
  buffer_pool_manager_->UnpinPage(raw_block_page->GetPageId(), success);
  return success;
}

调整大小

由于实验假设哈希表很大，所以我们不能将原本的键值对全部保存到内存中，然后调整 HashTableHeaderPage 的大小，复用 HashTableBlockPage 并创建新的 Page，再把键值对重新插入。而是应该直接创建新的 HashTableHeaderPage 和 HashTableBlockPage ，并删除旧的哈希表页：

复制template <typename KeyType, typename ValueType, typename KeyComparator>
void HASH_TABLE_TYPE::Resize(size_t initial_size) {
  table_latch_.WLock();
  num_buckets_ = 2 * initial_size;
  num_pages_ = (num_buckets_ - 1) / BLOCK_ARRAY_SIZE + 1;
  last_block_array_size_ = num_buckets_ - (num_pages_ - 1) * BLOCK_ARRAY_SIZE;

  // save the old header page id
  auto old_header_page_id = header_page_id_;
  std::vector<page_id_t> old_page_ids(page_ids_);

  // get the new header page
  auto raw_header_page = buffer_pool_manager_->NewPage(&header_page_id_);
  raw_header_page->WLatch();
  InitHeaderPage(HeaderPageCast(raw_header_page));

  // move (key, value) pairs to new space
  for (size_t block_index = 0; block_index < num_pages_; ++block_index) {
    auto old_page_id = old_page_ids[block_index];
    auto raw_block_page = buffer_pool_manager_->FetchPage(old_page_id);
    raw_block_page->RLatch();
    auto block_page = BlockPageCast(raw_block_page);

    // move (key, value) pair from each readable slot
    for (slot_offset_t bucket_index = 0; bucket_index < GetBlockArraySize(block_index); ++bucket_index) {
      if (block_page->IsReadable(bucket_index)) {
        InsertImpl(nullptr, block_page->KeyAt(bucket_index), block_page->ValueAt(bucket_index));
      }
    }

    // delete old page
    raw_block_page->RUnlatch();
    buffer_pool_manager_->UnpinPage(old_page_id, false);
    buffer_pool_manager_->DeletePage(old_page_id);
  }

  raw_header_page->WUnlatch();
  buffer_pool_manager_->UnpinPage(header_page_id_, false);
  buffer_pool_manager_->DeletePage(old_header_page_id);
  table_latch_.WUnlock();
}

删除

删除操作和查找操作很像，不过是将找到的 slot 标上墓碑罢了：

复制template <typename KeyType, typename ValueType, typename KeyComparator>
bool HASH_TABLE_TYPE::Remove(Transaction *transaction, const KeyType &key, const ValueType &value) {
  table_latch_.RLock();

  // get slot index, block page index and bucket index according to key
  auto [slot_index, block_index, bucket_index] = GetIndex(key);

  // get block page that contains the key
  auto raw_block_page = buffer_pool_manager_->FetchPage(page_ids_[block_index]);
  raw_block_page->WLatch();
  auto block_page = BlockPageCast(raw_block_page);

  bool success = false;
  while (block_page->IsOccupied(bucket_index)) {
    // remove the (key, value) pair if find the matched readable one
    if (IsMatch(block_page, bucket_index, key, value)) {
      if (block_page->IsReadable(bucket_index)) {
        block_page->Remove(bucket_index);
        success = true;
      } else {
        success = false;
      }
      break;
    }

    // step forward
    StepForward(bucket_index, block_index, raw_block_page, block_page, LockType::WRITE);

    // break loop if we have returned to original position
    if (block_index * BLOCK_ARRAY_SIZE + bucket_index == slot_index) {
      break;
    }
  }

  raw_block_page->WUnlatch();
  buffer_pool_manager_->UnpinPage(raw_block_page->GetPageId(), success);
  table_latch_.RUnlock();
  return success;
}

获取大小

最后是获取哈希表的大小操作，直接返回 num_buckets_ 就行了：

复制template <typename KeyType, typename ValueType, typename KeyComparator>
size_t HASH_TABLE_TYPE::GetSize() {
  return num_buckets_;
}

测试

对哈希表的测试结果如下，6 个测试全部通过了：

总结

该实验主要考察对线性探测哈希表、缓冲池管理器和读写锁的理解，难度相比上一个实验略有提升，但是理解了哈希表的结构图之后应该就不难完成该实验了，以上~~

转 https://www.cnblogs.com/zhiyiYo/p/16453495.html

这篇关于CMU15445 (Fall 2019) 之 Project#2 - Hash Table 详解的文章就介绍到这儿，希望我们推荐的文章对大家有所帮助，也希望大家多多支持为之网！

C/C++教程

CMU15445 (Fall 2019) 之 Project#2 - Hash Table 详解

前言

代码实现

Page 布局

Hash Table Header Page

Hash Table Block Page

哈希表

声明

构造函数

查找

插入

调整大小

删除

获取大小

测试

总结

前端开发

后端开发

移动端开发

数据库

服务器运维

人工智能

区块链

游戏开发

网站运营

大数据/云计算

软件工程

软件/开发工具使用

资讯