TIC
The Interns Company
Advanced

Consistent Hashing

Distributed SystemsLoad BalancingScalabilitySystem Design

Overview

Consistent hashing is a distributed hashing technique that operates independently of the number of servers or objects in a distributed system. It allows for minimal redistribution of keys when servers are added or removed, making it ideal for distributed caches, databases, and load balancers.

What is Consistent Hashing?

Consistent hashing is a special kind of hashing that minimizes the number of keys that need to be remapped when a hash table is resized. It's particularly useful in distributed systems where we need to scale up or down the number of servers without having to rebuild the entire mapping of keys to servers.

In traditional hashing, when the number of slots (servers) changes, most keys need to be remapped. This can be catastrophic in a distributed system where data migration is expensive. Consistent hashing solves this by ensuring that when a server is added or removed, only K/N keys need to be remapped on average, where K is the number of keys and N is the number of servers.

Consistent Hashing Ring Structure

Why Use Consistent Hashing?

  • Minimal Key Redistribution: When servers are added or removed, only a small fraction of keys need to be remapped
  • Load Balancing: Achieves natural load balancing across servers
  • Scalability: Easy to scale up or down without major data movement
  • High Availability: Supports automatic failover and recovery
  • Reduced Hotspots: Virtual nodes help distribute load more evenly

How Consistent Hashing Works

Consistent hashing maps both servers and keys to a fixed circular space or "ring" (typically using a range from 0 to 2^32-1). Each key is assigned to the first server that appears after its position when walking clockwise around the ring.

Adding a New Server

Virtual Nodes

One challenge with basic consistent hashing is that the distribution of keys can become uneven, especially with a small number of servers. Virtual nodes solve this by having each physical server represent multiple points on the ring.

Virtual Nodes Distribution

Implementation Example

Here's an implementation of consistent hashing in TypeScript with support for virtual nodes:

typescript
1class ConsistentHash<T> {
2  private ring: Map<number, T>;
3  private sortedKeys: number[];
4  private virtualNodes: number;
5  private hashFn: (key: string) => number;
6
7  constructor(nodes: T[] = [], vnodes: number = 100) {
8    this.ring = new Map();
9    this.sortedKeys = [];
10    this.virtualNodes = vnodes;
11    
12    // Simple hash function using string's char codes
13    this.hashFn = (key: string): number => {
14      let hash = 0;
15      for (let i = 0; i < key.length; i++) {
16        hash = ((hash << 5) + hash) + key.charCodeAt(i);
17        hash = hash & hash; // Convert to 32-bit integer
18      }
19      return Math.abs(hash);
20    };
21
22    // Add initial nodes
23    nodes.forEach(node => this.addNode(node));
24  }
25
26  addNode(node: T): void {
27    // Add virtual nodes
28    for (let i = 0; i < this.virtualNodes; i++) {
29      const virtualKey = `${node}-${i}`;
30      const hash = this.hashFn(virtualKey);
31      this.ring.set(hash, node);
32    }
33    
34    // Update sorted keys
35    this.sortedKeys = Array.from(this.ring.keys()).sort((a, b) => a - b);
36  }
37
38  removeNode(node: T): void {
39    // Remove virtual nodes
40    for (let i = 0; i < this.virtualNodes; i++) {
41      const virtualKey = `${node}-${i}`;
42      const hash = this.hashFn(virtualKey);
43      this.ring.delete(hash);
44    }
45    
46    // Update sorted keys
47    this.sortedKeys = Array.from(this.ring.keys()).sort((a, b) => a - b);
48  }
49
50  getNode(key: string): T | null {
51    if (this.ring.size === 0) return null;
52
53    const hash = this.hashFn(key);
54    
55    // Find the first node that comes after the key in the ring
56    const nodeKey = this.sortedKeys.find(k => k >= hash) || this.sortedKeys[0];
57    return this.ring.get(nodeKey) || null;
58  }
59
60  getNodes(key: string, count: number): T[] {
61    if (this.ring.size === 0) return [];
62
63    const hash = this.hashFn(key);
64    const nodes: T[] = [];
65    let seen = new Set<T>();
66
67    // Find starting position
68    let index = this.sortedKeys.findIndex(k => k >= hash);
69    if (index === -1) index = 0;
70
71    // Collect unique nodes
72    while (nodes.length < count && nodes.length < this.ring.size) {
73      const nodeKey = this.sortedKeys[index];
74      const node = this.ring.get(nodeKey)!;
75      
76      if (!seen.has(node)) {
77        seen.add(node);
78        nodes.push(node);
79      }
80      
81      index = (index + 1) % this.sortedKeys.length;
82    }
83
84    return nodes;
85  }
86}
87
88// Usage example
89const ch = new ConsistentHash<string>(['server1', 'server2', 'server3'], 100);
90
91// Add a new server
92ch.addNode('server4');
93
94// Get server for a key
95const server = ch.getNode('user123');
96console.log(`Key 'user123' is mapped to ${server}`);
97
98// Get multiple servers for replication
99const replicas = ch.getNodes('user123', 3);
100console.log(`Replicas for 'user123' are: ${replicas.join(', ')}`);
101
102// Remove a server
103ch.removeNode('server2');

Real-World Applications

  • Distributed Caches: Systems like Memcached and Redis clusters use consistent hashing to distribute keys across nodes
  • Content Delivery Networks (CDNs): To determine which edge server should cache specific content
  • Load Balancers: For distributing requests across backend servers
  • Distributed Databases: Systems like Cassandra and DynamoDB use consistent hashing for data partitioning
  • Distributed Object Storage: Systems like Amazon S3 use consistent hashing to distribute objects across storage nodes

Best Practices

  • Use Virtual Nodes: Implement virtual nodes to achieve better key distribution
  • Choose Appropriate Hash Function: Use a uniform hash function to ensure even distribution
  • Consider Replication: Store data on multiple nodes for fault tolerance
  • Monitor Distribution: Keep track of key distribution to detect and address hotspots
  • Handle Node Failures: Implement automatic failover and recovery mechanisms
  • Cache Node Locations: Cache the node lookup results to improve performance

Conclusion

Consistent hashing is a fundamental technique in distributed systems that enables efficient scaling and high availability. By minimizing the amount of data that needs to be redistributed when the system topology changes, it provides a robust foundation for building large-scale distributed systems.

While the basic concept is straightforward, careful consideration must be given to implementation details such as virtual nodes, hash function selection, and handling edge cases to ensure optimal performance and reliability in production environments.