Optimizing Memory and Performance: The Power of HashSets, Dictionaries, and Numeric Keys

When working with large datasets in software development, the difference between success and failure often comes down to understanding your tools and making the right choices. A recent experience migrating 40 million database records taught me valuable lessons about data structures, memory optimization, and the importance of choosing the right key types. This journey from an out-of-memory failure to a successful, efficient solution highlights fundamental principles that every developer should understand.

The Problem: Running Out of Memory

The task seemed straightforward: migrate user-episode metadata from one database system to another, checking existing records to determine whether to insert new entries or update existing ones. The initial approach seemed logical—load all existing records into a dictionary for fast lookups:

Dictionary<string, tblUserEpisode> existingItems = 
    db.tblUserEpisodes.ToDictionary(
        x => $"u={x.UserId}/e={x.EpisodeId}", 
        y => y
    );

This worked fine in testing with smaller datasets. However, when dealing with 40 million records, the application crashed with out-of-memory errors. The problem wasn’t immediately obvious, but a closer look revealed the issue.

Each dictionary entry was storing:

A string key like "u=12345/e=67890" (approximately 20 bytes)
A full object with multiple properties, LINQ overhead, and change tracking

This added up to roughly 200-300 bytes per record. For 40 million records, that’s 8-12 GB of memory—far beyond what was available.

The Solution: Understanding Data Structures

Step 1: Choose the Right Data Structure

The first breakthrough came from asking a fundamental question: “What do I actually need to store?”

The original code stored entire tblUserEpisode objects, but the logic only needed to know:

Does this record exist?
What is its UpdatedAtUtc value? (to determine if an update is needed)

This realization led to understanding the difference between HashSet<T> and Dictionary<TKey, TValue>:

HashSet<T>: Stores only keys, perfect for existence checks. If you only need to know “does this exist?”, a HashSet is ideal.
Dictionary<TKey, TValue>: Stores key-value pairs. When you need both the key and associated data, a Dictionary is the right choice.

Since we needed both existence checking AND the UpdatedAtUtc value, a Dictionary<long, DateTime> was the perfect fit—storing only the essential information.

Step 2: Replace String Keys with Numeric Keys

The second optimization came from examining the key structure. String keys are convenient and readable, but they come with significant overhead:

Memory Overhead:

Each string in .NET has object overhead (24 bytes for the object header)
The string length (4 bytes)
The actual character data (2 bytes per character in UTF-16)
Additional memory alignment

Performance Overhead:

String hashing is more expensive (must iterate through characters)
String comparison requires character-by-character checking
More allocations mean more garbage collection pressure

The original key format "u={UserId}/e={EpisodeId}" was essentially combining two integers into a string. But why use a string when we can combine them into a single numeric value?

The Magic of Bit Manipulation

Both UserId and EpisodeId are 32-bit integers. A 64-bit long can hold both values simultaneously using bit manipulation:

long numericKey = ((long)userId << 32) | (uint)episodeId;

Here’s how it works:

(long)userId << 32 shifts the user ID into the upper 32 bits of the long
(uint)episodeId occupies the lower 32 bits
The | (OR) operator combines them

To extract the values later (if needed):

int userId = (int)(numericKey >> 32);
int episodeId = (int)(numericKey & 0xFFFFFFFF);

This creates a unique numeric key in exactly 8 bytes, compared to a variable-length string that could be 15-25 bytes or more.

The Final Optimized Solution

Here’s what we ended up with:

Dictionary<long, DateTime> existingItemKeys = db.tblUserEpisodes
    .Select(x => new { x.UserId, x.EpisodeId, x.UpdatedAtUtc })
    .ToList() // Materialize to avoid LINQ-to-SQL translation issues
    .ToDictionary(
        x => ((long)x.UserId << 32) | (uint)x.EpisodeId,
        x => x.UpdatedAtUtc
    );

Memory Impact:

Before: ~200-300 bytes per record = 8-12 GB for 40M records
After: 16 bytes per record (8 for key + 8 for DateTime) = ~640 MB for 40M records
Result: 25-37x reduction in memory usage

Performance Benefits Beyond Memory

The improvements weren’t just about memory:

Faster Lookups

Numeric keys: O(1) hash computation with simple integer operations
String keys: O(n) hash computation where n is string length, plus character-by-character comparison

Reduced Garbage Collection Pressure

Fewer allocations mean less work for the garbage collector
More predictable performance with fewer GC pauses

Better Cache Locality

Numeric keys are more cache-friendly than string objects scattered in memory

When to Use Each Data Structure

HashSet<T>

Use when you only need to check existence:

HashSet<long> existingIds = new HashSet<long>(ids);
if (existingIds.Contains(newId)) { /* exists */ }

Dictionary<TKey, TValue>

Use when you need to store and retrieve values:

Dictionary<long, DateTime> keyToTimestamp = new Dictionary<long, DateTime>();
DateTime timestamp = keyToTimestamp[key]; // Fast lookup

Numeric Keys vs String Keys

Use numeric keys when:

You can derive a unique numeric representation
Performance and memory are critical
The data naturally has numeric identifiers

Use string keys when:

The key is naturally a string (email addresses, file paths, etc.)
Readability is more important than performance
The dataset is small enough that overhead doesn’t matter

Real-World Impact

In our migration scenario, the optimizations meant:

Memory usage dropped from 8-12 GB to 640 MB – fitting comfortably in available RAM
Lookups became faster – integer operations are faster than string operations
The application ran successfully – no more out-of-memory crashes
Garbage collection improved – fewer allocations meant fewer GC pauses

The migration that previously failed now completes successfully, processing all 40 million records without issues.

Key Takeaways

Choose the right data structure: Use HashSet for existence checks, Dictionary when you need values
Prefer numeric keys when possible: They’re more memory-efficient and faster to hash/compare
Store only what you need: Don’t load full objects if you only need a few fields
Consider bit manipulation: Combining multiple integers into a single key can be very efficient
Measure and optimize: Profile your memory usage and performance to find bottlenecks

Conclusion

The difference between a solution that works and one that fails often comes down to understanding your tools. HashSet and Dictionary are powerful data structures, but using them effectively requires understanding when to use each one and how to optimize your key selection.

For large datasets, numeric keys can dramatically reduce memory usage and improve performance. The next time you’re working with large datasets, ask yourself: “Do I really need the full object, or just a few fields? Can I use a numeric key instead of a string?” These small optimizations can make the difference between a solution that works and one that doesn’t.

This optimization reduced memory usage by 25-37x and eliminated out-of-memory errors in a production migration handling 40 million records. Sometimes the best solutions are the simplest ones—just applied correctly.

The Power of HashSets and Numeric Keys when Working with Large Datasets