Optimizing Memory and Performance: The Power of HashSets, Dictionaries, and Numeric Keys
When working with large datasets in software development, the difference between success and failure often comes down to understanding your tools and making the right choices. A recent experience migrating 40 million database records taught me valuable lessons about data structures, memory optimization, and the importance of choosing the right key types. This journey from an out-of-memory failure to a successful, efficient solution highlights fundamental principles that every developer should understand.
The Problem: Running Out of Memory
The task seemed straightforward: migrate user-episode metadata from one database system to another, checking existing records to determine whether to insert new entries or update existing ones. The initial approach seemed logical—load all existing records into a dictionary for fast lookups:
Dictionary<string, tblUserEpisode> existingItems =
db.tblUserEpisodes.ToDictionary(
x => $"u={x.UserId}/e={x.EpisodeId}",
y => y
);
This worked fine in testing with smaller datasets. However, when dealing with 40 million records, the application crashed with out-of-memory errors. The problem wasn’t immediately obvious, but a closer look revealed the issue.
Each dictionary entry was storing:
- A string key like
"u=12345/e=67890"(approximately 20 bytes) - A full object with multiple properties, LINQ overhead, and change tracking
This added up to roughly 200-300 bytes per record. For 40 million records, that’s 8-12 GB of memory—far beyond what was available.
The Solution: Understanding Data Structures
Step 1: Choose the Right Data Structure
The first breakthrough came from asking a fundamental question: “What do I actually need to store?”
The original code stored entire tblUserEpisode objects, but the logic only needed to know:
- Does this record exist?
- What is its
UpdatedAtUtcvalue? (to determine if an update is needed)
This realization led to understanding the difference between HashSet<T> and Dictionary<TKey, TValue>:
- HashSet<T>: Stores only keys, perfect for existence checks. If you only need to know “does this exist?”, a HashSet is ideal.
- Dictionary<TKey, TValue>: Stores key-value pairs. When you need both the key and associated data, a Dictionary is the right choice.
Since we needed both existence checking AND the UpdatedAtUtc value, a Dictionary<long, DateTime> was the perfect fit—storing only the essential information.
Step 2: Replace String Keys with Numeric Keys
The second optimization came from examining the key structure. String keys are convenient and readable, but they come with significant overhead:
Memory Overhead:
- Each string in .NET has object overhead (24 bytes for the object header)
- The string length (4 bytes)
- The actual character data (2 bytes per character in UTF-16)
- Additional memory alignment
Performance Overhead:
- String hashing is more expensive (must iterate through characters)
- String comparison requires character-by-character checking
- More allocations mean more garbage collection pressure
The original key format "u={UserId}/e={EpisodeId}" was essentially combining two integers into a string. But why use a string when we can combine them into a single numeric value?
The Magic of Bit Manipulation
Both UserId and EpisodeId are 32-bit integers. A 64-bit long can hold both values simultaneously using bit manipulation:
long numericKey = ((long)userId << 32) | (uint)episodeId;
Here’s how it works:
(long)userId << 32shifts the user ID into the upper 32 bits of the long(uint)episodeIdoccupies the lower 32 bits- The
|(OR) operator combines them
To extract the values later (if needed):
int userId = (int)(numericKey >> 32);
int episodeId = (int)(numericKey & 0xFFFFFFFF);
This creates a unique numeric key in exactly 8 bytes, compared to a variable-length string that could be 15-25 bytes or more.
The Final Optimized Solution
Here’s what we ended up with:
Dictionary<long, DateTime> existingItemKeys = db.tblUserEpisodes
.Select(x => new { x.UserId, x.EpisodeId, x.UpdatedAtUtc })
.ToList() // Materialize to avoid LINQ-to-SQL translation issues
.ToDictionary(
x => ((long)x.UserId << 32) | (uint)x.EpisodeId,
x => x.UpdatedAtUtc
);
Memory Impact:
- Before: ~200-300 bytes per record = 8-12 GB for 40M records
- After: 16 bytes per record (8 for key + 8 for DateTime) = ~640 MB for 40M records
- Result: 25-37x reduction in memory usage
Performance Benefits Beyond Memory
The improvements weren’t just about memory:
Faster Lookups
- Numeric keys: O(1) hash computation with simple integer operations
- String keys: O(n) hash computation where n is string length, plus character-by-character comparison
Reduced Garbage Collection Pressure
- Fewer allocations mean less work for the garbage collector
- More predictable performance with fewer GC pauses
Better Cache Locality
- Numeric keys are more cache-friendly than string objects scattered in memory
When to Use Each Data Structure
HashSet<T>
Use when you only need to check existence:
HashSet<long> existingIds = new HashSet<long>(ids);
if (existingIds.Contains(newId)) { /* exists */ }
Dictionary<TKey, TValue>
Use when you need to store and retrieve values:
Dictionary<long, DateTime> keyToTimestamp = new Dictionary<long, DateTime>();
DateTime timestamp = keyToTimestamp[key]; // Fast lookup
Numeric Keys vs String Keys
Use numeric keys when:
- You can derive a unique numeric representation
- Performance and memory are critical
- The data naturally has numeric identifiers
Use string keys when:
- The key is naturally a string (email addresses, file paths, etc.)
- Readability is more important than performance
- The dataset is small enough that overhead doesn’t matter
Real-World Impact
In our migration scenario, the optimizations meant:
- Memory usage dropped from 8-12 GB to 640 MB – fitting comfortably in available RAM
- Lookups became faster – integer operations are faster than string operations
- The application ran successfully – no more out-of-memory crashes
- Garbage collection improved – fewer allocations meant fewer GC pauses
The migration that previously failed now completes successfully, processing all 40 million records without issues.
Key Takeaways
- Choose the right data structure: Use
HashSetfor existence checks,Dictionarywhen you need values - Prefer numeric keys when possible: They’re more memory-efficient and faster to hash/compare
- Store only what you need: Don’t load full objects if you only need a few fields
- Consider bit manipulation: Combining multiple integers into a single key can be very efficient
- Measure and optimize: Profile your memory usage and performance to find bottlenecks
Conclusion
The difference between a solution that works and one that fails often comes down to understanding your tools. HashSet and Dictionary are powerful data structures, but using them effectively requires understanding when to use each one and how to optimize your key selection.
For large datasets, numeric keys can dramatically reduce memory usage and improve performance. The next time you’re working with large datasets, ask yourself: “Do I really need the full object, or just a few fields? Can I use a numeric key instead of a string?” These small optimizations can make the difference between a solution that works and one that doesn’t.
This optimization reduced memory usage by 25-37x and eliminated out-of-memory errors in a production migration handling 40 million records. Sometimes the best solutions are the simplest ones—just applied correctly.