diff --git a/analysis/03-Two_Sum_Is_Not_About_Numbers/analysis_two_sum_not_about_numbers.md b/analysis/03-Two_Sum_Is_Not_About_Numbers/analysis_two_sum_not_about_numbers.md deleted file mode 100644 index 7632cde..0000000 --- a/analysis/03-Two_Sum_Is_Not_About_Numbers/analysis_two_sum_not_about_numbers.md +++ /dev/null @@ -1,173 +0,0 @@ -# Analysis #XX — Two Sum Is Not About Numbers - -## Problem (LeetCode-style) - -You are given a list of records. Each record contains: - -- an identifier -- a value -- optional metadata - -Your task is to find whether there exists a pair of records whose values sum to a given target. - -Return the identifiers of any such pair. - -Constraints: -- Each record may be used at most once -- At most one valid answer exists - ---- - -## Typical Interview Thinking - -1. Start with brute force: - - Check all pairs → O(n²) - -2. Optimize: - - Use a hash map - - Store seen values - - Lookup complement (target - value) - -```cpp -unordered_map seen; - -for (int i = 0; i < n; ++i) { - int complement = target - nums[i]; - - if (seen.count(complement)) { - return {seen[complement], i}; - } - - seen[nums[i]] = i; -} -``` - -Time complexity: O(n) -Space complexity: O(n) - ---- - -## What This Actually Tests - -- Pattern recognition -- Familiarity with hash maps -- Knowledge of time complexity -- Prior exposure to the problem - ---- - -## Real-World Version (Logs & Event Correlation) - -### Synthetic Log Example - -``` -2026-04-16T10:15:01.123Z service=api event=parse_input latency=12ms request_id=req-1001 -2026-04-16T10:15:01.130Z service=cache event=cache_miss latency=48ms request_id=req-1001 -2026-04-16T10:15:01.135Z service=db event=read_user latency=55ms request_id=req-1001 -2026-04-16T10:15:01.144Z service=net event=external_call latency=47ms request_id=req-1001 -2026-04-16T10:15:01.151Z service=cache event=cache_miss latency=60ms request_id=req-3001 -2026-04-16T10:15:01.154Z service=net event=external_call latency=52ms request_id=req-3001 -``` - ---- - -## Real Problem - -Detect whether there exist two events: - -- belonging to the same request_id -- occurring within a time window -- whose combined latency exceeds a threshold - ---- - -## Where LeetCode Logic Breaks - -### 1. Not Exact Match -LeetCode: -``` -a + b == target -``` - -Reality: -``` -a + b > threshold -``` - ---- - -### 2. Context Matters (request_id) - -You cannot mix unrelated events. - ---- - -### 3. Time Window - -Events must be close in time. - ---- - -### 4. Streaming Data - -- Data arrives continuously -- May be out of order -- Cannot store everything - ---- - -## Real Engineering Approach - -### Core Idea - -Maintain sliding windows per request_id. - -### Pseudocode - -``` -for each incoming event: - bucket = active_events[event.request_id] - - remove old events outside time window - - for each old_event in bucket: - if event.latency + old_event.latency > threshold: - report anomaly - - add event to bucket -``` - ---- - -## Additional Real Constraints - -- Out-of-order events -- Missing logs -- Duplicate events -- Noise filtering -- Memory limits - ---- - -## Key Takeaway - -Two Sum is not about numbers. - -It is about recognizing patterns in controlled environments. - -Real engineering problems are about: - -- defining valid data -- handling imperfect inputs -- managing time and memory -- maintaining system behavior under constraints - ---- - -## Project Perspective - -Exists in real engineering? -→ Yes, but heavily transformed - -Exists in interview form? -→ Yes, but oversimplified diff --git a/analysis/03-Two_Sum_Is_Not_About_Numbers/examples/two_sum_logs_demo.cpp b/analysis/03-Two_Sum_Is_Not_About_Numbers/examples/two_sum_logs_demo.cpp new file mode 100644 index 0000000..7102a6f --- /dev/null +++ b/analysis/03-Two_Sum_Is_Not_About_Numbers/examples/two_sum_logs_demo.cpp @@ -0,0 +1,223 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct Event { + std::int64_t timestamp_ms; + std::string service; + std::string event; + int latency_ms; + std::string request_id; + std::string raw_line; +}; + +struct PairResult { + bool found = false; + Event first; + Event second; +}; + +std::int64_t parseTimestampMs(const std::string& timestamp) +{ + // Expected format: + // YYYY-MM-DDTHH:MM:SS.mmmZ + // For this demo we only convert the HH:MM:SS.mmm part to milliseconds. + const std::size_t t_pos = timestamp.find('T'); + const std::size_t z_pos = timestamp.find('Z'); + + if (t_pos == std::string::npos || z_pos == std::string::npos) { + throw std::runtime_error("Invalid timestamp: " + timestamp); + } + + const std::string time_part = timestamp.substr(t_pos + 1, z_pos - t_pos - 1); + + int hours = 0; + int minutes = 0; + int seconds = 0; + int millis = 0; + char colon1 = '\0'; + char colon2 = '\0'; + char dot = '\0'; + + std::istringstream iss(time_part); + iss >> hours >> colon1 >> minutes >> colon2 >> seconds >> dot >> millis; + + if (!iss || colon1 != ':' || colon2 != ':' || dot != '.') { + throw std::runtime_error("Invalid time part: " + time_part); + } + + return (((hours * 60LL) + minutes) * 60LL + seconds) * 1000LL + millis; +} + +Event parseLogLine(const std::string& line) +{ + std::istringstream iss(line); + + std::string timestamp; + std::string service_token; + std::string event_token; + std::string latency_token; + std::string request_token; + + if (!(iss >> timestamp >> service_token >> event_token >> latency_token >> request_token)) { + throw std::runtime_error("Cannot parse log line: " + line); + } + + auto valueAfterEquals = [](const std::string& token) -> std::string { + const std::size_t pos = token.find('='); + if (pos == std::string::npos || pos + 1 >= token.size()) { + throw std::runtime_error("Invalid token: " + token); + } + return token.substr(pos + 1); + }; + + Event result; + result.timestamp_ms = parseTimestampMs(timestamp); + result.service = valueAfterEquals(service_token); + result.event = valueAfterEquals(event_token); + + std::string latency_value = valueAfterEquals(latency_token); + if (latency_value.size() < 3 || latency_value.substr(latency_value.size() - 2) != "ms") { + throw std::runtime_error("Invalid latency token: " + latency_token); + } + latency_value.erase(latency_value.size() - 2); + result.latency_ms = std::stoi(latency_value); + + result.request_id = valueAfterEquals(request_token); + result.raw_line = line; + + return result; +} + +std::vector parseLogs(const std::vector& lines) +{ + std::vector events; + events.reserve(lines.size()); + + for (const std::string& line : lines) { + events.push_back(parseLogLine(line)); + } + + return events; +} + +void printPair(const PairResult& result, const std::string& label) +{ + std::cout << label << '\n'; + + if (!result.found) { + std::cout << " no pair found\n\n"; + return; + } + + std::cout << " first : " << result.first.raw_line << '\n'; + std::cout << " second: " << result.second.raw_line << '\n'; + std::cout << " combined latency: " + << (result.first.latency_ms + result.second.latency_ms) + << "ms\n\n"; +} + +PairResult interviewStyleReduction(const std::vector& events, int threshold_ms) +{ + // Intentionally wrong for the real-world problem: + // it ignores request_id and time. + for (std::size_t i = 0; i < events.size(); ++i) { + for (std::size_t j = i + 1; j < events.size(); ++j) { + if (events[i].latency_ms + events[j].latency_ms > threshold_ms) { + return PairResult{true, events[i], events[j]}; + } + } + } + + return PairResult{}; +} + +PairResult realSlidingWindowDetection(const std::vector& events, + int threshold_ms, + std::int64_t window_ms) +{ + std::unordered_map > active_events; + + for (const Event& current : events) { + std::deque& bucket = active_events[current.request_id]; + + while (!bucket.empty() && + (current.timestamp_ms - bucket.front().timestamp_ms) > window_ms) { + bucket.pop_front(); + } + + for (const Event& previous : bucket) { + const std::int64_t delta = current.timestamp_ms - previous.timestamp_ms; + + if (delta >= 0 && delta <= window_ms && + previous.latency_ms + current.latency_ms > threshold_ms) { + return PairResult{true, previous, current}; + } + } + + bucket.push_back(current); + } + + return PairResult{}; +} + +void printEvents(const std::vector& events) +{ + std::cout << "Synthetic log stream:\n"; + for (const Event& event : events) { + std::cout << " " << event.raw_line << '\n'; + } + std::cout << '\n'; +} + +int main() +{ + try { + const std::vector raw_logs = { + "2026-04-16T10:15:01.100Z service=api event=parse_input latency=12ms request_id=req-1001", + "2026-04-16T10:15:01.110Z service=cache event=cache_miss latency=48ms request_id=req-1001", + "2026-04-16T10:15:01.120Z service=auth event=token_check latency=58ms request_id=req-2001", + "2026-04-16T10:15:01.130Z service=db event=read_user latency=43ms request_id=req-3001", + "2026-04-16T10:15:01.135Z service=db event=read_user latency=55ms request_id=req-1001", + "2026-04-16T10:15:01.144Z service=net event=external_call latency=47ms request_id=req-1001", + "2026-04-16T10:15:01.200Z service=cache event=cache_miss latency=60ms request_id=req-3001", + "2026-04-16T10:15:01.260Z service=net event=external_call latency=52ms request_id=req-3001" + }; + + const int threshold_ms = 100; + const std::int64_t window_ms = 20; + + const std::vector events = parseLogs(raw_logs); + + printEvents(events); + + std::cout << "Threshold: " << threshold_ms << "ms\n"; + std::cout << "Time window: " << window_ms << "ms\n\n"; + + const PairResult naive_result = interviewStyleReduction(events, threshold_ms); + printPair(naive_result, "Interview-style reduction (ignores request_id and time):"); + + const PairResult real_result = realSlidingWindowDetection(events, threshold_ms, window_ms); + printPair(real_result, "Streaming sliding-window detection:"); + + std::cout << "Notes:\n"; + std::cout << " - The interview-style version can produce a false correlation.\n"; + std::cout << " - In this dataset, it first matches 58ms from req-2001 with 43ms from req-3001.\n"; + std::cout << " - That pair exceeds the threshold, but it is operationally meaningless.\n"; + std::cout << " - The streaming version only correlates events from the same request_id\n"; + std::cout << " and only within the configured time window.\n"; + + return 0; + } + catch (const std::exception& ex) { + std::cerr << "Error: " << ex.what() << '\n'; + return 1; + } +} diff --git a/analysis/03-Two_Sum_Is_Not_About_Numbers/readme.md b/analysis/03-Two_Sum_Is_Not_About_Numbers/readme.md index 44a0c1e..6961a79 100644 --- a/analysis/03-Two_Sum_Is_Not_About_Numbers/readme.md +++ b/analysis/03-Two_Sum_Is_Not_About_Numbers/readme.md @@ -1,4 +1,6 @@ -# Analysis #XX — Two Sum Is Not About Numbers +# Analysis #03 — Two Sum Is Not About Numbers + +--- ## Problem @@ -9,9 +11,10 @@ At first glance, the problem looks trivial: This is one of the most well-known interview questions, commonly referred to as **Two Sum**. It is simple, clean, and perfectly defined: -- a static array -- exact arithmetic -- a guaranteed answer + +- a static array +- exact arithmetic +- a guaranteed answer And that’s exactly why it works so well in interviews. @@ -21,10 +24,10 @@ And that’s exactly why it works so well in interviews. A candidate is expected to go through a familiar progression: -1. Start with brute force (O(n²)) -2. Recognize inefficiency -3. Optimize using a hash map -4. Achieve O(n) time complexity +1. Start with brute force (O(n²)) +2. Recognize inefficiency +3. Optimize using a hash map +4. Achieve O(n) time complexity ```cpp unordered_map seen; @@ -40,7 +43,7 @@ for (int i = 0; i < n; ++i) { } ``` -The “correct” answer is not about solving the problem. +The “correct” answer is not really about solving the problem. It is about recognizing the pattern. @@ -50,35 +53,41 @@ It is about recognizing the pattern. Despite its simplicity, this problem evaluates: -- familiarity with standard patterns -- ability to choose a data structure -- understanding of time complexity +- familiarity with standard patterns +- ability to choose a data structure +- understanding of time complexity But most importantly: > it tests whether you have seen this problem before. +A candidate who has already practiced this family of tasks will likely reach the expected answer quickly. + +A candidate who has spent years solving real engineering problems may still pause — not because the problem is hard, but because the interview expects a very specific kind of answer. + --- ## A Subtle Shift -Now let’s take the same idea and move it one step closer to reality. +Now let’s move the same idea one step closer to reality. Instead of numbers, we have **log events**. Instead of a static array, we have a **stream**. -Instead of a clean equality, we have **imperfect data and thresholds**. +Instead of a clean equality, we have **imperfect data, context, and thresholds**. --- ## Synthetic Log Example -``` +```text 2026-04-16T10:15:01.123Z service=api event=parse_input latency=12ms request_id=req-1001 2026-04-16T10:15:01.130Z service=cache event=cache_miss latency=48ms request_id=req-1001 2026-04-16T10:15:01.135Z service=db event=read_user latency=55ms request_id=req-1001 +2026-04-16T10:15:01.141Z service=auth event=token_check latency=18ms request_id=req-2001 2026-04-16T10:15:01.144Z service=net event=external_call latency=47ms request_id=req-1001 +2026-04-16T10:15:01.149Z service=db event=read_user latency=22ms request_id=req-2001 2026-04-16T10:15:01.151Z service=cache event=cache_miss latency=60ms request_id=req-3001 2026-04-16T10:15:01.154Z service=net event=external_call latency=52ms request_id=req-3001 ``` @@ -92,9 +101,9 @@ We are no longer asked to find two numbers. Instead, the problem becomes: > Detect whether there exist two events: -> - belonging to the same request -> - occurring close in time -> - whose combined latency exceeds a threshold +> - belonging to the same request +> - occurring within a time window +> - whose combined latency exceeds a threshold This still *looks* like Two Sum. @@ -102,60 +111,104 @@ But it is not. --- +## How LeetCode Thinking Tries to Adapt + +The first instinct is to simplify. + +Take the log stream, ignore most of the structure, extract just the latency values, and reduce everything back to “numbers in an array”. + +That leads to a familiar line of thinking: + +1. Collect latencies +2. Search for matching pairs +3. Try to reuse the same hash map pattern +4. Treat the task as another variation of Two Sum + +This is exactly what interview training encourages: + +> reduce the problem until it matches a known template. + +That works beautifully in interviews. + +But this is also where the model starts to break. + +--- + ## Where the Interview Model Breaks -### 1. No Exact Match +### 1. It Is Not an Exact-Match Problem Interview version: -``` + a + b == target -``` Real version: -``` -a + b > threshold -``` -We are not searching for a perfect complement. +a + b > threshold + +We are not searching for a perfect complement. We are evaluating a condition. --- -### 2. Context Is Mandatory +### 2. Context Cannot Be Ignored -You cannot combine arbitrary events. +A latency of 55ms from one request and 52ms from another may exceed the threshold. -A latency spike only makes sense **within the same request**. +But together they mean nothing. -Without context, the result is meaningless. +Without context, the result is technically correct — and completely useless. --- -### 3. Time Matters +### 3. Time Makes the Problem Harder Events are not just values — they exist in time. -Two events five seconds apart may not be related at all. +Two events may belong to the same request and still be unrelated if they are too far apart. This introduces: -- time windows -- ordering issues -- temporal constraints + +- time windows +- ordering +- eviction --- -### 4. Data Is Not Static +### 4. The Data Is Not Static -LeetCode assumes: -- full dataset -- already loaded -- perfectly ordered +Interview assumptions: + +- full dataset available +- stable ordering +- perfect input Reality: -- streaming input -- delayed events -- missing entries -- out-of-order delivery + +- streaming data +- out-of-order events +- missing logs +- duplicates + +The “single clean pass over an array” stops being a valid model. + +--- + +### 5. Pattern Matching Becomes a Trap + +The more familiar the pattern, the stronger the temptation: + +> “This is just Two Sum.” + +But in reality: + +- request_id defines grouping +- timestamp defines relevance +- streaming defines constraints + +These are not details. + +They are the problem. --- @@ -169,21 +222,22 @@ It becomes: > “determine which events are comparable at all” -And that is a fundamentally different problem. +The arithmetic is trivial. + +The system is not. --- ## Real Engineering Approach -Instead of solving a mathematical puzzle, we build a system. +Instead of solving a puzzle, we build a mechanism. ### Core Idea -Maintain a sliding window of recent events per request. +Maintain a sliding window per request_id. ### Pseudocode -``` for each incoming event: bucket = active_events[event.request_id] @@ -194,7 +248,6 @@ for each incoming event: report anomaly add event to bucket -``` --- @@ -202,17 +255,52 @@ for each incoming event: Now we must deal with: -- bounded memory -- streaming constraints -- time-based eviction -- correlation logic +- bounded memory +- streaming constraints +- time-based eviction +- request-level grouping -And beyond that: +And then reality hits: -- out-of-order events -- duplicate logs -- partial data -- noise filtering +- out-of-order events +- duplicate logs +- partial data +- noise + +At this point, the original Two Sum is almost unrecognizable. + +--- + +## Demo + +See example implementation: + +- examples/two_sum_logs_demo.cpp + +--- + +## Example Output + +Interview-style reduction: + combines events from different request_id → false positive + +Streaming solution: + finds valid pair within same request and time window + +--- + +## Explanation + +The interview-style solution produces a mathematically valid result. + +But it mixes unrelated events. + +The streaming solution respects: + +- request boundaries +- time constraints + +Which makes the result meaningful. --- @@ -222,20 +310,20 @@ The difficulty is not in computing a sum. The difficulty is in defining: -- what data is valid -- what events belong together -- what “close enough” means -- how the system behaves under imperfect conditions +- what data is valid +- what events belong together +- what “close enough” means +- how the system behaves under imperfect conditions --- ## Key Takeaway -Two Sum is often presented as a problem about numbers. +Two Sum is not about numbers. -In reality, it is a problem about assumptions. +It is about assumptions. -Remove those assumptions, and the problem changes completely. +Remove those assumptions — and the problem changes completely. > The challenge is not finding two values. > The challenge is understanding whether those values should ever be compared. @@ -245,7 +333,14 @@ Remove those assumptions, and the problem changes completely. ## Project Perspective Exists in real engineering? -→ Yes, but as event correlation under constraints +→ Yes, but as event correlation under constraints Exists in interview form? -→ Yes, but stripped of context and complexity +→ Yes, but stripped of context and complexity + +--- + +## Final Note + +The algorithm was never the hard part. +The assumptions were.