Closes #01

2026-05-18 15:31:52 -04:00
parent 269d560847
commit 41be5a2e24
3 changed files with 384 additions and 239 deletions
--- a/analysis/03-Two_Sum_Is_Not_About_Numbers/analysis_two_sum_not_about_numbers.md
+++ b/analysis/03-Two_Sum_Is_Not_About_Numbers/analysis_two_sum_not_about_numbers.md
@@ -1,173 +0,0 @@
-# Analysis #XX — Two Sum Is Not About Numbers
-
-## Problem (LeetCode-style)
-
-You are given a list of records. Each record contains:
-
- an identifier
- a value
- optional metadata
-
-Your task is to find whether there exists a pair of records whose values sum to a given target.
-
-Return the identifiers of any such pair.
-
-Constraints:
- Each record may be used at most once
- At most one valid answer exists
-
---
-
-## Typical Interview Thinking
-
-1. Start with brute force:
-   - Check all pairs → O(n²)
-
-2. Optimize:
-   - Use a hash map
-   - Store seen values
-   - Lookup complement (target - value)
-
-```cpp
-unordered_map<int, int> seen;
-
-for (int i = 0; i < n; ++i) {
-    int complement = target - nums[i];
-
-    if (seen.count(complement)) {
-        return {seen[complement], i};
-    }
-
-    seen[nums[i]] = i;
-}
-```
-
-Time complexity: O(n)  
-Space complexity: O(n)
-
---
-
-## What This Actually Tests
-
- Pattern recognition
- Familiarity with hash maps
- Knowledge of time complexity
- Prior exposure to the problem
-
---
-
-## Real-World Version (Logs & Event Correlation)
-
-### Synthetic Log Example
-
-```
-2026-04-16T10:15:01.123Z service=api    event=parse_input   latency=12ms request_id=req-1001
-2026-04-16T10:15:01.130Z service=cache  event=cache_miss    latency=48ms request_id=req-1001
-2026-04-16T10:15:01.135Z service=db     event=read_user     latency=55ms request_id=req-1001
-2026-04-16T10:15:01.144Z service=net    event=external_call latency=47ms request_id=req-1001
-2026-04-16T10:15:01.151Z service=cache  event=cache_miss    latency=60ms request_id=req-3001
-2026-04-16T10:15:01.154Z service=net    event=external_call latency=52ms request_id=req-3001
-```
-
---
-
-## Real Problem
-
-Detect whether there exist two events:
-
- belonging to the same request_id
- occurring within a time window
- whose combined latency exceeds a threshold
-
---
-
-## Where LeetCode Logic Breaks
-
-### 1. Not Exact Match
-LeetCode:
-```
-a + b == target
-```
-
-Reality:
-```
-a + b > threshold
-```
-
---
-
-### 2. Context Matters (request_id)
-
-You cannot mix unrelated events.
-
---
-
-### 3. Time Window
-
-Events must be close in time.
-
---
-
-### 4. Streaming Data
-
- Data arrives continuously
- May be out of order
- Cannot store everything
-
---
-
-## Real Engineering Approach
-
-### Core Idea
-
-Maintain sliding windows per request_id.
-
-### Pseudocode
-
-```
-for each incoming event:
-    bucket = active_events[event.request_id]
-
-    remove old events outside time window
-
-    for each old_event in bucket:
-        if event.latency + old_event.latency > threshold:
-            report anomaly
-
-    add event to bucket
-```
-
---
-
-## Additional Real Constraints
-
- Out-of-order events
- Missing logs
- Duplicate events
- Noise filtering
- Memory limits
-
---
-
-## Key Takeaway
-
-Two Sum is not about numbers.
-
-It is about recognizing patterns in controlled environments.
-
-Real engineering problems are about:
-
- defining valid data
- handling imperfect inputs
- managing time and memory
- maintaining system behavior under constraints
-
---
-
-## Project Perspective
-
-Exists in real engineering?  
-→ Yes, but heavily transformed
-
-Exists in interview form?  
-→ Yes, but oversimplified
--- a/analysis/03-Two_Sum_Is_Not_About_Numbers/examples/two_sum_logs_demo.cpp
+++ b/analysis/03-Two_Sum_Is_Not_About_Numbers/examples/two_sum_logs_demo.cpp
@@ -0,0 +1,223 @@
+#include <algorithm>
+#include <cstdint>
+#include <deque>
+#include <iomanip>
+#include <iostream>
+#include <sstream>
+#include <stdexcept>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+struct Event {
+    std::int64_t timestamp_ms;
+    std::string service;
+    std::string event;
+    int latency_ms;
+    std::string request_id;
+    std::string raw_line;
+};
+
+struct PairResult {
+    bool found = false;
+    Event first;
+    Event second;
+};
+
+std::int64_t parseTimestampMs(const std::string& timestamp)
+{
+    // Expected format:
+    // YYYY-MM-DDTHH:MM:SS.mmmZ
+    // For this demo we only convert the HH:MM:SS.mmm part to milliseconds.
+    const std::size_t t_pos = timestamp.find('T');
+    const std::size_t z_pos = timestamp.find('Z');
+
+    if (t_pos == std::string::npos || z_pos == std::string::npos) {
+        throw std::runtime_error("Invalid timestamp: " + timestamp);
+    }
+
+    const std::string time_part = timestamp.substr(t_pos + 1, z_pos - t_pos - 1);
+
+    int hours = 0;
+    int minutes = 0;
+    int seconds = 0;
+    int millis = 0;
+    char colon1 = '\0';
+    char colon2 = '\0';
+    char dot = '\0';
+
+    std::istringstream iss(time_part);
+    iss >> hours >> colon1 >> minutes >> colon2 >> seconds >> dot >> millis;
+
+    if (!iss || colon1 != ':' || colon2 != ':' || dot != '.') {
+        throw std::runtime_error("Invalid time part: " + time_part);
+    }
+
+    return (((hours * 60LL) + minutes) * 60LL + seconds) * 1000LL + millis;
+}
+
+Event parseLogLine(const std::string& line)
+{
+    std::istringstream iss(line);
+
+    std::string timestamp;
+    std::string service_token;
+    std::string event_token;
+    std::string latency_token;
+    std::string request_token;
+
+    if (!(iss >> timestamp >> service_token >> event_token >> latency_token >> request_token)) {
+        throw std::runtime_error("Cannot parse log line: " + line);
+    }
+
+    auto valueAfterEquals = [](const std::string& token) -> std::string {
+        const std::size_t pos = token.find('=');
+        if (pos == std::string::npos || pos + 1 >= token.size()) {
+            throw std::runtime_error("Invalid token: " + token);
+        }
+        return token.substr(pos + 1);
+    };
+
+    Event result;
+    result.timestamp_ms = parseTimestampMs(timestamp);
+    result.service = valueAfterEquals(service_token);
+    result.event = valueAfterEquals(event_token);
+
+    std::string latency_value = valueAfterEquals(latency_token);
+    if (latency_value.size() < 3 || latency_value.substr(latency_value.size() - 2) != "ms") {
+        throw std::runtime_error("Invalid latency token: " + latency_token);
+    }
+    latency_value.erase(latency_value.size() - 2);
+    result.latency_ms = std::stoi(latency_value);
+
+    result.request_id = valueAfterEquals(request_token);
+    result.raw_line = line;
+
+    return result;
+}
+
+std::vector<Event> parseLogs(const std::vector<std::string>& lines)
+{
+    std::vector<Event> events;
+    events.reserve(lines.size());
+
+    for (const std::string& line : lines) {
+        events.push_back(parseLogLine(line));
+    }
+
+    return events;
+}
+
+void printPair(const PairResult& result, const std::string& label)
+{
+    std::cout << label << '\n';
+
+    if (!result.found) {
+        std::cout << "  no pair found\n\n";
+        return;
+    }
+
+    std::cout << "  first : " << result.first.raw_line << '\n';
+    std::cout << "  second: " << result.second.raw_line << '\n';
+    std::cout << "  combined latency: "
+              << (result.first.latency_ms + result.second.latency_ms)
+              << "ms\n\n";
+}
+
+PairResult interviewStyleReduction(const std::vector<Event>& events, int threshold_ms)
+{
+    // Intentionally wrong for the real-world problem:
+    // it ignores request_id and time.
+    for (std::size_t i = 0; i < events.size(); ++i) {
+        for (std::size_t j = i + 1; j < events.size(); ++j) {
+            if (events[i].latency_ms + events[j].latency_ms > threshold_ms) {
+                return PairResult{true, events[i], events[j]};
+            }
+        }
+    }
+
+    return PairResult{};
+}
+
+PairResult realSlidingWindowDetection(const std::vector<Event>& events,
+                                      int threshold_ms,
+                                      std::int64_t window_ms)
+{
+    std::unordered_map<std::string, std::deque<Event> > active_events;
+
+    for (const Event& current : events) {
+        std::deque<Event>& bucket = active_events[current.request_id];
+
+        while (!bucket.empty() &&
+               (current.timestamp_ms - bucket.front().timestamp_ms) > window_ms) {
+            bucket.pop_front();
+        }
+
+        for (const Event& previous : bucket) {
+            const std::int64_t delta = current.timestamp_ms - previous.timestamp_ms;
+
+            if (delta >= 0 && delta <= window_ms &&
+                previous.latency_ms + current.latency_ms > threshold_ms) {
+                return PairResult{true, previous, current};
+            }
+        }
+
+        bucket.push_back(current);
+    }
+
+    return PairResult{};
+}
+
+void printEvents(const std::vector<Event>& events)
+{
+    std::cout << "Synthetic log stream:\n";
+    for (const Event& event : events) {
+        std::cout << "  " << event.raw_line << '\n';
+    }
+    std::cout << '\n';
+}
+
+int main()
+{
+    try {
+        const std::vector<std::string> raw_logs = {
+            "2026-04-16T10:15:01.100Z service=api    event=parse_input   latency=12ms request_id=req-1001",
+            "2026-04-16T10:15:01.110Z service=cache  event=cache_miss    latency=48ms request_id=req-1001",
+            "2026-04-16T10:15:01.120Z service=auth   event=token_check   latency=58ms request_id=req-2001",
+            "2026-04-16T10:15:01.130Z service=db     event=read_user     latency=43ms request_id=req-3001",
+            "2026-04-16T10:15:01.135Z service=db     event=read_user     latency=55ms request_id=req-1001",
+            "2026-04-16T10:15:01.144Z service=net    event=external_call latency=47ms request_id=req-1001",
+            "2026-04-16T10:15:01.200Z service=cache  event=cache_miss    latency=60ms request_id=req-3001",
+            "2026-04-16T10:15:01.260Z service=net    event=external_call latency=52ms request_id=req-3001"
+        };
+
+        const int threshold_ms = 100;
+        const std::int64_t window_ms = 20;
+
+        const std::vector<Event> events = parseLogs(raw_logs);
+
+        printEvents(events);
+
+        std::cout << "Threshold: " << threshold_ms << "ms\n";
+        std::cout << "Time window: " << window_ms << "ms\n\n";
+
+        const PairResult naive_result = interviewStyleReduction(events, threshold_ms);
+        printPair(naive_result, "Interview-style reduction (ignores request_id and time):");
+
+        const PairResult real_result = realSlidingWindowDetection(events, threshold_ms, window_ms);
+        printPair(real_result, "Streaming sliding-window detection:");
+
+        std::cout << "Notes:\n";
+        std::cout << "  - The interview-style version can produce a false correlation.\n";
+        std::cout << "  - In this dataset, it first matches 58ms from req-2001 with 43ms from req-3001.\n";
+        std::cout << "  - That pair exceeds the threshold, but it is operationally meaningless.\n";
+        std::cout << "  - The streaming version only correlates events from the same request_id\n";
+        std::cout << "    and only within the configured time window.\n";
+
+        return 0;
+    }
+    catch (const std::exception& ex) {
+        std::cerr << "Error: " << ex.what() << '\n';
+        return 1;
+    }
+}
--- a/analysis/03-Two_Sum_Is_Not_About_Numbers/readme.md
+++ b/analysis/03-Two_Sum_Is_Not_About_Numbers/readme.md
@@ -1,4 +1,6 @@
-# Analysis #XX — Two Sum Is Not About Numbers
+# Analysis #03 — Two Sum Is Not About Numbers
+
+---

 ## Problem

@@ -9,9 +11,10 @@ At first glance, the problem looks trivial:
 This is one of the most well-known interview questions, commonly referred to as **Two Sum**.

 It is simple, clean, and perfectly defined:
- a static array
- exact arithmetic
- a guaranteed answer
+
+- a static array  
+- exact arithmetic  
+- a guaranteed answer  

 And that’s exactly why it works so well in interviews.

@@ -21,10 +24,10 @@ And that’s exactly why it works so well in interviews.

 A candidate is expected to go through a familiar progression:

-1. Start with brute force (O(n²))
-2. Recognize inefficiency
-3. Optimize using a hash map
-4. Achieve O(n) time complexity
+1. Start with brute force (O(n²))  
+2. Recognize inefficiency  
+3. Optimize using a hash map  
+4. Achieve O(n) time complexity  

 ```cpp
 unordered_map<int, int> seen;
@@ -40,7 +43,7 @@ for (int i = 0; i < n; ++i) {
 }
 ```

-The “correct” answer is not about solving the problem.
+The “correct” answer is not really about solving the problem.

 It is about recognizing the pattern.

@@ -50,35 +53,41 @@ It is about recognizing the pattern.

 Despite its simplicity, this problem evaluates:

- familiarity with standard patterns
- ability to choose a data structure
- understanding of time complexity
+- familiarity with standard patterns  
+- ability to choose a data structure  
+- understanding of time complexity  

 But most importantly:

 > it tests whether you have seen this problem before.

+A candidate who has already practiced this family of tasks will likely reach the expected answer quickly.
+
+A candidate who has spent years solving real engineering problems may still pause — not because the problem is hard, but because the interview expects a very specific kind of answer.
+
 ---

 ## A Subtle Shift

-Now let’s take the same idea and move it one step closer to reality.
+Now let’s move the same idea one step closer to reality.

 Instead of numbers, we have **log events**.

 Instead of a static array, we have a **stream**.

-Instead of a clean equality, we have **imperfect data and thresholds**.
+Instead of a clean equality, we have **imperfect data, context, and thresholds**.

 ---

 ## Synthetic Log Example

-```
+```text
 2026-04-16T10:15:01.123Z service=api    event=parse_input   latency=12ms request_id=req-1001
 2026-04-16T10:15:01.130Z service=cache  event=cache_miss    latency=48ms request_id=req-1001
 2026-04-16T10:15:01.135Z service=db     event=read_user     latency=55ms request_id=req-1001
+2026-04-16T10:15:01.141Z service=auth   event=token_check   latency=18ms request_id=req-2001
 2026-04-16T10:15:01.144Z service=net    event=external_call latency=47ms request_id=req-1001
+2026-04-16T10:15:01.149Z service=db     event=read_user     latency=22ms request_id=req-2001
 2026-04-16T10:15:01.151Z service=cache  event=cache_miss    latency=60ms request_id=req-3001
 2026-04-16T10:15:01.154Z service=net    event=external_call latency=52ms request_id=req-3001
 ```
@@ -92,9 +101,9 @@ We are no longer asked to find two numbers.
 Instead, the problem becomes:

 > Detect whether there exist two events:
-> - belonging to the same request
-> - occurring close in time
-> - whose combined latency exceeds a threshold
+> - belonging to the same request  
+> - occurring within a time window  
+> - whose combined latency exceeds a threshold  

 This still *looks* like Two Sum.

@@ -102,60 +111,104 @@ But it is not.

 ---

+## How LeetCode Thinking Tries to Adapt
+
+The first instinct is to simplify.
+
+Take the log stream, ignore most of the structure, extract just the latency values, and reduce everything back to “numbers in an array”.
+
+That leads to a familiar line of thinking:
+
+1. Collect latencies  
+2. Search for matching pairs  
+3. Try to reuse the same hash map pattern  
+4. Treat the task as another variation of Two Sum  
+
+This is exactly what interview training encourages:
+
+> reduce the problem until it matches a known template.
+
+That works beautifully in interviews.
+
+But this is also where the model starts to break.
+
+---
+
 ## Where the Interview Model Breaks

-### 1. No Exact Match
+### 1. It Is Not an Exact-Match Problem

 Interview version:
-```
+
 a + b == target
-```

 Real version:
-```
-a + b > threshold
-```

-We are not searching for a perfect complement.
+a + b > threshold
+
+We are not searching for a perfect complement.  
 We are evaluating a condition.

 ---

-### 2. Context Is Mandatory
+### 2. Context Cannot Be Ignored

-You cannot combine arbitrary events.
+A latency of 55ms from one request and 52ms from another may exceed the threshold.

-A latency spike only makes sense **within the same request**.
+But together they mean nothing.

-Without context, the result is meaningless.
+Without context, the result is technically correct — and completely useless.

 ---

-### 3. Time Matters
+### 3. Time Makes the Problem Harder

 Events are not just values — they exist in time.

-Two events five seconds apart may not be related at all.
+Two events may belong to the same request and still be unrelated if they are too far apart.

 This introduces:
- time windows
- ordering issues
- temporal constraints
+
+- time windows  
+- ordering  
+- eviction  

 ---

-### 4. Data Is Not Static
+### 4. The Data Is Not Static

-LeetCode assumes:
- full dataset
- already loaded
- perfectly ordered
+Interview assumptions:
+
+- full dataset available  
+- stable ordering  
+- perfect input  

 Reality:
- streaming input
- delayed events
- missing entries
- out-of-order delivery
+
+- streaming data  
+- out-of-order events  
+- missing logs  
+- duplicates  
+
+The “single clean pass over an array” stops being a valid model.
+
+---
+
+### 5. Pattern Matching Becomes a Trap
+
+The more familiar the pattern, the stronger the temptation:
+
+> “This is just Two Sum.”
+
+But in reality:
+
+- request_id defines grouping  
+- timestamp defines relevance  
+- streaming defines constraints  
+
+These are not details.
+
+They are the problem.

 ---

@@ -169,21 +222,22 @@ It becomes:

 > “determine which events are comparable at all”

-And that is a fundamentally different problem.
+The arithmetic is trivial.
+
+The system is not.

 ---

 ## Real Engineering Approach

-Instead of solving a mathematical puzzle, we build a system.
+Instead of solving a puzzle, we build a mechanism.

 ### Core Idea

-Maintain a sliding window of recent events per request.
+Maintain a sliding window per request_id.

 ### Pseudocode

-```
 for each incoming event:
    bucket = active_events[event.request_id]

@@ -194,7 +248,6 @@ for each incoming event:
            report anomaly

    add event to bucket
-```

 ---

@@ -202,17 +255,52 @@ for each incoming event:

 Now we must deal with:

- bounded memory
- streaming constraints
- time-based eviction
- correlation logic
+- bounded memory  
+- streaming constraints  
+- time-based eviction  
+- request-level grouping  

-And beyond that:
+And then reality hits:

- out-of-order events
- duplicate logs
- partial data
- noise filtering
+- out-of-order events  
+- duplicate logs  
+- partial data  
+- noise  
+
+At this point, the original Two Sum is almost unrecognizable.
+
+---
+
+## Demo
+
+See example implementation:
+
+- examples/two_sum_logs_demo.cpp
+
+---
+
+## Example Output
+
+Interview-style reduction:
+  combines events from different request_id → false positive
+
+Streaming solution:
+  finds valid pair within same request and time window
+
+---
+
+## Explanation
+
+The interview-style solution produces a mathematically valid result.
+
+But it mixes unrelated events.
+
+The streaming solution respects:
+
+- request boundaries  
+- time constraints  
+
+Which makes the result meaningful.

 ---

@@ -222,20 +310,20 @@ The difficulty is not in computing a sum.

 The difficulty is in defining:

- what data is valid
- what events belong together
- what “close enough” means
- how the system behaves under imperfect conditions
+- what data is valid  
+- what events belong together  
+- what “close enough” means  
+- how the system behaves under imperfect conditions  

 ---

 ## Key Takeaway

-Two Sum is often presented as a problem about numbers.
+Two Sum is not about numbers.

-In reality, it is a problem about assumptions.
+It is about assumptions.

-Remove those assumptions, and the problem changes completely.
+Remove those assumptions — and the problem changes completely.

 > The challenge is not finding two values.  
 > The challenge is understanding whether those values should ever be compared.
@@ -245,7 +333,14 @@ Remove those assumptions, and the problem changes completely.
 ## Project Perspective

 Exists in real engineering?  
-→ Yes, but as event correlation under constraints
+→ Yes, but as event correlation under constraints  

 Exists in interview form?  
-→ Yes, but stripped of context and complexity
+→ Yes, but stripped of context and complexity  
+
+---
+
+## Final Note
+
+The algorithm was never the hard part.  
+The assumptions were.