Designing Data Intensive Applications
Foundations of Data Systems
- Data system categorization: Look beyond traditional categories (database, queue, cache) to see data systems as tools with overlapping functionality
- Key requirements: Focus on reliability, scalability, and maintainability as foundational properties
- Reliability definition: Build systems that work correctly even when faults occur
- Scalability approach: Design for graceful handling of growth in data, traffic, and complexity
- Maintainability emphasis: Create systems that many engineers can work on productively
- Fault vs. failure: Distinguish between faults (components deviating from spec) and failures (system as a whole stops working)
- Hardware faults: Design redundancy for hardware components expected to fail
- Software faults: Prevent systematic errors through testing, isolation, and monitoring
- Human errors: Minimize through well-designed interfaces, sandboxed environments, and easy recovery
- Measurement importance: Use metrics to make quantitative statements about scalability
Data Models and Query Languages
- Data model impact: Recognize how your data model shapes application code and thinking
- Relational model benefits: Leverage the relational model for its generality and flexibility
- Document model advantages: Consider document model for schema flexibility and locality
- Impedance mismatch awareness: Be conscious of the object-relational mismatch
- Schema evolution: Plan for schema changes over time regardless of data model
- Normalization trade-offs: Balance normalization against read performance and complexity
- Graph model applicability: Use graph models when many-to-many relationships are core to your data
- Query language paradigms: Understand differences between declarative and imperative approaches
- MapReduce limitations: Recognize the expressiveness limitations of MapReduce patterns
- Declarative preference: Prefer declarative query languages to enable optimization and parallelization
Storage and Retrieval
- Storage engine classification: Distinguish between OLTP (transaction processing) and OLAP (analytics) workloads
- Log-structured designs: Consider log-structured storage engines for high write throughput
- B-tree characteristics: Leverage B-trees for ordered data with strong read performance
- Comparing LSM and B-trees: Choose LSM-trees for high write throughput, B-trees for strong read performance
- Clustered index impact: Understand impact of clustered indexes on data locality
- Secondary index overhead: Account for write amplification from maintaining secondary indexes
- In-memory databases: Use in-memory databases when performance justifies the cost
- Data warehouse purpose: Build separate systems optimized for analytics queries
- Column-oriented storage: Use column storage for analytical workloads with many columns
- Compression benefits: Apply compression to reduce storage costs and improve throughput
Encoding and Evolution
- Encoding importance: Select appropriate encodings for data in memory, on disk, and over networks
- Backward compatibility: Maintain the ability to read old data with new code
- Forward compatibility: Ensure new data can be read with old code
- Language-specific formats: Avoid language-specific serialization formats due to coupling and performance issues
- Text format trade-offs: Understand that text formats (JSON, XML) are human-readable but verbose
- Binary encoding benefits: Use binary encodings for efficiency and reduced size
- Schema evolution: Design for schema changes without downtime
- Schema registry value: Consider schema registries to manage schemas and evolution
- RPC complications: Be aware of the many failure modes unique to RPC calls
- Async workflows: Design for asynchronous message passing when appropriate
- Message broker benefits: Use message brokers to decouple producers from consumers
Replication
- Replication purpose: Replicate data for data locality, availability, and read scaling
- Synchronous vs. asynchronous: Understand the fundamental trade-offs between consistency and availability
- Single-leader benefits: Use single-leader replication for simplicity when read scaling is sufficient
- Replica setup: Configure new replicas properly to avoid inconsistency
- Replica failure: Design robust processes for handling node failures and rejoins
- Replication lag: Account for replication lag in system design and user experience
- Read consistency: Choose appropriate read consistency for your application needs
- Multi-leader complexities: Recognize the conflict resolution requirements of multi-leader setups
- Leaderless trade-offs: Understand durability and consistency trade-offs in leaderless systems
- Quorum calculations: Set W + R > N to ensure quorum for strong consistency in leaderless replication
Partitioning
- Partitioning purpose: Partition data to improve scalability and balance load
- Key-based partitioning: Choose partitioning schemes based on key distribution knowledge
- Hash partitioning: Use hash partitioning to distribute load evenly
- Range partitioning: Apply range partitioning when range queries are important
- Secondary index challenges: Address the scatter/gather overhead of secondary indexes across partitions
- Rebalancing approaches: Select rebalancing strategies that minimize data movement
- Request routing: Create a robust system for routing requests to appropriate partitions
- Skew avoidance: Design to avoid hot spots from skewed workloads
- Parallel query execution: Build mechanisms for parallel query execution across partitions
- Combining techniques: Combine replication and partitioning for scalable, available systems
Transactions
- Transaction purpose: Use transactions to simplify error handling and concurrency for applications
- ACID interpretation: Understand the practical meaning of Atomicity, Consistency, Isolation, and Durability
- Weak isolation impacts: Be aware of anomalies possible under weak isolation levels
- Read phenomena: Recognize and prevent dirty reads, non-repeatable reads, and phantom reads
- Isolation levels: Choose appropriate isolation levels based on application requirements
- Optimistic vs. pessimistic: Select optimistic or pessimistic concurrency control based on contention expectations
- Serializability cost: Understand the performance cost of serializable isolation
- Two-phase locking: Use 2PL when strong isolation is required despite performance impact
- Serializable snapshot isolation: Consider SSI for serializable isolation with better performance
- External consistency need: Determine if your application requires external consistency/linearizability
Distributed System Troubles
- Partial failure reality: Design for partial failures in distributed systems
- Unreliable networks: Assume the network will fail in various ways
- Timeout tuning: Set timeouts carefully based on expected response distributions
- Network faults handling: Handle network partitions and faults gracefully
- Clock synchronization limitations: Don’t rely on synchronized clocks for critical functionality
- Monotonic clock usage: Use monotonic clocks for measuring elapsed time
- Anomaly detection: Implement detection of clock skew and network issues
- Failure detection: Create robust failure detection mechanisms with appropriate timeouts
- Knowledge limitations: Acknowledge the fundamental limitations of knowledge in distributed systems
- Byzantine fault tolerance: Determine if Byzantine fault tolerance is needed for your threat model
Consistency and Consensus
- Consistency models: Understand the spectrum from eventual consistency to strict serializability
- Linearizability benefits: Use linearizability when recency guarantees are critical
- Linearizability costs: Be aware of the performance and availability costs of linearizability
- Causality importance: Respect causal dependencies in your system design
- Lamport timestamps: Use Lamport timestamps to establish a causal ordering
- Total order broadcast: Implement total order broadcast for consensus
- Consensus algorithms: Choose appropriate consensus algorithms for your requirements
- Two-phase commit limitations: Understand the coordinator failure vulnerability in 2PC
- Distributed transactions: Use distributed transactions cautiously due to performance and operational costs
- Idempotence design: Design operations to be idempotent when possible
Batch Processing
- Batch processing value: Leverage batch processing for cost-efficient high-volume data processing
- Unix philosophy: Apply Unix philosophy of composability to batch job design
- MapReduce workflow: Structure complex analytics as multi-stage MapReduce jobs
- Sorting benefits: Utilize sorting for efficient grouping and joining in batch processing
- Partitioning effect: Use partitioning to parallelize batch processing
- Fault tolerance: Design batch jobs to handle partial failures gracefully
- Task granularity: Choose appropriate task sizes to balance parallelism and overhead
- Output immutability: Treat outputs as immutable for simpler recovery and debugging
- Separation of concerns: Separate the logical data processing from physical execution
- Workflow coordination: Coordinate workflows using scheduler tools like Airflow or Oozie
Stream Processing
- Stream processing applications: Apply stream processing for event-rich domains
- Event stream origin: Capture event streams at their source to enable derived views
- Messaging delivery: Choose message delivery semantics based on application needs
- Messaging systems: Select messaging systems based on throughput, latency, and durability requirements
- Database integration: Integrate streams with databases through change data capture
- Stream processing patterns: Implement common patterns like windowing, sampling, and joining
- Time handling: Handle event time vs. processing time discrepancies
- Watermark heuristics: Use watermarks to reason about event completeness
- State management: Manage state carefully in stream processors
- Idempotent operations: Design stream operations to be idempotent for exactly-once semantics
The Future of Data Systems
- Data integration approaches: Evaluate different approaches to integrating disparate systems
- Derived data value: Recognize the value of derived data for system flexibility
- Batch and stream unification: Unify batch and stream processing conceptually
- Unbundling databases: Consider unbundling database functionality for specific needs
- Separation of storage and processing: Separate storage from processing for flexibility
- Correctness enforcement: Push correctness enforcement to the infrastructure layer
- Metadata importance: Leverage metadata for system observability and evolution
- Ethical data use: Consider the ethical implications of data collection and processing
- Predictive privacy impact: Anticipate privacy concerns in data system design
- End-to-end thinking: Design data systems with end-to-end application needs in mind
Key Takeaways
- Reliability engineering: Build systems that continue working correctly even when things go wrong
- Scalability planning: Design for growth in data volume, complexity, and traffic from the start
- Maintainability focus: Prioritize operability, simplicity, and evolvability in system design
- Data model selection: Choose data models based on application access patterns, not just data structure
- Storage engine fit: Select storage engines based on workload characteristics (write-heavy vs. read-heavy)
- Encoding flexibility: Use encodings that allow for independent evolution of services
- Replication purpose: Apply replication patterns based on specific needs for availability, latency, or scalability
- Partitioning strategy: Partition data to scale write throughput and balance load appropriately
- Consistency levels: Select consistency levels based on actual application requirements, not dogma
- System integration: Design thoughtful integration between systems using batch, stream processing, and derived data