Top 12 System Design Metrics for PMs

Nrupal Das
Bootcamp
Published in
5 min readAug 30, 2023

--

When designing a new product, instead of just throwing all the decisions to architects and engineers regarding system design, PMs must be aware of the decisions and the tradeoffs made by architects & engineers. They should participate and healthily contribute to this decision-making process. After all, the design decisions have the most significant impact on your product.

Basic or advanced understanding of distributed system architecture or general system design topics is super-helpful for product managers.

Top System Design Metrics — Nrupal Das
Top System Design Metrics

These metrics will help you evaluate the performance of your product and allow you to understand, predict, and make effective decisions with regard to your product.

  1. Scalability
  2. Latency & Response Time
  3. Throughput
  4. Concurrency
  5. User Experience
  6. Caching Efficiency
  7. Database Performance
  8. Data Consistency & Replication
  9. Fault Tolerance & High Availability
  10. Resource Utilization
  11. Security & Authorization
  12. Cost Efficiency

Scalability:

  • What will be your product adoption velocity?
  • What is the best case and the worst case scenario?
  • What is the ratio of user adoption to data storage requirements?
  • Can one user be very different from another user regarding their needs from the system?

Example— If you are a flight booking engine, then user addition will have a linear impact on the demands on your system. However, let us assume that you are an average Joe payment company, and suddenly, you sign up for Uber and will process its payment. In this B2B setting, one user can disproportionately impact the system.

A social media company that generates and stores user-generated data as a graph will have a very different ratio of users/data needs compared to a travel company in which you will create an account and book flights. In this, user data generated is quite straightforward and can be generally accommodated in RDBMS.

Latency & Response Time:

Latency — Latency is the time a request takes to travel from the client (user device) to the server and back.

Response Time — It is the time the server takes to respond to a particular query/request.

Network/Internet speed device performance will affect the latency, but the response time is completely under the company’s control. The response time will be a factor in the server's hardware and software powering.

Throughput:

Throughput- Number of requests a system can handle per unit of time. Examples of measurements can be requests per second or transactions per second (tps).

You must ensure that the system has high throughput when the system scales.

Concurrency:

Concurrency — Number of simultaneous requests or users a system can handle.

The system should have a decent leeway in terms of handling enough concurrent users specifically to handle business peaks. This can be configured specifically when you can predict when the traffic or user behaviour will peak for certain foreseeable reasons.

User Experience:

Many things in this list affect the user behaviour, but the more pronounced are the latency, response time, throughput, concurrency, and caching efficiency. If you consider all of this from the user experience perspective and make the necessary investment, you will definitely provide a superior user experience. This is an amazing paper to read how Amazon took user experience as a centre piece to build new technology.

Caching Efficiency:

Generally speaking, in most products, providing information vs. creating or updating information happens in the ratio of 10:1. Consequently, fast retrieval of information to serve the users becomes a huge need, necessitating cache and its efficiency pursuit.

However, if your application is write-heavy, you may not be so worried about cache efficiency.

Database Performance:

Some of the key factors that influence your database performance are as follows:

  • Hardware resources- What kind of CPU, RAM, Storage and network speed powers your servers?
  • Database Design- What kind of schema design, indexing and/or partitioning is used for your database?
  • Query Optimisation- How well optimised are your queries?
  • How well do you understand and act on the data Access patterns?
  • How well are you handling concurrency control? Locking and Isolation levels.
  • Caching mechanism to store frequently used data in memory and reduce DB hits.
  • What is the growth rate and data volume your application is handling and will continue to handle?

Data Consistency & Replication:

The needs of your systems, the individual facets of your distributed computing architecture and the need to ensure data consistency for a better user experience.

One of the tenets of the CAP theorem.

Fault Tolerance & High Availability:

The two other tenets of the CAP theorem. These are borne out of the lacuna of computer systems to be able to address all three tenets of computer systems — consistency, availability, and partition tolerance, only two can be effectively taken care of at one point in time.

The system needs to define which two of these pieces of CAP theorem will be of utmost importance for your system.

Resource Utilization:

An alert on the resource utilization parameter will allow you to keep tabs on which system is seeing ‘hot’ usage, which is seeing sustained overloads and various other information. This will allow you to make your system very resilient.

Security & Authorization:

Needless to say, you have to be careful regarding your application's security and authorisation. I am not an expert in that area, but you have to consult the security experts in your company to get it right and then follow up with penetration testing of your system by either an internal team or an external vendor to be assured that your application is doing well in that category.

Cost Efficiency:

Lastly, you need to be conscious that the cost versus improved user experience graph sharply increased in >99.5% region when it comes to making sure your system is available or is able to sustain partition tolerance or latency, etc. In other words, how critical is the need, and how much do you need to spend to ensure your SLA is within that range? In a few cases, it might make sense not to invest and, in those rare scenarios, handle the UX otherwise.

Example — Instead of spending $1M to upgrade your system never to allow X situation to occur, it might sometimes make sense that in the eventuality that X occurs, you will give $100 gift voucher to the affected users. As a product manager, you would have to really think about it and make the right business decision. It is not a technology decision; it is a product decision.

#systemdesign #productdesign #productmanagement

--

--

Product Management | Chevening Fellow, Oxford University | ISB | Author | Successfully Co-founded 2 Startups