This document is intended for operators and developers of Manta seeking to better understand Manta’s runtime behavior, both under normal operation and during incident response.

You may find other documentation helpful:

This document is organized into a few broad sections:

  • Incident Response Decision Tree: this chapter attempts to walk you through the process of understanding common problems with Manta. This often links directly to specific common tasks later in the document.

  • Basic investigation tasks: this chapter covers a number of different specific tasks associated with debugging Manta, like understanding a Muskie log entry, checking what instances of a service are registered in DNS, and so on.

  • Quick references to HTTP status codes and headers used in Manta

  • Deployment-specific details: this chapter lists a number of details that your company or organization should separately document.

  • General FAQ

Incident Response Decision Tree

Choose a place to start:

Many of these sections refer to checking recent historical metrics. See Deployment-specific details for more information.

Investigating a decrease in overall throughput

Start here if you have reason to believe that inbound or outbound throughput to Manta has recently been reduced.

First, can you confirm the decrease in throughput from Manta’s own metrics? If not, the problem may be external to Manta. Since problems often unfold over several minutes, you may have to look back over several hours to get a sense of recent throughput.

metrics webapi bytes falling 1
Figure 1. A sudden drop in inbound throughput after several hours of steady performance.
  • Has the error rate increased? Check recent historical metrics for an increase in 400-level or 500-level errors.

    • Yes, there’s an increase in the overall error rate. See Investigating an increase in error rate. It may be that many requests that were previously accounting for throughput are now failing. (If the error rate is not high enough to account for the change in throughput, then it may be useful to investigate latency instead.)

      Spike in 503 errors
      Figure 2. A transient spike in 503 errors. (Note the absolute number is extremely low here. To see this against a larger number of successful requests, you may need to click on "503" in the legend to show only the 503s.)
    • No, the error rate is unchanged. Has webapi latency increased? Check recent historical metrics for an increase in average latency, p90 latency, or p95 latency.

      • Yes, webapi latency has increased. See Investigating an increase in latency.

        metrics webapi latency increasing 1
        Figure 3. An increase in webapi tail latency.
      • No, webapi latency is unchanged or lower. If there are no errors and no increased latency, then Manta appears healthy. Check upstream network connections, client metrics, and client behavior. (e.g., has the client changed its workload?)

Background: By definition, the throughput of the system (either inbound or outbound) is the number of completed requests per second times average request size (in terms of inbound or outbound bytes, depending on which throughput you’re interested in). If throughput is lower, then one of the following is likely true:

  1. There are fewer requests per second because there’s a fixed concurrency and average request latency has increased. This is generally a server problem. For example, it’s common for there to be a fixed number of clients, each with a fixed maximum concurrency. In that situation, if Manta requests start taking twice as long, then the number of requests completed per second will naturally be cut in half, cutting overall throughput in half.

  2. Average request size has reduced because requests are failing (resulting in small requests). This is generally a server problem, though it could also be a client problem. For example, if the average upload is 10 MiB, but suddenly 10% of requests are failing before they reach the upload stage, overall throughput is likely to degrade by 10%, since 10% of requests are now uploading almost no data.

  3. There are fewer requests per second because clients are making fewer requests. This would be a client change (and may not be an issue). For example, this could happen when an operator turns off some clients.

  4. Average request size has reduced because of a deliberate change in the client workload. This would be a client change (and may not be an issue).

Investigating an increase in latency

When there’s reason to suspect an overall increase in webapi latency, real-time metric dashboards can be used to identify the source. The first step is typically confirming whether latency is visible at webapi.

Note

A Manta deployment can easily execute tens of thousands of requests per second, each with an associated latency. It’s useful to summarize a large number of latencies with a single number so that we can identify trends, both for raising alarms and for general situational awareness during incident response.

Average latency is the sum of latencies divided by the request count. This is useful for understanding changes in throughput, since throughput changes that result from increased latency would manifest as increased average latency.

However, normal-looking averages can obscure high-latency requests that have an outsize impact on clients. As a result, we often focus on tail latency, or the latencies of the slowest requests. The notation p99 refers to the 99th percentile latency, which is the latency value where 99% of requests completed within this latency. If the p90 is 300ms, then 90% of requests completed within 300ms.

When making broad statements about latency, it’s helpful to be specific about whether you’re talking about average latency or tail latency. However, the questions below are intentionally ambiguous, as increases in either average or tail latency might be cause for further investigation.

metrics webapi latency increasing 1
Figure 4. A steady increase in webapi tail latency.
metrics webapi latency spike 1
Figure 5. A transient spike in webapi tail latency.

Has latency at Muskie increased?

  • No, Muskie latency has not increased. There does not appear to be an increase in latency from Manta. If clients are reporting elevated latency, try to confirm that. One approach is to collect information from a client about a specific slow request and build a request timeline. This can help determine if latency is coming from within Manta or not.

  • Yes, Muskie latency has increased. At this point, you can find or generate a specific slow request and then understand its latency. But when you’re looking at a trend like this, it’s often helpful to look at latency at other parts of the system. The next question is: Has latency at Electric-Moray increased?

    • Yes, latency has increased at Electric-Moray, too. In this case, has latency at Moray increased as well?

      • Yes, latency has increased at Moray, too. See below.

      • No, there’s no corresponding increase in latency at Moray. This is unusual. There may be a problem with particular electric-moray instances, or the whole tier may be overloaded. See Investigating a slow tier of Node processes.

    • No, there’s no corresponding increase in latency at Electric-Moray. This is unusual. There may be a problem with particular webapi instances, or the whole tier may be overloaded. See Investigating a slow tier of Node processes.

metrics latency correlation 1
Figure 6. A decrease in webapi latency that’s well correlated with a decrease in maximum Moray tail latency across all shards.

If you found above that latency at Moray has increased, take a look at the Moray latency metrics broken out by shard name (or zonename, and then map the zonenames to shard names). Remember to ignore shard 1. Have multiple Moray shards seen an increase in latency?

  • No, only one shard has elevated latency. You’ll need to dig into the behavior of this shard. See below.

  • Yes, multiple Moray shards have elevated latency. Moray shards are generally independent, so problems with multiple shards may need to be investigated separately. On the other hand, you may wind up looking at system-wide PostgreSQL metrics next, in which case you may be able to answer questions about several shards at once.

If you’ve established that a particular shard’s Moray latency has increased, the next step is to identify if it’s Moray or PostgreSQL that’s the source of the latency. This is a bit more difficult than previous steps because we do not have metrics for p99 query latency from PostgreSQL.

There are a couple of different questions to ask:

  • Are all Moray zones affected in the same way? Specifically, compare tail latency, average latency (if possible), and queue depth across zones for this shard. Are all of them elevated, or are some different than others?

    • Yes, all Moray zones in this shard appear affected in the same way. This suggests a problem with PostgreSQL rather than Moray.

    • No, some Moray zones in this shard appear much more affected than others. This is more likely to reflect a problem with specific Moray instances rather than PostgreSQL. See the question below about database connections, and see also Investigating a slow tier of Node processes.

  • Is there a high rate of Moray queueing on this shard, relative to other shards? If there’s a high rate of queueing, the database might be the source of the latency. If not, it’s possible that Moray is the source of the problem.

  • If there is queueing at Moray, check the number of backend connections (or processes) reported on the PostgreSQL dashboard. Does this shard have the same number of database connections as other shards?

    • Yes, this shard has the same number of connections as other shards. Move on to other questions.

    • No, this shard has fewer connections than other shards. This may indicate a connection management problem at Moray. If Moray instances lose track of connections, they may be limited in how much work they can dispatch to PostgreSQL, resulting in increased latency as requests queue up.

    • No, this shard has far more connections than other shards. If this shard has over 500 connections, that may indicate a problem with Moray. There are generally supposed to be at most 64 connections per Moray zone in this shard, and we usually deploy 3-6 zones per shard.

Based on these questions, if it seems like the problem is associated with PostgreSQL, see Investigating PostgreSQL latency. Otherwise, the problem is likely with Moray. See Investigating a slow tier of Node processes.

Investigating an increase in error rate

There are a couple of major kinds of error.

  • A 500-level response (that is, a well-formed HTTP response with a status code between 500 and 599) generally reflects a problem with Manta.

  • A 400-level response (that is, a well-formed HTTP response with a status code between 400 and 499) may happen under normal operation and may indicate no problem at all, or it may reflect a client issue. In rare cases, a 499 response can reflect a server issue. The details depend on the specific type of error and whether the client expects it or not.

  • If the client gives up before the server has sent a response, then the client will likely report a client timeout, while the server will likely report a connection that was abruptly closed by the client.

  • If there’s a networking issue that causes the client or server to abandon the connection, both sides will generally report an explicit socket error.

Errors that manifest as explicit HTTP responses (i.e., 400-level and 500-level responses) are visible to operators via logs and metrics provided by Manta.

There are less common failures as well:

  • A client may find no servers in DNS. This may or may not be visible to operators.

  • A client may time out attempting to resolve DNS. This is unlikely to be visible to Manta operators.

  • A client may time out attempting to establish a TCP connection to Manta. This is not likely to be visible to Manta operators.

The rest of this section discusses specific response codes for errors and what they mean.

Response code 507

Response code 503

A 503 Service Unavailable response generally indicates that Manta is refusing some requests because it is overloaded or some dependencies are not functioning. There are three major cases where this happens:

  • At least one Moray instance is at its maximum queue length and is refusing new requests.

  • There are not enough online storage nodes to handle the upload.

  • Muskie did not respond to the request quickly enough.

In all cases, you can Investigating a specific request that has failed to find the cause of the failure.

If you already have a particular 503 response, you can quickly determine which of these cases caused it.

If you have a large number of 503s, you can check for systemic causes:

  • Are there Moray shards with high queue lengths? Check recent historical metrics for Moray queue length. If any zone or shard has more than a few thousand items queued, it may be causing 503-level responses.

Response code 502

Manta no longer issues this response code. If you see it, please file a bug. Historically, this was associated with slow Muskie requests. These are generally now reported as 503s.

Response code 500

This generally indicates a server-side bug. See Finding (or generating) a failed request to learn why the request failed.

Response code 499

499 is an internal status code used to describe when a client appears to have abandoned a request. Specifically, this is recorded when a client closes its socket before finishing a request. In this case, there is no response, since the server may have no place to send it.

499s may be seen if:

  • the client gave up (timed out) before the server sent an initial response (e.g., a 100-continue)

  • the client crashed (closing its sockets)

  • a network issue disrupted the connection between client and server

If you have a specific 499 request’s Muskie log entry already (as from a Muskie log), was the latency fairly high? (If you know the client’s timeout, was the request latency longer than this timeout?) Check the "latency" field in the Muskie log entry. Also compare the Date header in the request with the timestamp of the log entry. If these don’t match up, the request may have been queued somewhere before being processed by Muskie.

Response code 413

Other 400-level response codes

These are usually client issues, though it’s always possible there are server-side bugs that cause erroneous 400-level responses. (For an obscure example, see MANTA-2319.) The only way to be sure is to examine the request and response to see if the response appears correct.

Investigating a specific request that has failed

Start here if you want to understand why a specific request has failed. These steps will help you find corresponding log entries with more detail.

Ideally, you’ll want to have:

  • the x-server-name header from the response

  • the x-request-id header from the response

  • the approximate time of the response (which calendar hour it was sent)

  • the IP address that the client used to reach Manta

In many cases, you can get by with only some of this information. The more information you have, the easier (and faster) it will be to find more information.

You might also try generating your own request to investigate.

If you find the log entry, see Understanding a Muskie log entry for details. If you find none, see If there is no Muskie log entry.

If you have the x-request-id and x-server-name headers

The x-server-name header gives you the uuid for the "webapi" zone that processed this request.

  • Was the request completed after the top of the current hour?

    • Yes, the request was handled after the top of the hour. The log entry will be inside the Muskie zone. First, find the datacenter where the Muskie zone that handled the request is deployed. From the headnode of that datacenter, use manta-oneach to search the Muskie log file for the request id:

      manta-oneach -z WEBAPI_ZONE_UUID 'svcs -L muskie | xargs cat | grep REQUEST_ID' | bunyan

      filling in WEBAPI_ZONE_UUID from the x-server-name header and REQUEST_ID from the x-request-id header.

    • No, the request was handled earlier than that. The log entry will generally be in a historical log file inside Manta itself. Use mlogin or mget to fetch the path:

      /poseidon/stor/logs/muskie/YYYY/MM/DD/HH/UUID8.log

      where YYYY/MM/DD/HH represent the year, month, day, and hour when the request completed and UUID8 is the first 8 characters of the x-server-name header. If this object does not exist in Manta, and Manta has been having availability issues, then the historical log file may still be inside the corresponding "webapi" zone. Log into the "webapi" zone and use grep to search for the request ID in the files in /var/log/manta/upload.

    • I don’t know when the request was handled. In this case, you need to check all of the log files mentioned above. You may be able to use a Manta job to scan a large number of historical files at once. For example, you can search all of a day’s log files for one server using:

      mfind -t o -n UUID8.log /poseidon/stor/logs/muskie/YYYY/MM/DD |
          mjob create -o -m 'grep REQUEST_ID || true' -r bunyan

      As before, UUID8 is the first 8 characters of the x-server-name header.

If you find the log entry, see Understanding a Muskie log entry for details. If you find none, see If there is no Muskie log entry.

If you have the x-request-id, but no x-server-name

In this case, you have to check the log files for all "webapi" zones to find the log entry.

  • Was the request completed since the top of the current hour?

    • Yes, the request was handled since the top of the hour. The log entry will be inside the Muskie zone. Separately for each datacenter in this Manta, use manta-oneach to search all the Muskie logs:

      manta-oneach -s webapi 'svcs -L muskie | xargs cat | grep REQUEST_ID' | bunyan
    • No, the request was handled earlier than that. Use a job to search historical logs with names:

      /poseidon/stor/logs/muskie/YYYY/MM/DD/HH/*.log

      where YYYY/MM/DD/HH represent the year, month, day, and hour when the request completed.

      For example, you can search all log files for a particular hour with:

      mfind -t o /poseidon/stor/logs/muskie/YYYY/MM/DD/HH |
          mjob create -o -m 'grep REQUEST_ID || true' -r bunyan
    • I don’t know when the request was handled. In this case, you need to check all of the log files mentioned above.

If you find the log entry, see Understanding a Muskie log entry for details. If you find none, see If there is no Muskie log entry.

If you don’t have the x-request-id

If you don’t have the request id, then you’ll need some other information about the request that you can use to filter it. Examples include:

  • the name of the account, if that account only made a few requests around the time in question

  • the path that was used, if that’s relatively unique among requests

  • a particular client header that’s somewhat uncommon

  • a very small time window in which the request may have happened

If you have this sort of information, your best bet is to use some combination of grep or json to scan all of the log entries for the appropriate time.

Tip
When working out a grep or json pipeline, it’s helpful to use mlogin to get an interactive shell for a particular Muskie log file. There, you can practice your shell pipeline a few times until it matches what you want, possibly using slightly different parameters (e.g., a different account name) than you’ll use for the real search, since you probably didn’t happen to pick a log file with the precise entry you’re looking for). Then run that same shell pipeline in a Manta job over a much larger number of Muskie log files.

If you find the log entry, see Understanding a Muskie log entry for details. If you find none, see If there is no Muskie log entry.

If there is no Muskie log entry

There’s a difference between there being no Muskie log entry and not being able to find the Muskie log entry for a request.

You may know that there’s no log entry for a request if:

  • you have the rough timestamp and x-server-name header, found a non-empty log for that server for that hour, and there’s no entry for the request in it, or

  • you know the rough timestamp of the request, found non-empty log files for all servers for that hour, and there’s no matching request

Otherwise, it’s possible that the log entry was lost (e.g., if a log file was lost or clobbered, due to a bug or extended availability loss).

  • Did the HTTP response contain an x-server-name or x-request-id header?

    • Yes, there was a response with these headers. In this case, a Muskie instance definitely handled the request. There should be a log entry.

    • There was a response, but it did not contain these headers. In this case, the response very likely came from the load balancer and not Muskie. See Finding a load balancer log entry to find more information about the request. This typically happens for one of two reasons:

      • Muskie took too long (usually more than two minutes) to handle the request. Note that even though the load balancer may have reported a 500-level error, the request may have completed successfully (or failed for some other reason) inside Muskie.

      • Muskie did process the request, but it just took longer than the load balancer timeout. This is often a sign of high latency at the metadata tier.

      • Muskie stopped processing a request. This would be a bug in Muskie. It often leads to file descriptor leaks and memory leaks, so it’s very serious. Examples: MANTA-3338, MANTA-2916, MANTA-2907.

      • Muskie sent an invalid HTTP response. (This is very uncommon. Example: MANTA-3489)

    • There was no response, or the client timed out before receiving a response. It would be very unusual for the system to produce no response within 2 minutes of a request being completed, but it’s not uncommon for a client to give up before receiving a response.

    • I don’t know if there was a response.

In all of these cases, you can get more information about what happened by Finding a load balancer log entry.

Investigating elevated Electric-Moray latency

Investigating elevated Moray latency

Investigating elevated PostgreSQL latency

Basic investigation tasks

Investigating a slow tier of Node processes

The following services within the Manta data path are Node.js-based:

  • Webapi (muskie)

  • Electric-Moray

  • Moray

  • Authcache (mahi)

  • Nameservice (not covered in this section)

Note
This section assumes that you’re looking at a specific tier of services (i.e., one of the ones listed above). If you’re looking at Moray specifically, as you follow this section, consider only the Moray instances for the particular shard you’re looking at.

First, have you already confirmed that this tier is reporting high latency? If not, check metric dashboards first to see whether latency is high. See Investigating an increase in latency.

Next, have you also confirmed that the reason for the latency is not due to a dependent service? For example, if you’re here because of Electric-Moray, have you confirmed that Moray isn’t also seeing high latency? If you haven’t, see Investigating an increase in latency for how to locate which tier is the origin of overall latency.

At this point, you’ve confirmed that this tier appears to be the source of a latency increase. Now, use the latency metric dashboards to see is the increase in latency affecting nearly all instances?

  • Yes, nearly all instances are reporting high latency. This might reflect insufficient capacity. Individual Node processes are usually close to capacity when they exceed about 80-85% of CPU utilization. Each component typically has multiple processes deployed in each zone, which means the saturation point (80-85% per process) is different for each type of zone:

    Service Processes per zone Saturation point Equivalent CPUs

    webapi

    16

    13000% (zone-wide)

    13 CPUs

    electric-moray

    4

    325% (zone-wide)

    3.25 CPUs

    moray

    4

    325% (zone-wide)

    3.25 CPUs

    authcache

    1

    80% (zone-wide)

    0.8 CPUs

    Given the above guides, use CPU utilization graphs to determine: are most instances in this service close to CPU saturation? (The Deployment-specific details section recommends having both a line graph showing the minimum zone-wide CPU usage for each service and a heat map of zone-wide CPU utilization. If the line graph is above these guidelines, the service definitely looks saturated. The heat map allows you to identify cases where some instances might have more headroom available, but most of them are still too busy.)

    • Yes, most instances are nearing saturation. Deploy more instances of the service in question. See Scaling up a component. Additionally, if the workload has not changed substantially, you may want to check for a memory leak that may have affected many processes. If it has, then deploying more instances will likely only help for a little while — until those suffer the same leak.

    • No, many instances appear to have plenty of headroom. This is very unusual, so it’s worth double-checking that latency is elevated across the board, but latency at dependent services is not high, and CPU utilization is not high. If this is really the case, pick any specific process showing high latency and see Investigating a slow process. Other avenues to consider: is the dependent service close to CPU saturation? If so, clients of the dependent service may see much higher latency than the service reports because of queueing. Is there evidence of elevated packet loss? This can also increase client-side latency without manifesting as latency reported by the dependent service.

  • No, only some instances are reporting high latency. In this case, this service does not appear generally overloaded, although it’s possible that some instances are. Next question: can you tell from the per-request metrics for this tier whether the workload is evenly distributed across instances?

    • Yes, the workload is evenly distributed across instances. In this case, it appears that processes are generally doing comparable work, but some are doing it much slower. The next step is to use the dashboard to identify a zone with particularly high latency and dig deeper into a specific slow process. See Investigating a slow process.

    • No, the workload is unevenly distributed. See Investigating service discovery. (Remember, if you’re looking at Moray, this section assumes you’re looking only at instances for a particular shard. As described under Investigating an increase in latency, if you see latency at Moray, you should first isolate the shard and investigate latency in each shard separately.)

Investigating PostgreSQL latency

Finding (or generating) a failed request

When trying to understand either an explicit error or high latency, it can be helpful to investigate the log entry written by "webapi" for a specific request.

Do you already have information about a specific request you want to investigate?

Next question: does the problem appear to be reproducible? Try reproducing the problem with the node-manta command-line tools (e.g., mls). You can use the -v flag and redirect stderr to the bunyan command to see request and response headers, like this:

$ mls -v /poseidon/public 2> >(bunyan)
[2018-07-19T21:18:48.146Z] DEBUG: mls/MantaClient/7054 on zathras.local (/Users/dap/install/node-v4.9.1-darwin-x64/lib/node_modules/manta/lib/client.js:1536 in ls): ls: entered (req_id=4b4927be-fc1f-4dd8-88fe-2ae75dcbc262, path=/poseidon/public)
[2018-07-19T21:18:48.148Z] DEBUG: mls/MantaClient/7054 on zathras.local (/Users/dap/install/node-v4.9.1-darwin-x64/lib/node_modules/manta/lib/client.js:1128 in info): info: entered (req_id=0f34c68e-2072-405a-be3a-248e8020f1ba, path=/poseidon/public, id=0f34c68e-2072-405a-be3a-248e8020f1ba, query={})
    headers: {
      "accept": "application/json, */*",
      "x-request-id": "0f34c68e-2072-405a-be3a-248e8020f1ba"
    }
[2018-07-19T21:18:48.189Z] TRACE: mls/MantaClient/7054 on zathras.local (/Users/dap/install/node-v4.9.1-darwin-x64/lib/node_modules/manta/node_modules/restify-clients/lib/HttpClient.js:314 in rawRequest): request sent
    HEAD /poseidon/public HTTP/1.1
    Host: us-central.manta.mnx.io:null
    accept: application/json, */*
    x-request-id: 0f34c68e-2072-405a-be3a-248e8020f1ba
    date: Thu, 19 Jul 2018 21:18:48 GMT
    authorization: Signature keyId="/dap/keys/56:f3:e1:56:3d:e6:f7:83:a9:ce:19:5d:62:ba:5c:1f",algorithm="rsa-sha1",headers="date",signature="kG7IydhNO06ImfI6hFzFXXoSrWT6+2kCcDUC3swGebIr7YxeDcLEWMxGzB4z5lC29Vv7kgpLGaTc218m+63D0Y3M84LTNCvM1Va5COetXhIHkkAlBtXpJt5MUjqsRFK1xrpGKJjDc1QIBGSQIDmh4p6wNjofeaLX8jYnYa7FagW5iyQIHQmpAwe/AO+9Bg7fXBgzfvVZjWfhLaBA4G2CwuCSlkpF7mR7t04pTn+oxOmufE5h6XI/VLNsLZyQkc6prBFDoSiOLMgZsGfdsF11J9c/lCK/PW1y4MlTZBDGG8W1F0ssUEx0euLdm4TsqoBc1cfeIC43fV6sR2nN/CSiow=="
    user-agent: restify/1.4.1 (x64-darwin; v8/4.5.103.53; OpenSSL/1.0.2o) node/4.9.1
    accept-version: ~1.0
[2018-07-19T21:18:48.861Z] TRACE: mls/MantaClient/7054 on zathras.local (/Users/dap/install/node-v4.9.1-darwin-x64/lib/node_modules/manta/node_modules/restify-clients/lib/HttpClient.js:210 in onResponse): Response received
    HTTP/1.1 200 OK
    content-type: application/x-json-stream; type=directory
    result-set-size: 5
    date: Thu, 19 Jul 2018 21:18:48 GMT
    server: Manta
    x-request-id: 0f34c68e-2072-405a-be3a-248e8020f1ba
    x-response-time: 254
    x-server-name: 511ea59e-2a4a-486b-9258-ef59016d064d
    connection: keep-alive
    x-request-received: 1532035128176
    x-request-processing-time: 684
[2018-07-19T21:18:48.867Z] DEBUG: mls/MantaClient/7054 on zathras.local (/Users/dap/install/node-v4.9.1-darwin-x64/lib/node_modules/manta/lib/client.js:820 in get): get: entered (req_id=dce478bd-6bc7-451b-ac2b-22d74d7bfd37, path=/poseidon/public, id=dce478bd-6bc7-451b-ac2b-22d74d7bfd37)
    headers: {
      "accept": "application/x-json-stream",
      "x-request-id": "4b4927be-fc1f-4dd8-88fe-2ae75dcbc262",
      "date": "Thu, 19 Jul 2018 21:18:48 GMT",
      "authorization": "Signature keyId=\"/dap/keys/56:f3:e1:56:3d:e6:f7:83:a9:ce:19:5d:62:ba:5c:1f\",algorithm=\"rsa-sha1\",headers=\"date\",signature=\"kG7IydhNO06ImfI6hFzFXXoSrWT6+2kCcDUC3swGebIr7YxeDcLEWMxGzB4z5lC29Vv7kgpLGaTc218m+63D0Y3M84LTNCvM1Va5COetXhIHkkAlBtXpJt5MUjqsRFK1xrpGKJjDc1QIBGSQIDmh4p6wNjofeaLX8jYnYa7FagW5iyQIHQmpAwe/AO+9Bg7fXBgzfvVZjWfhLaBA4G2CwuCSlkpF7mR7t04pTn+oxOmufE5h6XI/VLNsLZyQkc6prBFDoSiOLMgZsGfdsF11J9c/lCK/PW1y4MlTZBDGG8W1F0ssUEx0euLdm4TsqoBc1cfeIC43fV6sR2nN/CSiow==\""
    }
    --
    query: {
      "limit": 1024
    }
[2018-07-19T21:18:48.872Z] TRACE: mls/MantaClient/7054 on zathras.local (/Users/dap/install/node-v4.9.1-darwin-x64/lib/node_modules/manta/node_modules/restify-clients/lib/HttpClient.js:314 in rawRequest): request sent
    GET /poseidon/public?limit=1024 HTTP/1.1
    Host: us-central.manta.mnx.io:null
    accept: application/x-json-stream
    x-request-id: 4b4927be-fc1f-4dd8-88fe-2ae75dcbc262
    date: Thu, 19 Jul 2018 21:18:48 GMT
    authorization: Signature keyId="/dap/keys/56:f3:e1:56:3d:e6:f7:83:a9:ce:19:5d:62:ba:5c:1f",algorithm="rsa-sha1",headers="date",signature="kG7IydhNO06ImfI6hFzFXXoSrWT6+2kCcDUC3swGebIr7YxeDcLEWMxGzB4z5lC29Vv7kgpLGaTc218m+63D0Y3M84LTNCvM1Va5COetXhIHkkAlBtXpJt5MUjqsRFK1xrpGKJjDc1QIBGSQIDmh4p6wNjofeaLX8jYnYa7FagW5iyQIHQmpAwe/AO+9Bg7fXBgzfvVZjWfhLaBA4G2CwuCSlkpF7mR7t04pTn+oxOmufE5h6XI/VLNsLZyQkc6prBFDoSiOLMgZsGfdsF11J9c/lCK/PW1y4MlTZBDGG8W1F0ssUEx0euLdm4TsqoBc1cfeIC43fV6sR2nN/CSiow=="
    user-agent: restify/1.4.1 (x64-darwin; v8/4.5.103.53; OpenSSL/1.0.2o) node/4.9.1
    accept-version: ~1.0
[2018-07-19T21:18:49.365Z] TRACE: mls/MantaClient/7054 on zathras.local (/Users/dap/install/node-v4.9.1-darwin-x64/lib/node_modules/manta/node_modules/restify-clients/lib/HttpClient.js:210 in onResponse): Response received
    HTTP/1.1 200 OK
    content-type: application/x-json-stream; type=directory
    result-set-size: 5
    date: Thu, 19 Jul 2018 21:18:49 GMT
    server: Manta
    x-request-id: 4b4927be-fc1f-4dd8-88fe-2ae75dcbc262
    x-response-time: 219
    x-server-name: 60771e58-2ad0-4c50-8b23-86b72f9307f8
    connection: keep-alive
    transfer-encoding: chunked
    x-request-received: 1532035128869
    x-request-processing-time: 496
-rwxr-xr-x 1 poseidon            17 Dec 04  2015 agent.sh
drwxr-xr-x 1 poseidon             0 Sep 18  2014 manatee
drwxr-xr-x 1 poseidon             0 Jun 18  2013 medusa
drwxr-xr-x 1 poseidon             0 Aug 01  2013 minke
drwxr-xr-x 1 poseidon             0 Nov 07  2013 stud
[2018-07-19T21:18:49.480Z] DEBUG: mls/MantaClient/7054 on zathras.local (/Users/dap/install/node-v4.9.1-darwin-x64/lib/node_modules/manta/lib/client.js:887 in onEnd): get: done (req_id=dce478bd-6bc7-451b-ac2b-22d74d7bfd37, path=/poseidon/public)

From the output, we can see that this operation made two requests. The second one has x-server-name: 60771e58-2ad0-4c50-8b23-86b72f9307f8 and x-request-id: 4b4927be-fc1f-4dd8-88fe-2ae75dcbc262. You can now use these to start Investigating a specific request that has failed.

If the problem isn’t quite so easily reproducible, but you suspect it still affects a variety of requests, you can use manta-mlive to generate more requests and collect debug output for all of them. For mlive, you’ll want to set LOG_LEVEL in the environment to generate the debug logging and you’ll likely want to redirect stderr to a file that you can search through later:

$ LOG_LEVEL=trace ./bin/mlive -S 2>debug.out
2018-07-19T21:24:01.307Z: reads okay, writes stuck (4/4 ok since start)
2018-07-19T21:24:02.310Z: all okay (17/17 ok since 2018-07-19T21:24:01.307Z)
^C

As before, you can use the bunyan tool to format the log:

$ bunyan debug.out
...
[2018-07-19T21:25:01.716Z] TRACE: mlive/MantaClient/9435 on zathras.local: request sent
    HEAD /dap/stor/mlive HTTP/1.1
    Host: us-central.manta.mnx.io:null
    accept: application/json, */*
    x-request-id: c317603c-82d4-4b2e-ac4b-066c9ece1864
    date: Thu, 19 Jul 2018 21:25:01 GMT
    authorization: Signature keyId="/dap/keys/56:f3:e1:56:3d:e6:f7:83:a9:ce:19:5d:62:ba:5c:1f",algorithm="rsa-sha1",headers="date",signature="oJZZIDh1qT8PeSSpz09bIzYT4LYK6rqXS2G5bHhh2r37SNOs0vBkFHUhfso6tSq1hmHIlkCEMXX9zGLIvYxQtHj6/KtiNgZgyWzGHms+qhc2gziXnOrMybxmWqJwipd8rAJCdDBV0B5FlCDeELWIA+1LifGDqqLdDZT4ScBUNOm9JG2+mha2U+pFbNtaXQQyyoPgopk+4ur4OHYpcaK/KY6WdC91quLIaIKV28VMtPoN/q/15lzRj6G6L7mbIMyd48ut0EbmTTR/CfYq9dquNsWDlyWgEJJVYyPZ9odAE34YQiYt/N4JXH7Crr9M6md9GtZonY+DbP8vvb5+7xr8dA=="
    user-agent: restify/1.4.1 (x64-darwin; v8/4.5.103.53; OpenSSL/1.0.2o) node/4.9.1
    accept-version: ~1.0
[2018-07-19T21:25:02.548Z] TRACE: mlive/MantaClient/9435 on zathras.local: Response received
    HTTP/1.1 200 OK
    last-modified: Tue, 16 Dec 2014 01:17:29 GMT
    content-type: application/x-json-stream; type=directory
    result-set-size: 45
    date: Thu, 19 Jul 2018 21:25:02 GMT
    server: Manta
    x-request-id: c317603c-82d4-4b2e-ac4b-066c9ece1864
    x-response-time: 462
    x-server-name: 39adec6c-bded-4a14-9d80-5a8bfc1121f9
    connection: keep-alive
    x-request-received: 1532035501703
    x-request-processing-time: 844

You can use grep, json, or other tools to filter the output for requests of interest (e.g., those with a particular HTTP response code or an x-response-time larger than some value). From the filtered results, you can identify an x-server-name and x-request-id and then see Investigating a specific request that has failed.

Checking recent historical metrics

Locating log files

Real-time logs contain data from the current hour. These are typically stored as regular files within each zone. The specific file used varies by type of zone. See the Operator Guide for details. The manta-oneach command can be used as a low-level way to scan these real-time logs. For example, a common way to count recent 500-level errors in webapi logs is:

manta-oneach --service=webapi 'svcs -L muskie | xargs cat | grep -c "handled: 5"'

Since the real-time logs only store the current hour’s data, at 01:02Z, this would only scan 2 minutes worth of data. At 01:58Z, this would scan 58 minutes worth of data.

If we were looking for a specific request that took place at 01:37Z, then we’d look at the real-time log file immediately after the request happened until 02:00Z (i.e., for 23 minutes). After that, we’d have to look at the historical logs.

Historical logs are maintained by rotating the real-time logs at the top of each hour and then uploading them into Manta. Once in Manta, they’re stored under:

/poseidon/stor/logs/COMPONENT/YYYY/MM/DD/HH/SHORTZONE.log

where:

  • COMPONENT varies based on the component you’re looking for

  • YYYY, MM, DD, and HH represent the year, month, day, and hour for the entries in the log file

  • SHORTZONE is the first 8 characters of the zone’s uuid.

For example, to find the load balancer (haproxy) logs from zone f6817865-10fb-416c-a079-47941ac2aab4 from 2018-12-05T01:37Z, we would look for the object:

/poseidon/stor/logs/haproxy/2018/12/05/01/f6817865.log

You can scan a number of these logs at once using a compute job. For example, you could look for requests from IP address 10.1.2.3 in all of the load balancer logs from December 5, 2018 using:

mfind -t o /poseidon/stor/logs/haproxy/2018/12/05 |
    mjob create -o -m 'grep 10.1.2.3'

You can adjust the mfind invocation as needed to scan a broader or more narrow range. You could also use the -n argument to mfind to select log files from a particular load balancer:

mfind -n f6817865.log -t o /poseidon/stor/logs/haproxy/2018/12/05 |
    mjob create -o -m 'grep 10.1.2.3'

When working with a Manta deployment where jobs aren’t functional, you can instead mget log files and do your filtering locally, like so:

mfind -t o /poseidon/stor/logs/haproxy/2018/12/05 |
    while read f; do mget $f | grep 10.1.2.3; done

Note that the above sequence of commands will download the entirety of every log file found by mfind, which may occupy an inconvenient amount of time and disk space. To mitigate this, you can use head to fetch a smaller number of log entries, like so:

mfind -t o /poseidon/stor/logs/haproxy/2018/12/05 |
    while read f; do mget $f | head -n 1000 | grep 10.1.2.3; done

When the above command sequence is run, each invocation of mget will terminate once head has read 1000 lines.

Note: the archival process for historical logs first rotates the logs to new files under /var/log/manta/upload. A few minutes later, these are uploaded to Manta and then removed from the local filesystem. If the upload fails, the files are kept in /var/log/manta/upload for up to two days. In extreme situations where Manta has been down for over an hour, you may find recent historical log files in /var/log/manta/upload, and you can scan them similar to the live log files using manta-oneach.

Understanding a Muskie log entry

Muskie (the Manta API server) logs an audit entry for every request that it completes. These logs are useful for:

  • understanding how Manta handled a particular request (e.g., how long did it take? where was the time spent? what metadata shards were involved? what storage nodes were involved? what errors were encountered?)

  • understanding the workload Manta is serving (e.g., what percentage of requests are GETs? what percentage are failing? what’s the distribution of sizes for uploaded objects? which accounts are making the most requests?)

Note that for real-time incident response, it’s often faster to start with Checking recent historical metrics. These logs are most useful for digging deeper into a particular request or the workload overall.

Muskie logs in bunyan format, which is a JSON-based format. You typically use the bunyan tool to view them. You can also use the json tool to filter and aggregate them.

For more information, see Muskie log entry properties.

If you have a particular Muskie log entry with an error message in it and you want to know more about the error, see Details about specific error messages.

Contents of a GET log entry

Below is an example log entry for a GET request, formatted using the bunyan command-line tool. See Muskie log entry properties for more details.

[2017-08-01T03:03:13.985Z]  INFO: muskie/HttpServer/79465 on 204ac483-7e7e-4083-9ea2-c9ea22f459fd: handled: 200 (audit=true, _audit=true, operation=getstorage, billable_operation=LIST, logicalRemoteAddress=172.27.4.22, remoteAddress=127.0.0.1, remotePort=64628, reqHeaderLength=754, resHeaderLength=269, err=false, latency=26, entryShard=tcp://3.moray.staging.joyent.us:2020, route=getstorage, req.owner=4d649f41-cf87-ca1d-c2c0-bb6a9004311d)
    GET /poseidon/stor/manta_gc/mako/1.stor.staging.joyent.us?limit=1024 HTTP/1.1
    accept: */*
    x-request-id: a080d88b-8e42-4a98-a6ec-12e1b0dbf612
    date: Tue, 01 Aug 2017 03:03:13 GMT
    authorization: Signature keyId="/poseidon/keys/ef:0e:27:45:c5:95:4e:92:ba:ab:03:17:e5:3a:60:14",algorithm="rsa-sha256",headers="date",signature="Q74o9RHIwrDT15ogL2WeB/jankUIqJAtMM5t7+VzrHxzoB52/BoqEnq9uMY0wEvPJxv+Lf1VyLG/IBXCXeUx+fZlkhKWIWd2jkpLRdVLKwZ4nnqTfHM+YXhZ0vSN1X1W2demmnpPRTRK/RaG21pyvlbIrSTwI+N5MtKFDh9/4Ks43wSyM4MvqZZWywfs7LgKz7UtjL1Z+juhJDT8mrfQYCDpZw/NDhHmoslKsMFesMrMjPALy/CBSB23800+MhLiFB7LT0nTyCLonPBmIOjrQCZu99ICXbCxx096XCzZ2XBOK1Pe4eoDUHWx5ukTbCJV63QA+gvcvDCbS5BdDn0Xiw=="
    user-agent: restify/1.4.1 (ia32-sunos; v8/3.14.5.9; OpenSSL/1.0.1i) node/0.10.32
    accept-version: ~1.0
    host: manta.staging.joyent.us
    connection: keep-alive
    x-forwarded-for: ::ffff:172.27.4.22
    --
    HTTP/1.1 200 OK
    last-modified: Sat, 22 Mar 2014 01:17:01 GMT
    content-type: application/x-json-stream; type=directory
    result-set-size: 1
    date: Tue, 01 Aug 2017 03:03:13 GMT
    server: Manta
    x-request-id: a080d88b-8e42-4a98-a6ec-12e1b0dbf612
    x-response-time: 26
    x-server-name: 204ac483-7e7e-4083-9ea2-c9ea22f459fd
    --
    req.caller: {
      "login": "poseidon",
      "uuid": "4d649f41-cf87-ca1d-c2c0-bb6a9004311d",
      "groups": [
        "operators"
      ],
      "user": null
    }
    --
    req.timers: {
      "earlySetup": 32,
      "parseDate": 8,
      "parseQueryString": 28,
      "handler-3": 127,
      "checkIfPresigned": 3,
      "enforceSSL": 3,
      "ensureDependencies": 5,
      "_authSetup": 5,
      "preSignedUrl": 3,
      "checkAuthzScheme": 4,
      "parseAuthTokenHandler": 36,
      "signatureHandler": 73,
      "parseKeyId": 59,
      "loadCaller": 133,
      "verifySignature": 483,
      "parseHttpAuthToken": 5,
      "loadOwner": 268,
      "getActiveRoles": 43,
      "gatherContext": 27,
      "setup": 225,
      "getMetadata": 5790,
      "storageContext": 8,
      "authorize": 157,
      "ensureEntryExists": 3,
      "assertMetadata": 3,
      "getDirectoryCount": 7903,
      "getDirectory": 10245
    }

The raw JSON, formatted with the json tool, looks like this:

{
  "name": "muskie",
  "hostname": "204ac483-7e7e-4083-9ea2-c9ea22f459fd",
  "pid": 79465,
  "component": "HttpServer",
  "audit": true,
  "level": 30,
  "_audit": true,
  "operation": "getstorage",
  "billable_operation": "LIST",
  "logicalRemoteAddress": "172.27.4.22",
  "remoteAddress": "127.0.0.1",
  "remotePort": 64628,
  "reqHeaderLength": 754,
  "req": {
    "method": "GET",
    "url": "/poseidon/stor/manta_gc/mako/1.stor.staging.joyent.us?limit=1024",
    "headers": {
      "accept": "*/*",
      "x-request-id": "a080d88b-8e42-4a98-a6ec-12e1b0dbf612",
      "date": "Tue, 01 Aug 2017 03:03:13 GMT",
      "authorization": "Signature keyId=\"/poseidon/keys/ef:0e:27:45:c5:95:4e:92:ba:ab:03:17:e5:3a:60:14\",algorithm=\"rsa-sha256\",headers=\"date\",signature=\"Q74o9RHIwrDT15ogL2WeB/jankUIqJAtMM5t7+VzrHxzoB52/BoqEnq9uMY0wEvPJxv+Lf1VyLG/IBXCXeUx+fZlkhKWIWd2jkpLRdVLKwZ4nnqTfHM+YXhZ0vSN1X1W2demmnpPRTRK/RaG21pyvlbIrSTwI+N5MtKFDh9/4Ks43wSyM4MvqZZWywfs7LgKz7UtjL1Z+juhJDT8mrfQYCDpZw/NDhHmoslKsMFesMrMjPALy/CBSB23800+MhLiFB7LT0nTyCLonPBmIOjrQCZu99ICXbCxx096XCzZ2XBOK1Pe4eoDUHWx5ukTbCJV63QA+gvcvDCbS5BdDn0Xiw==\"",
      "user-agent": "restify/1.4.1 (ia32-sunos; v8/3.14.5.9; OpenSSL/1.0.1i) node/0.10.32",
      "accept-version": "~1.0",
      "host": "manta.staging.joyent.us",
      "connection": "keep-alive",
      "x-forwarded-for": "::ffff:172.27.4.22"
    },
    "httpVersion": "1.1",
    "owner": "4d649f41-cf87-ca1d-c2c0-bb6a9004311d",
    "caller": {
      "login": "poseidon",
      "uuid": "4d649f41-cf87-ca1d-c2c0-bb6a9004311d",
      "groups": [
        "operators"
      ],
      "user": null
    },
    "timers": {
      "earlySetup": 32,
      "parseDate": 8,
      "parseQueryString": 28,
      "handler-3": 127,
      "checkIfPresigned": 3,
      "enforceSSL": 3,
      "ensureDependencies": 5,
      "_authSetup": 5,
      "preSignedUrl": 3,
      "checkAuthzScheme": 4,
      "parseAuthTokenHandler": 36,
      "signatureHandler": 73,
      "parseKeyId": 59,
      "loadCaller": 133,
      "verifySignature": 483,
      "parseHttpAuthToken": 5,
      "loadOwner": 268,
      "getActiveRoles": 43,
      "gatherContext": 27,
      "setup": 225,
      "getMetadata": 5790,
      "storageContext": 8,
      "authorize": 157,
      "ensureEntryExists": 3,
      "assertMetadata": 3,
      "getDirectoryCount": 7903,
      "getDirectory": 10245
    }
  },
  "resHeaderLength": 269,
  "res": {
    "statusCode": 200,
    "headers": {
      "last-modified": "Sat, 22 Mar 2014 01:17:01 GMT",
      "content-type": "application/x-json-stream; type=directory",
      "result-set-size": 1,
      "date": "Tue, 01 Aug 2017 03:03:13 GMT",
      "server": "Manta",
      "x-request-id": "a080d88b-8e42-4a98-a6ec-12e1b0dbf612",
      "x-response-time": 26,
      "x-server-name": "204ac483-7e7e-4083-9ea2-c9ea22f459fd"
    }
  },
  "err": false,
  "latency": 26,
  "entryShard": "tcp://3.moray.staging.joyent.us:2020",
  "route": "getstorage",
  "msg": "handled: 200",
  "time": "2017-08-01T03:03:13.985Z",
  "v": 0
}

Contents of a PUT log entry

Below is an example log entry for a GET request, formatted using the json tool. See Muskie log entry properties for more details.

{
  "name": "muskie",
  "hostname": "204ac483-7e7e-4083-9ea2-c9ea22f459fd",
  "pid": 79465,
  "component": "HttpServer",
  "audit": true,
  "level": 30,
  "_audit": true,
  "operation": "putdirectory",
  "billable_operation": "PUT",
  "logicalRemoteAddress": "172.27.3.22",
  "reqHeaderLength": 655,
  "req": {
    "method": "PUT",
    "url": "/poseidon/stor/logs/config-agent/2017/08/01/02",
    "headers": {
      "user-agent": "curl/7.37.0",
      "host": "manta.staging.joyent.us",
      "accept": "*/*",
      "date": "Tue, 01 Aug 2017 03:01:10 GMT",
      "authorization": "Signature keyId=\"/poseidon/keys/ef:0e:27:45:c5:95:4e:92:ba:ab:03:17:e5:3a:60:14\",algorithm=\"rsa-sha256\",signature=\"VkRkcUK7Y796whM3/IsAl+wVvsu9pKwVGNHIHxLqeBtJZqrR+cbgWZ/E9uchhsxsMezvVXVN7hMXhiSxlfnGJKjPoTKJzfJNSW8WEUhu7rMilRi9WkYGvxo/PpdplK0/Evx1dvxHSX2TiAoTgBs5s6IyP7j6LgySfDu6TzJu/9HJdLzIwAf/TTiHU15okOUoJGbcNb+OcGN/mp+EZpYbNbJ8+I585v1ZLTuta1eAPngUPWp5E7Vm5sUpJH87/8bx2H/3HaMB9YCCacorZ7NkVS5Mbiaz0ptYYEESj8DCJScKnEVrM/L97zGuTPOnQ38Il/CZfENAP7ZH2u029h3WSg==\"",
      "connection": "close",
      "content-type": "application/json; type=directory",
      "x-forwarded-for": "::ffff:172.27.3.22"
    },
    "httpVersion": "1.1",
    "owner": "4d649f41-cf87-ca1d-c2c0-bb6a9004311d",
    "caller": {
      "login": "poseidon",
      "uuid": "4d649f41-cf87-ca1d-c2c0-bb6a9004311d",
      "groups": [
        "operators"
      ],
      "user": null
    },
    "timers": {
      "earlySetup": 94,
      "parseDate": 45,
      "parseQueryString": 32,
      "handler-3": 268,
      "checkIfPresigned": 8,
      "enforceSSL": 7,
      "ensureDependencies": 9,
      "_authSetup": 10,
      "preSignedUrl": 7,
      "checkAuthzScheme": 8,
      "parseAuthTokenHandler": 78,
      "signatureHandler": 155,
      "parseKeyId": 166,
      "loadCaller": 346,
      "verifySignature": 1164,
      "parseHttpAuthToken": 12,
      "loadOwner": 234,
      "getActiveRoles": 43,
      "gatherContext": 28,
      "setup": 315,
      "getMetadata": 13345,
      "storageContext": 14,
      "authorize": 409,
      "ensureParent": 222,
      "mkdir": 841
    }
  },
  "resHeaderLength": 215,
  "res": {
    "statusCode": 204,
    "headers": {
      "connection": "close",
      "last-modified": "Tue, 01 Aug 2017 03:01:01 GMT",
      "date": "Tue, 01 Aug 2017 03:01:11 GMT",
      "server": "Manta",
      "x-request-id": "ac2a5780-7665-11e7-b9e8-cf86a4bf1253",
      "x-response-time": 18,
      "x-server-name": "204ac483-7e7e-4083-9ea2-c9ea22f459fd"
    }
  },
  "latency": 18,
  "entryShard": "tcp://3.moray.staging.joyent.us:2020",
  "parentShard": "tcp://2.moray.staging.joyent.us:2020",
  "route": "putdirectory",
  "msg": "handled: 204",
  "time": "2017-08-01T03:01:11.048Z",
  "v": 0
}

Understanding latency for a specific request

Finding a load balancer log entry

When to investigate the load balancer

All HTTP requests to Manta travel through an haproxy-based load balancer (in a component sometimes called "muppet") before reaching the Manta API ("webapi" or "muskie"). This load balancer is one of the first components that processes these requests when they arrive at Manta. For many problems internal to Manta, it’s more useful to look at log entries at Muskie (webapi). However, there are a several cases where it’s helpful to investigate the load balancer:

  • when investigating a problem where the client reports not having received a normal HTTP response (e.g., a "connection refused", "connection reset", or a connection timeout)

  • when investigating a problem where the client reports having received a 500-level error with no "x-server-name" header. (This generally indicates the response was sent by the load balancer, which happens when Muskie sends an invalid response or fails to send a response within a given timeout.)

  • when investigating a problem where Muskie reports surprising client behavior (e.g., client closed its connection mid-upload, or Muskie timed out waiting for a client to either upload or download data)

  • when investigating a failed request for which there appears to be no Muskie log entry at all

  • to identify the source IP address of a client in a case where Muskie fails to report that in its own log entry

Generally, if a client receives a well-formed response from Manta, the Muskie logs have more useful information than the load balancer logs. In these other cases where either the client or Muskie is doing something surprising, the load balancer log entries can provide more information about exactly what happened.

There’s a major caveat to the load balancer logs: haproxy is only able to interpret HTTP-level information about the first request on each TCP connection. For clients using HTTP keep-alive, where multiple HTTP requests are sent sequentially over a TCP connection, you may not find information in the haproxy logs about requests after the first one.

Finding load balancer log entries

First, see Locating log files for information on where to find real-time and historical log files. For the load balancer, real-time log files are stored in /var/log/haproxy.log. Historical log files are stored in Manta under /poseidon/stor/logs/haproxy.

Usually when you’re looking for a load balancer log entry, you know one or more of the following:

  • an approximate time of the request (almost always necessary)

  • the URL that was requested

  • the remote IP address that made the request

  • the status code that was returned

  • that the request experienced an unusual HTTP exchange (e.g., a malformed server response, a client timeout, or a server timeout)

  • the particular load balancer that handled the request

Since these are plaintext logs, you can use grep or awk to filter or summarize them. See Understanding a load balancer log entry for more.

Often you don’t know which load balancer handled a particular request. In that case, you need to scan all of them for a given time. That might involve manta-oneach or a compute job. Again, see Locating log files.

Understanding a load balancer log entry

The load balancer logs in a plaintext format described in the haproxy documentation. (The haproxy documentation is also plaintext so it’s not possible to link directly to the right section, but look for the section called "Log format".) Our load balancer logs these through syslog, which prepends a few fields.

Here’s an example entry from our load balancer logs:

2018-12-05T18:32:01+00:00 42563f8d-4d61-4045-ab87-c71560388399 haproxy[65158]: ::ffff:72.2.115.97:42121 [05/Dec/2018:18:30:01.859] https secure_api/be6 2/0/0/-1/120005 502 245 - - SH-- 247/192/240/19/0 0/0 "GET /thoth/stor/thoth?limit=1024 HTTP/1.1"

We have a tool called haplog for converting our haproxy log entries into JSON. Often, the easiest way to filter and summarize these log entries is to pass the log through haplog, use json to extract the relevant fields, and then use grep or awk to summarize. We can use it like this:

$ haplog 42563f8d.log | json
...
{
  "syslog_date": "2018-12-05T18:32:01.000Z",
  "syslog_hostname": "42563f8d-4d61-4045-ab87-c71560388399",
  "pid": 65158,
  "client_ip": "72.2.115.97",
  "client_port": 42121,
  "accept_date": "05/Dec/2018:18:30:01.859",
  "frontend_name": "https",
  "backend_name": "secure_api",
  "server_name": "be6",
  "Tq": 2,
  "Tw": 0,
  "Tc": 0,
  "Tr": -1,
  "Tt": 120005,
  "status_code": 502,
  "bytes_read": 245,
  "termination_state": {
    "raw": "SH--",
    "termination_cause": "BACKEND_ABORT",
    "state_at_close": "WAITING_FOR_RESPONSE_HEADERS",
    "persistence_cookie_client": "N/A",
    "persistence_cookie_server": "N/A"
  },
  "actconn": 247,
  "feconn": 192,
  "beconn": 240,
  "srv_conn": 19,
  "retries": 0,
  "srv_queue": 0,
  "backend_queue": 0,
  "http_request": "GET /thoth/stor/thoth?limit=1024 HTTP/1.1"
}

There’s quite a lot of information here! Among the most relevant bits:

Example value haplog JSON field name Meaning

"2018-12-05T18:32:01.000Z"

syslog_date

The timestamp when this entry was logged. This usually corresponds to the end of an HTTP request or TCP connection. This is very useful for constructing timelines of what happened.

"42563f8d-4d61-4045-ab87-c71560388399"

syslog_hostname

On our systems, this is the zonename of the particular load balancer that handled this request.

"72.2.115.97"

client_ip

The (remote) IP address of the client that connected to Manta.

"05/Dec/2018:18:30:01.859"

accept_date

The timestamp when the TCP connection was accepted by the load balancer. This is very useful for constructing timelines of what happened.

"be6"

server_name

On our system, this is a unique identifier that indicates which Muskie zone handled this request. This identifier varies between load balancer zones and over time. In order to know which Muskie zone this corresponds to, you need to find the corresponding line in the haproxy log file (at /opt/smartdc/muppet/etc/haproxy.cfg). This will contain the IP address of the Muskie zone that handled the request. This is useful for understanding the flow of the request through Manta.

"SH--"

termination_state

This is a four-digit code that describes how the TCP session finally terminated. This is among the most useful fields for understanding abnormal behavior from the client or Muskie. This code can be used to tell whether the client or server either did something unexpected (like closed the TCP connection) or stopped responding. For details, on what each code means, see the haproxy documentation linked above.

"GET /thoth/stor/thoth?limit=1024 HTTP/1.1"

http_request

The first line of the HTTP request, which contains the HTTP method and request URL.

There are a few telltale symptoms here:

  • The elapsed time between accept_date and syslog_date is exactly two minutes. The load balancer has a two-minute timeout for Muskie responding to requests.

  • The termination status SH-- is documented (in the haproxy docs) to mean:

    SH The server aborted before sending its full headers, or it crashed.

This entry appears to reflect Muskie closing its TCP connection (without sending an HTTP response) after exactly two minutes. If we didn’t know what happened here, we at least know now that Muskie did something unexpected and not the client, and we also know which Muskie zone it was.

Build a request timeline

Building a request timeline can be incredibly valuable for understanding why a particular request failed or why a request took as long as it did. Timelines are especially useful for understanding apparently mysterious failure modes like request timeouts, "socket hang-up" errors, "connection reset" errors, and the like.

A request timeline is simply a list of events related to a request and the precise timestamps when they were reported. You can include as much or as little detail as is relevant to your problem. It’s common to focus on the phases that happen inside Muskie. These are the simplest to collect and they’re often sufficient when failure modes are explicit (e.g., an authentication failure or a 404 response). A pretty complete request timeline could include a lot more than that about what happens at the load balancer and storage nodes.

Note
Keep in mind that wall clocks vary across physical machines. When comparing timestamps recorded from different systems, beware that differences may simply be a result of differences in the wall clock time on those systems. Within Manta, clocks are usually synchronized to within a few hundred milliseconds. (You can check this for the servers you’re looking at using NTP tools.) Timestamps reported by clients can be much further off, but if they’re also using NTP to synchronize their clocks, they may well be pretty close.

A brief review of request handling

A typical GET request for an object stored in Manta runs as follows:

  • The client establishes a TCP connection to the Manta load balancer. As part of this process, the load balancer establishes a TCP connection to a Muskie instance to handle any requests that arrive over this connection.

  • The client constructs an HTTP request and sends it over the TCP connection.

  • The load balancer forwards the request to Muskie.

  • Muskie begins processing the request.

    • There are a number of (usually very fast) setup phases within Muskie.

    • Muskie authenticates the request. This may involve fetching authentication and authorization information from Mahi, in which case an HTTP request will be sent over the network to Mahi to fetch basic information about the account.

    • Muskie fetches metadata associated with the object. This step involves a few network RPC calls to the metadata tier: first to an Electric-Moray instance, which forwards the request to a Moray instance, which executes a SQL query against a PostgreSQL primary.

    • Muskie authorizes the request. This involves comparing the authenticated user’s credentials to the credentials required to access the object (specified in its metadata).

    • Muskie makes an HTTP request to a storage node hosting a copy of the object.

    • Assuming this request succeeds, Muskie reports a successful response header to the client. Then data is sent from the storage node, through Muskie, through the load balancer, and to the client.

    • The request concludes when the last data is sent to the client.

PUT requests to upload objects are similar except that there’s an additional metadata RPC after all the data has streamed to the storage nodes. Other types of requests (e.g., creation and deletion of directories) are largely similar, but generally don’t involve storage nodes.

This is a simplification. For details, see the Muskie source code. (Don’t be afraid to crack it open!)

There are 1-2 dozen phases of request handling within Muskie, but most of the elapsed time of a request happens in only a handful of phases that relate to making requests to external services. These are described below.

Key events during a request lifetime

The table below explains a number of events that happen while processing a request and where you can find more information about it. Many of these entries refer to entries in logs that are documented elsewhere in this guide. See Understanding a Muskie log entry and Understanding a load balancer log entry.

Remember, it’s not necessary to collect all of these to start! Start with the basics and flesh out what looks relevant. Some of what’s below won’t apply to every request. However, if you’re stumped about a strange failure mode, it’s often helpful to construct a pretty complete timeline, as you’ll often find surprising gaps or unusual intervals (e.g., exactly 60 seconds from when something started until when some obscure error message was reported, which might suggest a timeout).

If you don’t know where to start, consider a timeline that just includes:

  • when the request was constructed on the client

  • when the request was received by the load balancer (if possible)

  • when the request was received by Muskie

  • when the metadata phases of request handling were completed in Muskie

  • when the request was completed by Muskie

It’s common to start there, skim the req.timers field (mentioned below) to look for unusually long phases (e.g., those taking upwards of 1 second), and add those to the timeline as needed.

Event Component where it happens Where to find the timestamp Notes

Client establishes a TCP connection to the load balancer.

Load balancer

Load balancer log entry, accept_date field. Note that this is not the date at the very start of the load balancer log entry.

Both the client and server may use timeouts that measure from this timestamp. It can be helpful to check for intervals of round numbers (e.g., exactly 1 or 2 minutes).

Client constructs the HTTP request.

Client (outside Manta)

Muskie log entry, req.headers.date (the Date header supplied by the client with the HTTP request)

Large gaps between when the client constructed the request and when Muskie began processing it could indicate queueing in Muskie or somewhere in the network before reaching Muskie.

Muskie begins processing the request.

Muskie

Muskie log entry. There’s no discrete field for this, but you can compute this by taking the timestamp when Muskie completed the request (see below) and subtracting the total latency of the request (including time to transfer data). To calculate the total latency, sum all the timers in req.phases.

This is a very useful point at which to divide an investigation, since large gaps in time before this point indicate queueing prior to the request reaching Muskie, which generally involve different tools and data sources than latency after this point.

Muskie makes HTTP request to Mahi (authcache) to load information needed to authenticate and authorize this request.

Muskie

Muskie log entry, req.timers. These requests generally occur during the loadCaller and loadOwner phases, though they can also happen during parseHttpAuthToken (if the x-auth-token header is provided by the client), or getActiveRoles. You can generally infer that if these HTTP requests were made, they started after one of these phases began and completed before the same phase ended. To determine the actual start and end timestamps, you have compute them based on the latency of the previous phases and the time when Muskie began processing the request (or the latency of the subsequent phases and the time when Muskie finished processing the request).

Most requests do not actually involve any of these HTTP requests because the information is usually cached in Muskie. However, latency here may indicate a problem with Mahi or the network.

Muskie makes RPCs to load metadata.

Muskie

Muskie log entry, req.timers. These requests most commonly occur during the getMetadata phase. For the actual start and end timestamps, you must compute them based on the latency of the other phases and the time the request completed (or finished).

High latency here indicates a slow read RPC to the metadata tier.

Muskie issues HTTP requests to storage nodes.

Muskie

Muskie log entry, req.timers and req.sharksContacted. For GET requests, this happens during the streamFromSharks phase. For PUT requests, this happens during the startSharkStreams phase. Since multiple storage nodes may be contacted, sometimes in parallel and sometimes sequentially, the sharksContacted field has more fine-grained information about the time for each particular storage node.

This step is often forgotten, but it’s important to complete before Muskie sends successful response headers to the client. High latency between when these requests are issued and when storage nodes send headers back may indicate a DNS problem, a network problem, or a storage node that’s offline or overloaded.

Muskie sends response headers.

Muskie

Muskie log entry, latency field (elapsed time in milliseconds, which must be added to the initial timestamp)

Sometimes called latency-to-first-byte, this is a significant point in the request because until this point, the client has heard nothing from the server while the server authenticates the client, loads metadata, and authorizes the request. Many clients have timeouts (sometimes as short as a few seconds) from when they send the request until they see response headers. If there’s a long gap in this range and then Muskie subsequently reports a client error (e.g., timed out waiting for the client or a closed connection from the client), the client may have given up.

Also, the work associated with the rest of this request depends on how large the end user’s object is and how much network bandwidth they have available to Manta. As a result, it’s common to summarize this latency-to-first-byte rather than the whole latency of the request, as this number is more comparable across different types of requests.

Muskie begins streaming data between storage nodes and the client.

Muskie

Muskie log entry, req.timers field. For GET operations, the phase where data is transferred is called streamFromSharks. For PUT operations, the phase is called sharkStreams.

This should be about the same time as the response headers are sent.

Storage nodes process HTTP requests

Mako

mako-access log. Entries in the mako-access log include both a timestamp when they completed and a latency, which allows you to compute the end timestamp.

High latency between when Muskie began transferring data to or from storage nodes and when storage nodes report having started may indicate overload on the storage node or a network problem.

Muskie finishes streaming data between storage nodes and the client.

Muskie

See the row above where Muskie begins streaming data. The end of the corresponding phase indicates when this finished.

The time for this phase is highly variable depending on how much data is requested and the bandwidth available to the client. It’s more useful to compute throughput (as the total bytes sent divided by the latency of this phase) than look at the latency directly. Low throughput here can be indicative of almost anything: a slow client, a slow network, overloaded Muskie, or an overloaded storage node.

Muskie makes RPCs to save metadata.

Muskie

Muskie log entry, req.timers. These requests most commonly occur during the saveMetadata phase. For the actual start and end timestamps, you must compute them based on the latency of the other phases and the time the request completed (or finished).

Another possible indicator of this timestamp may be given by the mtime stored in the object metadata, assuming it hasn’t been changed since the request you’re investigating completed.

High latency here indicates a slow write RPC to the metadata tier.

Load balancer indicates the TCP session ended.

Load balancer

Load balancer log, syslog_date field.

This is generally the last time the client heard anything from the server associated with this request. Normally, this indicates the end of the request. In pathological cases where metadata operation takes several minutes, the load balancer may terminate the request, logging this entry and sending a failure response to the client, even though Muskie is still processing the request. This is typically followed by a Muskie log entry with a timestamp after this point. The latency of the various phases in the Muskie request point to what took so long.

There’s loads more information available in the system. Depending on your problem, you may need to get more creative. Examples:

  • The approximate time when any system (or zone) halted or booted can be determined from the output of last reboot, uptime, SMF service times, or process start times.

  • The precise time when a process started can be extracted from ps(1).

  • The precise time when a system panicked can be extracted from a crash dump from the panic_hrestime and panic_hrtime variables.

  • The time that a process crashed (dumped core) can be inferred from the mtime of the core file or entries in the SMF log file.

Example request timeline

Let’s make our own test request and make a timeline for it:

$ mget -v /dap/stor/cmd.tgz 2> >(bunyan) > /dev/null
...
[2019-05-22T18:22:55.329Z] TRACE: mget/MantaClient/12438 on blinky.local (/Users/dap/install/lib/node_modules/manta/node_modules/restify-clients/lib/HttpClient.js:210 in onResponse): Response received (client_res={})
    HTTP/1.1 200 OK
    etag: cb1036e4-3b57-c118-cd46-961f6ebe12d0
    last-modified: Tue, 08 Apr 2014 21:56:13 GMT
    accept-ranges: bytes
    content-type: application/octet-stream
    content-md5: vVRjo74mJquDRsoW2HJM/g==
    content-length: 17062152
    durability-level: 2
    date: Wed, 22 May 2019 18:22:55 GMT
    server: Manta
    x-request-id: 0b241d9c-d076-4c9b-b954-3b65adb73c73
    x-response-time: 187
    x-server-name: 204ac483-7e7e-4083-9ea2-c9ea22f459fd
    connection: keep-alive
    x-request-received: 1558549375054
    x-request-processing-time: 275

With the debug output from the client command, we can easily find the Muskie instance (204ac483-7e7e-4083-9ea2-c9ea22f459fd) that handled this request and request-id (0b241d9c-d076-4c9b-b954-3b65adb73c73). (This is just an example. If you don’t have these in your case, see Finding (or generating) a failed request.)

To find the Muskie log entry, we must first find Muskie zone 204ac483-7e7e-4083-9ea2-c9ea22f459fd. We can see which datacenter it’s in using manta-adm show:

[root@headnode (staging-1) ~]# manta-adm show -a webapi
SERVICE          SH DATACENTER ZONENAME
webapi            1 staging-1  380920d9-ed44-4bcd-b61c-4b99f49c1329
webapi            1 staging-2  204ac483-7e7e-4083-9ea2-c9ea22f459fd
webapi            1 staging-3  6e59a763-6f6a-46a1-926e-90c1b7fc370b

It’s in staging-2. Following the instructions above, we can search for the log entry for this request:

[root@headnode (staging-2) ~]# manta-oneach -z 204ac483-7e7e-4083-9ea2-c9ea22f459fd 'grep 0b241d9c-d076-4c9b-b954-3b65adb73c73 /var/log/muskie.log'
SERVICE          ZONE     OUTPUT
webapi           204ac483 {"name":"muskie","hostname":"204ac483-7e7e-4083-9ea2-c9ea22f459fd","pid":969231,"component":"HttpServer","audit":true,"level":30,"_audit":true,"operation":"getstorage","billable_operation":"GET","bytesTransferred":"17062152","logicalRemoteAddress":"172.20.5.18","remoteAddress":"127.0.0.1","remotePort":53399,"reqHeaderLength":503,"req":{"method":"GET","url":"/dap/stor/cmd.tgz","headers":{"accept":"*/*","x-request-id":"0b241d9c-d076-4c9b-b954-3b65adb73c73","date":"Wed, 22 May 2019 18:22:55 GMT","authorization":"Signature keyId=\"/dap/keys/c0:24:6d:54:c4:5e:72:15:b5:65:21:80:69:81:cb:14\",algorithm=\"ecdsa-sha256\",headers=\"date\",signature=\"MEUCIBaFtaKwO2XwOfJGXPw2ITfr1C2v7fHi0lMcsCa4049FAiEAkaZ/oPTYNbYdVvAXJM9pFq3rJgZB2F7gUQjKu2UG1m4=\"","user-agent":"restify/1.4.1 (x64-darwin; v8/3.14.5.9; OpenSSL/1.0.1t) node/0.10.45","accept-version":"~1.0","host":"manta.staging.joyent.us","connection":"keep-alive","x-forwarded-for":"::ffff:172.20.5.18"},"httpVersion":"1.1","owner":"bc8cd146-fecb-11e1-bd8a-bb6f54b49808","caller":{"login":"dap","uuid":"bc8cd146-fecb-11e1-bd8a-bb6f54b49808","groups":[],"user":null},"timers":{"earlySetup":99,"parseDate":30,"parseQueryString":35,"handler-3":202,"checkIfPresigned":10,"enforceSSL":8,"ensureDependencies":10,"_authSetup":9,"preSignedUrl":7,"checkAuthzScheme":14,"parseAuthTokenHandler":300,"signatureHandler":152,"parseKeyId":149,"loadCaller":3396,"verifySignature":4258,"parseHttpAuthToken":15,"loadOwner":4353,"getActiveRoles":192,"gatherContext":34,"setup":303,"getMetadata":169779,"storageContext":17,"authorize":201,"ensureEntryExists":5,"assertMetadata":3,"getDirectoryCount":3,"getDirectory":3,"negotiateContent":65,"checkIfMatch":10,"checkIfNoneMatch":5,"checkIfModified":10,"checkIfUnmodified":5,"verifyRange":4,"streamFromSharks":576907}},"resHeaderLength":369,"res":{"statusCode":200,"headers":{"etag":"cb1036e4-3b57-c118-cd46-961f6ebe12d0","last-modified":"Tue, 08 Apr 2014 21:56:13 GMT","accept-ranges":"bytes","content-type":"application/octet-stream","content-md5":"vVRjo74mJquDRsoW2HJM/g==","content-length":"17062152","durability-level":2,"date":"Wed, 22 May 2019 18:22:55 GMT","server":"Manta","x-request-id":"0b241d9c-d076-4c9b-b954-3b65adb73c73","x-response-time":187,"x-server-name":"204ac483-7e7e-4083-9ea2-c9ea22f459fd"}},"latency":187,"objectId":"cb1036e4-3b57-c118-cd46-961f6ebe12d0","sharksContacted":[{"shark":"1.stor.staging.joyent.us","result":"ok","timeToFirstByte":2,"timeTotal":577,"_startTime":1558549375336}],"entryShard":"tcp://3.moray.staging.joyent.us:2020","route":"getstorage","_auditData":true,"dataLatency":762,"dataSize":17062152,"latencyToFirstByte":187,"msg":"handled: 200","time":"2019-05-22T18:22:55.914Z","v":0}

or, formatted with bunyan:

[2019-05-22T18:22:55.914Z]  INFO: muskie/HttpServer/969231 on 204ac483-7e7e-4083-9ea2-c9ea22f459fd: handled: 200 (audit=true, _audit=true, operation=getstorage, billable_operation=GET, bytesTransferred=17062152, logicalRemoteAddress=172.20.5.18, remoteAddress=127.0.0.1, remotePort=53399, reqHeaderLength=503, resHeaderLength=369, latency=187, objectId=cb1036e4-3b57-c118-cd46-961f6ebe12d0, entryShard=tcp://3.moray.staging.joyent.us:2020, route=getstorage, _auditData=true, dataLatency=762, dataSize=17062152, latencyToFirstByte=187, req.owner=bc8cd146-fecb-11e1-bd8a-bb6f54b49808)
    GET /dap/stor/cmd.tgz HTTP/1.1
    accept: */*
    x-request-id: 0b241d9c-d076-4c9b-b954-3b65adb73c73
    date: Wed, 22 May 2019 18:22:55 GMT
    authorization: Signature keyId="/dap/keys/c0:24:6d:54:c4:5e:72:15:b5:65:21:80:69:81:cb:14",algorithm="ecdsa-sha256",headers="date",signature="MEUCIBaFtaKwO2XwOfJGXPw2ITfr1C2v7fHi0lMcsCa4049FAiEAkaZ/oPTYNbYdVvAXJM9pFq3rJgZB2F7gUQjKu2UG1m4="
    user-agent: restify/1.4.1 (x64-darwin; v8/3.14.5.9; OpenSSL/1.0.1t) node/0.10.45
    accept-version: ~1.0
    host: manta.staging.joyent.us
    connection: keep-alive
    x-forwarded-for: ::ffff:172.20.5.18
    --
    HTTP/1.1 200 OK
    etag: cb1036e4-3b57-c118-cd46-961f6ebe12d0
    last-modified: Tue, 08 Apr 2014 21:56:13 GMT
    accept-ranges: bytes
    content-type: application/octet-stream
    content-md5: vVRjo74mJquDRsoW2HJM/g==
    content-length: 17062152
    durability-level: 2
    date: Wed, 22 May 2019 18:22:55 GMT
    server: Manta
    x-request-id: 0b241d9c-d076-4c9b-b954-3b65adb73c73
    x-response-time: 187
    x-server-name: 204ac483-7e7e-4083-9ea2-c9ea22f459fd
    --
    sharksContacted: [
      {
        "shark": "1.stor.staging.joyent.us",
        "result": "ok",
        "timeToFirstByte": 2,
        "timeTotal": 577,
        "_startTime": 1558549375336
      }
    ]
    --
    req.caller: {
      "login": "dap",
      "uuid": "bc8cd146-fecb-11e1-bd8a-bb6f54b49808",
      "groups": [],
      "user": null
    }
    --
    req.timers: {
      "earlySetup": 99,
      "parseDate": 30,
      "parseQueryString": 35,
      "handler-3": 202,
      "checkIfPresigned": 10,
      "enforceSSL": 8,
      "ensureDependencies": 10,
      "_authSetup": 9,
      "preSignedUrl": 7,
      "checkAuthzScheme": 14,
      "parseAuthTokenHandler": 300,
      "signatureHandler": 152,
      "parseKeyId": 149,
      "loadCaller": 3396,
      "verifySignature": 4258,
      "parseHttpAuthToken": 15,
      "loadOwner": 4353,
      "getActiveRoles": 192,
      "gatherContext": 34,
      "setup": 303,
      "getMetadata": 169779,
      "storageContext": 17,
      "authorize": 201,
      "ensureEntryExists": 5,
      "assertMetadata": 3,
      "getDirectoryCount": 3,
      "getDirectory": 3,
      "negotiateContent": 65,
      "checkIfMatch": 10,
      "checkIfNoneMatch": 5,
      "checkIfModified": 10,
      "checkIfUnmodified": 5,
      "verifyRange": 4,
      "streamFromSharks": 576907
    }

Similarly, we can look for a load balancer log entry, which gets us this:

2019-05-22T18:22:56+00:00 710f85a9-6f13-4fcf-8f3f-a03c4a508714 haproxy[550758]: ::ffff:172.20.5.18:56707 [22/May/2019:18:22:55.125] https secure_api/be1 6/0/0/188/934 200 17062612 - - ---- 25/8/24/8/0 0/0 "GET /dap/stor/cmd.tgz HTTP/1.1"

From these log entries, we can put together this timeline:

18:22:55    Z  Client generates "Date" header for request
               (source: Muskie log entry, request headers)

18:22:55.125Z  Load balancer haproxy accepts TCP request from client (172.20.5.18)
               (source: Load balancer log entry, accept timestamp)

18:22:55.153Z  Muskie begins processing request
               (source: Muskie log entry timestamp minus the sum of all handler times)

18:22:55.166Z  Muskie begins loading metadata for request
               (source: Muskie log entry, inferred wall time for "getMetadata" handler start)

18:22:55.336Z  Muskie begins HTTP request to storage node "1.stor.staging.joyent.us"
               (source: Muskie log entry, "sharksContacted" object)

18:22:55.337Z  Muskie begins streaming data from storage node
               (source: Muskie log entry, inferred wall time for "streamFromSharks" handler start)

18:22:55.913Z  Muskie finishes streaming data from storage node
               (source: Muskie log entry, "sharksContacted" object)

18:22:55.914Z  Muskie records the request completed
               (source: Muskie log entry timestamp)

18:22:56    Z  Load balancer haproxy terminates session
               (source: haproxy log message timestamp)

18:22:56.059Z  Load balancer haproxy terminates session
               (source: accept time (see above) plus 934ms total latency reported by haproxy)

Details about specific error messages

"Request has exceeded …​ bytes"

You may observe any of the following symptoms:

  • a 413 ("Request Entity Too Large") response code

  • an error with name MaxContentLengthExceededError

  • an error with a message like request has exceeded 11 bytes (the actual number of bytes varies)

  • a Muskie log entry showing MaxSizeExceededError

All of these indicate that the client attempted a streaming upload but attempted to upload more bytes than Manta expected. If the max-content-length header was provided, its value was too small for the content that was uploaded. If the header was not provided, then the default value picked for the maximum streaming upload size was not sufficient. In both cases, the issue can be resolved by having the client provide a max-content-length that’s at least as large as the object that it’s attempting to upload.

Here’s an example that demonstrates the case:

$ echo 1234567890 | mput -q -H 'max-content-length: 7' /dap/public/foo
mput: MaxContentLengthExceededError: request has exceeded 11 bytes
$

See the section on "Not enough free space for …​ MB" for more details on streaming uploads and the max-content-length header. Note that the error message incorrectly implies that 11 bytes was the limit. This is a bug. The actual limit was whatever was specified by the max-content-length header or else a default value described in that linked section.

Note that for non-streaming uploads, the failure mode for attempting to upload too many bytes is less well-defined, as this is likely to break the framing of HTTP requests on the TCP connection.

"No storage nodes available for this request"

Certain underlying conditions (described below) result in the following symptoms:

  • an error (reported either from the client or server) with message "No storage nodes available for this request"

  • a 503 response code (reported either from the client or server), likely having a "Retry-After" header

  • a client-reported error called "InternalError"

  • a server-reported error called "SharksExhaustedError"

If you see these symptoms, read on for more information.

Note
This section does not apply to all 503 response codes nor all instances of InternalError. However, the condition described here always produces a 503 response code, an InternalError reported to the client, and a SharksExhaustedError reported internally, and both the client and the server errors always include this message.

These symptoms occur when Manta gave up on an object upload request (i.e., a PUT request) while trying to connect to storage nodes to store copies of the requested object. In many cases, retrying the request is likely to succeed because different storage nodes will be selected that are likely to be available. For that reason, the response generally includes a Retry-After HTTP header. If the problem is persistent, there may be an unusual problem affecting a lot of storage nodes (e.g., a whole datacenter partition).

More specifically: for each upload request, Manta selects up to three sets of storage nodes that will be used to store copies of the object. Each set contains as many storage nodes as there will be copies of the object (two by default, but this can be overridden by the client). Manta first attempts to initiate an object upload to all the storage nodes in the first set. If any of those fails, it moves on to the second set, and so on until all sets are exhausted, at which point the request fails.

Based on the way this works, it’s possible for Manta to report this failure even when most of the system is functioning normally — it only takes three storage node failures. Keep in mind, though, that when Manta selects storage nodes for these sets (before even attempting to connect to them), it only selects storage nodes that have reported their own capacity relatively recently (typically within the last minute or so), so it’s expected that most of them should be online and functioning. That’s why retries are likely to succeed and why persistent occurrence of this problem may indicate a more unusual network issue.

To understand the problem in more detail:

Using the sharksContacted property, you can Build a request timeline. You should be able to tell from this property which storage nodes were part of which set (e.g., the first set of two, the second set of two, and so on). This will allow you to confirm that Manta failed to reach at least one of the storage nodes in each set, and it will indicate which storage nodes it failed to contact.

What if there are fewer than three sets of storage nodes? This can happen when there aren’t enough storage nodes meeting the desired storage requirements (e.g., that copies are in different datacenters). In small deployments, this can lead to higher error rates than expected.

Once you’ve determined which storage nodes Manta failed to reach, you can dig further into why Manta failed to reach them. This is largely beyond the scope of this guide, but below are some questions you might start with:

  • Was the storage node actually online at this time? You may be able to use logs in the zone (e.g., /var/log/mako-access.log) or the output of commands like last reboot or uptime to demonstrate that the zone was online (or not). Similarly, you could look for evidence that the server itself was online using logs or these same commands from the global zone. You could look for crash dumps (indicating an operating system panic) in /var/crash/volatile in the global zone or looking at logs in /var/log in the global zone.

  • Was the storage node reachable at this time? This can be difficult to debug after the fact, but you can look for evidence in some of these same logs (e.g., /var/log/mako-access.log). Evidence of serving requests indicates at least some remote systems were able to reach it. A sudden drop in request activity is consistent with network problems. However, many of the symptoms of network issues (including disconnects, timeouts, and reset connections) are also consistent with systemic problems that caused important programs to stop running for an extended period (e.g., major I/O problems or problems with the operating system).

If you’re debugging the problem right after it happened or if it’s currently reproducible, these questions are easier to answer. For example, if you can log into a zone, then it’s not offline. You can check for network connectivity with tools like ping(1).

"Not enough free space for …​ MB"

This message (associated with 507 errors) indicates that Manta does not have enough space available on enough storage nodes for the write that was requested. This would be surprising in production environments, although it’s easy to induce even in production by requesting an absurd size. For example, you’ll see this if you attempt to upload an enormous object:

$ mput -H 'max-content-length: 1125899906842624' /dap/stor/waytoobig
mput: NotEnoughSpaceError: not enough free space for 1073741824 MB

Manta determines a maximum object size for each upload request and validates that enough storage is available to satisfy the request. When there aren’t enough storage nodes that have the required amount of space, the request fails with this message. (A request will also fail if the client attempts to upload more than the maximum expected size, but that produces a different error.)

In determining the maximum object size for an upload request, Manta supports two kinds of uploads: fixed-size and streaming. A fixed-size upload is one that specifies the size of the object using the content-length request header. Manta uses this value directly as the maximum object size. A streaming upload is specified using the transfer-encoding: chunked header. In this case, the space that Manta allocates up front (and validates) is given by the max-content-length header, if provided, or else a server-side, configurable default value. The storage.defaultMaxStreamingSizeMB property in Muskie’s configuration file determines this value. This is typically populated from the MUSKIE_DEFAULT_MAX_STREAMING_SIZE_MB SAPI property. The typical value is 5120 MB. This is documented in the Manta API, so changing it may break working clients. Note that this default is sometimes too large in some development environments, which can cause streaming uploads to fail with this message. An object PUT must specify either transfer-encoding: chunked or content-length.

To understand a specific instance of this error, ask:

  • Does the size in the error message match what you (and the end user) expected would be required for this request? The process by which Manta determines this size is described above. If this does not match what you and the end user expect, that may reflect a bug. (More commonly, end users may not understand the semantics of content-length and max-content-length or operators don’t understand the defaultMaxStreamingSizeMB default value. Again, see above and the linked API documentation for an explanation of those.)

  • If the value matches what you expect, does Manta actually have that much space? This question is more complex than it sounds. Keep in mind that:

    • Manta will only consider storage zones currently using less than a configurable maximum amount of space. This is determined by the storage.maxUtilizationPct property of Muskie configuration files, which comes from the MUSKIE_MAX_UTILIZATION_PCT SAPI property. For requests from operator accounts, the storage.maxOperatorUtilizationPct property is used instead, which comes from the MUSKIE_MAX_OPERATOR_UTILIZATION_PCT SAPI property.

    • Multiple copies are written for each new object. The number of copies is determined by the durability-level header that is specified by the client. All copies of the object must be assigned to separate storage zones. The requested amount of space must be present on as many storage zones are there are supposed to be copies created.

    • If this is a multi-datacenter deployment (i.e., if the storage.multiDC Muskie configuration property (which comes from the MUSKIE_MULTI_DC SAPI property) is set to a true value), then copies must be spread across datacenters as well. If the number of copies is less than the number of datacenters (e.g., 2 copies in a 3-datacenter deployment), then all copies must be placed into separate datacenters (so there must be a storage zone in each datacenter with enough capacity). If the number of copies is at least the number of datacenters, at least one copy must be placed in each datacenter.

Example: in the common case of durability-level: 2 (the default) in a multi-DC deployment with content-length of 10 GB, there must be 10 GB of space available on at least one storage zone in each of two datacenters. The request could fail with this error even if there are two storage zones with enough space if those storage zones are all in the same datacenter.

The mpicker tool can be used to determine how much space is available on each storage zone. The mpicker poll command summarizes available space on each storage zone that’s under the configured utilization percentage. For example:

[root@380920d9 (webapi) ~]$ /opt/smartdc/muskie/bin/mpicker poll
polling options:
 max percentage utilization: 92

DATACENTER MANTA_STORAGE_ID          %USED         AVAIL_MB      TIMESTAMP
staging-3  3.stor.staging.joyent.us  91            197218        2019-04-10T21:15:35.726Z
staging-2  2.stor.staging.joyent.us  87            306039        2019-04-10T21:15:35.221Z
staging-1  1.stor.staging.joyent.us  81            456521        2019-04-10T21:15:37.506Z
[root@380920d9 (webapi) ~]$

The mpicker choose command can be used to simulate Manta’s storage allocation behavior for a request of a given size. Here’s an example request for a size that would fail, given the above output:

[root@380920d9 (webapi) ~]$ /opt/smartdc/muskie/bin/mpicker poll -j | /opt/smartdc/muskie/bin/mpicker choose -s 310000
replicas requested: 2
size requested: 310000 MB
storage server count by datacenter:
        staging-3: 1 (0 available)
        staging-2: 1 (0 available)
        staging-1: 1 (1 available)
total storage server count: 3 (1 available)

could not choose storage node set: 2 copies requested, but only 1 DC(s) have sufficient space

If you run mpicker choose in this way using the same size as you see in the "No space available for …​ " message, you’d typically expected to see the same error.

The mpicker command has a number of options to control the output and simulate different conditions. See the help output for details.

Locating metadata for an object

See Locating object data in the Manta Operator’s Guide.

Locating actual data for an object

See Locating object data in the Manta Operator’s Guide.

Locating a particular server

See Locating servers in the Manta Operator’s Guide.

Locating a particular zone

See Locating Manta component zones in the Manta Operator’s Guide.

Locating a particular database shard

Finding what shard a particular zone is part of

Save debugging state and restart a process

Investigating service discovery

Investigating a slow process

Investigating why a process is on-CPU

Investigating why a process is off-CPU

Check for a garbage collection issue (or memory leak)

Scaling up a component

Characterizing a problem

It’s very valuable to briefly and precisely characterize a problem:

  • when asking for help from others, so they can quickly understand the relevant context

  • for summarizing status to stakeholders

  • for handing off work to others (e.g., when escalating to another team or changing shifts)

  • as a form of rubber-ducking — it often helps spark insight

When characterizing a problem, include the basic facts:

  • Exactly what you’ve observed: 500 errors? 503 errors? Elevated latency? It’s useful to be as specific as you can (e.g., "a 5% increase in 500 errors"), but it’s better to be vague (e.g., "an increase in latency") than to say something false (e.g., "an increase in average latency" when only tail latency is affected).

  • Something about when the observations started. Again, it’s useful to be as specific as possible (e.g., "starting at 2018-08-09T16:47Z"), but it’s better to be vague (e.g., "seems to have increased since yesterday") than incorrect (e.g., "it’s higher than yesterday" when you’re really just eyeballing an average value from a graph).

  • What other observations you’ve made (or anything else you’ve tried, and the results)

  • (if you’re asking for help) Any specific questions you have.

It’s useful to mention what conclusions you’ve drawn, but try to distinguish facts (e.g., "moray queue lengths are high") from inferences ("moray is overloaded").

Here’s a good status summary:

Webapi is reporting an increase in 500 and 503 errors since 2018-08-09T16:58Z. These do not seem correlated with increased latency, nor with any errors from the metadata tier. How do we check for errors from the storage tier?

It’s often very helpful to include screenshots (MacOS instructions). Here’s an example:

Elevated latency at Webapi

Here’s one way to characterize this:

We saw elevated latency since early on 8/1 (UTC)

Better:

We saw a modest increase in tail latency (both p90 and p99) starting early on 8/1 and continuing through 8/4 (UTC). This was followed by a sharp, significant increase on 8/6.

More detailed (may be better, depending on the situation):

We saw a modest increase in tail latency (both p90 and p99) starting early on 8/1 and continuing through 8/4 (UTC). This was followed by a sharp, significant increase on 8/6. During both increases, latency was much less stable than before. p90 peaked at about 50% higher than normal, while p95 spiked about 3x what it was before.

Of course, it’s easy to give too much detail, too. Think about what’s likely to be relevant for your audience.

Grafana’s features can help you make quantitative statements. For example, take this graph:

Dips in throughput at Webapi

About the dip on 8/6, you could say:

We saw a dip in throughput on 8/6.

Better would be:

There was a brief dip in inbound throughput on 8/6 around 12:36Z.

Since Grafana can show you both the average for the whole period, plus the value at a point, you can make more specific statements:

There was a brief dip in inbound throughput on 8/6 around 12:36Z. It dropped about 38% relative to the average for 14 days around it (3.3 GBps during the dip, compared to a recent average of about 5.3 GBps).

Quick references

Manta HTTP Quick Reference

HTTP Status Codes in Manta

Related links:

Here we cover only status codes with particular meanings within Manta or that are commonly used within Manta.

Code HTTP Meaning in Manta

100-continue

-

The client requested extra initial validation, and the server has not yet rejected the request.

200

OK

Most commonly used for successful GETs

201

Created

Most commonly used for creating jobs and multipart uploads (not object PUT operations)

204

No Content

Used for successful direction creations, directory removals, object uploads, object deletes, snaplink creation, and a handful of other operations

400

Bad Request

The client send an invalid HTTP request (e.g., an incorrect MD5 checksum)

401

Unauthorized

The client sent an invalid or unsupported signature, or it did not send any signature.

403

Forbidden

The client failed to authenticate, or it authenticated and was not allowed to access the resource.

408

Request Timeout

The server did not receive a complete request from the client within a reasonable timeout.

409

Conflict

The client sent an invalid combination of parameters for an API request.

412

Precondition Failed

The client issued a conditional request and the conditions were not true. (For example, this could have been a PUT-if-the-object-does-not-already-exist, and the object already existed.)

413

Request Entity Too Large

The client attempted a streaming upload and sent more bytes than were allowed based on the max-content-length header. See "Request has exceeded …​ bytes" for details.

429

Too Many Requests

The client is being rate-limited by the server because it issued too many requests in too short a period.

499

(not in HTTP)

The 499 status is used to indicate that the client appeared to abandon the request. (In this case, it’s not possible to send a response. The 499 code is used for internal logging and statistics.) This was originally used in nginx.

500

Internal Server Error

Catch-all code for a failure to process this request.

502

Bad Gateway

Historically, this code was emitted by Manta when requests took more than two minutes to complete. This was an artifact of the load balancer. Modern versions of Manta report this as a 503.

503

Service Unavailable

This code generally indicates that the system is overloaded and cannot process more work. In practice, this currently means that a particular metadata shard’s queue is full, that Muskie took too long to respond to the request, or that there aren’t enough working storage nodes with enough disk space to satisfy this upload.

507

Not Enough Space

The Manta deployment is out of physical disk space for new objects. See "Not enough free space for …​ MB" for details.

Generally:

  • Responses with status codes above 400 (400 through 599) are often called errors. In many cases, though, 400-level errors do not indicate that anything is wrong. For example, a 404 may be normal behavior for a client that checks for the existence of a particular object before doing some other operation.

  • For errors (except for 499), the response body should contain a JSON object containing more information: a Manta-specific error code and message.

  • Generally, 400-level codes (i.e., codes from 400 to 499) indicate that the request failed due to something within the client’s control.

  • Generally, 500-level codes (i.e., codes from 500 to 599) indicate a server-side failure.

HTTP Headers in Manta

Related links:

Here we cover only headers with particular meanings within Manta or that are commonly used within Manta.

Header Request/Response Origin Meaning

Content-Length

Both

HTTP

See Streaming vs. fixed-size requests.

Content-MD5

Both

HTTP

MD5 checksum of the body of a request or response. It’s essential that clients and servers validate this on receipt.

Content-Type

Both

HTTP, Manta

Describes the type (i.e., MIME type) of the body of the request or response. Manta understands a special content-type for directories called application/json; type=directory, which represents a Manta directory.

Date

Both

HTTP

The time when the request or response was generated. This is often useful when debugging for putting together a timeline.

Transfer-encoding: chunked

Both

HTTP

See Streaming vs. fixed-size requests.

any header starting with m-

Both

Manta

Arbitrary user-provided headers.

Result-Set-Size

Response

Manta

For GET or HEAD requests on directories, this header indicates how many items are in the directory.

x-request-id

Both

Manta

A unique identifier for this request. This can be used to locate details about a request in Matna logs. Clients may specify this header on requests, in which case Manta will use the requested id. Othewrise, Manta will generate one and provide it with the response.

x-server-name

Response

Manta

A unique identifier for the frontend instance that handled this request. Specifically, this identifies the "webapi" zone that handled the request.

Requests using "100-continue"

HTTP allows clients to specify a header called Expect: 100-continue to request that the server validate the request headers before the client sends the rest of it. For example, suppose a client wants to upload a 10 GiB object to /foo/stor/bar/obj1, but /foo/stor/bar does not exist. With Expect: 100-continue, the server can immediately send a "404 Not Found" response (because the parent directory doesn’t exist). Without this header, HTTP would require that the client send the entire 10 GiB request.

When Expect: 100-continue is specified with the request headers, then the client waits for a 100-continue response before proceeding to send the body of the request.

We mention this behavior because error handling for requests that do not use 100-continue can be surprising. For example, when the client doesn’t specify this header, the server might still choose to send a 400 or 500-level response immediately, but it must still wait for the client to send the whole request. There have been bugs in the past where the server did not read the request of the request, resulting in a memory leak and a timeout from the client’s perspective (because the client has no reason to read a response before it has even finished sending the request, if it didn’t use 100-continue).

Streaming vs. fixed-size requests

In order to frame HTTP requests and responses, one of two modes must be used:

  • A request or response can specify a content-length header that indicates exactly how many bytes of data will be contained in the body; or

  • A request or response can specify transfer-encoding: chunked, which indicates that the body will be sent in chunks, each of which is preceded by a size

Manta treats these two modes a little differently. If an upload request has a content-length, then Manta ensures that the storage nodes chosen to store the data have enough physical space available. Requests with transfer-encoding: chunked are called streaming uploads. For these uploads, a maximum content length is assumed by the server that’s used to validate that storage nodes contain enough physical space. The maximum content length for a streaming upload can be overridden using the max-content-length header.

Validating the contents of requests and responses

It’s critical that clients and servers validate the body of responses and requests. Some types of corruption are impossible to report any other way.

Corrupted requests and responses can manifest in a number of ways:

  • the sender may stop sending after too few bytes

  • the sender may send EOF after sending too few bytes

  • the sender may send too many bytes

  • the body may have the right number of bytes, but have incorrect bytes

Importantly, because of the two modes of transfer described above (under Streaming vs. fixed-size requests), the reader of a request or response always knows how many bytes to expect. In the cases above:

  • If the sender stops sending bytes after too few bytes (but the socket is still open for writes in both directions), then the reader will fail the operation due to a timeout. For example, if the client does this, then the server will report a 408 error. The client must implement a timeout for this case to cover the case where the server fails in this way.

  • If the sender sends EOF after too few bytes, this would be a bad request or response. If a client did this, then the server would report a 400 error. The client must implement a check for this case to cover the case where the server fails in this way. At this point in the HTTP operation, the client may have already read a successful response (i.e., a 200), and it needs to be sophisticated enough to treat it as an error anyway.

  • If the sender sends too many bytes, then the request or response would be complete, but the next request or response would likely be invalid.

  • When possible, clients and servers should generally send a Content-MD5 header. This allows the remote side to compute an MD5 checksum on the body and verify that the correct bytes were sent. For object downloads, Manta always stores the MD5 computed from the original upload and it always provides the Content-MD5 header on responses. If clients provide a Content-MD5 header on uploads, then Manta always validates that it receives it. When both of these mechanisms are used by both client and server, a client can be sure of end-to-end integrity.

Note: It’s been noted that MD5 checksums are deprecated for security purposes due to the risk of collisions. While they are likely not appropriate for security, MD5 collisions remain rare enough for MD5 checksums to be used for basic integrity checks.

Muskie log entry properties

Below is a summary of the most relevant fields for an audit log entry. (Note that Muskie sometimes writes out log entries unrelated to the completion of an HTTP request. Only log entries with "audit": true represent completion of an HTTP request. Other log entries have other fields.)

General Muskie-provided properties

JSON property Example value Meaning

audit

true

If true, this entry describes completion of an HTTP request. Otherwise, this is some other type of log entry, and many of the fields below may not apply.

latency

26

Time in milliseconds between when Muskie started processing this request and when the response headers were sent. This is commonly called time to first byte. See also building a request timeline. This should generally match the x-response-time response header.

operation

getstorage

Manta-defined token that describes the type of operation. In this case, getstorage refers to an HTTP GET from a user’s stor directory.

req

See specific properties below.

Object describing the incoming request

req.method

GET

HTTP method for this request (specified by the client)

req.url

"/poseidon/stor/manta_gc/mako/1.stor.staging.joyent.us?limit=1024"

URL (path) provided for this request (specified by the client)

req.headers

{
    "accept": "*/*",
    "x-request-id": "a080d88b-8e42-4a98-a6ec-12e1b0dbf612",
    "date": "Tue, 01 Aug 2017 03:03:13 GMT",
    "authorization": "Signature keyId=\"/poseidon/keys/ef:0e:27:45:c5:95:4e:92:ba:ab:03:17:e5:3a:60:14\",algorithm=\"rsa-sha256\",headers=\"date\",signature=\"...\"",
    "user-agent": "restify/1.4.1 (ia32-sunos; v8/3.14.5.9; OpenSSL/1.0.1i) node/0.10.32",
    "accept-version": "~1.0",
    "host": "manta.staging.joyent.us",
    "connection": "keep-alive",
    "x-forwarded-for": "::ffff:172.27.4.22"
}

Headers provided with this request (specified by the client). The Date header is particularly useful to note, as this usually reflects the timestamp (on the client) when the client generated the request. This is useful when constructing a request timeline. In particular, problems with the network (timeouts and retransmissions) or queueing any time before Muskie starts processing the request can be identified using this header, provided that the client clock is not too far off from the server clock.

req.caller

{
    "login": "poseidon",
    "uuid": "4d649f41-cf87-ca1d-c2c0-bb6a9004311d",
    "groups": [ "operators" ],
    "user": null
}

Object describing the account making this request. This is not the same as the owner! Note that this can differ from the owner of the resource (req.owner). That commonly happens when the caller uses operator privileges to access objects in someone else’s account or when any user makes an authenticated request to access public data in some other user’s account.

req.caller.login

"poseidon"

For authenticated requests, the name of the account that made the request.

req.caller.uuid

"4d649f41-cf87-ca1d-c2c0-bb6a9004311d"

For authenticated requests, the unique identifier for the account that made the request.

req.caller.groups

[ "operators" ]

For authenticated requests, a list of groups that the caller is part of. Generally, the only interesting group is "operators", which grants the caller privileges to read from and write to any account.

req.caller.user

null

For authenticated requests from a subuser of the account, the name of the subuser account.

req.owner

"4d649f41-cf87-ca1d-c2c0-bb6a9004311d"

Unique identifier for the account that owns the requested resource. This is generally the uuid of the account at the start of the URL (i.e., for a request of "/poseidon/stor", this would be the uuid of the account poseidon).

res

See specific properties below.

Describes the HTTP response sent by Muskie to the client.

res.statusCode

200

HTTP-level status code.

res.headers

{
    "last-modified": "Sat, 22 Mar 2014 01:17:01 GMT",
    "content-type": "application/x-json-stream; type=directory",
    "result-set-size": 1,
    "date": "Tue, 01 Aug 2017 03:03:13 GMT",
    "server": "Manta",
    "x-request-id": "a080d88b-8e42-4a98-a6ec-12e1b0dbf612",
    "x-response-time": 26,
    "x-server-name": "204ac483-7e7e-4083-9ea2-c9ea22f459fd"
}

Headers sent in the response from Muskie to the client. Among the most useful is the x-request-id header, which should uniquely identify this request. You can use this to correlate observations from the client or other parts of the system.

route

"getstorage"

Identifies the name of the restify route that handled this request.

Muskie-provided properties for debugging only

JSON property Example value Meaning

entryShard

"tcp://3.moray.staging.joyent.us:2020"

When present, this indicates the shard that was queried for the metadata for req.url. Unfortunately, this field is not currently present when Muskie fails to fetch metadata, either because of a Moray failure or just because the metadata is missing (i.e., the path doesn’t exist).

err

false

Error associated with this request, if any. See Details about specific error messages.

objectId

"bf54fb8a-6cb5-4683-8655-f9ad90b984d4"

When present, this is the unique identifier for the Manta object identified by req.url when the request was made. This is helpful when trying to verify that a request fetched the exact object that you expect (and not another object that had the same name at the time).

parentShard

"tcp://2.moray.staging.joyent.us:2020"

When present, this indicates the shard that was queried for the metadata for the parent directory of req.url. This is only present when the parent metadata was fetched (which is common for PUT requests, but not GET or DELETE requests). Unfortunately, this field is not currently present when Muskie fails to fetch metadata, either because of a Moray failure or just because the metadata is missing (i.e., the path doesn’t exist).

logicalRemoteAddress

"172.27.4.22"

The (remote) IP address of the client connected to Manta. Note that clients aren’t connected directly to Muskie. When using TLS ("https" URLs), clients connect to stud in the loadbalancer component. Stud connects to haproxy in the same container. haproxy in the load balancer container connects to another haproxy instance in the Muskie container. That haproxy instance connects to a Muskie process. The client’s IP is passed through this chain and recorded in logicalRemoteAddress.

remoteAddress, remotePort

"127.0.0.1", 64628

The IP address and port of the TCP connection over which this request was received. Generally, Muskie only connects directly to an haproxy inside the same zone, so the remote address will usually be 127.0.0.1. Neither of these fields is generally interesting except when debugging interactions with the local haproxy.

req.timers

{
    "earlySetup": 32,
    "parseDate": 8,
    "parseQueryString": 28,
    "handler-3": 127,
    "checkIfPresigned": 3,
    "enforceSSL": 3,
    "ensureDependencies": 5,
    "_authSetup": 5,
    "preSignedUrl": 3,
    "checkAuthzScheme": 4,
    "parseAuthTokenHandler": 36,
    "signatureHandler": 73,
    "parseKeyId": 59,
    "loadCaller": 133,
    "verifySignature": 483,
    "parseHttpAuthToken": 5,
    "loadOwner": 268,
    "getActiveRoles": 43,
    "gatherContext": 27,
    "setup": 225,
    "getMetadata": 5790,
    "storageContext": 8,
    "authorize": 157,
    "ensureEntryExists": 3,
    "assertMetadata": 3,
    "getDirectoryCount": 7903,
    "getDirectory": 10245
}

An object describing the time in microseconds for each phase of the request processing pipeline. This is useful for identifying latency. The names in this object are the names of functions inside Muskie responsible for the corresponding phase of request processing.

sharksContacted

[ {
  "shark": "1.stor.staging.joyent.us",
  "result": "ok",
  "timeToFirstByte": 2,
  "timeTotal": 902,
  "_startTime": 1509505866032
}, {
  "shark": "2.stor.staging.joyent.us",
  "result": "ok",
  "timeToFirstByte": 1,
  "timeTotal": 870,
  "_startTime": 1509505866033
} ]

This field should be present for Manta requests that make requests to individual storage nodes. The value is an array of storage nodes contacted as part of the request, including the result of this subrequest, when it started, and how long it took.

For GET requests, these subrequests are GET requests from individual storage nodes hosting a copy of the object requested. These subrequests happen serially, and we stop as soon as one completes.

For PUT requests, the storage node subrequests are PUT requests to individual storage nodes on which a copy of the new object will be stored. If all goes well, you’ll see N sharks contacted (typically 2, but whatever the client’s requested durability level is), all successfully, and the requests will be concurrent with each other. If any of these fail, Manta will try another N sharks, and up to one more set of N. For durability level 2, you may see up to 6 sharks contacted: three sets of two. The sets would be sequential, while each pair in a set run concurrently.

Bunyan-provided properties

JSON property Example value Meaning

time

"2017-08-01T03:03:13.985Z"

ISO 8601 timestamp closest to when the log entry was generated.

hostname

"204ac483-7e7e-4083-9ea2-c9ea22f459fd"

The hostname of the system that generated the log entry. For us, this is generally a uuid corresponding to the zonename of the Muskie container.

pid

79465

The pid of the process that generated the log entry.

level

30

Bunyan-defined log level. This is a numeric value corresponding to conventional values like 'debug', 'info', 'warn', etc. You can filter based on level using the bunyan command.

msg

"handled: 200"

For Muskie audit log entries, the message is always "handled: " followed by the HTTP level status code.

XXX talk about common stack traces? XXX that should include 503 from No storage nodes available for this request

Debugging tools quick reference

See also the Manta Tools Overview.

Many of these tools have manual pages or sections in this guide about how to use them. You can generally view the manual page with man TOOLNAME in whatever context you can run the tool.

Tool Where you run it Has manual page? Purpose

manta-oneach(1)

headnode GZ or "manta" zones

Yes

Run arbitrary commands in various types of Manta zones

manta-login(1)

headnode GZ or "manta" zones

Yes

Open a shell in a particular Manta zone

mlocate

"webapi" zone

No

Fetch metadata for an object (including what shard it’s on)

moray(1) tools

"moray", "electric-moray" zones

Yes

Fetch rows directly from Moray

moraystat.d

"moray" zones

No

Shows running stats about Moray RPC activity

pgsqlstat tools

"postgres" zones (need to be copied in as needed)

No

Report on PostgreSQL activity

bunyan

Anywhere

Yes

Format bunyan-format log files. With -p PID, shows live verbose log entries from a process.

curl

Anywhere

Yes

curl is (among other things) a general-purpose HTTP client. It can be used to make test requests to Manta itself as well as various components within Manta, including authcache and storage.

proc(1) tools (also called the ptools, which includes pfiles, pstack, and others)

Anywhere

Yes

Inspect various properties of a process, including its open files, thread stacks, working directory, signal mask, etc.

netstat(1M)

Anywhere

Yes

Shows information about the networking stack, including open TCP connections and various counters (including error counters).

vfsstat(1M)

Anywhere

Yes

Shows running stats related to applications' use of the filesystem (e.g., reads and writes)

prstat(1M)

Anywhere

Yes

Shows running stats related to applications' use of CPU and memory

mpstat(1M)

Anywhere

Yes

Shows running stats related to system-wide CPU usage

zonememstat(1M)

Anywhere

Yes

Shows running stats related to zone-wide memory usage

mdb_v8

Anywhere

No

Inspect JavaScript-level state in core files from Node.js processes.

Glossary of jargon

bounce (as in: "bounce a box", "bounce a service")

Bouncing a box or a service means restarting it. Bouncing a box usually means rebooting a server. Bouncing a service usually means restarting an SMF service (killing any running processes and allowing the system to restart them).

bound (as in: "CPU-bound", "disk-bound", "I/O-bound")

A program or a workload is said to be "X-bound" for some resource X when its performance is limited by that resource. For example, the performance of a CPU-bound process is limited by the amount of CPU available to it. "Disk-bound" (or "I/O-bound") usually means that a process or workload is limited by the I/O performance of the storage subsystem, which may be a collection of disks organized into a ZFS pool.

box

A box is a physical server (as opposed to a virtual machine or container).

container/zone/VM

A container is a lightweight virtualized environment, usually having its own process namespace, networking stack, filesystems, and so on. For most purposes, a container looks like a complete instance of the operating system, but there may be many containers running within one instance of the OS. They generally cannot interact with each other except through narrow channels like the network. The illumos implementation of containers are called zones. SmartOS also runs hardware-based virtual machines inside zones (i.e., a heavyweight hardware-virtualized environment within the lightweight OS-virtualized environment), and while those are technically running in a container, the term container is usually only applied to zones not running a hardware-based virtualization environment. For historical reasons, within Triton and SmartOS, zones are sometimes called VMs, though that term sometimes refers only to the hardware virtualized variety. The three terms are often used interchangeably (and also interchangeably with instance, since most components are deployed within their own container).

headroom

Idle capacity for a resource. For example, we say there’s CPU headroom on a box when some CPUs are idle some of the time. This usually means the system is capable of doing more work (at least with respect to this resource).

instance (general, SAPI)

Like service, instance can refer to a number of different things, including a member of a SAPI service or SMF service. Most commonly, "instance" to refer to a SAPI service.

latency

Latency refers to how much time an operation takes. It can apply to any discrete operation: a disk I/O request, a database transaction, a remote procedure call, a system call, establishment of a TCP connection, an HTTP request, and so on.

out of (as in: "out of CPU")

We sometimes say a box is out of a resource when that resource is fully utilized (i.e., "out of CPU" when all CPUs are busy).

pegged, slammed, swamped

These are all synonyms for being out of some resource. "The CPUs are pegged" means a box has very little CPU headroom (i.e., the CPUs are mostly fully utilized). You can also say "one CPU is pegged" (i.e., that CPU is fully utilized). You might also say "the disks are swamped" (i.e., they’re nearly always busy doing I/O). See also saturated.

saturated

A resource is saturated when processes are failing to use the resource because it’s already fully utilized. For example, when CPUs are saturated, threads that are ready to run have to wait in queues. When a network port is saturated, packets are dropped. Similar to pegged, but more precise.

service (general)

Service can refer to a SAPI service (see below), an SMF service (see below), or it may be used more generally to describe almost any useful function provided by a software component. As a verb (e.g., "this process is servicing requests"), it usually means "to process [requests]".

service (SAPI)

Within SAPI (the Triton facility for managing configuration and deployment of cloud applications like Manta), a service refers to a collection of instances providing similar functionality. It usually describes a type of component (e.g., "storage" or "webapi") that may have many instances. These instances usually share images and configuration, and within SAPI, the service is the place where such configuration is stored.

service (SMF)

Within the operating system, an SMF service is a piece of configuration that usually describes long-running programs that should be automatically restarted under various failure conditions. For example, we define an SMF service for "mahi-v2" (our authenticationc ache) so that the operating system automatically starts the service upon boot and restarts it if the process exits or dumps core. (Within SMF, it’s actually instances of a service that get started, stopped, restarted, and so on. For many services, there’s only one "default" instance, and the terms are often used interchangeably. Usually someone will say "I restarted the mahi-v2 service" rather than "I restarted the sole instance of the mahi-v2 service". However, for some services (notably "muskie", "moray", "electric-moray", and "binder") we do deploy multiple instances, and it may be important to be more precise (e.g., "three of the muskie instances in this zone are in maintenance"). See smf(5).

shard

A shard generally refers to a database that makes up a fraction of a larger logical database. For example, the Manta metadata tier is one logical data store, but it’s divided into a number of equally-sized shards. In sharded systems like this, incoming requests are directed to individual shards in a deterministic way based on some sharding key. (Many systems use a customer id for this purpose. Manta traditionally uses the name of the parent directory of the resource requested. In Manta, each shard typically uses 2-3 databases for high availability, but these aren’t separate shards because they’re exact copies. Sharding typically refers to a collection of disjoint databases that together make up a much larger dataset.

tail latency

When discussing a collection of operations, tail latency refers to the latency of the slowest operations (i.e., the tail of the distribution). This is often quantified using a high-numbered percentile. For example, if the 99th percentile of requests is 300ms, then 99% of requests have latency at most 300ms. As compared with an average or median latency, the 99th percentile better summarizes the latency of the slowest requests.

Deployment-specific details

This documentation is general-purpose for any Manta deployment, but it makes reference to specific infrastructure that operators should provide for their own deployments. In particular, it’s strongly recommended that operators deploy:

  • one or more Prometheus instances to pull runtime metrics from various Manta components, with rules defined to precompute commonly-used metrics

  • one or more Grafana instances to support dashboards based on these runtime metrics

We recommend creating a few different dashboards:

  • a "basic health" dashboard that shows:

    • webapi requests completed with ability to break out by operation and status code

    • webapi inbound and outbound throughput

    • webapi average, p90, p95, and p99 latency

    • electric-moray average, p90, p95, and p99 latency

    • moray average, p90 p95, and p99 latency

  • a "shard health" dashboard that shows:

    • moray RPCs per second with ability to break out by shard and RPC type

    • moray RPC average, p90, p95, and p99 latency

  • a "postgres" dashboard that shows:

    • transactions committed and rolled back with ability to break out by shard

    • transactions-til-wraparound-autovacuum, with ability to break out by shard and table

    • live and dead tuples, with ability to break out by shard and table

    • the fraction of dead tuples to total tuples

  • a CPU utilization dashboard that shows

    • line graphs for each service showing the min and max zone-wide CPU utilization for that service

    • a heat map for each service showing the zone-wide CPU utilization for each zone in that service

The rest of this guide makes reference to "checking recent historical metrics". This refers to using one of these dashboards to observe these metrics, typically for a period before and during an incident.

(Unfortunately, there is not much documentation yet on the metrics available from Manta or how to create these dashboards. Specific metrics are often documented with the components that provide them.)

For more information about Grafana (including how to build dashboards and interact with them), see the Grafana documentation.

For more information about Prometheus, see the official Prometheus documentation.

Once these dashboards are created, it’s recommended to create a landing page for incident responders that links to this guide as well as the endpoints for viewing these dashboars. Any credentials needed for these endpoints should be document or distributed to responders.

Metrics

Real-time metrics provided by Manta form the basis of situational awareness, particularly during incident response. Understanding the metrics provided by the various components internal to Manta requires a basic understanding of these components and how they work together.

This section discusses a number of useful metrics exposed by Manta. When we say a metric is exposed by Manta, we usually mean that Triton (via CMON) and/or individual Manta components collect these metrics and serve them over HTTP in Prometheus format. Within a Manta deployment, it’s up to operators to set up systems for collecting, presenting, and alerting on these metrics. For more on this, see Deployment-specific details. There are screenshots from existing deployments below, but the specific metrics available and the appearance of graphs may vary in your deployment.

Note
This section covers operational metrics used to understand Manta’s runtime behavior. These are not end-user-visible metrics.
Note
This section covers what information is available and how to interpret it. Unfortunately, there is no single place today that documents the list of metrics available.

Key service metrics

Key metrics for assessing the health of a service are driven by whatever the customer cares about. For Manta, that’s typically:

  • the error rate

  • throughput of data in and out of Manta

  • latency

These are closely related to the the Four Golden Signals described in the Site Reliability Engineering book from Google. That section provides a good summary of these metrics and why they’re so important.

A Manta deployment typically seeks to achieve a particular level of performance, usually expressed in terms of throughput (objects per second read or written or bytes per second read or written) or latency (the time required to complete each request) while maintaining an acceptable error rate. Errors must be considered when looking at performance, since many types of error responses can be served very quickly — that doesn’t mean the service is working as expected!

Note
We use the term error rate to refer to the fraction of requests that failed with an error (e.g., "3% of responses were 500-level errors" would be a 3% error rate). The same term is sometimes used to refer to the count of errors per unit time (e.g., "3 errors per second"), but that’s usually less useful for describing the service’s health.

Since there may be thousands of requests completed per second at any given time, when understanding latency, we almost always use some aggregated form of latency:

  • Average latency is useful for use-cases that primarily care about throughput because the average latency is closely related to throughput: throughput (in requests per second) equals the total number of clients divided by the average latency per request. When all you know is that throughput has gone down, it’s hard to tell whether there’s a problem because it could reflect a server issue or a lack of client activity. Average latency resolves this ambiguity: if average latency has gone up, there’s likely a server problem. If not, clients have likely stopped trying to do as much work.

  • Tail latency refers to the latency of the the slowest requests (the tail of the latency distribution). Tail latency is often a better indicator of general service health than average latency because small or occasional problems often affect tail latency significantly even when they don’t affect the average that much. Tail latency is often expressed as a percentile: for example, the shorthand "p90" refers to the 90th percentile, which is the minimum latency of the slowest 10% of requests. Similarly, p99 is the minimum latency of the slowest 1% of requests; and so on. When we say that the p99 is 300ms, we mean that 99% of requests completed within 300ms.

Not only do we use these metrics (error rate, throughput, and latency) to describe Manta’s overall health, but we also use them to describe the health of the various components within Manta.

Top-level Manta metrics

To understand Manta’s overall health, we typically look at the error rate, throughput, and latency for Muskie (webapi) since Muskie most directly handles the HTTP requests being issued by end users.

Note
If you’re looking for help understanding a Manta problem, see the Incident Response Decision Tree. This section provides background on the metrics referenced in that section.

Errors at webapi: For background on errors in Manta, see Investigating an increase in error rate. That section also discusses how to debug specific errors. At this level, we’re typically talking about explicit 500-level errors. When evaluating an error rate against expectations, we usually use its inverse — the success rate: the fraction of requests that completed successfully. This is often measured in "nines". Generally, if a deployment seeks a particular level of availability (e.g., 99.9%), an incident may be raised if the error rate exceeds a target percentage (e.g., 0.1%) for some period of time (e.g., 5 minutes); an incident may be considered resolved once the error rate is below this threshold. The specific thresholds vary by deployment, but are usually at least 99%. Generally, errors indicate a software failure, a significant server-side failure, or unexpected overload — all of which are not supposed to be common in production deployments.

Throughput at webapi: Manta throughput can be measured in

  • objects created per second

  • objects fetched per second

  • bytes written per second

  • bytes read per second

Any of these might be significant to end users. The target throughput in any of these dimensions is deployment-specific. As mentioned above, it’s very difficult to use throughput directly to assess the server’s behavior because it’s significantly affected by client behavior. Average latency may be more useful.

Latency at webapi: For webapi, we usually define latency as the time-to-first-byte for each request. For uploads, this is the time between when the request is received at the server to the time when the server tells the client to proceed with the upload. For downloads, this is the time between when the request is received and client receives the first bytes of data. For other requests (which do not involve object data transfer), we look at the whole latency of the request. As we’ve defined it, latency includes the time used by the server to parse the request, validate it, authenticate it, authorize it, load metadata, and initiate any data transfer, but it does not include the time to transfer data. This is useful, since transfer time depends on object size, and we usually want to factor that out.

For general system health, we typically monitor Muskie error rate and tail latency (p90, p99). When throughput is important, average latency is also useful for the reasons mentioned above.

Key metrics internal to Manta

When understanding problems with Manta, we use the same key metrics — error rate, throughput, and latency — measured by various other components.

Note
If you’re looking for help understanding a Manta problem, see the Incident Response Decision Tree. This section provides background on the metrics referenced in that section.

Metadata tier: We generally consider the metadata tier to include:

  • Electric-Moray, which is the gateway for nearly all requests to the metadata tier that come from the data path

  • Moray, each instance of which handles requests for a particular database shard

  • PostgreSQL (deployed under Manatee), which ultimately services requests to the metadata tier

Electric-Moray and Moray both operate in terms of RPCs. They both expose metrics that count the number of RPCs completed and failed, as well as histograms that can be used to calculate average latency and estimate tail latency.

PostgreSQL operates in terms of transactions. Manta exposes metrics collected from PostgreSQL about transactions completed and aborted, but not latency. We typically use Moray latency as a proxy for PostgreSQL latency.

Storage tier: We do not currently record request throughput, latency, or error rate from the storage tier. However, Triton (via CMON) collects network bytes transferred and ZFS bytes read and written, which are useful proxies for inbound and outbound data transfer.

Other operational metrics

Manta exposes a number of other useful metrics:

  • CPU utilization, broken out by zone (and filterable by type of component). For stateless services (i.e., most services within Manta), this is a useful way to determine if instances (or a whole service) is overloaded. For example, webapi instances are typically deployed using 16 processes that are effectively single-threaded. If any webapi instances are using close to 1600% of one CPU (i.e., 16 CPUs), they’re likely overloaded, and end users are likely to experience elevated latency as a consequence. In order to interpret these values, you generally have to know how many CPUs a particular component can typically use.

  • Disk utilization, broken out by zone (and filterable by type of component). This is useful for understanding disk capacity at both the metadata tier and storage tier.

  • PostgreSQL active connections, broken out by shard. This roughly reflects how much concurrent work is happening on each PostgreSQL shard. This can be useful for identifying busy or slow shards (though it can be hard to tell if a shard is slow because it’s busy or if it’s busy because it’s slow).

  • PostgreSQL vacuum activity, described below.

  • TCP errors (collected via Triton’s CMON), including failed connection attempts, listen drops, and retransmitted packets. These can reflect various types of network issues.

  • OS anonymous allocation failures (collected via Triton’s CMON). This particular event indicates that a process attempted to allocated memory but failed because it has reached a memory cap. Many programs do not handle running out of memory well, so these allocation failures can sometimes result in cascading failures.

Summary of metrics

Below is a rough summary of the metrics exposed by Manta and which components expose them. There are several caveats:

  • This information is subject to change without notice as the underlying software evolves!

  • This table does not describe which Prometheus instances collect, aggregate, and serve each metric. See Deployment-specific details.

  • Relatedly, in large deployments, Prometheus recording rules may be used to precalculate important metrics. These are not documented here.

  • Many metrics provided a number of breakdowns using Prometheus labels.

Component being measured Where the metric is collected Metric name Notes

Manta itself

webapi

muskie_inbound_streamed_bytes

count of bytes uploaded to Muskie (all uploads, including in-progress and failures). This is a primary metric for end users.

Manta itself

webapi

muskie_outbound_streamed_bytes

count of bytes downloaded from Muskie (all downloads, including in-progress and failures). This is a primary metric for end users.

Manta itself

webapi

http_requests_completed

count of requests completed, with labels for individual HTTP response codes. This can be used to calculate the error rate as well. This is the basis for several primary metrics for end users.

Manta itself

webapi

http_request_latency_ms

histogram of request latency, used to calculate average latency and to estimate percentiles. This is a primary metric for end users.

Electric-Moray

electric-moray

fast_requests_completed

count of requests completed, with a label for RPC method name. This is useful for measuring overall throughput at the metadata tier.

Electric-Moray

electric-moray

fast_server_request_time_seconds

histogram of RPC latency, with a label for RPC method name. This is useful for calculating average latency and estimating tail latency at the metadata tier.

Moray

moray

fast_requests_completed

Same as for electric-moray, but this is measured for a particular Moray instance (and so a particular database shard).

Moray

moray

fast_server_request_time_seconds

Same as for electric-moray, but this is measured for a particular Moray instance (and so a particular database shard).

PostgreSQL

pgstatsmon

Various

A number of stats exposed by PostgreSQL are collected and exposed by pgstatsmon. For the authoritative set, see the pgstatsmon source. These stats are named according to their PostgreSQL names, so for example the xact_commit stat in the pg_stat_database view is exposed as pg_stat_database_xact_commit. Labels are used to identify the PostgreSQL instance, which can often be used to break out by shard.

TCP stack

CMON (in Triton)

tcp_listen_drop_count, tcp_listen_drop_Qzero_count

count of the number of times a TCP connect attempt was dropped on the server side, often due to overload. This is useful for identifying TCP server problems.

TCP stack

CMON (in Triton)

tcp_failed_connection_attempt_count

count of the number of times a TCP connect attempt failed on the client side. This is useful for identifying when clients are having issues, even if you can’t see corresponding server-side failures.

TCP stack

CMON (in Triton)

tcp_retransmitted_segment_count

count of the number of times a TCP packet was retransmitted. This can indicate a network problem or a software problem on either end of the TCP connection, but interpreting this stat is difficult because there are many non-failure cases where packets may be retransmitted.

OS

CMON (in Triton)

mem_anon_alloc_fail

count of the number of times an operating system process attempted to allocate memory but failed because the container would exceeds its cap. This often indicates a type of memory exhaustion.

OS

CMON (in Triton)

cpu_user_usage, cpu_sys_usage

count of the number nanoseconds of CPU time (user and system time, respectively) used by a given container, with labels for the container being measured. This is useful for understanding CPU usage, including problems of CPU saturation.

ZFS (filesystem)

CMON (in Triton)

zfs_used, zfs_available

gauge of the number of bytes used and available, with labels for the container being measured. This is useful for identifying containers that are low on disk space and for understanding overall system storage capacity.

Predicting autovacuum activity

Background on vacuum in Manta

Autovacuum activity in PostgreSQL is a major source of degraded performance in large deployments, known to cause a throughput degradation as much as 70% on a per-shard basis. It’s helpful for operators to understand some of the basics of autovacuum. A deeper understanding requires digging rather deep into PostgreSQL internals. The PostgreSQL documentation describes autovacuum, the reason for it, and the conditions for it in detail.

Operators should understand at least the following:

  • "Vacuum" is a long-running activity that runs on a per-table basis. This is a maintenance operation that generally has to be run periodically on all tables in all PostgreSQL databases.

  • "Autovacuum" is the name for any vacuum that is scheduled and managed by PostgreSQL itself, as opposed to one that an operator kicks off explicitly.

  • Manta has two primary tables: "manta" and "manta_directory_counts". As mentioned above, each vacuum operation runs on one table at a time (though multiple vacuums can be running at the same time on different tables.)

  • There’s generally a significant degradation in both average query latency and tail latency while vacuum is running. In fixed-concurrency deployments (i.e., when there are a fixed number of clients), an increase in average latency corresponds directly to a decrease in throughput.

We classify vacuum operations into two types:

  • Normal vacuums clean up tuples (rows) in the table that have been invalidated since the last vacuum (usually by UPDATE and DELETE operations). PostgreSQL kicks these off whenever the fraction of dead tuples exceeds a configurable threshold of the table size, which is generally 20%.

  • "Anti-wraparound vacuums" (also sometimes called "wraparound vacuums", "freeze vacuums", or "aggressive vacuums") are responsible for freezing old tuples. PostgreSQL kicks these off whenever it’s been more than a fixed number of transactions since the last time this was done.

Note that each type of vacuum may do the work of the other. A normal vacuum may freeze some tuples, and a freeze vacuum will generally clean up dead tuples. This classification is about what caused PostgreSQL to start the vacuum, and it’s useful because we can monitor the underlying metrics in order to predict when PostgreSQL will kick off vacuum operations.

Again, there’s significantly more information about all of this in the above-linked PostgreSQL documentation.

Using metrics to predict normal autovacuums

As mentioned above, a normal vacuum is kicked off when the number of dead tuples has exceeded 20% of the total tuples in the table. We can see this in Manta metrics. Here’s a graph of live tuples, dead tuples, and the fraction of dead tuples for a made-up table called "test_table" (normally in Manta this would be the "manta" or "manta_directory_counts" table):

metrics postgresql tuples autovac
Figure 7. A graph of dead tuples, live tuples, and autovacuum activity in a test environment.

In this graph:

  • In the upper graph, the green line shows live tuples. This system is running a heavy INSERT workload, so the count of live tuples increases relatively constantly.

  • In the upper graph, the yellow line shows dead tuples. A fraction of this workload runs UPDATE queries, so there’s a steady increase in dead tuples as well.

  • In the upper graph, the blue line (which goes with the right-hand y-axis) shows the percentage of tuples in the table that are dead. This value also climbs, though not at a linear rate.

  • In the bottom graph, the green bars represent periods where a normal vacuum was running. (You can ignore the yellow bars in this graph.)

Critically autovacuum starts running when the blue line reaches 20%, for the reasons described above. Further, when vacuum finishes, the count (and fraction) of dead tuples decreases suddenly — because vacuum has cleaned up those dead tuples. As a result, the blue line can be used to predict when normal vacuums will kick off.

Using metrics to predict anti-wraparound autovacuums

As mentioned above, an anti-wraparound vacuum is kicked off on a table when the number of transactions in a database that have been executed since the last such vacuum exceeds some threshold. Manta exposes this metric as well.

Typically, as a workload runs, the transactions-until-wraparound-vacuum decreases at a rate determined by how many transactions are running in the database. For a single shard, we can plot this on a line graph (one graph for each major table):

metrics postgresql wraparound leadup
Figure 8. Graphs of the number of transactions left until the next wraparound autovacuum kicks off for the "manta" and "manta_directory_counts" tables.

For a large number of shards, we can plot this as a heat map, which helps us see the pattern across shards:

metrics postgresql wraparound leadup heatmap
Figure 9. Heat maps of the number of transactions left until the next wraparound autovacuum. Brighter blocks indicate a larger number of data points (shards).

In the right-hand heat map, the bright line above 400M indicates that most shards are over 400 million transactions away from the next wraparound autovacuum. The darker line around 100M shows that a smaller number are much closer to the threshold. The left-hand heat map shows much greater variance for the "manta" table, though there’s a cluster (a bright line) just under 100M transactions from the next wraparound autovacuum.

When any of these lines reaches zero, that means we’d PostgreSQL to kick off a wraparound autovacuum. The line will continue decreasing (to negative numbers) until the wraparound autovacuum completes, at which point it will jump back up. Here, we can see a whole wraparound autovacuum cycle:

metrics postgresql wraparound duration
Figure 10. Graphs of the number of transactions until the next wraparound autovacuum, plus the number of wraparound autovacuums running, for a particular shard over the course of one autovacuum operation.

We see here that we’d expect a wraparound autovacuum to kick off when the threshold reaches 0. It keeps falling until the vacuum completes, at which point it jumps back up. Another round will kick off when the line reaches zero again. (Note that the lower graph here is a prediction, based directly on the graph above it. It’s possible (though not common in practice) that PostgreSQL won’t actually have kicked off the wraparound autovacuum at this time.)

Because of this behavior, the graph of transactions until wraparound autovacuum can be used to predict when wraparound autovacuums will kick off for each shard.