Scheduled SuiteScripts That Hit External APIs: A Production Survival Guide

Most scheduled SuiteScripts I read that talk to an external API treat failure as a single category: something went wrong, log it, move on. That works in dev. In production it costs you nights.

The real shape of failure in a NetSuite-to-API integration is three different things that look the same on the surface but need very different responses. Get this wrong and you'll either retry forever against a misconfigured API key, or abort the whole script when one bad record could have been quietly skipped, or silently drop a thousand invoices into a "submitted=false, error=null" state where nobody knows they exist.

This is the pattern I run in production scheduled scripts that hit external APIs. Real code, no theoretical. The scope of this post is specifically the "iterate over a queue of NetSuite records, POST each one to a third-party API, stamp the result" shape. Other scheduled-script shapes (cleanup jobs, batch field updates, internal aggregations) need a different set of guards.

The three tiers of failure

Every error from an external API call falls into exactly one of three buckets:

Permanent. Something about this specific record will never succeed. Bad customer email, missing required field, business-rule violation upstream. No retry helps. The right response is to mark the record with a permanent error message, exclude it from future runs, and move on to the next.
Retryable. A transient problem. Network blip, 5xx from the upstream, rate-limit. Retry within the same run with a small budget. If it still fails, leave the record in the queue for the next scheduled run.
Systemic. Something is broken about the integration itself. Your API key is wrong. The endpoint moved. Your account got suspended. EVERY record will fail the same way. Halt the script immediately and alert someone. Don't retry, don't move on.

The hardest part is classifying correctly. Get it from the HTTP status code where you can:

classify-error.js

const SYSTEMIC_ERROR_CODES = [400, 401, 403, 404, 405, 422];

const classify = (statusCode) => {
  if (statusCode >= 200 && statusCode < 300)        return 'success';
  if (statusCode === 419)                            return 'duplicate'; // already submitted
  if (SYSTEMIC_ERROR_CODES.includes(statusCode))     return 'systemic';
  return 'retryable'; // includes 429, 5xx, network errors
};

Some details that matter:

422 is systemic, not retryable. Unprocessable Entity usually means your request shape is wrong (missing field, malformed payload). Retrying the same payload won't help. Mark systemic and stop.
419 (where applicable) means duplicate. Some APIs return this when the same order ID has already been submitted. Treat it as success and mark the record submitted, because that's the end state the user wanted.
429 is retryable but special. If you retry rate-limit errors inside the same run, add an exponential backoff or you'll just hammer the API harder. For a scheduled script that runs once a day, deferring to the next run is often the cleanest response.
Network errors throw exceptions, not status codes. Wrap the call in try/catch. An exception falls into the retryable bucket by default.

The per-record retry loop

Within a single record, retry a couple of times before giving up. Don't loop infinitely. Two attempts is usually enough; three is wasteful unless the API is genuinely flaky.

process-record.js

const MAX_RETRY_ATTEMPTS = 2;

const processRecord = (rec, counters) => {
  for (let attempt = 1; attempt <= MAX_RETRY_ATTEMPTS; attempt++) {
    try {
      const response = submitToApi(rec);

      if (response.success) {
        stampSubmitted(rec.id, response.body);
        counters.successCount++;
        counters.consecutiveRetryable = 0; // reset on success
        return true;
      }

      // Systemic. Halt the whole script.
      if (isSystemicError(response.statusCode)) {
        sendSystemicAlert(response.statusCode, response.body, counters.successCount);
        counters.halted = true;
        return false;
      }

      // Retryable. If last attempt, count it and move on to next record.
      if (attempt === MAX_RETRY_ATTEMPTS) {
        counters.errorCount++;
        counters.consecutiveRetryable++;
      }
    } catch (e) {
      // Network error. Same treatment as a 5xx.
      if (attempt === MAX_RETRY_ATTEMPTS) {
        counters.errorCount++;
        counters.consecutiveRetryable++;
        log.error({ title: 'Network error', details: 'Record ' + rec.id + ': ' + e.message });
      }
    }
  }
  return true;
};

The consecutiveRetryable counter is the load-bearing part. It's a circuit breaker.

The circuit breaker

If three records in a row hit retryable failures, the upstream is almost certainly down. Burning through 500 invoices when the API is offline is wasted execution time and wasted governance budget. Stop, alert, let the next scheduled run try again.

main-loop.js

const MAX_CONSECUTIVE_RETRY = 3;

for (const rec of records) {
  // governance + wall-clock guards (next section)
  if (script.getRemainingUsage() < GOVERNANCE_THRESHOLD) break;
  if (Date.now() - startTime >= MAX_RUNTIME_MS)         break;

  const shouldContinue = processRecord(rec, counters);
  if (!shouldContinue) break; // systemic, already alerted

  if (counters.consecutiveRetryable >= MAX_CONSECUTIVE_RETRY) {
    sendSystemicAlert(
      0,
      counters.consecutiveRetryable + ' consecutive retryable errors. Possible API outage.',
      counters.successCount,
    );
    counters.halted = true;
    break;
  }
}

The circuit breaker fires after three retryable-in-a-row, not three retryable total. A single record failing twice and then succeeding doesn't count. The counter resets on every success. This is the right semantic: "three in a row" tells you the upstream is broken, while "three in total" tells you nothing.

Governance and wall-clock guards

Scheduled SuiteScripts have two ways to die badly. Either they run out of usage points (governance) or they hit the 60-minute wall-clock cap and NetSuite kills the execution. Both ways, anything you were going to do but hadn't yet is just dropped on the floor.

The fix is to check both before every record:

guards.js

const GOVERNANCE_THRESHOLD = 200; // bail when we have less than this left
const MAX_RUNTIME_MS = 55 * 60 * 1000; // 55 min. NS hard-caps at 60.

if (script.getRemainingUsage() < GOVERNANCE_THRESHOLD) {
  log.audit({ title: 'Yield', details: 'Governance threshold reached.' });
  break; // exit the loop cleanly; remaining records picked up next run
}

if (Date.now() - startTime >= MAX_RUNTIME_MS) {
  log.audit({ title: 'Yield', details: 'Runtime limit reached.' });
  break;
}

The threshold values matter. 200 governance units is enough headroom to update one more record, log the audit, send a summary email, and exit cleanly. 55 minutes (not 60) gives you the same headroom in time.

The records you don't get to should be designed so they're naturally picked up on the next scheduled run. In my case, "submitted=true" is the exit flag. Anything where that flag is still false (and no permanent error) is automatically eligible next time. No external queue, no state outside NetSuite. The flag IS the queue.

This only works if every script run starts with the assumption that the fields exist and are configured correctly. If the custom fields ever go missing or the API secrets are wrong, all this error handling is moot because the script will fail at step zero. The init check pattern catches that.

Permanent errors stay sticky

The other half of "what runs next time" is permanent errors. If a record has a permanent error and you don't mark it, the next run picks it up, fails it again, and stamps it again. You burn governance on records that will never succeed.

The fix is a permanent-error field on the source record. Once it's populated, the SuiteQL query in the next run excludes it:

SELECT id, tranid /* ... */
  FROM Transaction t
 WHERE t.type = 'CustInvc'
   AND (t.custbody_submitted = 'F' OR t.custbody_submitted IS NULL)
   AND (t.custbody_error IS NULL OR t.custbody_error = '')

Permanent errors are sticky on purpose. A human has to clear the error field to re-attempt. Usually they're fixing the underlying issue (adding a customer email, correcting a missing field) and then clearing the error. The script never re-attempts on its own. That's the right default. If the record failed permanently, retrying without human review is at best wasted work and at worst the same wrong write twice.

The summary email is the operations channel

At the end of every run, send a summary email with stat cards, a table of failed records (each one linked directly to the record in NetSuite for one-click access), and any soft warnings the API surfaced. This is the thing your operations team actually reads. Make it good. Inline CSS so it renders in every email client. Link the record IDs so the recipient can click straight through to fix.

Systemic errors get a separate, louder alert: subject prefixed with "CRITICAL:", red banner in the body, link to the API Secrets settings page (or wherever the failure is most likely to be fixed). Different urgency, different presentation, different recipient list if appropriate.

What I'd skip if you're getting started

You don't need a Map/Reduce script for this. A Scheduled Script handles batches of up to a few thousand records cleanly with the governance and wall-clock guards above. Map/Reduce is the right tool for genuinely parallel processing of independent items (millions of rows, no shared state), not for a sequential queue with an external API call.

You don't need exponential backoff on retries if your scheduled run is daily. Just defer to the next run. Backoff matters when you're inside a tight loop firing dozens of calls per second.

You don't need an external queue. NetSuite custom fields ARE the queue. submitted=true means done. Populated error means quarantined. Both blank means eligible. The data lives where the business logic lives.

Building one of these?

I write production-grade scheduled scripts and integrations that survive real failures. If yours is dying in ways your monitoring isn't catching, let's talk.