ZIP: 173
Title: Bech32 Format
Owners: Daira-Emma Hopwood <daira-emma@electriccoin.co>
Credits: Pieter Wuille <pieter.wuille@gmail.com>
         Greg Maxwell <greg@xiph.org>
         Rusty Russell
         Mark Friedenbach
Status: Final
Category: Standards / Wallet
Created: 2018-06-13
License: MIT

Terminology

The key words "MUST", "MUST NOT", "SHOULD", and "SHOULD NOT" in this document are to be interpreted as described in BCP 14 1 when, and only when, they appear in all capitals.

The term "network upgrade" in this document is to be interpreted as described in ZIP 200. 4

The term "Sapling" in this document is to be interpreted as described in ZIP 205. 5

Abstract

This document proposes a checksummed base32 format, "Bech32", and a standard for Sapling addresses and keys using it.

Motivation

Since launch, Zcash has relied on Base58 addresses with a truncated double-SHA256 checksum, inherited from Bitcoin. They were part of the original Bitcoin software and their scope was extended in BIP 13 6 for P2SH (Pay-to-Script-Hash) addresses. However, both the character set and the checksum algorithm have limitations:

To address these issues, Bitcoin adopted a new encoding called Bech32 7, for use with address types added by its Segregated Witness proposal. Zcash does not implement Segregated Witness, but it reuses Bech32 with address and key types introduced by the Sapling network upgrade 5.

Since the description of Bech32 in 7 is partially entangled with Segregated Witness address formats, we have found it clearer to write this ZIP to specify Bech32 separately. This specification should be read in conjunction with section 5.6 ("Encodings of Addresses and Keys") of the Zcash Sapling protocol specification 2, and with ZIP 32 which specifies additional key types for support of shielded hierarchical deterministic wallets 3.

Specification

We describe the general checksummed base32 format called ''Bech32''. Its use for Sapling payment addresses and keys is defined in the Sapling protocol specification 2.

A Bech32 string consists of:

0 1 2 3 4 5 6 7
+0 q p z r y 9 x 8
+8 g f 2 t v d w 0
+16 s 3 j n 5 4 k h
+24 c e 6 m u a 7 l

Checksum

The last six characters of the data part form a checksum and contain no information. Valid strings MUST pass the criteria for validity specified by the Python3 code snippet below. The checksum is defined so that the function bech32_verify_checksum returns true when its arguments are:

  • hrp: the human-readable part as a string;
  • data: the data part as a list of integers representing the characters after conversion using the table above.
def bech32_polymod(values):
  GEN = [0x3b6a57b2, 0x26508e6d, 0x1ea119fa, 0x3d4233dd, 0x2a1462b3]
  chk = 1
  for v in values:
    b = (chk >> 25)
    chk = (chk & 0x1ffffff) << 5 ^ v
    for i in range(5):
    chk ^= GEN[i] if ((b >> i) & 1) else 0
  return chk

def bech32_hrp_expand(s):
  return [ord(x) >> 5 for x in s] + [0] + [ord(x) & 31 for x in s]

def bech32_verify_checksum(hrp, data):
  return bech32_polymod(bech32_hrp_expand(hrp) + data) == 1

This implements a BCH code; in the case where the encoded string is at most 90 characters, this code guarantees detection of any error affecting at most 4 characters and has less than a 1 in 109 chance of failing to detect more errors. More details about the properties can be found in the Checksum Design section. The human-readable part is processed by first feeding the higher 3 bits of each character's US-ASCII value into the checksum calculation followed by a zero and then the lower 5 bits of each US-ASCII value.

To construct a valid checksum given the human-readable part and (non-checksum) values of the data-part characters, the code below can be used:

def bech32_create_checksum(hrp, data):
  values = bech32_hrp_expand(hrp) + data
  polymod = bech32_polymod(values + [0,0,0,0,0,0]) ^ 1
  return [(polymod >> 5 * (5 - i)) & 31 for i in range(6)]

Error correction

One of the properties of these BCH codes is that they can be used for error correction. An unfortunate side effect of error correction is that it erodes error detection: correction changes invalid inputs into valid inputs, but if more than a few errors were made then the valid input may not be the correct input. Use of an incorrect but valid input can cause funds to be lost irrecoverably. Because of this, implementations SHOULD NOT implement correction beyond potentially suggesting to the user where in the string an error might be found, without suggesting the correction to make.

Uppercase/lowercase

The lowercase form is used when determining a character's value for checksum purposes.

Encoders MUST always output an all-lowercase Bech32 string. If an uppercase version of the encoding result is desired (e.g. for presentation purposes, or QR code use), then an uppercasing procedure can be performed external to the encoding process.

Decoders MUST NOT accept strings where some characters are uppercase and some are lowercase (such strings are referred to as mixed-case strings).

For presentation, lowercase is usually preferable, but inside QR codes uppercase SHOULD be used, as those permit the use of alphanumeric mode, which is 45% more compact than the byte mode that would otherwise be used.

Encoding

  • Start with the bits of the raw encoding of the appropriate address or key type, most significant bit per byte first.
  • Re-arrange those bits into groups of 5, and pad with zeroes at the end if needed.
  • Translate those bits, most significant bit first, to characters using the table above.

Decoding

Software interpreting a Bech32-encoded address MUST first verify that the human-readable part matches that of a specified address type, and similarly for keys.

If this check passes, convert the rest of the data to bytes:

  • Translate the values using the table above to 5 bits, most significant bit first.
  • Re-arrange those bits into groups of 8 bits. Any incomplete group at the end MUST be 4 bits or fewer, MUST be all zeroes, and is discarded.
  • The resulting groups are interpreted as the raw encoding of the appropriate address or key type, with the most significant bit first in each byte.

Decoders SHOULD enforce known-length restrictions on address and key types.

For example, 2 specifies that the length of the raw encoding of a Sapling payment address is 43 bytes (88 + 256 bits). This results in a Bech32-encoded Sapling payment address being 78 characters long.

Compatibility

Only software supporting the Sapling network upgrade is able to use these addresses.

There is no effect on support for transparent addresses and keys, or for Sprout z-addresses and keys; these will continue to be encoded using Base58Check, and MUST NOT be encoded using Bech32.

Rationale

Why use base32 at all?

The lack of mixed case makes it more efficient to read out loud or to put into QR codes. It does come with a 15% length increase. However, the length of a Bech32-encoded Sapling payment address remains below 80 characters, which reduces the likelihood of line splitting in terminals or email. Thus, cutting-and-pasting of addresses is still reliable.

Why call it Bech32?

"Bech" contains the characters BCH (the error detection algorithm used) and sounds a bit like "base". The term Bech32 is established for Bitcoin and there was no reason to use a different name for it in Zcash Sapling.

Why not support Bech32 encoding of Sprout or transparent addresses?

This was not considered to be sufficiently well-motivated given the compatibility issues that would arise from having two formats for these address types, with pre-Sapling software not supporting the new format.

Why include a separator in addresses?

That way the human-readable part is unambiguously separated from the data part, avoiding potential collisions with other human-readable parts that share a prefix. It also allows us to avoid having character-set restrictions on the human-readable part. The separator is ''1'' because using a non-alphanumeric character would complicate copy-pasting of addresses (with no double-click selection in several applications). Therefore an alphanumeric character outside the normal character set was chosen.

Why not use an existing base32 character set?

Existing character sets for base-32 encodings include RFC 3548 and z-base-32. The set used for Bech32 was chosen to minimize ambiguity according to this visual similarity data, and the ordering is chosen to minimize the number of pairs of similar characters (according to the same data) that differ in more than 1 bit. As the checksum is chosen to maximize detection capabilities for low numbers of bit errors, this choice improves its performance under some error models.

Why are the high bits of the human-readable part processed first?

This design choice had a rationale for Bitcoin Segregated Witness addresses (see 7) that does not apply to Zcash Sapling addresses. It is retained for compatibility with Bech32 encoders/decoders written for Bitcoin.

Reference implementations

Note that the encoders written for Bitcoin may make assumptions specific to Segregated Witness address formats that do not apply to Zcash. Only the Python one has been tested with Zcash at the time of writing.

Examples

TODO: add valid Sapling payment addresses and keys, and their corresponding raw encodings.

Test vectors

The following strings are valid Bech32:

The following strings are not valid Bech32 (with reason for invalidity):

Checksum design

Design choices

BCH codes can be constructed over any prime-power alphabet and can be chosen to have a good trade-off between size and error-detection capabilities. Unlike Reed-Solomon codes, they are not restricted in length to one less than the alphabet size. While they also support efficient error correction, the implementation of just error detection is very simple.

We pick 6 checksum characters as a trade-off between length of the addresses and the error-detection capabilities, as 6 characters is the lowest number sufficient for a random failure chance below 1 per billion. For the length of data we're most interested in protecting (up to 77 bytes excluding the separator, for a Sapling payment address), BCH codes can be constructed that guarantee detecting up to 4 errors. Longer data is also supported with slightly weaker error detection.

Selected properties

Many of these codes perform badly when dealing with more errors than they are designed to detect, but not all. For that reason, we consider codes that are designed to detect only 3 errors as well as 4 errors, and analyse how well they perform in practice.

The specific code chosen here is the result of:

  • Starting with an exhaustive list of 159605 BCH codes designed to detect 3 or 4 errors up to length 93, 151, 165, 341, 1023, and 1057.
  • From those, requiring the detection of 4 errors up to length 71, resulting in 28825 remaining codes.
  • From those, choosing the codes with the best worst-case window for 5-character errors, resulting in 310 remaining codes.
  • From those, picking the code with the lowest chance for not detecting small numbers of ''bit'' errors.

As a naive search would require over 6.5 * 1019 checksum evaluations, a collision-search approach was used for analysis. The code can be found here.

Properties

The following table summarizes the chances for detection failure (as multiples of 1 in 109).

Window length Number of wrong characters
Length Description ≤4 5 6 7 8 ≥9
8 Longest detecting 6 errors 0 1.127 0.909 n/a
18 Longest detecting 5 errors 0 0.965 0.929 0.932 0.931
19 Worst case for 6 errors 0 0.093 0.972 0.928 0.931
39 Length ≤ 39 characters 0 0.756 0.935 0.932 0.931
59 Length ≤ 59 characters 0 0.805 0.933 0.931
71 Length ≤ 71 characters 0 0.830 0.934 0.931
89 Longest detecting 4 errors 0 0.867 0.933 0.931

TODO: fill in table for a Sapling payment address, and check the following paragraph.

This means that when 5 changed characters occur randomly distributed in the 77 characters (excluding the separator) of a Sapling payment address, there is a chance of at most 0.867 per billion that it will go undetected. When those 5 changes occur randomly within a 19-character window, that chance goes down to 0.093 per billion. As the number of errors goes up, the chance converges towards 1 in 230 = 0.931 per billion.

The chosen code performs reasonably well up to 1023 characters, but is best for lengths up to 89 characters (excluding the separator). Since the search for suitable codes was based on the requirements for Bitcoin P2WPKH and P2WSH addresses, it is not quite optimal for the address lengths used by Sapling, but the advantages of compatibility with existing Bitcoin libraries were considered to outweigh this consideration.

Acknowledgements

This document is closely based on BIP 173 written by Pieter Wuille and Greg Maxwell, which was inspired by the address proposal by Rusty Russell and the base32 proposal by Mark Friedenbach. BIP 173 also had input from Luke Dashjr, Johnson Lau, Eric Lombrozo, Peter Todd, and various other reviewers.

References

1 Information on BCP 14 — "RFC 2119: Key words for use in RFCs to Indicate Requirement Levels" and "RFC 8174: Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words"
2 Zcash Protocol Specification, Version 2021.2.16 or later
3 ZIP 32: Shielded Hierarchical Deterministic Wallets
4 ZIP 200: Network Upgrade Mechanism
5 ZIP 205: Deployment of the Sapling Network Upgrade
6 BIP 13: Address Format for pay-to-script-hash
7 BIP 173: Base32 address format for native v0-16 witness outputs