(Where I had to manually test something, I used godbolt. (Except for Ruby, which I tested locally because Godbolt doesn't permit passing command-line arguments to ruby
. There's a box for it, but it gets parsed as a filename.) The characters I tested were é
and 𐐀
, and I only tried the latest x86_64/amd64 version of the compiler/interpreter if there were multiple versions to choose from.)
-
Unicode provides a recommendation for identifier syntax in Annex 31, defining the properties
XID_Start
andXID_Continue
(and alsoID_Start
andID_Continue
, but these seem to be less used). -
The C standards are paywalled, but according to cppreference.com, from C99 on, identifiers can contain
\u
and\U
escape sequences, and implementations may accept actual Unicode characters. That lasts until the current draft version ("C23", although it isn't guaranteed to be finalized this year). As of C23, they may not contain escapes, but theXID_Start
andXID_Continue
properties are used. Also, as of C23, identifiers must be in Normalization Form C.In practice? GCC permits unescaped Unicode characters, as well as
$
on some targets. I can't find documentation on what Clang supports, but in practice it appears to accept both escaped and unescaped. MSVC doesn't even document support for escapes, but they seem to work fine, although unescaped characters are rejected. All three support characters beyond the BMP to exactly the same extent as they support characters within it.Also, a fairly large set of identifiers are reserved for standard use (it's UB if you define them), and compilers are allowed to not distinguish identifiers that only differ after a certain length. Exactly how many characters is that? It depends on what kind of identifier it is, but since C99 it is generally long enough that it shouldn't matter. Still worth noting.
-
The C++ standards are also paywalled, but again according to cppreference.com, the rules are the same as in C23 (even for earlier versions of C++, early enough that it doesn't say when this became the case).
In practice, GCC and Clang support escape sequences as well, and MSVC only accepts escape sequences. Once again, a variety of identifiers are reserved and cause UB if defined. All three support characters beyond the BMP to exactly the same extent as they support characters within it.
-
C# (and presumably other .NET languages) has its own specialized set of requirements. Also, you can define identifiers that overlap with keywords if you prefix them with
@
, or if you use a Unicode escape. (The@
is explicitly not part of the identifier.) In practice, entering characters above the BMP confuses the parser, even if they are escaped. -
D defines its identifier syntax as allowing letters, digits,
_
, and "universal alphas", a term which is apparently defined in the C99 standard. As such, I have no idea what it means. In practice, D acceptedé
but not𐐀
. -
Dart (it's in chapter 17.38) only allows ASCII letters, digits (after the first character), and
_
, and in some cases$
. Also, a name is private iff it starts with an_
(chapter 6.2). Unlike Python, Dart enforces this. -
Go uses the Unicode general category of a character to determine whether it is allowed in an identifier. Also, a name is exported if it starts with a capital letter. I imagine nobody designing the language really considered caseless writing systems.
-
Haskell allows identifiers to begin with ASCII or Unicode letters, and continue with ASCII or Unicode letters, ASCII or Unicode decimal digits (it specifies "decimal"; in practice, however, it also accepted
𝋮
, the Mayan numeral 14), or'
. The first letter of the identifier is capitalized for some kinds of referent (types, modules, etc) and lowercase for others (functions, etc). There are no fewer than six namespaces, and names may be used for multiple referents in multiple namespaces almost freely. -
Java (and presumably other JVM languages) has its own specialized set of requirements, which are conveniently exposed by the standard library. Notably, they're always evaluated according to Unicode 10. The standard library also offers checkers for whether a character can start or continue a "Unicode identifier". I can't find any information on what exactly that is, at least in a cursory search. The decompiler doesn't like non-ASCII characters in identifiers, though, replacing them with
?
. Characters above the BMP work fine, modulo the decompiler issue. -
JavaScript/ECMAScript uses
ID_Start
andID_Continue
(not theXID
versions), but also allows$
. Unicode escape sequences are permitted, but only for ease of typing; you can't use them to create an identifier that would otherwise be invalid. You can't even use them to avoid collisions with keywords. There are multiple kinds of keywords with differing behaviour if you try to use them as identifiers. -
Julia has its own specialized set of requirements, which are based on Unicode general categories, but with various exceptions. Notably, most mathematical operators can be used as identifiers some of the time (they must be wrapped in parens to disambiguate some cases). Identifiers are normalized to form C, and a few custom equivalences also apply. By (perhaps ill-advised) convention, identifiers usually do not contain any word separation.
-
Lua permits only ASCII letters and digits, and underscores.
-
Nim requires identifiers to be made of letters, digits, and underscores; to start with a letter; to not end with an underscore; and to not contain two underscores in a row. All non-ASCII characters are considered letters, but this may change in the future.
-
OCaml only permits ASCII letters, digits, and
_
, and expects the capitalization of the first letter to be correct for the referent's namespace. The implementation recognizes some ISO 8859-1 characters as letters, but this is apparently deprecated. -
PHP requires identifiers to begin with a letter or underscore, followed by letters, numbers, and underscores. However, "letter" is defined as including "the bytes from 128 through 255". Assuming UTF-8, that means everything outside of ASCII.
-
Protocol Buffers only allows ASCII letters for the first character, and letters, digits and
_
for following characters. (I linked proto2, but it's the same in proto3.) -
Python uses a slightly modified version of
XID_Start
andXID_Continue
, according to whatever version of the Unicode databases it was built with (i.e. the same one as is in the standard library'sunicodedata
module). Identifiers beginning with an underscore are considered "private", although this is only weakly enforced (from module import *
doesn't import them). Identifiers beginning and ending with two underscores (four total) aren't reserved per se, but have special semantics defined by the language. In a class, identifiers beginning but not ending with two underscores are mangled with the class's name. -
R uses the C standard library's
isalnum
function, which means it depends on the current system locale. It also permits underscores and periods. However, non-identifiers can also be used as identifiers if you use the indirect accessorsget
andassign
, or apparently in some contexts via string literals. -
Racket (representing the Lisps, which are here for an extreme example rather than because they are particularly common) has a very liberal identifier syntax, although it isn't expressly documented anywhere. From testing, it accepts
+
,^
,@
,⸮
,𐁕
,𐤿
, and presumably many others, in addition to the usualé
and𐐀
that I tested with. I happen to remember that there are some Lisps with a separate namespace for functions, and some without. -
Ruby allows ASCII letters, digits (after the first character),
_
, and any "character with the eighth bit set". In practice, it may or may not accept non-ASCII characters in any context, depending on the command-line arguments passed to theruby
executable, and maybe other factors. I was able to persuade it to accepté
in a variable name using-Eutf-8
, but it still rejected𐐀
(although my terminal displayed it funny so the problem might be there). Also, Ruby permits function names to end in!
(conventionally signaling mutation),?
(conventionally signaling that this function will return a boolean), or=
(which the language interprets specially, generating a setter instead of a regular function) and uses the special prefixes$
,@
, and@@
to determine variable scope. -
Rust uses
XID_Start
andXID_Continue
, and also allows defining identifiers that overlap with keywords if you prefix them withr#
. Identifiers are automatically converted to Normalization Form C. A single underscore is not an identifier. In some contexts, mostly to do with the filesystem and linker, identifiers must be ASCII. Rust also has five namespaces, although only two (type and value) are likely to be relevant to us. And as I mentioned, Rust uses bare integers as identifiers for tuple fields. -
Swift allows ASCII letters,
_
, "a noncombining alphanumeric Unicode character in the Basic Multilingual Plane, or a character outside the Basic Multilingual Plane that isn’t in a Private Use Area" to begin an identifier, and then extends that to digits and combining characters. Reserved words can be used as identifiers if wrapped in backticks, which are not considered part of the identifier. The compiler sometimes creates identifiers beginning with$
; code can use these, but not create more. -
Web IDL only allows ASCII letters, digits (after the first character),
_
, and-
. -
Zig defines its identifiers in terms of "alphabetic" and "alphnanumeric" characters (and
_
). It seems to mean ASCII. However, you can also use a string literal as an identifier if you prefix it with@
.