Skip to content

Instantly share code, notes, and snippets.

@xiabingquan
Created March 11, 2025 12:10
Show Gist options
  • Save xiabingquan/736f5c5c457c5dfa78b3c3e80b6e2b62 to your computer and use it in GitHub Desktop.
Save xiabingquan/736f5c5c457c5dfa78b3c3e80b6e2b62 to your computer and use it in GitHub Desktop.
Given a binary file of tokens in Megatron format, calculate its number of tokens. Expect each token occupies exactly 4 bytes.
import re
def calc_num_token_of_bin(filesize: str) -> int:
m = re.findall(r"(\d+)([G|T])", filesize)
assert len(m) == 1, f"Expect string like '10G', '1T', but got {filesize}"
m = m[0]
digit, dim = int(m[0]), m[1]
pow = 2 if dim == 'G' else 3
return digit / 4 * (1024 / 1000) ** pow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment