Created
March 11, 2025 12:10
-
-
Save xiabingquan/736f5c5c457c5dfa78b3c3e80b6e2b62 to your computer and use it in GitHub Desktop.
Given a binary file of tokens in Megatron format, calculate its number of tokens. Expect each token occupies exactly 4 bytes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import re | |
def calc_num_token_of_bin(filesize: str) -> int: | |
m = re.findall(r"(\d+)([G|T])", filesize) | |
assert len(m) == 1, f"Expect string like '10G', '1T', but got {filesize}" | |
m = m[0] | |
digit, dim = int(m[0]), m[1] | |
pow = 2 if dim == 'G' else 3 | |
return digit / 4 * (1024 / 1000) ** pow |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment