Skip to content

Instantly share code, notes, and snippets.

@widnyana
Last active March 11, 2018 19:45
Show Gist options
  • Select an option

  • Save widnyana/716106b9783155948b7b65de65f6a7b7 to your computer and use it in GitHub Desktop.

Select an option

Save widnyana/716106b9783155948b7b65de65f6a7b7 to your computer and use it in GitHub Desktop.
fasttext files converter

based on Kyubyong/wordvectors#14 (comment)

Alex Dodge did a little poking around in the fasttext history, and, yes, they had a different file format a year ago.

  • There's no magic number or version at the top of the file.
  • There's no "pruneidx_size" value in the header for the dictionary object.
  • There's no "quant" boolean before each of the two matrix objects.

This is a script that will convert one of the old fasttext files to something the current version can read:

$ echo merhaba | fasttext print-word-vectors tr.bin2
merhaba 0.12206 0.066014 0.093112 -0.043492 0.5207 0.057019 0.20127 0.20933 0.057977 -0.29209 0.087561 0.05825 0.50264 -0.17409 0.19332 -0.08724 0.35125 0.045985 0.21882 0.1872 0.16603 0.21172 0.17046 0.062976 -0.022134 -0.50327 -0.064927 0.1336 0.10681 -0.1902 0.030359 -0.075208 -0.19389 0.40742 0.078176 0.11845 -0.057126 0.52497 0.11417 0.36205 -0.055332 -0.2492 0.46497 0.72146 0.42214 0.082853 0.035755 -0.1644 -0.23566 0.1037 -0.079192 0.15678 -0.14464 -0.023746 0.11418 0.21951 -0.20679 -0.11682 -0.020332 -0.07834 0.27913 -0.59613 -0.15867 0.15623 0.066335 0.078509 -0.0045359 -0.15227 -0.025417 -0.14899 -0.25298 0.2158 -0.26728 0.071114 -0.86768 -0.39044 -0.36575 0.053666 0.38771 0.3328 0.085293 -0.12563 0.13022 -0.21437 0.31115 0.013396 0.02462 -0.25962 -0.51704 -0.55816 0.43276 0.25894 -0.55603 0.3785 -0.13968 0.0031102 0.23232 0.11755 0.17286 -0.14933 0.19528 0.36565 -0.19717 0.066704 -0.20812 -0.32329 -0.09979 -0.34596 0.12763 -0.26259 -0.13747 -0.056275 0.47636 -0.068787 0.05284 -0.16213 -0.57922 -0.15148 0.31464 0.23883 -0.43305 0.21852 -0.082744 0.26875 -0.28505 -0.379 -0.24597 -0.11538 0.22466 -0.17107 0.047522 0.31911 0.15056 0.21347 0.16531 -0.078537 0.14234 0.090975 -0.4294 0.067041 0.085503 0.41908 0.18248 0.18221 0.10699 -0.21135 0.1343 -0.05573 -0.16256 -0.39946 0.086395 -0.030858 -0.66857 0.58846 0.17388 0.56812 -0.088791 -0.024312 -0.054497 -0.075219 -0.0048822 -0.17311 0.070715 0.080788 0.14496 0.45174 0.071725 -0.14704 0.56277 0.058342 0.67329 0.22379 -0.13657 -0.11677 0.31955 0.21028 -0.24803 -0.34743 0.0019436 0.26037 0.49244 0.2648 -0.07083 -0.26863 -0.24654 -0.025958 -0.27783 -0.045067 -0.068344 0.16087 0.11595 -0.044365 0.029121 0.12629 0.28304 0.23161 -0.17879 -0.092399 -0.38922 -0.24235
#!/usr/bin/env python36
import struct
import sys,mmap
with open(sys.argv[1], 'rb') as fd:
M = mmap.mmap(fd.fileno(), 0, access=mmap.ACCESS_READ)
with open(sys.argv[2], 'wb') as fd:
# Magic number and version
fd.write(struct.pack('@ii', 0x2F4F16BA, 12))
offset = 0
# Arg object
span = struct.calcsize("iiiiiiiiiiiid")
fd.write(M[offset:offset+span])
offset = span
# Dictionary header, with one extra field
span = struct.calcsize("iiiq")
size, = struct.unpack("@i", M[offset:offset+4])
fd.write(M[offset:offset+span])
offset += span
fd.write(struct.pack("@q", -1)) # pruneidx_size
# Dictionary contents
span = struct.calcsize("qc")
offset1 = offset
for n in range(size):
while True:
c, = struct.unpack('@c', M[offset1:offset1+1])
offset1 += 1
if c == b'\0': break
offset1 += span
fd.write(M[offset:offset1])
offset = offset1
# Add the "quant_input" bool
fd.write(struct.pack("@?", False))
# Matrix object
span = struct.calcsize("qq")
m,n = struct.unpack("@qq", M[offset:offset+span])
fd.write(M[offset:offset+span])
offset += span
span = m*n*struct.calcsize("f")
fd.write(M[offset:offset+span])
offset += span
# "qout" boolean
fd.write(struct.pack("@?", False))
# Matrix object
span = struct.calcsize("qq")
m,n = struct.unpack("@qq", M[offset:offset+span])
fd.write(M[offset:offset+span])
offset += span
span = m*n*struct.calcsize("f")
fd.write(M[offset:offset+span])
offset += span
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment