Procuring large schematic dataset

Use GitHub API to batch download .sch files.

GitHub API requires you to limit searches to users, orgs, or repos. So we first have to get a list of repos that use the KiCad or EAGLE "languages".

https://api.github.com/search/repositories?q=language:kicad&per_page=100&page=1
https://api.github.com/search/repositories?q=language:eagle&per_page=100&page=1

The max pages returned is 34 with 100 results per page.

Once we have a list of repos we can extract users from these results and target our next set of queries on users. For instance, the mossman/hackrf repo showed up in our first query. From this we can assume that mossman might have some more KiCad/EAGLE repos, so lets include him in our .sch file search.

https://api.github.com/search/code?q=endComp+in:file+language:kicad+user:mossmann

Here we are looking for files that include the text endComp in repositories written in KiCad by the mossman user. We use endComp as our search query because we are only interested in .sch files that actually include components, and $endComp is a required tag for such files.

Processing Dataset

Utils

List files less than n bytes

find data/sch/*.sch -type f -size -4096c

That output can also be redirected to xargs so and the contents of those files can be concated together and saved to disk like:

find data/sch/*.sch -type f -size -4096c | xargs cat > data/4KB_concated.txt

Training and Sampling with torch-rnn

If no docker container running launch one w(with a mounted volume) like:

# mounting data/ from current directory
docker run -ti -v $(pwd)/data:/root/torch-rnn/data crisbal/torch-rnn:base bash

Then pre-process data with:

FILE_BASENAME='data/concated/4KB_concated'
python scripts/preprocess.py \
--input_txt "$FILE_BASENAME.txt" \
--output_h5 "$FILE_BASENAME.h5" \
--output_json "$FILE_BASENAME.json"

And train with default hyperparameters like:

th train.lua \
-input_h5 "$FILE_BASENAME.h5" \
-input_json "$FILE_BASENAME.json" \
-checkpoint_every 250 \
-gpu -1 # disable gpu support

Samples can be generated from checkpoints with:

CHECKPOINT=1000
SAMPLE_SIZE=2000
th sample.lua -checkpoint "cv/checkpoint_$CHECKPOINT.t7" \
  -length $SAMPLE_SIZE \
  -start_text "EESchema Schematic File Version " \
  -gpu -1

brannondorsey/sch_ml_research.md

Procuring large schematic dataset

Processing Dataset

Utils

Training and Sampling with torch-rnn