データセット | LLM-jp-13B-v1.0 | weblab-10b | PLaMo-13B | Stockmark-13b | Japanese StableLM Alpha | 備考 |
---|---|---|---|---|---|---|
mc4 | ◯ | ◯ | ◯ | ◯ | ◯ | |
wikipedia | ◯ | ◯ | ◯ | ◯ | StableLMのページからはdumps.wikipediaにリンクされてる | |
pile | ◯ | ◯ | ||||
RedPajama | ◯ | ◯ | ||||
cc100 | ◯ | ◯ | ||||
the stack |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# | |
# Unityフォルダ構成のルールについて | |
# https://qiita.com/takish/items/8608ba9070755da3ae6d | |
# ここで書かれたフォルダ構成をUnityの空のプロジェクトを作ったあとに作成するスクリプト | |
# 空のプロジェクトのルートフォルダにおいて実行 | |
# Base directory for Unity project Assets | |
base_dir="./Assets" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# the-stackデータセットから10MBだけ読み込んで、先頭を表示する | |
import sys | |
from datasets import load_dataset | |
dataset = load_dataset("bigcode/the-stack", split="train", streaming=True) | |
data_subset = [] | |
total_size = 0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# 参考URL: | |
# https://note.com/oriki111/n/n49ae98873a98?sub_rt=share_h | |
# 実行コマンド。実行時間のログをテキストに書いておく | |
# python3 mc4_load.py | tee mc4_load.txt | |
# 仮想環境の作成 | |
# python3.12 -m venv myenv | |
# 仮想環境をアクティベート |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
''' | |
データセット関連の情報 | |
https://huggingface.co/datasets/graelo/wikipedia <- 日本語データセット読める | |
https://huggingface.co/datasets/wikipedia <- 日本語データセット読めない | |
https://dumps.wikimedia.org/jawiki/ | |
''' | |
''' | |
実行コマンド。実行時間のログをテキストに書いておく | |
# python3 wikipedia_en_load.py | tee wikipedia_en_load.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# 参考URL: | |
# GoogleColobで小規模言語モデル(0.15B)の事前学習モデルを作ってみる | |
# https://ayousanz.hatenadiary.jp/entry/2024/01/23/225623 | |
# | |
''' | |
データセット関連の情報 | |
https://huggingface.co/datasets/graelo/wikipedia <- 日本語データセット読める | |
https://huggingface.co/datasets/wikipedia <- 日本語データセット読めない | |
https://dumps.wikimedia.org/jawiki/ |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
''' | |
[問題] | |
https://atcoder.jp/contests/abc214/tasks/abc214_c | |
[参考] | |
https://atcoder.jp/contests/abc214/editorial/2438 | |
円周上を2周させないと答えが出ないとのこと | |
''' | |
import sys |
NewerOlder