Created
May 8, 2016 15:39
-
-
Save Cartman0/133e57faac8c1ae3fa9394789a3ef94e to your computer and use it in GitHub Desktop.
言語処理100本ノック 2章メモ(Unixコマンドの基礎)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"metadata": { | |
"toc": "true" | |
}, | |
"cell_type": "markdown", | |
"source": "# Table of Contents\n <p><div class=\"lev1\"><a href=\"#2章-Unixコマンドの基礎-1\"><span class=\"toc-item-num\">1 </span>2章 Unixコマンドの基礎</a></div><div class=\"lev2\"><a href=\"#10.-行数のカウント-1.1\"><span class=\"toc-item-num\">1.1 </span>10. 行数のカウント</a></div><div class=\"lev3\"><a href=\"#powershell-1.1.1\"><span class=\"toc-item-num\">1.1.1 </span>powershell</a></div><div class=\"lev2\"><a href=\"#11.-タブをスペースに置換-1.2\"><span class=\"toc-item-num\">1.2 </span>11. タブをスペースに置換</a></div><div class=\"lev3\"><a href=\"#powershell-1.2.1\"><span class=\"toc-item-num\">1.2.1 </span>powershell</a></div><div class=\"lev2\"><a href=\"#12.-1列目をcol1.txtに,2列目をcol2.txtに保存-1.3\"><span class=\"toc-item-num\">1.3 </span>12. 1列目をcol1.txtに,2列目をcol2.txtに保存</a></div><div class=\"lev3\"><a href=\"#powershell-の場合-1.3.1\"><span class=\"toc-item-num\">1.3.1 </span>powershell の場合</a></div><div class=\"lev2\"><a href=\"#13.-col1.txtとcol2.txtをマージ-1.4\"><span class=\"toc-item-num\">1.4 </span>13. col1.txtとcol2.txtをマージ</a></div><div class=\"lev3\"><a href=\"#powershell-の場合-1.4.1\"><span class=\"toc-item-num\">1.4.1 </span>powershell の場合</a></div><div class=\"lev2\"><a href=\"#14.-先頭からN行を出力-1.5\"><span class=\"toc-item-num\">1.5 </span>14. 先頭からN行を出力</a></div><div class=\"lev3\"><a href=\"#powershellの場合-1.5.1\"><span class=\"toc-item-num\">1.5.1 </span>powershellの場合</a></div><div class=\"lev2\"><a href=\"#15.-末尾のN行を出力-1.6\"><span class=\"toc-item-num\">1.6 </span>15. 末尾のN行を出力</a></div><div class=\"lev3\"><a href=\"#powershell-の場合-1.6.1\"><span class=\"toc-item-num\">1.6.1 </span>powershell の場合</a></div><div class=\"lev2\"><a href=\"#16.-ファイルをN分割する-1.7\"><span class=\"toc-item-num\">1.7 </span>16. ファイルをN分割する</a></div><div class=\"lev3\"><a href=\"#powershell-の場合-1.7.1\"><span class=\"toc-item-num\">1.7.1 </span>powershell の場合</a></div><div class=\"lev2\"><a href=\"#17.-1列目の文字列の異なり-1.8\"><span class=\"toc-item-num\">1.8 </span>17. 1列目の文字列の異なり</a></div><div class=\"lev3\"><a href=\"#powershell-の場合-1.8.1\"><span class=\"toc-item-num\">1.8.1 </span>powershell の場合</a></div><div class=\"lev2\"><a href=\"#18.-各行を3コラム目の数値の降順にソート-1.9\"><span class=\"toc-item-num\">1.9 </span>18. 各行を3コラム目の数値の降順にソート</a></div><div class=\"lev3\"><a href=\"#windows-powershell-1.9.1\"><span class=\"toc-item-num\">1.9.1 </span>windows powershell</a></div><div class=\"lev2\"><a href=\"#19.-各行の1コラム目の文字列の出現頻度を求め,出現頻度の高い順に並べる-1.10\"><span class=\"toc-item-num\">1.10 </span>19. 各行の1コラム目の文字列の出現頻度を求め,出現頻度の高い順に並べる</a></div><div class=\"lev3\"><a href=\"#powershell-の場合-1.10.1\"><span class=\"toc-item-num\">1.10.1 </span>powershell の場合</a></div><div class=\"lev2\"><a href=\"#参考リンク-1.11\"><span class=\"toc-item-num\">1.11 </span>参考リンク</a></div>" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "- [言語処理100本ノック 1章(準備運動編)](http://nbviewer.jupyter.org/gist/Cartman0/77c669b28f674179e459869881da7a56)" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "# 2章 Unixコマンドの基礎" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "hightemp.txtは,日本の最高気温の記録を「都道府県」「地点」「℃」「日」のタブ区切り形式で格納したファイルである.\n以下の処理を行うプログラムを作成し,hightemp.txtを入力ファイルとして実行せよ.\nさらに,同様の処理をUNIXコマンドでも実行し,プログラムの実行結果を確認せよ." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## 10. 行数のカウント\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "行数をカウントせよ.確認にはwcコマンドを用いよ." | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "import sys\nsys.getdefaultencoding()", | |
"execution_count": 1, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"execution_count": 1, | |
"data": { | |
"text/plain": "'utf-8'" | |
}, | |
"metadata": {} | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "with open('hightemp.txt', 'r', encoding='utf-8') as file:\n print(len(file.readlines()))", | |
"execution_count": 2, | |
"outputs": [ | |
{ | |
"text": "24\n", | |
"output_type": "stream", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"collapsed": true | |
}, | |
"cell_type": "markdown", | |
"source": "### powershell" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "```\nGet-Content -Encoding UTF8 .\\hightemp.txt | Measure-Object -Line\n```\nor\n```\ncat -Encoding UTF8 .\\hightemp.txt | Measure-Object -Line\n```\n\nでもいける。\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## 11. タブをスペースに置換" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "タブ1文字につきスペース1文字に置換せよ.確認にはsedコマンド,trコマンド,もしくはexpandコマンドを用いよ." | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "with open('hightemp.txt', 'r', encoding='utf-8') as file:\n #print(file.readlines())\n replace_space = file.read().replace('\\t', ' ')\n #print(list(replace_space))\n print(replace_space)", | |
"execution_count": 3, | |
"outputs": [ | |
{ | |
"text": "高知県 江川崎 41 2013-08-12\n埼玉県 熊谷 40.9 2007-08-16\n岐阜県 多治見 40.9 2007-08-16\n山形県 山形 40.8 1933-07-25\n山梨県 甲府 40.7 2013-08-10\n和歌山県 かつらぎ 40.6 1994-08-08\n静岡県 天竜 40.6 1994-08-04\n山梨県 勝沼 40.5 2013-08-10\n埼玉県 越谷 40.4 2007-08-16\n群馬県 館林 40.3 2007-08-16\n群馬県 上里見 40.3 1998-07-04\n愛知県 愛西 40.3 1994-08-05\n千葉県 牛久 40.2 2004-07-20\n静岡県 佐久間 40.2 2001-07-24\n愛媛県 宇和島 40.2 1927-07-22\n山形県 酒田 40.1 1978-08-03\n岐阜県 美濃 40 2007-08-16\n群馬県 前橋 40 2001-07-24\n千葉県 茂原 39.9 2013-08-11\n埼玉県 鳩山 39.9 1997-07-05\n大阪府 豊中 39.9 1994-08-08\n山梨県 大月 39.9 1990-07-19\n山形県 鶴岡 39.9 1978-08-03\n愛知県 名古屋 39.9 1942-08-02\n\n", | |
"output_type": "stream", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "### powershell" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "powershell の場合、タブ文字は `\\`t` を使う。" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "```\n$file = Get-Content -Encoding UTF8 .\\hightemp.txt\n$file -replace \"`t\", \" \"\n```" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## 12. 1列目をcol1.txtに,2列目をcol2.txtに保存" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "各行の1列目だけを抜き出したものをcol1.txtに,2列目だけを抜き出したものをcol2.txtとしてファイルに保存せよ.確認にはcutコマンドを用いよ." | |
}, | |
{ | |
"metadata": { | |
"code_folding": [], | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "def cut_col_lines(file_name_in:str, cut_idx:int):\n with open(file_name_in, 'r', encoding='utf-8') as file_in:\n col_lines = [line.split()[cut_idx] for line in file_in.readlines()]\n return col_lines\n \ndef cut_col_out(file_name_in:str, cut_idx:int, file_name_out:str):\n col_lines = cut_col_lines(file_name_in, cut_idx)\n with open(file_name_out, 'w', encoding='utf-8') as file_out:\n for line in col_lines:\n file_out.write(line + '\\n')\n \ncut_col_out('hightemp.txt', 0, 'col1.txt')\ncut_col_out('hightemp.txt', 1, 'col2.txt')", | |
"execution_count": 4, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "### powershell の場合" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "```\n$file = Get-Content -Encoding UTF8 .\\hightemp.txt\n$f = $file -replace \"`t\", \" \"\n\n$f | %{$_.split(\" \")[0]} > col1.txt\n$f | %{$_.split(\" \")[1]} > col2.txt\n```" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## 13. col1.txtとcol2.txtをマージ" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "12で作ったcol1.txtとcol2.txtを結合し,元のファイルの1列目と2列目をタブ区切りで並べたテキストファイルを作成せよ.確認にはpasteコマンドを用いよ." | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "def merge(filename1:str, filename2:str, filename_out:str, separate='\\t'):\n with open(filename1, 'r', encoding='utf-8') as file1, open(filename2, 'r', encoding='utf-8') as file2:\n with open(filename_out, 'w', encoding='utf-8') as file_out:\n for l1, l2 in zip(file1.readlines(), file2.readlines()):\n file_out.write(l1.split()[0] + separate + l2.split()[0] + '\\n') ", | |
"execution_count": 5, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"collapsed": true, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "merge('col1.txt', 'col2.txt', 'merge.txt')", | |
"execution_count": 6, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "### powershell の場合" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "windows では難しそう" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## 14. 先頭からN行を出力\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "自然数Nをコマンドライン引数などの手段で受け取り,入力のうち先頭のN行だけを表示せよ.確認にはheadコマンドを用いよ." | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "import sys\ndef head(filename, N=1):\n with open(filename, 'r', encoding='utf-8') as file:\n for i in range(N):\n try:\n sys.stdout.write(file.readline())\n except:\n sys.stderr.write('out of range')", | |
"execution_count": 7, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "head('col1.txt', 26)", | |
"execution_count": 8, | |
"outputs": [ | |
{ | |
"text": "高知県\n埼玉県\n岐阜県\n山形県\n山梨県\n和歌山県\n静岡県\n山梨県\n埼玉県\n群馬県\n群馬県\n愛知県\n千葉県\n静岡県\n愛媛県\n山形県\n岐阜県\n群馬県\n千葉県\n埼玉県\n大阪府\n山梨県\n山形県\n愛知県\n", | |
"output_type": "stream", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "### powershellの場合" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "```\ncat -Encoding UTF8 .\\col1.txt -Head 25\n```" | |
}, | |
{ | |
"metadata": { | |
"collapsed": true | |
}, | |
"cell_type": "markdown", | |
"source": "## 15. 末尾のN行を出力\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "自然数Nをコマンドライン引数などの手段で受け取り,入力のうち末尾のN行だけを表示せよ.確認にはtailコマンドを用いよ." | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "import sys\ndef tail(filename, N=1):\n with open(filename, 'r', encoding='utf-8') as file:\n sys.stdout.write(''.join(file.readlines()[-N:]))", | |
"execution_count": 9, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "tail('col1.txt', 25)", | |
"execution_count": 10, | |
"outputs": [ | |
{ | |
"text": "高知県\n埼玉県\n岐阜県\n山形県\n山梨県\n和歌山県\n静岡県\n山梨県\n埼玉県\n群馬県\n群馬県\n愛知県\n千葉県\n静岡県\n愛媛県\n山形県\n岐阜県\n群馬県\n千葉県\n埼玉県\n大阪府\n山梨県\n山形県\n愛知県\n", | |
"output_type": "stream", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "### powershell の場合" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "```\ncat -Encoding UTF8 .\\col1.txt -Tail 2\n```" | |
}, | |
{ | |
"metadata": { | |
"collapsed": true | |
}, | |
"cell_type": "markdown", | |
"source": "## 16. ファイルをN分割する" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "自然数Nをコマンドライン引数などの手段で受け取り,入力のファイルを行単位でN分割せよ.同様の処理をsplitコマンドで実現せよ" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "import sys\nimport math\n\ndef split(filename:str, N=2, filename_out=False):\n with open(filename, 'r', encoding='utf-8') as file:\n lines = file.readlines()\n idx = 0\n length = len(lines)\n ratio = math.ceil(length/N)\n for i in range(N):\n sys.stdout.writelines(lines[idx:idx + ratio])\n print()\n \n if filename_out:\n with open(filename_out + str(i) + '.txt', 'w', encoding='utf-8') as file_out:\n file_out.writelines(lines[idx:idx + ratio])\n idx = idx + ratio", | |
"execution_count": 11, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"scrolled": true, | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "split('hightemp.txt', 6)", | |
"execution_count": 12, | |
"outputs": [ | |
{ | |
"text": "高知県\t江川崎\t41\t2013-08-12\n埼玉県\t熊谷\t40.9\t2007-08-16\n岐阜県\t多治見\t40.9\t2007-08-16\n山形県\t山形\t40.8\t1933-07-25\n\n山梨県\t甲府\t40.7\t2013-08-10\n和歌山県\tかつらぎ\t40.6\t1994-08-08\n静岡県\t天竜\t40.6\t1994-08-04\n山梨県\t勝沼\t40.5\t2013-08-10\n\n埼玉県\t越谷\t40.4\t2007-08-16\n群馬県\t館林\t40.3\t2007-08-16\n群馬県\t上里見\t40.3\t1998-07-04\n愛知県\t愛西\t40.3\t1994-08-05\n\n千葉県\t牛久\t40.2\t2004-07-20\n静岡県\t佐久間\t40.2\t2001-07-24\n愛媛県\t宇和島\t40.2\t1927-07-22\n山形県\t酒田\t40.1\t1978-08-03\n\n岐阜県\t美濃\t40\t2007-08-16\n群馬県\t前橋\t40\t2001-07-24\n千葉県\t茂原\t39.9\t2013-08-11\n埼玉県\t鳩山\t39.9\t1997-07-05\n\n大阪府\t豊中\t39.9\t1994-08-08\n山梨県\t大月\t39.9\t1990-07-19\n山形県\t鶴岡\t39.9\t1978-08-03\n愛知県\t名古屋\t39.9\t1942-08-02\n\n", | |
"output_type": "stream", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "### powershell の場合" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "powershell の場合難しそう\n\n```\n$split_num = 2 # 分割\n\n$count = 0;\n$file_name = \".\\hightemp.txt\"\n$file = cat -Encoding UTF8 $file_name \ncat -Encoding UTF8 $file_name -ReadCount ($file.count / $split_num) | \n ForEach-Object { \n $count ++\n $cfs = \"{0:D3}\" -f $count;\n $_ > ($file_name + '_' + $cfs)\n }\n```" | |
}, | |
{ | |
"metadata": { | |
"collapsed": true | |
}, | |
"cell_type": "markdown", | |
"source": "## 17. 1列目の文字列の異なり" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "1列目の文字列の種類(異なる文字列の集合)を求めよ.確認にはsort, uniqコマンドを用いよ." | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "def cut_col_set(file_name_in:str, col_idx:int):\n lines = cut_col_lines(file_name_in, col_idx)\n return set(lines)\n \ncut_col_set('hightemp.txt', 0)", | |
"execution_count": 13, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"execution_count": 13, | |
"data": { | |
"text/plain": "{'千葉県',\n '和歌山県',\n '埼玉県',\n '大阪府',\n '山形県',\n '山梨県',\n '岐阜県',\n '愛媛県',\n '愛知県',\n '群馬県',\n '静岡県',\n '高知県'}" | |
}, | |
"metadata": {} | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "### powershell の場合\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "```\n$file = Get-Content -Encoding UTF8 .\\hightemp.txt\n$f = $file -replace \"`t\", \" \"\n$f | %{$_.split(\" \")[0]} | sort | Get-Unique\n```" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## 18. 各行を3コラム目の数値の降順にソート" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "各行を3コラム目の数値の逆順で整列せよ(注意: 各行の内容は変更せずに並び替えよ).確認にはsortコマンドを用いよ(この問題はコマンドで実行した時の結果と合わなくてもよい)." | |
}, | |
{ | |
"metadata": { | |
"collapsed": true, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "import sys\ndef sort(filename:str, col_idx=2):\n with open(filename, 'r', encoding='utf-8') as file:\n file_lines = [l.replace('\\t', ' ') for l in file.readlines()]\n # key値は関数\n sys.stdout.writelines(sorted(file_lines, key=lambda l: l.split()[col_idx], reverse=True))", | |
"execution_count": 14, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "sort('hightemp.txt', 2)", | |
"execution_count": 15, | |
"outputs": [ | |
{ | |
"text": "高知県 江川崎 41 2013-08-12\n埼玉県 熊谷 40.9 2007-08-16\n岐阜県 多治見 40.9 2007-08-16\n山形県 山形 40.8 1933-07-25\n山梨県 甲府 40.7 2013-08-10\n和歌山県 かつらぎ 40.6 1994-08-08\n静岡県 天竜 40.6 1994-08-04\n山梨県 勝沼 40.5 2013-08-10\n埼玉県 越谷 40.4 2007-08-16\n群馬県 館林 40.3 2007-08-16\n群馬県 上里見 40.3 1998-07-04\n愛知県 愛西 40.3 1994-08-05\n千葉県 牛久 40.2 2004-07-20\n静岡県 佐久間 40.2 2001-07-24\n愛媛県 宇和島 40.2 1927-07-22\n山形県 酒田 40.1 1978-08-03\n岐阜県 美濃 40 2007-08-16\n群馬県 前橋 40 2001-07-24\n千葉県 茂原 39.9 2013-08-11\n埼玉県 鳩山 39.9 1997-07-05\n大阪府 豊中 39.9 1994-08-08\n山梨県 大月 39.9 1990-07-19\n山形県 鶴岡 39.9 1978-08-03\n愛知県 名古屋 39.9 1942-08-02\n", | |
"output_type": "stream", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "### windows powershell" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "```\n# CSV を使うと楽\n\nImport-Csv -Encoding UTF8 -Delimiter \"`t\" -Header \"loc1\", \"loc2\", \"hight\", \"date\" .\\hightemp.txt | sort -Property hight -Descending\n```" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "linuxでは、\n```\nsort -r -k 3 hoge.txt\n```\n\nで並び替え" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## 19. 各行の1コラム目の文字列の出現頻度を求め,出現頻度の高い順に並べる" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "各行の1列目の文字列の出現頻度を求め,その高い順に並べて表示せよ.確認にはcut, uniq, sortコマンドを用いよ." | |
}, | |
{ | |
"metadata": { | |
"collapsed": true, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "import collections\n\ndef count_dict(filename_in:str, col_idx=0):\n d = collections.defaultdict(int)\n with open(filename_in, 'r', encoding='utf-8') as file:\n f_lines = file.readlines()\n for line in f_lines:\n d[line.split()[col_idx]] += 1\n return d\n\ndef count_sort(filename_in:str, col_idx=0, descending=True):\n d = count_dict(filename_in, col_idx)\n print(sorted(d.items(), key=lambda l:l[1], reverse=descending))", | |
"execution_count": 16, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "count_dict('hightemp.txt', 0)\ncount_sort('hightemp.txt', 0)", | |
"execution_count": 17, | |
"outputs": [ | |
{ | |
"text": "[('山形県', 3), ('群馬県', 3), ('山梨県', 3), ('埼玉県', 3), ('静岡県', 2), ('岐阜県', 2), ('愛知県', 2), ('千葉県', 2), ('大阪府', 1), ('和歌山県', 1), ('高知県', 1), ('愛媛県', 1)]\n", | |
"output_type": "stream", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "### powershell の場合" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "```\ncat -Encoding UTF8 .\\hightemp.txt | %{$_.split()[0]} | Group-Object | sort -Property count -Descending\n```" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## 参考リンク" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "- [言語処理100本ノック with Python(第2章・前編)](http://qiita.com/gamma1129/items/92b23219a5b9d8333dad)\n- [言語処理100本ノック with Python(第2章・後編)](http://qiita.com/gamma1129/items/6afee2034d6028847e1a)\n- [言語処理100本ノック 第2章 in Python](http://qiita.com/piyo56/items/37cf702c2b5a7f5b5d72)" | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3", | |
"language": "python" | |
}, | |
"language_info": { | |
"nbconvert_exporter": "python", | |
"name": "python", | |
"codemirror_mode": { | |
"version": 3, | |
"name": "ipython" | |
}, | |
"version": "3.5.1", | |
"file_extension": ".py", | |
"pygments_lexer": "ipython3", | |
"mimetype": "text/x-python" | |
}, | |
"toc": { | |
"toc_threshold": "6", | |
"toc_cell": true, | |
"toc_number_sections": true, | |
"toc_window_display": false | |
}, | |
"hide_input": false, | |
"gist": { | |
"id": "", | |
"data": { | |
"description": "言語処理100本ノック 2章メモ(Unixコマンドの基礎)", | |
"public": true | |
} | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment