2018-05-05

Golangのmapとsliceはどちらが速いのか

GoのパフォーマンスTipsメモにインデックスアクセスについて、以下のように述べられている。

mapのインデックスアクセスはsliceの数十倍遅い。 100件以下の場合バイナリサーチでsliceから目的の値を探すほうが早い。 100要素超えくらいからmapのアクセス速度一定の恩恵が発揮される。

実際にベンチマークを取ってみる。

測定したいこと

要素数の増加に従って、sliceとmapでインデックスアクセス速度がどのように変わるか
sliceから特定の要素を探すとき、mapと性能分岐点になる要素数はいくつか

条件

sliceは[]int型とする
sliceの中身はsort済みとする(indexをそのまま要素の値とする)
mapはmap[int]int型とする

環境

macOS High Sierra version 10.13.4（17E202）
MacBook Pro (Retina, 13-inch, Early 2015)
2.7 GHz Intel Core i5
16 GB 1867 MHz DDR3
Intel Iris Graphics 6100 1536 MB

❯ go version
go version go1.10.1 darwin/amd64

固定indexへアクセス

要素数nに対して、中間のn/2のインデックスにアクセスする時間を測定する。

実装

func benchmarkOfIndexAccessToMap(n int, b *testing.B) {
    m := make(map[int]int, n)
    idx := n / 2

    for i := 0; i < b.N; i++ {
        _ = m[idx]
    }
}
func BenchmarkOfIndexAccessToMap10(b *testing.B)   { benchmarkOfIndexAccessToMap(10, b) }
func BenchmarkOfIndexAccessToMap30(b *testing.B)   { benchmarkOfIndexAccessToMap(30, b) }
func BenchmarkOfIndexAccessToMap70(b *testing.B)   { benchmarkOfIndexAccessToMap(70, b) }
func BenchmarkOfIndexAccessToMap100(b *testing.B)  { benchmarkOfIndexAccessToMap(100, b) }
func BenchmarkOfIndexAccessToMap300(b *testing.B)  { benchmarkOfIndexAccessToMap(300, b) }
func BenchmarkOfIndexAccessToMap700(b *testing.B)  { benchmarkOfIndexAccessToMap(700, b) }
func BenchmarkOfIndexAccessToMap1000(b *testing.B) { benchmarkOfIndexAccessToMap(1000, b) }
func BenchmarkOfIndexAccessToMap3000(b *testing.B) { benchmarkOfIndexAccessToMap(3000, b) }

func benchmarkOfIndexAccessToSlice(n int, b *testing.B) {
    s := make([]int, n)
    idx := n / 2

    for i := 0; i < b.N; i++ {
        _ = s[idx]
    }
}
func BenchmarkOfIndexAccessToSlice10(b *testing.B)   { benchmarkOfIndexAccessToSlice(10, b) }
func BenchmarkOfIndexAccessToSlice30(b *testing.B)   { benchmarkOfIndexAccessToSlice(30, b) }
func BenchmarkOfIndexAccessToSlice70(b *testing.B)   { benchmarkOfIndexAccessToSlice(70, b) }
func BenchmarkOfIndexAccessToSlice100(b *testing.B)  { benchmarkOfIndexAccessToSlice(100, b) }
func BenchmarkOfIndexAccessToSlice300(b *testing.B)  { benchmarkOfIndexAccessToSlice(300, b) }
func BenchmarkOfIndexAccessToSlice700(b *testing.B)  { benchmarkOfIndexAccessToSlice(700, b) }
func BenchmarkOfIndexAccessToSlice1000(b *testing.B) { benchmarkOfIndexAccessToSlice(1000, b) }
func BenchmarkOfIndexAccessToSlice3000(b *testing.B) { benchmarkOfIndexAccessToSlice(3000, b) }

結果

BenchmarkOfIndexAccessToMap10-4           1000000000          2.81 ns/op
BenchmarkOfIndexAccessToMap30-4            1000000000          2.81 ns/op
BenchmarkOfIndexAccessToMap70-4            1000000000          2.81 ns/op
BenchmarkOfIndexAccessToMap100-4           1000000000          2.83 ns/op
BenchmarkOfIndexAccessToMap300-4           1000000000          2.81 ns/op
BenchmarkOfIndexAccessToMap700-4           1000000000          2.81 ns/op
BenchmarkOfIndexAccessToMap1000-4          1000000000          2.81 ns/op
BenchmarkOfIndexAccessToMap3000-4          1000000000          2.81 ns/op

BenchmarkOfIndexAccessToSlice10-4          2000000000          0.33 ns/op
BenchmarkOfIndexAccessToSlice30-4          2000000000          0.33 ns/op
BenchmarkOfIndexAccessToSlice70-4          2000000000          0.33 ns/op
BenchmarkOfIndexAccessToSlice100-4         2000000000          0.33 ns/op
BenchmarkOfIndexAccessToSlice300-4         2000000000          0.33 ns/op
BenchmarkOfIndexAccessToSlice700-4         2000000000          0.33 ns/op
BenchmarkOfIndexAccessToSlice1000-4        2000000000          0.33 ns/op
BenchmarkOfIndexAccessToSlice3000-4        2000000000          0.33 ns/op

表とグラフにまとめると以下の通り。

要素数	map	slice
10	2.81	0.33
30	2.81	0.33
70	2.81	0.33
100	2.83	0.33
300	2.81	0.33
700	2.81	0.33
1000	2.81	0.33
3000	2.81	0.33

f:id:cipepser:20180505152007p:plain

単純なインデックスアクセスであれば、sliceとmapどちらも固定の時間しか掛からない（sliceでも探索するわけではないので当然といえば当然）。アクセス速度はsliceのほうが約8.5倍速い。アクセスしたい要素のインデックスが予めわかっているようなケースや、各要素をloopで順番に処理していくような操作では、sliceを使ったほうがいいと考えられる。

異なるindexへアクセス

アクセスしたい要素のインデックスがわからないケースも想定してベンチマークを取ってみる。乱数でインデックスを決定することも考えたが、nに対して、b.Nが十分大きいことが前測定でわかったので、i%nがアクセスしたい要素とする。sliceの場合は、要素をBinary Searchで探索する必要があるので、実装し、同時に測定した。Binary Search単体でどれくらい時間を要するのかも確認したかったので、sliceへのインデックスアクセスをしない測定も同時に行っている。

実装

func benchmarkOfMap(n int, b *testing.B) {
    m := make(map[int]int, n)

    for i := 0; i < b.N; i++ {
        _ = m[i%n]
    }
}
func BenchmarkOfMap10(b *testing.B)   { benchmarkOfMap(10, b) }
func BenchmarkOfMap30(b *testing.B)   { benchmarkOfMap(30, b) }
func BenchmarkOfMap70(b *testing.B)   { benchmarkOfMap(70, b) }
func BenchmarkOfMap100(b *testing.B)  { benchmarkOfMap(100, b) }
func BenchmarkOfMap300(b *testing.B)  { benchmarkOfMap(300, b) }
func BenchmarkOfMap700(b *testing.B)  { benchmarkOfMap(700, b) }
func BenchmarkOfMap1000(b *testing.B) { benchmarkOfMap(1000, b) }
func BenchmarkOfMap3000(b *testing.B) { benchmarkOfMap(3000, b) }

func benchmarkOfSearchElementOfSlice(n int, b *testing.B) {
    s := make([]int, n)
    for i := range s {
        s[i] = i
    }

    for i := 0; i < b.N; i++ {
        _ = s[search(i%n, s)]
    }
}
func BenchmarkOfSearchElementOfSlice10(b *testing.B)   { benchmarkOfSearchElementOfSlice(10, b) }
func BenchmarkOfSearchElementOfSlice30(b *testing.B)   { benchmarkOfSearchElementOfSlice(30, b) }
func BenchmarkOfSearchElementOfSlice70(b *testing.B)   { benchmarkOfSearchElementOfSlice(70, b) }
func BenchmarkOfSearchElementOfSlice100(b *testing.B)  { benchmarkOfSearchElementOfSlice(100, b) }
func BenchmarkOfSearchElementOfSlice300(b *testing.B)  { benchmarkOfSearchElementOfSlice(300, b) }
func BenchmarkOfSearchElementOfSlice700(b *testing.B)  { benchmarkOfSearchElementOfSlice(700, b) }
func BenchmarkOfSearchElementOfSlice1000(b *testing.B) { benchmarkOfSearchElementOfSlice(1000, b) }
func BenchmarkOfSearchElementOfSlice3000(b *testing.B) { benchmarkOfSearchElementOfSlice(3000, b) }

func benchmarkOfSearch(n int, b *testing.B) {
    s := make([]int, n)
    for i := range s {
        s[i] = i
    }

    for i := 0; i < b.N; i++ {
        _ = search(i%n, s)
    }
}
func BenchmarkOfSearch10(b *testing.B)   { benchmarkOfSearch(10, b) }
func BenchmarkOfSearch30(b *testing.B)   { benchmarkOfSearch(30, b) }
func BenchmarkOfSearch70(b *testing.B)   { benchmarkOfSearch(70, b) }
func BenchmarkOfSearch100(b *testing.B)  { benchmarkOfSearch(100, b) }
func BenchmarkOfSearch300(b *testing.B)  { benchmarkOfSearch(300, b) }
func BenchmarkOfSearch700(b *testing.B)  { benchmarkOfSearch(700, b) }
func BenchmarkOfSearch1000(b *testing.B) { benchmarkOfSearch(1000, b) }
func BenchmarkOfSearch3000(b *testing.B) { benchmarkOfSearch(3000, b) }

Binary Searchは以下で実装。

package slicevsmap

// search returns an index of element `e` in slice `s`.
// s must be sorted slice to execute binary search.
func search(e int, s []int) int {
    idx := binarySearch(e, 0, len(s)-1, s)
    if idx < 0 {
        panic("not found")
    }
    return idx
}

func binarySearch(e, l, r int, s []int) int {
    i := (l + r) / 2

    if i == l {
        switch {
        case s[l] == e:
            return l
        case s[r] == e:
            return r
        default:
            return -1
        }
    }

    switch {
    case s[i] < e:
        return binarySearch(e, i+1, r, s)
    case s[i] > e:
        return binarySearch(e, l, i-1, s)
    case s[i] == i:
        return i
    }

    return -1
}

結果

BenchmarkOfMap10-4                        100000000          14.9 ns/op
BenchmarkOfMap30-4                         100000000          14.8 ns/op
BenchmarkOfMap70-4                         100000000          14.7 ns/op
BenchmarkOfMap100-4                        100000000          14.8 ns/op
BenchmarkOfMap300-4                        100000000          14.6 ns/op
BenchmarkOfMap700-4                        100000000          14.7 ns/op
BenchmarkOfMap1000-4                       100000000          14.6 ns/op
BenchmarkOfMap3000-4                       100000000          14.6 ns/op

BenchmarkOfSearchElementOfSlice10-4        100000000          23.8 ns/op
BenchmarkOfSearchElementOfSlice30-4        50000000           29.2 ns/op
BenchmarkOfSearchElementOfSlice70-4        50000000           33.4 ns/op
BenchmarkOfSearchElementOfSlice100-4       50000000           34.9 ns/op
BenchmarkOfSearchElementOfSlice300-4       30000000           48.7 ns/op
BenchmarkOfSearchElementOfSlice700-4       20000000           70.6 ns/op
BenchmarkOfSearchElementOfSlice1000-4      20000000           77.6 ns/op
BenchmarkOfSearchElementOfSlice3000-4      20000000           89.5 ns/op

BenchmarkOfSearch10-4                      100000000          23.1 ns/op
BenchmarkOfSearch30-4                      50000000           29.0 ns/op
BenchmarkOfSearch70-4                      50000000           32.9 ns/op
BenchmarkOfSearch100-4                     50000000           34.9 ns/op
BenchmarkOfSearch300-4                     30000000           48.2 ns/op
BenchmarkOfSearch700-4                     20000000           69.7 ns/op
BenchmarkOfSearch1000-4                    20000000           77.1 ns/op
BenchmarkOfSearch3000-4                    20000000           88.8 ns/op

こちらも表とグラフにまとめると以下の通り。

要素数	slice + search	searchのみ	map
10	23.8	23.1	14.9
30	29.2	29	14.8
70	33.4	32.9	14.7
100	34.9	34.9	14.8
300	48.7	48.2	14.6
700	70.6	69.7	14.7
1000	77.6	77.1	14.6
3000	89.5	88.8	14.6

f:id:cipepser:20180505152011p:plain

mapは要素数に依存しない定数時間になっている。sliceのほうは、Binary SearchがO(logN)で、探索に大部分の時間を費やしてしまっていることがわかる。sort済みのsliceですらこの結果になってしまうので、Existsメソッドを生やしたいような場合では、mapを使ったほうがよいと思われる。

結論

要素数の増加に従って、sliceとmapでインデックスアクセス速度がどのように変わるか
→ 単純なインデックスアクセスであれば、sliceのほうが8.5倍程度速い
sliceから特定の要素を探すとき、mapと性能分岐点になる要素数はいくつか
→ searchも含めて行う場合は、mapのほうが速い

同内容をGithubにも上げている。

References

GoのパフォーマンスTipsメモ

2018-05-03

ビットコインとブロックチェーン:暗号通貨を支える技術を読んだ

Mastering Bitcoinの邦訳であるビットコインとブロックチェーン:暗号通貨を支える技術/アンドレアス・M・アントノプロス (著), 今井崇也 (翻訳), 鳩貝淳一郎 (翻訳)を読みました。

Blockchainのバイブルと名高い本書ですが、その評判に違わない内容でした。 2009年にナカモトサトシが論文を発表して以来、数多くの記事がネット上に存在し、十分な情報がある現在ですが、どうしても今一歩疑問が解決できない事柄が多々ありました。個人的には以下の2つが、なんとなくはわかるけど、どうもしっくりこない事柄でした。

UTXO
マークル木

具体的にはそれぞれ、以下のような疑問がずっとありましたが、本書を読んで解決することができました。

UTXOの疑問点

UTXOがアカウントベースではないというのは、理解できるものの以下がよくわかりませんでした。

特定のアドレスの残高はどうやってわかるのか

この疑問は、以下の記載で氷解しました。

ウォレットは、ブロックチェーンをスキャンしてユーザに属しているすべてのUTXOを掻き集め、残高を計算しているのです。

(P.120 第5章トランザクション / トランザクションアウトプットより引用)

Blockchainの機能として残高がわかるわけではなく、ウォレットの機能なのですね。その他にも以下のような疑問がありました。

コインをロックするとは、どういうことなのか
任意の精度でアウトプットがロックされてしまうと、ゴミのようなコインばかりになってしまうのではないか

これらの疑問は以下の文章で理解できました。

・ビットコインの最小単位であるsatoshi単位で表されたビットコイン金額・アウトプットを使用するにあたって満たさなければいけない「解除条件(encumbrance)」として知られているlocking script

(P.122 第5章トランザクション / トランザクションアウトプットより引用)

コインのロックとは、locking scriptのことであり、これを解除する条件を満たせるのが、そのアドレスの秘密鍵を持っている人です。具体的には、秘密鍵で作成したunlock scriptを作成すること、というのもscript言語の節で理解できました。また、任意の精度ではなく、satoshi単位で量子化されているのも、考えてみれば当たり前ですが、本書を読まなければずっとモヤモヤしたままでした。

マークル木の疑問点

マークル木では、ブロックに含まれるトランザクションをマークルルートで代表するのは理解できるものの以下の疑問がありました。

client-sideで検証する際にマークルルートだけでは不足するのではないのか

再び本書から引用すると、以下の通りでした。

特定のトランザクションがブロックに格納されていることを証明するために、ビットコインノードはlog2(N)個の32バイトのハッシュ値を作るだけでよく、（中略）千個以上のトランザクションから、1個のトランザクションを特定するためのマークルパスを10〜12個のハッシュ値（320〜384バイト）で効率的に作り出すことができるのです。

(P.178 第7章ブロックチェーン / マークルツリーより引用)

疑問は正しく、マークルルートだけでは、特定のトランザクションが格納されている検証はできないものの、すべてのトランザクションを持つ必要はなく、log2(N)で検証に必要なトランザクション数を減らせるというのが肝のようでした。

網羅性が高い

本書はBlockchain、とりわけBitcoinについて非常に網羅的です。目次を引用しますが、第3章〜第8章のBlockchainの基盤的な仕組みがとても勉強になりました。逆に第9章は少し古いので、読み飛ばすのもありだと思います。

第1章　イントロダクション第2章　ビットコインの仕組み第3章　ビットコインクライアント第4章　キー、アドレス、ウォレット第5章　トランザクション(取引) 第6章　ビットコイン・ネットワーク第7章　ブロックチェーン第8章　マイニングとコンセンサス(合意形成) 第9章　代替的なチェーン、通貨、アプリケーション第10章　ビットコインの安全性

また、インターネット上の記事を見ていると「ビットコインアドレスは1から始まる」という説明をよく見ると思いますが、本書では、Version prefixがどのように付けられるから、それをbase58エンコーディングをした結果、先頭が1になるというような説明が、P2SHやWIF形式といった種類ごとにまとめられています。本書中でも「xから始めるアドレス」という表現は何度も登場し、ネット上でこれだけみんなが言っていたのも本書がバイブルであることを裏付けているようでした。

自分で手を動かせる

本書にはコードが多数含まれています。コマンドライン上で動かすbitcoin coreやlibbitcoin/bitcoin explorerや EthereumのVitalikが作成したpythonライブラリpybitcointoolsなどを使って、上記のxから始めるアドレスというようなアドレスを実際に生成してみたり、bitcoinコアのコマンドからblockやtx、utxoの情報を取得したりします。

コードは多いですが、自分で実装を考える部分はほぼなく、写経する形になると思います。むしろ大変なのは、150GB(2018/04/13時点)を超えるフルノードの同期と、バージョンが違うことによる./configureとmakeのエラー解消だと思います。実際、読書時点の2018/04/13時点で安定バージョンだったv0.16.0では、本書の内容から変更されている部分もありました。第3章から第6章を自分でやってみた際の結果と変更に応じて対処した点は、こちらにログを残しているのでご参考にどうぞ。

その他収穫だったところ

今までわからなかったscript言語もスタックマシンであるという説明があり、実際にscriptSigとscriptPubKeyが組み合わされることでUTXOのロックを解除する流れや簡単な例が記載されており、マルチシグのような条件付きスクリプトも読めるようになったのは大きな収穫です。その他にも階層的決定性ウォレットやトランザクションプール、extra nonceなど興味深いトピックが多数含まれており、とても勉強になりました。

こういう人におすすめ

上述の通り、コマンドライン上でbxを実行したり、C++、pythonのコードもありますが、そこまで難しくはないです。とはいえ、どの言語も触ったことがないという方だと少し難しい部分もあるかと思うので、技術的な観点からBlockchainを理解したいエンジニアの方には自信をもってオススメできます。

ビットコインとブロックチェーン:暗号通貨を支える技術

作者: アンドレアス・M・アントノプロス,今井崇也,鳩貝淳一郎
出版社/メーカー: エヌティティ出版
発売日: 2016/07/14
メディア: 大型本
この商品を含むブログ (7件) を見る

References

2018-04-21

Golangで言語処理100本ノック2015 目次

言語処理100本ノック 2015を完走したので、目次として各記事へのリンクをまとめました。

2018-04-21

Golangで言語処理100本ノック2015 第10章: ベクトル空間法 (II)

言語処理100本ノック 2015の第10章: ベクトル空間法 (II)の10問です。

第10章では，前章に引き続き単語ベクトルの学習に取り組む．

90. word2vecによる学習

81で作成したコーパスに対してword2vecを適用し，単語ベクトルを学習せよ．さらに，学習した単語ベクトルの形式を変換し，86-89のプログラムを動かせ．

まずは、word2vecで単語ベクトルを学習します。問題文のリンクはシェルスクリプトなのとSVNがリンク切れしているようなので、Githubのコピー版を使いました。第9章に合わせて引数を指定しています（default値と同じものもありますが、合わせてる意図を込めて明示的に記載しています）。

❯ ./word2vec -train <PATH_TO>/q81.out.txt -output ./trained_model.txt -size 300 -window 5 -negative 5

gist3119b38c63d0bd410eda14984c5d84f7

❯ go run q90.go
------------------------------
86. 単語ベクトルの表示
United_States
[1.823542 0.998212 0.151048 -0.756645 1.062619 1.191252 -1.759574 0.093121 -0.225518 -0.293923 -0.021835 -0.36047 0.672428 -0.316323 0.304076 0.622363 -0.327869 0.557787 0.303997 -0.940469 -0.303781 0.679504 -0.780764 0.235443 0.232059 0.787089 -0.503982 -0.413358 -0.577007 -1.864737 0.285715 0.063075 -0.653608 -0.666802 -0.401068 -0.225341 -0.076025 0.597728 -0.323169 1.686404 0.298269 0.329775 -1.483898 -0.060094 0.38326 -1.090787 -0.401468 0.584051 0.483002 0.289693 -1.364987 1.05808 1.084504 0.758496 -0.700064 0.921075 -0.389838 0.557126 -0.623444 0.545065 -1.866777 -1.540459 1.311491 1.252319 0.584708 -0.154731 1.178817 0.145006 -1.706097 -1.087952 -0.094799 -0.308998 -0.412141 0.747653 2.131562 0.277976 0.198845 1.520575 -0.453697 -0.340365 0.949026 -0.940041 0.362253 0.049247 0.16048 -0.014336 -1.221997 -0.341103 -1.099968 -0.17678 -1.028104 1.222829 -0.116968 -0.874536 -1.352752 -0.337586 0.153726 0.579554 1.410893 0.468515 -1.762948 0.614649 -0.508029 1.843732 2.288618 0.165021 -0.006009 -1.207882 -1.004512 0.660928 0.515029 0.591341 0.323199 1.665782 -1.4572 0.073172 -0.635124 -1.256417 1.200009 0.773504 0.258664 0.897132 -0.431907 0.014102 -0.112152 -0.070823 0.695606 -1.465958 -0.108996 -0.74142 0.804069 0.519639 0.821394 -1.453141 0.468004 1.578241 -0.163182 0.910499 0.310518 0.19384 -1.736236 -0.587345 0.305146 -0.955576 -1.035768 0.599607 -0.326483 -0.009862 -1.529698 -0.566457 0.233704 0.179851 -0.4688 -0.276734 0.579735 0.494316 -1.086717 -0.186733 -0.430683 0.061927 0.117999 -0.238442 1.233732 -0.431326 0.158956 1.685206 0.689615 1.619143 -0.512952 -1.22104 0.772325 -0.222188 -1.231234 0.127665 0.533659 2.21996 -0.220213 1.019247 -0.677278 -0.2214 -0.357192 1.181603 -1.539749 0.452797 1.653844 0.00879 0.271083 0.091245 -0.554843 -0.198385 -0.695354 -0.549476 -0.176233 0.779347 0.180929 0.028113 -0.33744 -1.248068 -1.201352 1.12126 0.239276 -0.086537 -0.847377 1.124556 -0.540444 1.216385 0.264993 -1.182609 -0.306502 0.783623 0.056023 -0.558314 0.707812 -0.169515 0.096631 0.444101 1.047364 -0.872509 -0.020277 1.113677 -1.054098 -2.080168 -2.928403 0.592833 -0.539894 0.34649 -2.267012 -0.184786 -0.975977 -0.045709 -0.022163 -0.288505 1.045948 -0.659885 -0.128221 -1.098224 -1.027867 0.375439 -0.435416 -1.238411 2.494976 0.797522 -0.704402 0.91279 -1.0877 -0.417203 -1.122003 0.06181 -1.862646 0.039802 0.147812 0.598384 0.06901 0.504204 0.778565 1.255542 -0.272787 1.952137 0.68609 1.174417 -0.386691 -0.239997 -0.05854 0.032976 -1.42854 0.638762 0.967045 0.75671 -0.599432 0.962889 -1.163935 0.57435 -1.434526 -0.838759 -0.423522 1.184661 0.49942 0.541755 -0.663361 -0.103821 0.315 0.847865 0.156598 1.353468 -0.383525 0.892902 0.247807 0.091043 0.789759 0.365569 -0.163217 1.9303 -1.300313 -1.616886 -1.752266 0.6924 -0.864655 -2.085828 1.231503 -0.501419]
------------------------------
87. 単語の類似度
United_Stetes v.s. U.S.
0.7745992551063425
------------------------------
88. 類似度の高い単語10件
1   England : 1
2   Wales : 0.6799484101924722
3   Scotland : 0.6601875749325679
4   Ireland : 0.576445075380364
5   Britain : 0.5294550459380852
6   Hampshire : 0.5080241507481739
7   Somerset : 0.5048523334327183
8   Lancashire : 0.4975930638881287
9   London : 0.49608850575325425
10   Glasgow : 0.490454852366284
------------------------------
89. 加法構成性によるアナロジー
1   Spain : 0.6948565586316053
2   Athens : 0.681273120661114
3   Greece : 0.564727893639371
4   Denmark : 0.5412544877031749
5   Romania : 0.5190412855968768
6   Armenia : 0.5134651735824168
7   Egypt : 0.5112680989598707
8   Argentina : 0.5040413172446792
9   Portugal : 0.5017621437939799
10   Austria : 0.4927839578682253

8章で苦労したものと違って学習にかかる時間も短く、結果も直感どおりでGoogleすごいなと実感させられますね。。。

91. アナロジーデータの準備

単語アナロジーの評価データをダウンロードせよ．このデータ中で": "で始まる行はセクション名を表す．例えば，": capital-common-countries"という行は，"capital-common-countries"というセクションの開始を表している．ダウンロードした評価データの中で，"family"というセクションに含まれる評価事例を抜き出してファイルに保存せよ．

gist82757c3b8676b53ccfd7cdcde06b2349

❯ go run q91.go
❯ head -10 ../data/q91.out.txt
: family
boy girl brother sister
boy girl brothers sisters
boy girl dad mom
boy girl father mother
boy girl grandfather grandmother
boy girl grandpa grandma
boy girl grandson granddaughter
boy girl groom bride
boy girl he she

こちらも評価データがリンク切れしているので、こちらを使いました。

92. アナロジーデータへの適用

91で作成した評価データの各事例に対して，vec(2列目の単語) - vec(1列目の単語) + vec(3列目の単語)を計算し，そのベクトルと類似度が最も高い単語と，その類似度を求めよ．求めた単語と類似度は，各事例の末尾に追記せよ．このプログラムを85で作成した単語ベクトル，90で作成した単語ベクトルに対して適用せよ．

前処理

Q85のモデルのフォーマットをQ90と同一にするため以下前処理を行います。

gist0bb71f60790e48b728636e259707691f

❯ go run q92pre.go
❯ head -2 ../data/q85_model.txt
23699 300
Tibet 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

本題

gist33df6b590a9ea39ddb3735bb0926537a

❯ go run q92.go
// 題意に従い、modelを変えて二度実行しています。

❯ head -20 ../data/q92_model_q85.out.txt
: family
boy girl brother sister brother 1.000000
boy girl brothers sisters Undifined -1.100000
boy girl dad mom Undifined NaN
boy girl father mother Undifined -1.100000
boy girl grandfather grandmother Undifined -1.100000
boy girl grandpa grandma Undifined NaN
boy girl grandson granddaughter Undifined -1.100000
boy girl groom bride Undifined NaN
boy girl he she he 1.000000
boy girl his her his 1.000000
boy girl husband wife Undifined -1.100000
boy girl king queen king 1.000000
boy girl man woman Undifined -1.100000
boy girl nephew niece Undifined -1.100000
boy girl policeman policewoman Undifined NaN
boy girl prince princess Undifined -1.100000
boy girl son daughter son 1.000000
boy girl sons daughters Undifined -1.100000
boy girl stepbrother stepsister Undifined NaN

❯ head -20 ../data/q92_model_q90.out.txt
: family
boy girl brother sister brother 0.877781
boy girl brothers sisters brothers 0.824237
boy girl dad mom girl 0.698522
boy girl father mother father 0.870984
boy girl grandfather grandmother grandfather 0.759021
boy girl grandpa grandma Undifined NaN
boy girl grandson granddaughter grandson 0.724518
boy girl groom bride girl 0.670416
boy girl he she he 0.881071
boy girl his her his 0.865413
boy girl husband wife husband 0.893910
boy girl king queen king 0.857420
boy girl man woman man 0.870114
boy girl nephew niece nephew 0.743345
boy girl policeman policewoman girl 0.700381
boy girl prince princess prince 0.755579
boy girl son daughter son 0.879387
boy girl sons daughters sons 0.880932
boy girl stepbrother stepsister Undifined NaN

Q85のほうはダメダメですね。Q90も正解率は高くなさそうです。

93. アナロジータスクの正解率の計算

92で作ったデータを用い，各モデルのアナロジータスクの正解率を求めよ．

gist68faf311e8fe37dacde2b1fd69682492

// Q85
❯ go run q93.go
0

// Q90
❯ go run q93.go
0.043478260869565216

Q85のmodelではひとつも正解できませんでした。Q90でも4%と精度はあまりよくないですね。 Q92で見たように零ベクトルが多くUndifinedになってしまいました。

94. WordSimilarity-353での類似度計算

The WordSimilarity-353 Test Collectionの評価データを入力とし，1列目と2列目の単語の類似度を計算し，各行の末尾に類似度の値を追加するプログラムを作成せよ．このプログラムを85で作成した単語ベクトル，90で作成した単語ベクトルに対して適用せよ．

gist79cb64278142bed6d7ca1b58c5529ec8

// Q85
❯ go run q93.go
❯ head -10 ../data/q94_model_q85.out.txt
Word 1,Word 2,Human (mean)
love,sex,6.77,NaN
tiger,cat,7.35,NaN
tiger,tiger,10.00,NaN
book,paper,7.46,NaN
computer,keyboard,7.62,NaN
computer,internet,7.58,NaN
plane,car,5.77,NaN
train,car,6.31,NaN
telephone,communication,7.50,NaN

// Q90
❯ go run q93.go
Word 1,Word 2,Human (mean)
love,sex,6.77,0.344764
tiger,cat,7.35,0.650308
tiger,tiger,10.00,1.000000
book,paper,7.46,0.457775
computer,keyboard,7.62,0.448387
computer,internet,7.58,0.518989
plane,car,5.77,0.397299
train,car,6.31,0.460842
telephone,communication,7.50,0.441498

人力で評価した関連する単語とこれまで学習してきたモデルの比較をしていきます。 Q85はすべてがNaNになっているわけではないのですが、大部分がNaNですね。一方でQ90はなかなか良さそうです。集計は次の問題です。

95. WordSimilarity-353での評価

94で作ったデータを用い，各モデルが出力する類似度のランキングと，人間の類似度判定のランキングの間のスピアマン相関係数を計算せよ．

gist7d5274b15f7a20948013658fd297bbfa

// Q85
❯ go run q95.go
-0.010072800290837701

// Q90
❯ go run q95.go
0.5951617769609865

今回はQ85で同順位が多いので、スピアマンの順位相関係数 - Wikipediaにも載っている同順位の場合で実装しようと思いましたが、dgryski/go-onlinestatsを拝借しました。

96. 国名に関するベクトルの抽出

word2vecの学習結果から，国名に関するベクトルのみを抜き出せ．

gist448273c49432eb5c3e7b5fa0d0051fb5

❯ go run q96.go
// 出力はgobでバイナリフォーマットのため省略

Q81でも用いたList of countries of the world in alphabetical orderの国名を抽出しました。前処理として複合語から成る国名は_で結合しています。

97. k-meansクラスタリング

96の単語ベクトルに対して，k-meansクラスタリングをクラスタ数k=5として実行せよ．

gisteeeff6fc40045a328486829456ddb759

❯ go run q97.go
4 : Sierra_Leone
3 : Wake_Island
2 : Virgin_Islands
3 : El_Salvador
1 : Costa_Rica
3 : Marshall_Islands
2 : American_Samoa
3 : Northern_Mariana_Islands
1 : Saudi_Arabia
4 : Cape_Verde
3 : Wallis_and_Futuna
3 : Central_African_Republic
3 : San_Marino
4 : Isle_of_Man
5 : United_Kingdom
3 : Bosnia_and_Herzegovina
4 : Puerto_Rico
3 : Cayman_Islands
5 : New_Zealand
3 : West_Bank
3 : Faroe_Islands
1 : Solomon_Islands
3 : Norfolk_Island
3 : United_Arab_Emirates
3 : Saint_Lucia
1 : Sri_Lanka
3 : Trinidad_and_Tobago
3 : Western_Sahara
2 : Hong_Kong
2 : United_States
1 : Dominican_Republic
3 : Cook_Islands
3 : Serbia_and_Montenegro
3 : French_Polynesia
3 : Saint_Helena
3 : Gaza_Strip
3 : French_Guiana
3 : Equatorial_Guinea
1 : Papua_New_Guinea
3 : Christmas_Island
3 : British_Virgin_Islands
1 : Czech_Republic
1 : New_Caledonia
1 : South_Africa
3 : Burkina_Faso
3 : Antigua_and_Barbuda

Q96で学習モデルに含まれていない(零ベクトルになる)国名は弾いてしまっているので46カ国での結果になりました。 k-meansの実体はkMeansに実装し、各国が割り当てられたlabelが変化しなくなるまでループを回しています。

98. Ward法によるクラスタリング

96の単語ベクトルに対して，Ward法による階層型クラスタリングを実行せよ．さらに，クラスタリング結果をデンドログラムとして可視化せよ．

gistb57d94e6ad48f7909011a28f0c1a0c3c

結果です。

f:id:cipepser:20180407153927p:plain

本文で実装した内容でパッケージ化したので、以下に使い方を含めてまとめています。【Golang】Ward法で階層的クラスタリングするパッケージを書いた - 逆さまにした

99. t-SNEによる可視化

96の単語ベクトルに対して，ベクトル空間をt-SNEで可視化せよ．

Q98と同様に可視化部分は自分で実装する必要があったのでgonum/plotで以下のように実装しました。

gist63aad7d67083a70dc8230315588ae407

本題はこちら。

gistab950a1bd807782eab10859a55633e56

結果です。

f:id:cipepser:20180408130612p:plain

t-SNEの計算は、tsne4goを使いました。最初可視化した際には黒一色で表示していましたが、意味が理解できなかったので、Q97のk-meansで色分けしています。そもそもt-SNEとk-meansの相性がいいのか知りませんが、気持ち分類できているかなというところでしょうか。

感想

以上で言語処理処理100本ノックの全100問完走です。途中で半年くらい間開けてしまったりしたので結構長く掛かりましたが完走できてよかったです。ただ、前章のQ85が微妙な結果だったので、Q85が実装できたら本章も再チャレンジしたいです。

もともとGoを触り始めて慣れるために始めたのですが、いろいろ実装できて自然言語処理の勉強に加えて、Goの勉強にもなりました。特にGoで100本ノックやっている方をほかに見かけないので自力で頑張る必要があったのも結果的にはよかったのかと思います。お付き合いありがとうございました。

コードはすべてまとめてGithubにも置いてあります。

Reference

言語処理100本ノック 2015
Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin, "Placing Search in Context: The Concept Revisited", ACM Transactions on Information Systems, 20(1):116-131, January 2002.
スピアマンの順位相関係数 - Wikipedia
dgryski/go-onlinestats
List of countries of the world in alphabetical order
k平均法 - Wikipedia
linkage - MathWorks
ウォード法によるクラスタリングのやり方
【Golang】Ward法で階層的クラスタリングするパッケージを書いた - 逆さまにした
t-SNEによるイケてる次元圧縮&可視化
t-SNE を用いた次元圧縮方法のご紹介
CSS Colors - w3schools.com
tsne4go

2018-04-14

【Golang】Ward法で階層的クラスタリングするパッケージを書いた

背景

言語処理100本ノックのQ98を解くにあたって、Go実装がなかったので実装し、goClusteringというパッケージにしました。k-meansもQ97で実装したので、将来的には他のクラスタリングアルゴリズムも統合するかもしれません。

概要

内部的には、階層構造を二分木で表現し、距離尺度に分散を用いて2つのグループを一つにまとめる操作を再帰で実装しています。実装はこちら。

可視化には、gonum/plotを用いています。こちらも樹形図は自前で実装しています。以下のように可視化できます。

f:id:cipepser:20180407145255p:plain

使い方

READMEにも記載していますが、こちらでも紹介します。

How to Install

go getでインストールします。

go get github.com/cipepser/goClustering/...

可視化でgonum/plotを使うのでこちらもgo getします。

go get gonum.org/v1/plot/...

How to Use

入力はn * dの行列Xです。ここでnは観測データ数、dは特徴ベクトルの次元です。以下を実行することで、ward.Tree型の結果を得られます。

T := ward.Ward(X)

次にこれを可視化します。

d, _ := plotter.NewDendrogram(T)
p, err := plot.New()
p.Add(d)

図のタイトルや軸名を設定します。

p.Title.Text = "Dendrogram"
p.X.Label.Text = "data"
p.Y.Label.Text = "distance"

葉ノードを設定します。要素が多くなった場合には回転することもできます。

p.NominalX("aaa", "bbb", "ccc", "ddd", "eee", "fff")
p.X.Tick.Label.Rotation = math.Pi / 3
p.X.Tick.Label.YAlign = draw.YCenter
p.X.Tick.Label.XAlign = draw.XRight

Example

package main

import (
    "math"

    "github.com/cipepser/goClustering/ward"
    "github.com/cipepser/goClustering/vis"

    "gonum.org/v1/plot"
    "gonum.org/v1/plot/vg"
    "gonum.org/v1/plot/vg/draw"
)

func main() {
    // input data set
    X := [][]float64{
        {0, 0},
        {2, 2},
        {1, 1},
        {2, -1.2},
        {3, 2.2},
        {3.5, 0.5},
    }

    // Ward's method
    T := ward.Ward(X)

    // draw the dendrogram
    d, err := plotter.NewDendrogram(T)
    if err != nil {
        panic(err)
    }

    p, err := plot.New()
    if err != nil {
        panic(err)
    }
    p.Add(d)

    p.Title.Text = "Dendrogram"
    p.X.Label.Text = "data"
    p.NominalX("aaa", "bbb", "ccc", "ddd", "eee", "fff")

    p.X.Tick.Label.Rotation = math.Pi / 3
    p.X.Tick.Label.YAlign = draw.YCenter
    p.X.Tick.Label.XAlign = draw.XRight

    p.Y.Label.Text = "distance"

    // save as a png file
    file := "img.png"
    if err = p.Save(10*vg.Inch, 6*vg.Inch, file); err != nil {
        panic(err)
    }
}

f:id:cipepser:20180407145255p:plain

References

2018-04-07

Golangで言語処理100本ノック2015 第9章: ベクトル空間法 (I)

言語処理100本ノック 2015の第9章: ベクトル空間法 (I)の10問です。

enwiki-20150112-400-r10-105752.txt.bz2は，2015年1月12日時点の英語のWikipedia記事のうち，約400語以上で構成される記事の中から，ランダムに1/10サンプリングした105,752記事のテキストをbzip2形式で圧縮したものである．このテキストをコーパスとして，単語の意味を表すベクトル（分散表現）を学習したい．第9章の前半では，コーパスから作成した単語文脈共起行列に主成分分析を適用し，単語ベクトルを学習する過程を，いくつかの処理に分けて実装する．第9章の後半では，学習で得られた単語ベクトル（300次元）を用い，単語の類似度計算やアナロジー（類推）を行う．なお，問題83を素直に実装すると，大量（約7GB）の主記憶が必要になる．メモリが不足する場合は，処理を工夫するか，1/100サンプリングのコーパス enwiki-20150112-400-r100-10576.txt.bz2を用いよ．

※コード中ではr100 コーパスで実装している箇所もありますが、難関と評判のQ85でハマりr10コーパスにしています。

80. コーパスの整形

文を単語列に変換する最も単純な方法は，空白文字で単語に区切ることである．ただ，この方法では文末のピリオドや括弧などの記号が単語に含まれてしまう．そこで，コーパスの各行のテキストを空白文字でトークンのリストに分割した後，各トークンに以下の処理を施し，単語から記号を除去せよ．

トークンの先頭と末尾に出現する次の文字を削除: .,!?;:()[]'"
空文字列となったトークンは削除

以上の処理を適用した後，トークンをスペースで連結してファイルに保存せよ．

gist4ea9f91550e8b265f80751d8bd005e6b

❯ go run q80.go
❯ head -3 ../data/q80.out.txt
Anarchism
Anarchism is political philosophy that advocates stateless societies often defined as self-governed voluntary institutions but that several authors have defined as more specific institutions based on non-hierarchical free associations Anarchism holds the state to be undesirable unnecessary or harmful While anti-statism is central anarchism entails opposing authority or hierarchical organisation in the conduct of human relations including but not limited to the state system
As subtle and anti-dogmatic philosophy anarchism draws on many currents of thought and strategy Anarchism does not offer fixed body of doctrine from single particular world view instead fluxing and flowing as philosophy There are many types and traditions of anarchism not all of which are mutually exclusive Anarchist schools of thought can differ fundamentally supporting anything from extreme individualism to complete collectivism Strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications Anarchism is usually considered radical left-wing ideology and much of anarchist economics and anarchist legal philosophy reflect anti-authoritarian interpretations of communism collectivism syndicalism mutualism or participatory economics

トークンの先頭と末尾と書いてありますが、"-izein").のようなトークンで不要な記号の除去ができないので、一つの記号を除去したあと再帰的にremoveSymbol()を呼び出すことで、きちんと不要な記号を削除できるようにしました。

81. 複合語からなる国名への対処

英語では，複数の語の連接が意味を成すことがある．例えば，アメリカ合衆国は"United States"，イギリスは"United Kingdom"と表現されるが，"United"や"States"，"Kingdom"という単語だけでは，指し示している概念・実体が曖昧である．そこで，コーパス中に含まれる複合語を認識し，複合語を1語として扱うことで，複合語の意味を推定したい．しかしながら，複合語を正確に認定するのは大変むずかしいので，ここでは複合語からなる国名を認定したい．

インターネット上から国名リストを各自で入手し，80のコーパス中に出現する複合語の国名に関して，スペースをアンダーバーに置換せよ．例えば，"United States"は"United_States"，"Isle of Man"は"Isle_of_Man"になるはずである．

gistcfab66b743e9507879fb63a272086782

❯ go run q81.go
❯  head -10 ../data/q81.out.txt | grep United
The term anarchism isma compound word composed from the word anarchy and the suffix -ism themselves derived respectively from the Greek i.e anarchy from anarchos meaning one without rulers from the privative prefix ἀν- an- i.e without and archos i.e leader ruler cf archon or arkhē i.e authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal infinitive suffix -ίζειν -izein The first known use of this word was in 1539."Anarchist was the term adopted by Maximilien de Robespierre to attack those on the left whom he had used for his own ends during the French Revolution but was determined to get rid of though among these anarchists there were few who exhibited the social revolt characteristics of later anarchists There would be many revolutionaries of the early nineteenth century who contributed to the anarchist doctrines of the next generation such as William Godwin and Wilhelm Weitling but they did not use the word anarchist or anarchism in describing themselves or their beliefs Pierre-Joseph Proudhon was the first political philosopher to call himself an anarchist marking the formal birth of anarchism in the mid-nineteenth century Since the 1890s from France the term libertarianism has often been used as synonym for anarchism and was used almost exclusively in this sense until the 1950s in the United_States its use as synonym is still common outside the United_States On the other hand some use libertarianism to refer to individualistic free-market philosophy only referring to free-market anarchism as libertarian anarchism

List of countries of the world in alphabetical orderから国名リストを拝借しました。ただし、一語からなる国名の場合は、_で繋げてあげる必要もないので複数語からなる国名のみをターゲットとしています。

82. 文脈の抽出

81で作成したコーパス中に出現するすべての単語tに関して，単語tと文脈語cのペアをタブ区切り形式ですべて書き出せ．ただし，文脈語の定義は次の通りとする．

ある単語tの前後d単語を文脈語cとして抽出する（ただし，文脈語に単語tそのものは含まない）
単語tを選ぶ度に，文脈幅dは{1,2,3,4,5}の範囲でランダムに決める．

gist5d8c3571da29019ef73a187ab1f77edb

❯ go run q82.go
❯ head -20 ../data/q82.out.txt
Anarchism   Anarchism
Anarchism   is
Anarchism   political
Anarchism   philosophy
Anarchism   Anarchism
Anarchism   is
Anarchism   political
Anarchism   philosophy
is  Anarchism
is  Anarchism
is  political
is  philosophy
is  that
is  advocates
is  stateless
political   is
political   Anarchism
political   Anarchism
political   philosophy
political   that

あるトークンの前後d単語を抜き出すだけなので簡単ですね。

83. 単語／文脈の頻度の計測

82の出力を利用し，以下の出現分布，および定数を求めよ．

f(t,c): 単語tと文脈語cの共起回数
f(t,∗): 単語tの出現回数
f(∗,c): 文脈語cの出現回数
N: 単語と文脈語のペアの総出現回数

gist817e4f2f21365f6871e98e1f25dfc2a5

go run q83.go
N:  22760703

結果は表示しきれないため、総出現回数Nのみを表示しています。

84. 単語文脈行列の作成

83の出力を利用し，単語文脈行列 $X$ を作成せよ．ただし，行列Xの各要素 $X_{tc}$ は次のように定義する．

$f(t,c) \geq 10$ ならば， $X_{tc}=PPMI(t,c)=max \bigl\{ log \frac{ {N} \times f(t,c) } { f(t,∗) \times f(∗,c) },0 \bigr\}$
$f(t,c) \lt 10$ ならば， $X_{tc}=0$

ここで， $PPMI(t,c)$ はPositive Pointwise Mutual Information（正の相互情報量）と呼ばれる統計量である．なお，行列 $X$ の行数・列数は数百万オーダとなり，行列のすべての要素を主記憶上に載せることは無理なので注意すること．幸い，行列 $X$ のほとんどの要素は0になるので，非0の要素だけを書き出せばよい．

gista18d59730627de797661466de14525ab

// PPMIの大きい上位10個のみを表示
❯ go run q84.go
the number of the word:  390101
the number of contex word:  390101
length of X:  22760703
n(X)/(n(t) * n(c)):  0.0001495655404405392
------------------
{{<br>MONSIEUR LECOQ} 13.30823695164182}
{{Gols Gols} 13.02055487919004}
{{nah nah} 12.979732884669785}
{{LECOQ <br>MONSIEUR} 12.929583100984313}
{{Kimmage Kimmage} 12.915194363532214}
{{Terenure Terenure} 12.822248251022769}
{{Priorswood Priorswood} 12.792134270774987}
{{ου for} 12.728418456388878}
{{molalities as} 12.5192987064402}
{{Snana Made} 12.513398984313012}

何やらhtmlタグが入っていますが、先に進むことにします。。。このあとの問題が次元削減なので、この時点での要素がどれくらいの規模になるのか表示させました。また同時にどのくらいスパースなのかも見ておくことにしました。0.01%くらいの充填率のようです。

85. 主成分分析による次元圧縮

84で得られた単語文脈行列に対して，主成分分析を適用し，単語の意味ベクトルを300次元に圧縮せよ．

全量の確認

まず最初に述べておきますが、全量は実装できていません。

インプットとなるQ82の出力は以下のように6930万行となっています。

❯ wc ../data/q82.out.txt
 69306584 138756729 852050914 ../data/q82.out.txt

本問題では、PCAを行う必要があるため、Golangで主成分分析する - 逆さまにしたでも紹介したpackage stat - gonumを利用しています。ところがこの方法でPCAをしようとするとpanic: runtime error: makeslice: len out of rangeのエラーが発生します。 Sparse matrix formatsを用いて、Sparseに行列を持つところまでは実装していますが、statパッケージでは、データの格納以前に初期化として、runtimeのmakesliceを行うため上記エラーで失敗します。どうしてもPCAできず、トレーニングデータのサイズを小さくすることで対処しています。

❯ head -1000000 ../data/q82.out.txt > ../data/q82_tmp.out.txt

gist002868aaa7d7abfa20365039735c192a

実は上記でも3500万行くらいであればメモリの確保はできるのですが、スワップが発生して計算が終わらないので、100万行まで落としています。それでも1日以上掛かりました。。。

❯ go run q85.go
// 出力はgobで行っておりバイナリフォーマットのため省略

86. 単語ベクトルの表示

85で得た単語の意味ベクトルを読み込み，"United States"のベクトルを表示せよ．ただし，"United States"は内部的には"United_States"と表現されていることに注意せよ．

gistffdac9cafb3d1efad815290df09730a1

❯ go run q86.go
United_States : &{{1 [-0.014999052638340955 -0.06666512410686669 0.002520661466195838 
(中略)
-6.730425781930617e-15 -1.0561597069590151e-14]} 300}

問題文をそのまま実装してあげればOKです。もはやスパースではなくなっていますね。 goのmapは順番を保存しないので（一般的に保存しないため、以前改修されました）、rowの位置はQ85のdicttで保存しておいたものを使います。

87. 単語の類似度

85で得た単語の意味ベクトルを読み込み，"United States"と"U.S."のコサイン類似度を計算せよ．ただし，"U.S."は内部的に"U.S"と表現されていることに注意せよ．

gistc5505e492f63667d5fe9e84e746ba955

❯ go run q87.go
0.4862803570718515

Q86で表示したベクトルのコサイン類似度を計算します。 matパッケージにはDot関数があるので、そちらで内積とノルムを計算します。最初に実装したときは、ノルムの平方根を取るのを忘れてしまいました。。。

データ数が少ないからか、コサイン類似度は0.486でした。

88. 類似度の高い単語10件

85で得た単語の意味ベクトルを読み込み，"England"とコサイン類似度が高い10語と，その類似度を出力せよ．

gistc699a2323bcbc81fda3f93df14b922a5

※コード中にも記載していますが、Englandが零ベクトルとなったため、United_Statesで代用しています。トレーニングデータを増やす必要があります。。。

❯ go run q88.go
1   case : 0.9548240697912775
2   2005 : 0.9108531345148447
3   2008 : 0.9019300375926617
4   addition : 0.8825165206539555
5   2012 : 0.8660833551129663
6   2001 : 0.8660833551129661
7   highest : 0.4998879708231505
8   turn : 0.4994728313281741
9   UK : 0.49755475904637503
10   event : 0.49489020770038084

Q87の拡張版みたいな問題です。 sortするためにCosineSimilarity型を定義しています。 United_Statesは西暦に関することが多いですね。caseも類似度が高いので各年で起きた事件などでしょうか。

89. 加法構成性によるアナロジー

85で得た単語の意味ベクトルを読み込み，vec("Spain") - vec("Madrid") + vec("Athens")を計算し，そのベクトルと類似度の高い10語とその類似度を出力せよ．

gista29690d4cf9fe2c6deac6c8cad633092

❯ go run q89.go
1   Spain : 1.0000000000000002
2   terms : 0.9997587662737554
3   languages : 0.9990898124310773
4   cases : 0.9967353219953207
5   small : 0.9960712723292813
6   Germany : 0.9843980937275112
7   areas : 0.9791005134719385
8   church : 0.9774278193367258
9   Rome : 0.9751437717985347
10   2000 : 0.9482871637356568

実装自体は、Q88をほとんど使いまわせます。分散表現することで、単語が意味ベクトルとなったため、加算減算ができるようになりました。ただし、上記では、Q88と同様にMadridとAthensが零ベクトルとなってしまっているため、Spainだけで類似度を計算しているのと同じです。結果を見ると実装自体は間違っていないので、トレーニングデータが不足していますね。

感想

解答中にも記載していますが、Q85,88,89は十分な結果が得られていません。分散表現を実装することで単語の加法減法ができるようになったことと、どのように分散表現を作るのかは勉強になりましたが、Sparseな行列に対するPCAをうまく実装するのは改めて戻ってきたいと思います。

いつも通りコードはGithubにも置いてあります。

Reference

2018-03-31

GolangでIBLTを実装してみた

前回の記事では、IBLTの概要について記載しました。今回はIBLTのデータ構造と実装について記載します。なお、使った図などはこちらにスライドとしてまとめています。

データ構造

IBLTは以下のようにm個のcellを持ちます。各cellには、count、keySum、valueSumの3フィールドがあります。

f:id:cipepser:20180317154816p:plain

今回はkeyとvalueはint型として実装します。 cellは上記3つのフィールドを持つstructとし、IBLTをcellのsliceとして、以下のように定義します。

type cell struct {
    count, keySum, valueSum int
}

type IBLT []cell

ハッシュ関数

IBLTではBloomFilterと同じようにkey-value pairをInsert/Deleteするためcellを決定するためハッシュ関数を用います。今回の実装上は、説明用の図に合わせ、ハッシュ関数の数をk = 3としています。

メソッドの実装

前回の記事も記載しましたがIBLTでサポートされるメソッドは以下の4つです。

(key, value)のInsert
(key)のGet
(key, value)のDelete
ListEntries

それぞれ実装していきます。

Insert

まずは、Insertです。 Insertは単純で、TをIBLTとし、i = 1, ..., kのハッシュ関数 $h_i$ に対して以下の操作を行います。

$T[h_i [key$ ].count++]
$T[h_i [key$ ].sumKey += key]
$T[h_i [key$ ].sumValue += value]

Insert(key=5, value=10)とすると以下のようになります。

f:id:cipepser:20180317163818p:plain

さらにInsert(key=2, value=30)を追加でInsertしてみます。

f:id:cipepser:20180317163825p:plain 実装は以下のようになります。

func (iblt IBLT) Insert(key, value int) {
    hash := util.CalcMD5Hash(strconv.Itoa(key))
    hashA := hash[:int(len(hash)/2)]
    hashB := hash[int(len(hash)/2):]

    i64HashA, _ := strconv.ParseInt(hashA, 16, 64)
    i64HashB, _ := strconv.ParseInt(hashB, 16, 64)

    for i := 0; i < k; i++ {
        iblt[util.DoubleHashing(i64HashA, i64HashB, i, len(iblt))].count++
        iblt[util.DoubleHashing(i64HashA, i64HashB, i, len(iblt))].keySum += key
        iblt[util.DoubleHashing(i64HashA, i64HashB, i, len(iblt))].valueSum += value
    }
}

Get

次にGetです。フローチャートに表すと以下のようになります。

f:id:cipepser:20180317160012p:plain

何パターンか例を見ていきましょう。 (key, value) = (5, 10), (2, 30)が挿入済みのIBLTに対して、Get(key=2)を行ってみます。

f:id:cipepser:20180317160643p:plain

1つ目のハッシュ関数 $h_1$ でcountが1かつkeySumがkeyと一致したため正しくvalue=30を返せています。

次に(key, value) = (5, 10), (2, 30)が挿入済みのIBLT(1つ前と同じ)に対して、Get(key=3)(Insertしていないkey)を行ってみます。

f:id:cipepser:20180317160651p:plain

1つ目のハッシュ関数 $h_1$ でcountが0となったため正しくfalseを返せています。

最後に(key, value) = (5, 10), (2, 30)が挿入済みのIBLT(前2つと同じ)に対して、Get(key=9)(Insertしていないkey)を行ってみます。

f:id:cipepser:20180317160656p:plain

1つ目のハッシュ関数 $h_1$ でcountが1となりましたが、keySumがkeyと一致しませんでした。 2,3つ目のハッシュ関数 $h_2, h_3$ ではcountが2となり、k = 3個のハッシュ関数の探索が終わったため、正しくfalseを返します。

Getに関する補足

上で見たようにGetは、ハッシュ関数で特定したcellのcountが1かつkeySumがkeyと一致（一つしかInsertしていないcellでkeyが一致する）であれば、valueSumを返します。BloomFiterと同じく、いくつかのハッシュ関数が衝突したとしても一つでも衝突してない(countが0)のcellがあれば、正しくfalseを返します。

BloomFilterと異なる点としては、ハッシュ衝突が起こっても、keySumのチェックがあるため、keyが異なればfalseを返します。ただし注意が必要なのは、countが1のときにのみ正しいvalueを返してくれる点です。Insertが複数に渡って行われた場合、countが2以上のcellばかりになり、正しくInsertしたkeyであってもfalseを返します。つまり、false-negativeがあります（falseと返したが、実はkeyはInsertされている）。

false-negativeが起こる例も見てみましょう。 (key, value) = (5, 10), (2, 30), (3, 20)が挿入済みのIBLT(これまでの例より(3, 20)が多い)に対して、Get(key=2)(Insert済みのkey)を行ってみます。

f:id:cipepser:20180317161729p:plain

1,2,3つ目のハッシュ関数 $h_1, h_2, h_3$ のいずれでもcountが2となり、k = 3個のハッシュ関数の探索が終わったため、誤ってfalseを返します。

この辺は、原論文で言及されていないのですが、あまり嬉しくない性質ですね。

実装は以下です。

func (iblt IBLT) Get(key int) (bool, int) {
    hash := util.CalcMD5Hash(strconv.Itoa(key))
    hashA := hash[:int(len(hash)/2)]
    hashB := hash[int(len(hash)/2):]

    i64HashA, _ := strconv.ParseInt(hashA, 16, 64)
    i64HashB, _ := strconv.ParseInt(hashB, 16, 64)

    for i := 0; i < k; i++ {
        idx := util.DoubleHashing(i64HashA, i64HashB, i, len(iblt))
        if iblt[idx].count == 0 {
            return false, 0
        }
        if iblt[idx].count == 1 {
            if iblt[idx].keySum == key {
                return true, iblt[idx].valueSum
            }
            return false, 0
        }
    }
    return false, 0
}

Delete

DeleteはInsertの逆の手順となります。

$T[h_i [key$ ].count--]
$T[h_i [key$ ].sumKey -= key]
$T[h_i [key$ ].sumValue -= value]

(key, value) = (5, 10), (2, 30)が挿入済みのIBLTにDelete(key=2, value=30)を行ってみます。

f:id:cipepser:20180317162114p:plain

(key, value) = (5, 10)だけInsertした状態と一致し、正しくDeleteできています。ここで、Insertしていないkeyに対してもDeleteが実行できることに注意してください。

実装は以下です。

func (iblt IBLT) Delete(key, value int) {
    hash := util.CalcMD5Hash(strconv.Itoa(key))
    hashA := hash[:int(len(hash)/2)]
    hashB := hash[int(len(hash)/2):]

    i64HashA, _ := strconv.ParseInt(hashA, 16, 64)
    i64HashB, _ := strconv.ParseInt(hashB, 16, 64)

    for i := 0; i < k; i++ {
        iblt[util.DoubleHashing(i64HashA, i64HashB, i, len(iblt))].count--
        iblt[util.DoubleHashing(i64HashA, i64HashB, i, len(iblt))].keySum -= key
        iblt[util.DoubleHashing(i64HashA, i64HashB, i, len(iblt))].valueSum -= value
    }
}

ListEntries

最後にListEntriesです。 ListEntriesはkey-value pairを列挙するため、以下のPairを定義します。

type Pair struct {
    key, value int
}

フローチャートは以下のようになります。

f:id:cipepser:20180317162424p:plain

これまでと同じように例を見ていきます。 (key, value) = (5, 10), (2, 30)が挿入済みのIBLTにからPairを列挙するため、 ListEntries()を行ってみます。

f:id:cipepser:20180317162441p:plain

cellを前から順番に見ていき、countが1となるcellが見つかれば、そのPairを列挙、削除し、また最初からcellを見ていきます。 Insert済みであった(5, 10), (2, 30)が列挙できました。

実装は以下です。

func (iblt IBLT) ListEntries() []Pair {
    var pairs []Pair
    newIblt := NewIBLT(len(iblt))
    copy(newIblt, iblt)

LABEL:
    for i := 0; i < len(newIblt); i++ {
        if newIblt[i].count == 1 {
            pairs = append(pairs,
                Pair{
                    key:   newIblt[i].keySum,
                    value: newIblt[i].valueSum,
                },
            )
            newIblt.Delete(newIblt[i].keySum, newIblt[i].valueSum)
            goto LABEL
        }
    }
    return pairs
}

最後に

false-negativeがあるので使い所を選ぶのではないかというのが所感ですが、集合一致のような列挙が必要な用途では要素をサマライズしたテーブルを渡せるIBLTは有用そうです。実際、Grapheneでは、新しいブロックに2000tx入っている前提で2.6KBに圧縮できると述べられており、同条件のCompact Blocksの10KBに比べて3倍以上の圧縮効率となります。

また、今回実装したIBLTだけでなく、BloomFilterやcounting BloomFiterの実装もGithubにあげてあります。