空飛ぶロボットのつくりかた

ロボットをつくるために必要な技術をまとめます。ロボットの未来についても考えたりします。

機械学習のお勉強(Webアプリケーション)

教科書

GitHub - rasbt/python-machine-learning-book: The "Python Machine Learning (1st edition)" book code repository and info resource

robonchu.hatenablog.com の内容を実行している前提

例:http://raschkas.pythonanywhere.com/results

学習済みのscikit-learn推定器をシリアライズする

  • モデルの永続化のひとつの方法はpickleを使う

  • Numpyが含まれている場合はjoblibのほうが効率的らしい

import pickle
import os

dest = os.path.join('movieclassifier', 'pkl_objects')
if not os.path.exists(dest):
    os.makedirs(dest)

pickle.dump(stop, open(os.path.join(dest, 'stopwords.pkl'), 'wb'), protocol=2)  # 4)   
pickle.dump(clf, open(os.path.join(dest, 'classifier.pkl'), 'wb'), protocol=2)  # 4)

シリアライズ

import pickle
import re
import os
from vectorizer import vect

clf = pickle.load(open(os.path.join('pkl_objects', 'classifier.pkl'), 'rb'))

import numpy as np
label = {0:'negative', 1:'positive'}

example = ['I love this movie']
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%' %\
      (label[clf.predict(X)[0]], clf.predict_proba(X).max()*100))

上記スクリプトで使っている関数

from sklearn.feature_extraction.text import HashingVectorizer
import re
import os
import pickle

cur_dir = os.path.dirname(__file__)
stop = pickle.load(open(
                os.path.join(cur_dir, 
                'pkl_objects', 
                'stopwords.pkl'), 'rb'))

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) \
                   + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)

SQLiteデータベースをセットアップ

Webアプリケーションの予測に対するユーザーからのフィードバックを収集する

SQLite

f:id:robonchu:20171126101921p:plain

参考:

SQLite Home Page

SQLite入門

batteries included

pythonのbatteries includedの考え方: 標準ライブラリだけでも様々なことができる

参考: What “Batteries Included” Means | Musings of an Anonymous Geek

データベースの作成

batteries includedの考え方からsqlite3というSQLiteを操作できる標準APIがある

11.13. sqlite3 — SQLite データベースに対する DB-API 2.0 インタフェース — Python 2.7.14 ドキュメント

  • データベースの作成
import sqlite3
import os

if os.path.exists('reviews.sqlite'):
    os.remove('reviews.sqlite')

conn = sqlite3.connect('reviews.sqlite')
c = conn.cursor()
c.execute('CREATE TABLE review_db (review TEXT, sentiment INTEGER, date TEXT)')

example1 = 'I love this movie'
c.execute("INSERT INTO review_db (review, sentiment, date) VALUES (?, ?, DATETIME('now'))", (example1, 1))

example2 = 'I disliked this movie'
c.execute("INSERT INTO review_db (review, sentiment, date) VALUES (?, ?, DATETIME('now'))", (example2, 0))

conn.commit()
conn.close()
  • データベースの内容確認
conn = sqlite3.connect('reviews.sqlite')
c = conn.cursor()

c.execute("SELECT * FROM review_db WHERE date BETWEEN '2015-01-01 10:10:10' AND DATETIME('now')")
results = c.fetchall()

conn.close()

print(results)

結果

[(u'I love this movie', 1, u'2017-11-26 01:06:38'), (u'I disliked this movie', 0, u'2017-11-26 01:06:38')]

Firefoxのバージョン 57.0ではSQLite Managerのpluginがつかえず...

f:id:robonchu:20171126101144p:plain

FlaskでのWebアプリ開発

f:id:robonchu:20171126101934p:plain

参考:

Quickstart — Flask Documentation (0.12)

Flaskの簡単な使い方 - Qiita

[Python] 軽量WebフレームワークのFlaskに入門(準備、起動、HTML、静的ファイル、GET、POSTなど) - YoheiM .NET

なれるための単純な開発

1st_flask_app_1/
├── app.py  <- メインのコード
└── templates  
    └── first_app.html  <- webブラウザでレンダリングする静的なHTMLファイル

app.py

from flask import Flask, render_template

app = Flask(__name__)

@app.route('/')
def index():
    return render_template('first_app.html')

if __name__ == '__main__':
    app.run(debug=True)

first_app.html

<!doctype html>
<html>
  <head>
    <title>First app</title>
  </head>
  <body>

  <div>
    Hi, this is my first Flask web app!
  </div>

  </body>
</html>

HTMLのわかりやすい参考: HTML | MDN

実行手順

  1. python app.py

  2. ブラウザでhttp://127.0.0.1:5000/にアクセス

  3. 「Hi, this is my first Flask web app! 」とでていることを確認

WTFormsでデータ収集

参考: WTForms Documentation

Flask にフォームがないから WTForms 使ってみた - present

WTFormsを学ぶ その1 - 学んだことをメモする日記

WTFormsを学ぶ その2 - 学んだことをメモする日記

WTFormsを学ぶ その3 - 学んだことをメモする日記

下記のような名前を入力して挨拶をするものをつくる

f:id:robonchu:20171126110945p:plain

f:id:robonchu:20171126110949p:plain

1st_flask_app_2/
├── app.py
├── static
│   └── style.css
└── templates
    ├── _formhelpers.html
    ├── first_app.html
    └── hello.html

app.py

from flask import Flask, render_template, request
from wtforms import Form, TextAreaField, validators

app = Flask(__name__)

class HelloForm(Form):
    sayhello = TextAreaField('',[validators.DataRequired()])

@app.route('/')
def index():
    form = HelloForm(request.form)
    return render_template('first_app.html', form=form)

@app.route('/hello', methods=['POST'])
def hello():
    form = HelloForm(request.form)
    if request.method == 'POST' and form.validate():
        name = request.form['sayhello']
        return render_template('hello.html', name=name)
    return render_template('first_app.html', form=form)

if __name__ == '__main__':
    app.run(debug=True)

_formhelpers.html

Jinja2というテンプレートエンジンを用いている

{% macro render_field(field) %}
  <dt>{{ field.label }}
  <dd>{{ field(**kwargs)|safe }}
  {% if field.errors %}
    <ul class=errors>
    {% for error in field.errors %}
      <li>{{ error }}</li>
    {% endfor %}
    </ul>
  {% endif %}
  </dd>
  </dt>
{% endmacro %}

style.css : CSScascading style sheets)ファイル

body {
    font-size: 2em;
}

first_app.html

<!doctype html>
<html>
  <head>
    <title>First app</title>
   <link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
  </head>
  <body>

{% from "_formhelpers.html" import render_field %}

<div>What's your name?</div>
<form method=post action="/hello">

  <dl>
      {{ render_field(form.sayhello) }}
  </dl>

  <input type=submit value='Say Hello' name='submit_btn'>

</form>

  </body>
</html>

hello.html

<!doctype html>
<html>
  <head>
    <title>First app</title>
   <link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
  </head>
  <body>

<div>Hello {{ name }}</div>


  </body>
</html>

Jinja2

テンプレートエンジン. pythonの結果からHTMLを作成したいときなどに便利.

Welcome to Jinja2 — Jinja2 Documentation (2.11)

jinja2 — it-note 1.0 ドキュメント

CSScascading style sheets)ファイル

HTMLドキュメントのルック&フィール(コンピュータの操作画面の見た目や操作感のこと)が指定できる

映画レビュー分類器をwebアプリとして実装

下記のようなレビューを書いて、その予測結果をフィードバックし、感謝を伝えるツールを作る

http://raschkas.pythonanywhere.com/results

f:id:robonchu:20171126110953p:plainf:id:robonchu:20171126111000p:plainf:id:robonchu:20171126111153p:plain

movieclassifier_with_update/
├── app.py
├── pkl_objects
│   ├── classifier.pkl
│   └── stopwords.pkl
├── reviews.sqlite
├── static
│   └── style.css
├── templates
│   ├── _formhelpers.html
│   ├── results.html
│   ├── reviewform.html
│   └── thanks.html
├── update.py
└── vectorizer.py

app.py

from flask import Flask, render_template, request
from wtforms import Form, TextAreaField, validators
import pickle
import sqlite3
import os
import numpy as np

# import HashingVectorizer from local dir
from vectorizer import vect

app = Flask(__name__)

######## Preparing the Classifier
cur_dir = os.path.dirname(__file__)
clf = pickle.load(open(os.path.join(cur_dir,
                 'pkl_objects',
                 'classifier.pkl'), 'rb'))
db = os.path.join(cur_dir, 'reviews.sqlite')

def classify(document):
    label = {0: 'negative', 1: 'positive'}
    X = vect.transform([document])
    y = clf.predict(X)[0]
    proba = np.max(clf.predict_proba(X))
    return label[y], proba

def train(document, y):
    X = vect.transform([document])
    clf.partial_fit(X, [y])

def sqlite_entry(path, document, y):
    conn = sqlite3.connect(path)
    c = conn.cursor()
    c.execute("INSERT INTO review_db (review, sentiment, date)"\
    " VALUES (?, ?, DATETIME('now'))", (document, y))
    conn.commit()
    conn.close()

######## Flask
class ReviewForm(Form):
    moviereview = TextAreaField('',
                                [validators.DataRequired(),
                                validators.length(min=15)])

@app.route('/')
def index():
    form = ReviewForm(request.form)
    return render_template('reviewform.html', form=form)

@app.route('/results', methods=['POST'])
def results():
    form = ReviewForm(request.form)
    if request.method == 'POST' and form.validate():
        review = request.form['moviereview']
        y, proba = classify(review)
        return render_template('results.html',
                                content=review,
                                prediction=y,
                                probability=round(proba*100, 2))
    return render_template('reviewform.html', form=form)

@app.route('/thanks', methods=['POST'])
def feedback():
    feedback = request.form['feedback_button']
    review = request.form['review']
    prediction = request.form['prediction']

    inv_label = {'negative': 0, 'positive': 1}
    y = inv_label[prediction]
    if feedback == 'Incorrect':
        y = int(not(y))
    train(review, y)
    sqlite_entry(db, review, y)
    return render_template('thanks.html')

if __name__ == '__main__':
    app.run(debug=True)

reviewform.html

<!doctype html>
<html>
  <head>
    <title>Movie Classification</title>
   <link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
  </head>
  <body>

<h2>Please enter your movie review:</h2>

{% from "_formhelpers.html" import render_field %}

<form method=post action="/results">
  <dl>
    {{ render_field(form.moviereview, cols='30', rows='10') }}
  </dl>
  <div>
      <input type=submit value='Submit review' name='submit_btn'>
  </div>
</form>

  </body>
</html>

results.html

<!doctype html>
<html>
  <head>
    <title>Movie Classification</title>
   <link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
  </head>
  <body>

<h3>Your movie review:</h3>
<div>{{ content }}</div>

<h3>Prediction:</h3>
<div>This movie review is <strong>{{ prediction }}</strong>
     (probability: {{ probability }}%).</div>

<div id='button'>
      <form action="/thanks" method="post">
        <input type=submit value='Correct' name='feedback_button'>
        <input type=submit value='Incorrect' name='feedback_button'>
        <input type=hidden value='{{ prediction }}' name='prediction'>
        <input type=hidden value='{{ content }}' name='review'>
      </form>
</div>

<div id='button'>
      <form action="/">
        <input type=submit value='Submit another review'>
      </form>
</div>

  </body>
</html>

style.css

body{
    width:600px;
}

.button{
    padding-top: 20px;
}

thanks.html

<!doctype html>
<html>
  <head>
    <title>Movie Classification</title>
   <link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
  </head>
  <body>

<h3>Thank you for your feedback!</h3>

<div id='button'>
      <form action="/">
        <input type=submit value='Submit another review'>
      </form>
</div>

  </body>
</html>

実行

python app.py

Webアプリケーションをパブ陸Webサーバーにデプロイ

これまでのアプリのローカルでのテストが完了すればパブリックでデプロイする

PythonAnywhere

Host, run, and code Python in the cloud: PythonAnywhere

  1. アカウントの登録

  2. Add a new web app

  3. File Upload or do something

そうすれば<ユーザー名>.pythonanywhere.comでアクセスできる

参考:

Flask Tutorial (part 5) - deploying to PythonAnywhere - YouTube

Simple Flask Hosting: PythonAnywhere

映画レビュー分類器の更新

サーバーがクラッシュした際にデータを消えないようにするには更新されるたびにclfオブジェクトをシリアライズすることが考えられる。

しかし、ユーザーの数が増えるに従い、計算効率が悪くなる、また、同時に更新した際にはファイルが壊れるかもしれない。

なので、もうひとつの方法として、 SQLiteデータベースに収集されたフィードバックデータを使った予測モデルの更新 がある。

実装

update.py *

import pickle
import sqlite3
import numpy as np
import os

# import HashingVectorizer from local dir
from vectorizer import vect

def update_model(db_path, model, batch_size=10000):

    conn = sqlite3.connect(db_path)
    c = conn.cursor()
    c.execute('SELECT * from review_db')

    results = c.fetchmany(batch_size)
    while results:
        data = np.array(results)
        X = data[:, 0]
        y = data[:, 1].astype(int)

        classes = np.array([0, 1])
        X_train = vect.transform(X)
        model.partial_fit(X_train, y, classes=classes)
        results = c.fetchmany(batch_size)

    conn.close()
    return model

cur_dir = os.path.dirname(__file__)

clf = pickle.load(open(os.path.join(cur_dir,
                  'pkl_objects',
                  'classifier.pkl'), 'rb'))
db = os.path.join(cur_dir, 'reviews.sqlite')

clf = update_model(db_path=db, model=clf, batch_size=10000)

# Uncomment the following lines if you are sure that
# you want to update your classifier.pkl file
# permanently.

# pickle.dump(clf, open(os.path.join(cur_dir,
#             'pkl_objects', 'classifier.pkl'), 'wb')
#             , protocol=4)

これをapp.pyで呼び出すようにする

以下の二箇所をapp.pyに追加

from update import update_model
if __name__ == '__main__' :
    clf = update_model(dp_path=db, model=clf, batch_size=10000)

softkineticをros(kinetic)で動かしみる

f:id:robonchu:20171125212310p:plain

f:id:robonchu:20171125212320j:plain

softkinetic(ToFセンサ)を3つゲットしたので動かしてみる

SoftKinetic - 3D Vision Leader

softkinetic - ROS Wiki

softkineticのSDKのダウンロード

  1. アカウント登録

  2. SDKのダウンロード:Download

  3. ~/.bashrcにpathを追加:export LD_LIBRARY_PATH=/opt/softkinetic/DepthSenseSDK/lib/:$LD_LIBRARY_PATH

rosのsoftkineticのpackageをインストール&demoの実行

  1. git clone https://github.com/ipa320/softkinetic.git

  2. branchをkinetic-devに

  3. ワークスペースを作ってcatkin_make

  4. source devel/setup.bash

  5. roslaunch softkinetic_camera softkinetic_camera_demo.launch

f:id:robonchu:20171125212350p:plain

イエイ、動いた。

これを使ってPCLのお勉強とNewロボットを作ろう。

Error: No space left on device

これが出たら以下の記事を参考に対処

jsk_recognition/install_softkinetic_camera.rst at master · jsk-ros-pkg/jsk_recognition · GitHub

Softkinetic node StreamingException with senz3D - ROS Answers: Open Source Q&A Forum

機械学習のお勉強(BoW)

教科書

GitHub - rasbt/python-machine-learning-book: The "Python Machine Learning (1st edition)" book code repository and info resource

映画のデータセットの取得

http://ai.stanford.edu/~amaas/data/sentiment/

import pyprind
import pandas as pd
import os

# change the `basepath` to the directory of the
# unzipped movie dataset

#basepath = '/Users/Sebastian/Desktop/aclImdb/'
basepath = './aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r') as infile: #, #encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

BoW

Bag of Words:

  1. ドキュメントの集合全体から一意なトークンからなる語彙を作成する(一意なトークン(ソースコード中の文字の並び中の、1個ないし複数個の文字から成る文字列):単語etc...)

  2. 各ドキュメントでの各単語の出現回数を含んだ特徴ベクトルを構築する

特徴ベクトルの大半は0となる疎ベクトルになる

単語を特徴ベクトルに変換

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

print(count.vocabulary_)

結果

{u'and': 0, u'sun': 4, u'is': 1, u'two': 7, u'one': 2, u'weather': 8, u'sweet': 5, u'the': 6, u'shining': 3}
print(bag.toarray())

結果

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]
  • 1グラム:"the" , "sun" , "is"

  • 2グラム:"the sun" , "sun is"

単語の関連性の評価

TF-IDF : Term Frequency-Inverse Document Frequency

tf-idf(t, d) = tf(t, d) x idf(t, d)

tf:ドキュメントdにおける単語tの出現回数

idf = log (n_d) / (1 + df(t, d))

n_d: ドキュメントの総数

df : 単語tを含んでいるドキュメントdの個数

sk-learnでの実装

np.set_printoptions(precision=2)
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

結果

[[ 0.    0.43  0.    0.56  0.56  0.    0.43  0.    0.  ]
 [ 0.    0.43  0.    0.    0.    0.56  0.43  0.    0.56]
 [ 0.5   0.45  0.5   0.19  0.19  0.19  0.3   0.25  0.19]]

"is"の重要度が下がっている

sk-learnではtf-idfの最後にL2正規化を行っている。

テキストデータのクレンジング

文章から顔文字以外の句読点を削除する

import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

正規表現に参考になるサイト: Python Regular Expressions  |  Google for Education  |  Google Developers

ドキュメントをトークン化

ワードステミング:単語を原型にする。関連する単語を同じ互換にマッピングできる。

from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()


def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

print tokenizer('runners like running and thus they run')]
print tokenizer_porter('runners like running and thus they run')

結果

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']
[u'runner', 'like', u'run', 'and', u'thu', 'they', 'run']

高度な自然言語処理NLTK Book を参考

ストップワードの除去:ごくありふれた単語 (ex: is , and , has etc ...)

import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:]
if w not in stop]

結果

['runner', u'like', u'run', u'run', 'lot']

ドキュメントを分類するロジスティック回帰モデルのトレーニング

肯定的、否定的に分類する

GridSearchCVオブジェクトを使ってロジスティック回帰モデルの最適なパラメータ集合を求める

X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
if Version(sklearn_version) < '0.18':
    from sklearn.grid_search import GridSearchCV
else:
    from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

gs_lr_tfidf.fit(X_train, y_train)

print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

最適なパラメータセットは以下となる

Best parameter set: {'vect__tokenizer': <function tokenizer at 0x11851c6a8>, 'clf__C': 10.0, 'vect__stop_words': None, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1)} 

正解率

CV Accuracy: 0.897
Test Accuracy: 0.899

この結果から90%の正解率が予測できることがわかる。

テキスト分類ではナイーブベイズ分類器が有名

オンラインアルゴリズムとアウトオブコア学習

ローカルドライブからドキュメントを直接ストリーミングし、ドキュメントの小さなミニバッチを使ってロジスティック回帰モデルをトレーニングする

トークンに分割 & ドキュメントをひとつずつ読み込んで返すジェネレータを作成

import numpy as np
import re
from nltk.corpus import stopwords

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized


def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

print next(stream_docs(path='./movie_data.csv'))

結果

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich family used their influence to cover the murder for more than twenty years. However, a snoopy detective and convicted perjurer in disgrace was able to disclose how the hideous crime was committed. The screenplay shows the investigation of Mark and the last days of Martha in parallel, but there is a lack of the emotion in the dramatization. My vote is seven.<br /><br />Title (Brazil): Not Available"',
 1)

get_minibatch関数を定義する

def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

アウトオブコア学習のためにハッシュトリックを利用

特徴量を2**21に設定。ロジスティック回帰分類器を初期化している。

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='./movie_data.csv')
import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

結果

Accuracy: 0.867

正解率は僅かに下がるが額スユ時間が大幅に削減される

また、以下のように最後の5000個のドキュメントを使ってモデルを逐次的に更新できる

clf = clf.partial_fit(X_test, y_test)

Word2Vec

ニューラルネットワークに基づく教師なし学習アルゴリズムであり、単語間の関係を自動的に学習しようとする。

Google Code Archive - Long-term storage for Google Code Project Hosting.

絵で理解するWord2vecの仕組み - Qiita

機械学習のお勉強(アンサンブル学習)

アンサンブル学習

様々な分類器をひとつのメタ分類器として組み合わせる。それによって高い汎化性能をもたらす。

  • 多数決に基づく予測

  • トレーニングデータセットのランダムな組み合わせを繰り返し抽出することによる過学習の抑制

  • 誤答から学習する弱学習器による強力なモデルの構築

f:id:robonchu:20171029165214p:plain

多数決

f:id:robonchu:20171029165403p:plain

y^ = mode{C1(x), C2(x),...,Cm(x)} (mode:最頻値)

C(x) = sign[ΣCj] = { 1 (ΣC > 0) or -1 (ΣC < 0)

アンサンブルの誤分類率

P:二項分布の確率質量関数

k:誤分類の個数

y:要素数

P(y>=k) = Σ nCk εk (1-ε)(n-k) = ε

from scipy.misc import comb
import math

def ensemble_error(n_classifier, error):
    k_start = int(math.ceil(n_classifier / 2.0)) # 引数の値以上の最小の整数を返す
    probs = [comb(n_classifier, k) * error**k * (1-error)**(n_classifier - k)
             for k in range(k_start, n_classifier + 1)]
    return sum(probs)

ensemble_error(n_classifier=11, error=0.25)

結果

0.034327507019042969

ここの分類率error=0.25より誤分類の確率が低いことがわかる。

このベース分類率を0.0~1.0にふった時の誤分類率は以下のようになる。

import numpy as np

error_range = np.arange(0.0, 1.01, 0.01)
ens_errors = [ensemble_error(n_classifier=11, error=error)
              for error in error_range]

import matplotlib.pyplot as plt

plt.plot(error_range, 
         ens_errors, 
         label='Ensemble error', 
         linewidth=2)

plt.plot(error_range, 
         error_range, 
         linestyle='--',
         label='Base error',
         linewidth=2)

plt.xlabel('Base error')
plt.ylabel('Base/Ensemble error')
plt.legend(loc='upper left')
plt.grid()
plt.tight_layout()
# plt.savefig('./figures/ensemble_err.png', dpi=300)
plt.show()

f:id:robonchu:20171029171501p:plain

ベース分類器の誤分類率が0.5を下回っていれば、アンサンブル学習の誤分類率がそれよりも低くなることがわかる。

多数決分類器の実装

3つの分類器があって、それぞれの重みが0.2,0.2,0.6の場合

import numpy as np

np.argmax(np.bincount([0, 0, 1], 
                      weights=[0.2, 0.2, 0.6]))

ひとつ目、ふたつ目の分類器が0とクラス分類しても、みっつ目の分類器が1と分類すれば、結果は1となる。

sk-learnではクラスラベルの確率が得られるので、その確率pに重みをつけて多数決を行うこともできる。

y^ = arg max Σ w * p

ex = np.array([[0.9, 0.1],
               [0.8, 0.2],
               [0.4, 0.6]])

p = np.average(ex, 
               axis=0, 
               weights=[0.2, 0.2, 0.6])
print p
print np.argmax(p)

結果

array([ 0.58,  0.42])
0

多数決方式の分類アルゴリズムの構築(スタッキング)

上記を組み合わせた実装

from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.preprocessing import LabelEncoder
from sklearn.externals import six
from sklearn.base import clone
from sklearn.pipeline import _name_estimators
import numpy as np
import operator


class MajorityVoteClassifier(BaseEstimator, 
                             ClassifierMixin):
    """ A majority vote ensemble classifier

    Parameters
    ----------
    classifiers : array-like, shape = [n_classifiers]
      Different classifiers for the ensemble

    vote : str, {'classlabel', 'probability'} (default='label')
      If 'classlabel' the prediction is based on the argmax of
        class labels. Else if 'probability', the argmax of
        the sum of probabilities is used to predict the class label
        (recommended for calibrated classifiers).

    weights : array-like, shape = [n_classifiers], optional (default=None)
      If a list of `int` or `float` values are provided, the classifiers
      are weighted by importance; Uses uniform weights if `weights=None`.

    """
    def __init__(self, classifiers, vote='classlabel', weights=None):

        self.classifiers = classifiers
        self.named_classifiers = {key: value for key, value
                                  in _name_estimators(classifiers)}
        self.vote = vote
        self.weights = weights

    def fit(self, X, y):
        """ Fit classifiers.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Matrix of training samples.

        y : array-like, shape = [n_samples]
            Vector of target class labels.

        Returns
        -------
        self : object

        """
        if self.vote not in ('probability', 'classlabel'):
            raise ValueError("vote must be 'probability' or 'classlabel'"
                             "; got (vote=%r)"
                             % self.vote)

        if self.weights and len(self.weights) != len(self.classifiers):
            raise ValueError('Number of classifiers and weights must be equal'
                             '; got %d weights, %d classifiers'
                             % (len(self.weights), len(self.classifiers)))

        # Use LabelEncoder to ensure class labels start with 0, which
        # is important for np.argmax call in self.predict
        self.lablenc_ = LabelEncoder()
        self.lablenc_.fit(y)
        self.classes_ = self.lablenc_.classes_
        self.classifiers_ = []
        for clf in self.classifiers:
            fitted_clf = clone(clf).fit(X, self.lablenc_.transform(y))
            self.classifiers_.append(fitted_clf)
        return self

    def predict(self, X):
        """ Predict class labels for X.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Matrix of training samples.

        Returns
        ----------
        maj_vote : array-like, shape = [n_samples]
            Predicted class labels.
            
        """
        if self.vote == 'probability':
            maj_vote = np.argmax(self.predict_proba(X), axis=1)
        else:  # 'classlabel' vote

            #  Collect results from clf.predict calls
            predictions = np.asarray([clf.predict(X)
                                      for clf in self.classifiers_]).T

            maj_vote = np.apply_along_axis(
                                      lambda x:
                                      np.argmax(np.bincount(x,
                                                weights=self.weights)),
                                      axis=1,
                                      arr=predictions)
        maj_vote = self.lablenc_.inverse_transform(maj_vote)
        return maj_vote

    def predict_proba(self, X):
        """ Predict class probabilities for X.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of samples and
            n_features is the number of features.

        Returns
        ----------
        avg_proba : array-like, shape = [n_samples, n_classes]
            Weighted average probability for each class per sample.

        """
        probas = np.asarray([clf.predict_proba(X)
                             for clf in self.classifiers_])
        avg_proba = np.average(probas, axis=0, weights=self.weights)
        return avg_proba

    def get_params(self, deep=True):
        """ Get classifier parameter names for GridSearch"""
        if not deep:
            return super(MajorityVoteClassifier, self).get_params(deep=False)
        else:
            out = self.named_classifiers.copy()
            for name, step in six.iteritems(self.named_classifiers):
                for key, value in six.iteritems(step.get_params(deep=True)):
                    out['%s__%s' % (name, key)] = value
            return out

参考:Python の無名関数( lambda )の使い方 - Life with Python

データセットの準備

あやめのデータセット👇

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
if Version(sklearn_version) < '0.18':
    from sklearn.cross_validation import train_test_split
else:
    from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
X, y = iris.data[50:, [1, 2]], iris.target[50:]
le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test =\
       train_test_split(X, y, 
                        test_size=0.5, 
                        random_state=1)

ロジスティック回帰、決定機分類器、k近傍分類器でのそれぞれの結果

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.pipeline import Pipeline
if Version(sklearn_version) < '0.18':
    from sklearn.cross_validation import cross_val_score
else:
    from sklearn.model_selection import cross_val_score

clf1 = LogisticRegression(penalty='l2', 
                          C=0.001,
                          random_state=0)

clf2 = DecisionTreeClassifier(max_depth=1,
                              criterion='entropy',
                              random_state=0)

clf3 = KNeighborsClassifier(n_neighbors=1,
                            p=2,
                            metric='minkowski')

pipe1 = Pipeline([['sc', StandardScaler()],
                  ['clf', clf1]])
pipe3 = Pipeline([['sc', StandardScaler()],
                  ['clf', clf3]])

clf_labels = ['Logistic Regression', 'Decision Tree', 'KNN']

print('10-fold cross validation:\n')
for clf, label in zip([pipe1, clf2, pipe3], clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='roc_auc')
    print("ROC AUC: %0.2f (+/- %0.2f) [%s]"
          % (scores.mean(), scores.std(), label))

結果

10-fold cross validation:

ROC AUC: 0.92 (+/- 0.20) [Logistic Regression]
ROC AUC: 0.92 (+/- 0.15) [Decision Tree]
ROC AUC: 0.93 (+/- 0.10) [KNN]

アンサンブル学習の結果

# Majority Rule (hard) Voting

mv_clf = MajorityVoteClassifier(classifiers=[pipe1, clf2, pipe3])

clf_labels += ['Majority Voting']
all_clf = [pipe1, clf2, pipe3, mv_clf]

for clf, label in zip(all_clf, clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='roc_auc')
    print("ROC AUC: %0.2f (+/- %0.2f) [%s]"
          % (scores.mean(), scores.std(), label))

結果

ROC AUC: 0.92 (+/- 0.20) [Logistic Regression]
ROC AUC: 0.92 (+/- 0.15) [Decision Tree]
ROC AUC: 0.93 (+/- 0.10) [KNN]
ROC AUC: 0.97 (+/- 0.10) [Majority Voting]

ここからアンサンブル学習の結果が良いことがわかる

アンサンブル分類器の評価とチューニング

from sklearn.metrics import roc_curve
from sklearn.metrics import auc

colors = ['black', 'orange', 'blue', 'green']
linestyles = [':', '--', '-.', '-']
for clf, label, clr, ls \
        in zip(all_clf,
               clf_labels, colors, linestyles):

    # assuming the label of the positive class is 1
    y_pred = clf.fit(X_train,
                     y_train).predict_proba(X_test)[:, 1]
    fpr, tpr, thresholds = roc_curve(y_true=y_test,
                                     y_score=y_pred)
    roc_auc = auc(x=fpr, y=tpr)
    plt.plot(fpr, tpr,
             color=clr,
             linestyle=ls,
             label='%s (auc = %0.2f)' % (label, roc_auc))

plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1],
         linestyle='--',
         color='gray',
         linewidth=2)

plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])
plt.grid()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

# plt.tight_layout()
# plt.savefig('./figures/roc.png', dpi=300)
plt.show()

結果

f:id:robonchu:20171029175505p:plain

決定領域の確認

sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)

from itertools import product

all_clf = [pipe1, clf2, pipe3, mv_clf]

x_min = X_train_std[:, 0].min() - 1
x_max = X_train_std[:, 0].max() + 1
y_min = X_train_std[:, 1].min() - 1
y_max = X_train_std[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

f, axarr = plt.subplots(nrows=2, ncols=2, 
                        sharex='col', 
                        sharey='row', 
                        figsize=(7, 5))

for idx, clf, tt in zip(product([0, 1], [0, 1]),
                        all_clf, clf_labels):
    clf.fit(X_train_std, y_train)
    
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.3)
    
    axarr[idx[0], idx[1]].scatter(X_train_std[y_train==0, 0], 
                                  X_train_std[y_train==0, 1], 
                                  c='blue', 
                                  marker='^',
                                  s=50)
    
    axarr[idx[0], idx[1]].scatter(X_train_std[y_train==1, 0], 
                                  X_train_std[y_train==1, 1], 
                                  c='red', 
                                  marker='o',
                                  s=50)
    
    axarr[idx[0], idx[1]].set_title(tt)

plt.text(-3.5, -4.5, 
         s='Sepal width [standardized]', 
         ha='center', va='center', fontsize=12)
plt.text(-10.5, 4.5, 
         s='Petal length [standardized]', 
         ha='center', va='center', 
         fontsize=12, rotation=90)

plt.tight_layout()
# plt.savefig('./figures/voting_panel', bbox_inches='tight', dpi=300)
plt.show()

結果

f:id:robonchu:20171029180116p:plain

これをみるにそれぞれの決定領域を掛けあわせたような結果となっている

パラメータのチューニング

mv_clf.get_params()

結果

{'decisiontreeclassifier': DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=1,
             max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, presort=False, random_state=0,
             splitter='best'),
 'decisiontreeclassifier__class_weight': None,
 'decisiontreeclassifier__criterion': 'entropy',
 'decisiontreeclassifier__max_depth': 1,
 'decisiontreeclassifier__max_features': None,
 'decisiontreeclassifier__max_leaf_nodes': None,
 'decisiontreeclassifier__min_impurity_decrease': 0.0,
 'decisiontreeclassifier__min_impurity_split': None,
 'decisiontreeclassifier__min_samples_leaf': 1,
 'decisiontreeclassifier__min_samples_split': 2,
 'decisiontreeclassifier__min_weight_fraction_leaf': 0.0,
 'decisiontreeclassifier__presort': False,
 'decisiontreeclassifier__random_state': 0,
 'decisiontreeclassifier__splitter': 'best',
 'pipeline-1': Pipeline(memory=None,
      steps=[('sc', StandardScaler(copy=True, with_mean=True, with_std=True)), ['clf', LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
           verbose=0, warm_start=False)]]),
 'pipeline-1__clf': LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
           verbose=0, warm_start=False),
 'pipeline-1__clf__C': 0.001,
 'pipeline-1__clf__class_weight': None,
 'pipeline-1__clf__dual': False,
 'pipeline-1__clf__fit_intercept': True,
 'pipeline-1__clf__intercept_scaling': 1,
 'pipeline-1__clf__max_iter': 100,
 'pipeline-1__clf__multi_class': 'ovr',
 'pipeline-1__clf__n_jobs': 1,
 'pipeline-1__clf__penalty': 'l2',
 'pipeline-1__clf__random_state': 0,
 'pipeline-1__clf__solver': 'liblinear',
 'pipeline-1__clf__tol': 0.0001,
 'pipeline-1__clf__verbose': 0,
 'pipeline-1__clf__warm_start': False,
 'pipeline-1__memory': None,
 'pipeline-1__sc': StandardScaler(copy=True, with_mean=True, with_std=True),
 'pipeline-1__sc__copy': True,
 'pipeline-1__sc__with_mean': True,
 'pipeline-1__sc__with_std': True,
 'pipeline-1__steps': [('sc',
   StandardScaler(copy=True, with_mean=True, with_std=True)),
  ['clf',
   LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
             intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
             penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
             verbose=0, warm_start=False)]],
 'pipeline-2': Pipeline(memory=None,
      steps=[('sc', StandardScaler(copy=True, with_mean=True, with_std=True)), ['clf', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
            metric_params=None, n_jobs=1, n_neighbors=1, p=2,
            weights='uniform')]]),
 'pipeline-2__clf': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
            metric_params=None, n_jobs=1, n_neighbors=1, p=2,
            weights='uniform'),
 'pipeline-2__clf__algorithm': 'auto',
 'pipeline-2__clf__leaf_size': 30,
 'pipeline-2__clf__metric': 'minkowski',
 'pipeline-2__clf__metric_params': None,
 'pipeline-2__clf__n_jobs': 1,
 'pipeline-2__clf__n_neighbors': 1,
 'pipeline-2__clf__p': 2,
 'pipeline-2__clf__weights': 'uniform',
 'pipeline-2__memory': None,
 'pipeline-2__sc': StandardScaler(copy=True, with_mean=True, with_std=True),
 'pipeline-2__sc__copy': True,
 'pipeline-2__sc__with_mean': True,
 'pipeline-2__sc__with_std': True,
 'pipeline-2__steps': [('sc',
   StandardScaler(copy=True, with_mean=True, with_std=True)),
  ['clf',
   KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
              metric_params=None, n_jobs=1, n_neighbors=1, p=2,
              weights='uniform')]]}

get_paramsメソッドは、キーがパラメータ名、バリューがパラメータ値とするディクショナリを返している。

パラメータをふって結果をみてみる

if Version(sklearn_version) < '0.18':
    from sklearn.cross_validation import GridSearchCV
else:
    from sklearn.model_selection import GridSearchCV

params = {'decisiontreeclassifier__max_depth': [1, 2],
          'pipeline-1__clf__C': [0.001, 0.1, 100.0]}

grid = GridSearchCV(estimator=mv_clf,
                    param_grid=params,
                    cv=10,
                    scoring='roc_auc')
grid.fit(X_train, y_train)


if Version(sklearn_version) < '0.18':
    for params, mean_score, scores in grid.grid_scores_:
        print("%0.3f +/- %0.2f %r"
              % (mean_score, scores.std() / 2.0, params))

else:
    cv_keys = ('mean_test_score', 'std_test_score','params')

    for r, _ in enumerate(grid.cv_results_['mean_test_score']):
        print("%0.3f +/- %0.2f %r"
              % (grid.cv_results_[cv_keys[0]][r], 
                 grid.cv_results_[cv_keys[1]][r] / 2.0, 
                 grid.cv_results_[cv_keys[2]][r]))

print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)

結果

0.967 +/- 0.05 {'pipeline-1__clf__C': 0.001, 'decisiontreeclassifier__max_depth': 1}
0.967 +/- 0.05 {'pipeline-1__clf__C': 0.1, 'decisiontreeclassifier__max_depth': 1}
1.000 +/- 0.00 {'pipeline-1__clf__C': 100.0, 'decisiontreeclassifier__max_depth': 1}
0.967 +/- 0.05 {'pipeline-1__clf__C': 0.001, 'decisiontreeclassifier__max_depth': 2}
0.967 +/- 0.05 {'pipeline-1__clf__C': 0.1, 'decisiontreeclassifier__max_depth': 2}
1.000 +/- 0.00 {'pipeline-1__clf__C': 100.0, 'decisiontreeclassifier__max_depth': 2}

Best parameters: {'pipeline-1__clf__C': 100.0, 'decisiontreeclassifier__max_depth': 1}
Accuracy: 1.00

ブーストラップ標本を用いた分類器アンサンブルの構築(バギング)

f:id:robonchu:20171029181114p:plain

具体例

f:id:robonchu:20171029181324p:plain

バギングの効果の確認を行う

データセットの準備

import pandas as pd

df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/'
                      'machine-learning-databases/wine/wine.data',
                      header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']

# drop 1 class
df_wine = df_wine[df_wine['Class label'] != 1]

y = df_wine['Class label'].values
X = df_wine[['Alcohol', 'Hue']].values

from sklearn.preprocessing import LabelEncoder
if Version(sklearn_version) < '0.18':
    from sklearn.cross_validation import train_test_split
else:
    from sklearn.model_selection import train_test_split


le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test =\
            train_test_split(X, y, 
                             test_size=0.40, 
                             random_state=1)

バギングの実装

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(criterion='entropy', 
                              max_depth=None,
                              random_state=1)

bag = BaggingClassifier(base_estimator=tree,
                        n_estimators=500, 
                        max_samples=1.0, 
                        max_features=1.0, 
                        bootstrap=True, 
                        bootstrap_features=False, 
                        n_jobs=1, 
                        random_state=1)

決定木とバギングの比較

from sklearn.metrics import accuracy_score

tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)

tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f'
      % (tree_train, tree_test))

bag = bag.fit(X_train, y_train)
y_train_pred = bag.predict(X_train)
y_test_pred = bag.predict(X_test)

bag_train = accuracy_score(y_train, y_train_pred) 
bag_test = accuracy_score(y_test, y_test_pred) 
print('Bagging train/test accuracies %.3f/%.3f'
      % (bag_train, bag_test))

結果

Decision tree train/test accuracies 1.000/0.833
Bagging train/test accuracies 1.000/0.896

決定境界の可視化

import numpy as np
import matplotlib.pyplot as plt

x_min = X_train[:, 0].min() - 1
x_max = X_train[:, 0].max() + 1
y_min = X_train[:, 1].min() - 1
y_max = X_train[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

f, axarr = plt.subplots(nrows=1, ncols=2, 
                        sharex='col', 
                        sharey='row', 
                        figsize=(8, 3))


for idx, clf, tt in zip([0, 1],
                        [tree, bag],
                        ['Decision Tree', 'Bagging']):
    clf.fit(X_train, y_train)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    axarr[idx].contourf(xx, yy, Z, alpha=0.3)
    axarr[idx].scatter(X_train[y_train == 0, 0],
                       X_train[y_train == 0, 1],
                       c='blue', marker='^')

    axarr[idx].scatter(X_train[y_train == 1, 0],
                       X_train[y_train == 1, 1],
                       c='red', marker='o')

    axarr[idx].set_title(tt)

axarr[0].set_ylabel('Alcohol', fontsize=12)
plt.text(10.2, -1.2,
         s='Hue',
         ha='center', va='center', fontsize=12)

plt.tight_layout()
# plt.savefig('./figures/bagging_region.png',
#            dpi=300,
#            bbox_inches='tight')
plt.show()

f:id:robonchu:20171029182840p:plain

アダブーストによる弱学習器

ブースティングの原型版

  1. トレーニングデータセットDからトレーニングサンプルのランダムなサブセットd1を非復元抽出し、弱学習器C1をトレーニングする

  2. 2つ目のランダムなトレーニングサブセットd2をトレーニングデータセットから非復元抽出し、以前にご分類されたサンプルの50%を追加して、弱学習C2をトレーニングする

  3. トレーニングデータセットDからC1とC2の結果が異なるトレーニングサンプルd3を洗い出し、3つ目の学習器C3をトレーニングする

  4. 弱学習器C1,C2,C3を多数決により組み合わせる

アダブースト

f:id:robonchu:20171029183836p:plain

Ensembles (4): AdaBoost - YouTube

  1. トレーニングサンプルの重みベクトルwを設定する

  2. 全体でm回実行するブースティングのうちj回目のステップで3~8を繰り返す

  3. 重み付けされた弱学習器をトレーニング

  4. クラスラベルの予測

  5. 重み付けされた誤分類率を計算

  6. 重みの更新に用いる係数の計算

  7. 重みの更新

  8. 重みを正規化して合計が1になるようにする

  9. 入力された特徴行列に対する最終予測を計算する

f:id:robonchu:20171029214519p:plain

決定木とアダブーストの比較

from sklearn.ensemble import AdaBoostClassifier

tree = DecisionTreeClassifier(criterion='entropy', 
                              max_depth=1,
                              random_state=0)

ada = AdaBoostClassifier(base_estimator=tree,
                         n_estimators=500, 
                         learning_rate=0.1,
                         random_state=0)

tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)

tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f'
      % (tree_train, tree_test))

ada = ada.fit(X_train, y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)

ada_train = accuracy_score(y_train, y_train_pred) 
ada_test = accuracy_score(y_test, y_test_pred) 
print('AdaBoost train/test accuracies %.3f/%.3f'
      % (ada_train, ada_test))

結果

Decision tree train/test accuracies 0.845/0.854
AdaBoost train/test accuracies 1.000/0.875

テストデータセットの性能が向上しているが、バリアンスは大きくなっている。

決定領域の確認

x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

f, axarr = plt.subplots(1, 2, sharex='col', sharey='row', figsize=(8, 3))


for idx, clf, tt in zip([0, 1],
                        [tree, ada],
                        ['Decision Tree', 'AdaBoost']):
    clf.fit(X_train, y_train)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    axarr[idx].contourf(xx, yy, Z, alpha=0.3)
    axarr[idx].scatter(X_train[y_train == 0, 0],
                       X_train[y_train == 0, 1],
                       c='blue', marker='^')
    axarr[idx].scatter(X_train[y_train == 1, 0],
                       X_train[y_train == 1, 1],
                       c='red', marker='o')
    axarr[idx].set_title(tt)

axarr[0].set_ylabel('Alcohol', fontsize=12)
plt.text(10.2, -1.2,
         s='Hue',
         ha='center', va='center', fontsize=12)

plt.tight_layout()
# plt.savefig('./figures/adaboost_region.png',
#           dpi=300,
#           bbox_inches='tight')
plt.show()

f:id:robonchu:20171029215422p:plain

ROS x LeapMotionのお勉強

LeapMotionを使ってみる

準備

SDKのダウンロード V2 Tracking — Leap Motion Developer

install

sudo dpkg --install Leap-2.3.1+31549-x64.deb

ubuntu16.04でもし動かなければ Ubuntu 16.04にLeapMotion導入 - Qiita を参考に修正

  • leapを起動したい時
sudo service leapd start
  • 止めたい時
sudo service leapd stop
  • 状態の確認がしたい時
sudo service leapd status

*設定

LeapControlPanel --showsettings

ROSで起動

cppの場合は自作する必要あり

以下を参考に動かせる

leap_motion - ROS Wiki

GitHub - ros-drivers/leap_motion: Leap Motion ROS integration

GitHub - YuOhara/leap_motion at leap_camera

ステレオカメラとして動かしたい!

しかし、下記エラーで動かず。。。原因調査中。

[ERROR] [1510999568.734786707]: Skipping XML Document "/opt/ros/kinetic/share/gmapping/nodelet_plugins.xml" which had no Root Element.  This likely means the XML is malformed or missing.

Skipping XML documentation · Issue #55 · ros-perception/slam_gmapping · GitHub

このissueが原因

GitHub - gavanderhoorn/slam_gmapping at install_nodelet_plugin_xml

のソースからオーバーレイで解決

GitHub - qboticslabs/leap_client: A ROS C++ wrapper around the Leap Motion API

leap motion公式HPの説明↓

https://developer.leapmotion.com/documentation/v2/python/devguide/Leap_Images.html?proglang=python

https://community.leapmotion.com/t/sdk-2-1-raw-data-get-pixel-position-xyz/1604/12

Leap Motion SDK - YouTube

ROS x C++のお勉強

背景

PCLや画像処理の高速化などC++でしか扱えないものを使えるようになりたい。

ROSパッケージの構成検討

robonchu.hatenablog.com

最小構成(hello world)

CMakeLists.txt

cmake_minimum_required(VERSION 2.8.3)
project(hello)

find_package(catkin REQUIRED COMPONENTS
  roscpp
  std_msgs
)

catkin_package(
#  INCLUDE_DIRS include
#  LIBRARIES chapter03
#  CATKIN_DEPENDS roscpp std_msgs
#  DEPENDS system_lib
)

include_directories(
# include
  ${catkin_INCLUDE_DIRS}
)

add_executable(hello_world src/hello_world.cpp)
target_link_libraries(hello_world ${catkin_LIBRARIES})

package.xml

<?xml version="1.0"?>
<package format="2">
  <name>hello</name>
  <version>0.0.0</version>
  <description>The hello package</description>
  <maintainer email="mouse@todo.todo">mouse</maintainer>
  <license>TODO</license>

  <buildtool_depend>catkin</buildtool_depend>

  <build_depend>roscpp</build_depend>
  <build_depend>std_msgs</build_depend>

  <build_export_depend>roscpp</build_export_depend>
  <build_export_depend>std_msgs</build_export_depend>

  <exec_depend>roscpp</exec_depend>
  <exec_depend>std_msgs</exec_depend>
</package>

hello_world.cpp

#include <ros/ros.h>

int main(int argc, char **argv)
{
  ros::init(argc, argv, "hello_world");
  ros::NodeHandle nh;

  ros::Rate rate(1);
  while(ros::ok()){
    ROS_INFO_STREAM("Hello ROS world !!!");
    rate.sleep();
  }
  return 0;
}

ImageTransport

image_transport/Tutorials - ROS Wiki

cv_bridge/Tutorials/UsingCvBridgeToConvertBetweenROSImagesAndOpenCVImages - ROS Wiki

subscriber

#include <ros/ros.h>
#include <image_transport/image_transport.h>
#include <opencv2/highgui/highgui.hpp>
#include <cv_bridge/cv_bridge.h>

void imageCallback(const sensor_msgs::ImageConstPtr& msg)
{
  try {
    cv::imshow("ImageSubscriber", cv_bridge::toCvShare(msg, "bgr8")->image);
    if(cv::waitKey(30) >= 0)
      ros::shutdown();
  }
  catch (cv_bridge::Exception& e) {
    ROS_ERROR("Could not convert from '%s' to 'bgr8'.",
          msg->encoding.c_str());
  }
}

int main(int argc, char **argv)
{
  ros::init(argc, argv, "camera_subscriber");
  ros::NodeHandle nh;
  cv::namedWindow("Image_Subscriber");
  cv::startWindowThread();

  image_transport::ImageTransport it(nh);
  image_transport::Subscriber sub;
  sub = it.subscribe("/usb_cam/image_raw", 1,imageCallback);

  ros::spin();
  cv::destroyWindow("camera_Subscriber");
  return EXIT_SUCCESS;
}

camera subscriber カメラのInfo情報が追加

#include <ros/ros.h>
#include <image_transport/image_transport.h>
#include <opencv2/highgui/highgui.hpp>
#include <cv_bridge/cv_bridge.h>

void imageCallback(const sensor_msgs::ImageConstPtr& msg, 
     const sensor_msgs::CameraInfoConstPtr& info)
{
  try {
    cv::imshow("ImageCameraSubscriber", cv_bridge::toCvShare(msg, "bgr8")->image);
    if(cv::waitKey(30) >= 0)
      ros::shutdown();
  }
  catch (cv_bridge::Exception& e) {
    ROS_ERROR("Could not convert from '%s' to 'bgr8'.",
          msg->encoding.c_str());
  }
}

int main(int argc, char **argv)
{
  ros::init(argc, argv, "camera_camerasubscriber");
  ros::NodeHandle nh;
  cv::namedWindow("Image_CameraSubscriber");
  cv::startWindowThread();

  image_transport::ImageTransport it(nh);
  image_transport::Subscriber sub = it.subscribe("/usb_cam/image_raw", 1,imageCallback);

  ros::spin();
  cv::destroyWindow("camera_Subscriber");

  return EXIT_SUCCESS;
}

参考:

image_transport/Tutorials/SubscribingToImages - ROS Wiki

PCL

TODO

DockerでDeepLearningの環境を作ってみる

Dockerを支えるLinuxカーネル

f:id:robonchu:20171112160308p:plain

参考:

Etsukata blog: Docker を支える Linux Kernel の機能 (概要編)

Docker内部で利用されているLinuxカーネルの機能 (namespace/cgroups) - Qiita

Linuxカーネル Docker関連 namespaceのメモ « Rest Term

NVIDIA-DOCKER

Internals · NVIDIA/nvidia-docker Wiki · GitHub

Motivation · NVIDIA/nvidia-docker Wiki · GitHub

Motivation · NVIDIA/nvidia-docker Wiki · GitHub

環境構築

nvidia-docker でポータブルな機械学習の作業環境を作る - Qiita

が今回の教科書。わかりやすい説明ありがとうございます。

f:id:robonchu:20171112162226p:plain

手順

  1. dockerインストール:Get Docker CE for Ubuntu | Docker Documentation

  2. nvidia-dockerインストール:GitHub - NVIDIA/nvidia-docker: Build and run Docker containers leveraging NVIDIA GPUs

  3. sudo なしで docker を使えるように:Dockerコマンドをsudoなしで実行する方法 - Qiita

  4. Dockerfileの作成(作業環境の初期状態の設計図)

  5. imageの作成(作業環境の初期状態)

  6. containerの作成とログイン( image を元に新しい作業環境を作る。)

ワークフロー

初回

  1. Dockerfile 作成

  2. nvidia-docker build ... でイメージ作成。失敗したら goto 1

  3. nvidia-docker run ... でコンテナ作成&ログイン コンテナ環境を確認してexitしてコンテナ停止

コンテナ内作業&ホストでの作業

  1. docker start ... でコンテナを起動してログイン

  2. screen|tmux を起動

  3. コンテナ内で何か作業する

  4. ctrl-p + ctrl-q で一旦ホストへ戻る

  5. ホストで何か作業する

  6. docker attach ... でコンテナ内へ戻る。goto 2

  7. 作業完了したら exit でコンテナを停止する

  8. 作業再開したければ goto 1

Dockerfile

参考:

kerasのDockerfile 趣味でディープラーニングするための GPU 環境を安上がりに作る方法 - Qiita

chainerのDockerfile nvdia-docker で ディープラーニング用の環境を作る – GitHub 出張所 – プログラム関係のブログはここに

例1:Dockerfile

nvidia-docker/Dockerfile at master · NVIDIA/nvidia-docker · GitHub

FROM nvidia/cuda:8.0-devel-ubuntu16.04

CMD nvidia-smi -q

例2:Dockerfile

FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
ENV DEBIAN_FRONTEND "noninteractive"

RUN apt-get update
RUN apt-get -y \
    -o Dpkg::Options::="--force-confdef" \
    -o Dpkg::Options::="--force-confold" dist-upgrade

RUN apt-get install -y --no-install-recommends \
    sudo ssh \
    zsh screen cmake unzip git curl wget vim tree htop \
    python-dev python-pip python-setuptools \
    python3-dev python3-pip python3-setuptools \
    build-essential \
    graphviz \
    libatlas-base-dev libopenblas-base libopenblas-dev liblapack-dev \
    build-essential \
    libopencv-dev \
    python-numpy \
    protobuf-compiler

RUN groupadd -g 1942 ubuntu
RUN useradd -m -u 1942 -g 1942 -d /home/ubuntu ubuntu
RUN echo "ubuntu ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
RUN chown -R ubuntu:ubuntu /home/ubuntu
RUN chsh -s /usr/bin/zsh ubuntu

USER ubuntu
WORKDIR /home/ubuntu
ENV HOME /home/ubuntu

#### zsh ####
RUN bash -c "$(curl -fsSL https://raw.githubusercontent.com/robbyrussell/oh-my-zsh/master/tools/install.sh)"

#### zsh path setting ####
RUN echo 'export PATH=/usr/local/cuda/bin:$PATH' >> /home/ubuntu/.zshrc
RUN echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> /home/ubuntu/.zshrc
RUN echo 'export LD_LIBRARY_PATH=/usr/local/nvidia/lib64:$LD_LIBRARY_PATH' >> /home/ubuntu/.zshrc

RUN echo 'export CPATH=/usr/local/cuda-8.0/targets/x86_64-linux/include:$CPATH' >> /home/ubuntu/.zshrc

RUN echo 'export CPATH=$HOME/cuda/include:$CPATH' >> /home/ubuntu/.zshrc
RUN echo 'export LIBRARY_PATH=$HOME/cuda/lib64:$LIBRARY_PATH' >> /home/ubuntu/.zshrc
RUN echo 'export LD_LIBRARY_PATH=$HOME/cuda/lib64:$LD_LIBRARY_PATH' >> /home/ubuntu/.zshrc

RUN echo 'export LC_ALL=C.UTF-8' >> /home/ubuntu/.zshrc
RUN echo 'export LANG=C.UTF-8' >> /home/ubuntu/.zshrc

RUN echo 'shell "/usr/bin/zsh"' >>  /home/ubuntu/.screenrc


#### pyenv install ####
RUN git clone https://github.com/yyuu/pyenv.git /home/ubuntu/.pyenv
RUN echo 'export PYENV_ROOT="$HOME/.pyenv"' >> /home/ubuntu/.zshrc
RUN echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> /home/ubuntu/.zshrc
RUN echo 'eval "$(pyenv init -)"' >> /home/ubuntu/.zshrc
ENV PYENV_ROOT $HOME/.pyenv
ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
RUN eval "$(pyenv init -)"


#### anaconda ####
RUN pyenv install anaconda3-4.2.0
RUN pyenv rehash
RUN pyenv global anaconda3-4.2.0
RUN conda update --all
RUN pip install --upgrade pip


#### tensorflow ####
RUN pip install cython
RUN pip install tensorflow-gpu
RUN pip install keras


WORKDIR $HOME/

例3:Dockerfile フレームワークたくさん入り 深層学習全部入りコンテナ(Keras/TensorFlow/Chainer/Pytorch/Open AI Gym/Anaconda)で、nividia-dockerを使う - Qiita

FROM nvidia/cuda:8.0-cudnn6-runtime

RUN apt-get update
RUN apt-get install -y curl git unzip imagemagick bzip2
RUN git clone git://github.com/yyuu/pyenv.git .pyenv

WORKDIR /
ENV HOME  /
ENV PYENV_ROOT /.pyenv
ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH

RUN pyenv install anaconda3-4.4.0
RUN pyenv global anaconda3-4.4.0
RUN pyenv rehash

RUN pip install opencv-python tqdm h5py keras tensorflow-gpu kaggle-cli gym
RUN pip install chainer
RUN pip install http://download.pytorch.org/whl/cu80/torch-0.2.0.post3-cp36-cp36m-manylinux1_x86_64.whl 
RUN pip install torchvision
EOF

imageの作成

  1. Dockerfileの作成

  2. nvidia-docker build -t docker-gpu-test-workspace ./

  3. docker images

chainerのイメージ:https://hub.docker.com/r/chainer/chainer/

pytorchのイメージ:GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration

PyTorchのDockerイメージを作成してみた | SoraLab / ソララボ

https://hub.docker.com/search/?isAutomated=0&isOfficial=0&page=1&pullCount=0&q=pytorch&starCount=0

containerの作成とログイン

例1

nvidia-docker run -it docker-gpu-test-workspace

例2

nvidia-docker run \
  --name my-workspace \
  -v /data/:/data/ \
  --net="host" \
  -ti \
  docker-gpu-test-workspace\
    /usr/bin/zsh

test sample

例1:

(container)# apt install wget
(container)# wget https://raw.githubusercontent.com/fchollet/keras/master/examples/mnist_cnn.py
(container)# python3 mnist_cnn.py

結果

Using TensorFlow backend.
2017-11-12 08:16:25.823619: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-12 08:16:25.912916: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-12 08:16:25.913187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: GeForce GTX 950M major: 5 minor: 0 memoryClockRate(GHz): 0.928
pciBusID: 0000:01:00.0
totalMemory: 1.96GiB freeMemory: 1.61GiB
2017-11-12 08:16:25.913216: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 950M, pci bus id: 0000:01:00.0, compute capability: 5.0)
Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
11493376/11490434 [==============================] - 16s 1us/step
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
2017-11-12 08:16:43.096840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 950M, pci bus id: 0000:01:00.0, compute capability: 5.0)
60000/60000 [==============================] - 135s 2ms/step - loss: 0.3366 - acc: 0.8971 - val_loss: 0.0830 - val_acc: 0.9757
Epoch 2/12
60000/60000 [==============================] - 19s 316us/step - loss: 0.1126 - acc: 0.9666 - val_loss: 0.0527 - val_acc: 0.9831
Epoch 3/12
60000/60000 [==============================] - 18s 308us/step - loss: 0.0841 - acc: 0.9747 - val_loss: 0.0414 - val_acc: 0.9860
Epoch 4/12
60000/60000 [==============================] - 19s 310us/step - loss: 0.0695 - acc: 0.9792 - val_loss: 0.0402 - val_acc: 0.9864
Epoch 5/12
60000/60000 [==============================] - 18s 307us/step - loss: 0.0606 - acc: 0.9821 - val_loss: 0.0347 - val_acc: 0.9885
Epoch 6/12
60000/60000 [==============================] - 19s 314us/step - loss: 0.0548 - acc: 0.9842 - val_loss: 0.0319 - val_acc: 0.9897
Epoch 7/12
60000/60000 [==============================] - 19s 309us/step - loss: 0.0513 - acc: 0.9849 - val_loss: 0.0317 - val_acc: 0.9895
Epoch 8/12
60000/60000 [==============================] - 19s 310us/step - loss: 0.0473 - acc: 0.9861 - val_loss: 0.0319 - val_acc: 0.9892
Epoch 9/12
60000/60000 [==============================] - 19s 310us/step - loss: 0.0423 - acc: 0.9872 - val_loss: 0.0290 - val_acc: 0.9907
Epoch 10/12
60000/60000 [==============================] - 18s 307us/step - loss: 0.0411 - acc: 0.9878 - val_loss: 0.0299 - val_acc: 0.9904
Epoch 11/12
60000/60000 [==============================] - 18s 304us/step - loss: 0.0383 - acc: 0.9887 - val_loss: 0.0294 - val_acc: 0.9902
Epoch 12/12
60000/60000 [==============================] - 18s 306us/step - loss: 0.0361 - acc: 0.9887 - val_loss: 0.0354 - val_acc: 0.9890
Test loss: 0.035408690984
Test accuracy: 0.989

うごけばおけ

ためになる参考

nvidia-docker - AdwaysEngineersBlog - Adwaysエンジニアブログ

PyTorchのDockerイメージを作成してみた | SoraLab / ソララボ

PyTorch の Dockerfile に Opencv もインストールして使えるようにする - Qiita

深層学習全部入りコンテナ(Keras/TensorFlow/Chainer/Pytorch/Open AI Gym/Anaconda)で、nividia-dockerを使う - Qiita