2016/11/28 인공지능 과제

Note:

l 프로젝트 목적:

1) 인공지능에 대한 기본 원리를 이해하기 위함

2) 인공지능의 핵심 이론과 관련 프로그램 수행을 통하여 각 지능 이론의 실제적 이해하기 위함

3) 팀웍을 통한 리더쉽 향상을 도모하기 위함

l 프로젝트 방법:

1) 팀웍 향상을 위해 한 팀에 3명으로 구성한다.

2) 인공신경망에 관한 FAQ와 구체적인 project steps은 아래와 같다.

Step0: Obtain FAQ and Source code

인공신경망에 관한 다양한 서적과 인터넷 참고자료를 활용

https://www.codeproject.com/KB/AI/brainnet/brainnet_src.zip (backpropagation 알고리즘 소스코드 vb소스파일)
https://github.com/neuroph/neuroph/tree/master/neuroph-2.9/Samples/src/main/java/org/neuroph/samples/uci (Didn't check yet)

https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ (good to look up , python code in git repository)

http://zerkpage.tripod.com/project.htm ( c++ source code / If your visual studio version is upto 2008, change header <iostream.h> to <iostream> and use 'using std::xxxx' for using cin,cout,etc. / getch and khbit were depreciated use _getch and _khbit instead.)

인공신경망에 관한 어느 source code를 사용해도 무방

Step1: 다음 웹사이트에서 적절한DB를 선택한다.

방법:

l 다음의 Machine Learning Repository (http://archive.ics.uci.edu/ml/)에서 Classification 또는 Regression Task 중 하나의 DB를 택한다.

http://www.cs.toronto.edu/~delve/data/adult/desc.html

I downloaded adult dataset which is one of classfication datases. Detail is written below.

The Adult dataset

The information is a replica of the notes for the abalone dataset from the UCI repository.

1. Title of Database: adult

2. Sources:

(a) Original owners of database (name/phone/snail address/email address): US Census Bureau.
(b) Donor of database (name/phone/snail address/email address): Ronny Kohavi and Barry Becker,
Data Mining and Visualization
Silicon Graphics.
e-mail: ronnyk@sgi.com
(c) Date received (databases may change over time without name change!): 05/19/96

3. Past Usage:

(a) Complete reference of article where it was described/used: @inproceedings{kohavi-nbtree,
author={Ron Kohavi},
title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid},
booktitle={Proceedings of the Second International Conference on Knowledge Discovery and Data Mining},
year = 1996,
pages={to appear}}
(b) Indication of what attribute(s) were being predicted: Salary greater or less than 50,000.
(b) Indication of study's results (i.e. Is it a good domain to use?): Hard domain with a nice number of records.
The following results obtained using MLC++ with default settings
for the algorithms mentioned below.

	Algorithm	Error
1	C4.5	15.54
2	C4.5-auto	14.46
3	C4.5-rules	14.94
4	Voted ID3 (0.6)	15.64
5	Voted ID3 (0.8)	16.47
6	T2	16.84
7	1R	19.54
8	NBTree	14.10
9	CN2	16.00
10	HOODG	14.82
11	FSS Naive Bayes	14.05
12	IDTM (Decision table)	14.46
13	Naive-Bayes	16.12
14	Nearest-neighbor (1)	21.42
15	Nearest-neighbor (3)	20.35
16	OC1	15.04
17	Pebls	Crashed. Unknown why (bounds WERE increased)

4. Relevant Information Paragraph:

Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

5. Number of Instances

48842 instances, mix of continuous and discrete (train=32561, test=16281)
45222 if instances with unknown values are removed (train=30162, test=15060)
Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).

6. Number of Attributes

6 continuous, 8 nominal attributes.

7. Attribute Information:

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

8. Missing Attribute Values:

7% have missing values.

9. Class Distribution:

Probability for the label '>50K' : 23.93% / 24.78% (without unknowns)
Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)

10. Notes for Delve

One prototask (income) has been defined, using attributes 1-13 as inputs and income level as a binary target.
Missing values - These are confined to attributes 2 (workclass), 7 (occupation) and 14 (native-country). The prototask only uses cases with no missing values.
The income prototask comes with two priors, differing according to if attribute 4 (education) is considered to be nominal or ordinal.

Step 2: 선택한 Data를 2 groups으로 나눈다.

방법:

1. Training data set for learning and validation

adult\dataset.data\dataset.data

adult\dataset.spec

2. Test data set for prediction and final error or performance measure

adult\income\prototask.data\prototask.data

adult\income\prototask.spec

참고: 선택한 데이터에 관한 자세한 내용을 잘 읽어보고, 이미 training set과testing set을 분리 했는지 확인 또한 training set중 learning data set과 validation data set을 분리 했는지도 확인

Step 3: Write a program for BP and Train an Artificial Neural Network

방법:

l 어느 code나 simulator를 사용해도 무방하나 프로그램 언어에 관계없이 backpropagation (BP) 알고리즘을 사용해야 함

l 이전 step2에서 분리하지 않았다면Training data set을 learning set과 validation set으로 분리한다 (eg. 50대50). 이것은 cross-validation이라는 기법을 이용해 학습을 하기 위해 서다.

Hints:

l learning rate는 되도록 적어야 하고 (eg. 0.0001) momentum term은 되도록 큰 값을 추천 (eg. 0.8).

l Use only one hidden layer.

l Choose a high number of neurons (nodes 또는 units) in the hidden layer (eg. 20). 경우에 따라 몇 개의 neuron을 더 첨가해서 결과 (eg. learning/validation error가 적을수록 결과가 좋다)를 비교해 보라 (data에 따라 다를 수 있다).

l Run your algorithm several times (eg. 10 times) with different initializations. 혹은 위에서 분리한 learning/validation set을 다른 방법으로 분리해서 실행 해 보라. (eg. 50대 50에서 60대 40으로 분리함)

l 결과에 최적으로 mapping되도록 위에서 예로 제시한10번만 실행하지 말고 여러 번 더 반복 실행해 보라.

Task 1: 여러 번 반복 횟수 (epochs: x축)와 learning error, validation error (각 y축)의 관계를 그래프로 나타내라. 이것을 Plot 1으로 한다.

Step 4: 다음 방법을 이용해 error를 줄인다

방법:

l Early stopping method: 여러 번 반복해서 run시킨 후 error가 최소가 되는 network을 선택하고 최소가 되는 시점에서 학습을 멈추는 방법

Task 2: 여러 번 반복 횟수 (epochs: x축)와 learning error, validation error (각 y축)의 관계를 또 다시 그래프로 나타내라. 이것을 Plot 2로 한다

Step 5: Measure the performance on the test set

방법: 지금까지 학습한 network structures는 2개이다:

l Task 1: Trained with learning and validation data set (so-called “without reqularisation method”)

l Task 2: Trained with “early stopping method”

l 위 학습한 2개의 신경망 구조를 이용해 network최종 performance를 check 한다. 이 때 위 step에서 분리한 test data set을 여기서 이용한다. 지금까지는 training data set을 이용하였다.

Task 3: Report the performance error of your networks and preferably compare them with the performance error of other ANN methods (인터넷에서 다양한 관련된 논문 참조)

Write a short report (2~3 pages maximum)

l 어느 database를 사용했는지를 설명

l 사용한 network structure(s) (number of nodes/layers)를 설명

l Show the graphs mentioned in steps 3 and 4 (Tasks 1 and 2)

l State your results in step 5 (Task 3)

http://sanghyukchun.github.io/42/ BP 알고리즘에 대한 자세한 설명

http://staff.itee.uq.edu.au/janetw/cmc/chapters/BackProp/index2.html 여기도 설명

TRIPODcode backprop.cpp 분석

맵핑 파일이 필요해서 헤더로 만들었다.

t새로 작성한 헤더파일

완성 프로젝트

https://github.com/xodmf1215/AI-backpropagation