2016/11/28 인공지능 과제


Note:

l  프로젝트 목적:
1)  인공지능에 대한 기본 원리를 이해하기 위함
2)  인공지능의 핵심 이론과 관련 프로그램 수행을 통하여 지능 이론의 실제적 이해하기 위함
3)  팀웍을 통한 리더쉽 향상을 도모하기 위함

l  프로젝트 방법:
1)  팀웍 향상을 위해 한 팀에 3명으로 구성한다.
2)  인공신경망에 관한 FAQ구체적인 project steps은 아래와 같다.


Step0: Obtain FAQ and Source code

  • 인공신경망에 관한 다양한 서적과 인터넷 참고자료를 활용
https://www.codeproject.com/KB/AI/brainnet/brainnet_src.zip (backpropagation 알고리즘 소스코드 vb소스파일)
https://github.com/neuroph/neuroph/tree/master/neuroph-2.9/Samples/src/main/java/org/neuroph/samples/uci (Didn't check yet)

https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ (good to look up , python code in git repository)

http://zerkpage.tripod.com/project.htm ( c++ source code / If your visual studio version is upto 2008, change header <iostream.h> to <iostream> and use 'using std::xxxx' for using cin,cout,etc. / getch and khbit were depreciated use _getch and _khbit instead.)
  • 인공신경망에 관한 어느 source code를 사용해도 무방

Step1: 다음 웹사이트에서 적절한DB를 선택한다.

방법:

l  다음의 Machine Learning Repository (http://archive.ics.uci.edu/ml/)에서 Classification 또는 Regression Task 중 하나의 DB를 택한다.


http://www.cs.toronto.edu/~delve/data/adult/desc.html
I downloaded adult dataset which is one of classfication datases. Detail is written below.

The Adult dataset


The information is a replica of the notes for the abalone dataset from the UCI repository.

1. Title of Database: adult


2. Sources:


(a) Original owners of database (name/phone/snail address/email address)
US Census Bureau.

(b) Donor of database (name/phone/snail address/email address)
Ronny Kohavi and Barry Becker,
Data Mining and Visualization
Silicon Graphics.
e-mail: ronnyk@sgi.com 

(c) Date received (databases may change over time without name change!)
05/19/96

3. Past Usage:


(a) Complete reference of article where it was described/used
@inproceedings{kohavi-nbtree,
author={Ron Kohavi},
title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid},
booktitle={Proceedings of the Second International Conference on Knowledge Discovery and Data Mining},
year = 1996,
pages={to appear}} 
(b) Indication of what attribute(s) were being predicted
Salary greater or less than 50,000.
(b) Indication of study's results (i.e. Is it a good domain to use?)
Hard domain with a nice number of records.
The following results obtained using MLC++ with default settings
for the algorithms mentioned below. 
AlgorithmError
1C4.515.54
2C4.5-auto14.46
3C4.5-rules14.94
4Voted ID3 (0.6)15.64
5Voted ID3 (0.8)16.47
6T216.84
71R19.54
8NBTree14.10
9CN216.00
10HOODG14.82
11FSS Naive Bayes14.05
12IDTM (Decision table)14.46
13Naive-Bayes16.12
14Nearest-neighbor (1)21.42
15Nearest-neighbor (3)20.35
16OC115.04
17PeblsCrashed. Unknown why (bounds WERE increased)

4. Relevant Information Paragraph:

Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

5. Number of Instances

  • 48842 instances, mix of continuous and discrete (train=32561, test=16281)
  • 45222 if instances with unknown values are removed (train=30162, test=15060)
  • Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).

6. Number of Attributes

6 continuous, 8 nominal attributes.

7. Attribute Information:

  1. age: continuous.
  2. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  3. fnlwgt: continuous.
  4. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  5. education-num: continuous.
  6. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  7. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  8. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  9. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  10. sex: Female, Male.
  11. capital-gain: continuous.
  12. capital-loss: continuous.
  13. hours-per-week: continuous.
  14. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
  15. class: >50K, <=50K

8. Missing Attribute Values:

7% have missing values.

9. Class Distribution:

Probability for the label '>50K' : 23.93% / 24.78% (without unknowns) 
Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns) 

10. Notes for Delve

  1. One prototask (income) has been defined, using attributes 1-13 as inputs and income level as a binary target.
  2. Missing values - These are confined to attributes 2 (workclass), 7 (occupation) and 14 (native-country). The prototask only uses cases with no missing values.
  3. The income prototask comes with two priors, differing according to if attribute 4 (education) is considered to be nominal or ordinal.

 

Step 2: 선택한 Data 2 groups으로 나눈다.

방법:

1.   Training data set for learning and validation

adult\dataset.data\dataset.data
adult\dataset.spec

2.   Test data set for prediction and final error or performance measure

adult\income\prototask.data\prototask.data
adult\income\prototask.spec

참고: 선택한 데이터에 관한 자세한 내용을 잘 읽어보고, 이미 training settesting set을 분리 했는지 확인 또한 training set learning data set validation data set을 분리 했는지도 확인

 

Step 3: Write a program for BP and Train an Artificial Neural Network

방법:

l  어느 code simulator를 사용해도 무방하나 프로그램 언어에 관계없이 backpropagation (BP) 알고리즘을 사용해야 함

l  이전 step2에서 분리하지 않았다면Training data set learning set validation set으로 분리한다 (eg. 5050). 이것은 cross-validation이라는 기법을 이용해 학습을 하기 위해 서다.

Hints:

l  learning rate는 되도록 적어야 하고 (eg. 0.0001) momentum term은 되도록 큰 값을 추천 (eg. 0.8).

l  Use only one hidden layer.

l  Choose a high number of neurons (nodes 또는 units) in the hidden layer (eg. 20). 경우에 따라 몇 개의 neuron을 더 첨가해서 결과 (eg. learning/validation error가 적을수록 결과가 좋다)를 비교해 보라 (data에 따라 다를 수 있다).

l  Run your algorithm several times (eg. 10 times) with different initializations. 혹은 위에서 분리한 learning/validation set을 다른 방법으로 분리해서 실행 해 보라. (eg. 50 50에서 60 40으로 분리함)

l  결과에 최적으로 mapping되도록 위에서 예로 제시한10번만 실행하지 말고 여러 번 더 반복 실행해 보라.

Task 1: 여러 번 반복 횟수 (epochs: x) learning error, validation error ( y)의 관계를 그래프로 나타내라. 이것을 Plot 1으로 한다.

 

Step 4: 다음 방법을 이용해 error를 줄인다

방법:

l  Early stopping method: 여러 번 반복해서 run시킨 후 error가 최소가 되는 network을 선택하고 최소가 되는 시점에서 학습을 멈추는 방법

Task 2: 여러 번 반복 횟수 (epochs: x) learning error, validation error ( y)의 관계를 또 다시 그래프로 나타내라. 이것을 Plot 2로 한다

 

Step 5: Measure the performance on the test set

방법: 지금까지 학습한 network structures 2개이다:

l  Task 1: Trained with learning and validation data set (so-called “without reqularisation method”)

l  Task 2: Trained with “early stopping method”

l  위 학습한 2개의 신경망 구조를 이용해 network최종 performance check 한다. 이 때 위 step에서 분리한 test data set을 여기서 이용한다. 지금까지는 training data set을 이용하였다.

Task 3: Report the performance error of your networks and preferably compare them with the performance error of other ANN methods (인터넷에서 다양한 관련된 논문 참조)

 

Write a short report (2~3 pages maximum)

l  어느 database를 사용했는지를 설명

l  사용한 network structure(s) (number of nodes/layers)를 설명

l  Show the graphs mentioned in steps 3 and 4 (Tasks 1 and 2)


l  State your results in step 5 (Task 3)


http://sanghyukchun.github.io/42/ BP 알고리즘에 대한 자세한 설명
http://staff.itee.uq.edu.au/janetw/cmc/chapters/BackProp/index2.html 여기도 설명

TRIPODcode backprop.cpp 분석
맵핑 파일이 필요해서 헤더로 만들었다.
t새로 작성한 헤더파일

완성 프로젝트
https://github.com/xodmf1215/AI-backpropagation

댓글