2016/11/28 인공지능 과제
Note:
l
프로젝트 목적:
1)
인공지능에 대한 기본 원리를 이해하기 위함
2)
인공지능의 핵심 이론과 관련 프로그램 수행을 통하여 각 지능 이론의 실제적 이해하기 위함
3)
팀웍을 통한 리더쉽 향상을 도모하기 위함
l
프로젝트 방법:
1)
팀웍 향상을 위해 한 팀에 3명으로 구성한다.
2)
인공신경망에 관한 FAQ와 구체적인 project steps은 아래와 같다.
Step0: Obtain FAQ and Source code
- 인공신경망에 관한 다양한 서적과 인터넷 참고자료를 활용
https://github.com/neuroph/neuroph/tree/master/neuroph-2.9/Samples/src/main/java/org/neuroph/samples/uci (Didn't check yet)
https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ (good to look up , python code in git repository)
http://zerkpage.tripod.com/project.htm ( c++ source code / If your visual studio version is upto 2008, change header <iostream.h> to <iostream> and use 'using std::xxxx' for using cin,cout,etc. / getch and khbit were depreciated use _getch and _khbit instead.)
- 인공신경망에 관한 어느 source code를 사용해도 무방
Step1: 다음 웹사이트에서 적절한DB를 선택한다.
방법:
l 다음의 Machine Learning Repository (http://archive.ics.uci.edu/ml/)에서
Classification 또는 Regression Task 중
하나의 DB를 택한다.
http://www.cs.toronto.edu/~delve/data/adult/desc.html
I downloaded adult dataset which is one of classfication datases. Detail is written below.
The Adult dataset
The information is a replica of the notes for the abalone dataset from the UCI repository.
1. Title of Database: adult
2. Sources:
- (a) Original owners of database (name/phone/snail address/email address)
- US Census Bureau.
- (b) Donor of database (name/phone/snail address/email address)
- Ronny Kohavi and Barry Becker,
Data Mining and Visualization
Silicon Graphics.
e-mail: ronnyk@sgi.com - (c) Date received (databases may change over time without name change!)
- 05/19/96
3. Past Usage:
- (a) Complete reference of article where it was described/used
- @inproceedings{kohavi-nbtree,
author={Ron Kohavi},
title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid},
booktitle={Proceedings of the Second International Conference on Knowledge Discovery and Data Mining},
year = 1996,
pages={to appear}} - (b) Indication of what attribute(s) were being predicted
- Salary greater or less than 50,000.
- (b) Indication of study's results (i.e. Is it a good domain to use?)
- Hard domain with a nice number of records.
The following results obtained using MLC++ with default settings
for the algorithms mentioned below.
| Algorithm | Error | |
|---|---|---|
| 1 | C4.5 | 15.54 |
| 2 | C4.5-auto | 14.46 |
| 3 | C4.5-rules | 14.94 |
| 4 | Voted ID3 (0.6) | 15.64 |
| 5 | Voted ID3 (0.8) | 16.47 |
| 6 | T2 | 16.84 |
| 7 | 1R | 19.54 |
| 8 | NBTree | 14.10 |
| 9 | CN2 | 16.00 |
| 10 | HOODG | 14.82 |
| 11 | FSS Naive Bayes | 14.05 |
| 12 | IDTM (Decision table) | 14.46 |
| 13 | Naive-Bayes | 16.12 |
| 14 | Nearest-neighbor (1) | 21.42 |
| 15 | Nearest-neighbor (3) | 20.35 |
| 16 | OC1 | 15.04 |
| 17 | Pebls | Crashed. Unknown why (bounds WERE increased) |
4. Relevant Information Paragraph:
Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
5. Number of Instances
- 48842 instances, mix of continuous and discrete (train=32561, test=16281)
- 45222 if instances with unknown values are removed (train=30162, test=15060)
- Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
6. Number of Attributes
6 continuous, 8 nominal attributes.
7. Attribute Information:
- age: continuous.
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: Female, Male.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. class: >50K, <=50K
8. Missing Attribute Values:
7% have missing values.9. Class Distribution:
Probability for the label '>50K' : 23.93% / 24.78% (without unknowns)Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
10. Notes for Delve
- One prototask (income) has been defined, using attributes 1-13 as inputs and income level as a binary target.
- Missing values - These are confined to attributes 2 (workclass), 7 (occupation) and 14 (native-country). The prototask only uses cases with no missing values.
- The income prototask comes with two priors, differing according to if attribute 4 (education) is considered to be nominal or ordinal.
Step 2: 선택한
Data를 2 groups으로 나눈다.
방법:
1. Training
data set for learning and validation
adult\dataset.data\dataset.data
adult\dataset.spec
2. Test
data set for prediction and final error or performance measure
adult\income\prototask.data\prototask.data
adult\income\prototask.spec
참고: 선택한 데이터에 관한 자세한
내용을 잘 읽어보고, 이미 training set과testing set을 분리 했는지 확인 또한 training set중 learning data set과 validation data set을
분리 했는지도 확인
Step 3: Write a program for BP and
Train an Artificial Neural Network
방법:
l 어느 code나 simulator를 사용해도 무방하나 프로그램 언어에 관계없이 backpropagation (BP) 알고리즘을 사용해야 함
l 이전 step2에서 분리하지 않았다면Training data set을 learning set과 validation set으로 분리한다 (eg. 50대50). 이것은
cross-validation이라는 기법을 이용해 학습을 하기 위해 서다.
Hints:
l learning
rate는 되도록 적어야 하고 (eg. 0.0001) momentum term은 되도록 큰 값을 추천 (eg. 0.8).
l Use
only one hidden layer.
l Choose
a high number of neurons (nodes 또는 units) in the
hidden layer (eg. 20). 경우에 따라 몇 개의 neuron을 더 첨가해서
결과 (eg. learning/validation error가 적을수록 결과가 좋다)를 비교해 보라 (data에 따라 다를 수 있다).
l Run
your algorithm several times (eg. 10 times) with different initializations. 혹은 위에서 분리한 learning/validation set을 다른 방법으로 분리해서 실행 해 보라.
(eg. 50대 50에서 60대 40으로 분리함)
l 결과에 최적으로 mapping되도록 위에서 예로 제시한10번만 실행하지 말고 여러
번 더 반복 실행해 보라.
Task 1: 여러 번 반복 횟수
(epochs: x축)와 learning error,
validation error (각 y축)의
관계를 그래프로 나타내라. 이것을 Plot 1으로 한다.
Step 4: 다음 방법을 이용해 error를 줄인다
방법:
l Early
stopping method: 여러 번 반복해서 run시킨 후
error가 최소가 되는 network을 선택하고 최소가 되는 시점에서 학습을 멈추는 방법
Task 2: 여러 번 반복 횟수
(epochs: x축)와 learning error,
validation error (각 y축)의
관계를 또 다시 그래프로 나타내라. 이것을 Plot 2로
한다
Step 5: Measure the performance on the
test set
방법: 지금까지
학습한 network structures는 2개이다:
l Task
1: Trained with learning and validation data set (so-called “without
reqularisation method”)
l Task
2: Trained with “early stopping method”
l 위 학습한 2개의 신경망 구조를 이용해 network최종 performance를 check 한다. 이 때 위 step에서 분리한
test data set을 여기서 이용한다. 지금까지는
training data set을 이용하였다.
Task 3: Report the
performance error of your networks and preferably compare them with the
performance error of other ANN methods (인터넷에서 다양한 관련된 논문 참조)
Write a short report (2~3 pages
maximum)
l 어느 database를 사용했는지를 설명
l 사용한 network structure(s) (number of nodes/layers)를 설명
l Show
the graphs mentioned in steps 3 and 4 (Tasks 1 and 2)
l State
your results in step 5 (Task 3)
http://sanghyukchun.github.io/42/ BP 알고리즘에 대한 자세한 설명
http://staff.itee.uq.edu.au/janetw/cmc/chapters/BackProp/index2.html 여기도 설명
TRIPODcode backprop.cpp 분석
맵핑 파일이 필요해서 헤더로 만들었다.
t새로 작성한 헤더파일
완성 프로젝트
https://github.com/xodmf1215/AI-backpropagation
댓글
댓글 쓰기