正文

记微软openhack机器学习挑战赛(代码片段)

easymind223  easymind223  2022-12-10  222

关键词：

　　有幸参加了微软OpenHack挑战赛，虽然题目难度不大，但是很有意思，学到了很多东西，还有幸认识了微软梁健老师，谢谢您的帮助！同时还认识同行的很多朋友，非常高兴，把这段难忘的比赛记录一下~~也分享一下代码，给那些没有参加的朋友，

数据集(文末链接)

首先每支队伍会收到一个数据集，它是一个登山公司提供的装备图片，有登山镐，鞋子，登山扣，不知道叫什么的雪地爪？手套，冲锋衣，安全带。。。一共12个类别，每个类别几百个样本，我们的任务就是对这些图片分类和识别

简单看一下：

技术图片

赛题：

赛题共有6道，简单描述一下：

1、搭建环境(略过)

2、图像正规化(包括颜色和大小)

3、通过机器学习方法对图像分类，precision>0.8

4、通过深度学习方法对图像分类，precision>0.9

5、部署(略过)

6、目标检测(用全新的数据集，检测雪地中的登山者是否带头盔！！航拍图像，有点难度~)

_______________________________________

下面是每道题目的详细描述和代码

题目2

完成以下任务:

选择一种基本颜色，例如白色并填充所有图片尺寸不是1:1比例的图像
不通过直接拉伸的方式，重塑至128x128x3像素的阵列形状
确保每个图像的像素范围从0到255(包含或[0,255])，也称为“对比度拉伸”(contrast stretching).

标准化或均衡以确保像素在[0,255]范围内.

成功完成的标准
团队将在Jupyter Notebook中运行一个代码单元，绘制原始图像，然后绘制填充后的像素值归一化或均衡图像, 展示给教练看.
团队将在Jupyter notebook 为教练运行一个代码单元，显示的像素值的直方图应该在0到255的范围内（包括0和255）.

def normalize(src):
    arr = array(src)
    arr = arr.astype(‘float‘)
    # Do not touch the alpha channel
    for i in range(3):
        minval = arr[...,i].min()
        maxval = arr[...,i].max()
        if minval != maxval:
            arr[...,i] -= minval
            arr[...,i] *= (255.0/(maxval-minval))
    arr = arr.astype(uint8)
    return Image.fromarray(arr,‘RGB‘)

import matplotlib.pyplot as plt
from PIL import ImageColor 
from matplotlib.pyplot import imshow
from PIL import Image
from pylab import *
import copy

plt.figure(figsize=(10,10)) #设置窗口大小

# src = Image.open("100974.jpeg")
src = Image.open("rose.jpg")

src_array = array(src)
plt.subplot(2,2,1), plt.title(‘src‘)
plt.imshow(src), plt.axis(‘off‘)


ar=src_array[:,:,0].flatten()
ag=src_array[:,:,1].flatten()
ab=src_array[:,:,2].flatten()
plt.subplot(2,2,2),  plt.title(‘src hist‘)
plt.axis([0,255,0,0.03])
plt.hist(ar, bins=256, normed=1,facecolor=‘red‘,edgecolor=‘r‘,hold=1) #原始图像直方图
plt.hist(ag, bins=256, normed=1,facecolor=‘g‘,edgecolor=‘g‘,hold=1) #原始图像直方图
plt.hist(ab, bins=256, normed=1,facecolor=‘b‘,edgecolor=‘b‘) #原g始图像直方图


dst = normalize(src)
dst_array = array(dst)

plt.subplot(2,2,3), plt.title(‘dst‘)
plt.imshow(dst), plt.axis(‘off‘)

ar=dst_array[:,:,0].flatten()
ag=dst_array[:,:,1].flatten()
ab=dst_array[:,:,2].flatten()
plt.subplot(2,2,4),  plt.title(‘dst hist‘)
plt.axis([0,255,0,0.03])
plt.hist(ar, bins=256, normed=1,facecolor=‘red‘,edgecolor=‘r‘,hold=1) #原始图像直方图
plt.hist(ag, bins=256, normed=1,facecolor=‘g‘,edgecolor=‘g‘,hold=1) #原始图像直方图
plt.hist(ab, bins=256, normed=1,facecolor=‘b‘,edgecolor=‘b‘) #原g始图像直方图

View Code

技术图片

题目3

使用一个非参数化分类方法(参考参考文档)来创建一个模型，预测新的户外装备图像的分类情况，训练来自挑战2的预处理过的128x128x3的装备图像。所使用的算法可以从scikit-learn库中挑选现有的非参数化算法来做分类。向教练展示所提供的测试数据集的精确度，并且精确度分数需要超过80%。

dir_data ="data/preprocess_images/"

equipments = [‘axes‘, ‘boots‘, ‘carabiners‘, ‘crampons‘, ‘gloves‘, ‘hardshell_jackets‘, ‘harnesses‘, ‘helmets‘,
              ‘insulated_jackets‘, ‘pulleys‘, ‘rope‘, ‘tents‘]
train_data = [] 
y = [] 

import os
from PIL import Image
for equip_name in equipments:
    dir_equip = dir_data + equip_name
    
    for filename in os.listdir(dir_equip):             
        if(filename.find(‘jpeg‘)!=-1):
            name = dir_equip + ‘/‘ + filename
            img = Image.open(name).convert(‘L‘)
            train_data.append(list(img.getdata()))
            y.append(equip_name)

View Code

from sklearn import svm
from sklearn.cross_validation import train_test_split  

train_X,test_X, train_y, test_y = train_test_split(train_data, y, test_size = 0.3, random_state = 0)

from sklearn import neighbors 
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import precision_score,recall_score

clf_knn = neighbors.KNeighborsClassifier(algorithm=‘kd_tree‘)  
clf_knn.fit(train_X, train_y)
y_pred = clf_knn.predict(test_X)

View Code

print(__doc__)

import itertools
import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(cm, classes,
                          
                          normalize=False,
                          title=‘Confusion matrix‘,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype(‘float‘) / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print(‘Confusion matrix, without normalization‘)

    print(cm)

    plt.imshow(cm, interpolation=‘nearest‘, cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = ‘.2f‘ if normalize else ‘d‘
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel(‘True label‘)
    plt.xlabel(‘Predicted label‘)

# Compute confusion matrix
# cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)
confusion_mat = confusion_matrix(test_y, y_pred, labels = equipments)

# Plot non-normalized confusion matrix
plt.figure(figsize=(10,10))
plot_confusion_matrix(confusion_mat, classes=equipments,
                      title=‘Confusion matrix, without normalization‘)

# Plot normalized confusion matrix
plt.figure(figsize=(10,10))

plot_confusion_matrix(confusion_mat, classes=equipments, normalize=True,
                      title=‘Normalized confusion matrix‘)

plt.show()

View Code

　　因为要求精确度>0.8，sklearn中的很多算法应该都能满足，我选择了准确度比较高的KNN来建模，应该足够用了

技术图片

算一下presion和recall，轻松超越0.8

技术图片

题目4

挑战完成标准，使用深度学习模型，如CNN分析复杂数据
团队将在Jupyter Notebook上为教练运行一个代码单元，展示模型的准确度为90％或更高

准确度如果要>0.9，sklearn中的机器学习算法就很难达到了，关键时刻只能上CNN

import matplotlib.pyplot as plt
from PIL import ImageColor 
from matplotlib.pyplot import imshow
from PIL import Image
from pylab import *
dir_data ="data/preprocess_images/"

equipments = [‘axes‘, ‘boots‘, ‘carabiners‘, ‘crampons‘, ‘gloves‘, ‘hardshell_jackets‘, ‘harnesses‘, ‘helmets‘,
              ‘insulated_jackets‘, ‘pulleys‘, ‘rope‘, ‘tents‘]
train_data = [] 
y = [] 

import os
from PIL import Image
i=0
for equip_name in equipments:
    dir_equip = dir_data + equip_name
    for filename in os.listdir(dir_equip):             
        if(filename.find(‘jpeg‘)!=-1):
            name = dir_equip + ‘/‘ + filename
            img = Image.open(name).convert(‘L‘)
            train_data.append(array(img).tolist())
            y.append(i)
    i += 1
train_data = np.asarray(train_data)

View Code

from sklearn import svm
from sklearn.cross_validation import train_test_split  
import numpy as np
import keras
num_classes=12
img_rows=128
img_cols=128
train_X, test_X, train_y, test_y = train_test_split(train_data, y, test_size = 0.3, random_state = 0)

train_X = train_X.reshape(train_X.shape[0], img_rows, img_cols, 1)
test_X = test_X.reshape(test_X.shape[0], img_rows, img_cols, 1)
    
train_X = train_X.astype(‘float32‘)
test_X = test_X.astype(‘float32‘)
train_X /= 255
test_X /= 255
print(‘x_train shape:‘, train_X.shape)
print(train_X.shape[0], ‘train samples‘)
print(test_X.shape[0], ‘test samples‘)

# convert class vectors to binary class matrices
train_y = keras.utils.to_categorical(train_y, num_classes)
test_y = keras.utils.to_categorical(test_y, num_classes)

View Code

from keras.layers import Dense, Activation, Convolution2D, MaxPooling2D, Flatten
from keras.models import Sequential
from keras.layers import Convolution2D,MaxPooling2D, Conv2D
import keras

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation=‘relu‘,
                 input_shape=(128, 128, 1)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3), activation=‘relu‘))
model.add(MaxPooling2D(pool_size=(2, 2)))
# model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation=‘relu‘))
# model.add(Dropout(0.5))
model.add(Dense(12, activation=‘softmax‘))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=[‘accuracy‘])

model.fit(train_X, train_y,
          batch_size=128,
          epochs=50,
          verbose=1,
          validation_data=(test_X, test_y))
score = model.evaluate(test_X, test_y, verbose=0)
print(‘Test loss:‘, score[0])
print(‘Test accuracy:‘, score[1])

View Code

CNN的混淆矩阵比KNN的好了不少

技术图片

训练了好多次，不断调整各个卷积层和参数，终于达到了一个比较好的效果~~

技术图片

题目6

使用深度学习框架，基于一个常用的模型，比如Faster R-CNN，训练一个目标检测的模型。这个模型需要能够检测并且使用方框框出图片中出现的每一个头盔。

这道题目首先要自己标注样本，几百张图像标注完累的半死。。。这里我们使用VOTT来标注，它会自动生成一个样本描述文件，很方便。Faster R-CNN的程序我们参考了git上的一个红细胞检测的项目，https://github.com/THULiusj/CosmicadDetection-Keras-Tensorflow-FasterRCNN，代码非常多就不贴了

最后来一张效果图

技术图片

本文数据集和VOTT工具链接：

https://pan.baidu.com/s/1FFw0PLJrrOhwR6J1HexPJA

提取码 s242

微软开源的机器学习入门课程(代码片段)

导读微软开源的ML-For-Beginners入门机器学习的课程目前在GitHub上已经有将近15k颗星。课程是专门针对机器学习的入门教程，一共包含了12周24节课程，主要是基于Scikit-learn来介绍的。课程介绍每节课程主要包含了以下几个内... 查看详情

微软开源自动机器学习工具nni安装与使用(代码片段)

微软开源自动机器学习工具–NNI安装与使用??在机器学习建模时，除了准备数据，最耗时耗力的就是尝试各种超参组合，找到最佳模型的过程了。对于初学者来说，常常是无从下手。即使是对于有经验的算法工程师和数据科学家... 查看详情

#夏日挑战赛#ffh从零开始的鸿蒙机器学习之旅-nlp情感分析(代码片段)

[本文正在参加星光计划3.0-夏日挑战赛]1.2导入StandfordCoreNLP库1.2.1我们可以在官网下载工具包StandfordCoreNLP1.2.2解压，并引入lib中右键文件夹，点击addaslibrary2.情感分析2.1新建JAVA类，NLP_EMOTIONpackagecom.example.nlpdemo.utils;importedu.stanford.nlp... 查看详情

nlp讯飞英文学术论文分类挑战赛top10开源多方案–4机器学习lgb方案(代码片段)

1相关信息【NLP】讯飞英文学术论文分类挑战赛Top10开源多方案–1赛后总结与分析【NLP】讯飞英文学术论文分类挑战赛Top10开源多方案–2数据分析【NLP】讯飞英文学术论文分类挑战赛Top10开源多方案–3TextCNNFasttext方案【NLP】讯飞... 查看详情

nlp讯飞英文学术论文分类挑战赛top10开源多方案–4机器学习lgb方案(代码片段)

微软面向初学者的机器学习课程：1.4-机器学习技术(代码片段)

写在前面：最近在参与microsoft/ML-For-Beginners的翻译活动，欢迎有兴趣的朋友加入（https://github.com/microsoft/ML-For-Beginners/issues/71）机器学习技术构建、使用和维护机器学习模型及其使用的数据的过程与许多其他开发工... 查看详情

机器学习python常见用法汇总(代码片段)

...机器学习】Python常见用法汇总活动地址：[CSDN21天学习挑战赛](https://marketing.csdn.net/p/bdabfb52c5d56532133df2adc1a728fd)作者简介：在校大学生一枚，华为云享专家，阿里云星级博主，腾云先锋（TDP）成员，云... 查看详情

精品系列机器学习实战完整版区域房价中位数预测（挑战全网最全，没有之一，另附完整代码）(代码片段)

参照《机器学习实战》第二版1、下载数据importosimporttarfileimporturllib.requestDOWNLOAD_ROOT="https://raw.githubusercontent.com/ageron/handson-ml2/master/"HOUSING_PATH=os.path.join("datasets", 查看详情

机器学习数据科学基础——神经网络基础实验(代码片段)

...学基础——神经网络基础实验活动地址：[CSDN21天学习挑战赛](https://marketing.csdn.net/p/bdabfb52c5d56532133df2adc1a728fd)作者简介：在校大学生一枚，华为云享专家，阿里云星级博主，腾云先锋（TDP）成员，查看详情

微软出品！flaml：一款可以自动化机器学习过程的神器！(代码片段)

机器学习是我们使用一组算法解决来解决生活中问题的过程。创建机器学习模型很容易，但选择在泛化和性能方面都最适合的模型是一项艰巨的任务。有多种机器学习算法可用于回归和分类，可根据我们要解决的问题来... 查看详情

机器学习笔记-lightgbm(代码片段)

1.1LightGBM的介绍LightGBM是2017年由微软推出的可扩展机器学习系统，是微软旗下DMKT的一个开源项目，由2014年首届阿里巴巴大数据竞赛获胜者之一柯国霖老师带领开发。它是一款基于GBDT（梯度提升决策树）算法的分... 查看详情

机器学习数据科学基础——机器学习基础实践(代码片段)

...学习基础实践（一）活动地址：[CSDN21天学习挑战赛](https://marketing.csdn.net/p/bdabfb52c5d56532133df2adc1a728fd)作者简介：在校大学生一枚，华为云享专家，阿里云星级博主，腾云先锋（TDP）成员，云... 查看详情

golang学习随便记2(代码片段)

...f0c;使用平台依赖的整数的理由是效率最高（因为等于机器字长）。rune表示Unicode码点& 查看详情

一个python爬虫工程师学习养成记(代码片段)

...、后端开发、App开发与逆向、网络安全、数据库、运维、机器学习、数据分析等各个方向的内容，它像一张大网一样把现在一些主流的技术栈都连接在了一起。正因为涵盖的方向多，因此学习的东西也非常零散和杂乱，很多初学... 查看详情

机器学习之sklearn基础——一个小案例，sklearn初体验(代码片段)

活动地址：CSDN21天学习挑战赛工欲善其事，必先利其器子曰：“工欲善其事，必先利其器。居是邦也，事其大夫之贤者，友其士之仁者。” 各位爱好机器学习的小伙伴们，... 查看详情

uwp通过机器学习加载onnx进行表情识别(代码片段)

...相同格式存储模型数据并交互。ONNX的规范及代码主要由微软，亚马逊，Facebook和IBM等公司共同开发，以开放源代码的方式托管在Github上。目前官方支持加载ONNX模型并进行推理的深度学习框架有：Caffe2,PyTorch,MXNet，ML.NET，TensorRT和M... 查看详情

微软面向初学者的机器学习课程：3.1-构建使用ml模型的web应用程序(代码片段)

写在前面：最近在参与microsoft/ML-For-Beginners的翻译活动，欢迎有兴趣的朋友加入（https://github.com/microsoft/ML-For-Beginners/issues/71）构建使用ML模型的Web应用程序在本课中，你将在一个数据集上训练一个ML模型，... 查看详情

安全学习记一次内网环境渗透(代码片段)

...环境进行渗透的。我将使用KaliLinux作为此次学习的攻击者机器。这里使用的技术仅用于学习目的，如果列出的技术用于其他任何目标，概不负责。靶场环境配置网络拓扑图整个环境共四台目标机，分别处在三层内网环... 查看详情