正文

腾讯ailabaaai18现场陈述论文：用随机象限性消极下降算法训练l1范数约束模型

author  author  2022-10-18  626

关键词：

前言：腾讯 AI Lab共有12篇论文入选在美国新奥尔良举行的国际人工智能领域顶级学术会议 AAAI 2018。腾讯技术工程官方号独家编译了论文《用随机象限性消极下降算法训练L1范数约束模型》（Training L1-Regularized Models with Orthant-Wise Passive Descent Algorithms），该论文被 AAAI 2018录用为现场陈述论文（Oral Presentation），由腾讯 AI Lab独立完成，作者为王倪剑桥。

技术分享图片

中文概要

L1范数约束模型是一种常用的高维数据的分析方法。对于现代大规模互联网数据上的该模型，研究其优化算法可以提高其收敛速度，进而在有限时间内显著其模型准确率，或者降低对服务器资源的依赖。经典的随机梯度下降 (SGD) 虽然可以适用神经网络等多种模型，但对于L1范数不可导性并不适用。

在本文中，我们提出了一种新的随机优化方法，随机象限性消极下降算法 (OPDA)。本算法的出发点是L1范数函数在任何一个象限内是连续可导的，因此，模型的参数在每一次更新之后被投影到上一次迭代时所在的象限。我们使用随机方差缩减梯度 (SVRG) 方法产生梯度方向，并结合伪牛顿法 (Quasi-Newton) 来使用损失函数的二阶信息。我们提出了一种新的象限投影方法，使得该算法收敛于一个更加稀疏的最优点。我们在理论上证明了，在强凸和光滑的损失函数上，该算法可以线性收敛。在RCV1等典型稀疏数据集上，我们测试了不同参数下L1/L2范数约束Logistic回归下该算法性能，其结果显著超越了已有的线性收敛算法Proximal-SVRG，并且在卷积神经网络 (CNN) 的实验上超越Proximal-SGD等算法，证明了该算法在凸函数和非凸函数上均有很好的表现。

英文演讲全文

Hello, everyone, I am Jianqiao Wangni, from Tencent AI Lab.

1. Introduction: Learning sparse representation has been a very important task for data analysis. For example, in biology, it usually involves millions of genes for the genetic analysis of a single individual. In finance series prediction, online advertising, there also lots of cases where the data numbers are even smaller than data dimensions, which is an ill-conditioned problem without the sparse prior. So, for the conventional models, like logistic regression and linear regression, we put an L1 norm regularization, which is the sum of absolute values, to build robust applications with high dimensional data. And they are very powerful for learning sparse representations. To give an intuitive example. The blue areas are the constrained regions, L-1 norm ball on the left and L-2 norm ball on the right, respectively, while the red circles are the contours of the average of square loss functions. The intersection point between the balls and the contours are the solution to such regularized models. We can see that the solution to the L1-regularized model is near the Y-axis, which means that the X-dimension element is more approaching zero.

技术分享图片

2. Formal Definition: Now we go to the analytical part. We study a regularized function P(x) which equals to F(x) + R(x), where F(x) is the average of N loss functions each of which depends on a data sample, and R is the L_1 regularization. We also assume that each loss function is twice differentiable, strongly convex and smooth, which are general assumptions in convex optimization. The L1 norm is not differentiable. One of most representative optimization method is the proximal method, which iteratively takes a gradient descent step and then solves a proximal problem on the current point.

技术分享图片

3. Reference 1: Our primary reference is the orthant-wise limited-memory quasi-newton method, OWL-QN, which was based on L-BFGS, a representative quasi-Newton method that overcame an obstacle that L1 norms are not differentiable. This method restricts the updated parameter to be within a certain orthant, because that in every single orthant, the absolute value function are actually differentiable. A key component of OWL-QN is about the subgradient on zero points. The subgradient of the L1 regularization R(x) can be whether positive lambda or negative lambda. Take the third branch for an example, we study a single dimension, I-th dimension, of the current point X_i and the gradient, V_i. If X_i equals to zero, and X_i plus lambda is negative, then the subgradient is set to be V_i plus positive lambda, since after subtracting this subgradient, X_i will be a positive value, then the subgradient of R(x) will still be positive lambda, this makes the subgradient of the L1 norm to be consistent across one iteration.

技术分享图片

4. Reference 2: To process the massive internet data, many optimization methods were proposed to speed up the training process. The stochastic gradient descent method, SGD, is a popular choice for optimization. However, SGD generally needs a decreasing stepsize to converge, so it only has a sublinear convergence rate. Recently, some stochastic variance reduction methods, such as SVRG and SAGA, can converge without decreasing stepsizes and achieve linear convergence rates on smooth and strongly-convex models. In SGD, the descent direction in the k-th step, v_k, is evaluated on a stochastic subset S_k of the dataset. In SVRG, we need to periodically calculate a series of full gradients that depend on all data, on some reference points, say tilde-X. The full gradient constructs the third term of v_k of SVRG, then we have to balance the gradient in expectation, by subtracting a stochastic gradient on tilde-X, which is calculated on the same subset S_k.

技术分享图片

5. The First Step: Now we proceed to our method. Although being efficient, OWL-QN can be further improved, for examples, by dropping the line-search procedure, or using a stochastic gradient instead of the accurate but costly full gradient. Inspired by SVRG and OWL-QN, we develop a stochastic optimization method for in L1-regularized models. This is more complicated than the smooth function. At the first step, we calculate the SVRG of the loss function F(x), then we use an idea from OWL-QN to push the descent direction toward maintaining the same orthant after one iteration, and this will be used as a reference orthant. Our actual descent direction V_k is calculated as the third equation here.

技术分享图片

6. The Second-Order Information: At this point, we have an optional choice about the descent direction, we can use the second-order information of the loss function, by calculating the Hessian matrix on the current point, or estimating an approximated Hessian matrix, like quasi-Newton method. Then the descent direction D_k can be obtained by minimizing the following quadratic expansion around the current point. And if we use L-BFGS, we actually do not need to do the matrix inversion like the equation, this can be done through efficient matrix-vector multiplications. We may also directly assign V_k to D_k, as a typical first-order method. After this step, the orthant of the direction D_k has to be consistent with V_k, which means that, if some dimensions of D_k have different signs with V_k, they have to be aligned to zero, and we note the aligned version to be P_k.

技术分享图片

7. Final Updates: The aforementioned calculation does not explicitly involve the partial derivative of R(x), the L1 regularization, except for the alignment reference, since we are avoiding additional variance to the stochastic gradient. To make the solution to be sparse, we introduce a novel alignment operator to encourage zero elements, if X and Y had different signs, or the absolute value of X is less than a threshold, X would be forced to be zero. By this alignment operator, after each time we complete the previous calculation, we examine if the next point is in the same orthant as the current point, if not, some dimensions of the next point will be zero. After this step, clearly more dimensions of X_k should be zero instead of being small but nonzero values.

技术分享图片

8. Convergence Analysis: In the paper, we prove that, under the assumption of smoothness and strong-convexity, our method will converge with a linear convergence rate. Due to lack of time, we do not use too much time on details of this heavy mathematical analysis.

技术分享图片

9. Synthetic Experiments: To visualize a comparison, we plot the optimization trajectories on a simple two-dimensional synthetic function in this figure. Our method, OPDA, is noted by the red line, and proximal-gradient descent, is noted by the blue line, as the baseline. After the same number of iterations, we see that our method converges to the minimum with a faster speed.

技术分享图片

10. Experiments on Convex Cases: We also do some experiments on logistic regression with both L_2 and L_1 regularizations for binary classification. In this part, we compare our method with the proximal-SVRG method, which is also linearly-convergent. In the figure, Y-axis notes the suboptimality, which means how far is the current objective function to the minimum value, and X-axis represents the number of data passes. We found a stable result that OPDA runs faster than proximal-SVRG with different step sizes, L2 regularization, and L1 regularization.

技术分享图片

11. Experiments on Deep Learning: We also conducted experiments with L-1 regularized convolutional neural networks, or sparse CNN, to demonstrate the efficiency under the nonconvex case. This application is also useful to reduce the parameter size of neural networks. The red line represents our method, and the blue line is the proximal-SVRG. We test different scales of L1-regularization. We see that OPDA converge faster than proximal-SVRG, and the difference is stronger if the L1 regularization is stronger. Actually, by the orthant-wise nature of our methods, lots of dimensions of the descent direction and the updated points are forced to sparse during alignment, making the equivalent speed much slower, however, our methods still converge much faster than the proximal methods, with the same step size. This proves that our propose dalignment operator does calibrate the direction to be better, making the overall framework more efficient in terms of iterations, but with only negligible extra arithmetic operations.

技术分享图片

每日随笔毕业论文答辩④(答辩陈述|自我介绍|论文题目|论文内容|研究背景|文件综述|研究内容|研究结论|总结与展望)

...研究结论七、总结与展望之前的博客中,简单介绍了答辩陈述,本篇博客逐条展开分析;答辩陈述:自我介绍:介绍姓名,专业,导师,一句话带过;论文题目:我的毕业论文题目是<基于…的…研究>;论文内容:用一两句话描述论文内容,如... 查看详情

sci论文之数据可用性陈述--dataavailabilitystatement

数据可用性陈述DataAvailabilityStatementASCEistakingstepstoimprovetheavailabilityandreproducibilityofworkpublishedinitsjournals.ASCEisintroducinganewpolicyrequiringauthorstospecifytheavailabilityofdata,comput 查看详情

2022gartner全球云数据库管理系统魔力象限发布腾讯云数据库入选

...2022年度《云数据库管理系统魔力象限》研究报告中，腾讯云数据库进入特定领域者（NichePlayers）象限。同时据Gartner云数据库管理系统运行用例关键功能报告，腾讯云数据库在OLTP事务和轻量级事务用例上，得... 查看详情

每日随笔毕业论文答辩②(问答环节注意点|答辩陈述以及问题回答流程)

文章目录一、问答环节注意点二、答辩陈述以及问题回答流程一、问答环节注意点提出建议:在问答环节,答辩老师会对你的论文提出一些建议,然后开始提问;提出的建议听着就可以,记下来之后自己修改,不需要回答,别嘴硬顶嘴或... 查看详情

每日随笔毕业论文答辩②(问答环节注意点|答辩陈述以及问题回答流程)

2018年3月18日论文阅读

国外泛读！title（24）：WeaklySupervisedLearningwithDeepConvolutional NeuralNetworksforSemanticSegmentation：Understandingsemanticlayoutofimageswithminimumhumansupervision（弱监督学习框架下的深度卷积神经网络语义分割：用最少的人工监督理解图查看详情

cvpr大会现场纪念孙剑博士，最佳学生论文授予同济阿里，李飞飞获黄煦涛纪念奖...

...，CVPR2022召开开幕式，公布最佳论文等论文奖项。现场，大会特别纪念了孙剑博士。孙剑博士于6月14日凌晨不幸离世，令业内同仁震惊、哀痛。此前，他曾两度获得CVPR最佳论文奖：2009年，他带领团队完... 查看详情

每日随笔毕业论文答辩③(线上答辩注意点|答辩ppt准备|答辩通过的因素)

...像头和麦克风;软件准备:线上答辩需要使用会议软件,如:腾讯会议,钉钉等;提前调试:答辩前几天,推荐找同学一起调试下设备;二、答辩PPT准备答辩PPT准备:只写关键要点:PPT内容要简洁,不要把论文中冗长的文字堆砌到PPT中;PPT内容顺... 查看详情

腾讯面试——自然语言处理

目录一面二面hr面一面一面刚开始问是不是接受去深圳，实习时间段大概是什么时间。也是刚开始的时候是讲论文，后来问了一些机器学习和深度学习的知识。逻辑回归和SVM的区别为什么随机森林比较好简单介绍一下word2vec和fastt... 查看详情

腾讯大佬用了18小时讲完的python，整整400集，拿走不谢

...！针对性学习宗旨是按需去学，学以致用。因此腾讯大佬推荐的大型Pyth 查看详情

腾讯动漫爬虫与动态随机加载反爬

...#xff0c;就想试试爬一爬动漫，在微信社区里又看到一个腾讯动漫爬虫与动态随机加载反爬破解实战的文章，就试着跑了一下，还可以。用到了PhantomJS自动触发漫画图片以及js（window.scrollTo()实现页面滑动，自动触... 查看详情

腾讯动漫爬虫与动态随机加载反爬

用均匀分布随机变量生成泊松分布随机变量

　　《R语言的科学编程与仿真》的第18章提到，所有的随机变量可以通过处理U(0,1)随机变量生成。该书在18.2里给出了一个模拟算法，具体内容摘抄如下：　　假设X是在集合0,1，...取值的离散随机变量，累积分布函数是F，概率质... 查看详情

腾讯论文入选ai国际顶会，详细解读nlp研究成果

...8;NLP）领域顶级会议ACL-IJCNLP2021公布了论文接收情况。腾讯有50余篇论文被接收，又一次刷新了论文录取数量纪录，领跑国内业界AI研究第一梯队。本文将对腾讯AILab主导的两篇论文进行详细解读。ACL2021杰出论文：... 查看详情

腾讯技术工程|腾讯ailab11篇论文精选：图像描述nmt模型图卷积神经网络等

...018将于2月2日至7日在美国新奥尔良举行，在本届大会上，腾讯AILab有11篇论文被录用，涉及图像描述、更低计算成本的预测表现、NMT模型中的特定翻译问题、自适应图卷积神经网络、DNN面对对抗样本的优化问题等，本文精选了11篇... 查看详情

腾讯turinglab论文入选icassp，图像ai研究成果获国际认可

...会议ICASSP2022公布了论文入选名单。由王君乐博士带领的腾讯TuringLab实验室论文——《针对手机游戏的主观与客观视频质量评价》（SubjectiveandObjectiveQualityAssessmentofMobileGamingVideo）、《引入用户共识学习的美学质量预测》&#x... 查看详情

我需要坚持准备好的陈述吗？

】我需要坚持准备好的陈述吗？【英文标题】：DoIneedtopersistapreparedstatement?【发布时间】：2014-07-1905:18:36【问题描述】：我有以下代码：publicfunctiongetDefinitions($wordID)$query=$this->dbc->prepare(\'SELECT*FROMdefinitionsWHEREwordID=?\');$query->... 查看详情