正文

spark---词频统计(代码片段)

luren-hometown  luren-hometown  2022-12-17  642

关键词：

利用python来操作spark的词频统计,现将过程分享如下:

1.新建项目:(这里是在已有的项目中创建的,可单独创建wordcount项目)

①新建txt文件: wordcount.txt (文件内容: 跟词频统计(一)中文件一致)

②创建py文件: word.py

from pyspark import SparkContext
from pyspark import SparkConf

conf = SparkConf().setAppName(‘word‘).setMaster(‘local‘)
sc = SparkContext(conf=conf)
wordcount = sc.textFile(r‘E:Hbaseapiwordcount‘)
counts = wordcount.flatMap(lambda x: x.split(" "))             .map(lambda word: (word, 1))              .reduceByKey(lambda a, b: a + b).collect()
print(counts)

打印结果:

[(‘development‘, 1), (‘producing‘, 1), (‘among‘, 1), (‘Source,‘, 1), (‘for‘, 1), (‘quality‘, 1), (‘to‘, 1), (‘influencers‘, 1), (‘advances‘, 1), (‘collaborative‘, 1), (‘model‘, 1), (‘in‘, 1), (‘the‘, 2), (‘of‘, 1), (‘has‘, 1), (‘successful‘, 1), (‘Software‘, 1), ("Foundation‘s", 1), (‘most‘, 1), (‘long‘, 1), (‘that‘, 1), (‘uded‘, 1), (‘as‘, 1), (‘Open‘, 1), (‘The‘, 1), (‘commitment‘, 1), (‘software‘, 1), (‘consistently‘, 1), (‘a‘, 1), (‘development.‘, 1), (‘high‘, 1), (‘future‘, 1), (‘Apache‘, 1), (‘served‘, 1), (‘open‘, 1), (‘https://s.apache.org/PIRA‘, 1)]

2.如果词频统计的数据量较小,可以如下:

from pyspark import SparkContext
from pyspark import SparkConf

conf = SparkConf().setAppName(‘word‘).setMaster(‘local‘)
sc = SparkContext(conf=conf)
data = [r"uded among the most successful influencers in Open Source, The Apache Software Foundation‘s       commitment to collaborative development has long served as a model for producing consistently       high quality software that advances the future of open development. https://s.apache.org/PIRA      "]
datardd = sc.parallelize(data)

result = datardd.flatMap(lambda x: x.split(‘ ‘)).map(lambda x: (x,1)).reduceByKey(lambda a,b: a+b).collect()
print(result)

打印结果:

[(‘‘, 18), (‘development‘, 1), (‘producing‘, 1), (‘among‘, 1), (‘Source,‘, 1), (‘for‘, 1), (‘quality‘, 1), (‘to‘, 1), (‘influencers‘, 1), (‘served‘, 1), (‘collaborative‘, 1), (‘in‘, 1), (‘the‘, 2), (‘Open‘, 1), (‘of‘, 1), (‘has‘, 1), (‘long‘, 1), (‘https://s.apache.org/PIRA\
‘, 1), (‘successful‘, 1), (‘Software‘, 1), (‘most‘, 1), (‘consistently\
‘, 1), (‘a‘, 1), ("Foundation‘s\
", 1), (‘uded‘, 1), (‘as‘, 1), (‘advances‘, 1), (‘The‘, 1), (‘commitment‘, 1), (‘software‘, 1), (‘that‘, 1), (‘development.‘, 1), (‘high‘, 1), (‘future‘, 1), (‘Apache‘, 1), (‘model‘, 1), (‘open‘, 1)]
18/07/27 17:14:34 INFO SparkContext: Invoking stop() from shutdown hook
result = datardd.flatMap(lambda x: x.split(‘ ‘)).map(lambda x: (x,1)).reduceByKey(lambda a,b: a+b).collect()
print(result)

总结:

①在window上利用python操作spark词频统计前提: 本机要有spark的系统环境配置和java的环境配置,配置步骤类似于python,必须确保安装无误才能运行结果.

②注意本机的python 跟spark的版本的兼容性,本机是python3.6 /spark1.6,很明显两者不兼容,需要重新安装3.5版本的python, linux上python跟spark也是同理.

③实际工作过程中需要注意:collect()的数据收集,在大数据处理过程中都是p量级的海量数据,如果不加思索直接collect()会直接导致内存崩溃.

? 针对③的情况,建议操作有:

from pyspark import SparkContext
from pyspark import SparkConf

conf = SparkConf().setAppName(‘word‘).setMaster(‘local‘)
sc = SparkContext(conf=conf)
data = [r"uded among the most successful influencers in Open Source, The Apache Software Foundation‘s       commitment to collaborative development has long served as a model for producing consistently       high quality software that advances the future of open development. https://s.apache.org/PIRA      "]
datardd = sc.parallelize(data)

# result = datardd.flatMap(lambda x: x.split(‘ ‘)).map(lambda x: (x,1)).reduceByKey(lambda a,b: a+b).collect()
# print(result)
result = datardd.flatMap(lambda x: x.split(‘ ‘)).map(lambda x: (x,1)).reduceByKey(lambda a,b: a+b)
def f(x):
    print(x)

result2 = result.foreach(f)
print(result2)

解释:它是通过foreach()遍历循环将数据结果挨个挨个打印到后台,避免撑爆内存的风险!

spark学习02天-scala读取文件，词频统计(代码片段)

1.在本地安装jdk环境和scala环境 2.读取本地文件： scala>importscala.io.Sourceimportscala.io.Sourcescala>vallines=Source.fromFile("F:/ziyuan_badou/file.txt").getLines().toListlines:List[String]=List("With 查看详情

scala配置和spark配置以及scala一些函数的用法（附带词频统计实例）(代码片段)

文章目录配置Spark配置Scala生成RDDfilter过滤器map方法flatMap方法reduceByKeyspark下wordcount程序参考先给出spark和Scala的下载地址，这是我用的版本https://pan.baidu.com/s/1rcG1xckk3zmp9BLmf74hsg?pwd=1111也可以自己去官网下载。配置Spark解压文... 查看详情

02使用spark进行词频统计scala交互(代码片段)

...了spark，本节将展示如何在spark中通过scala方式交互的进行词频统计。1系统、软件以及前提约束CentOS764工作站作者的机子ip是192.168.100.200，主机名为danji，请读者根据自己实际情况设置hadoop已经安装完毕并启动https://www.jianshu.com/p/b7... 查看详情

spark基于javaapi的词频统计

使用Spark对以下内容进行词频统计（使用Java语言）helloworldhellojavahellocnblogs代码如下：/***Spark基于JavaApi的词频统计*/publicclassWordCountByJava{publicstaticvoidmain(String[]args){//初始化SparkConfsetAppName:设置应用名称setMas 查看详情

02使用spark进行词频统计【scala交互】

参考技术A我们已经在CentOS7中安装了spark，本节将展示如何在spark中通过scala方式交互的进行词频统计。以上，就是在spark当中通过scala的交互方式进行词频统计。查看详情

02使用flink的本地模式完成词频统计(代码片段)

...使用这种模式。本节将阐述如何使用本地模式的flink进行词频统计。1系统、软件以及前提约束CentOS764工作站作者的机子ip是192.168.100.200，请读者根据自己实际情况设置idea2018.1在Win10中安装nchttps://www.jianshu.com/p/4f6fb8834ad92操作1在idea... 查看详情

大数据计算spark的安装和基础编程(代码片段)

...录1.使用SparkSell编写代码1.1启动SparkShell1.2读取文件1.3编写词频统计程序2.编写Spark独立应用程序2.1用Scala语言编写Spark独立应用程序2.2用Java语言编写Spark独立应用程序3.编程题3.1第一题3.2第二题1.使用SparkSell编写代码1.1启动SparkShell... 查看详情

综合练习：词频统计(代码片段)

1.英文词频统计下载一首英文的歌词或文章a=‘‘‘WakingupIseethateverythingisokThefirsttimeinmylifeandnowit‘ssogreatSlowingdownIlookaroundandIamsoamazedIthinkaboutthelittlethingsthatmakelifegreatIwouldn‘tchangeathingaboutitT 查看详情

英文词频统计(代码片段)

词频统计预处理下载一首英文的歌词或文章将所有,.？！’:等分隔符全部替换为空格将所有大写转换为小写生成单词列表生成词频统计排序排除语法型词汇，代词、冠词、连词输出词频最大TOP10word=‘‘‘Lately,I‘vebeen,I‘vebeenlosi... 查看详情

英语词频统计(代码片段)

song=‘‘‘sunday‘scomingiwannadrivemycartoyourapartmentwithpresentlikeastarforecastersaidtheweathermayberainyhardbutiknowthesunwillshineforusohlazyseagullflymefromthedarkidressmyjeansandfeedmymonkeybana 查看详情

1.英文词频统2.中文词频统计(代码片段)

1.英文词频统news=‘‘‘GuoShuqing,headofthenewlyestablishedChinabankingandinsuranceregulatorycommission,wasappointedPartysecretaryandvice-governorofthecentralbankonMonday,accordingtoanannouncementpublishedont 查看详情

英文词频统计(代码片段)

str=‘‘‘Inhisspeechattheclosingsessionofthisyear‘sNationalPeople‘sCongress,ChinesePresidentXiJinpingreiteratedthetwocentenarygoalsandemphasizedavisionofa"CommunityofSharedFutureforMankind".Theaspectsof 查看详情

综合练习：词频统计(代码片段)

综合练习词频统计预处理下载一首英文的歌词或文章将所有,.？！’:等分隔符全部替换为空格str=‘‘‘PassionissweetLovemakesweakYousaidyoucherisedfreedomsoYourefusedtoletitgoFollowyourfaithLoveandhateneverfailedtoseizethedayDon‘tgiveyourselfawayOhwhen 查看详情

综合练习：英文词频统计(代码片段)

词频统计预处理下载一首英文的歌词或文章将所有,.？！’:等分隔符全部替换为空格将所有大写转换为小写生成单词列表生成词频统计排序排除语法型词汇，代词、冠词、连词输出词频最大TOP10song=‘‘‘Ifyousayyou’rethefireworkatthe... 查看详情

综合练习：词频统计(代码片段)

1.英文词频统下载一首英文的歌词或文章将所有,.？！’:等分隔符全部替换为空格将所有大写转换为小写生成单词列表生成词频统计排序排除语法型词汇，代词、冠词、连词输出词频最大TOP201.英文词频统下载一首英文的歌词或文... 查看详情

综合练习：英文词频统计(代码片段)

词频统计预处理下载一首英文的歌词或文章将所有,.？！’:等分隔符全部替换为空格将所有大写转换为小写生成单词列表生成词频统计排序排除语法型词汇，代词、冠词、连词输出词频最大TOP10song=‘‘‘Troublewillfindyounomatterwherey... 查看详情

英文小说词频统计(代码片段)

strYoung=‘‘‘youngforyouGalasunday‘scomingiwannadrivemycartoyourapartmentwithpresentlikeastarforecastersaidtheweathermayberainyhardbutiknowthesunwillshineforusohlazyseagullflymefromthedarkidressmyjeans 查看详情

添加spark的相关依赖和打包插件（第六弹）(代码片段)

...一个package并命名为cn.itcast步骤3创建WordCount.scala文件用于词频统计 alt+回车：选择导入包步骤3创建WordCount.scala文件用于词频统计 al 查看详情