python 分析社会现象 马太效应 幂法则

 

python是数据挖掘的利器。在数据处理方面有numpy,pandas;图形化展示有matplotlib;而且在数据爬取方面有scrapy这样的大杀器。

前段时间用scrapy在“知乎”的“社会”专题下爬取了近百万条“问题”(包含问题内容和回答数目)数据。存在mongodb中一直没用,最近在“mining of massive datasets” 这本书上看到一个词叫power law(幂法则),原话如下:

Often, the existence of power laws with values of the exponent higher than 1 are explained by the Matthew effect. In the biblical Book of Matthew, there is a verse about “the rich get richer.” Many phenomena exhibit this behavior, where getting a high value of some property causes that very property to increase. For example, if a Web page has many links in, then people are more likely to find the page and may choose to link to it from one of their pages as well. As another example, if a book is selling well on Amazon, then it is likely to be advertised when customers go to the Amazon site. Some of these people will choose to buy the book as well, thus increasing the sales of this book.

 

Power Laws: Many phenomena obey a law that can be expressed as y = cxa for some power a, often around −2. Such phenomena include the sales of the xth most popular book, or the number of in-links to the xth most popular page.

 

马太效应很早就听过了,但这个power law(幂法则)对我来说还是蛮新鲜的。而且它的表述更进一步,将两者关系公式化(指数/对数关系)这样的结论更有价值。所以就想到用上面提到的数据来验证一下这个法则。先看看各城市被提及的次数。

代码如下:

11,12行:matplotlib本身没有中文字体,需要调用自己电脑的字体。

14-17行:连接mongodb,连接相应的数据。

23行: 在数据库中查询包含带有相关关键字的“问题“

32行:使用关键词出现次数对数组进行排序。

得到的图表是这样的:

%e5%b1%8f%e5%b9%95%e5%bf%ab%e7%85%a7-2016-09-10-%e4%b8%8b%e5%8d%8810-19-38

 

基本符合幂函数的形式。 而且长尾理论,二八定律这些也都有所反映。

后面又试了试其他数据,比如国家,社会现象,年份,职业,称谓。只要改一下‘xx’这个变量就可以了。生成结果如下:

%e5%b1%8f%e5%b9%95%e5%bf%ab%e7%85%a7-2016-09-10-%e4%b8%8b%e5%8d%889-28-46 %e5%b1%8f%e5%b9%95%e5%bf%ab%e7%85%a7-2016-09-10-%e4%b8%8b%e5%8d%889-19-15 %e5%b1%8f%e5%b9%95%e5%bf%ab%e7%85%a7-2016-09-10-%e4%b8%8b%e5%8d%889-06-23
%e5%b1%8f%e5%b9%95%e5%bf%ab%e7%85%a7-2016-09-10-%e4%b8%8b%e5%8d%889-44-40 %e5%b1%8f%e5%b9%95%e5%bf%ab%e7%85%a7-2016-09-10-%e4%b8%8b%e5%8d%889-51-32

 

也都几本符合这个幂法则。可见这是一个广泛存在的现象,至于其背后的原因就不是本文讨论的范围了。

这样的分析还是蛮有趣的,而且也有助于理解一些社会现象。用python来完成这样一件事真的很方便。特别是scrapy才十多行代码就搞定了,只是代码跑了一晚上;mongoengine也很好用;还有数据分析三剑客numpy,pandas,matplotlib更是功能强大,方便快捷。

 

暂无评论

发表评论

电子邮件地址不会被公开。 必填项已用*标注