遭遇Python*重复运算符陷阱

时间： 2014-01-18 | 分类： Python | 阅读： 123 字 ~1分钟

在python中有个特殊的符号“*”，可以用做数值运算的乘法算子，也是用作对象的重复算子，但在作为重复算子使用时一定要注意* 重复出来的对象有可能是指向在内存中同一块地址的同一对象。测试代码： grid_width=2 grid_height=2 def modify_grid(cells, row, col, val): cells[row][col]=val print cells #testing 1 print '\ntesting 1' cells=[ [88 for col in range(grid_width)] for row in range(grid_height)] print cells modify_grid(cells,0,1,66) #testing 2: In the trap print '\ntesting 2' cells=[[88]*grid_width]*grid_height print cells modify_grid(cells,0,1,66) print '\n' cells=[["88"]*grid_width]*grid_height print cells modify_grid(cells,0,1,"66") #testing 3 print '\ntesting 3' cells=[] for idx in range(grid_height): cells.append([88]*grid_width) print cells modify_grid(cells,0,1,66) #testing 4 print '\ntesting 4' cells=[123]*(grid_height*grid_width) print cells cells[1]=321 print cells 结果

阅读全文 »

安装和卸载JRE或JDK时出现Error 1723的解决办法

时间： 2014-01-15 | 分类： Tool | 阅读： 14 字 ~1分钟

卸载JDK和JRE时失败，报错“Error 1723:java dll missing”。网上的很多攻略都不管用，最后还是从微软网站找到fix程序。 Fix problems that programs cannot be installed or uninstalled

用JS处理粘贴而来的HTML表单

时间： 2014-01-12 | 分类： FrontEnd | 阅读： 1 字 ~1分钟

今天用Javascript处理粘贴而来的HTML表单，代码如下：某日又写了一小段，代码如下：

[算法] O(0)的exch函数

时间： 2014-01-10 | 分类： Algorithm.DataStruct | 阅读： 66 字 ~1分钟

常用的exch函数： public static void exch(int[] nums, int i, int j) { int tmp = nums[i]; nums[i] = nums[j]; nums[j] = tmp; } 不使用辅助空间的exch函数： public static void exch(int[] nums, int i, int j) { nums[i] ^= nums[j]; nums[j] ^= nums[i]; nums[i] ^= nums[j]; } 控制流及nums[i]和nums[j]状态如下： |nums[i]|nums[j] |— |= nums[i] ^ nums[j]|/ |/|= nums[j] ^ (nums[i] ^ nums[j]) = nums[i] |= (nums[i] ^ nums[j]) ^ nums[i] = nums[j]|/

ClusterShell实践

时间： 2014-01-07 | 分类： Tool Linux | 阅读： 18 字 ~1分钟

ClusterShell介绍 ClusterShell提供了一个轻量级、统一和健壮的命令执行Python框架，非常适于减轻Linux集群日常管理任务负担。ClusterShell的好处如下：提供高效、并行和高可扩展的Python命令执行引擎。提供统一节点组语法和对外部组的访问当使用clush和nodeset等工具可有效提升集群创建和日常管理任务的效率 ClusterShell实践安装首先在集群内节点配置无密钥ssh访问。然后在主/工作节点上安装ClusterShell。 apt-get install clustershell 配置和实践 ClusterShell工具 ClusterShell包含如下工具： clush帮助文档 clubak帮助文档 nodesetcluset帮助文档 ClusterShell在线文档为http://clustershell.readthedocs.io/。

CRAN任务视图使用

时间： 2014-01-07 | 分类： DataScience | 阅读： 53 字 ~1分钟

[HBase] HBase Shell中的put操作解析

时间： 2014-01-03 | 分类： BigData | 阅读： 144 字 ~1分钟

阅读了HBase Shell datatype conversion一贴，感觉下列两个操作结果中的单元格数据值都像是文本类型的： put 'mytable', '2342', 'cf:c1', '67' put 'mytable', '2341', 'cf:c1', 23 预知真相，看来只好看HBase Shell代码了。HBase Shell是Ruby代码，首先找到这些代码的位置： cd $HBASE_HOME find . -name '*.rb' -print 找到了$HBASE_HOME/lib/ruby/shell/commands/put.rb，其GitHub代码库位置为https://github.com/apache/hbase/commits/master/hbase-shell/src/main/ruby/shell/commands/put.rb： def command(table, row, column, value, timestamp=nil, args = {}) put table(table), row, column, value, timestamp, args end def put(table, row, column, value, timestamp = nil, args = {}) format_simple_command do table._put_internal(row, column, value, timestamp, args) end end 继而找到了$HBASE_HOME/lib/ruby/hbase/table.rb，其GitHub代码库位置为https://github.com/apache/hbase/blob/master/hbase-shell/src/main/ruby/hbase/table.rb： def _put_internal(row, column, value, timestamp = nil, args = {}) p = org.

阅读全文 »

[HBase] Java客户端程序构建脚本

时间： 2014-01-03 | 分类： BigData | 阅读： 15 字 ~1分钟

上一博文[HBase] 原始数据类型存储中所用到的构建脚本build.sh如下： #!/bin/bash HADOOP_HOME=/usr/local/hadoop HBASE_HOME=/usr/local/hbase CLASSPATH=.:$HBASE_HOME/conf:$(hbase classpath) javac -cp $CLASSPATH HBasePrimitiveDataTypeTest.java java -cp $CLASSPATH HBasePrimitiveDataTypeTest

[HBase] 原始数据类型存储

时间： 2014-01-02 | 分类： BigData | 阅读： 161 字 ~1分钟

对原始数据类型如何在HBase中存储，如何在HBaseShell中如何显示尚不了解，做一下小实验满足一下好奇心。使用下列代码存放和读取原始数据类型： byte[] cf = Bytes.toBytes(CF_DEFAULT); Put put = new Put(Bytes.toBytes("test")); byte[] val = Bytes.toBytes("123"); System.out.println("Bytes for str: "+ bytesToHex(val)+",len="+val.length); put.addColumn(cf, Bytes.toBytes("str"), val); short shortVal = 123; val = Bytes.toBytes(shortVal); System.out.println("Bytes for short:"+ bytesToHex(val)+",len="+val.length); put.addColumn(cf, Bytes.toBytes("short"), val); int intVal = 123; val = Bytes.toBytes(intVal); System.out.println("Bytes for int:"+ bytesToHex(val)+",len="+val.length); put.addColumn(cf, Bytes.toBytes("int"), val); long longVal = 123L; val = Bytes.toBytes(longVal); System.out.println("Bytes for long:"+ bytesToHex(val)+",len="+val.length); put.addColumn(cf, Bytes.toBytes("long"), val); float floatVal = 123; val = Bytes.

阅读全文 »

[Hadoop] 处理跨块边界的InputSplit

时间： 2014-01-02 | 分类： BigData | 阅读： 41 字 ~1分钟

Mapper从HDFS中读取文件进行数据处理的。凭借InputFormat、InputSplit、RecordReader、LineReader等类，Mapper用户代码可以处理输入键值对进行数据处理。下面学习一下MapReduce是如何分割无压缩文本文件输入的。涉及的类有： InputFormat及其子类InputFormat类执行下列操作：检验作业的输入文件和目录是否存在。将输入文件分割策划功能InputSlit，基于文件的InputFormat根据文件大小将文件分割为逻辑上的Split。实例化用于对每个文件分割块解析记录的RecordReaderInputFormat类包括下列两个主要的子类： TextInputFormat：用于解析文本文件。将文件按行生成记录；键为LongWritable，文件偏移量；值为Text，行的内容。 SequenceFileInputFormat：用于解析Hadoop压缩二进制文件。SequenceFile可为无压缩、记录压缩或块压缩。与TextInputFormat不同，SequenceFileInputFormat的键值对是泛型的。 InputSplit及其子类InputSplit是单个Mapper所要处理的数据子集逻辑表现形式。每个InputSplit很可能不会包含完整记录数，即在输入分割中首尾记录有可能是不完整的，处理全部记录由RecordReader负责。InputSplit的子类包括： FileSplit代表输入文件的GetLength()大小的一个片段。FileSplit由InputFormat.getSplits(JobContext)调用返回，并传给InputFormat类用于实例化RecordReader。 CombineFileSplit将多个文件并入一个分割内（默认每个文件小于分割大小） RecordReader及其子类RecordReader将输入分割片内的数据分析成Mapper所要处理的键值对。记录跟分割边界/块边界不一定匹配，RecordReader判断记录位置并处理日志边界。RecordReader包括下列子类： LineRecordReader：处理文本文件。 SequenceFileRecordReader：处理Sequence文件。 LineReader：用于对文件进行读取操作、分析行并获得键值对。处理的具体流程如下： FileInputFormat.getSplits(JobContext)方法主要完成计算InputSplit的工作。首先判断输入文件是否可被分割的。如果文件流没有被压缩或者使用bzip2这种可分割的压缩算法，则文件可被分割；否则整个文件作为一个InputSplit。如果文件可被分割的话，分割尺寸为max( max( 1,mapreduce.input.fileinputformat.split.minsize), min(mapreduce.input.fileinputformat.split.maxsize, blockSize))。如果没有对分割最小/大值进行设置的话，则分割尺寸即等于块大小，而块大小默认为64MB。文件按照上述分割尺寸分割记录文件路径、每一分割的起始偏移量、分割块实际尺寸、输入文件所在机器。只要文件剩余数据量在1.1倍分割尺寸范围内，就会放到一个InputSplit中。 LineRecordReader主要完成从InputSplit获取键值对的工作。 LineRecordReader构造方法获知行分隔符是否为定制分割符； initialize(InputSplit,TaskAttemptContext)方法获知InputSplit的start和end(=start+splitLength)，如果start不为0的话，跳过第一行（不用管第一行是否完整）。即处理上一InputSplit的RecordReader处理本InputSplit的第一行，处理本InputSplit的RecordReader处理下一个InputSplit的第一行。 nextKeyValue()方法处理第一个InputSplit，需要跳过可能存在的UTF BOM。 LineReader主要完成从从文件输入流获取数据、如没有定制换行符则需判别CR/LF/CRLF换行符，并获得键值对。以上类都不涉及对HDFS文件和块的实际读操作，本地和远程读取可学习org.apache.hadoop.hdfs.client.HdfsDataInputStream、org.apache.hadoop.hdfs.DFSInputStream等类的代码。参考 How does Hadoop process records split across block boundaries?