IDL 中的字符串处理性能

原文链接: https://www.nv5geospatialsoftware.com/Learn/Blogs/Blog-Details/string-processing-performance-in-idl

8747 评分：3.0

IDL 中的字符串处理性能

匿名作者 2015年2月19日，星期四

IDL 非常高效地执行基于数组的操作，但大多数处理任务确实需要一定量的字符串解析和操作。我选择了3个常见的字符串处理任务进行更深入的分析，以找出每种情况下最佳的字符串处理策略。第一个例子是查找所有以给定子字符串开头的字符串。IDL 8.4 为字符串类型变量提供了许多新的内置方法，其中之一是 "StartsWith"。以下是我用来比较四种不同方法以查找字符串数组中哪些字符串以单词 "end" 开头的代码。

pro StrTest_StartsWith

compile_opt idl2,logical_predicate

f = file_which('amoeba.pro')

str = strarr(file_lines(f))

openr, lun, f, /get_lun

readf, lun, str

free_lun, lun

first = str.StartsWith('end')

n = 50000

times = dblarr(4)

methods = ['StartsWith','STRCMP','STREGEX','STRPOS']

for method=0,3 do begin

t0 = tic()

case method of

0: for i=0, n-1 do x = str.StartsWith('end')

1: for i=0, n-1 do x = strcmp(str,'end',3)

2: for i=0, n-1 do x = stregex(str,'^end',/boolean)

3: for i=0, n-1 do x = strpos(str,'end') eq 0

endcase

times[method] = toc(t0)

print, array_equal(x,first) ? 'Same answer' : 'Different answer'

endfor

print, string(methods[sort(times)] + ':', format='(a-15)') + $

string(times[sort(times)], format='(g0)'), $

format='(a)'

end

第一种方法是使用新的内置方法 "StartsWith"，第二种是使用带有第三个参数的 STRCMP，该参数指定要比较的字符数。第三种方法使用带有 STREGEX 的正则表达式，最后一种方法使用 STRPOS 并将结果与 0 比较，这意味着模式从位置 0 开始找到。当我在 IDL 8.4 中运行此代码时，得到的结果是：

答案相同

STRCMP： 0.128 StartsWith： 0.147 STRPOS： 0.91 STREGEX： 1.497

所有方法都返回一个由零和一组成的字节数组，指示匹配的位置。带有3个参数的 STRCMP 最终是最快的，新的 "StartsWith" 方法紧随其后。除非确实需要用于更复杂的表达式，否则应避免使用 STREGEX。

在第二个例子中，目标是在每个至少包含一个等号（=）的行中，将第一个出现的等号（=）替换为冒号（:）。如果存在额外的等号，则应保持不变。这对于转换存储在文本文件中的名称/值对格式非常有用。我使用了4种不同的方法来实现相同的结果：

pro StrTest_Substring

compile_opt idl2,logical_predicate

f = file_which('amoeba.pro')

str = strarr(file_lines(f))

openr, lun, f, /get_lun

readf, lun, str

free_lun, lun

n = 2000

index = str.IndexOf('=')

w = where(index ne -1)

index = index[w]

first = str

first[w] = str[w].Substring(0,index-1)+':'+str[w].Substring(index+1)

methods = ['Substring','STRPUT','Split/Join','BYTARR']

times = dblarr(4)

for method=0,3 do begin

t0 = tic()

case method of

0: for i=0, n-1 do begin

index = str.IndexOf('=')

w = where(index ne -1)

index = index[w]

y = str[w]

x = str

x[w] = y.SubString(0,index-1)+':'+y.SubString(index+1)

endfor

1: for i=0, n-1 do begin

x = str

pos = strpos(str,'=')

foreach xx, x, j do begin

if pos[j] ne -1 then begin

strput, xx, ':', pos[j]

x[j] = xx

endif

endforeach

endfor

2: for i=0, n-1 do begin

x = str

foreach xx, x, j do begin

parts = xx.Split('=')

if parts.length gt 1 then x[j] = ([parts[0],parts[1:*].join('=')]).join(':')

endforeach

endfor

3: for i=0, n-1 do begin

b = byte(str)

b[maxInd[where(max(b eq 61b, dimension=1, maxInd))]] = 58b

x = string(b)

endfor

endcase

times[method] = toc(t0)

print, array_equal(x,first) ? 'Same answer' : 'Different answer'

endfor

print, string(methods[sort(times)] + ':', format='(a-15)') + $

string(times[sort(times)], format='(g0)'), $

format='(a)'

end

答案相同

BYTARR： 0.148 STRPUT： 0.187 Substring： 0.188 Split/Join： 1.456

神秘的字节数组方法最终是最快的，尽管它确实执行了大量复制，并且不包含任何明显的字符串处理函数。这是因为 IDL 可以非常高效地对数组执行操作以加速计算。例如，内部数组索引提供了良好且可预测的内存访问模式。然而，我在这里并不真正推荐使用这种方法，因为代码非常难以理解，并且在需要时也难以修改。我也将避免使用 SPLIT/JOIN 方法，因为其效率非常低。使用 "IndexOf" 和 "Substring" 在这里很好，特别是注意 "Substring" 方法类似于 STRMID，但可以处理与字符串数组大小匹配的不同位置的数组。这相对于旧的 STRMID 是一个重大改进。例如，要提取每个字符串到第一个 "e"（包括该字符）之前的部分，您可以使用：

IDL> a=['!Hello!', 'test','this one!']

IDL> a.Substring(0,a.IndexOf('e'))

!He

te

this one

或者，提取第一个冒号之后的字符：

IDL> x = ((orderedhash(!cpu))._overloadPrint())

IDL> x

HW_VECTOR： 0

VECTOR_ENABLE： 0

HW_NCPU： 6

TPOOL_NTHREADS： 6

TPOOL_MIN_ELTS： 100000

TPOOL_MAX_ELTS： 0

IDL> x.Substring(x.IndexOf(':'))

: 0

: 6

: 100000

: 0

最后一个例子是将每个出现的 = 替换为 =>。我为此使用了2种不同的方法，一种是使用字符串类型的新 "Replace" 方法，另一种是使用 STRSPLIT/STRJOIN。结果表明，新的 Replace 方法效率更高。

pro StrTest_Replace

compile_opt idl2,logical_predicate

f = file_which('amoeba.pro')

str = strarr(file_lines(f))

openr, lun, f, /get_lun

readf, lun, str

free_lun, lun

n = 5000

first = str.Replace('=', '=>')

methods = ['Replace','STRSPLIT']

times = dblarr(2)

for method=0,1 do begin

t0 = tic()

case method of

0: for i=0, n-1 do begin

x = str.Replace('=','=>')

endfor

1: for i=0, n-1 do begin

x = str

foreach xx, x, j do x[j] = strjoin(strsplit(xx,'=',/extract),'=>')

endfor

endcase

times[method] = toc(t0)

print, array_equal(x,first) ? 'Same answer' : 'Different answer'

endfor

print, string(methods[sort(times)] + ':', format='(a-15)') + $

string(times[sort(times)], format='(g0)'), $

format='(a)'

end

答案相同

Replace： 0.545

STRSPLIT： 2.778

遥感数据的真正价值小型无人机在商业领域的新规则