跳转至

Testing Arrays for Invalid Values

原文链接: https://www.nv5geospatialsoftware.com/Learn/Blogs/Blog-Details/testing-arrays-for-invalid-values

16955 Rate this article:

No rating

Testing Arrays for Invalid Values

Anonym Thursday, June 19, 2014

I was recently writing a function where I needed to validate that an array passed into it contained no negative values. I quickly wrote a test using IDL's WHERE() function, but then got thinking about how efficient that was and if there was a faster and/or less memory intensive way to do this. After giving the matter some thought, I came up with a few different ways to implement this test, and was a little surprised how the times varied.

Here is some test code I put together that tests the different implementations I came up with. I use the TIC/TOC routines to perform the timing of a number of iterations. Each iteration generates a new array using RANDOMN() to generate a new array each time, to avoid any data caching that might taint my results. I put a variable increment inside the IF test, to make sure it wouldn't be optimized away.

pro test_find_negatives

compile_opt idl2

arraySize = 1000000

iterations = 100

fooWhere = 0

fooMin = 0

fooTotal = 0

fooHasValue = 0

print, 'where'

TIC

FOR i = 1, iterations DO BEGIN

seed = i

data = RANDOMN(seed, arraySize)

w = WHERE(data LT 0, count)

IF (count GT 0) THEN BEGIN

++fooWhere

ENDIF

ENDFOR

TOC

print, 'min'

TIC

FOR i = 1, iterations DO BEGIN

seed = i

data = RANDOMN(seed, arraySize)

IF (MIN(data) LT 0) THEN BEGIN

++fooMin

ENDIF

ENDFOR

TOC

print, 'total'

TIC

FOR i = 1, iterations DO BEGIN

seed = i

data = RANDOMN(seed, arraySize)

IF (TOTAL(data LT 0) GT 0) THEN BEGIN

++fooTotal

ENDIF

ENDFOR

TOC

print, 'hasValue'

TIC

FOR i = 1, iterations DO BEGIN

seed = i

data = RANDOMN(seed, arraySize)

IF ((data LT 0).HasValue(1)) THEN BEGIN

++fooHasValue

ENDIF

ENDFOR

TOC

print, fooWhere, fooMin, fooTotal, fooHasValue

end

As you can see, I came up with four implementations. The first is the class WHERE() function, which returns an array of all the bad values. I could have called N_ELEMENTS() on that returned array, but used the slightly more efficient positional parameter which returns the number of elements that are returned. The second implementation uses MIN(), and then just compares that value against 0. This approach may have limited utility for other types of tests, but for finding negative numbers it works. The third implementation uses TOTAL() to sum up the contents of a transient array of Boolean values. This one is a little expensive since it has to create that transient array, as we'll see in the timing results. Observant readers will notice the fourth implementation and wonder what it is. This is a preview of a new feature in IDL 8.3.1, where we have static methods available on variables. In this case the HasValue() method is testing the transient array of Boolean variables to see if any of them are true. This has a slight optimization over the TOTAL() approach, in that the method has short circuit logic to return as soon as it finds the desired value, instead of having to process the entire array like TOTAL() does.

Now on the to the timing results, which can be seen in the table below. I ran the code with many different values of iterations and arraySize, to see how the array size and other system differences like the memory manager would affect the outcome. The columns are the different array sizes, and the rows are for different numbers of iterations of the four approaches. The times are listed in milliseconds, which may be more precision than can be accurately measured by TIC/TOC.

WHERE iterations 100 1000 10000 100000 1000000
100 0 5 44 371 3340
1000 9 47 427 3780 35000
10000 90 466 4330 37800 335000
MIN iterations
100 1 3 32 339 3070
1000 7 36 311 3510 30900
10000 79 344 3320 34800 306000
TOTAL iterations
100 1 4 36 371 3210
1000 8 39 353 3740 33100
10000 83 387 3780 36300 321000
HasValue iterations
100 51 5 36 375 3280
1000 10 41 360 3760 33300
10000 105 410 3760 37000 327000

At first glance the timing looks relatively close, but upon inspection of the Profiler View we can see that these tests are dominated by the RANDOMN() function more than anything else. So I added another FOR loop that only called RANDOMN() and timed that, then subtracted those times and got much more interesting results that compare the real array test logic. You can't use the Profiler View to do this, because in cases like TOTAL(), the creation of that transient array of Boolean values doesn't get measured by the profiler.

WHERE iterations 100 1000 10000 100000 1000000
100 0 2 14 77 246
1000 3 13 126 493 4550
10000 17 136 1340 8070 31700
MIN iterations
100 1 0 2 45 64
1000 1 2 10 224 443
10000 6 14 322 5040 2260
TOTAL iterations
100 1 1 6 77 207
1000 2 5 52 447 2700
10000 10 57 785 6580 17200
HasValue iterations
100 51 2 6 81 282
1000 4 7 59 467 2900
10000 32 80 766 7290 23200

These numbers are much more interesting, and show how much the different approaches vary in performance. We can see how increasing the number of iterations by a factor of 10 doesn't always increase the time by a factor of 10, but for small arrays I think we are more influenced by other system load and timer accuracy more than anything else. As the number of iterations increases, we are also more likely to fragment the process memory and see that influence in the numbers. This is most notable in the 10,000 iteration results for MIN(), that the 100,000 element array took longer than the 1,000,000 element array. I think this is definitely a case where the memory manager was having trouble trying to find contiguous blocks of memory to use for all the iterations.

But the important part of this analysis is the more qualitative result that MIN() is far superior to everything else. This is most likely due to the fact that it does not have to deal with the memory manager to create transient input arrays like TOTAL() or HasValue(), nor a dynamically resized result array like WHERE(). The performance improvement is more pronounced as the array size increases, since it is avoiding allocations of larger and larger arrays. But MIN() has less generality than the other options, so if you were checking that you had only odd numbers, for example, then you wouldn't be able to use MIN().

The TOTAL() function comes in second place, but only by a little over HasValue(). Both of these approaches have to create that transient input array, which takes time to allocate. I think TOTAL() is faster because it is a free function, whereas HasValue is a class method so there's a small amount of overhead in the dereferencing of the class to get the memory address of the method to invoke.

The WHERE() function was far and away the slowest. It avoids creating the transient input array, but since it has no a priori knowledge of how many output elements it will have, it has to use array appending for each element, which slams the memory manager with a new allocation for each found element. Of course, if you wanted to know which elements were invalid, you would have to use WHERE() and pay the cost in time in space, but if you just want to know whether there are any bad elements, this is not your best option.

All in all, the numbers aren't that different for a single call to any of these tests, but they are a good demonstration of the thought process that should go into how you design you code to be time and space efficient.

Why Hadoop is Kind of a Big Deal Free Remote Sensing Data