Testing Arrays for Invalid Values

原文链接: https://www.nv5geospatialsoftware.com/Learn/Blogs/Blog-Details/testing-arrays-for-invalid-values

16955 Rate this article:

No rating

Testing Arrays for Invalid Values

Anonym Thursday, June 19, 2014

I was recently writing a function where I needed to validate that an array passed into it contained no negative values. I quickly wrote a test using IDL's WHERE() function, but then got thinking about how efficient that was and if there was a faster and/or less memory intensive way to do this. After giving the matter some thought, I came up with a few different ways to implement this test, and was a little surprised how the times varied.

Here is some test code I put together that tests the different implementations I came up with. I use the TIC/TOC routines to perform the timing of a number of iterations. Each iteration generates a new array using RANDOMN() to generate a new array each time, to avoid any data caching that might taint my results. I put a variable increment inside the IF test, to make sure it wouldn't be optimized away.

pro test_find_negatives

compile_opt idl2

arraySize = 1000000

iterations = 100

fooWhere = 0

fooMin = 0

fooTotal = 0

fooHasValue = 0

print, 'where'

TIC

FOR i = 1, iterations DO BEGIN

seed = i

data = RANDOMN(seed, arraySize)

w = WHERE(data LT 0, count)

IF (count GT 0) THEN BEGIN

++fooWhere

ENDIF

ENDFOR

TOC

print, 'min'

TIC

FOR i = 1, iterations DO BEGIN

seed = i

data = RANDOMN(seed, arraySize)

IF (MIN(data) LT 0) THEN BEGIN

++fooMin

ENDIF

ENDFOR

TOC

print, 'total'

TIC

FOR i = 1, iterations DO BEGIN

seed = i

data = RANDOMN(seed, arraySize)

IF (TOTAL(data LT 0) GT 0) THEN BEGIN

++fooTotal

ENDIF

ENDFOR

TOC

print, 'hasValue'

TIC

FOR i = 1, iterations DO BEGIN

seed = i

data = RANDOMN(seed, arraySize)

IF ((data LT 0).HasValue(1)) THEN BEGIN

++fooHasValue

ENDIF

ENDFOR

TOC

print, fooWhere, fooMin, fooTotal, fooHasValue

end

As you can see, I came up with four implementations. The first is the class WHERE() function, which returns an array of all the bad values. I could have called N_ELEMENTS() on that returned array, but used the slightly more efficient positional parameter which returns the number of elements that are returned. The second implementation uses MIN(), and then just compares that value against 0. This approach may have limited utility for other types of tests, but for finding negative numbers it works. The third implementation uses TOTAL() to sum up the contents of a transient array of Boolean values. This one is a little expensive since it has to create that transient array, as we'll see in the timing results. Observant readers will notice the fourth implementation and wonder what it is. This is a preview of a new feature in IDL 8.3.1, where we have static methods available on variables. In this case the HasValue() method is testing the transient array of Boolean variables to see if any of them are true. This has a slight optimization over the TOTAL() approach, in that the method has short circuit logic to return as soon as it finds the desired value, instead of having to process the entire array like TOTAL() does.

Now on the to the timing results, which can be seen in the table below. I ran the code with many different values of iterations and arraySize, to see how the array size and other system differences like the memory manager would affect the outcome. The columns are the different array sizes, and the rows are for different numbers of iterations of the four approaches. The times are listed in milliseconds, which may be more precision than can be accurately measured by TIC/TOC.


WHERE iterations	100	1000	10000	100000	1000000
100	0	5	44	371	3340
1000	9	47	427	3780	35000
10000	90	466	4330	37800	335000

MIN iterations
100	1	3	32	339	3070
1000	7	36	311	3510	30900
10000	79	344	3320	34800	306000

TOTAL iterations
100	1	4	36	371	3210
1000	8	39	353	3740	33100
10000	83	387	3780	36300	321000

HasValue iterations
100	51	5	36	375	3280
1000	10	41	360	3760	33300
10000	105	410	3760	37000	327000

At first glance the timing looks relatively close, but upon inspection of the Profiler View we can see that these tests are dominated by the RANDOMN() function more than anything else. So I added another FOR loop that only called RANDOMN() and timed that, then subtracted those times and got much more interesting results that compare the real array test logic. You can't use the Profiler View to do this, because in cases like TOTAL(), the creation of that transient array of Boolean values doesn't get measured by the profiler.


WHERE iterations	100	1000	10000	100000	1000000
100	0	2	14	77	246
1000	3	13	126	493	4550
10000	17	136	1340	8070	31700

MIN iterations
100	1	0	2	45	64
1000	1	2	10	224	443
10000	6	14	322	5040	2260

TOTAL iterations
100	1	1	6	77	207
1000	2	5	52	447	2700
10000	10	57	785	6580	17200

HasValue iterations
100	51	2	6	81	282
1000	4	7	59	467	2900
10000	32	80	766	7290	23200

These numbers are much more interesting, and show how much the different approaches vary in performance. We can see how increasing the number of iterations by a factor of 10 doesn't always increase the time by a factor of 10, but for small arrays I think we are more influenced by other system load and timer accuracy more than anything else. As the number of iterations increases, we are also more likely to fragment the process memory and see that influence in the numbers. This is most notable in the 10,000 iteration results for MIN(), that the 100,000 element array took longer than the 1,000,000 element array. I think this is definitely a case where the memory manager was having trouble trying to find contiguous blocks of memory to use for all the iterations.

But the important part of this analysis is the more qualitative result that MIN() is far superior to everything else. This is most likely due to the fact that it does not have to deal with the memory manager to create transient input arrays like TOTAL() or HasValue(), nor a dynamically resized result array like WHERE(). The performance improvement is more pronounced as the array size increases, since it is avoiding allocations of larger and larger arrays. But MIN() has less generality than the other options, so if you were checking that you had only odd numbers, for example, then you wouldn't be able to use MIN().

The TOTAL() function comes in second place, but only by a little over HasValue(). Both of these approaches have to create that transient input array, which takes time to allocate. I think TOTAL() is faster because it is a free function, whereas HasValue is a class method so there's a small amount of overhead in the dereferencing of the class to get the memory address of the method to invoke.

The WHERE() function was far and away the slowest. It avoids creating the transient input array, but since it has no a priori knowledge of how many output elements it will have, it has to use array appending for each element, which slams the memory manager with a new allocation for each found element. Of course, if you wanted to know which elements were invalid, you would have to use WHERE() and pay the cost in time in space, but if you just want to know whether there are any bad elements, this is not your best option.

All in all, the numbers aren't that different for a single call to any of these tests, but they are a good demonstration of the thought process that should go into how you design you code to be time and space efficient.

Why Hadoop is Kind of a Big Deal Free Remote Sensing Data