Suggested "tweaks" for the MARS cipher
======================================

We suggest the following two "tweaks" to the key expansion procedure 
of the MARS cipher. 

==================================================================

1. In the procedure for modification of the multiplication keys, 
  (Lines 20-32 in the pseudocode on Page 22) we would like to handle 
  the most significant bit of the mask M differently than what is 
  currently done. 

  Specifically, in the modified code we always reset the most significant 
  bit of the mask M to zero. Namely, we replace Lines 25-26 of the 
  pseudocode with 

    25. M_l=1 iff w_l belongs to a sequence of 10 consecutive 0's or 1's 
    26.       in w, and also 2 <= l <= 30 and w_{l-1} = w_l = w_{l+1}

  This change has no effect on either the efficiency or the security 
  of the cipher, and it is only meant to make the specification easier 
  to read and understand: 

  In the English description of this procedure, we explain that after 
  initializing the mask M, we reset to 0 those bits in M that correspond 
  to the "end of runs of 1's or 0's in the key word w". However, the way 
  this step was implemented, the most significant bit of M was reset to 0 
  only if the corresponding bit of w was 1. 

  Since the highest bit of w always "correspond to the end of a run", 
  the spec will be more intuitive if we always reset the highest bit 
  of M to 0. 


  (We thank Brian Gladman for suggesting that we make this change.)



==================================================================

2. The second change is meant mainly to simplify implementation of 
  the key expansion procedure in extremely low-memory environments:

  During the first round of the AES, much discussion revolved around the
  suitability of implementing the AES candidates on low-memory smart cards.  
  We recommend that the low-memory smart card not be used as a criterion in 
  the judging of the AES candidates.  We have provided supporting reasons 
  for this position in our official comments (available on NIST's AES page).  
  Some of the reasons include the fact that low-end smart cards are not 
  secure, 16 and 32 bit smart cards with over 1KB of RAM are already 
  available, and new smart cards have built in hardware countermeasures 
  against power attacks.  Therefore, the selection of the AES algorithm 
  should not be influenced by low-end smart cards.  In spite of these 
  facts, many still insist on analyzing the AES candidates in constrained 
  environments and consider this to be an issue.  

  To address this point, we are proposing the following simple change 
  to the MARS key expansion routine, which will enable implementation 
  in severely limited environments. Instead of calculating the 40 words 
  "in one shot", we use essentially the same procedure to calculate 10 
  words at a time, and run it four times. More details follow: 

  In the current procedure, we manipulate a temporary array T[] for 
  some time, and then take the resulting array as the words of the 
  expanded key. Namely, currently the process of computing the key 
  consists of 

    a. Apply linear transformation to fill a temporary array T[]
    b. Stir the array T[] for several rounds (currently, 7 rounds)
    c. Reorder the words in T[] into K[]

  and then

    d. modify the key words used for multiplication

  Instead, we propose to repeat the same process four times, each 
  time computing a quarter of the expanded key array. Namely we propose 
  the following:

    1. Repeat four times 
      a. Do linear transformation (using the same formula as before) 
      b. Stir the array T[] for several rounds (using the same 
         formula as before, we propose 4 rounds)
      c. Reorder the words in T[] into the next ten words in K[]

    2. modify the multiplication keys 

  The advantage is that T[] can be a lot smaller than 40 words, 
  and so this new procedure runs faster and takes up less memory than 
  the current one. Moreover, we propose to keep T[] larger than 10 
  words (specifically, we propose that T be 15-word long), and we 
  believe that this also makes the new procedure cryptographically 
  stronger than the current one. 

  To see why, note that the new process resembles the way most 
  pseudorandom-generators are built. Namely, you keep a state, and 
  each time you use part of the state to output words, and then 
  re-shuffle it so that you can output more words. In a sense, we 
  are building an "almost pseudorandom-generator", using the previous 
  key expansion procedure as our "re-shuffling" operation. 

  Implementation of this idea requires a minor change to the linear 
  transformation formula. (The reasoning for this is provided later 
  in this note.) Specifically, the previous formula was 

    T[i] = ((T[i-7] xor T[i-2]) <<< 3) xor k[i mod n] xor i

  where T[] is the temporary array, k[] is the original key and n is 
  the number of words in that key. In the new routine, we instead 
  initialize T[] with the words of k[] at the very beginning (so 
  there is no need for the "xor k[i mod n]" part) and then do in 
  the j'th iteration (j=0..3) 

    T[i] = T[i] xor ((T[i-7] xor T[i-2]) <<< 3) xor (4i+j)

  The stirring formula remains unchanged, 

    T[i] = (T[i] + S[ low 9 bits of T[i-1] ]) <<< 9

  A pseudocode for this new procedure is provided, followed by a more 
  comprehensive discussion. 


  Key-Expansion(input: k[], n; output: K[])
  -----------------------------------------
   1. n is the number of words in the original key buffer k[], (4 <= n <= 14)
   2. K[] is the expanded key array, consisting of 40 words
   3. T[] is a temporary array, consisting of 15 words, T[0] .. T[14]
   4. B[] is a fixed table of four words

   5. // Initialize B[] 
   6. B[]= {0xa4a8d57b, 0x5b5d193b, 0xc8a8309b, 0x73f9a978}

   7. // Initialize T[] with the key data
   8. T[0..n-1] = k[0..n-1],  T[n] = n,  T[n+1..14] = 0

   9. for j=0 to 3 do            // compute 10 words of K[] in each iteration
  10.    // Do linear transformation
  11.    for i=0 to 14 do 
  12.        T[i]= T[i] xor ((T[i-7 mod 15] xor T[i-2 mod 15])<<<3) xor (4i+j)

  13.    // Do four rounds of stirring
  14.    repeat four times 
  15.       for i = 0 to 14 do
  16.           T[i] = (T[i] + S[low 9 bits of T[i-1 mod 15]]) <<< 9
  17.    end-repeat

  18.    // Store next 10 key words into K[]
  19.    for i = 0 to 9
  20.        K[10*j+i] = T[4*i mod 15]
  21. end-for

  22. // Modify multiplication key-words
  23. for i = 5, 7, ... 35 do
  24.     j = least two bits of K[i]
  25.     w = K[i] with both of the least two bits set to 1

  26.     // Generate a bit-mask M (if K[i] should not be modified then M=0)
  27.     M_l=1 iff w_l belongs to a sequence of 10 consecutive 0's or 1's 
  28.           in w, and also 2 <= l <= 30 and w_{l-1} = w_l = w_{l+1}

  29.     // Select a pattern from the fixed table and rotate it
  30.     r = least five bits of K[i-1]                 // rotation amount
  31.     p = B[j] <<< r

  32.     // Modify K[i] with p under the control of the mask M
  33.     K[i] = w xor (p and M)
  34. end-for  


  We stress that the main consideration in the design of the key expansion 
  was and remains cryptographic strength. Hence we use essentially the same 
  design for the modified procedure. Indeed we believe that the change also 
  makes the procedure cryptographically stronger than the current version. 
  Incidentally, this change also improves the performance of the key setup 
  procedure in high-memory environments. 

  The advantages of this new procedure over the current one (in order 
  of importance) are: 

  a. It is more "one way" than the current one. With the current MARS 
   key setup, knowing only 12 "properly selected" words of the expanded 
   key enables you to reconstruct the original key and subsequently derive 
   all the other words of the expanded key. With the new procedure, even 
   knowing all the words of the expanded key, it is not clear how to 
   efficiently reconstruct the original key. Moreover, it seems that 
   knowing only part of the expanded key words still does not imply an 
   efficient way to derive the other words. 

  b. It is much easier to implement it in a limited-memory environment.
   specifically, all you need is the 15-word "temporary" array 
   T[] (which is 60 bytes of memory), and a few "scratch" words. 

  c. It is faster than current procedure. We estimate that this is about 
   15% faster. 


  The drawback of the new procedure with respect to the current one 
  is that the user-supplied key cannot be longer than 14 words (448 bits). 
  This does not seem to be a major concern, though, since longer keys 
  can always be hashed (say, using SHA) before they are used in the 
  cipher. (Doing this will also eliminate the "equivalent keys" that 
  can be shown when using long keys with the current procedure.) 


  The reasoning for the changes that we made to the linear transformation 
  formula are as follows: 

  1. The reason that we initialize T[] with k[] instead of incorporating
    the words of k[] to the formula, is to avoid having to store the 
    original key, in addition to the temporary array T[]. 

  2. The reason that we replaced the "xor i" with "xor (4i+j)" is to 
    avoid the remote possibility of having a "fixed point" in this 
    procedure: If we view the "linear transformation + stirring" as 
    a "random permutation", then there is some probability that it 
    has a fixed point. Moreover, there is a small probability that 
    this fixed point can actually result from some key (most likely, 
    a 14-word key. This has probability of about 2^{-32}). If this 
    is true, then this fixed point key implies an extended key where 
    K[0..9] = K[10..19] = K[20..29] = K[30..39]

    This does not seem to be a big deal. In particular, we don't see why 
    such key would be "weak" for encryption, it is only remotely likely 
    to exist with very long keys (14 words) and the probability of such 
    "weak keys" is then about 2^{-450}. 

    Still, whenever a repeated function is used, it is a good idea to 
    make some tweaks in it from iteration to iteration, to overcome the 
    possibility of fixed points. A simple such tweak would be to use 
    different xor values in different iterations, thus the "xor (4i+j)"
    term. 

  3. Other change that we made are: 

    - Eliminated the initialization of T[] with seven constants from 
      the S-box. This is just not needed. 

    - Changed the reordering formula, to interact better with 
      the size of T[].

    - In the modification of the multiplication key words, the new 
      procedure uses K[i-1] when modifying K[i] (see Line 32 in the 
      pseudocode above), while the old one uses K[i+3]. The reason 
      for this is to make sure that these two words belong to the 
      same "batch" of 10 words.