Techno Blender
Digitally Yours.

Interview Question: Select a Random Line from a File (in Rust) | by Carl M. Kadie | Sep, 2022

0 224


A cool and useful algorithm explained and extended

By Carl M. Kadie and Christopher Meek

A crab selecting a random line of text from a file — Source: https://openai.com/dall-e-2/

If I (Carl) interviewed you for a job at Microsoft, I probably asked you this question:

How would you select a random line from a text file of unknown length?

I asked this question in Microsoft Research, in the original anti-spam team, and in an Office Machine Learning/Data Science group. We love the question because it touches on probability, algorithms, and systems issues.

Even having retired from Microsoft, we still think about the random-line question. For example, we recently learned about bytecount. It is a Rust crate that uses SIMD CPU instructions to speed up, among other things, counting the lines in a file. As we’ll describe later, our learning of the crate led — indirectly — to improving the state-of-the-art solution to the random-line question.

We’ve organized this article as a series of questions and hints. If you wish, you can try to answer the questions yourself, before seeing our answers.

In this article, we’ll give answers in Rust. For the answers in Python, see the Python edition of this article. You might also enjoy reading both editions as a way to compare the two languages.

We gave interviewees this advice for whiteboard interviews:

  • Feel free to ask clarifying questions. (In the context of this article, you can read a little further to see if we offer clarifications.)
  • Start with a correct algorithm first, even if it might be suboptimal. If we want a better algorithm, we will prompt you for it with follow-up questions.
  • Don’t worry about exact syntax. We don’t, for example, care if you remember the exact method for generating a random number.
  • If you get stuck, we can provide some hints. (Again, in the context of this article, read a little farther to look for hints.)

Let’s start with the questions:

Q: How would you select a random line from a text file of unknown length?

A: We must first clarify the meaning of a “random line”. We mean that every line in the text file has an equal chance of being returned. In other words, we want the algorithm to select among the lines via a uniform distribution.

This requirement means you cannot use this algorithm, that we’ll call AlgoRandomSeek:

  • Ask the filesystem for the length of the file.
  • Randomly select a file position and seek to that position.
  • Return a line near the position.

Sub Q: What’s wrong with AlgoRandomSeek?

Sub A: Although the algorithm can return any line in the file, it returns longer lines with higher probability than shorter lines and, thus, does not select lines via a uniform distribution.

Aside: We like being asked for this clarification. It distinguishes the everyday meaning of “random” from the technical meaning used in statistics, data science, decision theory, and machine learning.

Hint: If you were coding in Rust and needed to solve the random line problem quickly (but correctly) what would you do?

We asked GitHub Copilot for a solution in Python. Translating its answer to Rust gave us this code. Call it AlgoReadAll:

Aside: For the full Rust project, including the Cargo.toml and use statements, see project random-line on GitHub. All the examples use the fetch-data crate to retrieve the sample file(s) as needed.

This algorithm is obviously correct and fast. As a bonus, if it encounters a bad file, it uses ? to return an error result. For another bonus, you could mention that the code returns None on empty files.

On the downside, this code reads the whole file into memory and, so, won’t work on large files.

Let’s follow up by asking for a solution that works on large files:

Q: Using little memory, how could you select a random line from a text file of unknown length?

A: We must first clarify “little memory”. For this question, assume that you can store any line from the file, or several lines, but not all the lines.

AlgoTwoPass solves the problem by 1. counting the lines in the file, 2. randomly selecting a line index, 3. returning the line with that index. (Throughout this article, “index0” is an index that starts its count at 0. If an index starts counting from 1, we’ll call it “index1”.)

On my machine, this outputs:

1,683 of 146,933: Some(“to gather material for illustrations of the poems of Robert”)

AlgoTwoPassis correct. We’d also give this code a bonus for:

  • using a random number generator with an explicit seed — machine learning and data science often require reproducibility, even from randomness.
  • mentioning that it returns None when applied to an empty file.
  • using let item = item_result? to check for possible file errors.
  • [very minor] formatting numbers with thousands separators.

Not required, but interesting would be this more functional implementation:

Aside 1: Background information: This code uses Rust iterators. When next() is applied to a Rust iterator, the next value in a sequence is returned inside a Some or, if the sequence is complete, None is returned. The expression BufReader::new(File::open(&path)?).lines() returns an iterator over “line results”. Each line result is either a string (the next line) or an error (found when trying to read that line). Although Rust defines a count() method on iterators, we should not use it here. Why? Because the count() method doesn’t check for error values. Likewise, Rust defines an nth() method that advances an iterator n+1 values. It returns the (n+1)ᵗʰ value but ignores any intermediate error values. To check for error values from .line(), the answer above defines and usestry_count and try_nth functions. [Thanks to the Reddit Rust community for help on this issue.]

Aside 2: We would not penalize someone for making an off-by-one error with rng.gen_range or nth/try_nth. We might, however, ask how the code could be tested to check for such errors.

We next follow up by asking for a faster algorithm:

Q: In one pass, can you select a random line from a text file of unknown length?

A: After a hint, we’ll develop an algorithm via a series of sub-questions.

Hint: Think about this recursively.

Sub Q: Put yourself in the position of the program. If we promise you the file contains exactly one line, what should you do?

Sub A: If the file contains exactly one line, just output it.

Sub Q: Whoops, we lied to you. The file may contain exactly one line or two, but no other number of lines. What should you do?

Sub A: With probability 0.5, replace the result we were going to output, with the second line.

Sub Q: Sorry, we lied again! The file may contain exactly one, two, or three lines. What should you do? What is the probability of selection for each line? How can this be generalized?

Sub A: With probability ⅓, replace the result we were going to output, with the third line.

The probability of selection for each line is:

  • 1st line: 1 × ½ × ⅔= ⅓
  • 2nd line: ½ × ⅔= ⅓
  • 3rd line: ⅓= ⅓

So, the probability distribution is uniform. We can generalize this to AlgoOnePass:

Aside: I (Carl) recall one interviewee who started to prove this algorithm’s correctness via induction. I could tell they could easily do it, so I stopped them and moved on. They earned a bonus.

Bonus Q: We never asked this in an interview, but can you write AlgoOnePass recursively?

Bonus A: Here is AlgoOnePassRecurse:

Rust crashes if you recurse more than a few thousand times. This code uses the crate tailcall to avoid that crash. The recursive and non-recursive version of the algorithm run at about the same speed. The code earns a bonus for being generic across result iterators. For another bonus, it accepts the rng, the random number generator, as &mut, which allows rng to continue to be used.

Aside: Thinking practically, is one-pass important compared to two-pass? Often not. If you can wait 1 second for a random line, you can probably wait 1 to 2 seconds. On the other hand, some data — “streaming data” — can’t be accessed twice, so AlgoOnePass has practical value.

One straightforward generalization to AlgoOnePass, that we won’t cover, is selecting multiple random lines, not just one. Wikipedia describes (more-or-less) AlgoOnePass and this generalization in its article on Reservoir sampling: Simple Algorithm R.

In interviews, this is usually where we stopped with “random line from a file”. However, we recently learned about the Rust crate bytecount. The bytecount crate uses SIMD CPU instructions to speed up, among other things, counting the lines in a file. That got us playing with the problem again. This led to a new approach and an improved algorithm. The new algorithm doesn’t use bytecount. It does, however, in the specific case of returning one random line, outperform the more general Optimal: Algorithm L described in Wikipedia.

Aside: We call this the “new algorithm”, but it may have been discovered earlier. In any case, we hope you’ll find the algorithm and its development interesting.

As before, we’ll present the new algorithm via a series of questions and hints. We have not, however, ever put these questions to an interviewee.

Q: In one pass, can you select a random line from a text file of unknown length such that the random number generator is called much less than the number of lines, n?

A: We must clarify “much less”. We mean that the number of calls to the random number generator is less than O(n) [see Big O notation — Wikipedia]. In other words, reducing the calls by one or by half is not enough. The random number of calls required should grow proportional to, for example, log(n) or sqrt(n).

Hint: First modify AlgoOnePass to print the index of every item that gets assigned to result. Call these the “keep indices”. Run the code on, say, 1,000,000 items.

Outputs:

1 2 4 14 38 112 210 914 4,512 5,659 6,242 13,388 917,008

This says that when we run with random seed 0, the first item is kept as a possible final random item. Then the 2nd item and then 4th. Then no item is kept until the 14th item, then the 38th. If the iterator contains between 917,008 and 1,000,000 items, then the 917,008th item will be the last item kept and so will be the final random item.

The keep indices seem to grow roughly exponentially. If that conjecture is true, the number of keep indices is O(log n), where n is the number of items in the iterator.

Sub Q: Can we randomly generate a sequence of keep indices directly?

Sub A: Yes! We’ve detailed our solution in a short, on-line, technical paper [Meek & Kadie, Streaming Random Selection Using the Attenuated Geometric Distribution, 2022]. We call the distribution of keep indices the “Attenuated Geometric Distribution”. We show that if index1is a keep index number then we can generate the next one with:

let r: f64 = rng.gen();
index1 += ((r * (index1 as f64) / (1.0 — r)).ceil() as usize).max(1);

where rng.gen() generates a uniform floating-point value between 0 .0 (inclusive) and 1.0 (exclusive). Bonus: (…).max(1) handles the very, very unlikely case of generating 0.0 randomly. Also, recall our convention that “index1” is an index that starts counting from 1 instead of 0.

We can now use this insight to create AlgoOnePassSkip:

We can get some confidence in the code by using it to pick a number between 0 (inclusive) and 100 (exclusive), 100,000 times. (We use the expression (0..100).map(Ok::<_, std::io::Error>) to create an iterator of result values, Ok(0), Ok(1), … Ok(99). The code expects result values.)

If we plot with Python, the plots should look uniform, which they do:

Histogram and Line Plot suggesting a Uniform Distribution

This algorithm differs from Optimal: Algorithm L (recommended by Wikipedia) in two important ways.

  • AlgoOnePassSkip only works for selecting one random item, while Algorithm L can select any specified number of random items.
  • When only one random item is desired, AlgoOnePassSkip needs one random number per keep index, while Algorithm L needs two.

Thus, for the special case in which we want only one random item, AlgoOnePassSkip uses half as many random draws as Algorithm L.

We’ve seen four ways to select an item randomly from a sequence of unknown length. In the context of a text file, the first solution required that the file fit into memory. The next solution used less memory but required two passes through the file. We then used a probability calculation to reduce this to one pass. That one-pass solution required one random number per line. The last solution required many fewer random numbers than lines. It also uses half as many random numbers as the “optimal” (more general) algorithm.

None of our code uses system-level methods such as bytecount to speed up the linear pass through a file. Adding system-level optimizations would be an interesting extension.


A cool and useful algorithm explained and extended

By Carl M. Kadie and Christopher Meek

DALL·E 2022–08–19 06.57.13 — A crab selecting a random line of text from a file, award-winning photograph.
A crab selecting a random line of text from a file — Source: https://openai.com/dall-e-2/

If I (Carl) interviewed you for a job at Microsoft, I probably asked you this question:

How would you select a random line from a text file of unknown length?

I asked this question in Microsoft Research, in the original anti-spam team, and in an Office Machine Learning/Data Science group. We love the question because it touches on probability, algorithms, and systems issues.

Even having retired from Microsoft, we still think about the random-line question. For example, we recently learned about bytecount. It is a Rust crate that uses SIMD CPU instructions to speed up, among other things, counting the lines in a file. As we’ll describe later, our learning of the crate led — indirectly — to improving the state-of-the-art solution to the random-line question.

We’ve organized this article as a series of questions and hints. If you wish, you can try to answer the questions yourself, before seeing our answers.

In this article, we’ll give answers in Rust. For the answers in Python, see the Python edition of this article. You might also enjoy reading both editions as a way to compare the two languages.

We gave interviewees this advice for whiteboard interviews:

  • Feel free to ask clarifying questions. (In the context of this article, you can read a little further to see if we offer clarifications.)
  • Start with a correct algorithm first, even if it might be suboptimal. If we want a better algorithm, we will prompt you for it with follow-up questions.
  • Don’t worry about exact syntax. We don’t, for example, care if you remember the exact method for generating a random number.
  • If you get stuck, we can provide some hints. (Again, in the context of this article, read a little farther to look for hints.)

Let’s start with the questions:

Q: How would you select a random line from a text file of unknown length?

A: We must first clarify the meaning of a “random line”. We mean that every line in the text file has an equal chance of being returned. In other words, we want the algorithm to select among the lines via a uniform distribution.

This requirement means you cannot use this algorithm, that we’ll call AlgoRandomSeek:

  • Ask the filesystem for the length of the file.
  • Randomly select a file position and seek to that position.
  • Return a line near the position.

Sub Q: What’s wrong with AlgoRandomSeek?

Sub A: Although the algorithm can return any line in the file, it returns longer lines with higher probability than shorter lines and, thus, does not select lines via a uniform distribution.

Aside: We like being asked for this clarification. It distinguishes the everyday meaning of “random” from the technical meaning used in statistics, data science, decision theory, and machine learning.

Hint: If you were coding in Rust and needed to solve the random line problem quickly (but correctly) what would you do?

We asked GitHub Copilot for a solution in Python. Translating its answer to Rust gave us this code. Call it AlgoReadAll:

Aside: For the full Rust project, including the Cargo.toml and use statements, see project random-line on GitHub. All the examples use the fetch-data crate to retrieve the sample file(s) as needed.

This algorithm is obviously correct and fast. As a bonus, if it encounters a bad file, it uses ? to return an error result. For another bonus, you could mention that the code returns None on empty files.

On the downside, this code reads the whole file into memory and, so, won’t work on large files.

Let’s follow up by asking for a solution that works on large files:

Q: Using little memory, how could you select a random line from a text file of unknown length?

A: We must first clarify “little memory”. For this question, assume that you can store any line from the file, or several lines, but not all the lines.

AlgoTwoPass solves the problem by 1. counting the lines in the file, 2. randomly selecting a line index, 3. returning the line with that index. (Throughout this article, “index0” is an index that starts its count at 0. If an index starts counting from 1, we’ll call it “index1”.)

On my machine, this outputs:

1,683 of 146,933: Some(“to gather material for illustrations of the poems of Robert”)

AlgoTwoPassis correct. We’d also give this code a bonus for:

  • using a random number generator with an explicit seed — machine learning and data science often require reproducibility, even from randomness.
  • mentioning that it returns None when applied to an empty file.
  • using let item = item_result? to check for possible file errors.
  • [very minor] formatting numbers with thousands separators.

Not required, but interesting would be this more functional implementation:

Aside 1: Background information: This code uses Rust iterators. When next() is applied to a Rust iterator, the next value in a sequence is returned inside a Some or, if the sequence is complete, None is returned. The expression BufReader::new(File::open(&path)?).lines() returns an iterator over “line results”. Each line result is either a string (the next line) or an error (found when trying to read that line). Although Rust defines a count() method on iterators, we should not use it here. Why? Because the count() method doesn’t check for error values. Likewise, Rust defines an nth() method that advances an iterator n+1 values. It returns the (n+1)ᵗʰ value but ignores any intermediate error values. To check for error values from .line(), the answer above defines and usestry_count and try_nth functions. [Thanks to the Reddit Rust community for help on this issue.]

Aside 2: We would not penalize someone for making an off-by-one error with rng.gen_range or nth/try_nth. We might, however, ask how the code could be tested to check for such errors.

We next follow up by asking for a faster algorithm:

Q: In one pass, can you select a random line from a text file of unknown length?

A: After a hint, we’ll develop an algorithm via a series of sub-questions.

Hint: Think about this recursively.

Sub Q: Put yourself in the position of the program. If we promise you the file contains exactly one line, what should you do?

Sub A: If the file contains exactly one line, just output it.

Sub Q: Whoops, we lied to you. The file may contain exactly one line or two, but no other number of lines. What should you do?

Sub A: With probability 0.5, replace the result we were going to output, with the second line.

Sub Q: Sorry, we lied again! The file may contain exactly one, two, or three lines. What should you do? What is the probability of selection for each line? How can this be generalized?

Sub A: With probability ⅓, replace the result we were going to output, with the third line.

The probability of selection for each line is:

  • 1st line: 1 × ½ × ⅔= ⅓
  • 2nd line: ½ × ⅔= ⅓
  • 3rd line: ⅓= ⅓

So, the probability distribution is uniform. We can generalize this to AlgoOnePass:

Aside: I (Carl) recall one interviewee who started to prove this algorithm’s correctness via induction. I could tell they could easily do it, so I stopped them and moved on. They earned a bonus.

Bonus Q: We never asked this in an interview, but can you write AlgoOnePass recursively?

Bonus A: Here is AlgoOnePassRecurse:

Rust crashes if you recurse more than a few thousand times. This code uses the crate tailcall to avoid that crash. The recursive and non-recursive version of the algorithm run at about the same speed. The code earns a bonus for being generic across result iterators. For another bonus, it accepts the rng, the random number generator, as &mut, which allows rng to continue to be used.

Aside: Thinking practically, is one-pass important compared to two-pass? Often not. If you can wait 1 second for a random line, you can probably wait 1 to 2 seconds. On the other hand, some data — “streaming data” — can’t be accessed twice, so AlgoOnePass has practical value.

One straightforward generalization to AlgoOnePass, that we won’t cover, is selecting multiple random lines, not just one. Wikipedia describes (more-or-less) AlgoOnePass and this generalization in its article on Reservoir sampling: Simple Algorithm R.

In interviews, this is usually where we stopped with “random line from a file”. However, we recently learned about the Rust crate bytecount. The bytecount crate uses SIMD CPU instructions to speed up, among other things, counting the lines in a file. That got us playing with the problem again. This led to a new approach and an improved algorithm. The new algorithm doesn’t use bytecount. It does, however, in the specific case of returning one random line, outperform the more general Optimal: Algorithm L described in Wikipedia.

Aside: We call this the “new algorithm”, but it may have been discovered earlier. In any case, we hope you’ll find the algorithm and its development interesting.

As before, we’ll present the new algorithm via a series of questions and hints. We have not, however, ever put these questions to an interviewee.

Q: In one pass, can you select a random line from a text file of unknown length such that the random number generator is called much less than the number of lines, n?

A: We must clarify “much less”. We mean that the number of calls to the random number generator is less than O(n) [see Big O notation — Wikipedia]. In other words, reducing the calls by one or by half is not enough. The random number of calls required should grow proportional to, for example, log(n) or sqrt(n).

Hint: First modify AlgoOnePass to print the index of every item that gets assigned to result. Call these the “keep indices”. Run the code on, say, 1,000,000 items.

Outputs:

1 2 4 14 38 112 210 914 4,512 5,659 6,242 13,388 917,008

This says that when we run with random seed 0, the first item is kept as a possible final random item. Then the 2nd item and then 4th. Then no item is kept until the 14th item, then the 38th. If the iterator contains between 917,008 and 1,000,000 items, then the 917,008th item will be the last item kept and so will be the final random item.

The keep indices seem to grow roughly exponentially. If that conjecture is true, the number of keep indices is O(log n), where n is the number of items in the iterator.

Sub Q: Can we randomly generate a sequence of keep indices directly?

Sub A: Yes! We’ve detailed our solution in a short, on-line, technical paper [Meek & Kadie, Streaming Random Selection Using the Attenuated Geometric Distribution, 2022]. We call the distribution of keep indices the “Attenuated Geometric Distribution”. We show that if index1is a keep index number then we can generate the next one with:

let r: f64 = rng.gen();
index1 += ((r * (index1 as f64) / (1.0 — r)).ceil() as usize).max(1);

where rng.gen() generates a uniform floating-point value between 0 .0 (inclusive) and 1.0 (exclusive). Bonus: (…).max(1) handles the very, very unlikely case of generating 0.0 randomly. Also, recall our convention that “index1” is an index that starts counting from 1 instead of 0.

We can now use this insight to create AlgoOnePassSkip:

We can get some confidence in the code by using it to pick a number between 0 (inclusive) and 100 (exclusive), 100,000 times. (We use the expression (0..100).map(Ok::<_, std::io::Error>) to create an iterator of result values, Ok(0), Ok(1), … Ok(99). The code expects result values.)

If we plot with Python, the plots should look uniform, which they do:

Histogram and Line Plot suggesting a Uniform Distribution

This algorithm differs from Optimal: Algorithm L (recommended by Wikipedia) in two important ways.

  • AlgoOnePassSkip only works for selecting one random item, while Algorithm L can select any specified number of random items.
  • When only one random item is desired, AlgoOnePassSkip needs one random number per keep index, while Algorithm L needs two.

Thus, for the special case in which we want only one random item, AlgoOnePassSkip uses half as many random draws as Algorithm L.

We’ve seen four ways to select an item randomly from a sequence of unknown length. In the context of a text file, the first solution required that the file fit into memory. The next solution used less memory but required two passes through the file. We then used a probability calculation to reduce this to one pass. That one-pass solution required one random number per line. The last solution required many fewer random numbers than lines. It also uses half as many random numbers as the “optimal” (more general) algorithm.

None of our code uses system-level methods such as bytecount to speed up the linear pass through a file. Adding system-level optimizations would be an interesting extension.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment