Also see Fast GPU bounding boxes on tree-structured scenes[1] (unpublished paper) and notes toward a blog post[2]. This is a highly tuned GPU implementation of parentheses matching. It's actually used in Vello (the classic version in which we offload basically all the work to the GPU, not the newer CPU-GPU hybrid version in which tracking the blend stack is done on the CPU).
Earlier versions of the work were featured on HN [3][4], but this is much more sophisticated. (plus a few more zero-comment submissions)
The basic idea (bicyclic semigroup and binary search) is the same as the submission. I think earliest attribution is to Bar-On and Vishkin[5] from 1985. Another implementation of this idea is in pareas[6], an experimental GPU-accelerated compiler.
I believe this work is publishable and would love to work with a student to resubmit it. Especially if you're a student or prof in Sydney, please reach out.
You can solve the same problem with Range Min Query Tree. The query for balanced or unbalanced parentheses is O(log N). A two or three level RMQTree can represent billions of parentheses already (a two-level tree of 65536 branch factor = 4B parentheses). The query is O(65536 + 65536) or effectively O(1). For a four-level tree of 256 branch factor, the query is O(256 + 256 + 256 + 256) or O(1). It becomes a problem of memory access on the number of levels vs the number of entries to process per level. A total = 0 and min > 0 mean the parentheses are balanced.
The parentheses can be broken up into multiple segments, and be processed in parallel. Multiple RMQTrees can be used, one per segment, and the trees can be processed in parallel. Need to have a final pass to sum up the 'total' and 'min' from all the RMQTrees. Parallel processing probably doesn't gain much.
I've heard that phrase couple of times, and cannot stop noting that then every real-world algorithm is "effectively O(1)", because in real world we have bounded inputs and RAM. E.g. every integer is O(2^64) = O(1).
If we need to say that something is really fast, let's just say that. E.g if a CPU needs to iterate something 65536 + 65536 times, we can just say that it would take about 0.1ms on 1GHz CPU, no need to involve asymptotic.
E.g. Scala boasts that their Vector implementation is "effectively constant", while in doing up to 5 non-consecutive RAM accesses, that screws up the CPU cache. But if we can bound something to "no more than 5 operations", then I can say that any array in 32-bit arch is "no more than 2^32 operations", which is equal to O(1).
If the implementation of qsort doesn't switch over to another algorithm for sufficiently small arrays (or when the recursion gets to a sufficiently small chunk) it could be improved.
You have ignored the much larger space and time cost of constructing the range minimum query tree in the first place. Furthermore, nobody would arbitrarily cut off a problem at “billions” and then consider O(√billions) or O(∜billions) to be “effectively O(1)”: a theoretician would complain that O() by definition measures the limiting behavior as the problem size tends to ∞, an engineer would complain that 65536 or 256 is very a important factor in practice, and both types are better served by leaving O(√n) or O(∜n) as-is instead of trying to hand-wave it away.
If you stare at the CYK algorithm long enough and see it for the dynamic programming approach it is, you'll then realize that you can do the same parallelization trick for any context-free grammar!
Getting to discover Oleg Kiselyov's work for the first time is such a treat. His web archive is incredible! I'm envious of the author and anyone else discovering it today.
Petition to have the question added to leetcode. Two variants one with a single type of parens such as () and one with mixed parens (),{},[]. In each problem implement the map method which takes a chunk sequence of the overall input, the reduce method which takes an array of map method output for each chunk and returns the overall answer.
interesting algorithm. if you can limit nesting depth to 255, the radix (now counting) sort version would become decently performant too. I think that's quite reasonable unless there's some unholy preprocessor-based metaprogramming going on
This is above my pay grade. I cannot begin to image having to actually implement that:
> To start off, we will discuss a smaller instance of the problem, which is given a single type of parentheses that has its open and closed forms '(' and ')', determine if a string of parentheses is balanced.
And in the real world, there are parentheses in strings (escaped or not) and in comments.
The rendering appears to be done specifically by the js hosted on jsdelivr so if you've blocked that as a script source you'll just get the raw LaTeX (which I assume we're all fluent in anyways, of course!)
Earlier versions of the work were featured on HN [3][4], but this is much more sophisticated. (plus a few more zero-comment submissions)
The basic idea (bicyclic semigroup and binary search) is the same as the submission. I think earliest attribution is to Bar-On and Vishkin[5] from 1985. Another implementation of this idea is in pareas[6], an experimental GPU-accelerated compiler.
I believe this work is publishable and would love to work with a student to resubmit it. Especially if you're a student or prof in Sydney, please reach out.
[1]: https://arxiv.org/abs/2205.11659
[2]: https://github.com/raphlinus/raphlinus.github.io/issues/66
[3]: https://news.ycombinator.com/item?id=24385095
[4]: https://news.ycombinator.com/item?id=27164009
[5]: https://dl.acm.org/doi/10.1145/3318.3478
[6]: https://github.com/Snektron/pareas
You can solve the same problem with Range Min Query Tree. The query for balanced or unbalanced parentheses is O(log N). A two or three level RMQTree can represent billions of parentheses already (a two-level tree of 65536 branch factor = 4B parentheses). The query is O(65536 + 65536) or effectively O(1). For a four-level tree of 256 branch factor, the query is O(256 + 256 + 256 + 256) or O(1). It becomes a problem of memory access on the number of levels vs the number of entries to process per level. A total = 0 and min > 0 mean the parentheses are balanced.
The parentheses can be broken up into multiple segments, and be processed in parallel. Multiple RMQTrees can be used, one per segment, and the trees can be processed in parallel. Need to have a final pass to sum up the 'total' and 'min' from all the RMQTrees. Parallel processing probably doesn't gain much.
I've heard that phrase couple of times, and cannot stop noting that then every real-world algorithm is "effectively O(1)", because in real world we have bounded inputs and RAM. E.g. every integer is O(2^64) = O(1).
If we need to say that something is really fast, let's just say that. E.g if a CPU needs to iterate something 65536 + 65536 times, we can just say that it would take about 0.1ms on 1GHz CPU, no need to involve asymptotic.
E.g. Scala boasts that their Vector implementation is "effectively constant", while in doing up to 5 non-consecutive RAM accesses, that screws up the CPU cache. But if we can bound something to "no more than 5 operations", then I can say that any array in 32-bit arch is "no more than 2^32 operations", which is equal to O(1).
https://okmij.org/ftp/
> To start off, we will discuss a smaller instance of the problem, which is given a single type of parentheses that has its open and closed forms '(' and ')', determine if a string of parentheses is balanced.
And in the real world, there are parentheses in strings (escaped or not) and in comments.
I wouldn't even know where to start.