Exploring Frameworks and Association Rules algorithms
Part of our development tasks at Singularities was to have an association rules mining algorithm running in a distributed environment. First we had to choose our framework for distributed process.
We used Flink at the beginning because it had better performance according to some benchmarks we found on Internet. It was giving us good results, but faced a huge problem with it, it's community was small and there wasn't much help you could find from them. Also a tiny adoption in comparison to other frameworks made it less likely to grow as fast as we were expecting.
We managed to implement the Apriori Algorithm in Flink.
We decided to move on with Spark where we first migrated the Apriori Algorithm implemented with Flink. The migration process was quite simple, because these frameworks are very similar to each other.
Association Rules mining algorithms
We had previously implemented an apriori algorithm, then we tried to migrate it to Flink, and then Spark. Our experience with this algorithm is that it takes too much memory and CPU. The biggest problem with apriori was the generating of combinations to be counted. After trying several different ways we found that step way too CPU intensive to make apriori feasible.
FP-Growth's association generation is by far the fastest one we tested. Because the algorithm achieves almost a linear O(n) with it's use of the FP-Tree it becomes the fastest way to get the associations. The problems with it came as soon as the rules come into play. The limitations we currently have in the reasoner make it impossible to process all the rules in a given dataset, so we use the top 100 rules (according to their confidence). The problem with FP-Growth is that, it produces every possible association, and that leads to a huge amount of rules and an absurd number of those are redundant.
We tested the FP-Growth from Spark's MLLib, it worked great, but it was implemented for only 1 client/entity/list of transactions at the same time, we needed it to run for multiple clients at the same time. We started adapting the code from MLLib, but then we found out about TNR so we stopped.
Top-k Non-Redundant Association Rules (TNR)
The TNR algorithm was a fantastic solution to our problem because we could paralellize it for every customer, it's incredibly quick and only generates the best K rules, with no redundancy. TNR solved every issue we had with FP-Growth, so we didn't finished adapting the code.
We found an open source library for machine learning algorithms made by Philippe Fournier-Viger. You can find his webpage here, and his SPMF library here.
We first ran some tests using 100 rules, 60 for delta and 90 % for minimum confidence for each rule. We found out 3 things while experimenting with this algorithm:
If a client has only 1 transaction, the algorithm gets stuck.
If a client has few transactions the algorithm takes a lot of time (using k=100, delta=60 and minConf = 0.9).
The algorithm works only with Integer item types and for 1 client at the same time.
We managed to fix these issues by giving the algorithm a minimum number of transactions, so it doesn't take too much time and doesn't get stuck if the client has only 1 transaction.
Also we modified the code in order to make it work with a generic type of item (if the type is comparable), and to make it work within Spark. We still had one more issue and it was that with certain type of clients with a high delta the algorithm got stuck. We fixed this by setting delta to 0.
TNR gave us the best results considering time and output. It was pretty fast, each client had an average duration of less than 1 second, and the output rules were also good because they were not redundant and were the top K of them. Also because of the way TNR fills the tree and chooses the rules, the more transactions a customer has the faster it goes. When we tested in favor of number of transactions instead of number of customers the algorithm's performance was significantly faster.