
Can you provide the file size of the largest file, the smallest file, average size and number of files you need to work on.Do you need the files in each file to stay in the same order, or can they be sorted alphabetically?.
At 5 million lines the lookahead is possibly not the best option for you. I think that’s why your machine is hanging. I had also explained my natural aversion to using the lookahead that has in his regex. The rest of the posts in that article went on to test some timings with respect to regex. That regex could be causing your machine to hang. I’ve been reading again the other post you started where provided you with a regular expression (regex) in
With 5 million lines, my machine will hang. You mentioned something in Linux, maybe that’s where you should direct your said in Remove duplicate lines in separate files: The only other possibility I see is using some other product. This is a concept at a high level, obviously it will involve lots of steps and lots of manual work.
Combine lines back into the original file and resort according to original line. Once duplicates found and removed sort files by the original file name they came from. Combine all the “a” files together and find duplicates. Depending on size it may even need to be “aa”, “ab”, “ac” etc. All lines starting with a in 1 file, b in another and so on. The file would then be cut into a number of smaller files, say along alphabetical lines. For each file sort them alphabetically, if need to keep the current line order a line number can be added prior to sorting. So the new idea would still take a number of steps but might work. Often when a problem is too big it should be looked at a different way. Now that you have changed the difficulty to 100 or so files I don’t think my idea as it was would be achievable without adding more steps My idea was not difficult, however there would have been a number of steps to do. You seem to have something difficult for me? Said in Remove duplicate lines in separate files: