In internet there is a lot of paralell corpora like: http://opus.lingfil.uu.se/Wikipedia.php
But this data as you can see is very noisy there are wrong translations in it, very poor, to other languages, some isbn codes, number, etc.
I need automatic filtering tool that would remove poor data (with minimal loss possible) to provide parallel texts with quality simillar to:
The tool should be language independent.
Finding best method will be some job for you but here is some tool that works good but far too slow, looses a lot of good data and requires a lot of manual parapeter adaptation: https://github.com/krzwolk/Text-Corpora-Adaptation-Tool/commits?author=krzwolk
and it also doest not necessary filter data but selects in-topic-domian data. Notheless I belevie that methods used there (levenstein distance, perplexity, td-idf) may be useful.
Most importantly in one step in data filtration I would like to use comparison to n-gram langauge model. https://en.wikipedia.org/wiki/N-gram
I can provide language models, but I would need the tool to also using them to analyse if such sentence can be present is a langauge or should be filtered out.
The tool should work under linux - programming langauge is not importantn to me