Register now or log in to join your professional community.
Well, adding some veracity to an already-complex dataset will not only increase the amount of data to be analyzed, but it could introduce values which skew distributions or make predictions which are based on artificial values. Making a dataset easier to work with does not guarantee that it is going to output useful information. It might be necessary to tweek your data mining scripts or algorithms to accommodate changes to the accuracy of the content
Adding variability can complicate matters to an even greater extreme. For instance, imagine if you had a dataset of plaintext-English, but wanted to incorporate Arabic content to analyze similarities or differences between the two sets. Well, not only would you have to find someone of translating this content, but you might have to make sure the translated Arabic is encoded via the same standards (such as ASCII or UTF-8). Furthermore, you would have to account for basic differences in syntax, such as the structure of sentences, which would rely heavily on the bias of the human-translator. Such problems can make data-sets useless if not done through a proven method.