Artificial intelligence (AI) can be used to automate work processes, create forecasts or open up new business areas. Enormous amounts of data are required to exploit the full potential. However, this data must not only be numerous, but also complete and of good quality. This blog post shows what these requirements mean in concrete terms. Data preparation and analysis are at least as important (if not more important) than effective data modeling is.
Clean data and yet a problem?
Since 2015 we have been conducting a code and learning week with the company. Last year's topic of further education was artificial intelligence. In various workshops and experiments, we trained in algorithms and data processing. We designed the trainings on the basis of real scenarios and data from two of our existing customers: schulerauktionen.ch and mtextur.com (at this point: thank you very much for your approval).
The digital archive of the Schuler auction house contains almost 80,000 art objects. For each object we know, among other things, title, object description, art genre, estimated and selling prices and even 1 to 15 illustrations. Furthermore, Google Analytics and internal statistics metrics.
The digital material archive mtextur has almost 50,000 CAD and BIM textures for architecture and design. For each texture we know, among other things, the manufacturer, type of material, type of use, classification, colour, approx. 20 technical codes and 1 to 10 illustrations. In addition, Google Analytics and internal statistics metrics.
At Schuler Auctions as well as at mtextur, the data is available in a clean and structured form. So nothing stands in the way of a flawless continuation of the work – at least we thought so. However, when we started to analyze and prepare the data, some common pitfalls became apparent.
Data collections and their pitfalls with regard to artificial intelligence
In both projects, one of the objectives was to be able to predict the classification on the basis of photos. In the case of mtextur, this could be used to invent "successful" textures for specific areas (we know the success metrics of each texture). At Schuler Auctions we thought of a consultant bot, which would predict the presumed value of an uploaded art item.
To anticipate: In both projects, due to the data situation, we did not succeed in achieving the two goals. Considering the large amount of data that was available, well prepared and structured, this may be surprising. Why is this the case?
Although the data were available in sufficient quantities, they did not cover the entire spectrum. At Schuler Auctions, we know all auction items and their selling prices. However, these are only those objects that made it into an auction at all. All worthless items that are sorted out in advance are not documented. These are, however, at least as exciting (and important) from a data point of view.
The large amount of data sets was spread over a wide spectrum in both projects. Since each spectrum has its own rules (concrete is different from wood, modern art is subject to different rules than ancient weapons), the amount of data was reduced drastically. So there were only a few thousand data sets per classification – which was by far not enough for our project.
Data may be neatly classified, but the use of classifications and hierarchies may be applied differently. For example, Schuler Auctions creates a specially created category (for example, "Collection XY") when auctioning a major art collection. The category "Collection XY" is equivalent to another category (for example "Swiss paintings"). What makes sense from the user's point of view distorts the data record in the background. An art object of the genre "Swiss Paintings" suddenly appears in two categories.
Things that are difficult for us humans to distinguish (for example: a photo shows real wood or imitation wood) are also difficult for machines.
Although we had tens of thousands of data sets with 10 to 20 well-structured data attributes per project, data modeling remained challenging. And yet we had not written code, nor had we thought about suitable algorithms.
Data analysis is crucial
The two example projects document well where pitfalls can lie within the data preparation. It becomes clear that great attention must be paid to the data basis.
The homogeneity of the existing data can drastically reduce the assumed amount of data. Does your data originate from a broad field or does it cover a niche?
Seemingly useless data are suddenly central in the context of artificial intelligence. At Schuler auctions, for example, it is all those objects that are rejected. Which data could this be in your case?
Structured data are not necessarily cleanly separated data. Do you store different data in the same attribute, or do your "misuse" your software for practical workaround in some areas?
It is very difficult to validate things that we humans cannot do ourselves. This is currently also the subject of many research projects (for example, the recognition of deepfakes)
Successful AI projects start with a detailed data analysis and preparation. It takes a lot of experience to identify common challenges. And it requires a lot of knowledge about the data itself. A very close cooperation between client and agency is essential for success. And what about the opportunities that come with this? Yes, they are infinite!