Chapter 259 Crazy Data_Crossing: 2014_Inokuma

Font

Large

Medium

Small

Night

Chapter 259 Crazy Data（1/2）

Take this example:

In image recognition, we often need millions of manually labeled data.

In speech recognition, we may need thousands of hours of manually labeled data.

When it comes to machine translation, tens of millions of sentence annotation data are needed.

To be honest, as a technician who came from a previous life and a few years later.

Previously, Lin Hui really didn’t take it seriously when it came to the value of manually annotated data.

But now it seems that the value of this thing was obviously ignored by Lin Hui before.

Lin Hui remembered a set of data he saw in his previous life in 2017, which involved human translation.

The cost of a word is about 5-10 cents, and the average length of a sentence is about 30 words.

If we need to label 10 million bilingual sentence pairs, that is, we need to find experts to translate 10 million sentences. The cost of this labeling is almost 22 million US dollars.

It can be seen that the cost of data labeling is very, very high.

And this is only the data labeling cost in 2017.

Now, doesn’t the labeling cost mean higher data labeling costs?

You should know that there is almost no emphasis on unsupervised learning now.

In terms of unsupervised learning, there are almost no models that can be used.

Mainstream machine learning still relies on supervised learning and semi-supervised learning.

All supervised learning and semi-supervised learning are basically inseparable from manually labeled data.

Measured from this perspective, wouldn't the large amount of ready-made manually labeled data owned by Lin Hui be a huge invisible wealth?

In the previous life of 2017, labeling 10 million pieces of bilingual data would have cost more than 20 million US dollars.

So in this time and space, 2014 is a time when machine learning as a whole is lagging behind.

How much does it cost to label the same 10 million pieces of bilingual data?

Lin Hui felt that 10 million pieces of bilingual annotated data would cost two to three billion US dollars.

The figure of "two to three billion US dollars" seems a bit scary.

But it's actually not an exaggeration.

There are two reasons why I say it is not an exaggeration:

First, even in the previous life, the cost of data annotation dropped significantly after the advent of special learning techniques such as dual learning.

Before this, the word "cheap" had nothing to do with data annotation.

Also take the examples listed by Lin Hui before as a reference:

In the previous life in 2017, the cost of 10 million bilingual translation annotations was approximately US$22 million;

Note that this is only a bilingual translation annotation.

"Bilingual translation" is just a label for translation between two languages.

Just the translation and labeling between the two languages requires more than 20 million US dollars?

How much does it cost to translate between hundreds of languages?

This problem is not complicated, it is a simple permutation and combination problem:

C(100,2)== 4950; 4950*022 million US dollars == 108.9 billion US dollars;

It is not difficult to see that if it is necessary to support translation between hundreds of languages, the cost of manually annotating training sets will reach hundreds of billions of dollars.

And this is only an estimate under ideal circumstances. If such annotation is to be carried out step by step, the actual cost will be far more than this.

After all, the cost of translation between many minor languages is obviously higher than the price of translation between mainstream languages.

Although in actual operation, there will not really be any big complaints about the step-by-step data annotation of hundreds of languages.

But this estimate also fully illustrates that data labeling will be expensive for a long time.

By the same token, the cost of labeling spatiotemporal data is still expensive today.

And because of the lag in research progress in spatiotemporal machine learning, the cost of data annotation is now even higher than in the same period in the previous life.

Second, the times are developing rapidly. You must know that the actual efficiency, reliability, and ease of use of scientific calculators that can be easily purchased in any sports store today can even beat those spent in the 1950s and 1960s. Tens of millions of dollars were spent to build a computer covering hundreds or even thousands of square meters.

In this case, the very cheap calculators of later generations still have a market even if they cost millions of dollars a few decades ago, and they may still be quite competitive.

Taking this example does not mean that Lin Hui will sell calculators a few decades ago.

Lin Hui just wanted to use this to show that the wheel of the times is moving forward, and technology is also developing rapidly.

Especially in the post-Internet era, it is no exaggeration to say that the development of science and technology is changing with each passing day.

Under such circumstances, it is normal for some technologies that were not paid much attention to in the next few years to be able to exchange for large amounts of wealth a few years ago.

What's more, is it still possible to use data annotation, something that has only been played by wealthy companies for a long period of time, in exchange for wealth?

In short, Lin Hui didn’t see anything wrong with the estimate that “10 million pieces of bilingual annotated data would cost two to three billion US dollars now.”

In fact, even if it is "a price of two to three billion US dollars", this price estimate may be a bit conservative.

In the industrial structure of artificial intelligence, the main body includes the application layer, the technology layer and the basic layer.

The application layer contains solutions and product services.

The technical layer includes application technology, algorithm theory and platform framework.

The base layer contains infrastructure and data.

Measured from this perspective, data can even be regarded as the cornerstone of artificial intelligence to a certain extent.

This is exactly the case.

It involves the troika algorithm, computing power, and data (data) of artificial intelligence.

Algorithms seem very important, but you must know that many times, without high-quality data, it is difficult to train high-quality algorithms.

Although data is usually invisible and intangible, no one can ignore the importance of data.

Especially labeling data is very important.

At present, supervised machine learning is still the main way of learning and training neural networks.

Supervised machine learning is inseparable from labeled data.

Supervised machine learning requires labeled data as prior experience.

In supervised machine learning, unlabeled data and labeled data are divided into training sets and test sets in proportion.

The machine obtains a model by learning the training set, and then identifies the test set to obtain the accuracy of the model.

The algorithm personnel find the shortcomings of the model based on the test results, and feedback the data problems to the data annotation personnel, and then repeat the process until the obtained model indicators meet the online requirements...

In the current situation where there are almost no applications of unsupervised learning, large-scale, high-quality manually labeled data sets can even be said to be an urgent need for the development of the current machine learning industry.

In this case, the importance of data and annotated data cannot be overemphasized.

That's why Lin Hui said the valuation was undervalued.

However, the so-called valuation is not important anymore. If it really involves the sale of annotated data, the specific price can be discussed slowly.

Lin Hui needs a lot of money, but if he negotiates with some super giants in the future, Lin Hui may not be obsessed with money.

It is not impossible to exchange resources that Lin Hui is interested in.

To be honest, some of the resources of these top giants are quite tempting to Lin Hui.

Specifically, the annotation data Lin Hui currently has.

When it came to online text translation, Lin Hui almost immediately thought of the SimpleT software on the mobile phone in his previous life.

SimpleT is a software developed and tested by the company where Lin Hui worked in his previous life.

This software is not well known because it is still in alpha beta testing stage.

The purpose of alpha testing is to evaluate the functionality, localization, usability, reliability, performance and support of the software product.

Pay special attention to the interface and features of the product.

The time for alpha testing can begin when coding of the software product is completed.

It can also be started after module (subsystem) testing is completed.

You can also start after confirming that the product reaches a certain level of stability and reliability during the testing process.

The alpha internal testing of SimpleT software was started after confirming that SimpleT reached a certain level of stability and reliability.

So although SimpleT is still in internal testing.

However, the technical level of this software is also quite mature, and it is almost only one round of public beta away from its official launch.

Lin Hui originally thought of replicating such a software when the time was right to enter the software translation market.

Paying attention to the special value of annotated data.
To be continued...

Prev Index Favorite NextPage