We now have seen strategies reminiscent of match(), rework(), and fit_transform() in numerous SciKit’s libraries. And nearly all tutorials, together with those I’ve written, solely let you know to only use certainly one of these strategies. The plain query that arises right here is, what do these strategies imply? What do you imply by match one thing and rework one thing? The rework() methodology makes some sense, it simply transforms the information, however what about match()? On this put up, we’ll attempt to perceive the distinction between the 2.
To higher perceive the that means of those strategies, we’ll take the Imputer class for instance, as a result of the Imputer class has these strategies. However earlier than we get began, understand that becoming one thing like an imputer is completely different from becoming a complete mannequin.
You utilize an Imputer to deal with lacking knowledge in your dataset. Imputer provides you straightforward strategies to interchange NaNs and blanks with one thing just like the imply of the column and even median. However earlier than it might change these values, it has to calculate the worth that might be used to interchange blanks. If you happen to inform the Imputer that you really want the imply of all of the values within the column for use to interchange all of the NaNs in that column, the Imputer has to calculate the imply first. This step of calculating that worth known as the match() methodology.
Subsequent, the rework() methodology will simply change the NaNs within the column with the newly calculated worth, and return the brand new dataset. That’s fairly easy. The fit_transform() methodology will do each the issues internally and makes it straightforward for us by simply exposing one single methodology. However there are cases the place you wish to name solely the match() methodology and solely the rework() methodology.
If you end up coaching a mannequin, you’ll use the coaching dataset. On this dataset, you’ll use the Imputer, calculate the worth, and change the blanks. However whenever you match this skilled mannequin on the take a look at dataset, you don’t calculate the imply or median once more. You’ll use the identical worth that you simply used in your coaching dataset. For this, you’ll use the match() methodology in your coaching dataset to solely calculate the worth and maintain it internally within the Imputer. Then, you’ll name the rework() methodology on the take a look at dataset with the identical Inputer object. This manner, the worth calculate for the coaching set, which was saved internally within the object, might be used on the take a look at dataset as properly.
To place it merely, you should use the fit_transform() methodology on the coaching set, as you’ll have to each match and rework the information, and you should use the match() methodology on the coaching dataset to get the worth, and later rework() take a look at knowledge with it. Let me know when you’ve got any feedback or will not be capable of perceive it.