subtitles/en/02_the-carbon-footprint-of-transformers.srt (452 lines of code) (raw):
1
00:00:05,580 --> 00:00:08,820
- So let's talk about the carbon
footprint of transformers.
2
00:00:08,820 --> 00:00:10,530
Maybe you've seen
headlines such as this one
3
00:00:10,530 --> 00:00:13,530
that training a single AI
model can emit as much carbon
4
00:00:13,530 --> 00:00:16,020
as five cars in their lifetimes.
5
00:00:16,020 --> 00:00:19,440
So when is this true
and is it always true?
6
00:00:19,440 --> 00:00:21,803
Well, it actually depends
on several things.
7
00:00:21,803 --> 00:00:23,430
Most importantly, it depends
8
00:00:23,430 --> 00:00:24,960
on the type of energy you're using.
9
00:00:24,960 --> 00:00:26,267
If you're using renewable energy such as
10
00:00:26,267 --> 00:00:30,670
solar, wind, hydroelectricity,
you're really
11
00:00:30,670 --> 00:00:33,810
not emitting any carbon
at all, very, very little.
12
00:00:33,810 --> 00:00:36,769
If you're using non-renewable
energy sources such as coal
13
00:00:36,769 --> 00:00:39,570
then their carbon
footprint is a lot higher
14
00:00:39,570 --> 00:00:43,260
'cuz essentially you are emitting
a lot of greenhouse gases.
15
00:00:43,260 --> 00:00:44,670
Another aspect is training time.
16
00:00:44,670 --> 00:00:47,232
So the longer you train,
the more energy you use
17
00:00:47,232 --> 00:00:50,250
the more energy you use, the
more carbon you emit, right?
18
00:00:50,250 --> 00:00:51,270
So this really adds up
19
00:00:51,270 --> 00:00:53,520
especially if you're
training large models for
20
00:00:53,520 --> 00:00:56,460
for hours and days and weeks.
21
00:00:56,460 --> 00:00:58,380
The hardware you use also matters
22
00:00:58,380 --> 00:01:00,930
because some GPUs, for
example, are more efficient
23
00:01:00,930 --> 00:01:05,460
than others and utilizing
efficiency use properly.
24
00:01:05,460 --> 00:01:07,500
So using them a hundred
percent all the time
25
00:01:07,500 --> 00:01:10,650
can really reduce the energy
consumption that you have.
26
00:01:10,650 --> 00:01:13,290
And then once again, reduce
your carbon footprint.
27
00:01:13,290 --> 00:01:15,870
There's also other aspects such as IO
28
00:01:15,870 --> 00:01:17,730
such as data, et cetera, et cetera.
29
00:01:17,730 --> 00:01:20,940
But these are the main three
that you should focus on.
30
00:01:20,940 --> 00:01:23,340
So when I talk about energy
sources and carbon intensity
31
00:01:23,340 --> 00:01:24,420
what does that really mean?
32
00:01:24,420 --> 00:01:27,480
So if you look at the top of the screen
33
00:01:27,480 --> 00:01:30,480
you have a carbon footprint
34
00:01:30,480 --> 00:01:33,860
of a cloud computing
instance in Mumbai, India
35
00:01:33,860 --> 00:01:38,700
which emits 920 grams of
CO2 per kilowatt hour.
36
00:01:38,700 --> 00:01:40,110
This is almost one kilogram
37
00:01:40,110 --> 00:01:43,680
of CO2 per kilowatt hour
of electricity used.
38
00:01:43,680 --> 00:01:45,150
If you compare that with Canada, Montreal
39
00:01:45,150 --> 00:01:48,720
where I am right now, 20
grams of CO2 per kilo hour.
40
00:01:48,720 --> 00:01:50,040
So that's a really, really big difference.
41
00:01:50,040 --> 00:01:54,240
Almost more than 40
times more carbon emitted
42
00:01:54,240 --> 00:01:55,950
in Mumbai versus Montreal.
43
00:01:55,950 --> 00:01:57,720
And so this can really, really add up.
44
00:01:57,720 --> 00:01:59,820
If you're training a model
for weeks, for example
45
00:01:59,820 --> 00:02:01,920
you're multiplying times 40
46
00:02:01,920 --> 00:02:03,450
the carbon that you're emitting.
47
00:02:03,450 --> 00:02:05,070
So choosing the right instance
48
00:02:05,070 --> 00:02:07,080
choosing a low carbon compute instance
49
00:02:07,080 --> 00:02:09,690
is really the most impactful
thing that you can do.
50
00:02:09,690 --> 00:02:13,020
And this is where it can really add up
51
00:02:13,020 --> 00:02:15,930
if you're training in a very intensive
52
00:02:15,930 --> 00:02:17,580
in a very carbon intensive region
53
00:02:19,170 --> 00:02:21,750
other elements to consider, for example
54
00:02:21,750 --> 00:02:22,770
using pre-trained models
55
00:02:22,770 --> 00:02:25,590
that's the machine learning
equivalent of recycling.
56
00:02:25,590 --> 00:02:28,292
When you have pre-trained
models available using them
57
00:02:28,292 --> 00:02:30,120
you're not emitting any
carbon at all, right?
58
00:02:30,120 --> 00:02:31,230
You're not retraining anything.
59
00:02:31,230 --> 00:02:33,450
So that's also doing your homework
60
00:02:33,450 --> 00:02:35,574
and kind of looking around
what already exists.
61
00:02:35,574 --> 00:02:37,890
Fine tuning instead of
training from scratch.
62
00:02:37,890 --> 00:02:38,723
So once again
63
00:02:38,723 --> 00:02:40,590
if you find a model that
is almost what you need
64
00:02:40,590 --> 00:02:43,530
but not quite fine tuning
the last couple of layers
65
00:02:43,530 --> 00:02:45,210
making it really fit your purpose instead
66
00:02:45,210 --> 00:02:46,500
of training a large transformer
67
00:02:46,500 --> 00:02:48,810
from scratch can really help,
68
00:02:48,810 --> 00:02:51,270
starting with smaller experiments
69
00:02:51,270 --> 00:02:52,800
and debugging as you go.
70
00:02:52,800 --> 00:02:54,630
So that means, for example, training
71
00:02:54,630 --> 00:02:58,770
figuring out data encoding,
figuring out, you know
72
00:02:58,770 --> 00:03:01,170
making sure that there's
no small bugs, that you'll
73
00:03:01,170 --> 00:03:03,840
you'll realize, you know,
16 hours into training
74
00:03:03,840 --> 00:03:05,820
starting small and really making sure
75
00:03:05,820 --> 00:03:08,760
that what you're doing, what
your code is, is stable.
76
00:03:08,760 --> 00:03:11,430
And then finally doing
a literature review to
77
00:03:11,430 --> 00:03:13,740
choose hyper parameter
ranges and then following
78
00:03:13,740 --> 00:03:15,900
up with a random search
instead of a grid search.
79
00:03:15,900 --> 00:03:18,420
So random searches for hyper parameters
80
00:03:18,420 --> 00:03:21,300
combinations have actually
shown to be as efficient
81
00:03:21,300 --> 00:03:24,000
in finding the optimal
configuration as grid search.
82
00:03:24,000 --> 00:03:27,510
But obviously you're not trying
all possible combinations
83
00:03:27,510 --> 00:03:29,520
you're only trying a subset of them.
84
00:03:29,520 --> 00:03:31,800
So this can really help as well.
85
00:03:31,800 --> 00:03:32,760
So now if we go back
86
00:03:32,760 --> 00:03:36,300
to the original paper by
Strubell et all in 2019
87
00:03:36,300 --> 00:03:39,180
the infamous five cars
in their lifetimes paper.
88
00:03:39,180 --> 00:03:40,013
If you just look
89
00:03:40,013 --> 00:03:43,606
at a transformer of 200
million perimeter transformer
90
00:03:43,606 --> 00:03:46,950
it is carbon footprint is
around 200 pounds of CO2
91
00:03:46,950 --> 00:03:47,940
which is significant
92
00:03:47,940 --> 00:03:49,980
but it's nowhere near five cars, right?
93
00:03:49,980 --> 00:03:52,893
It's not even a transatlantic flight.
94
00:03:52,893 --> 00:03:55,020
How it really adds up is when you're doing
95
00:03:55,020 --> 00:03:56,190
neural architecture search
96
00:03:56,190 --> 00:03:58,560
when you're doing hyper
parameter tuning, and
97
00:03:58,560 --> 00:04:00,930
this is trying all possible combinations
98
00:04:00,930 --> 00:04:01,763
et cetera, et cetera.
99
00:04:01,763 --> 00:04:02,596
And this is where
100
00:04:02,596 --> 00:04:05,400
like the 600,000 pounds of CO2 came from.
101
00:04:05,400 --> 00:04:08,490
So this is really where things add up.
102
00:04:08,490 --> 00:04:11,880
So, but if you're doing things
mindfully and conscientiously
103
00:04:11,880 --> 00:04:16,410
then your carbon footprint
wont be as big as,
104
00:04:16,410 --> 00:04:20,040
as the paper implied, some tools to figure
105
00:04:20,040 --> 00:04:22,111
out how much CO2 exactly you're emitting.
106
00:04:22,111 --> 00:04:24,270
There's a web-based tool called machine
107
00:04:24,270 --> 00:04:26,430
learning submissions
calculator, which allows you
108
00:04:26,430 --> 00:04:29,010
to manually input, for example,
which hardware you used
109
00:04:29,010 --> 00:04:30,488
how many hours you used it for
110
00:04:30,488 --> 00:04:34,260
where it was located
locally or in the cloud.
111
00:04:34,260 --> 00:04:35,640
And then it's gonna give you an estimate
112
00:04:35,640 --> 00:04:37,560
of how much CO2 you emitted.
113
00:04:37,560 --> 00:04:40,200
Another tool that does
this programmatically,
114
00:04:40,200 --> 00:04:41,190
is called code carbon.
115
00:04:41,190 --> 00:04:45,112
So you can PIP install it, you
can, you can go to the GitHub
116
00:04:45,112 --> 00:04:48,120
and essentially it runs
in parallel to your code.
117
00:04:48,120 --> 00:04:49,085
So essentially you call it
118
00:04:49,085 --> 00:04:51,060
and then you do all your training.
119
00:04:51,060 --> 00:04:53,760
And then at the end it's
gonna give you an estimate
120
00:04:53,760 --> 00:04:57,210
a CSV file with an
estimate of your emissions.
121
00:04:57,210 --> 00:04:59,250
And it's gonna give you some comparisons.
122
00:04:59,250 --> 00:05:01,230
It's got a visual UI
where you can really look
123
00:05:01,230 --> 00:05:04,680
at how this compares to
driving a car or watching TV.
124
00:05:04,680 --> 00:05:06,060
So it can give you an idea
125
00:05:06,060 --> 00:05:07,740
of the scope of your emissions as well.
126
00:05:07,740 --> 00:05:09,930
And actually, code carbon is
already integrated into auto
127
00:05:09,930 --> 00:05:12,270
and LP and hopefully
people will be using it
128
00:05:12,270 --> 00:05:15,240
out of the box and easily
tracking their emissions all
129
00:05:15,240 --> 00:05:17,523
through training and
deploying transformers.