1 00:00:05,580 --> 00:00:08,820 - So let's talk about the carbon footprint of transformers. 2 00:00:08,820 --> 00:00:10,530 Maybe you've seen headlines such as this one 3 00:00:10,530 --> 00:00:13,530 that training a single AI model can emit as much carbon 4 00:00:13,530 --> 00:00:16,020 as five cars in their lifetimes. 5 00:00:16,020 --> 00:00:19,440 So when is this true and is it always true? 6 00:00:19,440 --> 00:00:21,803 Well, it actually depends on several things. 7 00:00:21,803 --> 00:00:23,430 Most importantly, it depends 8 00:00:23,430 --> 00:00:24,960 on the type of energy you're using. 9 00:00:24,960 --> 00:00:26,267 If you're using renewable energy such as 10 00:00:26,267 --> 00:00:30,670 solar, wind, hydroelectricity, you're really 11 00:00:30,670 --> 00:00:33,810 not emitting any carbon at all, very, very little. 12 00:00:33,810 --> 00:00:36,769 If you're using non-renewable energy sources such as coal 13 00:00:36,769 --> 00:00:39,570 then their carbon footprint is a lot higher 14 00:00:39,570 --> 00:00:43,260 'cuz essentially you are emitting a lot of greenhouse gases. 15 00:00:43,260 --> 00:00:44,670 Another aspect is training time. 16 00:00:44,670 --> 00:00:47,232 So the longer you train, the more energy you use 17 00:00:47,232 --> 00:00:50,250 the more energy you use, the more carbon you emit, right? 18 00:00:50,250 --> 00:00:51,270 So this really adds up 19 00:00:51,270 --> 00:00:53,520 especially if you're training large models for 20 00:00:53,520 --> 00:00:56,460 for hours and days and weeks. 21 00:00:56,460 --> 00:00:58,380 The hardware you use also matters 22 00:00:58,380 --> 00:01:00,930 because some GPUs, for example, are more efficient 23 00:01:00,930 --> 00:01:05,460 than others and utilizing efficiency use properly. 24 00:01:05,460 --> 00:01:07,500 So using them a hundred percent all the time 25 00:01:07,500 --> 00:01:10,650 can really reduce the energy consumption that you have. 26 00:01:10,650 --> 00:01:13,290 And then once again, reduce your carbon footprint. 27 00:01:13,290 --> 00:01:15,870 There's also other aspects such as IO 28 00:01:15,870 --> 00:01:17,730 such as data, et cetera, et cetera. 29 00:01:17,730 --> 00:01:20,940 But these are the main three that you should focus on. 30 00:01:20,940 --> 00:01:23,340 So when I talk about energy sources and carbon intensity 31 00:01:23,340 --> 00:01:24,420 what does that really mean? 32 00:01:24,420 --> 00:01:27,480 So if you look at the top of the screen 33 00:01:27,480 --> 00:01:30,480 you have a carbon footprint 34 00:01:30,480 --> 00:01:33,860 of a cloud computing instance in Mumbai, India 35 00:01:33,860 --> 00:01:38,700 which emits 920 grams of CO2 per kilowatt hour. 36 00:01:38,700 --> 00:01:40,110 This is almost one kilogram 37 00:01:40,110 --> 00:01:43,680 of CO2 per kilowatt hour of electricity used. 38 00:01:43,680 --> 00:01:45,150 If you compare that with Canada, Montreal 39 00:01:45,150 --> 00:01:48,720 where I am right now, 20 grams of CO2 per kilo hour. 40 00:01:48,720 --> 00:01:50,040 So that's a really, really big difference. 41 00:01:50,040 --> 00:01:54,240 Almost more than 40 times more carbon emitted 42 00:01:54,240 --> 00:01:55,950 in Mumbai versus Montreal. 43 00:01:55,950 --> 00:01:57,720 And so this can really, really add up. 44 00:01:57,720 --> 00:01:59,820 If you're training a model for weeks, for example 45 00:01:59,820 --> 00:02:01,920 you're multiplying times 40 46 00:02:01,920 --> 00:02:03,450 the carbon that you're emitting. 47 00:02:03,450 --> 00:02:05,070 So choosing the right instance 48 00:02:05,070 --> 00:02:07,080 choosing a low carbon compute instance 49 00:02:07,080 --> 00:02:09,690 is really the most impactful thing that you can do. 50 00:02:09,690 --> 00:02:13,020 And this is where it can really add up 51 00:02:13,020 --> 00:02:15,930 if you're training in a very intensive 52 00:02:15,930 --> 00:02:17,580 in a very carbon intensive region 53 00:02:19,170 --> 00:02:21,750 other elements to consider, for example 54 00:02:21,750 --> 00:02:22,770 using pre-trained models 55 00:02:22,770 --> 00:02:25,590 that's the machine learning equivalent of recycling. 56 00:02:25,590 --> 00:02:28,292 When you have pre-trained models available using them 57 00:02:28,292 --> 00:02:30,120 you're not emitting any carbon at all, right? 58 00:02:30,120 --> 00:02:31,230 You're not retraining anything. 59 00:02:31,230 --> 00:02:33,450 So that's also doing your homework 60 00:02:33,450 --> 00:02:35,574 and kind of looking around what already exists. 61 00:02:35,574 --> 00:02:37,890 Fine tuning instead of training from scratch. 62 00:02:37,890 --> 00:02:38,723 So once again 63 00:02:38,723 --> 00:02:40,590 if you find a model that is almost what you need 64 00:02:40,590 --> 00:02:43,530 but not quite fine tuning the last couple of layers 65 00:02:43,530 --> 00:02:45,210 making it really fit your purpose instead 66 00:02:45,210 --> 00:02:46,500 of training a large transformer 67 00:02:46,500 --> 00:02:48,810 from scratch can really help, 68 00:02:48,810 --> 00:02:51,270 starting with smaller experiments 69 00:02:51,270 --> 00:02:52,800 and debugging as you go. 70 00:02:52,800 --> 00:02:54,630 So that means, for example, training 71 00:02:54,630 --> 00:02:58,770 figuring out data encoding, figuring out, you know 72 00:02:58,770 --> 00:03:01,170 making sure that there's no small bugs, that you'll 73 00:03:01,170 --> 00:03:03,840 you'll realize, you know, 16 hours into training 74 00:03:03,840 --> 00:03:05,820 starting small and really making sure 75 00:03:05,820 --> 00:03:08,760 that what you're doing, what your code is, is stable. 76 00:03:08,760 --> 00:03:11,430 And then finally doing a literature review to 77 00:03:11,430 --> 00:03:13,740 choose hyper parameter ranges and then following 78 00:03:13,740 --> 00:03:15,900 up with a random search instead of a grid search. 79 00:03:15,900 --> 00:03:18,420 So random searches for hyper parameters 80 00:03:18,420 --> 00:03:21,300 combinations have actually shown to be as efficient 81 00:03:21,300 --> 00:03:24,000 in finding the optimal configuration as grid search. 82 00:03:24,000 --> 00:03:27,510 But obviously you're not trying all possible combinations 83 00:03:27,510 --> 00:03:29,520 you're only trying a subset of them. 84 00:03:29,520 --> 00:03:31,800 So this can really help as well. 85 00:03:31,800 --> 00:03:32,760 So now if we go back 86 00:03:32,760 --> 00:03:36,300 to the original paper by Strubell et all in 2019 87 00:03:36,300 --> 00:03:39,180 the infamous five cars in their lifetimes paper. 88 00:03:39,180 --> 00:03:40,013 If you just look 89 00:03:40,013 --> 00:03:43,606 at a transformer of 200 million perimeter transformer 90 00:03:43,606 --> 00:03:46,950 it is carbon footprint is around 200 pounds of CO2 91 00:03:46,950 --> 00:03:47,940 which is significant 92 00:03:47,940 --> 00:03:49,980 but it's nowhere near five cars, right? 93 00:03:49,980 --> 00:03:52,893 It's not even a transatlantic flight. 94 00:03:52,893 --> 00:03:55,020 How it really adds up is when you're doing 95 00:03:55,020 --> 00:03:56,190 neural architecture search 96 00:03:56,190 --> 00:03:58,560 when you're doing hyper parameter tuning, and 97 00:03:58,560 --> 00:04:00,930 this is trying all possible combinations 98 00:04:00,930 --> 00:04:01,763 et cetera, et cetera. 99 00:04:01,763 --> 00:04:02,596 And this is where 100 00:04:02,596 --> 00:04:05,400 like the 600,000 pounds of CO2 came from. 101 00:04:05,400 --> 00:04:08,490 So this is really where things add up. 102 00:04:08,490 --> 00:04:11,880 So, but if you're doing things mindfully and conscientiously 103 00:04:11,880 --> 00:04:16,410 then your carbon footprint wont be as big as, 104 00:04:16,410 --> 00:04:20,040 as the paper implied, some tools to figure 105 00:04:20,040 --> 00:04:22,111 out how much CO2 exactly you're emitting. 106 00:04:22,111 --> 00:04:24,270 There's a web-based tool called machine 107 00:04:24,270 --> 00:04:26,430 learning submissions calculator, which allows you 108 00:04:26,430 --> 00:04:29,010 to manually input, for example, which hardware you used 109 00:04:29,010 --> 00:04:30,488 how many hours you used it for 110 00:04:30,488 --> 00:04:34,260 where it was located locally or in the cloud. 111 00:04:34,260 --> 00:04:35,640 And then it's gonna give you an estimate 112 00:04:35,640 --> 00:04:37,560 of how much CO2 you emitted. 113 00:04:37,560 --> 00:04:40,200 Another tool that does this programmatically, 114 00:04:40,200 --> 00:04:41,190 is called code carbon. 115 00:04:41,190 --> 00:04:45,112 So you can PIP install it, you can, you can go to the GitHub 116 00:04:45,112 --> 00:04:48,120 and essentially it runs in parallel to your code. 117 00:04:48,120 --> 00:04:49,085 So essentially you call it 118 00:04:49,085 --> 00:04:51,060 and then you do all your training. 119 00:04:51,060 --> 00:04:53,760 And then at the end it's gonna give you an estimate 120 00:04:53,760 --> 00:04:57,210 a CSV file with an estimate of your emissions. 121 00:04:57,210 --> 00:04:59,250 And it's gonna give you some comparisons. 122 00:04:59,250 --> 00:05:01,230 It's got a visual UI where you can really look 123 00:05:01,230 --> 00:05:04,680 at how this compares to driving a car or watching TV. 124 00:05:04,680 --> 00:05:06,060 So it can give you an idea 125 00:05:06,060 --> 00:05:07,740 of the scope of your emissions as well. 126 00:05:07,740 --> 00:05:09,930 And actually, code carbon is already integrated into auto 127 00:05:09,930 --> 00:05:12,270 and LP and hopefully people will be using it 128 00:05:12,270 --> 00:05:15,240 out of the box and easily tracking their emissions all 129 00:05:15,240 --> 00:05:17,523 through training and deploying transformers.