AI Supremacy

Share this post

Microsoft's Epic Lawsuit around GitHub Copilot for software piracy explained

aisupremacy.substack.com

Microsoft's Epic Lawsuit around GitHub Copilot for software piracy explained

It's STILL the A.I. wild-wild west. How could such a cute mascot be a pirate in disguise you ask?

Michael Spencer
Nov 11, 2022
4
Share this post

Microsoft's Epic Lawsuit around GitHub Copilot for software piracy explained

aisupremacy.substack.com

Hey Guys,

Congrats to you readers, nearly 8,000 of you read A.I. Supremacy (link to web view) now. For the immense effort I’m still trying to find a business model that makes sense. Perhaps your vote can help steer me in the right direction:

Loading...

This Poll will help me get your valuable feedback on how to proceed. Since I do want to be reader-centric as far as possible.

Check out my archives while they are still free, I’ve written a lot about A.I. topics so far in 2022. If I get into Sponsored Ads more they will enable me to offer more free posts.

As I approach my 1-year anniversary on Substack, I’m open to offer sponsored Ads on this Newsletter. Anyway enough of that administrative stuff. Let’s dive into today’s topic now.


AI-driven coding tool might generate other people's code – who knew? Well, Redmond, for one

What’s copyright and piracy in the era of Generative A.I. always? Let’s just train this tool on everyone else’s data.

Yet Microsoft went ahead with it anyway after acquiring GitHub and using OpenAI’s abilities. It’s a pretty serious matter in terms of the kind of copyright lawsuits Generative A.I. is going to face. It’s software piracy at least on some level!

GitHub Copilot – a programming auto-suggestion tool trained from public source code on the internet – has been caught generating what appears to be copyrighted code, prompting an attorney to look into a possible copyright infringement claim.

While I believe the Generation of code and auto-complete enables software developers to be more productive, there is a cost. At a times when A.I. ethics, safety and open-sources accessibility is being prioritized, the business reality doesn’t actually reflect it.

This lawsuit represents a growing concern from programmers, artists, and other people that AI systems may be using their code, artwork, and other data without permission.

On October 17th, 2022 Matthew Butterick, a lawyer, designer, and developer, announced he is working with Joseph Saveri Law Firm to investigate the possibility of filing a copyright claim against GitHub. There are two potential lines of attack here: is GitHub improperly training Copilot on open source code, and is the tool improperly emitting other people's copyrighted work – pulled from the training data – to suggest code snippets to users?

A lot of folk have been warning about this issue. In June 2022, Matt wrote about the legal prob­lems with GitHub Copi­lot, in par­tic­u­lar its mis­han­dling of open-source licenses.

Microsoft needs to get its act together, while profiteering on Generative A.I, we need more A.I. ethics and rule of law here obviously.

This is after all, one of Microsoft’s subscriptions of AI-as-a-Service. GitHub Copilot is a cloud-based intelligent tool that analyzes existing code to suggest lines of code and entire functions in real-time directly within the editor. The extension is available in integrated development environments such as Visual Studio, Visual Studio Code, Neovim, and JetBrains IDEs. GitHub Copilot is available for all developers for $10/month and $100/year.

There are a host of rising competitors to Github Copilot, but it perhaps has the most swag so far. I can see Matt’s point though, Butterick has been critical of Copilot since its launch. In June he published a blog post arguing that "any code generated by Copilot may contain lurking license or IP violations," and thus should be avoided.

Everyone was afraid something would happen like this when Microsoft acquired GitHub for just $7.5 Billion back in 2018. Let’s try to unpack Matt’s position here: You see, accord­ing to OpenAI, Codex was trained on “tens of mil­lions of pub­lic repos­i­to­ries” includ­ing code on GitHub. Microsoft has empha­sized Copi­lot’s abil­ity to sug­gest larger blocks of code, like the entire body of a func­tion.

It’s all very worrisome if you care about A.I. ethics at all. Microsoft itself has vaguely described the train­ing mate­r­ial as “bil­lions of lines of pub­lic code”. But Copi­lot researcher Eddie Aftandil­ian con­firmed in a recent pod­cast (@ 36:40) that Copi­lot is “train[ed] on pub­lic repos on GitHub”.

Microsoft and OpenAI must be rely­ing on a fair-use argu­ment. In fact we know this is so, because for­mer GitHub CEO Nat Fried­man claimed dur­ing the Copi­lot tech­ni­cal pre­view that “train­ing [machine-learn­ing] sys­tems on pub­lic data is fair use”.

Well—is it? The answer isn’t a mat­ter of opin­ion; it’s a mat­ter of law. Nat­u­rally, Microsoft, OpenAI, and other researchers have been pro­mot­ing the fair-use argu­ment. Nat Fried­man fur­ther asserted that there is “jurispru­dence” on fair use that is “broadly relied upon by the machine[-]learn­ing com­mu­nity”. But Soft­ware Free­dom Con­ser­vancy dis­agreed, and pressed Microsoft for evi­dence to sup­port its posi­tion. Accord­ing to SFC direc­tor Bradley Kuhn—

[W]e inquired pri­vately with Fried­man and other Microsoft and GitHub rep­re­sen­ta­tives in June 2021, ask­ing for solid legal ref­er­ences for GitHub’s pub­lic legal posi­tions … They pro­vided none.

  • Why couldn’t Microsoft pro­duce any legal author­ity for its posi­tion? Because SFC is cor­rect: there isn’t any. Or so Claims Matt.

There’s clearly a rift in the community. That same month, Denver Gingerich and Bradley Kuhn of the Software Freedom Conservancy (SFC) said their organization would stop using GitHub, largely as a result of Microsoft and GitHub releasing Copilot without addressing concerns about how the machine-learning model dealt with different open source licensing requirements.

Some people aren’t necessarily happy with the heavy handed way BigTech is going about this. Training AI to auto-suggest code on what amounts to the work of others.

Copilot is powered by Codex, an AI system that was created by OpenAI and licensed to Microsoft. According to OpenAI, Codex was trained on “millions of public repositories” and is “an instance of transformative fair use.”

No Legal Precedent

The problem is all of this Generative A.I. are entering grey areas where the profit motive means A.I. ethics, regulations and rule of law are totally absent.

For instance, even if a court ulti­mately rules that cer­tain kinds of AI train­ing are fair use—which seems pos­si­ble—it may also rule out oth­ers. As of today, we have no idea where Copi­lot or Codex sits on that spec­trum. Nei­ther does Microsoft nor OpenAI.

Since its launch, the developer community has heavily criticized Microsoft’s GitHub Copilot due to potential copyright violations. Microsoft of course has turned a blind eye.

It’s a trial for Generative A.I. and not just Microsoft as there are many related cases and instances of how this works. Microsoft, GitHub, and OpenAI are being sued for allegedly violating copyright law by reproducing open-source code using AI.

Microsoft, its subsidiary GitHub, and its business partner OpenAI are all basically complicit. No attribution, nothing.

Copilot's capacity to copy code verbatim, or nearly so, surfaced last week when Tim Davis, a professor of computer science and engineering at Texas A&M University, found that Copilot, when prompted, would reproduce his copyrighted sparse matrix transposition code.

What’s the Big Deal?

Keep reading with a 7-day free trial

Subscribe to AI Supremacy to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
Previous
Next
© 2023 Michael Spencer
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing